Last night I came across an instance of poor software design that is all too common in modern applications, one that has troubled and annoyed me ever since I became involved with computers and software development. Worse than my ruffled feathers, this is an issue that is responsible for such a vast amount of wasted time and money that it never ceases to amaze me how little has been done to correct it.
I have a Canon Powershot A710 IS digital camera. It’s a middle of the range compact digital camera with just enough features to give me power without having any menus or buttons that I don’t understand. One of the features I enjoy and use regularly is the ability to shoot 640×480 videos by simply turning a dial to the little video icon. I recorded such a video last night of about nine minutes in length, which was actually the longest I had yet recorded on the device (as I discovered later). When I bought the camera, Canon supplied with it a suite of software for managing the download of pictures and video to my PC. I have installed this software on a couple of my computers but last night I happened to be synching with a PC that did not have the software installed. This was no big loss. I’ve synchronized without using the Canon software several times. I don’t use any of the advanced features, I simply transfer the pictures to a folder on my PC and use other software applications for catalog organization, photo touch-up, and publishing at a later date. Windows XP ships with the Microsoft Scanner and Camera Wizard, which had been servicing my “get them off the memory card and onto disk” requirements quite sufficiently for several months.
However, last night while transferring a handful of pictures, some video clips, and my nine-minute movie, I was presented with an error dialog stating that there was a problem with the current picture and it could not be transferred. I was presented with two options “Try Again” and “Cancel”. A note indicated that choosing “Cancel” would cause the pictures transferred thus far to be deleted from their destination; my hard drive. I clicked “Try Again” hoping that some power glitch or other anomaly had simply occurred and that all would be fine. After about thirty seconds of nothing, I received the same dialog with the same options. I chose to “Cancel” this time and watched a progress bar indicating that the pictures were being deleted from the disk. After the process completed, another dialog informed me that the transfer process had failed and embedded in an otherwise redundant paragraph of text was a link for more information about the error. Wanting to correct the problem and complete the transfer, I clicked over the link to receive another pop-up dialog that stated:
“The following problem occurred while copying pictures: Not enough storage is available to complete this operation.”
Aha! Progress was beginning to emerge. My first thought was to check the available space on my hard drive; 11.1GB. Not a massive amount, but certainly enough to transfer a video from a camera with a 2GB storage card. After a little more thought and research, I discovered that the software involved in transferring images and video from a camera was the most haphazard layering of TWAIN drivers on WIA drivers on USB drivers with some other dribble gluing them all together. I’m not a low-level digital transfer guru. I’m familiar with the terms and have a pretty good idea of what goes on at each layer from an interface and bits perspective, but I certainly couldn’t debug errors being logged by a TWAIN driver on my system from a mini-dump file.
I checked the camera, first of all to see that the original pictures and video hadn’t been erased during the “Cancel” process from before, but also to see if the nine-minute video would play back on the camera natively. I was suspecting a possible corruption problem on the camera, perhaps something on the SD card had flaked or perhaps the hardware didn’t perform well with little to no memory available natively on the device. I had almost filled the 2GB card in the camera and was beginning to wonder if it used a cache in the transfer process and if, perhaps, that was the source of the “Not enough storage is available to complete this operation” message. The video played fine, indicating no corruption natively, so I proceeded under the caching assumption that the camera needed some virtual breathing room in order to complete the transfer. I fired up the Microsoft Scanner and Camera Wizard again and this time selected to transfer everything except the long video that had caused the problem. Everything else transferred just fine, including several other short video clips I had taken. I confirmed that they had transferred to the PC and then deleted them from the camera to free up some space.
Here we go! Firing up the Microsoft Scanner and Camera Wizard for a third time, I began the transfer process of the one and only item left on the camera’s memory; my long video. After about thirty seconds of inactivity, I was presented with the same error message I had received earlier. Hmm, not the virtual breathing room then. My last thought before I wrote off the video (not something I wanted to do as it was of our cat Peaches playing with her toys, who is having kidney and urinary problems and is going to be put down tomorrow) was to install the Canon software that had come with the camera and see if it could transfer the video. I dug out the CD, which ended up being in my “Ultimate Dance Party” CD case out in my car (long story, don’t ask – I catalog digitally for a reason!) and installed the software on my PC. Voila (or Walla as seems to be the trend now – Google it), the video transferred without error. I confirmed that it played back successfully in Quicktime, approximately nine minutes in length, occupying about 975MB on disk.
Not satisfied that one application magically transferred it while the other spewed, I began investigating my hard drive for other clips that had transferred successfully. The longest I had previously transferred was 780MB in size, confirming that this was the longest clip I’d recorded on my A710 IS camera. I found it an odd place to draw an imaginary line in the sand, but began formulating the idea that perhaps the nine-minute video was too large for the Microsoft Scanner and Camera Wizard to transfer. Perhaps it caches some of the movie as part of the transfer, or uses some ancient component of WIA or TWAIN that hasn’t been re-written since the late eighties and has some arbitrary “no movie will ever be large than 800MB” limitation in it. Perhaps part of the algorithm in one of the many moving parts just couldn’t figure out how to begin transferring that particular movie. Software is complex, and software operating upon multiple general purpose moving layers is even more complex. I accept that as part of being a software engineer.
However, and this is a very big however, this is the job and purpose of error messages and reporting. I can count the number of times I’ve received a meaningful and useful error message from a piece of software on the fingers of one hand. I cannot imagine the number that represents the number of times I’ve had a meaningless piece of rhetoric presented to me when a piece of software fails; probably close the famously canonical 1080 number of atoms in the universe. The only purpose, the very reason we have error messages is to meaningfully convey to the user of an application an explanation of why an operation failed. This leads me to one of most infuriating observations I have made time and time again in my career as a software engineer. When a piece of software is being developed by a programmer, in most cases the programmer can isolate a point of failure in a piece of code and tell you exactly the steps that would lead to the situation occurring. If you allow them to sit down in an office and explain the myriad reasons why a particular hardware bit being set or a third-party component returning a specific error message leads to their software encountering a failure, they will talk for hours and hours about the underlying reasons why the piece of software failed. Why then when it comes to trapping that error in their software, a process which takes time and planning to achieve, do they suffice by returning the error message “An error occurred. The operation cannot be completed.”? They have the information available to them. When you ask them to research the error at a later date, a volume of comments exist in the actual source code about why this particular error occurs. Why does this information become suddenly unimportant when it is time to present it to the user? Because most programmers are driven by the goal of making the software work and meeting a stated list of requirements by a specified deadline. Furthermore, no part of those requirements details the error messages that should be presented to the user of their application. Everyone blindly assumes the software will always just work, and yet the same set of people will instantly concede that no software is perfect and there are a whole heap of reasons why their software might legitimately fail under certain circumstances.
This leads me to the conclusion that software development teams need to be more aware and responsible of the failure conditions of their application and make error situations a priority from the first phases of software design and architecture. Questions about what should happen when the camera is unplugged or a cat pees on the keyboard should be asked very early on and considered a fundamental part of the success conditions of the application. The idea that a piece of software has succeeded when it reports a fatal error probably raises many eyebrows skyward and causes much scratching of heads, but the concept is pretty obvious when you think about it. If a software application gracefully presented a meaningful reason as to why it had failed, providing all of the information to the user about where the failure occurred, then it has succeeded in correctly handling the situation. From a software quality perspective, this case is just as important as completing the requested task and displaying a confirmation number and yet error cases are given some of the lowest priorities and subsequent development time. When was the last time you sat in a software planning meeting and discussed the cases under which a particular failure could occur, how this could affect layers consuming your software, and how you could most meaningfully provide information about the error either to an end user or a consumer of your component? Yet from a maintenance cost, ease of use, and customer satisfaction perspective, these discussions carry a phenomenally larger weight than whether or not to use a hyperlink or a button for sorting a DataGrid.
For software engineers who are looking for additional challenge in their work or for the ability to bring increased quality and value to the software that they develop, spending some quality time on exception planning and management is a gold mine of opportunity just waiting to be unlocked. Try revisiting a simple application you’ve written recently and listing out some of the known ways in which it could fail. Then examine the behavior of your application and think whether or not your parents could meaningfully digest the error information provided to come to a conclusion about how they might correct it and successfully complete their work. Software isn’t used by engineers as a majority and even when it is they are engineers like me who have expertise in other fields. I can often diagnose why an ASP.NET web application failed and what the error might mean, even on sites I’ve never worked on. However, when presented with an error from a TWAIN device driver consumer I have absolutely no clue what happened and would have to walk a pretty long road of knowledge gathering to even start understanding where to look for the failure. Consider that the majority of software users have absolutely no development skills whatsoever and it becomes pretty clear that software needs to do an outstanding job of reporting why errors have occurred.
Several companies, Microsoft included, have made attempts to bring more richness to their error reporting. Unfortunately, this usually results in a troubleshooting guide with only the most basic reasons for failure being listed (see “Is the printer turned on?” for details) and these are all too often compiled into a single indigestible list of all possibilities instead of being more context specific. It will take a long time for software to become more verbose with error reporting, especially considering that many errors occur in external software components that haven’t been rewritten in over a decade. However, that does not excuse a programmer who has additional information about an error from excluding it in their error message. Be verbose, take the time to really document why the line of code that only runs when an error occurs is being executed. Put down everything you know and really try to think about the reasons why the error might have occurred rather than simply stating “There was an error with the camera.” It will not only improve the usability and quality of your application, but might also make you learn something about the error case that allows you to handle it more effectively. Perhaps it is an error that can be handled and corrected without the user even knowing something went wrong. More often than not, programmers will choose to error out because something unexpected happened without even taking the time to figure out why it was unexpected. This masks real design flaws in an application and only perpetuates the “too little information” trend that causes such frustration and expense in software today. Break the trend and really document what is happening at all points. Also, educate your project managers and requirements teams on why error conditions are important and why taking the time to structure the output of error messages is cost effective. Planning a list of 100 well thought out error messages, using resource files to allow for the translation of these error messages in future releases, and taking the time to really understand the reasons behind error conditions will save literally hundreds of thousands of dollars spent developing FAQs, supporting forums, and paying for a technical support department to answer questions that a simple error message could have conveyed instantly. How many times has a call come in regarding “Error 471: Failure to communicate” that was received by a level 1, then a level 2 technical support assistant, forwarded on to a developer, researched, and then reported as “Oh yeah, they have to disable the wireless adapter first or the transfer won’t work.” Next a patch is released to check if the wireless adapter is disabled before attempting transfer and documentation created to distribute and manage the use of the patch. Tech support lists are updated and a new code branch is created for the new version. Instead, this condition could have been investigated as a success scenario up front and a meaningful error message like “Try disabling the wireless adapter (click here for instructions how) and try the transfer again.” could have been displayed. Even better, the software could have been coded to attempt disabling the adapter first or research could have been done into how the wireless software API could be utilized to achieve this. All of this would have been discovered if the error scenario had been taken seriously up front and would have cost vastly less than the “pick it up in support later” solution. Users really like software that just works. They gain confidence in your products and have a positive experience that they associate with using your solutions when those solutions speak to them in English and give meaningful reasons why they can’t obey instructions. Consider that the next time you didn’t ask a developer how they handled the camera being unplugged.