Thoughts on effective software testing

Written by Julie Zelenski

Beware of bugs in the above code; I have only proved it correct, not tried it.
---Don Knuth

The CS107 assignments involve fairly challenging coding tasks. Getting the program to work correctly on the ordinary cases will buff out your coding chops, but nailing down all variations, addressing edge cases, and gracefully handling client misuses presents a challenge to your testing efficacy. To achieve a truly polished and robust submission, plan to mount an organized and thorough testing campaign. Testing should be an integral part of your development, something you pay attention to early and often, not some cursory check at the end before submitting. Testing is a skill you can hone and you'll have lots of opportunity for practice this quarter. Time has shown that the investment in (or lack of) comprehensive testing is a key factor in distinguishing an excellent CS107 submission from the more mediocre ones. I want all of you to experience the joy and satisfaction of completely crushing these assignments, so read on for our advice. Black-box, white-box, stress and more, you'll want to develop a full arsenal of tactics and strategies you can bring to bear in a multi-pronged approach to testing.

Black-box testing

Black-box testing treats the program as a "black box", that is, without considering the code paths or internal structures. Working from a specification that dictates the expected behavior, you run experiments on the program to observe whether it behaves correctly. The idea to brainstorm a broad set of inputs/interactions (variations for invoking the program, different files to feed as inputs, ways of interacting with the program as the user to direct execution to different outcomes) and construct small test cases for each. Achieving comprehensive black-box coverage requires creativity and sometimes even a bit of deviousness. Your goal is to "torture" the code into exposing its flaws.

For example, consider testing just the command-line usage of a program. The "usage" here means the range of ways the user can invoke the program with various command-line arguments. What happens if the user invokes the program with a missing argument? What if an argument is of the wrong type? What if the arguments are ordered incorrectly? What if a value for the argument is non-sensical, such as a negative size or non-existent file? Each of these can be tested with an individual case to verify the program gracefully handles the total range of possible invocations.

For file-based inputs, you construct cases that isolate for certain behaviors. Consider a program which reads a file and finds the longest word. Try creating a test input file where the longest word appears first, another in the middle, or another as the last word. And why not also try an empty file and a file with only a single word? How about a file where there is a tie for longest? Does the spec say there is a limit on the maximum length? Try files with a longest word of length max -1, max, and max +1 to observe the behavior at that fringe. When building a test case focused on a particular issue, you typically aim for the most minimal case that reproduces the desired behavior and avoid unnecessary interference. These small test cases are also the exact inputs you will want trying to debug the issues found in testing.

One of the limitations of black-box testing is that it is difficult to be confident you have covered all the code paths without knowledge of the code internals. For example, maybe the longest program above handles files longer than a megabyte with a complete distinct code path from smaller files and thus it would be prudent to run tests near/at/over that size but a true outsider wouldn't have reason to even suspect this. When you are acting as both the tester and the author of the code, your insider information allows you to improve your test suite by adding white-box coverage.

White-box testing

White-box testing relies on knowledge of the design internals and code paths. As you are writing the code, you can be thinking and planning for the test cases needed to exercise all the code paths. One helpful way to think about coverage is by mapping to control flow. If a function has an if-else suggests there will two paths to verify, one through the if and other through the else. You can similarly use your knowledge of the essential special cases to identify other non-overlapping paths. For example, deleting from a linked list might be broken down into the cases of deleting the first cell, the last cell, and a middle cell.

White-box testing may be done by writing testing code and/or using the debugger to make directed calls to a function being tested. Consider testing a Contains function that computes whether one is a substring of the other. One way to test would be to stop in the debugger and execute manual calls to it (using gdb's print command) to check the result. Another technique is to write a throwaway function that prompts the user to enter the two strings and invoke Contains to show the result and thus allow you to interactively enter various test cases. You could add testing function that exercises a range of calls to Contains with carefully chosen arguments (prefix match, suffix match, interior match, no match, equal strings, empty string, etc.) and compare the result of each call to expected and alert you to any discrepancy. This testing function acts as a "unit test" for Contains and verifies its correctness on the identified cases. If you build such unit cases into the program, you can also re-run those tests at any time, which is particularly helpful in verifying that code that previously worked hasn't gone wonky due to subsequent changes.

One disadvantage of white-box testing is the same oversight leads you to introduce an error into the code is also likely to cause you to miss testing for it. For example, if you didn't consider that one of the arguments might be an empty string, your code may not be written to correctly handle it, and nor would you be likely to devise a test case of such an input.

Stress testing

Early in development, you will use small, focused tests to verify the basics are working in isolation, but later on, you need to mix in larger, unfocused inputs that scale up the size and bring in more complex interactions. Those larger inputs might be created by hand or generated using an automated or randomized tool (the idea of randomly generating inputs is known as fuzz testing). For automatically generating stress or randomized inputs, you could write a program in C or string together a sequence of unix utilities and/or custom scripts.

Sometimes there are simple changes you can make that force the code to hit certain passages harder to give them a thorough workout. For example, consider testing a hashtable. Configuring the hashtable to initialize to a very small number of buckets, then adding a lot of entries, will force it to repeatedly rehash and shake out problems in that code path. Or changing the hash function to map every key to the same code (say zero) forces all entries to be aggregated in a single bucket and now all operations will be exercising on a single large list rather than many singleton lists. Such scenarios can introduced temporarily and then removed when you're satisfied with the results.

The nature of the larger stress tests often makes them unwieldy as debugging inputs. If one of your stress tests uncovers a new bug, you may want to first try to winnow the case down to the essential necessary to reproduce the bug with minimal additional baggage so to simplify the nature of the debugging task.

Regression testing, or why to never throw any testing support away

One of the sad ironies of software is that fixing a bug can sometimes lead to breaking something else that previously was working. For this reason, make the habit of preserving test inputs and testing code rather being quick to discard them when you think you're past that hurdle. Keeping them around means you can easily repeat those tests as you continue in development and immediately spot when you've accidentally taken a step backward. Remember, you can use revision control to save a snapshot that you can retrieve later.

Test-driven development

One of the most celebrated features of the modern practice of extreme programming is the emphasis on test-driven development. Short-cycle test-driven development goes like this:

You change only a small amount of code at once and validate your results with a carefully constructed test before and after. This keeps your development process moving forward and gives you a nice confidence boost when you can immediately verify the results of your efforts.

Tools for testing

Testing is challenging enough that you don't want to exacerbate it by using a process that is full of laborious tedium. There are a variety of tools you can leverage to streamline and automate the work, for example:

The testing mindset

Program testing can be used to show the presence of bugs, but never to show their absence.
---Egsdger Dijkstra

Sometimes the biggest testing hurdle comes in the reluctance to even undertake the hunt. Let's be honest, the point of testing is to find flaws and once found, you will feel compelled to fix them! If you don't go looking for bugs, you get to remain blissfully ignorant and submit feeling smug and happy. Well, happy until our nefarious grading tools torture your code into showing what is lurking within... It's so much more satisfying to find (and fix) the problems yourself than wait for the dreaded autotester to smack your code around! Incorporate testing into your game plan from the get-go and you can submit with confidence that your code has already been subjected to intense scrutiny and you have nothing to fear from our puny tools! :-)