Assignment 2 advice

Written by Julie Zelenski

Better late than never

You have read our advice on "Best Practices" for 107 assignments, haven't you?? :)

void* client: a cautionary tale

Our provided CVector is a full-featured, memory-managed, re-sizable array and CMap is a super-efficient hash table. Each is a clean package that abstracts away the low-level details, allowing you to put your energy toward solving more interesting problems. Data abstraction is a power tool, and one you have taken advantage of throughout the CS106 courses. What makes it a bit different in C is that these generic collections have to operate using a void* interface that requires the client stay on their toes at all times.

The spellchecker is a short program (our solution is only 3 pages of code) and the handy CVector and CMap do a lot of the heavy lifting. However, don't mistakenly assume this will be a breeze. The code can be dense and it's frighteningly easy to make mistakes when handling memory using void* pointers. Go slow, develop in careful steps, do lots of testing along the way, and keep gdb and Valgrind in the loop.

Start by studying the interfaces in cvector.h and cmap.h. Using them as a client requires a solid understanding of how void* is used and steadfast vigilance about details. We strongly recommend doing this fill-in-the-blank warmup exercise before you even think about coding.

File reading

Data structures

C structs. C struct declarations are almost, but not exactly, the same as C++. In C, the following declares a new struct type:

struct point {
   int x, y;
};  // don't forget to end with semi-colon, without it you'll be mired in compiler errors

The name of the type is struct point and you cannot drop the struct keyword; declaring a point will not compile. You can wrap the struct definition in a typedef to set up a nickname:

typedef struct point {
   int x, y;
} Point;           // need to end with semi-colon here, too

This second version declares struct point and gives it the nickname Point. Either name can be used, they refer to the same type. Use the . (dot) operator to access the struct fields. If you have a pointer to a struct, your first attempt to access the fields is likely to run afoul of the fact that dot has higher precedence than * (dereference). You can add parentheses to force the desired precedence, or better, use the -> operator which combines . and * for this common need.

Point origin, *p = malloc(sizeof(Point));   //sizeof applies to structs, too!

origin.x = 0;    // access struct field from struct variable

                 // these next 3 lines access struct field via pointer to struct
*p.x = 0;        // doesn't compile because . is applied first, then *
(*p).x = 0;      // works, parens override default precedence
p->x = 0;        // preferred way to access via pointer

Interactive spellchecking

The common usage for the program is to spellcheck an entire document against the corpus, but there is a convenient alternate usage to interactively enter words to spellcheck against the corpus. We included the interactive mode as an aid to you in debugging and we don't intend for it to cause you extra work. Spellchecking words entered at the terminal should be able to share almost all its code with the document code path. The provided read_word function will read a valid word from stdin (ignoring any invalid words that are too long or contain non-letters) and then you just send that word through the find_corrections function you wrote for document misspellings. Running in interactive mode under gdb can be confusing to keep track of where your input is going (i.e. typing gdb commands versus words to check) so you may find it easier to run gdb on documents.

Algorithms

A particularly dense function in this program is the one that computes the edit distance between two strings. We recommend that you write and thoroughly test this function in isolation. The function to compute edit distance should not allocate/create or modify any strings, nor does it need to track which edits are performed. It merely counts the number of edits made along the way until the strings match.The code for a recursive formulation of edit distance is breathtakingly simple and elegant, but requires a good understanding of recursion. As a hint, each recursive call will handle only the first characters of the strings and delegates examining any suffixes to recursive call(s). Each call doesn't even look at any characters beyond the first.

The recursive edit distance is beautiful code, but its runtime performance is exponential. If you were to compute the full edit distance against every word in the model, the program will run much too slowly. However, it's not too difficult to drastically prune the work with a simple change. Once the function can determine that the edit distance is already too large for this correction to be in the set of the 5 best corrections, it can give up on it. For example, if after examining 100 words of the model, you have already found 5 corrections that have edit distance <= 3, there is no reason to consider a correction that requires more than 3 edits. Slogging on to find out whether this correction requires 4 or 10 edits is a waste of time. Add a parameter to your function with the current limit and have it bail as soon as the edit distance goes beyond the limit. Using a limit of INT_MAX (i.e. maximum value integer from limits.h) will still compute the full edit distance when needed.

NOTE: The implementation of edit distance should be your own. Our course Honor Code policy is unambiguous in its requirement that code submitted for grading must be original, independent work. Edit distance is a classic algorithm which seems to increase the temptation for students to hunt down someone else's code to use as a model instead of working out a solution for themselves. Adopting the work of others in your submission puts you in violation of the Stanford Honor Code. You shouldn't need to rely on someone else to do your thinking for you--- your CS106B programming chops are up to the task, so bust out your recursive superpowers and show them off! We also specifically require a recursive implementation. Some of you may have previously studied dynamic programming and edit distance is a standard example used to illustrate DP, but DP and other iterative formulations are not an acceptable substitute. If you are truly unable to design or debug your own recursive implementation and the only means for you to get a working version is by using an outside resource, you may do so without inviting Honor Code trouble if you openly and accurately cite the source. We will discount the credit awarded for edit distance based on how much help was taken from the resource. The small discount will be reasonable and much preferable to facing a plagiarism charge.

In your edit distance function, you will need to calculate the minimum of three values. There is no integer min function in the standard library, and you may think this task is so trivial than you couldn't possible get it wrong, but I have seen at least a few edit distance functions that would have worked perfectly if only they didn't have a faulty min calculation. The irony of getting the intricate recursion correct and yet botching a few comparisons is more than you should have to bear. The take-home is that no code is too simple to screw up :-) If you have separated min-of-three into its own function, at least you will be able to test/debug it independently, instead of having to deal with its bugs mixed in with your development of edit distance.

Another algorithmic issue comes in tracking the "leader board" of the best corrections so far. I have seen a surprisingly number of errors in how new entries are added to the board or which entry is knocked off when a better correction is found. Often these errors have their root cause in a misguided attempt to update the board "efficiently". Remember that the leader board is generally small and any time spent processing/sorting/searching it is likely to be trivial. A straightforward approach that is easy to verify correct is a better choice than an overly-clever variant that is subtly wrong.

Function pointers, client callbacks

The raw declaration of function types can be a bit goopy so CVector and CMap use typedef nicknames. For example, the cleanup function for CVector takes one void* parameter and returns void. The typedef below establishes the nickname CleanupElemFn as synonymous to the raw declaration of a void (*)(void *) function type.

typedef void (*CleanupElemFn)(void *addr);

This typedef allows parameters and variables to be declared using the typename CleanupElemFn. To write a cleanup function as a client, you merely need to define a function with the correct prototype. The function's name is used to access its function pointer . A function pointer is compatible with any function type matching its prototype.

void byebye_binky(void *ptr)
{
    Binky *b = (Binky *)ptr;
    // do cleanup of b here
}

int main()
{
    CVector *all_my_binkys = cvec_create(sizeof(Binky), 10, byebye_binky);

The comparison functions used for the CVector sort and search operations has exactly the same design as the qsort, bsearch, and lfind/lsearch functions. A comparison callback is a two-argument function that take pointers to elements (const void *) and returns an integer which indicates whether the first element is less than, equal, or greater than the second. You may want to refer to the lab2 exercises to refresh your memory.

A few additional tips about writing callback functions:

Efficiency

Code quality

Incremental development and testing

Some suggested coding milestones and testing strategies.

  1. Create test files. Build small corpus and document files with deliberate coverage of required functionality. Brainstorm the range of possibilities (differences in case, non-alpha tokens, duplicate misspellings, ties in frequency, etc.) and put together inputs with good coverage of the required functionality. Predict the results for your inputs and try out the sample solution to verify your understanding is correct.
  2. Compute edit distance. Design and write the function that reports the edit distance for a pair of strings. Test in isolation: make calls in gdb, add calls from main, operate on arguments from command-line, etc. Whatever you do, don't shortchange testing this critical function.
  3. Read/count corpus. Start with small files for which you can hand-verify counts. Use the CMap iterator to print the entire model and observe if counts are correct. Run under valgrind to verify that your model's memory use isn't raising any flags.
  4. Edit distance for single word versus corpus. Use the interactive mode to enter a single misspelling and iterate over the CMap invoking your edit distance function between that misspelling and every word in the model. Print all results and verify correctness. Test on small files, as running the full edit distance on a large model is unbearably slow.
  5. Find top corrections for single word. Now update the leader board as you iterate and print only the top correction when done. Compare those results against the sample solution. Verify with valgrind before moving on (use a small file because Valgrind slows things way down).
  6. Add cutoff into edit distance. Now add the cutoff into the edit distance. This shouldn't change what result are produced, but will drastically speed up the performance.
  7. Find misspellings in document. With spell correction working for a single word, applying it to the misspellings found in a document is pretty straightforward.
  8. Stress test. Work your way up to tests of the large corpus and document files. Compare outputs to sample solution on all varieties of inputs. Evaluate runtime and memory performance.
  9. Finalize. Run under valgrind and clean up memory leaks. Re-read the specification and re-test to make sure you cover all the required functionality. Review your code to make final improvements in quality. Submit, then go outside and take a nap in the warm sun!

A note about testing versus the sample solution. If you invoke our solution with the additional flag -v <number>, it prints verbosely (i.e. includes edit distance and frequency for each correction) and uses the number specified to set the size of the leader board. This can be helpful when verifying you are correctly choosing the same best corrections. Your program does not need to support the -v flag and should never print anything other than the required output to appease the autotester. As always, sanity check is your friend -- use it and conform to it!

Program facts

Some facts that might help you set expectation for what you're in for with this assignment.