Web

Assignment 2 advice

Written by Julie Zelenski

Better late than never

You have read our advice on "Best Practices" for 107 assignments, haven't you?? :)

void* client: a cautionary tale

Our provided CVector is a full-featured, memory-managed, re-sizable array and CMap is a super-efficient hash table. Each is a clean package that abstracts away the low-level details, allowing you to put your energy toward solving more interesting problems. Data abstraction is a power tool, and one you have taken advantage of throughout the CS106 courses. What makes it a bit different in C is that these generic collections have to operate using a void* interface that requires the client stay on their toes at all times.

The spellchecker is a short program (our solution is only 3 pages of code) and the handy CVector and CMap do a lot of the heavy lifting. However, don't mistakenly assume this will be a breeze. The code can be dense and it's frighteningly easy to make mistakes when handling memory using void* pointers. Go slow, develop in careful steps, do lots of testing along the way, and keep gdb and Valgrind in the loop.

Start by studying the interfaces in cvector.h and cmap.h. Using them as a client requires a solid understanding of how void* is used and steadfast vigilance about details. We strongly recommend doing this fill-in-the-blank warmup exercise before you even think about coding.

File reading

Our helper function takes care of reading well-formed words and discarding all other tokens. This function is used to read words from the corpus, from the document, and from stdin in interactive mode. Our tokenizing approach is intentionally naive to avoid it becoming a distraction. You'll notice oddities that may bother you. For example, the tokens "co-operate" or "doesn't" or "hello!" are discarded due to non-alphabetic characters. Obviously, there are more sophisticated ways to tokenize, but that's not the point of this program. We ask that you keep things simple-- use the helper function as-is and match the sample!
During file-reading, the strings are usually scanned into temporary storage. Think about when you do and don't need to copy a string to persistent storage. Any persistent string storage should be appropriately sized, not arbitrarily sized to some fixed maximum.
The corpus and document files must be read in one linear pass from start to end. Do not open the file more than once or jump around with the file (e.g. no seek/rewind) as this will cause your program to fail on streaming input.

Data structures

Use a CMap with string key and int value to store the counts from analyzing the corpus.
When tokenizing the document being checked, gather the misspellings in a CVector of strings. Be sure to add only one entry for each distinct misspelling. The CVector search functionality will be useful here.
Use another CVector to store the corrections for a misspelling. This CVector holds structs containing the information for a particular correction, including its edit distance and frequency. Keep this CVector sorted in order of likelihood and discard less likely corrections when better candidates are found.
Some of the data you store in the CVector and CMaps will be strings. It can be convenient/efficient to share the same strings (i.e. store the same char* pointer in more than one collection), but this necessitates that you coordinate the sharing to avoid surprises such as one collection freeing the strings out from under another. Alternatively, you can make distinct copies to avoid aliasing issues. Either is fine with us, but be sure to have a plan and document it.
We set a maximum-length on words to simplify file-reading and tokenizing, but it is not appropriate for all strings to be allocated of the full maximum length. Storage should be allocated to fit each string per its needs. When creating a CVector of strings, do not state that the size of every element will be MAX_WORD_LEN (or 50 or any fixed-size constant) characters! The elements in the CVector should be char*s, each of which points to an appropriately-sized sequence of characters. Note the difference between the two CVectors created below:
```
CVector *wrong   = cvec_create(MAX_WORD_LEN,  ...       // WRONG size, each element is char[MAX_WORD_LEN]
CVector *correct = cvec_create(sizeof(char *), ...      // correct size, each element is char *
```
Be sure to use the correct size and strategy. Some past submissions have used the wrong size and justified that they found this "fix" to be necessary, but oversizing the CVector like this is only working around or covering up other memory misuses that you should correct instead. Read through the fill-in-the-blank warmup exercise again to review how to properly use CVector to store/retrieve char* elements. Come by office hours if you're still stuck -- we are really good at spotting the common errors! :-)

C structs. C struct declarations are almost, but not exactly, the same as C++. In C, the following declares a new struct type:

struct point {
   int x, y;
};  // don't forget to end with semi-colon, without it you'll be mired in compiler errors

The name of the type is struct point and you cannot drop the struct keyword; declaring a point will not compile. You can wrap the struct definition in a typedef to set up a nickname:

typedef struct point {
   int x, y;
} Point;           // need to end with semi-colon here, too

This second version declares struct point and gives it the nickname Point. Either name can be used, they refer to the same type. Use the . (dot) operator to access the struct fields. If you have a pointer to a struct, your first attempt to access the fields is likely to run afoul of the fact that dot has higher precedence than * (dereference). You can add parentheses to force the desired precedence, or better, use the -> operator which combines . and * for this common need.

Point origin, *p = malloc(sizeof(Point));   //sizeof applies to structs, too!

origin.x = 0;    // access struct field from struct variable

                 // these next 3 lines access struct field via pointer to struct
*p.x = 0;        // doesn't compile because . is applied first, then *
(*p).x = 0;      // works, parens override default precedence
p->x = 0;        // preferred way to access via pointer

Interactive spellchecking

The common usage for the program is to spellcheck an entire document against the corpus, but there is a convenient alternate usage to interactively enter words to spellcheck against the corpus. We included the interactive mode as an aid to you in debugging and we don't intend for it to cause you extra work. Spellchecking words entered at the terminal should be able to share almost all its code with the document code path. The provided read_word function will read a valid word from stdin (ignoring any invalid words that are too long or contain non-letters) and then you just send that word through the find_corrections function you wrote for document misspellings. Running in interactive mode under gdb can be confusing to keep track of where your input is going (i.e. typing gdb commands versus words to check) so you may find it easier to run gdb on documents.

Algorithms

A particularly dense function in this program is the one that computes the edit distance between two strings. We recommend that you write and thoroughly test this function in isolation. The function to compute edit distance should not allocate/create or modify any strings, nor does it need to track which edits are performed. It merely counts the number of edits made along the way until the strings match.The code for a recursive formulation of edit distance is breathtakingly simple and elegant, but requires a good understanding of recursion. As a hint, each recursive call will handle only the first characters of the strings and delegates examining any suffixes to recursive call(s). Each call doesn't even look at any characters beyond the first.

The recursive edit distance is beautiful code, but its runtime performance is exponential. If you were to compute the full edit distance against every word in the model, the program will run much too slowly. However, it's not too difficult to drastically prune the work with a simple change. Once the function can determine that the edit distance is already too large for this correction to be in the set of the 5 best corrections, it can give up on it. For example, if after examining 100 words of the model, you have already found 5 corrections that have edit distance <= 3, there is no reason to consider a correction that requires more than 3 edits. Slogging on to find out whether this correction requires 4 or 10 edits is a waste of time. Add a parameter to your function with the current limit and have it bail as soon as the edit distance goes beyond the limit. Using a limit of INT_MAX (i.e. maximum value integer from limits.h) will still compute the full edit distance when needed.

NOTE: The implementation of edit distance should be your own. Our course Honor Code policy is unambiguous in its requirement that code submitted for grading must be original, independent work. Edit distance is a classic algorithm which seems to increase the temptation for students to hunt down someone else's code to use as a model instead of working out a solution for themselves. Adopting the work of others in your submission puts you in violation of the Stanford Honor Code. You shouldn't need to rely on someone else to do your thinking for you--- your CS106B programming chops are up to the task, so bust out your recursive superpowers and show them off! We also specifically require a recursive implementation. Some of you may have previously studied dynamic programming and edit distance is a standard example used to illustrate DP, but DP and other iterative formulations are not an acceptable substitute. If you are truly unable to design or debug your own recursive implementation and the only means for you to get a working version is by using an outside resource, you may do so without inviting Honor Code trouble if you openly and accurately cite the source. We will discount the credit awarded for edit distance based on how much help was taken from the resource. The small discount will be reasonable and much preferable to facing a plagiarism charge.

In your edit distance function, you will need to calculate the minimum of three values. There is no integer min function in the standard library, and you may think this task is so trivial than you couldn't possible get it wrong, but I have seen at least a few edit distance functions that would have worked perfectly if only they didn't have a faulty min calculation. The irony of getting the intricate recursion correct and yet botching a few comparisons is more than you should have to bear. The take-home is that no code is too simple to screw up :-) If you have separated min-of-three into its own function, at least you will be able to test/debug it independently, instead of having to deal with its bugs mixed in with your development of edit distance.

Another algorithmic issue comes in tracking the "leader board" of the best corrections so far. I have seen a surprisingly number of errors in how new entries are added to the board or which entry is knocked off when a better correction is found. Often these errors have their root cause in a misguided attempt to update the board "efficiently". Remember that the leader board is generally small and any time spent processing/sorting/searching it is likely to be trivial. A straightforward approach that is easy to verify correct is a better choice than an overly-clever variant that is subtly wrong.

Function pointers, client callbacks

The raw declaration of function types can be a bit goopy so CVector and CMap use typedef nicknames. For example, the cleanup function for CVector takes one void* parameter and returns void. The typedef below establishes the nickname CleanupElemFn as synonymous to the raw declaration of a void (*)(void *) function type.

typedef void (*CleanupElemFn)(void *addr);

This typedef allows parameters and variables to be declared using the typename CleanupElemFn. To write a cleanup function as a client, you merely need to define a function with the correct prototype. The function's name is used to access its function pointer . A function pointer is compatible with any function type matching its prototype.

void byebye_binky(void *ptr)
{
    Binky *b = (Binky *)ptr;
    // do cleanup of b here
}

int main()
{
    CVector *all_my_binkys = cvec_create(sizeof(Binky), 10, byebye_binky);

The comparison functions used for the CVector sort and search operations has exactly the same design as the qsort, bsearch, and lfind/lsearch functions. A comparison callback is a two-argument function that take pointers to elements (const void *) and returns an integer which indicates whether the first element is less than, equal, or greater than the second. You may want to refer to the lab2 exercises to refresh your memory.

A few additional tips about writing callback functions:

When you write a callback function, its prototype must match the callback prototype exactly as given in the CVector/CMap interface file, including matching use of const. The const qualifier on the void* is an indication that the pointee contents is only to be read, not written. Never cast a function pointer to get around type mismatch; instead, change the function prototype to correctly match.
Arguments to the callback function will be typed as void* , but you know they are really int* or struct binky * . You could add a typecast to the argument everywhere you use in the function body, but a better strategy declares a local variable of the correct pointer type, assigns it once from the parameter, and uses the type-safe local in the rest of the body.

Efficiency

Run our sample executable to gauge the target efficiency and use valgrind to spy on its memory usage.
Our code doesn't cheat, it operates using the data structures and algorithms as described (i.e. recursive edit distance with cutoff) and employs no trickery. If your performance is in the same ballpark as ours (say within a factor of 2 or 3), you're golden. But if your program is noticeably slower or uses much more memory, investigate to find out what unnecessary/redundant work or allocation can be streamlined.
Mild inefficiency (beyond 2-3x but not as far as 10x) loses the efficiency points. Gross inefficiency (> 10x) can lose additional functionality points if your submission runs past the hard timeout.
Note that the bulk of the runtime is spent in analyzing the corpus. Once the model is built, finding the set of best corrections is quite fast (assuming edit distance uses a distance cutoff).
A factor that can contribute to efficient memory use is appropriate capacity hints when creating a CVector or CMap. In situations where you have knowledge of the likely eventual size, allocate the collection to that size. In others (such as when making the model for the corpus), you have to guess somewhat blind. Starting small means many additional allocations will be needed to grow the collection to its eventual size whereas starting with a larger size will need to grow less, but can waste capacity in excess of what you use. Feel free to use the same estimates as our solution: 10K entries for the model and 50 for the number of misspellings.
The number of candidates kept on the leader board is determined by the constant NUM_CORRECTIONS. For efficiency purposes, it is best to maintain just the best N entries in your leader board vector. The alternative of filling an enormous vector with all candidates from the corpus and waiting until the end to prune to the top N corrections will result in significantly increased runtime and memory overhead.

Code quality

There are many features provided with the CVector and CMap implementations, so be sure to read the .h header files and leverage those features rather than rolling your own!
The CVector and CMap are capable of storing elements of any type, but whether your element is an int, a struct or a char*, you refer to the element by its address when adding or retrieving it from the collection. This doesn't mean that the elements stored in the collection have to themselves be pointers! Doing so adds an unnecessary level of indirection. For example, to store numbers, create a CVector of int elements, not int*.
You will need some pointer typecasts (most notably to convert the void* retrieved from the collection back into its specific type pointer) but don't become enamored with sprinkling them about. When you use a typecast, you are taking full responsibility for type-correctness and suppressing any help you might have otherwise gotten from the compiler. Use typecasts exactly and only when you must.

Incremental development and testing

Some suggested coding milestones and testing strategies.

Create test files. Build small corpus and document files with deliberate coverage of required functionality. Brainstorm the range of possibilities (differences in case, non-alpha tokens, duplicate misspellings, ties in frequency, etc.) and put together inputs with good coverage of the required functionality. Predict the results for your inputs and try out the sample solution to verify your understanding is correct.
Compute edit distance. Design and write the function that reports the edit distance for a pair of strings. Test in isolation: make calls in gdb, add calls from main, operate on arguments from command-line, etc. Whatever you do, don't shortchange testing this critical function.
Read/count corpus. Start with small files for which you can hand-verify counts. Use the CMap iterator to print the entire model and observe if counts are correct. Run under valgrind to verify that your model's memory use isn't raising any flags.
Edit distance for single word versus corpus. Use the interactive mode to enter a single misspelling and iterate over the CMap invoking your edit distance function between that misspelling and every word in the model. Print all results and verify correctness. Test on small files, as running the full edit distance on a large model is unbearably slow.
Find top corrections for single word. Now update the leader board as you iterate and print only the top correction when done. Compare those results against the sample solution. Verify with valgrind before moving on (use a small file because Valgrind slows things way down).
Add cutoff into edit distance. Now add the cutoff into the edit distance. This shouldn't change what result are produced, but will drastically speed up the performance.
Find misspellings in document. With spell correction working for a single word, applying it to the misspellings found in a document is pretty straightforward.
Stress test. Work your way up to tests of the large corpus and document files. Compare outputs to sample solution on all varieties of inputs. Evaluate runtime and memory performance.
Finalize. Run under valgrind and clean up memory leaks. Re-read the specification and re-test to make sure you cover all the required functionality. Review your code to make final improvements in quality. Submit, then go outside and take a nap in the warm sun!

A note about testing versus the sample solution. If you invoke our solution with the additional flag -v <number>, it prints verbosely (i.e. includes edit distance and frequency for each correction) and uses the number specified to set the size of the leader board. This can be helpful when verifying you are correctly choosing the same best corrections. Your program does not need to support the -v flag and should never print anything other than the required output to appease the autotester. As always, sanity check is your friend -- use it and conform to it!

Program facts

Some facts that might help you set expectation for what you're in for with this assignment.

Code size. Our solution has 110 lines, student submissions have ranged from 98 to 510 lines. These count programming statement lines (not including blank lines, comments, #include). Our solution is decomposed into 14 functions.
Number of mercurial commits. Average number of commits = 12. More commits is positively associated with higher scores in grading.
Time spent. Some data gathered on student hours spent (self-reported):
- < 5 hours 5%
- 5-10 hours 14%
- 10-15 hours 19%
- 15-20 hours 38%
- 20-25 hours 14%
- more than 25 9%
Points earned. The median assignment score for this project has generally been very strong ~93% of the total points (88/95).