Written by Julie Zelenski
You have read our advice on "Best Practices" for 107 assignments, haven't you?? :)
Our provided CVector is a full-featured, memory-managed, re-sizable array and CMap is a super-efficient hash table. Each is a clean package that abstracts away the low-level details, allowing you to put your energy toward solving more interesting problems. Data abstraction is a power tool, and one you have taken advantage of throughout the CS106 courses. What makes it a bit different in C is that these generic collections have to operate using a void*
interface that requires the client stay on their toes at all times.
The spellchecker is a short program (our solution is only 3 pages of code) and the handy CVector and CMap do a lot of the heavy lifting. However, don't mistakenly assume this will be a breeze. The code can be dense and it's frighteningly easy to make mistakes when handling memory using void*
pointers. Go slow, develop in careful steps, do lots of testing along the way, and keep gdb and Valgrind in the loop.
Start by studying the interfaces in cvector.h and cmap.h. Using them as a client requires a solid understanding of how void*
is used and steadfast vigilance about details. We strongly recommend doing this fill-in-the-blank warmup exercise before you even think about coding.
char*
pointer in more than one collection), but this necessitates that you coordinate the sharing to avoid surprises such as one collection freeing the strings out from under another. Alternatively, you can make distinct copies to avoid aliasing issues. Either is fine with us, but be sure to have a plan and document it.We set a maximum-length on words to simplify file-reading and tokenizing, but it is not appropriate for all strings to be allocated of the full maximum length. Storage should be allocated to fit each string per its needs. When creating a CVector of strings, do not state that the size of every element will be MAX_WORD_LEN (or 50 or any fixed-size constant) characters! The elements in the CVector should be char*
s, each of which points to an appropriately-sized sequence of characters. Note the difference between the two CVectors created below:
CVector *wrong = cvec_create(MAX_WORD_LEN, ... // WRONG size, each element is char[MAX_WORD_LEN]
CVector *correct = cvec_create(sizeof(char *), ... // correct size, each element is char *
Be sure to use the correct size and strategy. Some past submissions have used the wrong size and justified that they found this "fix" to be necessary, but oversizing the CVector like this is only working around or covering up other memory misuses that you should correct instead. Read through the fill-in-the-blank warmup exercise again to review how to properly use CVector to store/retrieve char*
elements. Come by office hours if you're still stuck -- we are really good at spotting the common errors! :-)
C structs. C struct declarations are almost, but not exactly, the same as C++. In C, the following declares a new struct type:
struct point {
int x, y;
}; // don't forget to end with semi-colon, without it you'll be mired in compiler errors
The name of the type is struct point
and you cannot drop the struct
keyword; declaring a point
will not compile. You can wrap the struct definition in a typedef to set up a nickname:
typedef struct point {
int x, y;
} Point; // need to end with semi-colon here, too
This second version declares struct point
and gives it the nickname Point.
Either name can be used, they refer to the same type. Use the . (dot) operator to access the struct fields. If you have a pointer to a struct, your first attempt to access the fields is likely to run afoul of the fact that dot has higher precedence than * (dereference). You can add parentheses to force the desired precedence, or better, use the -> operator which combines . and * for this common need.
Point origin, *p = malloc(sizeof(Point)); //sizeof applies to structs, too!
origin.x = 0; // access struct field from struct variable
// these next 3 lines access struct field via pointer to struct
*p.x = 0; // doesn't compile because . is applied first, then *
(*p).x = 0; // works, parens override default precedence
p->x = 0; // preferred way to access via pointer
The common usage for the program is to spellcheck an entire document against the corpus, but there is a convenient alternate usage to interactively enter words to spellcheck against the corpus. We included the interactive mode as an aid to you in debugging and we don't intend for it to cause you extra work. Spellchecking words entered at the terminal should be able to share almost all its code with the document code path. The provided read_word
function will read a valid word from stdin (ignoring any invalid words that are too long or contain non-letters) and then you just send that word through the find_corrections function you wrote for document misspellings. Running in interactive mode under gdb can be confusing to keep track of where your input is going (i.e. typing gdb commands versus words to check) so you may find it easier to run gdb on documents.
A particularly dense function in this program is the one that computes the edit distance between two strings. We recommend that you write and thoroughly test this function in isolation. The function to compute edit distance should not allocate/create or modify any strings, nor does it need to track which edits are performed. It merely counts the number of edits made along the way until the strings match.The code for a recursive formulation of edit distance is breathtakingly simple and elegant, but requires a good understanding of recursion. As a hint, each recursive call will handle only the first characters of the strings and delegates examining any suffixes to recursive call(s). Each call doesn't even look at any characters beyond the first.
The recursive edit distance is beautiful code, but its runtime performance is exponential. If you were to compute the full edit distance against every word in the model, the program will run much too slowly. However, it's not too difficult to drastically prune the work with a simple change. Once the function can determine that the edit distance is already too large for this correction to be in the set of the 5 best corrections, it can give up on it. For example, if after examining 100 words of the model, you have already found 5 corrections that have edit distance <= 3, there is no reason to consider a correction that requires more than 3 edits. Slogging on to find out whether this correction requires 4 or 10 edits is a waste of time. Add a parameter to your function with the current limit and have it bail as soon as the edit distance goes beyond the limit. Using a limit of INT_MAX (i.e. maximum value integer from limits.h) will still compute the full edit distance when needed.
NOTE: The implementation of edit distance should be your own. Our course Honor Code policy is unambiguous in its requirement that code submitted for grading must be original, independent work. Edit distance is a classic algorithm which seems to increase the temptation for students to hunt down someone else's code to use as a model instead of working out a solution for themselves. Adopting the work of others in your submission puts you in violation of the Stanford Honor Code. You shouldn't need to rely on someone else to do your thinking for you--- your CS106B programming chops are up to the task, so bust out your recursive superpowers and show them off! We also specifically require a recursive implementation. Some of you may have previously studied dynamic programming and edit distance is a standard example used to illustrate DP, but DP and other iterative formulations are not an acceptable substitute. If you are truly unable to design or debug your own recursive implementation and the only means for you to get a working version is by using an outside resource, you may do so without inviting Honor Code trouble if you openly and accurately cite the source. We will discount the credit awarded for edit distance based on how much help was taken from the resource. The small discount will be reasonable and much preferable to facing a plagiarism charge.
In your edit distance function, you will need to calculate the minimum of three values. There is no integer min function in the standard library, and you may think this task is so trivial than you couldn't possible get it wrong, but I have seen at least a few edit distance functions that would have worked perfectly if only they didn't have a faulty min calculation. The irony of getting the intricate recursion correct and yet botching a few comparisons is more than you should have to bear. The take-home is that no code is too simple to screw up :-) If you have separated min-of-three into its own function, at least you will be able to test/debug it independently, instead of having to deal with its bugs mixed in with your development of edit distance.
Another algorithmic issue comes in tracking the "leader board" of the best corrections so far. I have seen a surprisingly number of errors in how new entries are added to the board or which entry is knocked off when a better correction is found. Often these errors have their root cause in a misguided attempt to update the board "efficiently". Remember that the leader board is generally small and any time spent processing/sorting/searching it is likely to be trivial. A straightforward approach that is easy to verify correct is a better choice than an overly-clever variant that is subtly wrong.
The raw declaration of function types can be a bit goopy so CVector and CMap use typedef nicknames. For example, the cleanup function for CVector takes one void*
parameter and returns void. The typedef below establishes the nickname CleanupElemFn
as synonymous to the raw declaration of a void (*)(void *)
function type.
typedef void (*CleanupElemFn)(void *addr);
This typedef allows parameters and variables to be declared using the typename CleanupElemFn
. To write a cleanup function as a client, you merely need to define a function with the correct prototype. The function's name is used to access its function pointer . A function pointer is compatible with any function type matching its prototype.
void byebye_binky(void *ptr)
{
Binky *b = (Binky *)ptr;
// do cleanup of b here
}
int main()
{
CVector *all_my_binkys = cvec_create(sizeof(Binky), 10, byebye_binky);
The comparison functions used for the CVector sort and search operations has exactly the same design as the qsort, bsearch, and lfind/lsearch functions. A comparison callback is a two-argument function that take pointers to elements (const void *
) and returns an integer which indicates whether the first element is less than, equal, or greater than the second. You may want to refer to the lab2 exercises to refresh your memory.
A few additional tips about writing callback functions:
const
. The const
qualifier on the void*
is an indication that the pointee contents is only to be read, not written. Never cast a function pointer to get around type mismatch; instead, change the function prototype to correctly match.void*
, but you know they are really int*
or struct binky *
. You could add a typecast to the argument everywhere you use in the function body, but a better strategy declares a local variable of the correct pointer type, assigns it once from the parameter, and uses the type-safe local in the rest of the body.NUM_CORRECTIONS
. For efficiency purposes, it is best to maintain just the best N entries in your leader board vector. The alternative of filling an enormous vector with all candidates from the corpus and waiting until the end to prune to the top N corrections will result in significantly increased runtime and memory overhead.CVector
and CMap
implementations, so be sure to read the .h header files and leverage those features rather than rolling your own!int
, a struct
or a char*
, you refer to the element by its address when adding or retrieving it from the collection. This doesn't mean that the elements stored in the collection have to themselves be pointers! Doing so adds an unnecessary level of indirection. For example, to store numbers, create a CVector of int
elements, not int*
.void*
retrieved from the collection back into its specific type pointer) but don't become enamored with sprinkling them about. When you use a typecast, you are taking full responsibility for type-correctness and suppressing any help you might have otherwise gotten from the compiler. Use typecasts exactly and only when you must.Some suggested coding milestones and testing strategies.
A note about testing versus the sample solution. If you invoke our solution with the additional flag -v <number>
, it prints verbosely (i.e. includes edit distance and frequency for each correction) and uses the number specified to set the size of the leader board. This can be helpful when verifying you are correctly choosing the same best corrections. Your program does not need to support the -v flag and should never print anything other than the required output to appease the autotester. As always, sanity check is your friend -- use it and conform to it!
Some facts that might help you set expectation for what you're in for with this assignment.
Time spent. Some data gathered on student hours spent (self-reported):
Points earned. The median assignment score for this project has generally been very strong ~93% of the total points (88/95).