Web

Assignment 2: Spellcheck

Due date: Mon Jan 26 11:59 pm - Hard deadline: Fri Jan 30 11:59 pm

Assignment by Julie Zelenski

[Quick Links: Implementation details, Advice page, Grading]

Learning goals

Completing this assignment should sharpen your skills in:

client use of generic void * interfaces in C
writing void * client callback functions
debugging memory errors (because we are sure you will encounter at least one or two :-)

The problem

Have you ever wondered how Google Instant substitutes a correction "stanferd: did you mean stanford?" before you even finish typing a query with a typo? Wonder no more, you're about to find out. Your assignment this week is to write a spellchecker to find misspellings and suggest corrections for them. Your completed program will swallow a multi-megabyte text corpus, use it to find the misspellings within a document of thousands of words, and produce a prioritized set of corrections, all within seconds. Impressive? You bet!

What would be ideal for constructing this program is having some handy ADTs to build upon. You may still remember those halcyon CS106 days when ArrayList and HashMap were there for all your collection needs (and you took them for granted)? No doubt you've already noticed that the C language has no explicit support for genericity like Java/C++ templates nor any object-oriented features. This doesn't mean you can't build similar abstractions in C, but it will require some low-level craftiness and a willingness to engage in a bit more risk-tasking. We provide CVector and CMap, the C-equivalents of those generic collections, and these awesome ADTs help with the grunt work, but even so, there are plenty of challenges to keeping all the pointers straight. By the time you're done, you'll appreciate that C is an amazingly versatile language that puts few restrictions in your way when it comes to manipulating memory, but you'll also wish that it didn't always play it so fast and loose -- there is danger here!

A statistical model of language

Being able to identify misspellings relies on first having knowledge of the correct spelling of valid words. The language model comes from analyzing a text corpus. A corpus file is a large body of text aggregated from various sources. An ideal corpus is large and accurately representative, containing many occurrences of common words, few occurrences of rare words, and no occurrences of misspelled words. From the corpus, you can create a language model. The model consists of each word in the corpus and its number of occurrences. This model is used to spellcheck words. Any word in the model is assumed to be correctly spelled, any other word is classified as a misspelling.

The model is also used for spell correction. The words in the model are examined to find the best, or most probable, corrections for a given misspelling. What makes one correction more probable than another? The primary likelihood comes from measure of similarity between the correction and misspelling. A word with just one differing letter is more likely to have been intended than one with several extra or missing letters. e.g. 'computrs' is closer to 'computers' than it is to 'commute'. To measure similarity, we will use Levenshtein distance (a.k.a. edit distance for those of us who can neither spell nor pronounce Levenshtein). A correction with a lower edit distance from the misspelling is more probable than one with a larger edit distance. For corrections of the same edit distance, their relative frequencies are compared. A word that appears very frequently in the model is a more probable correction than a word that occurs only rarely. Given the words 'they' 'the' and 'theta' that are all equally close to the misspelling 'thet', examining their relative frequencies will suggest the word 'the' is the best choice, followed by 'they' and lastly 'theta'.

Using edit distance and frequency is enough to produce surprisingly accurate suggestions for correcting a misspelling. Industrial-strength spell correction may incorporate additional statistics, such as using contextual clues to further refine the likelihood estimate, but the basic premise is the same.

Choosing the best corrections

Russian scientist Vladimir Levenshtein is credited with the invention of edit distance in 1965 to measure the dissimilarity between strings. The edit distance between two identical strings is zero, otherwise it is the minimum number of edits required to transform one string into the other. An edit is either the deletion of a character, the insertion of a character, or the substitution of one character for another. The word 'happy' becomes 'sappy' with one substitution; two deletions make 'school' into 'cool'. Transforming 'hennessy' into 'keenness' can be done with one insertion, one substitution, and one deletion; a total of three edits. The smaller their edit distance, the more similar two strings are. You will compute edit distance using a lovely application of recursion.

An example. If the corpus text contains these color names:

red blue green yellow orange brown red green brown yellow yellow

The model built from this corpus will have six words, with the frequencies shown below :

    red       2
    blue      1
    green     2
    yellow    3
    orange    1
    brown     2

Let's spellcheck the word 'rend'. It doesn't appear in the model, so it's flagged as a misspelling. Below the model is annotated with the edit distance between 'rend' and each word:

    red       2       ED from rend = 1
    blue      1       ED from rend = 4
    green     2       ED from rend = 3
    yellow    3       ED from rend = 5
    orange    1       ED from rend = 4
    brown     2       ED from rend = 4

The best corrections are those with the smallest edit distances, and for those with equal edit distance, higher frequencies are better. In the above case, the three most probable corrections in decreasing order are {red, green, brown}.

Implementation details

The program operates in this general fashion:

It is invoked with arguments of the corpus file and an optional document file. If no document is given to check, the program operates in interactive mode.
Tokenizes the corpus file and counts the word frequencies to build the model.
Tokenizes the document (or interactively reads words from the terminal) and identifies the misspellings.
For each misspelling, finds the most likely corrections from the model and prints them.

For more details about the specific requirements, read on:

Usage. The required first argument is the name of the corpus file. The optional second argument is the name of a document file. If invoked with only one argument, spellcheck operates interactively, reading words to check from the terminal. If the required arguments are missing or otherwise incorrect, exit with an appropriate error message.
Tokenizing. A word is a sequence of at least 1 and at most 50 upper/lowercase letters ended by whitespace. We provide a function in the starter code that will read well-formed words and discard other tokens (those that are too long or contain non-letters). This function should be used to tokenize the corpus, document files, and terminal input.
Build model. The model is a CMap storing each word and its count of occurrences. The model is built from the corpus.
Case-insensitivity. Distinctions of case should be ignored. "HELLO" and "Hello" are considered the same word when counting occurrences, comparing misspellings, and so on.
Finding misspellings. Tokenize the document file (or terminal input) in the same manner as the corpus. Any word in the document that is not found in the model is assumed to be misspelled. Any duplicate occurrences of the same misspelling within a document are coalesced and printed only once in the output.
Finding the best corrections. For each misspelling, examine every word in the model and compute its fitness as a correction for the misspelling. Maintain a leader board which tracks the top N candidates most likely so far. Update the leader board whenever you find a better candidate. Compare corrections using the following metrics:
1. edit distance. A correction with a smaller edit distance from the misspelling is better than one with a larger edit distance.
2. frequency. If two corrections have same edit distance, the correction with a higher frequency in the corpus is better than one with a lower frequency.
3. alphabetic. If two corrections have same edit distance and same frequency, the one that is alphabetically first is considered better.
Note that the size of the leader board is determined by the constant NUM_CORRECTIONS. This constant is set to 5 by default, but can be changed either by editing the constant definition in spellcheck.c or using make to build a spellcheck_N target. The number of corrections offered on the leader board should respect the value of this constant.
Edit distance algorithm. We require that your edit distance algorithm must be implemented recursively (not using iteration or dynamic programming). As with all code submitted for grading, this algorithm should represent your own independent creation; never copied or based on the code of others.
Document output format. When spellchecking a document, the output is a list of each unique misspelling from the document and its corrections, one per line. The misspellings are printed in alphabetical order. Each line has a misspelling and its most likely corrections printed in order of decreasing likelihood. If no misspellings were found, nothing is printed. The misspellings and corrections are printed in all lowercase. The format of each line is as shown below
```
low: yellow brown red blue green
rend: red green brown blue orange
  ...
```
Interactive output format. When operating interactively, the user enters words one at a time and each is immediately spellchecked. The response printed for each word is either the user's word followed by OK (to indicate the word is correctly spelled) or the word and its list of corrections using the same format as when spellchecking a document. In interactive mode, the words are processed in the order the user entered them. A sample interactive session is shown below:
```
Enter word(s) to spellcheck one per line. A valid word contains only letters and has length <= 50.
Invalid words will be ignored. End with control-D on a line by itself.

red
red: OK
rend
rend: red green brown blue orange
```
Memory. We expect your program to run cleanly under Valgrind with no errors nor leaks reported.
Efficiency. Making correct use of the provided CVector and CMap should produce a program that runs pretty snappily. You can compare time and memory use with our sample executable as a benchmark for what is reasonable.
Design/style/readability. You know the drill-- aim for publication-quality code!
Don't miss our page of advice and hints specific to this assignment!

Grading

Have you read how assignments are graded? For this assignment, the expected point breakdown will be in the neighborhood of:

Functionality (95 points)

Basic cases. (45 points) Simple cases, e.g. a single word interactively spellchecked against as a small corpus, as well as more complex inputs that test the range of basic functionality, e.g. token discard rules, sorting, case-insensitivity, and so on.
Robustness cases. (15 points) Atypical inputs (e.g. no misspellings found) and required error conditions (e.g. file cannot be opened).
Stress cases. (15 points) Large inputs touching on all required features.
Clean compile. (2 points) We expect your code to compile cleanly without warnings.
Clean run under valgrind. (10 points) We expect your program to run cleanly under Valgrind during all normal operation. Memory errors are significant deductions, leaks are more minor.
Reasonable efficiency. (8 points) We expect programs to be reasonably efficient in use of time and space. Full points are awarded for being on par (2-3x) with the sample program; deductions will apply for immoderate use of memory/time. There is no bonus for outperforming the benchmark (and efforts to do so could detract from your code quality...). Programs with extreme runtime inefficiencies (requiring more than 10x the time of the sample solution) may lose additional functionality points from failed functionality tests that time out under the autotester.

Code review (buckets weighted to contribute ~20 points) Here we will read and evaluate your code in areas such as:

Style and readability We expect your code to be clean and readable. We will look for descriptive names, helpful comments, and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
Use of pointers and memory We expect you to show proficiency in handling pointers/memory, as demonstrated by appropriate use of stack versus heap allocation, no unnecessary levels of indirection, correct use of pointee types and typecasts, and so on.
Client use of CVector/CMap We expect your program to use CVector and CMap to manage and manipulate the model, misspellings, and corrections. You should leverage the features provided and not re-implement the functionality.
Program design We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. When you need the same code in more than one place, you should unify, not copy and paste. When the C library provides needed functionality, you should leverage the existing library functions rather than re-implement that functionality.

Getting started

The starter project contains the source file spellcheck.c, our CVector/CMap implementation provided as a compiled library, and a Makefile. The project is distributed as a Mercurial repository. Check out a copy of your cs107 repository using the command
```
hg clone /afs/ir/class/cs107/repos/assign2/$USER assign2
```
The shared directory /afs/ir/class/cs107/samples/assign2 contains a sample executable and document and corpus files. Your repo contains slink which is a shortcut way to refer to this directory if you find typing the full path tedious. The enormous English corpus in the samples is courtesy of the ANC and contains 11 million occurrences of 100K distinct words. It is perfect for stress-testing at the end but you'll want to use smaller files earlier in development (any ordinary text file will do for the document and corpus).

Finishing

Be sure to use sanity check to verify your program produces conforming output. When your code is finished, don't forget to finish with the all-important submit step to send your code to us for grading. If needed, familiarize yourself with our late policy.