Assignment 2: Spellcheck

Due date: Mon Jan 26 11:59 pm - Hard deadline: Fri Jan 30 11:59 pm

Assignment by Julie Zelenski

[Quick Links: Implementation details, Advice page, Grading]

Learning goals

Completing this assignment should sharpen your skills in:

  1. client use of generic void * interfaces in C
  2. writing void * client callback functions
  3. debugging memory errors (because we are sure you will encounter at least one or two :-)

The problem

Have you ever wondered how Google Instant substitutes a correction "stanferd: did you mean stanford?" before you even finish typing a query with a typo? Wonder no more, you're about to find out. Your assignment this week is to write a spellchecker to find misspellings and suggest corrections for them. Your completed program will swallow a multi-megabyte text corpus, use it to find the misspellings within a document of thousands of words, and produce a prioritized set of corrections, all within seconds. Impressive? You bet!

What would be ideal for constructing this program is having some handy ADTs to build upon. You may still remember those halcyon CS106 days when ArrayList and HashMap were there for all your collection needs (and you took them for granted)? No doubt you've already noticed that the C language has no explicit support for genericity like Java/C++ templates nor any object-oriented features. This doesn't mean you can't build similar abstractions in C, but it will require some low-level craftiness and a willingness to engage in a bit more risk-tasking. We provide CVector and CMap, the C-equivalents of those generic collections, and these awesome ADTs help with the grunt work, but even so, there are plenty of challenges to keeping all the pointers straight. By the time you're done, you'll appreciate that C is an amazingly versatile language that puts few restrictions in your way when it comes to manipulating memory, but you'll also wish that it didn't always play it so fast and loose -- there is danger here!

A statistical model of language

Being able to identify misspellings relies on first having knowledge of the correct spelling of valid words. The language model comes from analyzing a text corpus. A corpus file is a large body of text aggregated from various sources. An ideal corpus is large and accurately representative, containing many occurrences of common words, few occurrences of rare words, and no occurrences of misspelled words. From the corpus, you can create a language model. The model consists of each word in the corpus and its number of occurrences. This model is used to spellcheck words. Any word in the model is assumed to be correctly spelled, any other word is classified as a misspelling.

The model is also used for spell correction. The words in the model are examined to find the best, or most probable, corrections for a given misspelling. What makes one correction more probable than another? The primary likelihood comes from measure of similarity between the correction and misspelling. A word with just one differing letter is more likely to have been intended than one with several extra or missing letters. e.g. 'computrs' is closer to 'computers' than it is to 'commute'. To measure similarity, we will use Levenshtein distance (a.k.a. edit distance for those of us who can neither spell nor pronounce Levenshtein). A correction with a lower edit distance from the misspelling is more probable than one with a larger edit distance. For corrections of the same edit distance, their relative frequencies are compared. A word that appears very frequently in the model is a more probable correction than a word that occurs only rarely. Given the words 'they' 'the' and 'theta' that are all equally close to the misspelling 'thet', examining their relative frequencies will suggest the word 'the' is the best choice, followed by 'they' and lastly 'theta'.

Using edit distance and frequency is enough to produce surprisingly accurate suggestions for correcting a misspelling. Industrial-strength spell correction may incorporate additional statistics, such as using contextual clues to further refine the likelihood estimate, but the basic premise is the same.

Choosing the best corrections

Russian scientist Vladimir Levenshtein is credited with the invention of edit distance in 1965 to measure the dissimilarity between strings. The edit distance between two identical strings is zero, otherwise it is the minimum number of edits required to transform one string into the other. An edit is either the deletion of a character, the insertion of a character, or the substitution of one character for another. The word 'happy' becomes 'sappy' with one substitution; two deletions make 'school' into 'cool'. Transforming 'hennessy' into 'keenness' can be done with one insertion, one substitution, and one deletion; a total of three edits. The smaller their edit distance, the more similar two strings are. You will compute edit distance using a lovely application of recursion.

An example. If the corpus text contains these color names:

red blue green yellow orange brown red green brown yellow yellow

The model built from this corpus will have six words, with the frequencies shown below :

    red       2
    blue      1
    green     2
    yellow    3
    orange    1
    brown     2

Let's spellcheck the word 'rend'. It doesn't appear in the model, so it's flagged as a misspelling. Below the model is annotated with the edit distance between 'rend' and each word:

    red       2       ED from rend = 1
    blue      1       ED from rend = 4
    green     2       ED from rend = 3
    yellow    3       ED from rend = 5
    orange    1       ED from rend = 4
    brown     2       ED from rend = 4

The best corrections are those with the smallest edit distances, and for those with equal edit distance, higher frequencies are better. In the above case, the three most probable corrections in decreasing order are {red, green, brown}.

Implementation details

The program operates in this general fashion:

  1. It is invoked with arguments of the corpus file and an optional document file. If no document is given to check, the program operates in interactive mode.
  2. Tokenizes the corpus file and counts the word frequencies to build the model.
  3. Tokenizes the document (or interactively reads words from the terminal) and identifies the misspellings.
  4. For each misspelling, finds the most likely corrections from the model and prints them.

For more details about the specific requirements, read on:

Grading

Have you read how assignments are graded? For this assignment, the expected point breakdown will be in the neighborhood of:

Functionality (95 points)

Code review (buckets weighted to contribute ~20 points) Here we will read and evaluate your code in areas such as:

Getting started

Finishing

Be sure to use sanity check to verify your program produces conforming output. When your code is finished, don't forget to finish with the all-important submit step to send your code to us for grading. If needed, familiarize yourself with our late policy.