Assignment 4: Into the void*

Due: Mon May 6 11:59 pm
Ontime bonus 5%. Grace period for late submissions until Wed May 8 11:59 pm

Assignment by Julie Zelenski and Michael Chang, with modifications by Nick Troccoli

Learning Goals

This assignment focuses on generics with void *, function pointers, and generic memory handling. You will be building your skills with:

  • the purpose and use of function pointers in C
  • using generic void* interfaces as a client
  • implementing a generic void* function using raw memory operations

Overview

In this assignment, you will implement simplified versions of the unix commands ls and sort. Your ls command will print out the contents of directories in various ways. Your sort command will print out the lines in files in sorted orders.

To get started on the assignment, clone the starter project using the command

git clone /afs/ir/class/cs107/repos/assign4/$USER assign4

The starter project contains the following:

  • util.c mysort.c, myls.c and Makefile: three files that you modify, and their Makefile for compiling
  • custom_tests: the file where you should add custom tests for your programs
  • test_binsert.c: a program that calls your binsert function in util.c for testing purposes. You do not need to comment or modify this file.
  • samples: a symbolic link to the shared directory for this assignment. It contains:
    • SANITY.ini and prototypes.h: files to configure and run Sanity Check. You can ignore these.
    • myls_soln, mysort_soln and test_binsert_soln: executable solutions for the code you will implement.
    • colors, numbers and names: sample text files for testing your unix commands
  • tools: contains symbolic links to the sanitycheck and submit programs for testing and submitting your work.

Testing and Debugging

As you work through the problems below, make sure to develop incrementally and use good debugging practices. Incremental development means implementing small pieces in isolation, thoroughly testing them, and then building on them only once you know they are correct. This ensures a working implementation at each step, and if you encounter a later bug, you can be confident it is a result of your most recent changes.

Testing

As part of this assignment, you should also add at least 10 additional tests of your own (3 - 5 per portion: myls, binsert and mysort) in the custom_tests file that show thoughtful effort to develop comprehensive test coverage. When you add a test, also document your work by including comments in the file that explain why you included each test and how the tests relate to one another. The tests supplied with the default SanityCheck are a start that you should build on, with the goal of finding and fixing any bugs before submitting, just like how a professional developer is responsible for vetting any code through comprehensive testing before adding it to a team repository. As a reminder, you can visit the working on assignments linked to from the Assignments page for information about SanityCheck and how to use it with custom tests. We recommend you run SanityCheck early and often (and remember, it even makes a snapshot of your code to guard against editing mishaps!). You can also find suggested testing strategies on the testing page, linked to from the Assignments page.

IMPORTANT NOTE: by default, only files in the starter project are submitted when running the submit tool. If you create any additional files for your custom tests that you need to include in your submission, you must execute the following command to add it to the list of files that will be submitted: git add MY_FILE_NAME. For instance, if you create a custom text file for testing called myfile.txt that you need to submit, the command would be git add myfile.txt. Then, when you run the submit tool, this file will be submitted as well. You can confirm files were properly included in your submission by submitting and then re-cloning your starter code project into a different location, and you should see your submitted files there.

Debugging

Good testing can help you uncover bugs that you can then track down and fix. When debugging, you should follow the debugging checklist listed in an earlier lecture, lab3 and assign3.

1. Code Study: scandir

In a later part of this assignment, you'll be writing your own version of the ls command. To do this, you'll need to be a client of the scandir library function, which is used to create a directory listing (an array of strings of the contents of a directory) for a given path. It also uses two function pointers: a filter function which controls which entries to include in the list, and a comparison function which controls the order of entries in the list.

Your understanding of scandir will be critical for implementing myls, so before using it, let's look at how the function is implemented.

A scandir implementation (from musl) is shown below. This function bares the inner soul of C (function pointers, dynamic allocation, triple star, oh my!) but your last three weeks of immersion into deep C waters has made you ready for this. Go slow, draw pictures, and ask questions about anything you don't understand.

 1  int musl_scandir(const char *path, struct dirent ***res,
 2        int (*sel)(const struct dirent *),
 3        int (*cmp)(const struct dirent **, const struct dirent **))
 4  {
 5      DIR *d = opendir(path);
 6      struct dirent *de, **names = NULL, **tmp;
 7      size_t cnt = 0, len = 0;
 8
 9      if (!d) return -1;
10
11      while ((de = readdir(d))) {
12          if (sel && !sel(de)) continue;
13          if (cnt >= len) {
14              len = 2*len+1;
15              if (len > SIZE_MAX/sizeof(*names)) break;
16              tmp = realloc(names, len * sizeof(*names));
17              if (!tmp) break;
18              names = tmp;
19          }
20          names[cnt] = malloc(de->d_reclen);
21          if (!names[cnt]) break;
22          memcpy(names[cnt++], de, de->d_reclen);
23      }
24
25      closedir(d);
26
27      if (errno) {
28          if (names) while (cnt-- > 0) free(names[cnt]);
29          free(names);
30          return -1;
31      }
32      if (cmp) qsort(names, cnt, sizeof *names, 
32                   (int (*)(const void *, const void *))cmp);
33      *res = names;
34      return cnt;
35  }

Review the code above and consider the issues we highlight below. You do not need to submit answers to these questions, just work through them for your own understanding.

  • Review man readdir to see the definition of struct dirent. An oddity to note is that the last struct field is an array of characters that is described as of "unspecified size". The d_name portion will be custom-sized to fit a particular entry's filename including a null terminator.
  • Decipher the types of each of the function's arguments. (Here is it pasted into decl.org). Draw a picture of the memory layout of the stack frame for scandir including all parameters and local variables. Use this diagram when tracing the code to see what it happening with the memory.
  • That triple-star res parameter is pretty scary. What is this parameter used for? Why does it need a triple-star? (Hint: take a look at where that parameter is used. What is the code doing there?)
  • How does a client indicate that they want no filtering applied to the directory listing? No sorting?
  • The choice of the names len and cnt is rather unhelpful in indicating their purpose. The names variable is also misleading -- is this an array of string names?
  • Line 12 uses continue, a C construct you may not have seen before. What is its purpose here? This construct is not used that often, but good to know what it does when you see it.
  • Does a non-zero result from the filter callback function sel cause an entry to be kept or discarded from the list?
  • Line 15 is a near-repeat of a piece of code we looked at in calloc (lab3). What is the purpose of this test and what would be the consequence of removing it?
  • The number of directory entries is not known in advance, so scandir allocates the array using the classic "resize-as-you-go" allocation strategy. This code should look familiar, but there are a few details to examine closely:
    • It calls realloc on line 16 without having made an initial call to malloc. Why is this valid? (Hint: review the realloc man page to see how this particular case is handled).
    • It uses the expression len * sizeof(*names) to compute the number of total bytes. If names is NULL, this expression seems to be dereferencing a NULL pointer, but remember that sizeof doesn't evaluate the expression, it just determines the expression's type and reports the size of that type.
    • The size expression could have equivalently been written as len * sizeof (struct dirent *). What is the benefit of using sizeof(*names) instead of explicitly naming the type?
  • On line 16, it assigns the return value from realloc to tmp and two lines later copies from the pointer from tmp to names. Why does it not just assign directly to names?
  • Lines 20-22 make a heap copy of a struct dirent. Because of the variable-sized d_name field (see first issue above), it allocates and copies a custom size.
  • Line 27 refers to a mysterious errno which comes out of left field. Read man 3 errno to learn more about its purpose and function. If an allocation failure occurred, what will be the value of errno? (Hint: read the NOTES section of the malloc man page)
  • Line 32 is a little hard to parse, but it is applying a typecast to the function pointer being passed to qsort. Try compiling the code both with and without the cast. What warning/error do you get without the cast? (Casting a function pointer is a sketchy thing to do, and it should generally be avoided, but it's understandable why they chose to do so here.)
  • The scandir filter function receives its argument as a const struct dirent *; the comparison function receives its arguments as const struct dirent **. Why the inconsistency?
  • Trace the various malloc/free calls to understand what memory is allocated and deallocated within the function. What deallocation remains to be done after the function exits? What are the proper steps for the caller to properly free that memory?

Forging your way through this is a significant accomplishment. There is a lot going on in this code and you should feel crazy-proud of being able to dissect such a dense passage! Armed with that understanding, you're ready to move onto implementing myls. Your understanding of scandir will be critical for this next step.

2. Implement myls

You have used ls many a time to list a directory's contents. Now it's your turn to implement a simplified version of this workhorse utility, which is also a nice exercise with function pointers.

The myls program operates similarly to standard ls but with many simplifications and some differences. While it may be helpful to think of myls as the same as standard ls in spirit, please don't mistakenly attempt to match the more complex features of standard ls. The list below enumerates the required features for myls. If in doubt about the expected behavior for myls, observe the behavior of the sample solution in the samples folder rather than compare to standard ls.

These are the required features for myls:

  • myls takes zero or more directory paths as arguments. It lists the directory entries from each path. If invoked with no arguments, myls prints the entries from the current directory (.)
  • myls only supports directory paths as arguments. If an argument refers to a file or non-existent path, it should print an error message and skip that argument. You should match the error message exactly to the solution using the error function with the text "cannot access _____", filling in the name of the invalid argument, along with an error code of 0 and a status code of 0 (so the program still continues running).
  • myls prints the entries one per-line.
  • myls ignores entries whose names start with . unless invoked with the -a flag
  • myls adds a trailing slash when printing the name of an entry that is itself a directory. (Note: the provided code for discerning files and directories categorizes symbolic links, such as the samples directory in the assignment folder, as non-directories, and it's fine for your program to treat symbolic links as non-directories (this is what the sample solution does)
  • myls prints the entries in a directory sorted in order by name. Names are compared lexicographically (i.e. strcmp). If invoked with the -z flag, the sort order is instead by type (directories ahead of non-directories). Entries of the same type are sorted by name.
  • myls supports no command-line flags other than -a and -z
  • myls should call the standard scandir function and not re-implement its functionality. Given that scandir is most of an ls program already, you just need to write the filter/comparison functions, make a single call to scandir, and print the resulting list. As an additional note, the man page for scandir contains a short example program. Minor heads up: that example program happens to print the entries in reverse order which may not be what you want/expect. Be wary of dropping in code from the documentation without reviewing it first to know what you are getting and whether you will need to adapt it.

Note: an important restriction on code you write for this problem (and for this assignment in general) is you are not allowed to use the versionsort or alphasort functions that are part of the C library as part of this assignment. You should implement your own comparison functions. A solution that violates this restriction will receive zero credit, so please verify your approach is in compliance!

myls.c is given to you with a small amount of code to handle the command-line arguments. Before starting, first read and understand the given code, work out how to incorporate it into your plans, and finally add comments to document your strategy and document the provided code. The starter code for this assignment is intended to help you get going, but you are free to remove/change this code as you prefer.

TIP: you can use a typedef to give a nickname to a function pointer type to avoid repeating the raw syntax. It also makes it easier to make a variable of that function pointer type (function pointers can be variables just like any other type!). Check out the FAQ at the bottom of the page for how to do this - it can help make your code cleaner. Also check out the starter code for mysort.c for an example.

Some topics to explore as a self-test: (do not submit answers)

  • Processing command-line arguments is crufty work. The GNU extension getopt helps to clean it up somewhat. Use man 3 getopt to learn more about how it works. How is an invalid option detected/reported when using getopt?
  • Do you see anything unexpected or erroneous? We intend for our code to be bug-free; if you find otherwise, please let us know!

3. Code Study: bsearch

Next you'll be studying the following code in order to write a generic binary insert function to efficiently insert unique elements into a sorted array using binary search. You will use this as part of your implementation of the sort command later in the assignment. Writing a fully correct binary search is notoriously difficult, and implementing a generic void* function only adds to the challenge. (Need a refresher on the binary search algorithm? Check out this link). Below, we've provided Apple's code for generic binary search, bsearch. You'll be modifying this implementation's code as part of your own binsert function later, so it's important that you know how it works. Review carefully! If you did not get to fully work through the lab4 exercises on void*, now is a good time to shore up that knowledge before moving forward.

void *apple_bsearch(const void *key, const void *base, size_t nmemb,  
               size_t width, int (*compar)(const void *, const void *)) {
    for (size_t nremain = nmemb; nremain != 0; nremain >>= 1) {
        void *p = (char *)base + (nremain >> 1) * width;
        int sign = compar(key, p);
        if (sign == 0) {
            return p;
        }
        if (sign > 0) {  /* key > p: move right */
            base = (char *)p + width;
            nremain--;
        }       /* else move left */
    }
    return NULL;
}

Consider the following questions to verify your understanding (do not submit answers):

  • In lab1, we read Joshua Bloch's article Nearly All Binary Searches and Mergesorts are Broken, talking about the issue of implementing a midpoint calculation. Does Apple's bsearch exhibit the bug called out in the article? Justify why or why not.
  • The comment /* else move left */ appears to apply to an empty statement. How does the search move left when sign is negative?
  • Pointer arithmetic on a void* is illegal, thus the casts to char*. In lab4, you reviewed musl_memmove, which copied its void* arguments into local variables of type char* at the start of the function and thereby avoided the need for further casting. In contrast, Apple's bsearch maintains its arguments as void* and explicitly casts before each pointer arithmetic. The argument behind this approach is the void* type communicates the special nature of this pointer, rather than misleadingly labeling it as an ordinary c-string. Keeping it a void* also means the compiler will complain should you accidentally dereference or apply pointer arithmetic. Each manipulation of the void* requires an explicit cast. That typecast confirms the programmer's deliberate intent and serves as documentation.
  • Assume the variable void *arr is the base address of the array, void *found is the address of the matching element and size_t width is the size of each element in bytes. What is the C expression that converts found into its corresponding array index?

Forging your way through this is a big accomplishment. There is a lot going on in this code and you should feel proud that you can understand library code like this to implement a generic binary search! Armed with that understanding, you're ready to move onto implementing binsert, a variation of Apple's bsearch.

4. Write binsert

lfind and lsearch provide two variants of a generic linear search. These functions are not standard C, but are commonly available, such as on our myth systems. Read man lsearch for an introduction. The feature that distinguishes lsearch from lfind is that lsearch will add the search key to the array if not found. This is a handy convenience!

Your next task is to write the generic function binsert which is a bsearch variant with this same functionality; a call to binsert will perform a binary search for the key and if no matching element is found, it will insert the key into the proper position in the sorted array. This will come in handy for the final part of the assignment, where you implement your own version of the sort command. Consider this function prototype for binsert:

void *binsert(const void *key, void *base, size_t *p_nelem, 
   size_t width, int (*compar)(const void *, const void *));

Here are the details of the function's operation:

  • The binsert arguments are patterned after the arguments to bsearch (review man bsearch). The one difference is that the number of elements is passed as a pointer. This is necessary as binsert will update the count when inserting a new element into the array.
  • If binsert does not find a matching element, it inserts the key into the array at the proper position and increments the number of elements. It is the client's responsibility to ensure the array has sufficient space to accommodate a new element. binsert does no allocation, reallocation, or deallocation of the client's array memory.
  • The function returns a pointer to the matching array member or to the newly added member if no existing match was found.
  • You can copy/paste the code from Apple's bsearch above into binsert as a starting point. It would be ideal for binsert to simply call bsearch and not repeat its code, but the standard bsearch doesn't expose the necessary information from an unsuccessful search that would allow you to properly position the new element. If you called bsearch and got a negative result, you would have to make a second search yourself to find the position, which means redundant code and duplicate execution! By subsuming the bsearch code into your binsert, the function will search once using one code path. HINT: you may not need to make significant modifications to the provided code. Our solution adds/modifies less than 10 lines.
  • If you need to move elements in the array to make room, instead of a loop to move array items one by one you should use memmove, which can scoot the whole array over in one call!

Write your implementation in the util.c file. You will write and test this function in isolation, and then use the function later when writing the mysort program. You can test your binsert function using our provided test_binsert.c program. The test_binsert program is integrated with sanitycheck.

Before continuing: have you thoroughly tested your implementation of binsert? An essential part of the development process is incremental development, where you implement small pieces in isolation, thoroughly test them, and then build on them once you know they are correct. This way, if you encounter a later bug, you can be confident it is a result of your most recent changes. mysort builds on this code, so make sure you are confident in its correctness before moving on.

5. Implement mysort

For the final part of the assignment, you'll implement your own version of the Unix sort command, called mysort, that uses your binsert function as a client. sort is another filter command like uniq and tail, which you implemented versions of in the last assignment. The standard Unix sort reads input line-by-line and then prints out the lines in sorted order. It supports various command-line flags that control the sort behavior. Consider the following sample uses of sort:

myth$ cat samples/colors
red
green
green
red
blue
blue
blue
red

myth$ sort samples/colors 
blue
blue
blue
green
green
red
red
red

myth$ sort -u -r samples/colors 
red
green
blue

The mysort program operates similarly to standard sort but with many simplifications and some differences. While it may be helpful to think of mysort as the same as standard sort in spirit, please don't mistakenly attempt to match the more complex features of standard sort. If in doubt about the expected behavior for mysort, observe the behavior of the sample solution rather than compare to standard sort.

Here are the required features for mysort:

  • mysort reads one file; either the named file (if specified as an argument) or standard input (if not).
  • The default sort order for mysort is lexicographic (alphabetical) order.
  • If invoked with the -l flag, the sort order is by increasing line length.
  • If invoked with the -n flag, the sort order is by string numerical value (applies atoi to each line and compares numbers).
  • There is no tie-breaker. Duplicate lines, i.e. lines that compare as equal according to sort order, can be output in any order.
  • If invoked with the -r flag, the sort order for mysort is reversed. (Hint: sort the input as you would normally, and just change the order you print the results)
  • If invoked with the -u flag, mysort discards duplicate lines. The sorted output contains one instance of each unique line from the input. The sort order determines which lines are duplicates (e.g. if sorting by length, two lines of same length are considered duplicates).
  • The flags to mysort may be used alone or in combination. If both -l and -n used, whichever flag is last on the command line "wins" as sort order.
  • mysort supports no command-line flags other than -l -n -r -u.

Internally, mysort should have the following behavior:

  • mysort reads a line into a stack array using fgets. This stack array should be sized to a large maximum size (see the MAX_LINE_LEN constant). We will not test on inputs containing any lines longer than this maximum. You may furthermore assume that all inputs will end with a final newline. This avoids having to make any special case out of the last line in the input.
  • After reading a line you intend to store, it should be copied to dynamically-allocated storage of the appropriate size. The function strdup will be handy here.
  • mysort should be able to handle an input containing any number of lines. Such a large array of lines could much too big for the stack, so this array must be heap-allocated. Because the number of lines cannot be determined in advance, mysort should allocate the array to a minimum initial number (see MIN_NLINES constant) and then each time it fills up, double the size.
  • The -u option of mysort must make repeated calls to your binsert function and only store lines that are unique. It should not store duplicate lines for later removal, and should make no calls to qsort.
  • If invoked without the -u flag, mysort should make exactly one call to qsort and no calls to binsert.

Important Note: in this implementation, lines should only be heap allocated if you intend to store them. In other words, you should not allocate a line on the heap if it will not ultimately need to be saved and printed out later. If you have significantly higher memory usage than the solution, issues with this are likely the reason why.

This may seem similar to the previous assignment, and you may initially be inclined to copy-paste code over, but realize there are some significant differences in this implementation. Unlike the last assignment, this assignment requires using a different (and also valid) strategy for reading text of unknown size - allocating it on the stack, and then only using heap allocation once you know the full size, can allocate exactly what you need, and know you will need to store it. There are pros and cons to the various approaches; for the last assignment, you potentially allocated slightly more than you needed on the heap, but you didn't use a ton of space on the stack first. Conversely, in this case you use significant stack space to store the initial string, but as a result only allocate exactly the heap memory that you need. We want you to get exposure to both valid patterns.

mysort.c is given to you with a small amount of code to handle the command-line arguments. Before starting on the program, first read and understand the given code, work out how to incorporate it into your plans, and finally add comments to document your strategy and the code. The starter code for this assignment is intended to help you get going, but you are free to remove/change this code as you prefer.

Some topics to explore as self-test: (do not submit answers)

  • Processing command-line arguments is crufty work. The GNU extension getopt helps to clean it up somewhat. Use man 3 getopt to learn more about how it works. How is an invalid option detected/reported when using getopt?
  • Note how a typedef is used in mysort.c to give a nickname to a function pointer type to avoid repeating the raw syntax. It also makes it easier to make a variable of that function pointer type (function pointers can be variables just like any other type!)
  • Do you see anything unexpected or erroneous? We intend for our code to be bug-free; if you find otherwise, please let us know!

One of the key goals for mysort is to cleanly handle the mix of sorting options without repeating code or overly complicating things. In particular, the 4 command-line flags can each be turned on independently, making a total of 24 combinations and if you aren't careful about it, you can end up making a distinct case out of each combination and debugging 16x more code than you needed to. Instead, take care to unify all the tasks that are in common, making only the tiniest of detours where needed to add code to handle their differences. You may have iterate a little to work out the details, but you can arrive at a very pleasing end result if you put your mind to it!

Memory/Pointers Advice

  • Stack versus heap. Take care to understand proper use of stack versus heap allocation and when each is appropriate. As a general rule of thumb, unless a situation requires dynamic allocation, stack allocation is preferred. Often both techniques are used together in a program. For example, it is desirable that a dynamically-allocated C-strings be of the proper size, e.g. the string "apple" needs 6 characters worth of storage, no more, no less. A useful idiom to learn is allocating a maximum-sized stack buffer to temporarily store a string of indeterminate length, which is later copied to a dynamically-allocated buffer of the correct size once the length is known. This is how the line-reading code for mysort should operate.
  • Dynamic resizing. The mysort program doesn't know the count of lines in the input in advance, so it allocates based on an estimate and grows as needed. Doubling-in-size is a classic technique you'll see again and again such as in the buffer-handling for last assignment's read_line and the entries gathered by musl_scandir.
  • Assert allocation requests. It's not likely you will exhaust the entire heap on a modern system, but NULL can also be returned when the heap is corrupted. Either way,there is nothing good that comes after a failed allocation, so best to just call a halt to things before it goes from bad to worse.
  • Pointee types should communicate the truth. If you know the type of a pointer, by all means, declare it with its specific type. Use void* only if the pointee type is not known or can vary. Conversely, don't declare what should be a void* pointer with a specific pointee type. Declaring a void* as a char* misleads the reader into thinking this pointer is a C-string and allows it to be mis-used as such with no compiler warning.
  • Use pointer arithmetic sparingly. Even though you can access ordinary array elements or struct fields by calculating an offset from the base address, you should use array subscripts and struct field names for readability and safety. Work within the type system whenever possible and only drop down to low-level typecasts and pointer arithmetic when raw memory is all you've got.
  • Prefer assignment to raw memcpy. Similarly, although memcpy can be used to do any kind of assignment, that doesn't mean it always should be used. Consider the three following ways of assigning an integer:
int dst, src;
void *pDst = &dst;
void *pSrc = &src;

dst = src;                          // option A
*(int *)pDst = *(int *)pSrc;        // option B
memcpy(pDst, pSrc, sizeof(int));    // option C

All three options accomplish the same thing (copying a 4-byte integer from src to dst). Option A, the ordinary assignment statement is direct, clean, and safe. This is always your best option as long as you are able to work within the type system. If you are forced to operate through a void*, you must instead consider options B and C. Option B, assignment using a typecast, is your second best choice. This option is readable, efficient, and relatively safe (given the correct typecast). Option C operates at the lowest-level, with no type safety whatsoever. A mismatch in the pointer types or size is silently accepted, yet causes havoc. When trying to read/write contents of a void* , go with Option B if possible. Drop down to the low-level memory copying of Option C only when you have no other choice. As a rule of thumb: if the size argument to memcpy is a compile-time constant (and especially so if it uses sizeof), this indicates you shouldn't be using memcpy/memmove. You only need to invoke raw byte-copying when the type/amount of memory being copied is determined at runtime.

  • Be wary of typecasts. Use typecasting sparingly. Your typecast defeats the type system and coerces the compiler to go along with it without complaint. With this power comes great responsibility. Before introducing one, consider: why is this cast necessary? what happens if it is not present? is there a safer way to communicate my intention? how might I work within the type system rather than against it? (i.e. correcting a mismatch in types rather than covering up with a cast). You should never cast away const-ness (fix the types instead!) and casting a function pointer is an extremely sketchy move (fix the function prototype instead!). You also never need to cast anything to a void*, since it is the universal recipient and is compatible with any pointer type (such a cast is redundant and can be removed). You will need a few typecasts, but add them exactly and only where you must. If your code repeatedly uses an expression that contains a necessarily unavoidable typecast, consider placing that expression into a shared helper to be invoked everywhere rather than repeat code.

Submitting

Once you are finished working and have saved all your changes, check out the guide to working on assignments for how to submit your work. We recommend you do a trial submit in advance of the deadline to allow time to work through any snags. You may submit as many times as you would like; we will grade the latest submission. Submitting a stable but unpolished/unfinished is like an insurance policy. If the unexpected happens and you miss the deadline to submit your final version, this previous submit will earn points. Without a submission, we cannot grade your work.

Only the following files are submitted by the submit tool for this assignment: util.c, mysort.c, myls.c, custom_tests

We would also appreciate if you filled out this homework survey to tell us what you think once you submit. We appreciate your feedback!

Grading

Here is the tentative point breakdown:

Functionality (95 points)

  • Sanity cases (22 points) Correct results on the default sanity check tests.
  • Comprehensive/stress cases (30 points) Correct results for additional test cases with broad, comprehensive coverage and larger, more complex inputs.
  • Clean compile (2 points) Compiles cleanly with no warnings.
  • Clean runs under valgrind (15 points) Clean memory reports for all programs when run under valgrind. Memory errors (invalid read/write, use of freed memory, etc) are significant deductions. Memory leaks are a minor deduction. Every normal execution path is expected to run cleanly with no memory errors nor leaks reported. We will not test exceptional/error cases under Valgrind.
  • Reasonable efficiency (15 points) We expect your programs and code to be reasonably efficient in use of time and space. Full points are awarded for being on par (2-3x) with the sample program; deductions will apply for immoderate use of memory/time. There is no bonus for outperforming the benchmark (and efforts to do so could detract from your code quality...). Use valgrind and time to measure your programs' efficiency vs. the solutions'.
  • custom_tests (5 points) Your custom_tests file should include tests of your own that show thoughtful effort to develop comprehensive testing coverage. Please include comments that explain your choices. We will run your custom tests against your submission as well as review the cases to assess the strength of your testing efforts.

Code review (buckets together weighted to contribute ~20 points)

  • Use of pointers and memory. We expect you to show proficiency in handling pointers/memory, as demonstrated by appropriate use of stack versus heap allocation, no unnecessary levels of indirection, correct use of pointee types and typecasts, const correctness, and so on.
  • Program design. We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. When you need the same code in more than one place, you should unify, not copy and paste. If the C library provides functionality needed for a task, you should leverage these library functions rather than re-implement that functionality.
  • Style and readability. We expect your code to be clean and readable. We will look for descriptive names, defined constants (not magic numbers!), and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
  • Documentation. You should document both the code you wrote and what we provided. We expect program overview and per-function comments that explain the overall design along with sparing use of inline comments to draw attention to noteworthy details or shed light on a dense or obscure passage. The audience for the comments is your C-savvy peer.

For more information about style guidelines to use for assignments, take a look at the CS107 Style Guide, linked to from the Assignments page.

On-time Bonus (+5%) The on-time bonus for this assignment is 5%. Submissions received by the due date earn the on-time bonus. If you miss the due date, late work may be submitted during the grace period without penalty. No submissions will be accepted after the grace period ends, please plan accordingly! The bonus is calculated as a percentage of the point score earned by the submission. See the General Information handout for more information about the late policy.

Post-Assignment Check-in

How did the assignment go for you? We encourage you to take a moment to reflect on how far you've come and what new knowledge and skills you have to take forward. Once you finish this assignment, you will have written your own implementation of two workhorse Unix commands using raw memory, function pointers, and generics. Way to go! You should now have great command of memory management and the client use of a void* interface and the generic implementation. The critical mastery is correctly passing/receiving the void*s flying around: what level of indirection to use, what typecasts are needed, using * and & in exactly and only the necessary places. You should know without question how to determine the necessary level of indirection and correctly apply it. It's also valuable to understand and be able to reason through the consequences of getting it wrong.

To help you gauge your progress, for each assignment/lab, we identify some of its takeaways and offer a few thought questions you can use as a self-check on your post-task understanding. If you find the responses don't come easily, it may be a sign a little extra review is warranted. These questions are not to be handed in or graded. You're encouraged to freely discuss these with your peers and course staff to solidify any gaps in you understanding before moving on from a task. They could also be useful as review before the exams.

  • To read the input, the mysort program reads each line into a maximum-sized piece of stack memory used temporarily and the copies to right-sized dynamic memory for persistent storage. This stack-heap dance is a common technique in situations like this. Explain why this approach is necessary/appropriate given the characteristics of the stack and heap.
  • The strcmp function almost matches the prototype for the comparison function argument of bsearch and qsort. A sketchy typecast (such as the one exhibited by scandir) could be used to hush the compiler's complaint about the mismatch, but this would be a huge mistake. strcmp is pretty much never the right function to use as a callback comparison function. Explain why not.
  • Read the man page for lfind and describe how to use lfind with an appropriate callback to re-implement the functionality of strlen.

Further Reading

We highly recommend you check out: Engineering a sort function
This article is a great read on the thought and engineering that went into the qsort library function. Jon Bentley and Doug McIlroy are two Hall of Famers from the former Bell Labs and anything they write is worth reading!

Frequently Asked Questions

The last sanity check test's output is really long. Is there anything I can do to more easily read it or other test output?

The last test uses a long file of numbers for sorting. If you pass the test, the entire output will not be printed, but if your output does not match then it will print the entire diff, which can be annoying. One tip to better navigate potential long sanity check outputs is either to just scroll in your terminal if you're trying to view previous test output, or save the sanity check output to file using > and access it that way. One note is that saving it to file may display strange characters due to the formatting and coloring of the output. To get around this, you can open the file with proper formatting by using the following command: less -R mysanityfile.txt, where mysanityfile.txt is the name of the file where you saved the sanity check output. Then, you can use the mouse or arrow keys to scroll through the output more easily. To exit this view, press q. To summarize these steps:

myth$ tools/sanitycheck > outputfile.txt    # run sanitycheck and save its output to outputfile.txt
myth$ less -R outputfile.txt                # view outputfile.txt in its own scrolling view

Since a function pointer is just a variable, how do I declare a parameter/variable of function pointer type?

The syntax for this is a bit ugly. The name of the variable is buried in the middle of the declaration and you have to read from inside out. The declaration below says the variable c is a pointer to a function that takes two const void* parameters and returns an int.

int (*c)(const void *, const void *) = my_compare;

Making a typedef for the function pointer type is a nice way to clean this up. The typedef below establishes cmp_fn_t as a nickname for int (*)(const void *, const void*). This typedef means you can now use cmp_fn_t as the type for a variable, parameter, or return type, in place of the longhand form.

typedef int (*cmp_fn_t)(const void *, const void *);

cmp_fn_t c = my_compare;

See your C reference or web search for more info on typedef. A neat online resource for unraveling complex C declarations is cdecl.org.

How do I resolve the compiler warning pointer of type void * used in arithmetic?

The pointer arithmetic expression pointer + n implicitly scales n by the size of the pointee before adding it to pointer. If pointer is a void*, its pointee size is unknown, and the compiler will complain because it cannot apply the proper scaling. To perform pointer arithmetic on a void* you must first cast it to indicate the desired scaling. The typical approach is to cast to char* to get no scaling and then manually multiply yourself.

My custom sanitycheck tests for mysort report mismatch due to re-ordering of equal lines. Is this a problem?

There is no required tie-breaker for equal lines - the spec allows mysort to print them in any order. But sanity check is looking for an exact match so you should construct your custom tests so as to not trip on spurious discrepancies. In grading, our tests will accept any valid reordering of equal lines.