Assignment 3: CVector and CMap
Due date: Mon Feb 02 11:59 pm - Hard deadline: Fri Feb 06 11:59 pm
Assignment by Julie Zelenski
[Quick Links: Implementation details, Advice page, Grading]
Completing this assignment should provide you with:
- a clear understanding of the organization and management of raw memory
- extensive practice using low-level
void* memory operations (malloc/realloc/free, memcpy/memmove)
- experience implementing a generic
Last time you were the client of CVector and CMap, this time you will implement CVector and CMap. It's simple enough to understand the goal, but handling the nitty-gritty internals of managing
void*s will stretch your pointer skills to the limit. Start early. Draw pictures. Ask questions. Exercise on trivial test cases first. Use gdb and Valgrind early and often. Test more thoroughly than you have ever before in your life.
- The cvector.h interface describes the required behavior as seen by the client. Your implementation must adhere to this interface. You should change nothing in the cvector.h file.
The internal representation of a CVector uses one contiguous block of memory to store the elements, laid out like an ordinary C array. It is not an array of pointers to elements stored elsewhere, the elements themselves are stored contiguously. There is not the overhead of an extra level of indirection per-element. If the CVector elements are integers representing test scores, the internal storage of the elements would look like this:
A CVector is created empty (i.e. containing no elements). The capacity_hint parameter to
cvec_create dictates the initial storage allocation. As elements are appended or inserted, the capacity is consumed. Each time the capacity is exhausted, the CVector doubles the capacity to allow for future growth.
- The CVector elements are of client-chosen type. Elements are passed/returned by address (
void*). The CVector does not have knowledge of the internal structure of the client's chosen element type, it merely copies the specified number of bytes from the client's address into the CVector's internal storage. The CVector will call the client's cleanup function on the element when it is removed/replaced/disposed.
- The CVector operations are layered upon functions from the standard C library. Dynamic allocation is managed with malloc/realloc/free. Raw bytes are copied using memcpy/memmove. Sorting and searching use lfind/bsearch/qsort. Have your C reference or man pages at the ready!
- The CVector tries to defend against improper client-use by watching for indexes out of bounds or negative sizes. Your implementation should use
assert to detect and report fatal errors. The CVector interface lists exactly what conditions must be asserted.
- The cmap.h interface describes the required behavior as seen by the client. Your implementation must adhere to this interface. You should change nothing in the cmap.h file.
- The internal representation of a CMap is a hashtable. Our starter code provides a string hash function for your use. Hash collisions are resolved by chaining, i.e., all keys that hash to the same code are grouped in one bucket. The entries within a bucket are stored in a linked list.
- A CMap is created empty (i.e. containing no entries). The capacity_hint parameter to
cmap_create dictates the initial number of buckets. Although you could use a CVector for the array of buckets, there isn't much advantage since the array rarely changes size and has a fixed type (each entry in the bucket array is a pointer to cell). In this situation, using an ordinary C array is simpler and more type-safe.
- The CMap monitors the load factor, which is the number of entries divided by the number of buckets. This is a measure of how "full" the map is relative to its capacity. When the load factor exceeds 1.5, the CMap expands the capacity by increasing the number of buckets and rehashes to redistribute the entries across the larger number of buckets. The new number of buckets should be three times the old number + 1, e.g. if the initial capacity was set to 100, once the CMap contains 150 entries, the load factor triggers an expansion in capacity to 301 buckets.
- Rehashing must be efficiently implemented. It makes a single linear pass over the current entries to re-wire each into its new bucket. It should not make new copies of entries nor allocate or deallocate any linked list cells. It should not allocate temporary arrays of pointers/entries.
There are important differences in how keys and values are handled by the CMap. CMap keys are C-strings (
char*). The CMap duplicates the key string by allocating internal storage for the characters and copying them. The CMap owns that storage and cleans up that storage when an entry is removed/replaced/disposed. CMap values are of client-chosen type. Values are passed/returned by address (
void*). The CMap does not have knowledge of the internal structure of the client's chosen value type, it merely copies the specified number of bytes from the client's address into the CMap's internal storage. The CMap will call the client's cleanup function on the value when it is removed/replaced/disposed.
Internally the CMap stores the key and value together. It manually builds a blob of memory for a linked list cell which contains the link pointer, the characters for the key string, and the client's value laid out contiguously. For example, consider a CMap associating the average GPA (value type double) with a group's name. If "lsjumb" and "swe" hash to the same code, then the linked list for this bucket could look like this:
Note the value and the key are not pointers to data stored elsewhere, the key and value data are stored directly in the cell, without the overhead of an extra level of indirection for either the key or the value. This means the memory for each entry is exactly the sum of sizeof(pointer) + storage for characters of key + size in bytes of value. You cannot define a struct that describes this cell (as the size of the value per-CMap and length of string per-key vary!) so you will have to manually construct the cell at runtime using your knowledge of how the memory is laid out. This part is tricky! Think through how you will construct and deconstruct a list cell before writing any code.
The CMap uses assert to defend against various client misuses. The CMap interface lists exactly what conditions must be asserted.
Requirements that apply to both
- Testing. The starter project contains the test programs
synonyms. These programs are helpful, but are by no means comprehensive. Part of your job is figuring out where additional tests are needed. You can change/extend/cannibalize the given test programs and add new test programs of your own. Edit the Makefile to name your additional test programs and they will be built automatically by make. You can also apply custom sanity check to your test programs to compare your output against a version using our sample CVec/CMap--- very handy!
- Memory. We expect CVector and CMap to run cleanly under Valgrind with no errors nor leaks reported.
- Efficiency. The CVector should have array-based big-Oh performance (e.g. insert/remove is O(n), append/replace is O(1)). The CMap should have hashtable-based big-Oh performance (e.g. O(1) for put/get/remove). The memory required per-entry should be as diagrammed above without extra levels of indirection or extraneous padding.
- Code unification/reuse. Make a vow to never rewrite or copy/paste code. Instead decompose and unify. For example, CVector insert and append have a lot in common, and thus should share code, not duplicate it. Even a small bit of code, such as the tricky expression used to calculate the nth address for CVector, is worth factoring out to a helper. The use of the helper improves readability everywhere and you no longer have to stare down that intricate pointer arithmetic in multiple places to verify it is correct each time.
- Design/style/readability. When writing this kind of gnarly memory-handling code, it is incredibly important that you keep the code as clean and readable as possible. We'll be watching!
- Don't miss our page of advice and hints specific to this assignment!
Read our page on how assignments are graded for background information on our grading process and policies.
Functionality (125 points)
- CVector/CMap layout. The data layout for storing CVector and CMap elements is a strict requirement. The correct layout will receive all its points; submissions in violation of the spec will earn a very significant deduction off the top. Take care to understand the implementation details section and compare to your memory model to the given diagrams to ensure you are in compliance!
- CVector single operations. (35 points) Test the behavior of each CVector operation in isolation as much as possible, i.e. use cvec_search without mixing in other operations. Each test will start by allocating and appending some elements, and then go on to thoroughly exercise just one operation on the CVector.
- CVector combined operations. (5 points) Test mixed sequences of all CVector operations.
- CMap single operations. (35 points) Test the behavior of each CMap operation in isolation as much as possible.
- CMap combined operations. (5 points) Test mixed sequences of all CMap operations.
- CVector/CMap assertions. (10 points) Verify asserts are raised on required errors described in the CVector and CMap interfaces.
- CVector and CMap stress cases. (10 points) Test a mixed sequence of all operations from both CVector and CMap on large data sets.
- Clean compile. (2 points) We expect your code to compile cleanly without warnings.
- Clean run under valgrind. (12 points) We expect your program to run cleanly under valgrind during all normal operation sequences. Memory errors are significant issues, leaks are more minor.
- Reasonable efficiency. (12 points) We expect your implementations to meet the required big-Oh performance. Where efficiency details are specified (e.g. CMap rehash reuses cells), you must meet those requirements. Your internal memory storage should be structured exactly as diagrammed previously (e.g. CMap cells consist of link pointer + key chars + value).
Code quality (buckets weighted to contribute ~30 points)
Here we will read and evaluate your code in areas such as:
- Style and readability. We expect your code to be clean and readable. We will look for descriptive names, helpful comments, and consistent layout. Be sure to use the most clear and direct C syntax and constructs available to you.
- Clean use of pointers and memory. We expect you to show proficiency in handling pointers/memory, as demonstrated by appropriate use of stack versus heap allocation, no unnecessary levels of indirection, pointee types and typecasts used correctly, and so on. Show you are wise in the ways of pointers!
- Program design. We expect your code to show thoughtful design and appropriate decomposition. Data should be logically structured and accessed. Control flow should be clear and direct. Common code between the operations should be factored out and unified, not duplicated. Add private shared helpers to enable code-sharing where appropriate. When the C library provides needed functionality, you should leverage these library functions rather than re-implement that functionality.
The assign3 project contains the source files, our test programs, and a Makefile. Check out the starter project from your cs107 repo using the command
hg clone /afs/ir/class/cs107/repos/assign3/$USER assign3
When finished, use submit to turn in your code for grading. The sole output conformance requirement is to be sure there is no output of any kind from your operations. Double-check that you have removed/disabled all print statements-- any extraneous output can interfere with the autotester and cause loss of points.