Web

Assignment 7 advice

Written by Julie Zelenski

Choosing your design

Unlike our previous assignments which have dictated narrow design requirements, the allocator design is very open-ended. You will choose your own data structures, strategies, and implementation without our direction or interference. It is a wonderful opportunity (but a little frightening, too). The design process can be fun and creative but it is best to start early to have enough time to think through the options, weigh trade-offs, make decisions, experiment, and iterate until your design solidifies into a success.

Here are just a few of the decisions you will have to consider:

How to store and access housekeeping data: block headers and/or footers, separate table/list of block information?
Determining how to manage the free list: implicit? explicit? sorted? array/list/tree/hashtable? segregated by size?
Deciding whether/how to split and coalesce blocks
Finding available space using first-fit, next-fit, best-fit or something else entirely
Are the blocks themselves segregated by size in the storage? Do they have buddies?
Will you use the same strategy for all sizes of request, or have different handling by size?
Will you implement a standalone realloc or just delegate to malloc/memcpy/free?

Section 9.9 of Bryant and O'Hallaron contains helpful reading on heap implementation strategies and presents partial code for an allocator case study. This allocator uses boundary tags and an implicit free list and demonstrates some worthwhile motifs such as abstracting the pointer arithmetic/typecasts (although we strongly recommend using inline functions instead of macros) and strategically installing a few extra blocks to avoid making special cases. However, this design is not a particularly great performer, the code is missing some key pieces, and I know of at least one bug. Consider it code you can learn from, but be wary of adopting from it without careful scrutiny of what you're getting. We expect you are itching to put your hard-earned CS107 superpowers to use creating your own version from scratch than follow along the text anyway! If you do need to borrow from it to get a leg up, we have a specific carve-out in our Honor Code policy that allows you to incorporate code from the textbook (with appropriate citation). Use of all other code is entirely off-limits -- you should not be reviewing or adopting any code for your allocator from any other resource.

Design choices and performance trade-offs

Our performance test will measure utilization and throughput on a set of varied script files. These grading scripts will be similar in spirit to the published samples, i.e., a mix of real-world use cases and generated patterns, but will not be identical. Your allocator should be designed to perform well in a variety of situations, rather than be tailored to fit exactly one set of known inputs.

The sample scripts reflect a range of scenarios for how the allocator might be used. Studying the relative performance provides guidance about in which situations your allocator operates efficiently and those it does not, allowing you to identify opportunities for optimization. For example, if you have bottlenecks in realloc, you will see it reflected in poorer performance on those scripts that make heavy use of realloc as opposed to the other scripts that have little or no use. Sometimes improving the results for one particular script comes at the expense of degrading others, so you'll have to weigh the results on balance. The per-script results tell you where you have strengths/weaknesses, but your allocator will be evaluated on aggregate performance, not any one script in isolation.

Some suggestions to help get you thinking about designing for performance:

How much internal fragmentation does your design carry? This overhead is in the noise for large blocks, but not so for small blocks. Can you squeeze some bytes out of the general per-block overhead or make a special case out of small blocks?
How is external fragmentation affecting your utilization? You can trim this waste with more strategic choices in fit and placement: how you choose which block to use, whether/when you split and coalesce, the effects of segregated storage, pairing for buddies, and so on.
Malloc throughput is mostly about search--- examining blocks to find an appropriate one. How might you spend less time with each block and/or examine fewer blocks to find a good one? Can you organize the free list to allow you to more quickly narrow in on what you need or be less picky about which block you take?
Throughput on free is largely constrained by how quickly you can access/update your data structures.
Realloc has its own issues that directly pit throughput against utilization. It's worth noting that realloc is used less frequently than malloc/free, so optimizing realloc at the expense of malloc/free is of questionable merit. Here's a little tip: most blocks are never reallocated, but a block that is reallocated at least once is often reallocated repeatedly (e.g. the storage backing a CVector or merging strings in reassemble), so anticipating future reallocations might pay off.
Coalescing seems like a good idea in theory, but can be disappointing in practice if not thoughtfully implemented. It may help to experiment with easily-tunable parameters to find the sweet spot for when/where/how to use coalescing for best effect.
What choices make for cache-friendly access for the allocator itself and/or the client's blocks? For example, a segregated storage design that places same-size blocks (e.g. a gazillon homogeneously-sized linked list or tree nodes) in a contiguous run would be very cache-friendly.
Some changes improve both utilization and throughput. For example, tightly-packing blocks both increases utilization and ups the cache hit rate. Other improvements sacrifice one for the other. Searching for the very best fit improves utilization but costs cycles. A deliberately-over-allocated block is easy to realloc but wastes space. Fancier free lists (doubly-linked, trees, segregated) are faster to search, but add overhead, may require extra cycles to update, and necessitate a larger investment in complex code and careful debugging. Most design questions don't have a single "best" answer, just trade-offs to be balanced.
Improvements in utilization tend to come from changes in strategy (placement choices, fit algorithms, coalescing approach, and the like). You'll likely find hitting 50% utilization to be relatively straightforward even with rather simple approaches. However, it tends to get trickier after the low-hanging fruit has been cleared, and you'll work harder for gains from there. It's somewhat challenging to get utilization above 80% and achieving evenly tighter packing will often consume a lot of throughput.
Other than the one healthy boost you'll get from engaging compiler optimizations, the improvements on throughput will tend to be smaller and more continuous across the entire range. Streamline one bottleneck at a time and you'll accumulate small, steady gains. No matter where your throughput is currently at, whether it be 10% to 110%, there are usually a fair number of modest improvements in reach. Efforts can be applied across the spectrum--- from changing up a strategy at the high-level down to the micro-detail of streamlining a few assembly instructions that are heavily-trafficked.
The sample scripts include of a variety of workloads, but be wary of optimizing for exactly and only the sample mix. Our grading scripts will be similar, but not identical. It may help to play with creating some scripts to broaden the mix of workloads on which you have tested. It is typical that there are some scenarios that play into your best-case handling and others that don't, so ups and downs are expected, but far outliers can dominate the aggregate, so keep an eye on your extreme cases. This means both trying to keep your worst-case close to the norm, and being wary of depending too much on your best case making up for others. If your performance is very consistent across cases, you gain higher confidence that your results will be replicable across a wider variety of scenarios. Look for the opportunities that moderate between extremes and shoot for handling a broad mix of inputs gracefully.

Optimizing for performance

No amount of optimization can compensate for a design/algorithm that is irredeemably slow, so first shore up weaknesses in your design before putting on your optimization hat. Once you have a solid design that is correctly implemented, you can then work to reduce the constant factors by applying code optimization techniques. The first step in any optimization effort is measuring. The performance summary given by the test harness gives you a high-level view of your performance of your design as a whole, but you'll need more specific information to make effective optimizations in the code itself. You can run the alloctest program itself under the Valgrind callgrind tool to profile the code and obtain instruction counts and cache hit/miss counts to identify the heavily-traveled code passages or those that are particularly cache-unfriendly. This tells you where to focus your attention for maximum gain. Once you have identified critical passage(s), you apply optimization techniques to streamline code and reduce constant factors. Even just a slight decrease in the instruction count or cache miss rate for a high-traffic area can pay off handsomely in absolute results.

Some notes on optimization:

As we saw in lecture, gcc's optimization can provide a significant boost on many common patterns. Don't preemptively optimize on these fronts (i.e. your efforts to hand-fold constants or convert multiplication by 2 into a shift are redundant), let the optimizer do it for you. You will likely be quite pleased with how much help even just -O2 will be and be sure to try out higher levels and experimental options (gcc optimizations) as well. It can be educational to examine and compare the disassembly to understand how it has been transformed.
Your manual efforts are needed for optimizations the compiler can't or won't do. You have knowledge about the code that is deeper than what the compiler can glean. You know where you can cache results and avoid unnecessary re-computation. You can manually hoist constant work out of a loop that the compiler cannot determine is redundant. If you know that one of the variables to a particular multiply/divide operation is always a power of two, you can convert into the equivalent shift. And most importantly, you measure carefully first to identify which passages are critical so you know where to be ultra-aggressive and where not to bother.
Decomposition and code unification into helper functions is still a great idea for managing complexity, but you may be concerned about the cost of the function-call overall. Never fear, you can have your cake and eat it too by using gcc's support for inlining. Inlining is enabled at gcc optimization level -O2. Inlines are a great way to get the speed of preprocessor macros without any of their accompanying headaches.
Gcc provides a number of built-in functions (gcc builtins) provided specifically for optimization purposes. Many are faster versions of the expensive math functions and not relevant for an allocator, but there are a few bit operations (such as finding the first/last bit set) that might come in handy. Check them out!
Experiment and measure! Unsure about the trade-offs between immediate versus deferred coalescing? Curious about whether or how finely to segregate your free list? Rather than rely on hunches or speculation, try some A/B experimentation. Introduce some global settings that allow you to experiment with different values or enable/disable features and then run performance trials to measure under various scenarios to gauge the effects in order to choose the right parameters to tune your implementation.

Script files for alloctest

The alloctest harness runs on script files. A script file contains a sequence of allocator requests in a simple text-based format. The three request types are a (allocation), r(reallocation), and f (free). Each allocation result is assigned an id-number that can be referred to in a subsequent realloc or free.

a id-number size
r id-number size
f id-number

A script file containing:

is translated by alloctest into these calls to your allocator:

void *ptr0 = mymalloc(24);
void *ptr1 = mymalloc(100);
free(ptr0);
ptr1 = myrealloc(ptr1, 300);
free(ptr1);

You can open any script file in your text editor. A script can begin with a comment that summarizes the scenario presented by this script. You can also peruse the entire sequence to see the pattern of requests in more detail. You can make your own simple scripts using a text editor and create larger, complex scripts programmatically.

Testing and debugging

Understand alloctest and script files. You will do much of your allocator testing using the harness provided in alloctest.c and it's worth investing time to sort out how it works and how to use it effectively for debugging, testing, and measuring. Peruse the various correctness checks in the code so you know how to investigate further if an ALLOCATOR_ERROR message pops up. Read the given scripts and learn how you can make additional scripts of your own to experiment with other patterns.
Invest in validate heap. Implement validate_heap as early in development as possible and continually extend it as you further develop your implementation. Especially when your heap data structures become complex, it is essential to monitor their internal integrity so you can identify and correct problems sooner rather than later. This routine is written by you and is your opportunity to scan your heap housekeeping and look for any signs of trouble. For example, you could walk your entire heap and verify such things as:
- Does the housekeeping information (location, size, free status) for each block seem consistent?
- Is every block in the free list marked as free? Is every block marked free listed on the free list? Is each listed only once?
- Have adjacent free blocks been coalesced according to your policy?
- Do any blocks wrongly overlap?
The test harness makes a call to validate_heap between each allocator request. If validate checks out before a particular request is processed and the next validate fails, this pinpoints the exact operation which put the heap out of whack. This will be an enormous help in focusing your attention on the right place and saving you a lot of guesswork in debugging.

The more complete your validity check, the more useful it will be to you. Adding a little redundancy, e.g. simple counters for the number of free or in-use blocks, can provide another point of comparison to double-check. The point is to be as thorough as you need it to be. Do not be at all concerned about efficiency; this is a development/debugging aid and will not be invoked during any performance measurements.

It is plain crazy to try to debug an allocator without a decent validate_heap, so do yourself a favor and invest in yours from the get-go. If you somehow manage to get your allocator working without one and then throw together a lame version on your way to submit, you will have missed the entire point.
Use your mad gdb skills. Previous labs and assignments have built up your gdb skills, be sure to put them to good use here. Allocators are prone to the odd flaw that flares up only once in a thousand calls. A mix of gdb breakpoints, conditions, and commands can help narrow in on such elusive beasts. A memory watchpoint can monitor and report when a location is being changed, this can be valuable when something is being corrupted in your heap but you aren't sure when the deadly action is triggered.
Debugging optimized code. Code compiled with optimization complicates debugging. The debugger will do its best, but the optimizing transformations can produce strange results: single-stepping may appear to jump through the code erratically, some variables will have been optimized out entirely and be unknown to gdb, and so on. You'll have your best luck doing your debugging on code compiled without optimization. Some bugs only show up when optimized, in which case you have no choice. It may be more helpful to view the disassembly than work from the C code in those cases.
gdb and protection violations. gdb doesn't heed the protection bits on memory segments, so it can access read-protected locations without complaint, while accessing these same locations in a running program will cause a segmentation fault. If your allocator is faulting due to trying to access a read-protected location, it could be due to accessing part of the heap segment that was previously mapped but has since been marked unreadable. This symptom is highly suggestive that your myinit doesn't correctly re-set the state of the allocator back to a clean slate.
Valgrind memtool is no help. Valgrind is designed to spy on the actions of the real malloc/free and will be paying no attention to what you are doing over in your own heap segment. Running alloctest program under Valgrind memtool is not only unhelpful, its complaints may needlessly worry you (i.e. barking about large range for permissions or the big reachable block used by the timing functions).

Logistics for success

Our thoughts on how to make your last assignment a productive and joyful experience:

Always have a working program. It may be tempting to rip the code to shreds and eagerly set out to build a fancy design from the get-go, but we strongly recommend you morph slowly, applying improvements one step at a time, constantly testing and measuring as you go. Analyze why the naive implementation is lame, figure out a way to make it better (how about re-using freed memory or not requiring a whole page for a 2-byte request?), implement the improvement, measure the gain, and repeat. You can stop at any point and still have a working allocator to submit, whereas the tear-it-down-and-build-it-up approach could leave you stranded with an ambitious-yet-non-functional allocator at the time of the deadline.
Commit often. Make regular Mercurial commits. Optimization often involves trial-and-error experiments. If you commit before going off on an exploratory path, you will have a known good snapshot to revert to should it not work out.
Optimize after correctness. If your allocator doesn't work correctly, it doesn't matter how fast it runs. Add features in the most straightforward way and make it correct first. Then iteratively rework to make it faster.
Optimize intelligently. Don't assume that a data structure/algorithm performs better on a hunch or runs faster because it is expressed in fewer lines of code. Run performance tests to objectively measure results.
Optimized compile. The flags for compiling allocator.c are set by the ALLOCATOR_EXTRA_CFLAGS variable in the Makefile. These flags are initially set to -O0. While still in active development, unoptimized code will be more debugger-friendly, but once you're working on performance, experiment with more aggressive levels to see what boost you can get from the optimizing compiler. We will compile your allocator using the settings of ALLOCATOR_EXTRA_CFLAGS in your submitted Makefile, so be sure that you submit with the flags configured to work best for your code. We will use our default build settings for all modules other than allocator.c, so that is how you will also want to compile those modules for testing so you will observe behavior consistent with what will happen during grading.
Partners. Don't miss our advice for successful partnership.

Code quality

Our grading priorities heretofore have put little emphasis on efficiency. For all assignments heretofore, we designated only a few points for meeting the "reasonable efficiency" benchmark and we were quick to ding for hacky algorithms, code duplication, lack of decomposition, unnecessary special cases, and other choices that make for ugly code, even if they offered some efficiency advantage. The priorities change for this assignment where efficiency is the focus. This doesn't mean you should set out to write something ghastly, but it does shift the trade-offs. Obviously the best outcome is streamlined code that is also clean and high-quality, but when the two are in opposition, compromise is inevitable. When making the choice to sacrifice a bit on quality in the name of performance, only do so if the performance gain you're getting is measurably important and there is no elegant way to get the same boost.
There will some unavoidable use of void* and typecasts, but you can tame complexity by writing helpers to clear away the dense pointer arithmetic/typecasting and increase clarity of the surrounding code. As always, unify common code -- if it was tricky to write once, you surely don't want to write it more than once.
Decomposition is essential. A page-long malloc or realloc can be dense code that makes your head swim. Break it down into tasks: finding a block, update the housekeeping, etc. If needed, use inlining to keep down costs of function calls.
Don't slap in quick fixes, adding +1 here and a typecast there, an extra * there, without thinking about the underlying root case and find the right way to fix.
The standard malloc interface takes its arguments as a size_t, which is an unsigned int. Careless co-mingling signed and unsigned types can lead to unexpected troubles, so be thoughtful and consistent.

Program facts

Some facts that might help you gauge what to expect.

Code size. Submissions have ranged from 71 to 482 lines; the average is around 170. These are counts of programming statement lines (not including blank lines, comments, #include).
Performance metrics. Utilization has ranged from single digits to a high of 88 (median 68) and throughput from single digits to a high of 150 (average 72). A couple of submissions achieved combined utilization > 80 and throughput > 95!
Points earned. The median assignment score has been about 92% of the total points, almost a third of the submissions end up exceeding the full-credit mark and earning bonus points for performance.
Partners. Typically about half the class has chosen to work individually and the remainder pair up. The average score was roughly the same for the two groups, although there was a smaller variance in the partnered group (i.e. solo submissions have more outcomes at the extreme edges).