Lab sessions Mon Mar 02 to Thu Mar 05
Lab written by Julie Zelenski
After this lab, you will have
Share a computer with somebody new. Brag to each other about your plan to crush heap allocator.
Get started. Clone the lab starter project using the command
hg clone /afs/ir/class/cs107/repos/lab8/shared lab8
This creates the lab8 directory which contains source files and a Makefile.
Pull up the online lab checkoff and have it open in a browser so you'll be able to jot down things as you go. At the end of the lab period, you will submit that sheet and have the TA check off your work.
Understanding malloc. A conforming malloc implementation has requirements such as returning NULL on failure and freeing the old pointer when realloc moves the block. However some behaviors are unspecified, such as what happens when freeing a pointer twice or realloc'ing a non-heap pointer. In those situations, implementations can, and do, behave differently. Given the heavy use of the allocator, it is critical that it run fast. Including code to deflect errant requests might help a careless programmer, but it would come at the expense of slowing down every call. The preference for efficiency over safety means that allocators typically blunder through bad calls.The
malloctest.cprogram has code that makes incorrect use of the allocator functions. If the argument to the program is a number from 1 to 6, the number identifies an error to make. Execute the program 6 times (once for each different value of the argument) to see what happens on each of the errors. In some cases, you might see no visible effects. At best, this means it detected that your request was invalid and ignored it. At worst, it means you have subtly corrupted the heap and only later will the program happen upon the damage.
Run the program again for each of the 6 arguments, but run under valgrind this time to see whether it can help. Valgrind's virtual environment instruments the code with extra memory checks. It uses its own heap allocator with different behaviors such as inserting padding around heap blocks, never reusing memory, and storing the heap housekeeping information outside the heap so it is less likely to be stomped on. It also tracks when each address is written to detect use of uninitialized data and watches for leaks. These extra checks mean your program runs differently and more slowly under valgrind, but it can provide invaluable help in identifying and fixing memory errors and leaks.
Optional malloc reverse-engineering challenge. The heap allocator on the myths uses a pre-node header to track node housekeeping, like that discussed in lecture/textbook. By poking around, casting pointers, examining the memory, reading the disassembly, and so on, you can use your now-killer reverse-engineering skills to figure out the layout of the pre-node header. Given a pointer returned by malloc, how can you dig out the number of bytes allocated to that address? The true size can be larger than requested because of rounding/alignment. Run
malloctest 7 to do some spying on the libc heap. While you're digging around, read
<malloc.h> or the man page for mallinfo for some interesting additional allocator functions.The function
malloc_stats prints a brief summary of the heap use. The function
malloc_usable_size(void *) will return the size allocated to a malloc block. What happens if you pass a non-heap pointer to this function? Note these functions are not standard C and you should not rely on them in production code, but they can be handy for debugging.
Execution timing. The
copy programs allow you to experiment with various means of timing a program's execution.
matrix program sums the elements in a 2-d array. An argument given to the program controls the order it accesses the elements and which timer is being used.
timecommand (see its man page). This measurement is not that precise but it's convenient in that you can time any program without making code changes. Try
time ./matrix rowsand
time ./matrix colsa few times and read the seconds reported. How consistent are the times from run to run? Even at this coarse granularity, you should observe a difference in summing by rows versus columns. Let's explore this further with more precise measurements.
rtc) . The gtd timer uses
gettimeofdayto access the OS clock and reports the time used in seconds. This is a simple and portable way to measure time, but its accuracy can vary depending on the system implementation. The rtc timer is using the chip's real-time clock and K-best cycle counting code from Ch 5 of B&O. The real-time clock timer achieves high accuracy by using IA32 assembly and sophisticated statistics and reports the results in counts of cycles. Execute the program a few times with
./matrix gtdas the argument and then again with
./matrix rtcto review the results printed from the code-based timers. How much variability do you see in the
gettimeofdaytimer from run to run? What about the real-time clock? (Note that observed elapsed time from rtc is longer as the fcyc code times multiple trials until results converge)
swap program explores the relative cost of stack versus heap allocation for the temporary space needed to swap values during a sort routine. The program insertion sorts a large array using three different swap functions: one that uses a fixed-size stack buffer, one a variable-length stack buffer, and a third that uses a dynamically-allocated heap buffer. The program runs all three versions, times each via RTC, and prints the results.
swapprogram several times and review the printed results. How do the strategies rank in terms of time required?
copy program explores the relative cost of different means of copying a large array. The total amount of data copied is constant, but changing how much work is done per iteration and number of iterations can have a profound impact on performance. The program experiments with a variety of ways: char-by-char in a loop, int-by-int loop, a memcpy in a loop, or to a single call to memcpy or memmove on the entire block. The program uses the RTC counter to time the operation and prints the timing results.
copyprogram several times and look at the cycle counts for the char and int loops. About how much faster is copying by ints instead of chars? The int version demonstrates the value of "loop unrolling" to amortize the loop overhead by doing more work in each iteration and iterating fewer times.
Optimization. Higher levels of gcc optimization can do impressive things to streamline the generated assembly. By becoming knowledgeable of the compiler's capabilities, you'll be better able to focus your attention on passages that require human intervention and not bother where gcc can work its magic. The
trials.c program has code to implement a few simple numeric operations in a variety of ways (some clever, some less so, and some laughably lame) and runs time trials on each. The Makefile builds two versions of the program:
trials using -O0 and
trials_opt using -O2.
Just for fun: Here's a cute quiz on what gcc can/will optimize -- fun to check out!
Optimization "bugs". Compiler optimizations shouldn't change a program's behavior, just the time needed to execute it. If a program behaves differently under optimization, this is typically attributable to a bug or implementation-defined dependency. Read through the
buggy.c program, which contains buggy code reprised from a previous lab. The Makefile builds two versions of the program:
buggy using -O0 and
buggy_opt using -O2.
initappeared to be "passed" to
clear_arrayfunction goes into an infinite loop but not when compiled -02. (I guess you might call that an infinite speedup!) Disassemble the optimized and unoptimized code to explain the discrepancy.
factorial(-1)will end with segmentation fault, but it returns 0 when compiled -02. Disassemble the optimized and unoptimized functions and figure out what happened.
callgrind tool (part of the valgrind suite) is a code profiler that provides statistics about instruction counts and cache efficiency. If you haven't already, see our guide to callgrind for a quick introduction on how to run callgrind and use the annotator on the callgrind.out file to produce a readable report. Here are two experiments to run:
Run valgrind callgrind on the
swap program to generate a
callgrind.out file and then run the annotator on that file:
valgrind --tool=callgrind ./swap ==24256== Callgrind, a call-graph generating cache profiler ==24256== Command: ./swap ... callgrind_annotate --auto=yes callgrind.out.24256 // replace 24256 with your process num
In the callgrind report, each line is annotated with the number of cycles spent executing that line. Use those counts to identify hot spots and get a sense of the relative cost of operations. How much more expensive is calling malloc versus allocating on the stack? Is a malloc call more costly than a free call?
if you add the
simulate-cache option when running valgrind callgrind, it simulates the processor caches and report on cache hits/misses. Try this on the matrix program:
valgrind --tool=callgrind --simulate-cache=yes ./matrix rows and a second run with
./matrix cols argument. Read the miss counts from the report. What is the difference in the number of data cache (D1) misses when traversing by rows versus columns?
Before you leave, be sure to submit your checkoff sheet (in the browser) and have lab TA come by and confirm so you will be properly credited for labIf you don't finish everything before lab is over, we strongly encourage you to finish the remainder on your own!