Lab 8: Memory and optimization

Lab sessions Mon Mar 02 to Thu Mar 05

Lab written by Julie Zelenski

Learning goals

After this lab, you will have

  1. observed results of memory misuse detection by libc and valgrind
  2. experimented with different mechanisms for execution timing
  3. disassembled and compared code compiled with and without optimization
  4. run a profiler to analyze a program's runtime behavior

Share a computer with somebody new. Brag to each other about your plan to crush heap allocator.

Lab exercises

  1. Get started. Clone the lab starter project using the command

    hg clone /afs/ir/class/cs107/repos/lab8/shared lab8
    

    This creates the lab8 directory which contains source files and a Makefile.

    Pull up the online lab checkoff and have it open in a browser so you'll be able to jot down things as you go. At the end of the lab period, you will submit that sheet and have the TA check off your work.

  2. Understanding malloc. A conforming malloc implementation has requirements such as returning NULL on failure and freeing the old pointer when realloc moves the block. However some behaviors are unspecified, such as what happens when freeing a pointer twice or realloc'ing a non-heap pointer. In those situations, implementations can, and do, behave differently. Given the heavy use of the allocator, it is critical that it run fast. Including code to deflect errant requests might help a careless programmer, but it would come at the expense of slowing down every call. The preference for efficiency over safety means that allocators typically blunder through bad calls.The malloctest.cprogram has code that makes incorrect use of the allocator functions. If the argument to the program is a number from 1 to 6, the number identifies an error to make. Execute the program 6 times (once for each different value of the argument) to see what happens on each of the errors. In some cases, you might see no visible effects. At best, this means it detected that your request was invalid and ignored it. At worst, it means you have subtly corrupted the heap and only later will the program happen upon the damage.

    Run the program again for each of the 6 arguments, but run under valgrind this time to see whether it can help. Valgrind's virtual environment instruments the code with extra memory checks. It uses its own heap allocator with different behaviors such as inserting padding around heap blocks, never reusing memory, and storing the heap housekeeping information outside the heap so it is less likely to be stomped on. It also tracks when each address is written to detect use of uninitialized data and watches for leaks. These extra checks mean your program runs differently and more slowly under valgrind, but it can provide invaluable help in identifying and fixing memory errors and leaks.

    Optional malloc reverse-engineering challenge. The heap allocator on the myths uses a pre-node header to track node housekeeping, like that discussed in lecture/textbook. By poking around, casting pointers, examining the memory, reading the disassembly, and so on, you can use your now-killer reverse-engineering skills to figure out the layout of the pre-node header. Given a pointer returned by malloc, how can you dig out the number of bytes allocated to that address? The true size can be larger than requested because of rounding/alignment. Run malloctest 7 to do some spying on the libc heap. While you're digging around, read <malloc.h> or the man page for mallinfo for some interesting additional allocator functions.The function malloc_stats prints a brief summary of the heap use. The function malloc_usable_size(void *) will return the size allocated to a malloc block. What happens if you pass a non-heap pointer to this function? Note these functions are not standard C and you should not rely on them in production code, but they can be handy for debugging.

  3. Execution timing. The matrix, swap, and copy programs allow you to experiment with various means of timing a program's execution.

    a) The matrix program sums the elements in a 2-d array. An argument given to the program controls the order it accesses the elements and which timer is being used.

    • The simplest way to get a rough runtime estimate is using the system process timer via the Unix time command (see its man page). This measurement is not that precise but it's convenient in that you can time any program without making code changes. Try time ./matrix rows and time ./matrix cols a few times and read the seconds reported. How consistent are the times from run to run? Even at this coarse granularity, you should observe a difference in summing by rows versus columns. Let's explore this further with more precise measurements.
    • The matrix program can also time itself using a code-based timer if you execute it with one of its two timer arguments (either gtd or rtc) . The gtd timer uses gettimeofday to access the OS clock and reports the time used in seconds. This is a simple and portable way to measure time, but its accuracy can vary depending on the system implementation. The rtc timer is using the chip's real-time clock and K-best cycle counting code from Ch 5 of B&O. The real-time clock timer achieves high accuracy by using IA32 assembly and sophisticated statistics and reports the results in counts of cycles. Execute the program a few times with ./matrix gtd as the argument and then again with ./matrix rtc to review the results printed from the code-based timers. How much variability do you see in the gettimeofday timer from run to run? What about the real-time clock? (Note that observed elapsed time from rtc is longer as the fcyc code times multiple trials until results converge)
    • Use the timing data to answer the question: does C arrange the elements of a 2-d array in row-major or column-major order?

    b) The swap program explores the relative cost of stack versus heap allocation for the temporary space needed to swap values during a sort routine. The program insertion sorts a large array using three different swap functions: one that uses a fixed-size stack buffer, one a variable-length stack buffer, and a third that uses a dynamically-allocated heap buffer. The program runs all three versions, times each via RTC, and prints the results.

    • Run the swap program several times and review the printed results. How do the strategies rank in terms of time required?
    • Disassemble and compare the code for the fixed-stack versus variable-stack. Allocating a constant-length stack array is one instruction (subtracting a compile-time constant from %esp), how many instructions are needed to set up a variable-length stack array?

    c) The copy program explores the relative cost of different means of copying a large array. The total amount of data copied is constant, but changing how much work is done per iteration and number of iterations can have a profound impact on performance. The program experiments with a variety of ways: char-by-char in a loop, int-by-int loop, a memcpy in a loop, or to a single call to memcpy or memmove on the entire block. The program uses the RTC counter to time the operation and prints the timing results.

    • Run the copy program several times and look at the cycle counts for the char and int loops. About how much faster is copying by ints instead of chars? The int version demonstrates the value of "loop unrolling" to amortize the loop overhead by doing more work in each iteration and iterating fewer times.
    • Now compare the char/int loops to the memcpy loop (it memcpy's 4 bytes each iteration, same number of iterations as int loop) and learn the cost of using memcpy when a simple assignment would do-- ouch!
    • Now consider the cycle counts for the single memcpy/memmove -- what an enormous improvement over the manual loops! Now you know why inserting into a CVector should move the entire block in one call rather than memcpy-ing each element individually. memcpy can have a performance advantage over memmove (since memcpy can ignore overlapping regions and memmove has to deal with them) but it is generally slight-- can you observe much difference between the two?

  4. Optimization. Higher levels of gcc optimization can do impressive things to streamline the generated assembly. By becoming knowledgeable of the compiler's capabilities, you'll be better able to focus your attention on passages that require human intervention and not bother where gcc can work its magic. The trials.c program has code to implement a few simple numeric operations in a variety of ways (some clever, some less so, and some laughably lame) and runs time trials on each. The Makefile builds two versions of the program: trials using -O0 and trials_opt using -O2.

    • Before you run the program, first look at the code for the different implementations (version A, B, etc.) of the operations and make predictions about how you think the variants will stack up in speed and how large the expected difference. Run the unoptimized version and use the timings reported to find out if your intuition holds up. After you see the data, I want you to promise to never again use pow/exp2 when a simple shift will do...
    • Now run the optimized version and compare to the unoptimized times. Which operations gain large improvements from the optimizer? Which don't? Can you pick out some of the features that make code amenable to being optimized by the compiler?
    • Pick a function or two that show a large improvement when optimized. Examine the unoptimized and optimized disassembly to identify what transformations have been made.

    Just for fun: Here's a cute quiz on what gcc can/will optimize -- fun to check out!

  5. Optimization "bugs". Compiler optimizations shouldn't change a program's behavior, just the time needed to execute it. If a program behaves differently under optimization, this is typically attributable to a bug or implementation-defined dependency. Read through the buggy.c program, which contains buggy code reprised from a previous lab. The Makefile builds two versions of the program: buggy using -O0 and buggy_opt using -O2.

    • In the channeling example, the values from init appeared to be "passed" to print, despite there being no parameters in or returned values. This "works" when compiled without optimized, but breaks down when compiled -O2. Disassemble the optimized version -- what changed?
    • When compiled without optimization, the clear_array function goes into an infinite loop but not when compiled -02. (I guess you might call that an infinite speedup!) Disassemble the optimized and unoptimized code to explain the discrepancy.
    • When compiled without optimization, a call to factorial(-1) will end with segmentation fault, but it returns 0 when compiled -02. Disassemble the optimized and unoptimized functions and figure out what happened.

  6. Callgrind.The callgrind tool (part of the valgrind suite) is a code profiler that provides statistics about instruction counts and cache efficiency. If you haven't already, see our guide to callgrind for a quick introduction on how to run callgrind and use the annotator on the callgrind.out file to produce a readable report. Here are two experiments to run:

    • Run valgrind callgrind on the swap program to generate a callgrind.out file and then run the annotator on that file:

      valgrind --tool=callgrind ./swap
      
      ==24256== Callgrind, a call-graph generating cache profiler
      ==24256== Command: ./swap
      ...
      
      callgrind_annotate --auto=yes callgrind.out.24256               // replace 24256 with your process num
      

      In the callgrind report, each line is annotated with the number of cycles spent executing that line. Use those counts to identify hot spots and get a sense of the relative cost of operations. How much more expensive is calling malloc versus allocating on the stack? Is a malloc call more costly than a free call?

    • if you add the simulate-cache option when running valgrind callgrind, it simulates the processor caches and report on cache hits/misses. Try this on the matrix program: valgrind --tool=callgrind --simulate-cache=yes ./matrix rows and a second run with ./matrix cols argument. Read the miss counts from the report. What is the difference in the number of data cache (D1) misses when traversing by rows versus columns?

Check off with TA

Before you leave, be sure to submit your checkoff sheet (in the browser) and have lab TA come by and confirm so you will be properly credited for labIf you don't finish everything before lab is over, we strongly encourage you to finish the remainder on your own!