Web

Assignment 6 advice

Written by Julie Zelenski

Getting oriented

There is not a particularly large amount of code to write for this project and it doesn't involve a tricky algorithm like edit distance or the delicate pointer-rewiring of CMap rehash. Past students have told us the most daunting issues were getting started in the first place and keeping track of the big picture: which data is where, how to access it, and how the parts relate. So take a deep breath, spend time up front to get oriented, ask questions, draw pictures, and poke around in gdb to be sure your understanding of the big picture is solid before tackling the code.

ELF format

The ELF file format in its entirety can be crufty and overwhelming, but take heart that you have to extract only a little slice of the data (the function symbols) and only need to concern yourself with that portion.
Review the starter code to see the struct definitions for data you will access within the ELF file: RawElfFileHeader, RawElfSectionHeader, and RawElfSymbols. Each of these struct definitions corresponds to the raw data as found in an Elf file, and enables simple, direct access to the file contents.
The standard elf.h header is where constants such as SHT_SYMTAB and STB_LOCAL are #define-d.
The colored diagram in the writeup provides an overview of the ELF features specifically relevant to the assignment. Chapter 7 of B&O contains broader background details on ELF in general.
The readelf tool can be a great debugging aid. For example, to verify that you are correctly accessing the section headers, use readelf -S to dump the section header table and compare its entries to those your code is finding. By using readelf, you will learn there are some surprising entries (such as the first section header or symbol containing all zero fields), that might have otherwise had you wondering if your code was malfunctioning.
Be wary of profligate casting and avoid unnecessary use of void *, pointer arithmetic, and manual scaling. Accessing the data within the ELF file requires a little pointer arithmetic with an offset and one typecast to the content type, but once oriented, you can (and should) use ordinary C array/struct access. In our solution, there is a single typecast per each layer: a typecast on the base address to access the file header, another typecast on the address of the table containing section header structs, another typecast on the address of the section data containing symbol structs, and a typecast on the address of the section data containing name strings.

Symbols module

A good interface provides abstraction and encapsulation. The symbols interface allows the client to make a high-level request for information and returns a tidy package designed for the convenience of the client. The goopy details of the raw ELF format and whatever machinations are required to harvest and process it, is completely hidden inside the symbols implementation.
In cvector.h/cmap.h, the struct types were declared as incomplete types (details hidden, access only through functions) as CVector/CMap were designed to be used as abstract data types. The struct type you create to hold a symbol doesn't need this sort of secrecy, it's just an ordinary struct declaration with public access to its fields.
Past students have asked if they can use a CVector to store the function symbols. I salute this desire to leverage the code you've worked hard on, but I can't say I recommend it. CVector requires you to deal with void* and forfeits type-safety, so you want to be getting serious bang for your buck (i.e. using a lot of the CVector features) to make it worth it. In this case, you are filling an array of structs which you later print/search, all of which can be easily done on a strongly-typed C-array using the stdlib functions, so using a CVector doesn't offer much advantage.
When dynamically allocating memory, it is desirable that the memory be "right-sized", but sometimes you won't know until after you've finished processing what the right size is. To get around this, consider adopting an idiom used in assignments 1 & 2: use the stack for cheap temporary storage of maximum-size, fill it, and copy the used part to dynamically-allocated memory of the correct size when the size is known.
The sort order for symbol names is case-insensitive lexicographic order. The library function strcasecmp will be handy here.
You are to search for a symbol by address using lfind. Treat the address you are searching for as a symbol with a singleton range, i.e. a range that starts at address and has size one. Your comparator callback should report two symbols as equal if one symbol range encloses the other. A properly written comparator is always symmetric--- i.e. if cmp(a, b) reports equal, so should cmp(b, a).
There are a few oddballs in the symbol table. We want you to be aware of them, but do not want you to introduce any special-case handling.
- There is the occasional "synonym" symbol where two or more symbol names are associated with the same or overlapping ranges. When looking up an address (either for namelist or crash reporter), you are to match to the lexicographically first symbol name of a synonym. If you have stored the symbols sorted by name, lfind will match the first synonym without extra effort.
- In rare cases, a symbol has zero size. That gives the symbol an empty range and it is unclear whether the start address should match the empty range or not. We do not want you to make a special case either way and will not test on zero-size symbols.
You don't need to worry about how/why these quirky symbols are created, nor should you go out of your way to make a special case for them. We want to alert you to their presence so you don't stumble across them and mistakenly think they were due to an error in your code.

Namelist

Given the heavy lifting done in the symbols module, the namelist program is a trivial amount of code.
You can compare against our sample namelist program to get an exact match for output. Custom sanity check is available to facilitate further testing. The standard nm utility (see man page) invoked with the options nm --defined-only -S is a somewhat close match, but includes all symbols, not just functions, and its sort order is locale-dependent instead of ordinary strcasecmp.
Your namelist works for 32-bit ELF files only. The myth machines compile to 64-bit ELF by default, you must compile with the flag -m32 to create 32-bit ELF files.
The strip utility (see man page) will destructively modify an object file and discard symbols by removing the entire symbol table section. There is no way to reverse/unstrip, but you can compile again to regenerate the object file with its symbol table. First make clean to discard any existing build products and make anew.
When looking to validate a hex address or convert from hex to decimal, take advantage of the standard library rather than attempting to roll your own. Check out man pages or your C reference on the library functions strtoul and sscanf. Your call on which to pursue; I find strtoul more full-featured but its interface is awkward, sscanf is simpler to use but you have to do a bit more for yourself.

Dissecting the runtime stack

The runtime stack, its calling conventions, and the call/return protocol have been studied in lecture, lab, the textbook, the previous assignment, and by now you have a solid understanding of its operation. Despite this, you may still be skeptical that the runtime reality truly matches the abstraction as we've studied it. They are one and the same! If you can accurately draw the contents of the runtime stack and identify the parts (return address, saved base pointer, parameters, etc.), then you have the necessary knowledge to dissect that stack to produce the backtrace.

Verify that you can answer these critical questions about how the stack operates:

Consider how to get oriented at the topmost stack frame. What is something accessible in C code that is known to be placed at fixed location relative to the current base pointer? You will not need inline assembly, digging into the ucontext or other unfamiliar structures, no jumping through crazy hoops, just plain old ordinary C. There is an awesome a-ha! moment in here that each of you deserves to enjoy for yourself. :-)
Once you have determined the current base pointer, how can you access the return address within this stack frame? What does a return address represent? How does it relate to the current callee/caller functions?
Where is the saved base pointer within this stack frame? What does the saved base pointer represent? How does it relate to the current callee/caller functions? How can you use it to unwind the stack?
What happens when the return address and/or saved base pointer have been overwritten? What is the consequence of a bad return address? a bad saved base pointer? What simple heuristic can identify and halt on blatant stack corruption, where the saved base pointer points outside the current stack segment?
How can you determine the current extent of the stack segment? The address for stack bottom is randomized (as part of protection against malicious code) and will vary from run to run. However, once the program is running, you can reliably determine its bottom address as long as you do it when the stack is still intact. (hint!)

Take care to mind the difference between a saved ebp (base/frame pointer) and saved eip (return address). Although they are neighbors in the stack frame and have similar register names, they serve completely different purposes and conflating the two only leads to heartache and headache.

A word of therapy: You may feel vaguely unsettled by how the crash reporter code seems overly dependent on the precise details of the IA32 and linux calling conventions. It's great if your instincts are picking this up, as it shows you are developing an intuition about what makes for portable code and what doesn't. Stack dissection by its very nature is only made possible by leveraging intimate knowledge of the particular details of how the stack is laid out and how a function call operates, making a stack crawl platform-specific and unportable. Don't let this fluster you! Accept the inevitability of the presumptions you must make, but use good practice to avoid adding any unnecessary assumptions. For example, your program will absolutely embed a fixed assumption about the relative position of the saved ebp versus the saved eip that cannot be avoided or finessed away. But don't let that lull you into discarding all consideration of portability. There is no reason to use a number when you could use sizeof to determine the number of bytes, and if you needed to access a memory location two pointers away from another, advancing a void** by 2 would be preferable to hard-coding the constant 8.

Crash reporter

Note that the absolute addresses in the backtrace (printed inside square brackets) can vary depending on executable layout, but the function names and offsets are stable. Sanity check and grading tests verify that the names and offsets, ignoring the minor variations in absolute addresses. If you look closely during sanity check, you can see where it elides the low part of the address altogether before comparing the output.
You can force a signal in the buggy client program with a deliberate error (dereferencing NULL, divide by zero) or can fake a signal using raise, e.g., raise(SIGBUS). Unfortunately, raise is frameless, so does not cooperate with the expectation that it stores a saved ebp on the stack, and thus the stack behind a raise may be untraceable.
To test your handling of an untraceable stack, you need to deliberately corrupt a saved base pointer. Consider how that same insight you had earlier about orienting reporter on the stack can be used in the client program to locate (and destroy) the stack housekeeping.
Be conscious of the fragile state when the signal handler is operating and tread lightly! Do as much as you can in advance before any signal (e.g. symbol harvesting/sorting/prep) and leave only what is absolutely necessary for the signal handler (e.g. searching for exactly those symbols used in backtrace and displaying them). When the signal handler is called, the top of the stack has been patched up enough that you can proceed with making function calls. Calling stdlib functions who don't have dependencies on global/external state (e.g. lfind) is generally safe, but avoid calls to stdlib functions such as malloc/free that have their tentacles into areas that may be corrupted or unreliable after the signal.
When running the buggy program under the debugger, gdb will stop when a signal is raised. Don't try to step/next at this point-- (walking through code that raises a signal can confuse gdb) instead use continue to return the debugged program to executing and let control pass to your signal handler. To debug the code executed after the signal is raised, set a breakpoint on your signal handler and you can debug normally from there. For these deliberately buggy programs, you may find it convenient to change the default gdb behavior to not stop on signals using the handle command. For example handle SIGSEGV nostop tells gdb to not stop on segmentation faults.

Code quality

As always, working to unify common code and avoiding special cases both improves your code quality and reduces your testing load. For example, crash reporter needs a lookup operation to associate an address with a symbol name that is basically identical to the one used by namelist. The best designs will place this functionality into the symbols module so it can be shared rather than duplicated.
Knowing that addresses are represented as unsigned int sometimes leads to careless co-mingling of these types or haphazard coercing/casting between the two. In the ELF file, symbol addresses are stored as the type uintptr_t. This type is declared in <stdint.h> as an unsigned int guaranteed to have sufficient range to store any address. This type is intended for addresses that you intend to manipulate numerically. The uintptr_t type is a numeric type -- it is compatible with unsigned int (and other numeric types), but it is not a pointer type and cannot be dereferenced. The symbols module and namelist program manipulate symbol addresses solely as numbers and thus using uintptr_t type is completely appropriate. In crash reporter, a return address stored on the stack is a live address in the executing program-- you may wonder if void* would be the better type to use. Note that crash reporter does not dereference return addresses or otherwise treat them as pointers, they are just treated as numeric values, so uintptr_t is still the correct choice. Contrast this to crash reporter's use of the saved base pointer which is used as a true pointer and is dereferenced. It is therefore correct to declare and manipulate base pointers as pointer, not numeric, type. Starting with a thoughtful choice of whether to declare an address as numeric type versus pointer type will lead to fewer conversions later and cleaner code.
Do not throw typecasts around indiscriminately. Whenever possible, you should work within the type system, not against it. Choose the appropriate type for a value, declare it properly, and stick with that type. In the rare situations where you must overrule the declared type, the typecast can be your friend, but it is your worst enemy when used inappropriately. Make it a rule to stop and think twice before inserting a cast and justify to yourself why each cast is necessary and correct.

Testing

Thankfully, this assignment has a more moderate testing burden. Unlike some of the earlier assignments containing multi-branched code paths and diversity of inputs, namelist and crash reporter tend to have one unified path of operation, which simplifies testing. There are a few edge cases to consider, notably in terms of robustness, but otherwise, but once the code works for simple cases, it tends to scale up to stress cases without too many surprises.
The namelist program allows you to test your symbol-harvesting and address lookup in isolation. It will be much easier to find and fix bugs here than having to hunt them down in the more complex context of the crash reporter library.
One complicating factor for testing crash reporter is that you are deliberately working with buggy code and thus expect the program to have errors. However, if it crashes in the reporter code, you might accidentally dismiss the symptom, attributing it to the intentional error, and not realize the crash came from a flaw in your own code. Take special care to verify that the crash you are observing is the one you expected!
Be sure to develop and test on myth to avoid any surprises from platform-dependent discrepancies.
You may want to write additional test programs for crash reporter beyond the one buggy program we give in the starter. Follow these steps to add a new program to the project:
1. Create a myprogram.c file with the desired client program.
2. Use hg add myprogram.c to add the file to your repo.
3. Edit the Makefile to add myprogram to the names listed for PROGRAMS.
Now myprogram is a part of your project and make will build it.
The Makefile includes a target to build solution versions of your crash reporter test programs. A solution version is built from the same client program code but substitutes our sample implementation of libreporter for yours. For example, you can build buggy_soln (which uses our crash reporter) and compare its behavior to buggy (which uses your crash reporter). You can make the solution version of a single program by name, e.g. make buggy_soln, or use make soln to build solution versions of all programs named in the Makefile. Custom sanity check can also used to compare the output of a custom program with its solution version. Make the program and its solution, and then you can use that program as the executable in a custom sanity test. Pretty cool!

Program facts

Code size. Our solution has 160 lines (90 symbol.c, 20 namelist.c, 50 in reporter.c, about 50 of these lines given in starter). Student submissions have ranged from 141 to over 300 lines. This are counts of programming statement lines (not including blank lines, comments, #include).
Number of mercurial commits. Submissions ranged from a low of 1 to a high of 56, average of 9. More commits is positively associated with higher scores in grading.
Time spent. Student hours spent (self-reported):
- < 5 hours 2%
- 5-10 hours 17%
- 10-15 hours 41%
- 15-20 hours 27%
- 20-25 hours 9%
- more than 25 hours 3%
Points earned. The median assignment score has been about 83% of the total points.