Web

Lab 7: Compilation tools and executables

Lab sessions Mon Feb 23 to Thu Feb 26

Lab written by Julie Zelenski

Learning goals

After this lab, you should be able to:

describe the steps and tools used to build an executable from C code
interpret the symptoms of a build error and what to do to fix it
diagram the program address space

Find an open computer and somebody new to sit with. Introduce yourself and share war stories about your efforts to defuse your binary bomb.

Lab exercises

Get started. Clone the starter project using the command
```
hg clone /afs/ir/class/cs107/repos/lab7/shared lab7
```
This creates the lab7 directory which contains source files and a Makefile.

Pull up the online lab checkoff and have it open in a browser so you'll be able to jot down things as you go. At the end of the lab period, you will submit that sheet and have the TA check off your work.
Unix tools for dissecting object files. There are several Unix commands that can be used to poke around in compiled object files. Try out each of the commands listed below to learn how to use them and what information they provide. Each tool has a man page you can check into for further information.
- The strings command extracts printable character sequences from a file. When run on an object file, it will find string literals from the original C source and other character sequences. (The way this tool works is surprisingly simple--- it mostly just walks through the file and prints any byte sequence found that includes 4 or more printable characters in a row) Try strings on emacs, gcc, or your reassemble program to see what if finds. How does the output of strings relate to what you observe when opening the executable in a text editor?
- The size command lists the section sizes in an object file. The text section contains the compiled code, the data section contains initialized global/static data, the bss is uninitialized global/static data. You can experiment with declaring a large global array (initialized and not) and recompiling to see changes in the section sizes of the object file.
- readelf is a comprehensive tool for dissecting ELF object files (ELF = Executable and Linking Format used by our myth machines). readelf accepts many flags to depending on what information you wish to extract. readelf -e on an object file will dump the file header and section/program headers, which serve as a road map to the contents of the object file. This information is used by the OS loader to properly configure the address space of the new process when starting the executable.
- The Unix nm command prints the symbol table from an object file. This is a list of the functions and global variables referenced in the object file, giving the address, status, and segment (code, data, etc.) for each symbol. The symbol table can be removed from an object file using the strip command. If you invoke nm on a system executable like emacs, it reports "no symbols" because these executables have been stripped to save the space that would have been used for the symbol table.
- We think of gcc as the compiler, but technically it is a compiler driver. When you invoke gcc, it sequences together the necessary tools to do the build. Using the -v flag it runs verbosely so you can observe the preprocessor, compiler, assembler, and linker each getting a turn, and adding the -save-temps flag will leave the intermediate files behind so you can examine the transformation stage by stage. Our makefile is configured to build addrspace with these flags-- use make addrspace to see the full build process in action and poke around into the intermediate files.
  - (file.c -> file.i) The preprocessor cpp/cc1 does various text transformations on the source first. You can run just the preprocessor with gcc -E.
  - (file.i -> file.s) The compiler cc parses the C code and generates assembly for it. You can use gcc -S to stop after compiling and before assembling.
  - (file.s -> file.o) The assembler as converts assembly into an object file containing binary-encoded machine instructions. Running gcc -c stops after assembling but before before linking.
  - (file.o + file2.o + libs -> executable) The linker ld/collect combines multiple object files and any libraries into a executable file. Now you're ready to run the program!
The preprocessor cpp. As the first step of compilation, the preprocessor does a variety of text-based transformations such as:
- Removing comments and rearranging white space
- Handling #include by inserting the entire contents of the named file
- Expanding #define constants and macros
- Removing code based on the conditional compilation macros #ifdef/#if/#endif etc.
- Expanding various preprocessor-defined macros, __LINE__, __DATE__ etc.
Read through the pre.c file and predict what it would look like after preprocessing. Run just the preprocessor using gcc -E pre.c and verify you have the correct ideas.

Below is a list of a few of the things involving the preprocessor that can go wrong. Edit the pre.c file to create each of the problems listed below and try to build. If it fails to build, when is the problem detected and by which tool (preprocessor, compiler, linker)? Is it a hard error, a warning, or a total non-issue? If it builds despite the problem, does the program run correctly?
- typo in #define (e.g. #define MY_STRING "CS107 without closing quote)
- typo in #include (e.g. #include <dstio.h>)
- include the same header more than once
- missing type declaration (e.g. use FILE * without including stdio.h)
- missing function prototype (e.g. call qsort without including stdlib.h)
- missing constant define (e.g. use NULL without including stddef.h or any other header that includes stddef.h) Yes, it's true, NULL is not a C keyword, instead just a #define. What does NULL expand to after preprocessing?
- missing macro define (e.g. use assert without including assert.h) Note the difference between the consequences of using qsort without include stdlib.h and using assert without include assert.h. Why is it different?
Preprocessor macros. Given the pitfalls of macros, you should be wary of choosing to write code using macros. However, you may encounter macros in the code of others and it can useful to understand the mechanism and why macros can be problematic. Start by reviewing the code in macro.c.
- The code defines the one-argument macro SQUARED. (By convention, macros are often given uppercase names to serve as a reminder they are macros). One "feature" of macros is that they are type-less. Do you see how the macro can work for a variety of numeric inputs? When used on ints, it will do integer operations, and used on floats does floating point. But this lack of types can also lead to errors. What happens if you apply the macro to a string constant? How is the error reported? If you were to see this symptom, would it be obvious what the root cause was?
- This first attempt at the macro is sloppy and has several issues. The macro program invokes the macro in various ways to trigger its problems. Compile and run the program to see its output. Consider inputs #1 and #2. How can you change the macro to give a correct result for input #1? What about input #2? Fixing input #3 is more involved. The macro needs to evaluate its argument once and store in a variable which means committing to an argument of a certain type. This can be done, but the macro ends up practically a function after all, so what's the point?
- You can get most of the performance benefits of macros (and avoid all these drawbacks) via inline functions. The inline keyword recommends to the compiler that a function is a good candidate for being inlined (i.e. code incorporated directly into the calling function instead of operating as function call/return). Trying nm macro.o you'll find that squared doesn't even appear as a symbol and if you ask gdb to break at squared or disassemble squared, you'll discover the debugger knows nothing about it either. The code for squared was completely absorbed into the caller. Disassemble main and you'll see no setup up for function, no params, no call <squared>, but instead look for the instructions pased in from body of squared. Inlining avoids all the function call overhead (setting up stack, writing params, transfer control, return, etc.) at the cost of duplicating the instructions in the function's body at every calling site. This micro-optimization might be appropriate for a very small function that is called repeatedly on a performance-critical path. Adding the inline keyword is treated as "advisory" -- the compiler can disregard your advice and either inline what you didn't ask for or not inline what you did. In particular, gcc only inlines if compiling at optimization level -O2 or higher. You can examine the disassembly (now that you are a superb reader of IA32!) to find out what the compiler actually did.
Linking. The relationship between the compiler and linker is one of the more misunderstood aspects of the build process. The compiler operates on a single .c file at a time and produces an object file (also referred to as a .o file or relocatable file). A .o file contains the compiled version of the symbols defined in the .c file. To form a full program, one or more object file are linked with the system libraries. It is the linker that joins all the symbols from the various object files, resolves cross-module references, ties in the symbols from external libraries, and relocates addresses. A key task for the linker is resolving symbols-- ensuring there is at least one and no more than one definition for each symbol name. The linker detects exactly two kinds of errors-- undefined symbols and multiply-defined symbols.
- The linker reads the symbol table of an object file to get information on what symbols are defined in the module and what symbols are as-yet-unresolved. The nm utility shows the entries in the symbol table. Use nm addrspace.o to see the symbols in a relocatable file. You should note that the symbol addresses are all small-valued numbers, these numbers are offsets relative to start of this module. Now do nm addrspace. In a fully-linked executable, the addresses are much larger. During linking, each symbol is relocated to the final address that the symbol will occupy in the executing program's address space. Run gdb on addrspace and examine a few symbols with the gdb command info address symbolname to verify the addresses from symbol table match the executing program.
- It's important to know that the linker works solely on symbols and addresses, without type information. To the linker, any two symbols named binky "match", no matter if one version of binky is a function with five arguments that returns void, the other a function with no arguments that returns int, or even if the other binky is a variable! The files foo6.c and bar6.c contain the code from Problem 7.9 on page 694 of Bryant and O'Hallaron. Compile these two files together using gcc foo6.c bar6.c and note you get no complaints. Executing a.out prints 0x55, despite the variable main never being initialized in bar6. 0x55? Does this ring a distant bell from assignment 4 disassemble problem? Hmmmm.... What gives?
- Sections 7.6.2 and 7.10 of Bryant & O'Hallaron discusses the differences between linking with static libraries versus dynamic linking with shared libraries. By default, gcc produces a dynamic executable. Use the a.out created from foo/bar in the previous exercise to experiment with this. Invoke the size command to see its component sizes and run ldd command to find out which shared libraries it depends on. Now, rebuild using static linking: gcc -static foo6.c bar6.c. Use the size and ldd commands again. What is the percentage change in the executable size? What shared libraries does it now depend on?
Who detects what? One of the most important benefits of understanding the entire tool chain is that you are in a better position to know the right fix when you hit a build error. Below are a few common build errors. First, think through how you think each would be handled, then try making the error and building to verify your understanding is correct.
- You can't remember the right header for lfind but make a call to it anyway in your program. Does this compile? Does it link? Does it execute? Why or why not? What is affected if you decide to quiet the warning by adding your own prototype into your source? What if the prototype you hacked up is wrong -- you forget that the third argument is a size_t* and instead use a size_t in your prototype. How and when will you see a symptom of this mismatch? (As a rule, you should seek out the correct #include for a needed prototype instead of ignoring the warning or making your own)
- Your code makes calls to printf without #including the stdio.h header. Does this compile/link/execute? Why or why not? What if your code uses the global variable stdin from stdio.h without #including it? Does this compile/link/execute? Why or why not?
- Your code makes calls to math functions such as cos or sqrt without #including the math.h header or linking the math library Does it compile/link/execute? What changes if you only fix the #include? What changes if you only link the library?
- Your code invokes a macro such as assert and you don't #include the header file that defines it. Does this compile/link/execute?
- You include the header cvector.h and make calls to cvec_create and the other functions declared in the header file, but don't link with the module/library containing the CVector object code. Does this compile/link/execute?
Charting the address space. A program's address space is divided into segments: text (code), stack, global data, etc. The segments tend to be placed in predictable locations. Developing a feel for the address range used for each purpose can help you theorize about what has gone wrong when memory is out of whack. Run the addrspace program under gdb to answer these questions:
- Where are global variables being stored? What changes if they are marked static?
- Where are string constants placed? If the same string constant appears more than once in the code, are multiple distinct copies made or do they all refer to one shared copy? What happens if you write to a string constant?
- Where is your code positioned in memory? (i.e. find the address of main) What about the code for library functions such as printf? Are functions at the same locations for different runs of the same program? How do these addresses relate to the symbol addresses printed by nm?
- Edit the code to attempt to write to an address in the code segment. (i.e. try to write over the instructions for a function by casting the function pointer and dereferencing through it) What happens?
- Where does the stack start? Does it start at the same location for different runs of the same program? How big can the stack grow (do you remember from the stack lab?)
- Where is the heap located, i.e. addresses being returned by malloc? How big can the heap grow -- how many bytes can you allocate before malloc returns NULL?
- While you have the addrspace program stopped in gdb, use the gdb command info proc mapping to see the list of memory segments. Set a breakpoint at main and view the initial memory map, then view again executing the function that allocates gobs of heap memory. Can you identify which segment in the memory map contains the heap? Can you identify what each segment holds (e.g. either your code, library code, stack, heap, global data, etc.)?
Chart the address space, label segments and note where gaps occur. Given a troublesome address, you can use this chart to identify whether the address is located within the stack/heap/global/code, which is a helpful clue when tracking down the problem. Of the entire 4GB addressable region, about what percentage appears in use?
Optional extra diversion: binary hacking. The loader is responsible for running an executable file by starting the new process and configuring its address space. The code and data segments of the address space consist of data directly mapped in from the executable file. The executable file contains object code along with string constants, global data, and possibly symbol and debugging information. If you are very careful, there are ways to directly edit an executable file to change its runtime behavior, for example, by directly modifying data in the segments that will be mapped in. To be clear, there is rarely legitimate cause to do this, but we can play around with binary hacking to better understand the contents of the executable and its relationship to the executing program. If you open the binary file in emacs and invoke M-x hexl-mode, emacs will act as a raw hex editor. We're going to experiment with editing the addrspace executable file.
- One simple change is to edit string constants. Inserting characters into a string constant will cause havoc but it is ok to overwrite existing chars with different chars (including overwriting with null characters to make the string appear shorter). Edit the executable file to change the string that says "Hello world!n" to "Hello cs107!n". Run your hacked executable and see the change.
- Similarly, the values for initialized global variables are directly stored in the executable and can be edited. Find the location of the initial values for the extern and static global variables by searching the addrspace executable file for their distinctive values. Find and change the value for one of the variables in order to pass the test and "win". The same trick won't work to "double-win" but there is a different binary hack that can achieve it if you're determined, what might that be?
- Now that you're warmed up, what kind of binary hacking could you devise to suppress explosions from your binary bomb?

Just for fun followups

Reverse engineering has a number of frivolous, yet entertaining, uses. Check out the kcheat project that lets you forcibly manipulate values in an executing program. How does it work? It scans the address space looking for a particular value (e.g. the current value of your "strength") to identify the right location to poke a new value into to give you the strength you deserve.
Another web tidbit, this one on creating the smallest executable file by torturing the ELF format.
You gotta love a language that inspires competition to write the most hideous, unreadable code. The winners of the Obfuscated C Contest often abuse the preprocessor as one of their tricks. Look at hague-1986 or herrmann-2001 for particularly crazy examples! The Wikipedia entry on obfuscation gives more history and background on this crazy competition.
Another ode to the nature of C is found in the Underhanded C Contest which is about writing seemingly innocent, simple C code that secretly does something evil. That's C for ya!

Check off with TA

Before you leave, be sure to submit your checkoff sheet (in the browser) and have lab TA come by and confirm so you will be properly credited for lab If you don't finish everything before lab is over, we strongly encourage you to finish the remainder on your own!