Web

Assignment 1 advice

Written by Julie Zelenski

Welcome to the assignment advice page!

We use the assignment writeup to lay out the assigned task, detail the implementation requirements, and explain how submissions will be graded. In this companion advice page, we offer recommendations and hints, as well as address common questions. Whereas the specifications given the writeup are non-negotiable, the advice page is intended for ideas and suggestions that you can incorporate as you see fit. We strongly encourage you to take advantage of what we share here--- we know a lot about how to keep you on track for a great outcome!

As our first recommendation, review our best practices for completing cs107 assignments for broadly applicable advice for tackling any assignment!

File-reading

As your welcome-to-cs107 gift, the starter provides a first-pass at the code to read the fragments into an array. The code as given will work for correctly formatted inputs, but you need to modify/extend it to handle the robustness cases. You can rearrange this code in whatever way you deem best, even discarding it entirely to re-write in your own style.
Anytime you are scanning a string, you must take care to ensure that number of characters read is constrained to fit into the buffer being used to store it. For example, the format %[A-Za-z] matches any sequence of one or more alphabetic letters. The format specifies no field width and reads without bound, stopping at the first non-letter or EOF. If the sequence of letters is longer than expected, the read will overflow the buffer and cause data corruption or a crash. To avoid such sadness, you specify a field width that constrains the maximum number of characters to read. For a buffer of length 16, you should stop after at most 15 letters, leaving one slot for the null terminator. The corrected format including a field width is %15[A-Za-z] which stops at the first non-letter, EOF, or after reading 15 characters, whichever comes first. Don't be lulled into thinking a format without a field width is ok as long as you declare a "huge" enough buffer. The only way to truly be sure is to take action to stop reading when full. One of the most common vulnerabilities exploited by malware is the buffer overflow which deliberately feeds excessively long input into program that doesn't take the necessary precautions to prevent overflow. Don't let your program be one that succumbs to this attack!
One sad fact is that the field width value you need (1 smaller than your buffer size) cannot easily depend on the variable declaration or a shared constant, so typically you use a raw number in the format string. This is fine with us. There are some slightly goopy ways to tie the two together, but none of them are especially obvious or pleasing.
There is a lot of functionality wedged into printf/scanf family of functions. Now is a great time to familiarize yourself with what's available, but don't bother memorizing the minutiae. My copy of K&R has permanent bookmarks on the format specifications so I can look up details on a need-to-know basis.
When adding error handling into file reading, you'll find it easiest to directly report and halt at the point of detection. Attempting to pass out codes and information to be used further back up the chain is less convenient and messier. See handling exceptional conditions below.
The <stdio.h> header declares stdin for the already-opened FILE* connected to the terminal. Code that reads fragments from an ordinary FILE* is equally capable of reading from stdin without changes -- you will not need to make a special case for either. Since you can only move forward through stdin (i.e must start at beginning and read to end once with no ability to jump ahead or rewind), this means your code must read the fragments in a single linear pass over the input.
When running the program under gdb, choosing to enter fragments at the terminal makes it confusing to keep track of where your input is going (i.e. typing gdb commands versus fragments) so you may find it easier to read from a fragment file when debugging.

String-handling

Dynamically-allocated C-strings should be the proper size, e.g. the string "apple" needs 6 characters worth of storage, no more, no less. One common idiom to learn is the practice of allocating a maximum-sized stack buffer to temporarily store a string of unknown length, which is later copied to a dynamically-allocated buffer of the correct size once the length is known.
Learn and use the functions from <string.h>. Read the man pages or your C reference to acquaint yourself with strcmp/strncmp, strstr, strcpy/strncpy, and strcat/strncat. These functions for comparing and copying strings will be very useful; there is no reason to re-invent the wheel by writing manual loops to do these tasks. A program that makes appropriate use of the strxxx functions won't access any individual chars within the strings at all!

Arrays

Relying on the stated upper-bound on the total number of fragments allows you to declare a maximum-sized array on the stack to hold the fragment pointers, which is an easy and efficient storage solution. Most files will only use a small portion of that array, but given that stack storage is cheap and plentiful, the excess is of no consequence.
Each time you merge a pair of fragments, the array will contain one fewer entry. Do not attempt to re-size the array, allow it to remain oversized and continue to ignore the excess. It is a much better design to maintain the current set of fragments in the first K consecutive slots of the array, rather than have to pick your way through the entire array to find the K in-use entries scattered among many unused ones. This choice will result in cleaner code and more efficient access to the array.

Algorithms and program design

One key coding challenge is the find-max-overlap function that operates on a pair of strings. One possible design is a fancy all-purpose routine that is internally complex, but easy to use. Alternatively, you can simplify the function's implementation at the cost of moving complexity to the calling site. Some questions to ask yourself:

Given two strings A and B, does the function consider alignment in both orders (A-B and B-A)? Would it help or hurt if you required the function to be called in such a way that only one order has to be examined?
Would it help if the function were guaranteed that the first argument was the shorter string?
Does the function need to consider both overlapping and containing matches or can it assume that overlapping matches have already been pruned?
Given there can be many possible alignments to try for a given pair, is there an efficiency or design advantage to trying the possibilities in a particular order?
You only need the length of the overlap for each pair to compare them when determining the max overlap. But at the end of the round, merging needs the position and order of the alignment. How can you avoid repeating code/work between find-max-overlap and merge-overlap?

There is no one "right" answer to these questions; just as there is no one "best" design. But there are better and worse approaches and you should be thoughtful and deliberate about the trade-offs that come with the choices you make.

The find-max-overlap function will require more testing that any other part of the code. Take care to brainstorm possibilities and test with a wide variety of inputs to ensure that all ordinary and edge cases are properly handled. Once you have this function working correctly and efficiently, you're at the home stretch.

Efficiency

A round operates on the collection of N fragments, considers all N² pairs, computes the max overlap for each pair, finds the largest such overlap, and finally merges the maximal pair. The process then repeats; the next round operates on the N-1 fragments remaining. Almost all of those fragments are unchanged from the previous round, so there is a lot of re-computation in figuring the max overlap. We do not want you to do anything clever to avoid this. A round retains no memory of past rounds and requires a complete re-computation of the max overlap for all pairs.

However, given you will make N² calls to the find-max-overlap function per round, any inefficiency within this function can dramatically decrease program performance. Making find-max-overlap reasonably zippy is not about hacking and zealotry, but does depend on common sense. The string.h library has highly-efficient functions for string comparison and you should absolutely use them. However, be sure to choose the right function(s) and only consider valid alignments (e.g. a 2-char fragment can have an overlapping, non-containing match to a 1000-char fragment in just a few places, not 1000). Scrutinize high-traffic inner loops to identify if there are any expensive computations that can be hoisted out. For example, strlen runs in linear time to count characters, if the string isn't changing inside the loop, calculating the length once outside the loop avoids that unnecessary recalculation. Avoiding duplicate calls within in a function is generally worthwhile, don't take this advice to some crazy extreme like calculating each string length just once and then storing/passing it around with the string throughout the program -- such redundancy creates opportunity for error in synchronization, eats up memory, and complicates the code, and sadly, won't even likely give you the benefit you seek! When working to improve performance, be sure your attention is focused on where changes that are truly beneficial-- it is not that case that every call to strlen needs to be avoided or that every use of malloc is problematic!

Run our sample executable to gauge the target efficiency, especially noting the time required for the large sample files. Our code doesn't cheat, it does the full re-computation (without caching any previous results) and employs no trickery. If your performance is in the same ballpark as ours (say within a factor of 2 or 3), you're golden. But if your program is noticeably slower, it indicate you have some unnecessary or redundant work that should be eliminated. Evaluate your memory efficiency similarly. Run the sample executable under Valgrind to see its memory footprint, and aim to be within a factor or 2 or 3. When considering tradeoffs of space versus time, the sample executable shows you the balance we are looking for.

Handling exceptional conditions

There are several error in usage and robustness you are required to detect and gracefully handle for this program. For fatal conditions, the appropriate handling is to report the error to the user and exit. An error message must be accurate and actionable, giving the user sufficient information to understand the problem and what is needed to resolve it. The sample solution models appropriate handling and feedback on errors. You are not required to match the same wording as the sample, but doing so is a one way to ensure that your messages will be judged as sufficient.

We do not require recovery, memory deallocation, or other cleanup from fatal errors. When an error surfaces mid-execution, it can be rather convoluted to manually route control through exactly and only the deallocation steps appropriate from that context, so we don't ask for it and nor will we test runs that lead to fatal error conditions under Valgrind. Just halt the program by calling exit(1) and you're done. The minor convenience function error() function combines error-reporting with exit if you'd rather. This is a nonstandard-C GNU extension; you can learn more from its man page.

Incremental development and testing

Some suggested coding milestones and testing strategies.

Read fragments from a well-formed input. The starter code is a good start on handling well-formed inputs, but be sure to take the time to figure out what it does and doesn't do. Start with tiny inputs that you can hand-verify. For larger files, you could print out the fragments read and review what you read to verify everything matches. Use valgrind to verify that your use of memory isn't raising any flags.
Malformed inputs. Adding robustness handling is a fussy little job, so you may find it more satisfying to jump ahead to the meat of align-merge and come back to this later. For robustness, you will want to cook up a suite of ill-formed input files and update read_frag to correctly detect/report each type of malformation. Be sure to verify that correct file-reading doesn't regress along the way.
Find-max-overlap. Design and write the function that reports the maximal overlap for a given pair. There are various strategies for testing this function in isolation: making calls in gdb, add calls from main, applying to pairs read from the file-reading code, etc. Whatever you do, don't shortchange testing this critical function.
Merge-overlap. The function that constructs the merged result for a pair can be tested similarly, ensuring that all variants of pairs can be correctly merged in all situations. Be sure Valgrind has no complaints, too.
Do a round. Once find and merge are solid, it is straightforward to add code to try all pairs and merge the max.
Do all rounds. A loop around the previous step and now you have a program able to compare its output to the sample solution. First test on simple cases such as sanity check and hand-constructed inputs that are designed to exercise specific code paths. Work your way up to the large stress cases.
Finalize. Find and clean up memory leaks using valgrind. Re-read the specification and re-test to make sure you cover all the required functionality. Review your code to make final improvements in quality. Reward yourself with an entire pint of ice cream!

A note about output variations. Since you can break ties arbitrarily, there can be more than one correct input for some inputs. For example, given the fragments ['ax', 'xb', 'xx'] each pair has a max-overlap of length 1, thus any of those pairs can be chosen to merge in the first round. If you first merge 'xx' with either 'ax' or 'xb', your final result with be 'axxb'. If you first merge 'ax' with 'xb' then the output can be either 'xxaxb' or 'axbxx'. All of these are considered correct outputs. When comparing your output to that of the sample solution, be aware of these possible (harmless) discrepancies.

Program facts

Some facts that might help you gauge what you're in for with this assignment.

Code size. Our solution has 92 lines, student submissions have ranged from 70 to almost 300 lines. These are counts of programming statement lines (not including blank lines, comments, #include). Our solution is decomposed into 11 functions.
Number of mercurial commits. Submissions have an average of 10 commits. Note that larger number of commits is positively associated with higher scores in grading.
Time spent. Student hours spent (self-reported):
- < 5 hours 7%
- 5-10 hours 26%
- 10-15 hours 26%
- 15-20 hours 23%
- 20-25 hours 9%
- more than 25 9%
Points earned. Median assignment score has generally been around ~90% of the total points (95/105).

Frequently asked questions about assign1

How do I measure the runtime of a program?

For a rough estimate, your watch will do, but for more accuracy, the unix time command will report on the time used by a executing program, e.g. time ./reassemble slink/preamble_frags

I have an idea about how to make my program run faster/use less memory. Should I pursue it?

Performance will be prized in later assignments, but for the early assignments we strongly discourage heroic efforts to improve performance that increase complexity, compromise style, or add invalid shortcuts that put correctness at risk. Grading reserves only a few points for matching the efficiency of the sample, and there is no bonus for outperforming. That said, we know of a few clever ideas that provide a satisfying boost for delightfully simple tweaks. We didn't implement them in our solution so as to retain a very achievable baseline, but you might find it fun to show off by finding one of the cute ways to outdo us.

Will I be penalized if I choose to ignore the advice page and do things my own way?

Not necessarily. The suggestions here are worth considering, as they are intended to guide you toward choices that we know lead to better outcomes, but if your alternative approach also leads to an elegant solution that meets the requirements, great! Before going renegade, you may want to run your ideas by us to vet that they won't lead you astray or contradict the requirements.