Written by Julie Zelenski
Welcome to the assignment advice page!
We use the assignment writeup to lay out the assigned task, detail the implementation requirements, and explain how submissions will be graded. In this companion advice page, we offer recommendations and hints, as well as address common questions. Whereas the specifications given the writeup are non-negotiable, the advice page is intended for ideas and suggestions that you can incorporate as you see fit. We strongly encourage you to take advantage of what we share here--- we know a lot about how to keep you on track for a great outcome!
As our first recommendation, review our best practices for completing cs107 assignments for broadly applicable advice for tackling any assignment!
%[A-Za-z] matches any sequence of one or more alphabetic letters. The format specifies no field width and reads without bound, stopping at the first non-letter or EOF. If the sequence of letters is longer than expected, the read will overflow the buffer and cause data corruption or a crash. To avoid such sadness, you specify a field width that constrains the maximum number of characters to read. For a buffer of length 16, you should stop after at most 15 letters, leaving one slot for the null terminator. The corrected format including a field width is
%15[A-Za-z] which stops at the first non-letter, EOF, or after reading 15 characters, whichever comes first. Don't be lulled into thinking a format without a field width is ok as long as you declare a "huge" enough buffer. The only way to truly be sure is to take action to stop reading when full. One of the most common vulnerabilities exploited by malware is the buffer overflow which deliberately feeds excessively long input into program that doesn't take the necessary precautions to prevent overflow. Don't let your program be one that succumbs to this attack!
stdinfor the already-opened
FILE*connected to the terminal. Code that reads fragments from an ordinary
FILE*is equally capable of reading from
stdinwithout changes -- you will not need to make a special case for either. Since you can only move forward through
stdin(i.e must start at beginning and read to end once with no ability to jump ahead or rewind), this means your code must read the fragments in a single linear pass over the input.
<string.h>. Read the man pages or your C reference to acquaint yourself with strcmp/strncmp, strstr, strcpy/strncpy, and strcat/strncat. These functions for comparing and copying strings will be very useful; there is no reason to re-invent the wheel by writing manual loops to do these tasks. A program that makes appropriate use of the strxxx functions won't access any individual chars within the strings at all!
One key coding challenge is the find-max-overlap function that operates on a pair of strings. One possible design is a fancy all-purpose routine that is internally complex, but easy to use. Alternatively, you can simplify the function's implementation at the cost of moving complexity to the calling site. Some questions to ask yourself:
There is no one "right" answer to these questions; just as there is no one "best" design. But there are better and worse approaches and you should be thoughtful and deliberate about the trade-offs that come with the choices you make.
The find-max-overlap function will require more testing that any other part of the code. Take care to brainstorm possibilities and test with a wide variety of inputs to ensure that all ordinary and edge cases are properly handled. Once you have this function working correctly and efficiently, you're at the home stretch.
A round operates on the collection of N fragments, considers all N2 pairs, computes the max overlap for each pair, finds the largest such overlap, and finally merges the maximal pair. The process then repeats; the next round operates on the N-1 fragments remaining. Almost all of those fragments are unchanged from the previous round, so there is a lot of re-computation in figuring the max overlap. We do not want you to do anything clever to avoid this. A round retains no memory of past rounds and requires a complete re-computation of the max overlap for all pairs.
However, given you will make N2 calls to the find-max-overlap function per round, any inefficiency within this function can dramatically decrease program performance. Making find-max-overlap reasonably zippy is not about hacking and zealotry, but does depend on common sense. The string.h library has highly-efficient functions for string comparison and you should absolutely use them. However, be sure to choose the right function(s) and only consider valid alignments (e.g. a 2-char fragment can have an overlapping, non-containing match to a 1000-char fragment in just a few places, not 1000). Scrutinize high-traffic inner loops to identify if there are any expensive computations that can be hoisted out. For example, strlen runs in linear time to count characters, if the string isn't changing inside the loop, calculating the length once outside the loop avoids that unnecessary recalculation. Avoiding duplicate calls within in a function is generally worthwhile, don't take this advice to some crazy extreme like calculating each string length just once and then storing/passing it around with the string throughout the program -- such redundancy creates opportunity for error in synchronization, eats up memory, and complicates the code, and sadly, won't even likely give you the benefit you seek! When working to improve performance, be sure your attention is focused on where changes that are truly beneficial-- it is not that case that every call to strlen needs to be avoided or that every use of malloc is problematic!
Run our sample executable to gauge the target efficiency, especially noting the time required for the large sample files. Our code doesn't cheat, it does the full re-computation (without caching any previous results) and employs no trickery. If your performance is in the same ballpark as ours (say within a factor of 2 or 3), you're golden. But if your program is noticeably slower, it indicate you have some unnecessary or redundant work that should be eliminated. Evaluate your memory efficiency similarly. Run the sample executable under Valgrind to see its memory footprint, and aim to be within a factor or 2 or 3. When considering tradeoffs of space versus time, the sample executable shows you the balance we are looking for.
There are several error in usage and robustness you are required to detect and gracefully handle for this program. For fatal conditions, the appropriate handling is to report the error to the user and exit. An error message must be accurate and actionable, giving the user sufficient information to understand the problem and what is needed to resolve it. The sample solution models appropriate handling and feedback on errors. You are not required to match the same wording as the sample, but doing so is a one way to ensure that your messages will be judged as sufficient.
We do not require recovery, memory deallocation, or other cleanup from fatal errors. When an error surfaces mid-execution, it can be rather convoluted to manually route control through exactly and only the deallocation steps appropriate from that context, so we don't ask for it and nor will we test runs that lead to fatal error conditions under Valgrind. Just halt the program by calling
exit(1) and you're done. The minor convenience function
error() function combines error-reporting with exit if you'd rather. This is a nonstandard-C GNU extension; you can learn more from its man page.
Some suggested coding milestones and testing strategies.
A note about output variations. Since you can break ties arbitrarily, there can be more than one correct input for some inputs. For example, given the fragments ['ax', 'xb', 'xx'] each pair has a max-overlap of length 1, thus any of those pairs can be chosen to merge in the first round. If you first merge 'xx' with either 'ax' or 'xb', your final result with be 'axxb'. If you first merge 'ax' with 'xb' then the output can be either 'xxaxb' or 'axbxx'. All of these are considered correct outputs. When comparing your output to that of the sample solution, be aware of these possible (harmless) discrepancies.
Some facts that might help you gauge what you're in for with this assignment.
Time spent. Student hours spent (self-reported):
Points earned. Median assignment score has generally been around ~90% of the total points (95/105).
For a rough estimate, your watch will do, but for more accuracy, the unix
time command will report on the time used by a executing program, e.g.
time ./reassemble slink/preamble_frags
Performance will be prized in later assignments, but for the early assignments we strongly discourage heroic efforts to improve performance that increase complexity, compromise style, or add invalid shortcuts that put correctness at risk. Grading reserves only a few points for matching the efficiency of the sample, and there is no bonus for outperforming. That said, we know of a few clever ideas that provide a satisfying boost for delightfully simple tweaks. We didn't implement them in our solution so as to retain a very achievable baseline, but you might find it fun to show off by finding one of the cute ways to outdo us.
Not necessarily. The suggestions here are worth considering, as they are intended to guide you toward choices that we know lead to better outcomes, but if your alternative approach also leads to an elegant solution that meets the requirements, great! Before going renegade, you may want to run your ideas by us to vet that they won't lead you astray or contradict the requirements.