Assignment 1 advice

Written by Julie Zelenski

Welcome to the assignment advice page!

We use the assignment writeup to lay out the assigned task, detail the implementation requirements, and explain how submissions will be graded. In this companion advice page, we offer recommendations and hints, as well as address common questions. Whereas the specifications given the writeup are non-negotiable, the advice page is intended for ideas and suggestions that you can incorporate as you see fit. We strongly encourage you to take advantage of what we share here--- we know a lot about how to keep you on track for a great outcome!

As our first recommendation, review our best practices for completing cs107 assignments for broadly applicable advice for tackling any assignment!

File-reading

String-handling

Arrays

Algorithms and program design

One key coding challenge is the find-max-overlap function that operates on a pair of strings. One possible design is a fancy all-purpose routine that is internally complex, but easy to use. Alternatively, you can simplify the function's implementation at the cost of moving complexity to the calling site. Some questions to ask yourself:

There is no one "right" answer to these questions; just as there is no one "best" design. But there are better and worse approaches and you should be thoughtful and deliberate about the trade-offs that come with the choices you make.

The find-max-overlap function will require more testing that any other part of the code. Take care to brainstorm possibilities and test with a wide variety of inputs to ensure that all ordinary and edge cases are properly handled. Once you have this function working correctly and efficiently, you're at the home stretch.

Efficiency

A round operates on the collection of N fragments, considers all N2 pairs, computes the max overlap for each pair, finds the largest such overlap, and finally merges the maximal pair. The process then repeats; the next round operates on the N-1 fragments remaining. Almost all of those fragments are unchanged from the previous round, so there is a lot of re-computation in figuring the max overlap. We do not want you to do anything clever to avoid this. A round retains no memory of past rounds and requires a complete re-computation of the max overlap for all pairs.

However, given you will make N2 calls to the find-max-overlap function per round, any inefficiency within this function can dramatically decrease program performance. Making find-max-overlap reasonably zippy is not about hacking and zealotry, but does depend on common sense. The string.h library has highly-efficient functions for string comparison and you should absolutely use them. However, be sure to choose the right function(s) and only consider valid alignments (e.g. a 2-char fragment can have an overlapping, non-containing match to a 1000-char fragment in just a few places, not 1000). Scrutinize high-traffic inner loops to identify if there are any expensive computations that can be hoisted out. For example, strlen runs in linear time to count characters, if the string isn't changing inside the loop, calculating the length once outside the loop avoids that unnecessary recalculation. Avoiding duplicate calls within in a function is generally worthwhile, don't take this advice to some crazy extreme like calculating each string length just once and then storing/passing it around with the string throughout the program -- such redundancy creates opportunity for error in synchronization, eats up memory, and complicates the code, and sadly, won't even likely give you the benefit you seek! When working to improve performance, be sure your attention is focused on where changes that are truly beneficial-- it is not that case that every call to strlen needs to be avoided or that every use of malloc is problematic!

Run our sample executable to gauge the target efficiency, especially noting the time required for the large sample files. Our code doesn't cheat, it does the full re-computation (without caching any previous results) and employs no trickery. If your performance is in the same ballpark as ours (say within a factor of 2 or 3), you're golden. But if your program is noticeably slower, it indicate you have some unnecessary or redundant work that should be eliminated. Evaluate your memory efficiency similarly. Run the sample executable under Valgrind to see its memory footprint, and aim to be within a factor or 2 or 3. When considering tradeoffs of space versus time, the sample executable shows you the balance we are looking for.

Handling exceptional conditions

There are several error in usage and robustness you are required to detect and gracefully handle for this program. For fatal conditions, the appropriate handling is to report the error to the user and exit. An error message must be accurate and actionable, giving the user sufficient information to understand the problem and what is needed to resolve it. The sample solution models appropriate handling and feedback on errors. You are not required to match the same wording as the sample, but doing so is a one way to ensure that your messages will be judged as sufficient.

We do not require recovery, memory deallocation, or other cleanup from fatal errors. When an error surfaces mid-execution, it can be rather convoluted to manually route control through exactly and only the deallocation steps appropriate from that context, so we don't ask for it and nor will we test runs that lead to fatal error conditions under Valgrind. Just halt the program by calling exit(1) and you're done. The minor convenience function error() function combines error-reporting with exit if you'd rather. This is a nonstandard-C GNU extension; you can learn more from its man page.

Incremental development and testing

Some suggested coding milestones and testing strategies.

  1. Read fragments from a well-formed input. The starter code is a good start on handling well-formed inputs, but be sure to take the time to figure out what it does and doesn't do. Start with tiny inputs that you can hand-verify. For larger files, you could print out the fragments read and review what you read to verify everything matches. Use valgrind to verify that your use of memory isn't raising any flags.
  2. Malformed inputs. Adding robustness handling is a fussy little job, so you may find it more satisfying to jump ahead to the meat of align-merge and come back to this later. For robustness, you will want to cook up a suite of ill-formed input files and update read_frag to correctly detect/report each type of malformation. Be sure to verify that correct file-reading doesn't regress along the way.
  3. Find-max-overlap. Design and write the function that reports the maximal overlap for a given pair. There are various strategies for testing this function in isolation: making calls in gdb, add calls from main, applying to pairs read from the file-reading code, etc. Whatever you do, don't shortchange testing this critical function.
  4. Merge-overlap. The function that constructs the merged result for a pair can be tested similarly, ensuring that all variants of pairs can be correctly merged in all situations. Be sure Valgrind has no complaints, too.
  5. Do a round. Once find and merge are solid, it is straightforward to add code to try all pairs and merge the max.
  6. Do all rounds. A loop around the previous step and now you have a program able to compare its output to the sample solution. First test on simple cases such as sanity check and hand-constructed inputs that are designed to exercise specific code paths. Work your way up to the large stress cases.
  7. Finalize. Find and clean up memory leaks using valgrind. Re-read the specification and re-test to make sure you cover all the required functionality. Review your code to make final improvements in quality. Reward yourself with an entire pint of ice cream!

A note about output variations. Since you can break ties arbitrarily, there can be more than one correct input for some inputs. For example, given the fragments ['ax', 'xb', 'xx'] each pair has a max-overlap of length 1, thus any of those pairs can be chosen to merge in the first round. If you first merge 'xx' with either 'ax' or 'xb', your final result with be 'axxb'. If you first merge 'ax' with 'xb' then the output can be either 'xxaxb' or 'axbxx'. All of these are considered correct outputs. When comparing your output to that of the sample solution, be aware of these possible (harmless) discrepancies.

Program facts

Some facts that might help you gauge what you're in for with this assignment.

Frequently asked questions about assign1

How do I measure the runtime of a program?

For a rough estimate, your watch will do, but for more accuracy, the unix time command will report on the time used by a executing program, e.g. time ./reassemble slink/preamble_frags

I have an idea about how to make my program run faster/use less memory. Should I pursue it?

Performance will be prized in later assignments, but for the early assignments we strongly discourage heroic efforts to improve performance that increase complexity, compromise style, or add invalid shortcuts that put correctness at risk. Grading reserves only a few points for matching the efficiency of the sample, and there is no bonus for outperforming. That said, we know of a few clever ideas that provide a satisfying boost for delightfully simple tweaks. We didn't implement them in our solution so as to retain a very achievable baseline, but you might find it fun to show off by finding one of the cute ways to outdo us.

Will I be penalized if I choose to ignore the advice page and do things my own way?

Not necessarily. The suggestions here are worth considering, as they are intended to guide you toward choices that we know lead to better outcomes, but if your alternative approach also leads to an elegant solution that meets the requirements, great! Before going renegade, you may want to run your ideas by us to vet that they won't lead you astray or contradict the requirements.