Assignment 1: Reassemble

Due date: Mon Jan 19 11:59 pm - Hard deadline: Wed Jan 21 11:59 pm

Assignment by Julie Zelenski, with ideas borrowed from Rich Pattis (UC Irvine) and Owen Astrachan (Duke)

[Quick Links: Implementation details, Advice page, Grading]

Learning goals

Completing this assignment will jump-start your proficiency in:

  1. editing, compiling, testing, and debugging C programs under Unix
  2. writing code to manipulate C-strings
  3. using the C facilities for dynamic memory management (malloc & free)

The problem

You love The C Programming Language text so much that you own 10 copies of it, but your evil nemesis knows how dear it is to you and has taken a pair of scissors to every one of your copies and randomly cut each into thousands of pieces. Amidst your tears, you attempt to reassemble the fragments to recreate the original. Little did you know that solving this problem would prep you for a career in computational biology and international espionage.

For the assignment, you are to write a program that reads a input of text fragments and uses them to reconstruct the original document. The fragments were created by duplicating the original document many times over and chopping each copy into pieces. The process of reconstructing the original will rely on finding overlapping portions and aligning them to match.

This task is based on a technique used for genome shotgun sequencing in the Human Genome Project. Current sequencing technology can only sequence chunks of 500 or so base pairs, yet the human genome is billions of pairs long. Thus, the DNA to be sequenced is cloned several times and the clones are randomly cut into sequenceable-sized chunks. The reassembly step aligns those chunks to reconstruct the original strand. Nova had a fascinating show on Cracking the Code of Life about the race to decode the human genome. In addition to biology, a number of intelligence gathering tasks involve reassembly. The carpet weavers piecing together photos in Argo are based in truth, archivists are reassembling Stasi trash, and the team "All Your Shreds are Belong to U.S." (great name!) just won the DARPA-sponsored shredder challenge. All of these tasks have in common the process of matching fragments to piece them into a unified whole, just as you will do for this assignment.

Greedy match and merge

The program reads the collection of fragments and then processes the fragments in a series of rounds. A round does a single match/merge operation. To match, it searches the collection to find the pair of fragments with the longest-length overlap. To merge, the pair is aligned at their point of overlap and joined into a superstring that has the two original fragments as substrings. Each round decreases the total number of fragments by one and the process continues until only one fragment remains. In each round, it is the the pair of fragments with the longest-length overlap that is chosen for merging.

Consider a collection with fragments shown below (extra spaces were inserted between letters for clarity):

a l l   i s   w e l l
e l l   t h a t   e n
h a t   e n d
t   e n d s   w e l l

On the first round, the longest overlap found is a 6-character overlap between the middle two fragments when aligned as below:

e l l   t h a t   e n
          h a t   e n d

These two fragments would be removed and replaced with their merged result:

e l l   t h a t   e n d

The merged fragment is a candidate for future rounds; the two fragments it was composed from are discarded. On the next round, the longest overlap is 5 characters when these two fragments are aligned as below:

e l l   t h a t   e n d
              t   e n d s   w e l l

These fragments would be removed and replaced with their merged result:

e l l   t h a t   e n d s   w e l l

The last round merges the remaining two fragments aligned at their maximal overlap of 3 characters:

a l l   i s   w e l l
                e l l   t h a t   e n d s   w e l l

The one fragment remaining is the final result:

a l l   i s   w e l l   t h a t   e n d s   w e l l

A match is also possible when one fragment is completely contained within another. Consider:

s   w e l l   t  h  a
      e l l

The second fragment is entirely contained within the first. When these two fragments are merged, the result is simply the longer string.

Your program can break ties arbitrarily. If more than one pair have the same maximal length overlap, you can select any as the winner for that round. If a pair has two equally-maximal alignments (e.g. abxy and xyab can be aligned to merge to either abxyab or xyabxy), you can choose either. If a pair has no overlap, they are merged by concatenation, with your choice of order (e.g. ab and xy can merge to either abxy or xyab). Because of differences in how these ties are broken, some inputs can have more than one possible correct reconstruction. However, most fragment collections have such extensive redundancy that there is only one possible reassembly.

Note that all characters in the overlap must match exactly in sequence. Unlike DNA strands which have "noisy" data in which gaps and mutations which must be accommodated, the sequences in our simplified process are always "clean" and must match perfectly to be merged.

A little computer science theory aside. An optimal reassembly is equivalent to the shortest common superstring, a classic problem in theoretical computer science. Given a set of strings S, find the shortest string that contains all the strings in S as substrings. The shortest superstring program is NP-hard, and no efficient, polynomial-time solution is believed to exist. We are not asking your program to find an optimal result--- this difficult problem is much too computationally expensive. Instead, we approximate using a greedy strategy that finds a local maximum and optimistically pursues this locally optimal choice in the hopes that it will lead to the globally optimal result. This will find a common superstring, but it is not guaranteed to be the shortest.

Implementation details

Usage. The program can be invoked either with a single argument or no arguments. If an argument is given, it is the name of the fragment file. If the named file cannot be opened, the program responds with a helpful error message and exits. If no argument is given, the program reads fragments from the terminal (i.e. from stdin). This allows the user to type fragments at the command-line. The user signals the end of the fragments by typing control-d on a line by itself. If the fragments are successfully read (whether from file or terminal), the program reassembles and prints the merged result.

Fragment input format. A collection of fragments follows this format:

    {fragment 1}{fragment 2} {fragment 3} ... {fragment N}

Each fragment is a sequence of characters wrapped in curly braces. In between fragments (outside the braces), there can be arbitrary amounts of whitespace (tab, space, newline). The body of a fragment can contain any character other than braces. A fragment must contain at least 1 character and at most 1000 characters. A fragment file contains at most 20,000 fragments.

Malformed inputs. A robust program behaves reasonably even in the presence of incorrect inputs. Specific invalid inputs we may test on include a malformed fragment (i.e. doesn't fit the required format {chars}) or a fragment longer than the 1000-character maximum. Your program should detect these situations, report the problem to the user, and exit. An error message is expected to be specific, informative, and actionable. We will not test on files that exceed the maximum number of fragments, i.e., you may ignore the possibility of file containing more than 20,000 fragments. An input that consists of zero or one fragment reassembles to itself; these are considered valid inputs and should not result in an error.

Output conformance. The expected output is the correctly reassembled result with no other additional printing. Use our sanity check tool to verify that your output conforms to our specification. Your error messages are not required to exactly match the sample solution, but whatever wording you choose should be equally informative and appropriate. Feedback should be actionable; e.g. specific information that informs the user what/where/how to resolve the error.

Testing. We provide some fragment files and a sample executable that can be used to compare for expected output. You will want to create additional test inputs, especially small ones that allow you to focus on specific behaviors. Take advantage of the convenience of entering inputs via the terminal and use the custom tests feature of sanity check for tests you want to repeatedly run.

Memory. Any normal execution path is expected to run cleanly under valgrind with no memory errors nor leaks reported. We will not test exceptional/error cases under Valgrind. We also use Valgrind to verify your program's total memory usage is reasonable. Run the sample executable under Valgrind to get a baseline for comparison.

Efficiency. We have modest requirements for runtime efficiency. It is expected that each round re-examine all N2 pairings to find the maximal overlap, and it will necessarily bog down on large fragment files. You are not asked to devise anything clever to avoid this. Our sample solution takes the straightforward approach and can be used as a benchmark of what constitutes reasonable performance. Run both your program and the sample on the same input to verify you are in the ballpark. If your program runs noticeably more slowly than the sample, then you have a design/algorithmic weakness to address.

Design/style/readability. From CS106, you should be well-versed in good coding practice and we expect your solutions to show thoughtful design and attention to readability. You know the drill: decompose into manageable functions, comment where appropriate, factor common code to avoid duplication, no global variables, strive for readability in all ways, and so on. See Nick Parlante's stolen prose on landmarks in coding quality.

Don't miss our companion page of advice and suggestions specific to this assignment!

Grading

Please read our background information on how we grade assignments to acquaint yourself with the grading system used in CS107. Below is the tentative grading rubric for this assignment. The automated tests are scored in points; the code review is graded using buckets to stress the qualitative features of the review over the quantitative.

Functionality (105 points)

Code review (buckets weighted to contribute ~20 points) Here we will read and evaluate your code in areas such as:

Getting started

The assign1 project contains the source file reassemble.c and a Makefile. These files are distributed as a Mercurial repository. Get the starter project by cloning your cs107 repository using the command

hg clone /afs/ir/class/cs107/repos/assign1/$USER assign1

The $USER shell variable automatically expands to your sunet id. If there is no repo for your sunet, this means you were not registered when the repos were distributed, please register asap and send email to cs107@cs asking us to create your repo. There is a public 'guest' repo that can be accessed by auditors, the guest repo is read-only and does not have submit privileges.

You are to add your code in reassemble.c. You should not need to edit our starter Makefile. Given the small scope of this project, maintaining all code in one file is perfectly appropriate, do not divide the code across multiple files.

The directory samples/assign1 in our class directory contains a sample executable and fragment files. You can access these via their full path (e.g. /afs/ir/class/cs107/samples/assign1/reassemble_soln), but it is a bit unwieldy. Your repo contains slink, a symbolic link to the directory, which you can use to more easily refer to it (e.g. slink/reassemble_soln). Note that the samples directory is shared and read-only, so you will not be able to add or edit files within that directory.

Finishing

Your final step is to submit your finished code for grading. We recommend you do a trial submit well in advance of the deadline to familiarize yourself with the process and allow time to work through any snags. You can replace that submission with a subsequent one if desired.

Note that this first assignment can be turned in at most 2 days late. We move up the hard deadline to force you to conserve some grace days for later and allow us to move quickly on grading and returning feedback on the first assignment. Plan on submitting what you have by the hard deadline, absolutely no assignments will be accepted any later.

Break a vase, and the love that reassembles the fragments is stronger than that love which took its symmetry for granted when it was whole.
---Derek Walcott