Assignment 4: RSS News Feed Aggregator

Virtually all major newspapers and TV news stations have bought into this whole Internet thing. What you may not know is that all major media corporations serve up RSS feeds summarizing the news stories that’ve aired or gone to press in the preceding 24 hours. RSS news feeds are XML documents with information about online news articles. If we can get the feeds, we can get the articles, and if we can get the articles, we can build an index of information similar to that held by news.google.com. That’s precisely what you’ll be doing for Assignment 4.

Assignment 4 is probably the largest of the quarter in terms of the amount of code you need to write, so don't dilly dally. :) In previous quarters, it's been two assignments instead of one, but the first of the two was fairly straightforward and stole time away from the more demanding followup. The first half or so of the assignment will make sure you gracefully manage the transition from C to C++ while gaining some experience with C++ containers like the map and the vector, C++11 threads, mutexes, condition_variable_anys, and semaphores. The latter part of the assignment will revisit some of the initially imposed design decisions and force you to be much more conservative about the number of actual threads you create over the program's lifetime.

Due: Monday, May 18th at 11:59 p.m.

Indexing a news article amounts to little more than breaking the story down into the individual words and noting how often each word appears. If a particular word appears a good number of times, then said word is probably a good indicator as to what the web page is all about. Once all of the world’s online RSS news feeds have been indexed, you can ask it for a list of stories about a specific person, place, or thing.

If, for instance, you’re curious about what's up with all things presidential, you can query your friendly news index, and it’s sure to come back with something informative:

myth22> ./aggregate --verbose --url large-feed.xml

// lots of logging output omitted

Enter a search term [or just hit <enter> to quit]: president

That term appears in 33 articles. Here are the top 15 of them:

1.) "Obama’s Nike stop puts focus on outsourcing, labor standards" [appears 6 times]

"http://www.seattletimes.com/business/international-trade/obamas-n.....ards/"

2.) "Obama taps two for top spots on Joint Chiefs of Staff" [appears 4 times]

"http://www.dispatch.com/content/stories/national_world/2015/05/06......html"

3.) "Williams, Tufano to speak at Delco commencements" [appears 3 times].

"http://www.philly.com/r?19=961&43=165761&44=302860641&32=3796&7=1......html"

4.) "U.S., Saudis call for 5-day cease-fire in Yemen to let in aid" [appears 3 times].

"http://www.seattletimes.com/nation-world/u-s-saudis-call-for-5-da.....-aid/"

5.) "Aliens? No, just a mirage of Chicago skyline over Lake Michigan" [appears 2 times].

"http://feeds.chicagotribune.com/~r/chicagotribune/news/nationworl.....1.htm"

6.) "Chicago Teachers Union files labor complaint against school board" [appears 2 times]

"http://feeds.chicagotribune.com/~r/chicagotribune/news/nationworl.....1.htm"

7.) "ALS ice sculpture exhibit opens Thursday in Chicago" [appears 2 times]

"http://feeds.chicagotribune.com/~r/chicagotribune/news/nationworl.....1.htm"

8.) "3 men wounded in shootings on South Side" [appears 2 times]

"http://feeds.chicagotribune.com/~r/chicagotribune/news/nationworl.....1.htm"

9.) "Warmest day yet of 2015 may break record" [appears 2 times]

"http://feeds.chicagotribune.com/~r/chicagotribune/news/nationworl.....1.htm"

10.) "WorldViews: On Election Day in Britain, publishing the results of..... jail" [appears 2 times]

"http://feeds.washingtonpost.com/c/34656/f/636708/s/4618ff0f/sc/3/.....1.htm"

11.) "Cape May and Coast Guard to celebrate a long marriage" [appears 2 times].

"http://www.philly.com/r?19=961&43=165761&44=302842181&32=3796&7=1......html"

12.) "Maine tidal power company establishes presence in Ireland" [appears 2 times]

"http://www.sfgate.com/business/energy/article/Maine-tidal-power-c.....9.php"

13.) "Baltimore mayor softens tone and finds wide support for federal c.....ation" [appears 2 times].

"http://www.startribune.com/nation/302730811.html"

14.) "Remembering Nate Robinson's Top 5 Moments As a Celtic" [appears 1 time].

"http://feeds.boston.com/c/35022/f/646891/s/42559c54/sc/10/l/0L0Sb.....1.htm"

15.) "Two Americans Make History Climbing El Capitan With Their Bare Hands" [appears 1 time]

"http://feeds.boston.com/c/35022/f/646891/s/425bf93b/sc/10/l/0L0Sb.....1.htm"

If you’re contemplating a summer internship abroad, you might see what Paris is up to these days:

Enter a search term [or just hit <enter> to quit]: Paris

That term appears in 3 articles. Here they are:

1.) "Saudis propose a 5-day cease-fire in Yemen, if rebels agree" [appears 2 times].

"http://www.sfgate.com/news/world/article/US-Saudis-call-for-5-day.....0.php

2.) "Center City elevated rail park now on fast track" [appears 1 time].

"http://www.philly.com/r?19=961&43=165761&44=302862271&32=3796&7=1......html"

Some search terms are so focused (and our index is, in the grand scheme of indices, so small) that they produce just one result:

Enter a search term [or just hit <enter> to quit]: Salvador

That term appears in 1 article. Here it is:

1.) "An undocumented immigrant’s dream deferred" [appears 1 time].

"http://feeds.washingtonpost.com/c/34656/f/636708/s/3fd4339f/sc/1.....1.htm"

And very occasionally, something perfectly awesome doesn’t get mentioned at all:

Enter a search term [or just hit <enter> to quit]: CS110 Ah, we didn't find the term "CS110". Try again.

Task 1: Sequential `aggregate`

Your LiberalNewsAggregator class ultimately needs to provide a multithreaded version of processAllFeeds, subject to a laundry list of constraints we’ll eventually specify. You’re free to implement the multithreaded version right from the start and forgo an intermediate, sequential version. My strong feeling, however, is that you should invest the time getting a sequential version working first (sans any threading) to make sure everything works predictably, that you’ve made proper and elegant use of all of the classes we’ve provided, and that your sequential application’s output matches that produced by the sample (see note below). For what it’s worth, this is precisely the approach I took as I developed my own solution, so don’t go on thinking you’re weak because you don’t try to manage the full implementation of LiberalNewsAggregator in one pass. You know this already, but I’ll say it again: Incremental development is the key to arriving at a robust, bug-free executable much more quickly.

Note: Honestly, you might see small differences between the sample application and your own, because a small fraction of an article’s content—the advertising, actually—is dynamically generated for each download and therefore impacts how the index is built each time. The sanitycheck tool, however, has been seeded with static RSS feeds and HTML news articles, so you should rely on it for a reliable test framework.

There are two things I would do while implementing the sequential version, however, and those two things are this:

I would test with the small-feeds.xml file until you’re convinced you’ve reached the sequential milestone. The huge drawback of the sequential version—in fact, a drawback I deliberately highlight so we later see why programming with threads is so incredibly powerful—is that it takes a long time to load each article one after another. The small-feeds.xml file should be good enough for the vast majority of your development until you’re ready to take on the concurrency requirements, and at that point you can test the sequential version one last time by launching your aggregate executable against large-feeds.xml, walking away to read War and Peace from start to finish, and then coming back to see if everything’s finally indexed properly. It just might be.
Note that I've already placed an RSSIndex instance in the protected section of the NewsAggregator base class. The fact that it's protected means that it's visible to and easily manipulated by all classes in the NewsAggregator class hierarchy.

Task 2: Implementing Multithreaded `aggregate`

Practically speaking, the sequential, singly-threaded implementation of aggregate is unacceptably slow. Users don’t want to wait ten minutes while an application fires up, so you should ultimately introduce multithreading in order to speed things up. The sequential version of the aggregator is painfully slow, as each and every request for an article from some remote server takes on the order of seconds. If a single thread sequentially processes each and every article, then said articles are requested, pulled, parsed, and indexed one after another. There’s no overlay of network stall times whatsoever, so the lag associated with network connectivity scales with the number of processed articles, and it’s like watching paint dry in 100% humidity.

The implementation of NewsAggregator and LiberalNewsAggregator should be updated so that each RSS news feed can be downloaded in its own child thread, and each article identified within each feed thread can be downloaded in its own grandchild thread as well. Each article, of course, still needs to be parsed and indexed. But much of the dead time spent just waiting for content to come over the wire can be scheduled to overlap. The news servers hosted up by the New York Times, the Philadelphia Inquirer, and the Chicago Tribune are industrial strength and can handle hundreds if not thousands of requests per minute (or at least I’m hoping so, lest Stanford IP addresses be blacklisted between now and the assignment deadline. I guess we'll see, won’t we?)

Naturally, multithreading and concurrency have their shortcomings. Each thread needs exclusive access to the index while doing surgery, so you’ll need to introduce some mutexes to prevent race conditions from compromising your data structures.

Rather than outline a large and very specific set of constraints, I’m granting you the responsibility of doing whatever it takes to make the application run more quickly without introducing any synchronization issues. The assignment is high on intellectual content—after all, this is the first time where an algorithmically sound application can break because of race conditions and deadlock—but honestly, there isn’t much additional coding once you get the sequential version up and running. You’ll spend a good amount of time figuring out where the sequential version should spawn off child threads, when those threads should spawn off grandchild threads, and how you’re going to use concurrency directives to coordinate thread communication.

Here’s the fairly short list of must-haves:

Each news feed should be downloaded in its own child thread, though you should limit the number of such threads that can exist at any one time to 6.
Each news article should be downloaded in its own grandchild thread, though you should limit the number of threads maintaining connections to any one server (e.g. www.philly.com) to 6. Even heavy duty servers have a hard time responding to a flash mob of requests, so it’s considered good thread etiquette to limit the number of active conversations with any one server to some small number. (Browsers are even more conservative and usually limit it to 2 or 3 by default). Note that implementing this will be tricky, but you should be able to introduce a
std::map<std::string, std::unique_ptr<semaphore>> to the private section of LiberalNewsAggregator (making it thread-safe to the extent you need to) and leverage the implementation strategies I use to implement the oslock and osunlock manipulators, the implementation of which can be viewed by looking in /usr/class/cs110/local/src/threads/ at ostreamlock.cc.
Limit the total number of article download threads executing at any one time to be 24. There’s little sense in overwhelming the thread manager with a large thread count, so we’ll impose the overall limit to be 24. In practice, this number would be fine tuned for the number of processors, the number of cores per processor, and performance analytics, but we’re not going to allow this to evolve into some analytics assignment on how to optimize performance. You’re optimizing more than enough by just introducing clever use of threading in the first place. (The second part of this assignment will have you revisit this decision and implement something called a thread pool, which is all kinds of awesome).
Ensure that you never ever download—or even attempt to download—the same exact URL twice. The feed lists and RSS feeds may surface the same URL more than once (e.g. the New York Times's world-news and religion feeds may each include the same URL to an article about the Vatican), so you need to ensure that you detect the duplicates. You should download an article the first time you see its URL, but ignore all of the others. As it turns out, the XML and HTML parsing libraries don't deal well when asked to pull two copies of the same document as the same time. If you try to pull a document and the networking fails for whatever reason (i.e. RSSFeed::parse or HTMLDocument::parse throws an exception), you shouldn't try to download it a second time, and you should still ignore any other repeated requests to download the same URL during program execution. Long story short: every URL get one chance to contribute and that's it.
When two or more HTML articles share the same title and are hosted by the same server, don't assume they're true duplicates. It's safe to assume they're all really the same article, but download all of them anyway, compute the running intersection of their token sets, and file the token intersection under the lexicographically smallest URL. So if, for instance, http://www.jerrynews.com/a.html, http://www.jerrynews.com/b.html, and http://www.jerrynews.com/c.html all lead to articles titled "Rock Me Like a Jerry Cain", and independent downloads produce token vectors of ["a", "a", "b", "c", "a", "a"],["a", "a", "b", "b", "a"], and ["a", "a", "b", "b", "c", "a", "d", "a"], respectively, ensure that the RSSIndex gets "Rock Me Like a Jerry Cain" under http://www.jerrynews.com/a.html with ["a", "a", "a", "b"] as tokens. This is an easily explained technique for overcoming the fact that many news articles have dynamically generated advertising content that can be intersected away via multiple downloads. Implementing this will require you investigate the sort and set_intersection functions, iterators, and/or back_inserters (at least these are the C++'isms I used to implement this part).
Ensure that access to all shared data structures is thread-safe, meaning that there’s zero chance of deadlock, and there are no race conditions whatsoever. Use mutexes to provide the binary locks needed to protect against race conditions within critical regions. Make sure that all locks are released no matter the outcome of a download (e.g. inside catch clauses as well).
Ensure that the full index is built before advancing to the query loop.
Ensure that all memory is freed when the full application exits. If you elect to dynamically allocate memory (as you almost certainly will need to with your per-server semaphore map), then make sure that’s all properly donated back to the heap before the main function returns.
You can add logging code to emulate the logging using the provided NewsAggregatorLog class, but you aren't required to. We’ll be testing your solutions using the --quiet flag, but you can take delight if you add logging just so you can see how much faster everything runs once you introduce threading. Weee!
You may not change the implementation of the NewsAggregator::queryIndex method. It outputs the results in the exact format we’ll expect to be used by the auto-grading process, so changing it would be, you know, a clear path to a 0, which would be less than ideal.

If you take the care to introduce threads as I’ve outlined above, then you’ll dramatically speed up the configuration of your aggregate executable! Actually, dramatically isn’t dramatic enough of a word to describe how dramatic the speedup will be. (Oh, you’ll also get genuine experience with networking and the various concurrency patterns that come up in networked applications. That’s important, too!)

Once you get to this point, you should locally commit your changes to your NewsAggregator and LiberalNewsAggregator classes, and run through the submit script like you're done. You may make some more changes to your NewsAggregator and LiberalNewsAggregator implementations as you move forward and find some functionality that should be tapped by the ConservativeNewsAggregator class that's coming up next, but it's still wise to commit something as an intermediate point just in case you compromise the code base as you move on to the latter half of the assignment.

Getting Code

Go ahead and clone the mercurial repository that we’ve set up for you by typing:

myth22> hg clone /usr/class/cs110/repos/assign4/$USER assign4

Compile often, test incrementally and almost as often as you compile, hg commit a bunch so you don’t lose your work if someone reboots the myth machine you’re working on, and run /usr/class/cs110/tools/submit when you’re done. There's also a sample executable called aggregate_soln in /usr/class/cs110/samples/assign4 than you can run to see what it does.

Files

Here’s a description of each of the files contributing to the overall code base we’re giving you:

aggregate.cc

aggregate.cc is a super small file that defines the main function and nothing else. The file is so small that I present it here:

int main(int argc, char *argv[]) {

unique_ptr<NewsAggregator> aggregator(NewsAggregator::createNewsAggregator(argc, argv));

aggregator->buildIndex();

aggregator->queryIndex();

return 0;

}

As part of your implementation, you'll be designing and implementing three classes—an abstract base class called NewsAggregator and two concrete subclasses called LiberalNewsAggregator and ConservativeNewsAggregator—to manage all of the details. The createNewsAggregator factory method returns an instance of one of the subclasses—the command line arguments help to select which one—and passes the buck on to that instance to do all of the real work.

You shouldn’t need to change this file.

news-aggregator.h/cc

The news-aggregator files collectively define the abstract base class where state and functionality common to both subclasses should be placed. The code that allows a user to query the index is common to both subclasses, so it belongs here (and I've more or less implemented this for you already, since it doesn't involve threading. The two subclasses, however, each have different approaches to how the build the index, so the processAllFeeds method (which is called by the already-implemented buildIndex method) is abstract and must be implemented by each of the two subclasses.

You'll update these two files as you add state and functionality common to both subclasses.

news-aggregator-liberal.h/cc

The news-aggregator-liberal files define the first concrete subclass used to build an index. You'll update the news-aggregator and news-aggregator-liberal files to implement the version of the assignment that allocates new threads without bound. That's why it's liberal—it allocates threads as resources without concern for system costs (subject to some checks and bounds, and discussed above.)

You'll update these two files (and the news-aggregator files) as you implement the first part of the assignment.

news-aggregator-conservative.h/cc

The news-aggregator-conservative files define the second concrete subclass used to build an index (and its use is activated by passing --conserve-threads as a flag on the command line). You'll update the news-aggregator and news-aggregator-conservative files (and maybe even other files as you see opportunities for code sharing) to implement the version of the assignment that uses a limited number of threads over the lifetime of the executable, using what's called a thread pool. It's conservative because it's more conscious of the expense that comes with creating and destroying threads, so it tries to limit the number of threads ever created—regardless of the number of articles that get downloaded—to a reasonably small number while still allowing the parallelism of threading to speed things up.

You'll update these two files (and the news-aggregator files) as you implement the second part of the assignment, to be discussed soon.

news-aggregator-utils.h/cc

news-aggregator-utils.h and .cc define and implement a small number of URL and string utility functions that contribute to the implementation of several functions in the NewsAggregator class hierarchy. The implementation of queryIndex provided calls the shouldTruncate and truncate functions, and you’ll ultimately want to call the getURLServer function as you implant thread count limits on a per-server basis.