<link href="//d2i1pl9gz4hwa7.cloudfront.net/hna-Tc0s01WO4tuZOavI1Q-gz" rel="stylesheet">

Assignment 4: RSS News Feed Aggregator

Virtually all major newspapers and TV news stations have bought into this whole Internet thing. What you may not know is that all major media corporations serve up RSS feeds summarizing the news stories that’ve aired or gone to press in the preceding 24 hours. RSS news feeds are XML documents with information about online news articles. If we can get the feeds, we can get the articles, and if we can get the articles, we can build an index of information similar to that held by news.google.com.  That’s precisely what you’ll be doing for Assignment 4.

Assignment 4 is probably the largest of the quarter in terms of the amount of code you need to write, so don't dilly dally. :)  In previous quarters, it's been two assignments instead of one, but the first of the two was fairly straightforward and stole time away from the more demanding followup.  The first half or so of the assignment will make sure you gracefully manage the transition from C to C++ while gaining some experience with C++ containers like the map and the vector, C++11 threads, mutexes, condition_variable_anys, and semaphores.  The latter part of the assignment will revisit some of the initially imposed design decisions and force you to be much more conservative about the number of actual threads you create over the program's lifetime.

Due: Monday, May 18th at 11:59 p.m.

Indexing a news article amounts to little more than breaking the story down into the individual words and noting how often each word appears. If a particular word appears a good number of times, then said word is probably a good indicator as to what the web page is all about. Once all of the world’s online RSS news feeds have been indexed, you can ask it for a list of stories about a specific person, place, or thing.

If, for instance, you’re curious about what's up with all things presidential, you can query your friendly news index, and it’s sure to come back with something informative:

    myth22> ./aggregate --verbose --url large-feed.xml

    // lots of logging output omitted

    Enter a search term [or just hit <enter> to quit]: president

    That term appears in 33 articles. Here are the top 15 of them:

       1.) "Obama’s Nike stop puts focus on outsourcing, labor standards" [appears 6 times]

           "http://www.seattletimes.com/business/international-trade/obamas-n.....ards/"

       2.) "Obama taps two for top spots on Joint Chiefs of Staff" [appears 4 times]

           "http://www.dispatch.com/content/stories/national_world/2015/05/06......html"

       3.) "Williams, Tufano to speak at Delco commencements" [appears 3 times].

           "http://www.philly.com/r?19=961&43=165761&44=302860641&32=3796&7=1......html"

       4.) "U.S., Saudis call for 5-day cease-fire in Yemen to let in aid" [appears 3 times].

           "http://www.seattletimes.com/nation-world/u-s-saudis-call-for-5-da.....-aid/"

       5.) "Aliens? No, just a mirage of Chicago skyline over Lake Michigan" [appears 2 times].

           "http://feeds.chicagotribune.com/~r/chicagotribune/news/nationworl.....1.htm"

       6.) "Chicago Teachers Union files labor complaint against school board" [appears 2 times]

           "http://feeds.chicagotribune.com/~r/chicagotribune/news/nationworl.....1.htm"

       7.) "ALS ice sculpture exhibit opens Thursday in Chicago" [appears 2 times]

           "http://feeds.chicagotribune.com/~r/chicagotribune/news/nationworl.....1.htm"

       8.) "3 men wounded in shootings on South Side" [appears 2 times]

           "http://feeds.chicagotribune.com/~r/chicagotribune/news/nationworl.....1.htm"

       9.) "Warmest day yet of 2015 may break record" [appears 2 times]

           "http://feeds.chicagotribune.com/~r/chicagotribune/news/nationworl.....1.htm"

      10.) "WorldViews: On Election Day in Britain, publishing the results of..... jail" [appears 2 times]

           "http://feeds.washingtonpost.com/c/34656/f/636708/s/4618ff0f/sc/3/.....1.htm"

      11.) "Cape May and Coast Guard to celebrate a long marriage" [appears 2 times].

           "http://www.philly.com/r?19=961&43=165761&44=302842181&32=3796&7=1......html"

      12.) "Maine tidal power company establishes presence in Ireland" [appears 2 times]

           "http://www.sfgate.com/business/energy/article/Maine-tidal-power-c.....9.php"

      13.) "Baltimore mayor softens tone and finds wide support for federal c.....ation" [appears 2 times].

           "http://www.startribune.com/nation/302730811.html"

      14.) "Remembering Nate Robinson's Top 5 Moments As a Celtic" [appears 1 time].

           "http://feeds.boston.com/c/35022/f/646891/s/42559c54/sc/10/l/0L0Sb.....1.htm"

      15.) "Two Americans Make History Climbing El Capitan With Their Bare Hands" [appears 1 time]

           "http://feeds.boston.com/c/35022/f/646891/s/425bf93b/sc/10/l/0L0Sb.....1.htm"

If you’re contemplating a summer internship abroad, you might see what Paris is up to these days:

    Enter a search term [or just hit <enter> to quit]: Paris

    That term appears in 3 articles. Here they are:

       1.) "Saudis propose a 5-day cease-fire in Yemen, if rebels agree" [appears 2 times].

           "http://www.sfgate.com/news/world/article/US-Saudis-call-for-5-day.....0.php

       2.) "Center City elevated rail park now on fast track" [appears 1 time].

           "http://www.philly.com/r?19=961&43=165761&44=302862271&32=3796&7=1......html"

Some search terms are so focused (and our index is, in the grand scheme of indices, so small) that they produce just one result:

    Enter a search term [or just hit <enter> to quit]: Salvador

    That term appears in 1 article. Here it is:

       1.) "An undocumented immigrant’s dream deferred" [appears 1 time].

           "http://feeds.washingtonpost.com/c/34656/f/636708/s/3fd4339f/sc/1.....1.htm"

And very occasionally, something perfectly awesome doesn’t get mentioned at all:

    Enter a search term [or just hit <enter> to quit]: CS110
    Ah, we didn't find the term "CS110". Try again.

    Task 1: Sequential aggregate

    Your LiberalNewsAggregator class ultimately needs to provide a multithreaded version of processAllFeeds, subject to a laundry list of constraints we’ll eventually specify. You’re free to implement the multithreaded version right from the start and forgo an intermediate, sequential version. My strong feeling, however, is that you should invest the time getting a sequential version working first (sans any threading) to make sure everything works predictably, that you’ve made proper and elegant use of all of the classes we’ve provided, and that your sequential application’s output matches that produced by the sample (see note below). For what it’s worth, this is precisely the approach I took as I developed my own solution, so don’t go on thinking you’re weak because you don’t try to manage the full implementation of LiberalNewsAggregator in one pass. You know this already, but I’ll say it again: Incremental development is the key to arriving at a robust, bug-free executable much more quickly.

    Note: Honestly, you might see small differences between the sample application and your own, because a small fraction of an article’s content—the advertising, actually—is dynamically generated for each download and therefore impacts how the index is built each time.  The sanitycheck tool, however, has been seeded with static RSS feeds and HTML news articles, so you should rely on it for a reliable test framework.

    There are two things I would do while implementing the sequential version, however, and those two things are this:

    • I would test with the small-feeds.xml file until you’re convinced you’ve reached the sequential milestone. The huge drawback of the sequential version—in fact, a drawback I deliberately highlight so we later see why programming with threads is so incredibly powerful—is that it takes a long time to load each article one after another. The small-feeds.xml file should be good enough for the vast majority of your development until you’re ready to take on the concurrency requirements, and at that point you can test the sequential version one last time by launching your aggregate executable against large-feeds.xml, walking away to read War and Peace from start to finish, and then coming back to see if everything’s finally indexed properly. It just might be.
    • Note that I've already placed an RSSIndex instance in the protected section of the NewsAggregator base class. The fact that it's protected means that it's visible to and easily manipulated by all classes in the NewsAggregator class hierarchy.

    Task 2: Implementing Multithreaded aggregate

    Practically speaking, the sequential, singly-threaded implementation of aggregate is unacceptably slow.  Users don’t want to wait ten minutes while an application fires up, so you should ultimately introduce multithreading in order to speed things up. The sequential version of the aggregator is painfully slow, as each and every request for an article from some remote server takes on the order of seconds. If a single thread sequentially processes each and every article, then said articles are requested, pulled, parsed, and indexed one after another. There’s no overlay of network stall times whatsoever, so the lag associated with network connectivity scales with the number of processed articles, and it’s like watching paint dry in 100% humidity.

    The implementation of NewsAggregator and LiberalNewsAggregator should be updated so that each RSS news feed can be downloaded in its own child thread, and each article identified within each feed thread can be downloaded in its own grandchild thread as well. Each article, of course, still needs to be parsed and indexed. But much of the dead time spent just waiting for content to come over the wire can be scheduled to overlap. The news servers hosted up by the New York Times, the Philadelphia Inquirer, and the Chicago Tribune are industrial strength and can handle hundreds if not thousands of requests per minute (or at least I’m hoping so, lest Stanford IP addresses be blacklisted between now and the assignment deadline. I guess we'll see, won’t we?)

    Naturally, multithreading and concurrency have their shortcomings. Each thread needs exclusive access to the index while doing surgery, so you’ll need to introduce some mutexes to prevent race conditions from compromising your data structures.

    Rather than outline a large and very specific set of constraints, I’m granting you the responsibility of doing whatever it takes to make the application run more quickly without introducing any synchronization issues. The assignment is high on intellectual content—after all, this is the first time where an algorithmically sound application can break because of race conditions and deadlock—but honestly, there isn’t much additional coding once you get the sequential version up and running. You’ll spend a good amount of time figuring out where the sequential version should spawn off child threads, when those threads should spawn off grandchild threads, and how you’re going to use concurrency directives to coordinate thread communication.

    Here’s the fairly short list of must-haves:

    • Each news feed should be downloaded in its own child thread, though you should limit the number of such threads that can exist at any one time to 6.
    • Each news article should be downloaded in its own grandchild thread, though you should limit the number of threads maintaining connections to any one server (e.g. www.philly.com) to 6.  Even heavy duty servers have a hard time responding to a flash mob of requests, so it’s considered good thread etiquette to limit the number of active conversations with any one server to some small number. (Browsers are even more conservative and usually limit it to 2 or 3 by default). Note that implementing this will be tricky, but you should be able to introduce a 
      std::map<std::string, std::unique_ptr<semaphore>> to the private section of LiberalNewsAggregator (making it thread-safe to the extent you need to) and leverage the implementation strategies I use to implement the oslock and osunlock manipulators, the implementation of which can be viewed by looking in /usr/class/cs110/local/src/threads/ at ostreamlock.cc.

    • Limit the total number of article download threads executing at any one time to be 24.  There’s little sense in overwhelming the thread manager with a large thread count, so we’ll impose the overall limit to be 24.  In practice, this number would be fine tuned for the number of processors, the number of cores per processor, and performance analytics, but we’re not going to allow this to evolve into some analytics assignment on how to optimize performance. You’re optimizing more than enough by just introducing clever use of threading in the first place. (The second part of this assignment will have you revisit this decision and implement something called a thread pool, which is all kinds of awesome).
    • Ensure that you never ever download—or even attempt to download—the same exact URL twice. The feed lists and RSS feeds may surface the same URL more than once (e.g. the New York Times's world-news and religion feeds may each include the same URL to an article about the Vatican), so you need to ensure that you detect the duplicates.  You should download an article the first time you see its URL, but ignore all of the others. As it turns out, the XML and HTML parsing libraries don't deal well when asked to pull two copies of the same document as the same time.  If you try to pull a document and the networking fails for whatever reason (i.e. RSSFeed::parse or HTMLDocument::parse throws an exception), you shouldn't try to download it a second time, and you should still ignore any other repeated requests to download the same URL during program execution.  Long story short: every URL get one chance to contribute and that's it.
    • When two or more HTML articles share the same title and are hosted by the same server, don't assume they're true duplicates. It's safe to assume they're all really the same article, but download all of them anyway, compute the running intersection of their token sets, and file the token intersection under the lexicographically smallest URL. So if, for instance, http://www.jerrynews.com/a.html,  http://www.jerrynews.com/b.html, and http://www.jerrynews.com/c.html all lead to articles titled "Rock Me Like a Jerry Cain", and independent downloads produce token vectors of ["a", "a", "b", "c", "a", "a"],["a", "a", "b", "b", "a"], and ["a", "a", "b", "b", "c", "a", "d", "a"], respectively, ensure that the RSSIndex gets "Rock Me Like a Jerry Cain" under http://www.jerrynews.com/a.html with ["a", "a", "a", "b"] as tokens.  This is an easily explained technique for overcoming the fact that many news articles have dynamically generated advertising content that can be intersected away via multiple downloads.  Implementing this will require you investigate the sort and set_intersection functions, iterators, and/or back_inserters (at least these are the C++'isms I used to implement this part).
    • Ensure that access to all shared data structures is thread-safe, meaning that there’s zero chance of deadlock, and there are no race conditions whatsoever. Use mutexes to provide the binary locks needed to protect against race conditions within critical regions. Make sure that all locks are released no matter the outcome of a download (e.g. inside catch clauses as well).
    • Ensure that the full index is built before advancing to the query loop.
    • Ensure that all memory is freed when the full application exits. If you elect to dynamically allocate memory (as you almost certainly will need to with your per-server semaphore map), then make sure that’s all properly donated back to the heap before the main function returns.
    • You can add logging code to emulate the logging using the provided NewsAggregatorLog class, but you aren't required to.  We’ll be testing your solutions using the --quiet flag, but you can take delight if you add logging just so you can see how much faster everything runs once you introduce threading.  Weee!
    • You may not change the implementation of the NewsAggregator::queryIndex method.  It outputs the results in the exact format we’ll expect to be used by the auto-grading process, so changing it would be, you know, a clear path to a 0, which would be less than ideal.

    If you take the care to introduce threads as I’ve outlined above, then you’ll dramatically speed up the configuration of your aggregate executable!  Actually, dramatically isn’t dramatic enough of a word to describe how dramatic the speedup will be. (Oh, you’ll also get genuine experience with networking and the various concurrency patterns that come up in networked applications. That’s important, too!)

    Once you get to this point, you should locally commit your changes to your NewsAggregator and LiberalNewsAggregator classes, and run through the submit script like you're done.  You may make some more changes to your NewsAggregator and LiberalNewsAggregator implementations as you move forward and find some functionality that should be tapped by the ConservativeNewsAggregator class that's coming up next, but it's still wise to commit something as an intermediate point just in case you compromise the code base as you move on to the latter half of the assignment.

    Getting Code

    Go ahead and clone the mercurial repository that we’ve set up for you by typing:

        myth22> hg clone /usr/class/cs110/repos/assign4/$USER assign4

    Compile often, test incrementally and almost as often as you compile, hg commit a bunch so you don’t lose your work if someone reboots the myth machine you’re working on, and run /usr/class/cs110/tools/submit when you’re done.  There's also a sample executable called aggregate_soln in /usr/class/cs110/samples/assign4 than you can run to see what it does.

    Files

    Here’s a description of each of the files contributing to the overall code base we’re giving you:

    aggregate.cc

    aggregate.cc is a super small file that defines the main function and nothing else.  The file is so small that I present it here:

       int main(int argc, char *argv[]) {

          unique_ptr<NewsAggregator> aggregator(NewsAggregator::createNewsAggregator(argc, argv));

          aggregator->buildIndex();

          aggregator->queryIndex();

          return 0;

       }

    As part of your implementation, you'll be designing and implementing three classes—an abstract base class called NewsAggregator  and two concrete subclasses called LiberalNewsAggregator and ConservativeNewsAggregator—to manage all of the details.  The createNewsAggregator factory method returns an instance of one of the subclasses—the command line arguments help to select which one—and passes the buck on to that instance to do all of the real work.

    You shouldn’t need to change this file.

    news-aggregator.h/cc

    The news-aggregator files collectively define the abstract base class where state and functionality common to both subclasses should be placed.  The code that allows a user to query the index is common to both subclasses, so it belongs here (and I've more or less implemented this for you already, since it doesn't involve threading. The two subclasses, however, each have different approaches to how the build the index, so the processAllFeeds method (which is called by the already-implemented buildIndex method) is abstract and must be implemented by each of the two subclasses.

    You'll update these two files as you add state and functionality common to both subclasses.

    news-aggregator-liberal.h/cc

    The news-aggregator-liberal files define the first concrete subclass used to build an index.  You'll update the news-aggregator and news-aggregator-liberal files to implement the version of the assignment that allocates new threads without bound.  That's why it's liberal—it allocates threads as resources without concern for system costs (subject to some checks and bounds, and discussed above.)

    You'll update these two files (and the news-aggregator files) as you implement the first part of the assignment.

    news-aggregator-conservative.h/cc

    The news-aggregator-conservative files define the second concrete subclass used to build an index (and its use is activated by passing --conserve-threads as a flag on the command line). You'll update the news-aggregator and news-aggregator-conservative files (and maybe even other files as you see opportunities for code sharing) to implement the version of the assignment that uses a limited number of threads over the lifetime of the executable, using what's called a thread pool.  It's conservative because it's more conscious of the expense that comes with creating and destroying threads, so it tries to limit the number of threads ever created—regardless of the number of articles that get downloaded—to a reasonably small number while still allowing the parallelism of threading to speed things up.

    You'll update these two files (and the news-aggregator files) as you implement the second part of the assignment, to be discussed soon. 

    news-aggregator-utils.h/cc

    news-aggregator-utils.h and .cc define and implement a small number of URL and string utility functions that contribute to the implementation of several functions in the NewsAggregator class hierarchy.  The implementation of queryIndex provided calls the shouldTruncate and truncate functions, and you’ll ultimately want to call the getURLServer function as you implant thread count limits on a per-server basis.

    You shouldn’t need to change either of these files, although you're welcome to if it helps you arrive at a working program more quickly.

    news-aggregator-log.h/cc

    news-aggregator-log.h defines a small class that helps manage all of the progress and error messages that come up during program execution, and news-aggregator-log.cc.  You can read through the interface and implementation files to figure out what method is likely called where within working implementations of buildIndex and processAllFeeds.  In fact, the logging might help you work through the workflow of program execution as you implement, test, and debug.  You, however, are not required to use the logging facilities provided by the NewsAggregatorLog class if you don't want to.

    article.h

    article.h defines a simple record—the Article—that pairs a news article URL with its title. operator< has been overloaded so that it can compare two Article records, which means that Articles can be used as the keys in STL maps.

    You shouldn’t need to change either of these files, although you're welcome to if it helps you arrive at a working program more quickly.

    html-document.h/cc

    The html-document files define and implement the HTMLDocument class, which models a traditional HTML page. It provides functionality to pull an HTML document from its server and surface its payload—specifically, the plain text content under its <body> tag—as a sequence of tokens. The parse method manages the networking and processes the content, and the getTokens method provides access to the sequence of tokens making up the pages. All of the news articles referenced by the RSS feeds are standard web pages, so you’ll rely on this HTMLDocument class to pull each of them and parse them into their constituent tokens.

    You shouldn’t need to change either of these files, although you're welcome to if it helps you arrive at a working program more quickly.

    rss-feed.h/cc

    The rss-feed files define and implement the RSSFeed class, which models a particular type of standardized XML file known as an RSS feed. The structure of an RSS feed document isn’t important, since an RSSFeed object, when constructed around an URL of such a feed, knows how to pull and parse the XML document just as an HTMLDocument knows how to pull and parse an HTML document. The primary difference is that an RSSFeed produces a sequence of Articles, not a sequence of tokens. So, RSSFeeds produce Articles housing URLs which can be fed to HTMLDocuments to produce words.

    You shouldn’t need to change either of these files, although you're welcome to if it helps you arrive at a working program more quickly.

    rss-feed-list.h/cc

    The rss-feed-list files define and implement the RSSFeedList class, which models another XML file type whose format was invented for the purpose of this assignment. RSSFeedList’s story is similar to that for RSSFeed, except that it surfaces a feed-title-to-feed-url URL map.

    You shouldn’t need to change either of these files, although you're welcome to if it helps you arrive at a working program more quickly.

    rss-index.h/cc

    The rss-index files define and implement the RSSIndex class, which models the news article index we’ve been talking about. An RSSIndex index maintains information about all of the words in all of the various news articles that’ve been indexed. The code you add to each of the NewsAggregator subclasses will directly interface with a single protected instance of this class introduced in news-aggregator.h, so you’ll want to be familiar with it—in particular, you should inspect the rss-index.h file so you’re familiar with the add method. Note that the RSSIndex implementation is not thread-safe.

    My own solution actually changed these two files to codify the rules on how to handle articles with the same title but different URLs on the same server.  You might consider doing the same thing.

      *-exception.h

      These three header files each define and inline-implement a custom exception class that gets instantiated and thrown when some network drama prevents one of the three parse methods from doing its job. It’s possible the URL was malformed, or it’s possible your WiFi connection hiccuped and your network connection was dropped mid-parse. Recall that functions and methods capable of throwing exception can be optionally placed under the jurisdiction of a try/catch block, as with this:

          try {

              document.parse();

          } catch (const HTMLDocumentException& hde) { // exception of type HTMLDocumentException? catch and handle here
              log.noteSingleArticleDownloadFailure(article);
              return;

          }

          // got here? no exception was thrown, so carry on as usual

      You shouldn’t need to change any of these header files, though you're welcome to if you'd like.

      stream-tokenizer.h/cc

      The stream-tokenizer files provide the C++ equivalent of the Java StreamTokenizer class. The implementation is not pretty, but that’s because it needs to handle UTF8-encodings of strings that aren’t necessarily ASCII. Fortunately, you should be able to ignore this class, since it’s really used to decompose the already implemented HTMLDocument class. Feel free to peruse the implementation if you want, or ignore it. Just understand that it’s there and contributing to the overall solution.

      thread-pool.h/cc

      thread-pool.h presents the interface for the ThreadPool class, which you are responsible for implementing for as part of the ConservativeNewsAggregator class outlined below.  Naturally, you’ll want to extend the private section of the class definition to give it more state and the set of helper methods needed to build it, and you’ll need to flesh out the implementations of the public methods in thread-pool.cc.

      thread-pool-test.cc

      thread-pool-test.cc contains the trivial program I posted above, and can be used (and should be extended substantially) to fully exercise your ThreadPool class.

      You shouldn’t need to change this file.

      Continuing: Going Conservative 

      By this point you've arrived at a multithreaded RSS News Feed Aggregator—using as many threads as you want—to concurrently download news articles from around the globe and build a news.google.com-like search index. Now, you’re going to provide an alternate concrete class (the ConservativeNewsAggregator)—to rely on a limited number of threads—on the order of 30—to do everything. This new approach relies on the notion of a thread pool, which is what you’ll spend the vast majority of your time implementing this week.

      To be clear, the LiberalNewsAggregator class uses semaphores and mutexes to limit the number of threads that can exist at any one time, but it doesn't limit the total number of threads created over the lifetime of an aggregate run.  Even if we guarantee that no more than, say, 32 threads ever ever at any one time, there’s no sense creating a thread to do some work only to let it die if we’re going to need another thread to do pretty much the same thing later on. Building up and tearing down a thread is a relatively expensive process, so we should prefer a smaller number of threads that live very long lives to a larger number of threads that live very short ones.

      An industrial-strength aggregator needs to mitigate competing requirements (employing a bounded number of threads while getting everything off of the main thread as quickly as possible), and nothing we did with LiberalNewsAggregator does that.

      Here’s a new idea: implement a ThreadPool class, which exports the following public interface:

          class ThreadPool {

          public:

              ThreadPool(size_t numThreads);

              void schedule(const std::function<void(void)>& thunk);

              void wait();

              ~ThreadPool();

          };

      Here is a simple program that uses a ThreadPool to execute a collection of 10 functions calls using 4 threads.

          static const size_t kNumThreads = 4;

          static const size_t kNumFunctions = 10;

          int main(int argc, char *argv[]) {

              ThreadPool pool(kNumThreads);

              for (size_t id = 0; id < kNumFunctions; id++) {

                  pool.schedule([id] {

                      cout << oslock << "Thread (ID: " << id << ") has started."

                           << endl << osunlock;

                      size_t sleepTime = (id % 3) * 100;

                      sleep_for(sleepTime);

                      cout << oslock << "Thread (ID: " << id << ") has finished."

                           << endl << osunlock;

                  });

              }

              pool.wait(); // block until all scheduled functions have executed

              cout << "All done!" << endl;

              return 0;

          }

      The output of the above program might look like this:

          myth22> ./thread-pool-test

          Thread (ID: 3) has started.

          Thread (ID: 2) has started.

          Thread (ID: 1) has started.

          Thread (ID: 0) has started.

          Thread (ID: 0) has finished.

          Thread (ID: 4) has started.

          Thread (ID: 1) has finished.

          Thread (ID: 5) has started.

          Thread (ID: 2) has finished.

          Thread (ID: 6) has started.

          Thread (ID: 3) has finished.

          Thread (ID: 7) has started.

          Thread (ID: 7) has finished.

          Thread (ID: 8) has started.

          Thread (ID: 4) has finished.

          Thread (ID: 8) has finished.

          Thread (ID: 9) has started.

          Thread (ID: 5) has finished.

          Thread (ID: 9) has finished.

          Thread (ID: 6) has finished.

          All done!

          myth22>

      In a nutshell, the program’s ThreadPool creates a small number of worker threads (in this example, four of them) and relies on those four workers to collectively execute all of the scheduled functions (in this example, 10 of them). Yes, yes, we could have spawned ten separate threads and not used a thread pool at all, but that’s the unscalable approach your LiberalNewsAggregator went with, and we’re trying to improve on that by using a fixed number of threads to maximize parallelism without overwhelming the thread manager.

      Task 3: Implementing the ThreadPool class, version 1

      How does one implement this thread pool thing?  Well, your ThreadPool constructor—at least initially—should do the following:

      • launch a single dispatcher thread like this (assuming dt is a private thread data member):

              dt = thread([this]() {

                  dispatcher();

              });

      • launch a specific number of worker threads like this (assuming wts is a private vector<thread> data member):

              for (size_t workerID = 0; workerID < numThreads; workerID++) {

                  wts[workerID] = thread([this](size_t workerID) {

                      worker(workerID);

                  }, workerID);

              }

      The implementation of schedule should append the provided function pointer (expressed as a function<void(void)> , which is the C++11 way to type a function pointer that can be invoked without any arguments) to the end of a queue of such function pointers.  Each time a zero-argument function is scheduled, the dispatcher thread should be notified. Once the implementation of schedule has notified the dispatcher that the queue of outstanding functions to be executed has been extended, it should return right away so even more functions can be scheduled right away.

      The implementation of the private dispatcher method should loop interminably, blocking within each iteration until it has confirmation the queue of outstanding functions is nonempty. It should then wait for a worker thread to become available, select it, mark it as unavailable, dequeue the least recently scheduled function, put a copy of that function in a place where the selected worker (and only that worker) can find it, and then signal the worker thread to execute it.

      The implementation of the private worker method should also loop interminably, blocking within each iteration until the dispatcher thread signals it to execute a previously scheduled function. Once signaled, the worker should go ahead and call the function, wait for it to execute, and then mark itself as available so that it can be discovered and selected again (and again, and again) by the dispatcher.

      The implementation of wait should block until all previously-scheduled-but-yet-to-be-executed functions have been executed. The ThreadPool destructor should wait until all previously-scheduled-but-yet-to-be-executed thunks have executed to completion, somehow inform the dispatcher and worker threads to exit (and wait for them to exit), and then otherwise dispose of all ThreadPool resources.  (Functions that take no arguments at all are called thunks. The function<void(void)> type is a more general type than void (*)(), and can be assigned to anything invokable—a function pointer, or an anonymous function—that doesn’t require any arguments.)

      Your ThreadPool implementation shouldn’t orphan any memory whatsoever.  We'll be analyzing your ThreadPool using valgrind to ensure that no memory is leaked whatsoever.

      Task 4: Multithreaded aggregate, version 2

      Once you have a working ThreadPool, you should implement flesh out and implement the ConservativeNewsAggregator using two ThreadPools. Once you do so, you can activate its use over the LiberalNewsAggregator by passing the --conserve-threads flag to aggregate.

      Why two ThreadPools instead of just one? I want one ThreadPool (of size 6) to manage a collection of worker threads that download RSS XML documents, and I want a second ThreadPool (of size 24) to manage a second group of workers dedicated to news article downloads.

      To simplify the implementation, you should not worry about limiting the number of simultaneous connections to any given server.  The two ThreadPools you’re using are pretty small, so it’s pretty much impossible for any single news server to feel abused by the  version of aggregate that taps your ConservativeNewsAggregator.   You should deal with duplicate articles the same way you did in liberal land: never download—or even try to download—the same URL twice, and when two articles share the same title and server, keep track of the running intersection of all token sets, and file them under the lexicographically smallest URL.  My primary interest is showing you that a constant number of threads can more or less accomplish what an unbounded number of threads accomplished for aggregate 1.0.

      Task 5: Optimizing the ThreadPool class

      Once you press through your ConservativeNewsFeed implementation, I want you to go back and update your ThreadPool implementation to be a little less aggressive about spawning worker threads at construction time.

      Consider the scenario where you create a ThreadPool of size 32, but the surrounding executable only schedules two or three functions every few seconds. If the functions are relatively short and execute quickly, the vast majority of the worker threads spawned at construction time will be twiddling their thumbs with nothing to do. There’s no sense creating worker threads and burdening the thread manager with them until they’re really needed.

      To remedy this, you should update your ThreadPool implementation to lazily spawn a worker thread only when you need a worker and all existing workers are busy doing something else. (Note that a worker thread isn’t truly spawned until a thread with an installed thread routine is moved into the wts vector.)

      You should still ensure the number of worker threads never exceeds the thread pool size, but you should only spawn worker threads on an as-needed basis. (Once you spawn a worker thread, it can exist forever, even if it never gets used again).

      As always, compile more often than you blink, test incrementally, and hg commit with every bug-free advance toward your final solution. Of course, be sure to run /usr/class/cs110/tools/submit when you’re done, and invoke that sanitycheck a whole lot.

      Note that we will be exercising your ThreadPool and your NewsAggregator class hierarchy with many, many more tests than we've exposed via the sanitycheck.  Previous assignments have exposed all of the tests, but the onus is now on you to test your work to make sure it's bullet proof.  You can construct your own static RSS feed list to reference your own static RSS feeds to reference your own static HTML articles. Make sure you handle duplicates the same way as the sample executable, and ensure that per-server limits are respected even when multiple RSS feeds and their HTML articles reference the same server.

      I hope you enjoy this assignment as much as I enjoyed writing it!

      PDF Print
      All - None