-
Notifications
You must be signed in to change notification settings - Fork 21
Scraper
The scraper is a piece of software that runs daily, with the purpose to extract information from the Cook County Sheriff's Inmate Locator website.
Scraper now has configuration options that need to be set for it to run the raw_inmate_data capture.
The first version of our scraper ran as a serial process, retrieving one inmate record at a time, and as a result took a significant period of time to complete, 6+ hours. There was an attempt to create a concurrent version. When this attempt failed, limited concurrency was implemented using GNU Parallel. This reduces the processing time down to just over 4 hours.
In January of 2014, the Cook County Sheriff's department added a CAPTCHA form item to the inmate search mechanism. This, along with a couple of other deficiencies, primarily in detecting missed inmate information, triggered another rewrite of the scraper. The current (as of March of 2014) scraper uses a concurrent approach and is more aggressive in detecting missed inmate information.
In May of 2014 it was decided that the raw inmate information should be made available, so users can create their own database schema's or do their own selective analysis. The raw data is available at this URL, http://cookcountyjail.recoveredfactory.net/raw_inmate_data/. This location contains the years for which data has been captured. The contents of a year sub-directory are the daily snapshots for that year. The data is stored in a CSV format with the following structure. To help you work with the raw Inmate data see Notes On Working With Raw Inmate Data. The starting date for data in the V2 API database is August 17th, 2013.
The rest of this wiki page documents the architecture of the concurrent scraper.
- All computing is to be done concurrently.
- Objects own and manage their own concurrency, including how many concurrent instances there are.
- No concurrent instance of an object will block when calling a method on another object
- Use Gevent as the concurrency mechanism. Concurrent execution occurs in Greenlets.
The purpose of these design rules is to maximize separation of concerns and to ensure that full concurrency happens and that the chance of race conditions and deadlocks are minimized.
There are four objects at the heart of the Scraper:
- Controller - orchestrates the running and shutdown of the three other objects that do the majority of the work.
- SearchCommands - generates sets of commands that InmatesScraper runs
- InmateScraper - fetches detail inmate information pages from the Cook County Jail website
- Inmates - reads and writes and updates inmate information stored in the database
The Controller is called via the run method. Once it has created the other three objects it fetches the data from Inmates, that is used to fetch information about new, existing and departed inmates. The blue lines in the diagram show these calls. Once it has the information it issues three commands to the SearchCommands object. Commands are the black paths. SearchCommands then generates three different types of calls to InmateScraper which then fetches inmate detail pages from the Cook County Jail site. Depending on the result and the command it then make different type of calls to Inmates which then stores the inmates information. At different points during the execution SeachCommands, InmatesScraper and Inmates send notifications to the Controller which acts on those. Notification paths are in yellow. Once all the processing is completed the Controller shuts down the processing.
While the four objects do different computations, they share a common architecture. To meet the design requirements of computation running in Greenlets owned and managed by the objects, public methods when called, put a message on an internal queue, that is read from by functions running in a Greenlet and that does the computation. A convention is followed that each public API method has a shadow method with same name but it starts with an underscore. The Diagram below shows this. On the left is a public method, public_method. This is shadowed by the method, _public_method, which contains the code that performs the actual computation associated with that method. When public_method is called, it pushes the name of its shadow method and any arguments it received onto an internal message queue. These values are retrieved by a method which is running in a Greenlet. It then calls _public_method with the passed arguments. When _public_method is completed it returns control to the calling method which makes a blocking call to get the next message.
The internal message queue and the Greenlet worker or workers are spawned when the object instance is created. This encapsulation of the concurrency and it management is how the second design requirement is met. The purpose of the second requirement is that no user of this service needs to know how the concurrency is accomplished today or in the future.
- Controller - one.
- Inmates - one. As this is dealing with a disk and disk I/O speeds, there is no advantage to having more than one.
- InmatesScraper - many, currently 25 and this limitation is because we do not want to overload the Sheriff's website. Fetching inmate pages is the bottle neck. When the Sheriff's website is lightly loaded, it takes between 1 and 2 seconds to load down the page into a browser. The HTML page size is approximately 13,600 characters which is what is fetched by the scraper. The time to download this is significantly smaller than in the browser which also fetches graphics, images and script files. However even with this shorter of time it is still represents a couple of orders of magnitude longer than then computation time and so many pages can be downloaded concurrently without burdening the machine the server is running on.
- SearchCommands - one. Note that command generation occurs very quickly.
Much of the information returned from computations in the scraper are passed as parameters in calls to other objects. This information flow terminates either in the InmatesScraper when no inmate with the specified booking id exists or in Inmates with the inmate information being stored in the database. The only information that does not flow this way is the information about inmates in the system. This information is needed to generate the following search requests:
- Update the status of the inmate, are they still in the system or are have they been discharged
- Find new bookings or missed bookings
- Check if an inmate was actually discharged or was their problem that interfered with the retrieval of their information
Design rule #3 states that no concurrent instance of an object will block when calling a method on different object. This means that the computation that the called object provides does not interfere with the execution of the caller. In the cases of the forward flow of information the architecture of the Class is accomplishes this. The problem is with method calls that expect the results of the computation to be returned to them. There are two ways that this can be accomplished:
- Through registering a callback method
- Providing a response queue
In the current implementation a response queue was chosen, upon reflection, the callback approach is a better choice, it hides the implementation whereas the response queue exposes it too much - what if a different mechanism is to be used. That said the way the current mechanism works is that the caller provides a queue for the response and then they spawn an instance that listens on the queue when the return value is written to the queue. Once received the return value is used either in processing done by the instance or passed to the original calling instance. The result of this approach is that all of the design rules are met.
It is felt that the above is sufficient guidance to make the task of understanding the code of the current scraper much easier and faster. You are the judges of that. If this documentation does not meet that goal, then either edit this text or create an issue describing what is missing, or needs more explanation or is misleading. My thanks in advance if you do so.