This system is responsible for managing the schema of OpenTrials warehouse
database and collecting
data to populate it.
Collectors are fully compatible with Python2.7.
We use PostgreSQL for our database and Alembic for migrations.
Collectors are deployed and run in production with DockerCloud.
The system's collectors are independent python modules that share the following signature:
def collect(conf, conn, *args):
pass
Where arguments are:
conf
- config dictconn
- connections dictargs
- collector arguments
To run a collector from command line:
$ make start <name> [<args>]
This code will trigger collectors.<name>.collect(conf, conn, *args)
call.
NOTE: Most collectors need date_from
and date_to
arguments that define a
time range from which we want to extract resources. For example:
$ make start nct 2013-11-31 2013-12-01
To check if that is the case, see the collect
function of the collector you are interested in.
Many collectors are scrapers. Scraping is based on
Scrapy framework. Here is
an example of how to use Scrapy in the collect
function:
from scrapy.crawler import CrawlerProcess
from .spider import <name>Spider
def collect(conf, conn, <args>):
process = CrawlerProcess(conf)
process.crawl(<name>Spider, conn=conn, <args>)
process.start()
For more details check the tutorial How to Write a Collector using Scrapy
The folder collectors/base
contains multiple reusable components and
helpers including the base class for a database record
and the base class for a record's field.
Each collector that has a corresponding table in the warehouse
database has to
define the schema for that table in a class that inherits from the base class for record.
For example the following class defines the schema for table colors
. This table has
2 fields of type Text
, one of which is a primary key:
class ColorRecord(base.Record):
table = 'colors'
# Fields
id = Text(primary_key=True)
color = Text()
To see how this connects to the other parts of the collector check the How to Write a Collector tutorial.
- Define the table/field in the collector's record class as explained above.
- Create a migration for it (more details in Alembic docs).