A concurrent web scraping service built with Bocadillo which allows clients to fetch metadata about websites.
Raspador demonstrates how to build:
- A asynchronous web server exposing RESTful API endpoints.
- A concurrent and non-blocking web service without resorting to a message queue or a broker.
To achieve (2), Raspador makes use of background tasks to run I/O-intensive jobs in the background.
This principle can be easily adapted to other I/O-bound operations such as sending email (e.g. with aiosmtpd) or logging messages to an external service (e.g. with aiologstash).
Raspador uses a simple, in-memory scraper job system. Jobs run asynchronously (i.e. in the background) while clients can use the server's REST API to create jobs and inspect their results.
In practice, a scraper job only consists in fetching a single URL. It is delayed by 5 seconds in order to simulate longer I/O-bound processing.
Jobs are stored in-memory, but it is very possible to extend Raspador to store them in a database instead. An asynchronous database client such as asyncpg would then be of great help.
Using Pipenv:
pipenv install
pipenv shell
First, start the API server:
python api.py
For convenience, the site/
directory provides a set of HTML pages you can serve with Python in order to run a local website for testing purposes. The following command serves them on http://localhost:5001
:
python -m http.server 5001 -b localhost -d site
To create a scraper for one of the pages, e.g. index.html
, make a call to the POST /scrapers
endpoint:
curl http://localhost:8000/scrapers \
-X POST \
-H "Content-Type: application/json" \
-d '{"url": "http://localhost:5001"}'
Example response:
{
"key": 1,
"url": "http://localhost:5001",
"state": "scheduled",
"results": null
}
The response will be returned immediately. This is because the scraping occurs asynchronously after the response has been sent. This is implemented with a background task.
When creating a scraper job, the server will return a job key (a unique identifier), which can be used to retrieve results by making a call to the GET /scrapers/{key}/results
endpoint.
Below are example responses for a call to /scrapers/1/results
(assuming the scraper job we're interested in has a key 1
).
- While the scraper is running (status: 202):
{
"key": 1,
"url": "http://localhost:5001",
"state": "in_progress",
"results": null
}
- When the scraper has successfully finished (status: 200):
{
"key": 1,
"url": "http://localhost:5001",
"state": "success",
"results": {
"title": "Hello, world!",
"description": "This fake website rocks.",
"status": 200
}
}
- If scraping has failed (status: 200):
{
"key": 1,
"url": "http://localhost:5001",
"state": "failed",
"results": {
"error": "Cannot connect to host localhost:5001 ssl:None [No route to host]"
}
}