Raspador

A concurrent web scraping service built with Bocadillo which allows clients to fetch metadata about websites.

Rationale

Raspador demonstrates how to build:

A asynchronous web server exposing RESTful API endpoints.
A concurrent and non-blocking web service without resorting to a message queue or a broker.

To achieve (2), Raspador makes use of background tasks to run I/O-intensive jobs in the background.

This principle can be easily adapted to other I/O-bound operations such as sending email (e.g. with aiosmtpd) or logging messages to an external service (e.g. with aiologstash).

Concept

Raspador uses a simple, in-memory scraper job system. Jobs run asynchronously (i.e. in the background) while clients can use the server's REST API to create jobs and inspect their results.

In practice, a scraper job only consists in fetching a single URL. It is delayed by 5 seconds in order to simulate longer I/O-bound processing.

Jobs are stored in-memory, but it is very possible to extend Raspador to store them in a database instead. An asynchronous database client such as asyncpg would then be of great help.

Install

Using Pipenv:

pipenv install
pipenv shell

Usage

Running the app

First, start the API server:

python api.py

For convenience, the site/ directory provides a set of HTML pages you can serve with Python in order to run a local website for testing purposes. The following command serves them on http://localhost:5001:

python -m http.server 5001 -b localhost -d site

Creating a scraper job

To create a scraper for one of the pages, e.g. index.html, make a call to the POST /scrapers endpoint:

curl http://localhost:8000/scrapers \
    -X POST \
    -H "Content-Type: application/json" \
    -d '{"url": "http://localhost:5001"}'

Example response:

{
    "key": 1,
    "url": "http://localhost:5001",
    "state": "scheduled",
    "results": null
}

The response will be returned immediately. This is because the scraping occurs asynchronously after the response has been sent. This is implemented with a background task.

Retrieving scraping results

When creating a scraper job, the server will return a job key (a unique identifier), which can be used to retrieve results by making a call to the GET /scrapers/{key}/results endpoint.

Below are example responses for a call to /scrapers/1/results (assuming the scraper job we're interested in has a key 1).

While the scraper is running (status: 202):

{
    "key": 1,
    "url": "http://localhost:5001",
    "state": "in_progress",
    "results": null
}

When the scraper has successfully finished (status: 200):

{
    "key": 1,
    "url": "http://localhost:5001",
    "state": "success",
    "results": {
        "title": "Hello, world!",
        "description": "This fake website rocks.",
        "status": 200
    }
}

If scraping has failed (status: 200):

{
    "key": 1,
    "url": "http://localhost:5001",
    "state": "failed",
    "results": {
        "error": "Cannot connect to host localhost:5001 ssl:None [No route to host]"
    }
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
site		site
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
api.py		api.py
jobs.py		jobs.py
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Raspador

Rationale

Concept

Install

Usage

Running the app

Creating a scraper job

Retrieving scraping results

About

Releases

Packages

Languages

License

bocadilloproject/raspador

Folders and files

Latest commit

History

Repository files navigation

Raspador

Rationale

Concept

Install

Usage

Running the app

Creating a scraper job

Retrieving scraping results

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages