GitHub - rdcprojects/scrapy-mq-redis: A RabbitMQ/Redis tool for Scrapy

A RabbitMQ /Redis plugin for Scrapy Framework.

Scrapy-mq-redis is a tool that lets you feed and queue URLs from RabbitMQ via Scrapy spiders, using the Scrapy framework.

It uses Redis for DupeFilter

Made using a combination of scrapy-rabbitmq and scrapy-redis.

Installation

Clone the repo and inside the scrapy-mq-redis directory, type

python setup.py install

Preparation

Below queues will have to created in RabbitMQ before starting the crawl:

${spider_name}:start_urls
${spider_name}:requests
${spider_name}:items

Note: Will be adding automatic queue creation soon.

Usage

Step 1: In your scrapy settings, add the following config values:

# Enables scheduling storing requests queue in rabbitmq.

SCHEDULER = "scrapy_mq_redis.scheduler.Scheduler"

# Don't cleanup rabbitmq queues, allows to pause/resume crawls.
SCHEDULER_PERSIST = True

# Schedule requests using a priority queue. (default)
SCHEDULER_QUEUE_CLASS = 'scrapy_mq_redis.queue.SpiderQueue'

# Provide host and port to RabbitMQ daemon
RABBITMQ_CONNECTION_PARAMETERS = {'host': 'localhost', 'port': 6666}

# Store scraped item in rabbitmq for post-processing.
ITEM_PIPELINES = {
    'scrapy_mq_redis.pipelines.RabbitMQPipeline': 1
}

REDIS_HOST = '192.168.1.200'
REDIS_PORT = 6379

Step 2: Add RabbitMQSpider to Spider.

Example: multidomain_spider.py

from scrapy_mq_redis.spiders import RabbitMQSpider

class MultiDomainSpider(RabbitMQSpider):
    name = 'multidomain'

    def parse(self, response):
        # parse all the things
        pass

Step 3: Run spider using scrapy client

scrapy runspider multidomain_spider.py

Step 4: Push URLs to RabbitMQ

Example: push_web_page_to_queue.py

#!/usr/bin/env python
import pika
import settings

connection = pika.BlockingConnection(pika.ConnectionParameters(
               'localhost'))
channel = connection.channel()

channel.basic_publish(exchange='',
                      routing_key='spider:start_urls',
                      body='http://example.com')

connection.close()

Steps to add requests to requests queue (Optional)

import pika
import pickle
from scrapy.http import Request
from scrapy.utils.reqser import request_to_dict

#Change the host and port as required
conn = pika.BlockingConnection(pika.ConnectionParameters(host='localhost', port=5672)
channel = conn.channel()


# Change the URL as required. Also, any extra data can be passed as meta argument
req = Request("http://example.com", callback="parse")


# Change the routing key as per the spider name
msg=pickle.dumps(request_to_dict(req), protocol=-1)
channel.basic_publish(exchange='', routing_key='spider:requests', body=msg)

This way the spider never dies and keeps on processing new requests.

Contributing and Forking

See Contributing Guidlines

Releases

See the changelog for release details.

Version	Release Date
0.1.0	2016-6-2

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
scrapy_mq_redis		scrapy_mq_redis
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A RabbitMQ /Redis plugin for Scrapy Framework.

Installation

Preparation

Usage

Step 1: In your scrapy settings, add the following config values:

Step 2: Add RabbitMQSpider to Spider.

Example: multidomain_spider.py

Step 3: Run spider using scrapy client

Step 4: Push URLs to RabbitMQ

Example: push_web_page_to_queue.py

Steps to add requests to requests queue (Optional)

Contributing and Forking

Releases

Copyright & License

About

Releases

Packages

Languages

License

rdcprojects/scrapy-mq-redis

Folders and files

Latest commit

History

Repository files navigation

A RabbitMQ /Redis plugin for Scrapy Framework.

Installation

Preparation

Usage

Step 1: In your scrapy settings, add the following config values:

Step 2: Add RabbitMQSpider to Spider.

Example: multidomain_spider.py

Step 3: Run spider using scrapy client

Step 4: Push URLs to RabbitMQ

Example: push_web_page_to_queue.py

Steps to add requests to requests queue (Optional)

Contributing and Forking

Releases

Copyright & License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages