Simple Scraper

This is a fairly simple gem that will help you simplify the parsing of web pages.

How it works

Gem is based on several libraries that do most of the work:

HTTParty is an HTTP client
Parallel allows performing queries in multiple threads
Nokogiri is an HTML, XML, SAX, and Reader parser

Installation

Add this line to your application's Gemfile:

gem 'simple-scraper'

And then execute:

$ bundle

Or install it yourself in the following way:

$ gem install simple-scraper

Usage

require 'simple/scraper'

scraper = Simple::Scraper::Parser.new(
    title: { selector: "//h1[@class='title']", handler: ->(els) { els.first.text }, default: 'Ruby' },
    summary: { selector: "//h2[@class='summary']", handler: ->(els) { els.first.text } },
    link: { selector: "//a[@class='link']", handler: ->(els) { els.first['href'] } },
    text_array: { selector: "//*[@class='link']", handler: ->(els) { els.map(&:text) } }
)

result1 = scraper.parse('https://www.codica.com/')
result2 = scraper.parse(['https://www.codica.com/1', 'https://www.codica.com/2'])

The response will be similar to:

[
  {
    "title": "scraped title text",
    "summary": "scraped summary text",
    "link": "https://www.codica.com/blog/top-ruby-gems-we-cant-live-without/",
    "text_array": ["text", "text" ...]
  },
  ...
]

Or just find a page:

Simple::Scraper::Finder.find(url: 'https://www.codica.com/', query: {}, headers: {}) do |page|
  # page is an instance of Nokogiri::HTML::Document
end

Scraper attributes

title, summary, link, text_array - Random hash keys, they may be whatever you want.
selector - XPath. With its help you can find desired elements on the page.
handler - Any ruby object that can respond to #call method (proc, lambda or plain ruby class that has defined #call method). One argument will be passed to the handler which is an array of the elements found on the page. Each element is an instance of Nokogiri::XML::Element. You can read Nokogiri documentation for more info.
default - In case scraper cannot find the desired element using selector, the value provided for the default attribute will be returned.

Query parameters and headers

query = { page: 2 }
headers = { 'Authorization': 'Bearer' }
result = scraper.parse('https://www.codica.com/', query: query, headers: headers)

Configuration

Proxy

Simple::Scraper.configure do |config|
  config.proxy_addr = 'proxy.something.com'
  config.proxy_port = 80
  config.proxy_user = 'user:'
  config.proxy_pass = 'password'
end

Logging

Simple::Scraper.configure do |config|
  config.logger = Logger.new('path/to/my/logs')
end

By default the logging is turned off

Multithreading

Simple::Scraper.configure do |config|
  config.number_of_threads = 20
end

By default scraper works in 1 thread.

Reset

You might need to reset configuration to defaults

Simple::Scraper.reset

Now you can provide new configuration if needed

License

About Codica

simple-scraper is maintained and funded by Codica. The names and logos for Codica are trademarks of Codica.

We love open source software! See our other projects or hire us to design, develop, and grow your product.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
bin		bin
lib/simple		lib/simple
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.simplecov		.simplecov
.travis.yml		.travis.yml
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
simple-scraper.gemspec		simple-scraper.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Scraper

How it works

Installation

Usage

Scraper attributes

Query parameters and headers

Configuration

Proxy

Logging

Multithreading

Reset

License

About Codica

About

Releases

Packages

Contributors 2

Languages

License

codica2/simple-scraper

Folders and files

Latest commit

History

Repository files navigation

Simple Scraper

How it works

Installation

Usage

Scraper attributes

Query parameters and headers

Configuration

Proxy

Logging

Multithreading

Reset

License

About Codica

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages