A rewritten and modernized clone from cigolpl/web-scraper (https://github.com/digestoo/web-scraper).
It's responsible for web scraping by using google chrome headless in background.
Node.js > 18.x or docker machine
git clone [email protected]:digestoo/web-scraper.git
cd web-scraper
npm install
PORT=8080 npm start
You can use additional environment variables once running API with npm:
PROXY_URL
- proxy url in format like http://username:password@ip:portEXECUTABLE_PATH
- path to custom google chrome (you can find it inchrome://version
)USER_DATA_DIR
- path to user profileSLOW_MO
- slow down operations by specified amount of msHEADLESS=false
- run server in headfull modeUSER_AGENT
- global user agentPORT
If you have any problem with running on your local machine - this page -> https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md might be helpful.
docker pull cigolpl/web-scraper
docker run -it -p 8080:8080 cigolpl/web-scraper
curl -XGET -H "Content-Type: application/json" -d '{"url":"'http://stackoverflow.com/questions/3207418/crawler-vs-scraper'","pageFunction":"function($) { return { title: $(\"title\").text() }}","userAgent":"WebScraper"}' http://localhost:8080
Body params:
url
pageFunction
- function having jQuery context responsible for extracting datadelay
- delay between requests in msuserAgent
noCookies