Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing URL to parse in Scrapy Spider URL is captured using Scrapy-Selenium #81

Open
Mathoholic opened this issue Nov 18, 2020 · 7 comments

Comments

@Mathoholic
Copy link

Mathoholic commented Nov 18, 2020

I am trying to scrape a website which has some dropdowns, So I planned to use Scrapy Framework with Scrapy-Selenium(more here) to click around the dropdowns(Nested For loop) and then capture the URL using below code and pass it to the parse() function to look for the needed data and scrape it to MySQL Database.

now_url=self.driver.current_url

 print('Current URL is:'+now_url)
 yield Request(now_url,callback=self.parse)
            

def parse(self, response):
    
#This Function Will Loop though Each Page and Capture the Data Sets Available on Each Page of Medicine

#creating items to be stored in itemspy file with this Crawler: 

items=GrxItem()

#loop around the items on each medicine page(from a-z) and add them to items and throw into pipelines to SQL DB

But the logics seems not working as expected. Any insight to deal with this is appreciated. The full code is here.

EDIT: I tried using SeleniumRequest() as well but it seems that too is not working.

@Mathoholic
Copy link
Author

Is anybody here? I added a link to read the full code I wrote.

@tristanlatr
Copy link

tristanlatr commented Nov 19, 2020

@Mathoholic you should not have to use webdriver.Chrome directly in your start_requests method, it seems that you are bypassing scrapy system in your code to use directly Selenium as is.

Plus you should use the SeleniumRequest object like yield SeleniumRequest(now_url, callback=self.parse).

Please review this project spider exemple: https://github.com/tristanlatr/charityvillage_jobs/blob/master/charityvillage_jobs/spiders/charityvillage_com.py

Then you should find the webdriver.Chrome object under response.request.meta['driver']to click on any appropriate dropdown ect.

I hope the issue is clearer now

@Mathoholic
Copy link
Author

Oh!! Yes now I get it. Thanks for taking time to sharing this much needed insight. I will make changes to my Spider. Thanks a lot.

@Mathoholic
Copy link
Author

Mathoholic commented Nov 21, 2020

@tristanlatr one last thing I want to know while using scrapy only we randomise the useragent but the while using scrapy_selenium the randomization is not working. The website throws <403> with captacha kind of page.

The code was this:

import scrapy
import scrapy_selenium
from scrapy_selenium import SeleniumRequest


class GrxmedSpider(scrapy.Spider):
    name = 'grxmed'
    
    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.goodrx.com/drugs',
            wait_time=60,
            screenshot=True,
            callback=self.parse
        )

    def parse(self, response):
        img=response.request.meta['screenshot']
        with open('scrrenshot.png','wb') as f:
            f.write(img)

scrrenshot

@Mathoholic
Copy link
Author

@tristanlatr please share some insight for the issue, will be grateful.

@tristanlatr
Copy link

Sorry, if you hit a captcha, your are out of luck I think.

@Flushot
Copy link

Flushot commented Apr 23, 2021

@Mathoholic You've hit something that's unfortunately pretty common nowadays, and isn't just limited to user agents. There's no easy answer to this, because bot detection and countermeasures are an evolving arms race. There's many factors that go into evading bot detection (including machine learning), but it's not the fault of Selenium, Scrapy, or this project in particular.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants