Fork of scrapy-selenium working with the recent versions of Selenium 4.
All settings except SELENIUM_DRIVER_NAME
are now optional. The middleware
should still work with existing Scrapy projects integrating the upstream
package.
Tested with python3.12
. You will need a Selenium 4 compatible browser.
With Poetry
poetry add git+https://github.com/jirpok/scrapy-selenium4.git
(Edge and Safari are also supported.)
SELENIUM_DRIVER_NAME = "firefox"
SELENIUM_DRIVER_ARGUMENTS = ["-headless"]
SELENIUM_BROWSER_FF_PREFS = {
"javascript.enabled": False, # disable JavaScript
"permissions.default.image": 2 # block all images from loading
}
SELENIUM_BROWSER_FF_PREFS = {
"network.proxy.type": 1,
"network.proxy.socks_remote_dns": True,
"network.proxy.socks": "<HOST>",
"network.proxy.socks_port": <PORT>
}
SELENIUM_BROWSER_FF_PREFS = {
"network.proxy.type": 1,
"network.proxy.http": "<HOST>",
"network.proxy.http_port": <PORT>,
"network.proxy.ssl": "<HOST>",
"network.proxy.ssl_port": <PORT>
}
SELENIUM_DRIVER_NAME = "chrome"
SELENIUM_DRIVER_ARGUMENTS=["--headless=new"]
SELENIUM_BROWSER_EXECUTABLE_PATH = "path/to/browser/executable"
Selenium requires a driver (GeckoDriver, ChromeDriver, …) to interface with the chosen browser. Recent versions of Selenium 4 ship with the Selenium Manager, automatically handling these dependencies.
SELENIUM_DRIVER_EXECUTABLE_PATH = "path/to/driver/executable"
SELENIUM_COMMAND_EXECUTOR = "http://localhost:4444/wd/hub"
(Do not set SELENIUM_DRIVER_EXECUTABLE_PATH
along with
SELENIUM_COMMAND_EXECUTOR
.)
DOWNLOADER_MIDDLEWARES = {
"scrapy_selenium4.SeleniumMiddleware": 800
}
Use the scrapy_selenium4.SeleniumRequest
instead of the scrapy built-in
Request
:
from scrapy_selenium4 import SeleniumRequest
yield SeleniumRequest(url=url, callback=self.parse)
The request will have an additional meta
key driver
containing the Selenium
driver.
def parse(self, response):
print(response.request.meta["driver"].title)
Explicit wait before returning the response to the spider.
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
yield SeleniumRequest(
url=url,
callback=self.parse,
wait_time=10,
wait_until=EC.element_to_be_clickable((By.ID, "some_id"))
)
Take a screenshot of the page and add the binary data of the captured .png to
the response meta
.
yield SeleniumRequest(
url=url,
callback=self.parse,
screenshot=True
)
def parse(self, response):
with open("image.png", "wb") as image_file:
image_file.write(response.meta["screenshot"])
Execute custom JavaScript code.
yield SeleniumRequest(
url=url,
callback=self.parse,
script="window.scrollTo(0, document.body.scrollHeight);",
)
Run tests
pytest