-
Notifications
You must be signed in to change notification settings - Fork 627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AdaptivePlaywrightCrawler
: programmatically deciding when to render JS
#2446
Comments
Hello and thank you for trying out Regarding manually specifying the rendering type, we were considering adding it but we felt that it'll be better not to offer too many options at first. Furthermore, the general idea is that the crawler should do this automatically. We are open to discussing this though, we might end up with something cool in the end 🙂 When considering this initially, our idea of the solution was another callback in the crawler options that would return a rendering type hint for a
This is, to an extent, done automatically - if a HTTP crawl throws an exception, it is retried in browser.
Since the data used for rendering type prediction is strictly categorized by request label, this should also work out of the box if you use different labels for product listings and details. It is true that this should be documented before we consider the feature stable.
Please elaborate if you like 🙂 |
Thanks for your detailed answer, i think i understand better now: → We can crawl using HTTP by default, and if we try selecting an element that is not rendered or run a JS command (like scrolling), an error will be raised and the crawler will retry with JS rendering next, is that correct? In terms of API design i imagine combining a default, crawler level rendering parameter, and then tying the rendering type with the handler that will handle the request could be an easy solution, something like: const crawler = new AdaptivePlaywrightCrawler({
renderJs: true, // renderingTypeDetectionRatio: 0.1,
requestHandler: router,
}); await enqueueLinks({
selector: "<selector>",
label: "<handler name>",
renderJs: false,
});
I don't really get this rendering type hint, i'd rather be in full control but maybe my use case is too specific. |
Well, almost, just a few clarifications:
Well, specifying a crawler-wide rendering type default doesn't seem useful to me. If your results can be extracted with plain HTTP, the crawler will detect that soon enough, and the few initial browser crawls should not present a problem. We may consider something like the
Yeah, I believe we mean the same thing - by passing the hint, you'd basically enforce the rendering type for a particular request. |
For our own internal spidering project we ended up building this functionality (before the 'adaptive' stuff was in place — it was a terrible kludge, but it still worked). Our use case was a little different — lots of links that were 'mystery meat' and had PDF files, mixed with dynamically rendered JS pages that needed the full Playwright browser. We implemented a pre-page-load stage, where we made a HEAD request and checked the status code, mime types, etc of the response and used it to decide whether we should save a local binary file, log an error, or render the full page. It's not quite the same use case, but an example of how making the per-request decision logic a bit more accessible could be quite helpful. |
Which package is the feature request for? If unsure which one to select, leave blank
@crawlee/playwright (PlaywrightCrawler)
Feature
Add the possibility to programmatically decide when to render JS
→ use HTTP crawling by default but if some condition is met switch to JS rendering.
Motivation
Use cases:
Ideal solution or implementation, and any additional constraints
Maybe adding a parameter in
enqueueLinks
to process the URL with JS rendering.Alternative solutions or implementations
No response
Other context
No response
The text was updated successfully, but these errors were encountered: