-
Notifications
You must be signed in to change notification settings - Fork 523
fix: Only apply requestHandlerTimeout to request handler #1474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
1b44070
fb85108
eba3eff
3715db2
ecd3c64
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,9 +1,10 @@ | ||
| from ._abstract_http_crawler import AbstractHttpCrawler | ||
| from ._abstract_http_crawler import AbstractHttpCrawler, HttpCrawlerOptions | ||
| from ._abstract_http_parser import AbstractHttpParser | ||
| from ._http_crawling_context import ParsedHttpCrawlingContext | ||
|
|
||
| __all__ = [ | ||
| 'AbstractHttpCrawler', | ||
| 'AbstractHttpParser', | ||
| 'HttpCrawlerOptions', | ||
| 'ParsedHttpCrawlingContext', | ||
| ] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -3,9 +3,11 @@ | |
| import asyncio | ||
| import logging | ||
| import warnings | ||
| from datetime import timedelta | ||
| from functools import partial | ||
| from typing import TYPE_CHECKING, Any, Generic, Literal | ||
|
|
||
| import playwright.async_api | ||
| from more_itertools import partition | ||
| from pydantic import ValidationError | ||
| from typing_extensions import NotRequired, TypedDict, TypeVar | ||
|
|
@@ -106,6 +108,7 @@ def __init__( | |
| fingerprint_generator: FingerprintGenerator | None | Literal['default'] = 'default', | ||
| headless: bool | None = None, | ||
| use_incognito_pages: bool | None = None, | ||
| navigation_timeout: timedelta | None = None, | ||
| **kwargs: Unpack[BasicCrawlerOptions[PlaywrightCrawlingContext, StatisticsState]], | ||
| ) -> None: | ||
| """Initialize a new instance. | ||
|
|
@@ -134,6 +137,8 @@ def __init__( | |
| use_incognito_pages: By default pages share the same browser context. If set to True each page uses its | ||
| own context that is destroyed once the page is closed or crashes. | ||
| This option should not be used if `browser_pool` is provided. | ||
| navigation_timeout: Timeout for navigation (the process between opening a Playwright page and calling | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Open question - should the @B4nan your opinion on this is also welcome. This is due to change in crawlee js v4, so we should be aligned.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would include the hooks, otherwise we would need another two options for timeouts of pre and post nav hooks. They need to be part of some timeout handler, and having 3 options just for the navigation feels too much.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @B4nan include them in
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, either that, or separate timeouts over pre and post nav hooks. User code needs to be wrapped in timeout handlers.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would say we can start with one shared timeout for all, and adjust later if needed.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, that's how we got here 😁 but yeah, including hook in navigation timeout is fine with me. |
||
| the request handler) | ||
| kwargs: Additional keyword arguments to pass to the underlying `BasicCrawler`. | ||
| """ | ||
| configuration = kwargs.pop('configuration', None) | ||
|
|
@@ -202,6 +207,8 @@ def __init__( | |
| if 'concurrency_settings' not in kwargs or kwargs['concurrency_settings'] is None: | ||
| kwargs['concurrency_settings'] = ConcurrencySettings(desired_concurrency=1) | ||
|
|
||
| self._navigation_timeout = navigation_timeout or timedelta(minutes=1) | ||
|
|
||
| super().__init__(**kwargs) | ||
|
|
||
| async def _open_page( | ||
|
|
@@ -266,6 +273,7 @@ async def _navigate( | |
| Raises: | ||
| ValueError: If the browser pool is not initialized. | ||
| SessionError: If the URL cannot be loaded by the browser. | ||
| TimeoutError: If navigation does not succeed within the navigation timeout. | ||
| Yields: | ||
| The enhanced crawling context with the Playwright-specific features (page, response, enqueue_links, | ||
|
|
@@ -297,7 +305,12 @@ async def _navigate( | |
| # Set route_handler only for current request | ||
| await context.page.route(context.request.url, route_handler) | ||
|
|
||
| response = await context.page.goto(context.request.url) | ||
| try: | ||
| response = await context.page.goto( | ||
| context.request.url, timeout=self._navigation_timeout.total_seconds() * 1000 | ||
| ) | ||
| except playwright.async_api.TimeoutError as exc: | ||
| raise asyncio.TimeoutError from exc | ||
|
|
||
| if response is None: | ||
| raise SessionError(f'Failed to load the URL: {context.request.url}') | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -104,6 +104,7 @@ async def crawl( | |||||
| session: Session | None = None, | ||||||
| proxy_info: ProxyInfo | None = None, | ||||||
| statistics: Statistics | None = None, | ||||||
| timeout: timedelta | None = None, | ||||||
| ) -> HttpCrawlingResult: | ||||||
| """Perform the crawling for a given request. | ||||||
|
|
||||||
|
|
@@ -114,6 +115,7 @@ async def crawl( | |||||
| session: The session associated with the request. | ||||||
| proxy_info: The information about the proxy to be used. | ||||||
| statistics: The statistics object to register status codes. | ||||||
| timeout: Request timeout | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something more descriptive? Although not sure whether process is the correct word, I cannot find a better one.
Suggested change
|
||||||
|
|
||||||
| Raises: | ||||||
| ProxyError: Raised if a proxy-related error occurs. | ||||||
|
|
@@ -132,6 +134,7 @@ async def send_request( | |||||
| payload: HttpPayload | None = None, | ||||||
| session: Session | None = None, | ||||||
| proxy_info: ProxyInfo | None = None, | ||||||
| timeout: timedelta | None = None, | ||||||
| ) -> HttpResponse: | ||||||
| """Send an HTTP request via the client. | ||||||
|
|
||||||
|
|
@@ -144,6 +147,7 @@ async def send_request( | |||||
| payload: The data to be sent as the request body. | ||||||
| session: The session associated with the request. | ||||||
| proxy_info: The information about the proxy to be used. | ||||||
| timeout: Request timeout | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Something more descriptive? Although not sure whether process is the correct word, I cannot find a better one.
Suggested change
|
||||||
|
|
||||||
| Raises: | ||||||
| ProxyError: Raised if a proxy-related error occurs. | ||||||
|
|
||||||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just wondering. Can the context pipeline now get stuck?
navigation_timeout(some parts of context pipeline) -> time unlimited parts of context pipeline -> request_handler_timeout(request_hanlder)
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realistically, that has always been the case, but yeah, this increases the odds by a bit. The hooks (discussed in #1474 (comment)) are probably the biggest risk.
Perhaps we could add some comically large timeout to the context pipeline execution as a whole, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, it makes even more sense to include the hooks in the timeout.
Other steps do not appear to be a problem in our pipelines for now.