-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add ErrorSnapshotter
to ErrorTracker
#1125
Conversation
TODO: Figure out how to run it in Playwright crawlers. By the time ErrorTracker and ErrorSnapshotter is run, the page is already closed.
TODO: Fix mypy issues
TODO: Add more tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements an ErrorSnapshotter to capture page snapshots (HTML and screenshot) on the first encountered error and updates error tracking to support asynchronous error handling with snapshot capture. Key changes include:
- Introducing the ErrorSnapshotter class and integrating it within ErrorTracker.
- Updating tests for both Playwright and HTTP crawlers to validate snapshot functionality.
- Refactoring error tracking calls to be asynchronous across the codebase.
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
tests/unit/server_endpoints.py | Adds HTML response constants for server endpoint tests. |
tests/unit/server.py | Refactors inline HTML responses to use defined constants. |
tests/unit/crawlers/_playwright/test_playwright_crawler.py | Adds tests for snapshot retrieval and error snapshots in the Playwright crawler. |
tests/unit/crawlers/_http/test_http_crawler.py | Adds tests to verify snapshot functionality in the HTTP crawler and updates error snapshot test. |
tests/unit/_statistics/test_error_tracker.py | Updates tests to use async error tracker methods. |
src/crawlee/statistics/_statistics.py | Introduces a new parameter to control error snapshot saving. |
src/crawlee/statistics/_error_tracker.py | Refactors the error tracker to support async snapshot capture via ErrorSnapshotter. |
src/crawlee/statistics/_error_snapshotter.py | Implements the ErrorSnapshotter class to capture and store HTML and JPEG snapshots. |
src/crawlee/crawlers/_playwright/_playwright_pre_nav_crawling_context.py | Adds a get_snapshot method to capture page content and screenshot. |
src/crawlee/crawlers/_playwright/_playwright_crawler.py | Modifies context yielding to capture errors for early snapshot collection. |
src/crawlee/crawlers/_basic/_context_pipeline.py | Updates pipeline middleware signature to support exception propagation. |
src/crawlee/crawlers/_basic/_basic_crawler.py | Updates error tracker calls to be asynchronous in retry and failure scenarios. |
src/crawlee/crawlers/_abstract_http/_http_crawling_context.py | Adds a get_snapshot method returning HTML from HTTP responses. |
src/crawlee/_types.py | Defines the PageSnapshot data class and updates the BasicCrawlingContext interface. |
Comments suppressed due to low confidence (2)
tests/unit/crawlers/_http/test_http_crawler.py:712
- [nitpick] The use of the variable 'key_info' after the for-loop may be unclear. Consider capturing the key from the dictionary explicitly for clarity.
assert key_info.key.endswith('.html')
src/crawlee/statistics/_error_tracker.py:73
- The variable 'new_error_group_message' is initialized to an empty string and never updated, which could lead to confusion. Consider removing it or updating its value if it is intended for wildcard similarity matching.
new_error_group_message = '' # In case of wildcard similarity match
ErrorSnapshotter
ErrorSnapshotter
to ErrorTracker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Added
ErrorSnapshotter
that can take page snapshot (screenshot or html) on each first encountered unique error.Added documentation describing how to use it.
Issues
ErrorTracker
#151Testing
Added unit tests.
Example
PlaywrightCrawler
based actor run withErrorSnapshotter
: https://console.apify.com/actors/C0lWh1UCQvgdArp6R/runs/UNuaiRWBDgxiJau0U#storage