Description
I have a crawl with ~4,100 seeds (with scopeType: page
, so we are just capturing the seeds) that has frequently gotten hung up after encountering timeouts closing the page or browser (seems like Chromium is hung somewhere; ps
indicates Chromium is still running, so hasn't straight-up crashed, something else is going wrong). The issues don’t appear to be related to a particular URL as far as I can tell, but it does seem to die after a similar number of page loads each time (~3,100), regardless of the amount of memory or CPU on the machine I’m using.
I was hopeful #779 would fix it, since previous runs included log lines like "message":"New Window Timed Out","details":{"seconds":20000,"workerid":0}}
; the absurd seconds
value seemed in line with the huge number of retries that was fixed there. But running the crawl again today with v1.5.5, I am still hitting similar issues. The logs are slightly different, though! So maybe #779 fixed one problem only to reveal another, or just shifted how and where it’s showing up.
Anyway! The logs are quite long, but this is the tail-end of them (2 workers):
{"timestamp":"2025-02-27T05:39:31.584Z","logLevel":"error","context":"general","message":"Custom page load check timed out","details":{"seconds":5,"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
{"timestamp":"2025-02-27T05:39:36.590Z","logLevel":"error","context":"general","message":"Timed out getting page title, something is likely wrong","details":{"seconds":5,"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
{"timestamp":"2025-02-27T05:40:39.963Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed: will retry","details":{"retry":0,"retries":2,"msg":"Navigation timeout of 90000 ms exceeded","url":"https://ehp.niehs.nih.gov/about-ehp/staff","loadState":0,"page":"https://ehp.niehs.nih.gov/about-ehp/staff","workerid":0}}
{"timestamp":"2025-02-27T05:40:40.923Z","logLevel":"warn","context":"worker","message":"Page Worker Timeout","details":{"seconds":190,"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
{"timestamp":"2025-02-27T05:40:40.933Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":1,"page":"https://ehp.niehs.nih.gov/loi/cehp"}}
{"timestamp":"2025-02-27T05:40:40.934Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":3118,"total":4132,"pending":2,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":3129,\"started\":\"2025-02-27T05:40:40.933Z\",\"extraHops\":0,\"url\":\"https:\\/\\/ehp.niehs.nih.gov\\/loi\\/cehp\",\"added\":\"2025-02-27T02:27:46.407Z\",\"depth\":0}","{\"seedId\":3133,\"started\":\"2025-02-27T05:39:09.752Z\",\"extraHops\":0,\"url\":\"https:\\/\\/ehp.niehs.nih.gov\\/about-ehp\\/staff\",\"added\":\"2025-02-27T02:27:46.407Z\",\"depth\":0}"]}}
{"timestamp":"2025-02-27T05:40:49.963Z","logLevel":"error","context":"worker","message":"Page Close Timed Out","details":{"seconds":10,"page":"https://ehp.niehs.nih.gov/about-ehp/staff","workerid":0}}
{"timestamp":"2025-02-27T05:42:10.935Z","logLevel":"error","context":"fetch","message":"Direct fetch of page URL timed out","details":{"seconds":90,"page":"https://ehp.niehs.nih.gov/loi/cehp","workerid":1}}
{"timestamp":"2025-02-27T05:42:20.947Z","logLevel":"error","context":"worker","message":"Page Close Timed Out","details":{"seconds":10,"page":"https://ehp.niehs.nih.gov/loi/cehp","workerid":1}}
{"timestamp":"2025-02-27T05:44:37.407Z","logLevel":"warn","context":"general","message":"Failed to fetch favicon from browser /json endpoint","details":{"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
After that, I manually hit ctrl+c to kill it, and it logged this but still hung:
{"timestamp":"2025-02-27T06:03:56.237Z","logLevel":"info","context":"general","message":"SIGINT received...","details":{}}
{"timestamp":"2025-02-27T06:03:56.237Z","logLevel":"info","context":"general","message":"SIGNAL: interrupt request received...","details":{}}
{"timestamp":"2025-02-27T06:03:56.237Z","logLevel":"info","context":"general","message":"Crawler interrupted, gracefully finishing current pages","details":{}}
I gave it another few minutes before sending SIGINT
again (via kill
this time) and it finally exited semi-gracefully:
{"timestamp":"2025-02-27T06:09:31.384Z","logLevel":"info","context":"general","message":"SIGINT received...","details":{}}
{"timestamp":"2025-02-27T06:09:31.385Z","logLevel":"info","context":"general","message":"SIGNAL: stopping crawl now...","details":{}}
{"timestamp":"2025-02-27T06:09:31.435Z","logLevel":"info","context":"general","message":"Saving crawl state to: /crawls/collections/edgi-active-urls--20250227022739--combined/crawls/crawl-20250227060931-36f55b2ad938.yaml","details":{}}
{"timestamp":"2025-02-27T06:09:31.437Z","logLevel":"info","context":"general","message":"Removing old save-state: /crawls/collections/edgi-active-urls--20250227022739--combined/crawls/crawl-20250227053729-36f55b2ad938.yaml","details":{}}
{"timestamp":"2025-02-27T06:09:36.445Z","logLevel":"warn","context":"browser","message":"Closing Browser Timed Out","details":{"seconds":5}}
{"timestamp":"2025-02-27T06:09:36.446Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: interrupted","details":{}}
Unfortunately the log level here is INFO; the next time I do this crawl, I’ll try including debug logs (is the right way to do this via --logLevel 'debug,info,warn,error'
or --logging 'debug,stats'
? the docs seem a little unclear here…).
If helpful, I am happy to send the full log file as a gist or e-mail or whatever. It’s 11 MB.
The crawl was run via Docker:
docker run \
--rm \
--attach stdout --attach stderr \
--volume "./crawl-config.yaml:/app/config.yaml" \
--volume "./crawls:/crawls/" \
webrecorder/browsertrix-crawler:1.5.5 \
crawl \
--config /app/config.yaml \
--collection "edgi-active-urls--20250227022739--combined" \
--saveState always
With a crawl config file like:
workers: 2
saveStateHistory: 1
scopeType: page
rolloverSize: 8000000000
warcinfo:
# Extra WARC fields
seeds:
# Log seed list here
Metadata
Metadata
Assignees
Labels
Type
Projects
Status