You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a crawl with ~4,100 seeds (with scopeType: page, so we are just capturing the seeds) that has frequently gotten hung up after encountering timeouts closing the page or browser (seems like Chromium is hung somewhere; ps indicates Chromium is still running, so hasn't straight-up crashed, something else is going wrong). The issues don’t appear to be related to a particular URL as far as I can tell, but it does seem to die after a similar number of page loads each time (~3,100), regardless of the amount of memory or CPU on the machine I’m using.
I was hopeful #779 would fix it, since previous runs included log lines like "message":"New Window Timed Out","details":{"seconds":20000,"workerid":0}}; the absurd seconds value seemed in line with the huge number of retries that was fixed there. But running the crawl again today with v1.5.5, I am still hitting similar issues. The logs are slightly different, though! So maybe #779 fixed one problem only to reveal another, or just shifted how and where it’s showing up.
Anyway! The logs are quite long, but this is the tail-end of them (2 workers):
{"timestamp":"2025-02-27T05:39:31.584Z","logLevel":"error","context":"general","message":"Custom page load check timed out","details":{"seconds":5,"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
{"timestamp":"2025-02-27T05:39:36.590Z","logLevel":"error","context":"general","message":"Timed out getting page title, something is likely wrong","details":{"seconds":5,"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
{"timestamp":"2025-02-27T05:40:39.963Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed: will retry","details":{"retry":0,"retries":2,"msg":"Navigation timeout of 90000 ms exceeded","url":"https://ehp.niehs.nih.gov/about-ehp/staff","loadState":0,"page":"https://ehp.niehs.nih.gov/about-ehp/staff","workerid":0}}
{"timestamp":"2025-02-27T05:40:40.923Z","logLevel":"warn","context":"worker","message":"Page Worker Timeout","details":{"seconds":190,"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
{"timestamp":"2025-02-27T05:40:40.933Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":1,"page":"https://ehp.niehs.nih.gov/loi/cehp"}}
{"timestamp":"2025-02-27T05:40:40.934Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":3118,"total":4132,"pending":2,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":3129,\"started\":\"2025-02-27T05:40:40.933Z\",\"extraHops\":0,\"url\":\"https:\\/\\/ehp.niehs.nih.gov\\/loi\\/cehp\",\"added\":\"2025-02-27T02:27:46.407Z\",\"depth\":0}","{\"seedId\":3133,\"started\":\"2025-02-27T05:39:09.752Z\",\"extraHops\":0,\"url\":\"https:\\/\\/ehp.niehs.nih.gov\\/about-ehp\\/staff\",\"added\":\"2025-02-27T02:27:46.407Z\",\"depth\":0}"]}}
{"timestamp":"2025-02-27T05:40:49.963Z","logLevel":"error","context":"worker","message":"Page Close Timed Out","details":{"seconds":10,"page":"https://ehp.niehs.nih.gov/about-ehp/staff","workerid":0}}
{"timestamp":"2025-02-27T05:42:10.935Z","logLevel":"error","context":"fetch","message":"Direct fetch of page URL timed out","details":{"seconds":90,"page":"https://ehp.niehs.nih.gov/loi/cehp","workerid":1}}
{"timestamp":"2025-02-27T05:42:20.947Z","logLevel":"error","context":"worker","message":"Page Close Timed Out","details":{"seconds":10,"page":"https://ehp.niehs.nih.gov/loi/cehp","workerid":1}}
{"timestamp":"2025-02-27T05:44:37.407Z","logLevel":"warn","context":"general","message":"Failed to fetch favicon from browser /json endpoint","details":{"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
After that, I manually hit ctrl+c to kill it, and it logged this but still hung:
I gave it another few minutes before sending SIGINT again (via kill this time) and it finally exited semi-gracefully:
{"timestamp":"2025-02-27T06:09:31.384Z","logLevel":"info","context":"general","message":"SIGINT received...","details":{}}
{"timestamp":"2025-02-27T06:09:31.385Z","logLevel":"info","context":"general","message":"SIGNAL: stopping crawl now...","details":{}}
{"timestamp":"2025-02-27T06:09:31.435Z","logLevel":"info","context":"general","message":"Saving crawl state to: /crawls/collections/edgi-active-urls--20250227022739--combined/crawls/crawl-20250227060931-36f55b2ad938.yaml","details":{}}
{"timestamp":"2025-02-27T06:09:31.437Z","logLevel":"info","context":"general","message":"Removing old save-state: /crawls/collections/edgi-active-urls--20250227022739--combined/crawls/crawl-20250227053729-36f55b2ad938.yaml","details":{}}
{"timestamp":"2025-02-27T06:09:36.445Z","logLevel":"warn","context":"browser","message":"Closing Browser Timed Out","details":{"seconds":5}}
{"timestamp":"2025-02-27T06:09:36.446Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: interrupted","details":{}}
Unfortunately the log level here is INFO; the next time I do this crawl, I’ll try including debug logs (is the right way to do this via --logLevel 'debug,info,warn,error' or --logging 'debug,stats'? the docs seem a little unclear here…).
If helpful, I am happy to send the full log file as a gist or e-mail or whatever. It’s 11 MB.
I have a crawl with ~4,100 seeds (with
scopeType: page
, so we are just capturing the seeds) that has frequently gotten hung up after encountering timeouts closing the page or browser (seems like Chromium is hung somewhere;ps
indicates Chromium is still running, so hasn't straight-up crashed, something else is going wrong). The issues don’t appear to be related to a particular URL as far as I can tell, but it does seem to die after a similar number of page loads each time (~3,100), regardless of the amount of memory or CPU on the machine I’m using.I was hopeful #779 would fix it, since previous runs included log lines like
"message":"New Window Timed Out","details":{"seconds":20000,"workerid":0}}
; the absurdseconds
value seemed in line with the huge number of retries that was fixed there. But running the crawl again today with v1.5.5, I am still hitting similar issues. The logs are slightly different, though! So maybe #779 fixed one problem only to reveal another, or just shifted how and where it’s showing up.Anyway! The logs are quite long, but this is the tail-end of them (2 workers):
After that, I manually hit ctrl+c to kill it, and it logged this but still hung:
I gave it another few minutes before sending
SIGINT
again (viakill
this time) and it finally exited semi-gracefully:Unfortunately the log level here is INFO; the next time I do this crawl, I’ll try including debug logs (is the right way to do this via
--logLevel 'debug,info,warn,error'
or--logging 'debug,stats'
? the docs seem a little unclear here…).If helpful, I am happy to send the full log file as a gist or e-mail or whatever. It’s 11 MB.
The crawl was run via Docker:
With a crawl config file like:
The text was updated successfully, but these errors were encountered: