Skip to content

Crawler hangs after logging "Page Close Timed Out" #780

Closed
@Mr0grog

Description

@Mr0grog

I have a crawl with ~4,100 seeds (with scopeType: page, so we are just capturing the seeds) that has frequently gotten hung up after encountering timeouts closing the page or browser (seems like Chromium is hung somewhere; ps indicates Chromium is still running, so hasn't straight-up crashed, something else is going wrong). The issues don’t appear to be related to a particular URL as far as I can tell, but it does seem to die after a similar number of page loads each time (~3,100), regardless of the amount of memory or CPU on the machine I’m using.

I was hopeful #779 would fix it, since previous runs included log lines like "message":"New Window Timed Out","details":{"seconds":20000,"workerid":0}}; the absurd seconds value seemed in line with the huge number of retries that was fixed there. But running the crawl again today with v1.5.5, I am still hitting similar issues. The logs are slightly different, though! So maybe #779 fixed one problem only to reveal another, or just shifted how and where it’s showing up.

Anyway! The logs are quite long, but this is the tail-end of them (2 workers):

{"timestamp":"2025-02-27T05:39:31.584Z","logLevel":"error","context":"general","message":"Custom page load check timed out","details":{"seconds":5,"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
{"timestamp":"2025-02-27T05:39:36.590Z","logLevel":"error","context":"general","message":"Timed out getting page title, something is likely wrong","details":{"seconds":5,"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
{"timestamp":"2025-02-27T05:40:39.963Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed: will retry","details":{"retry":0,"retries":2,"msg":"Navigation timeout of 90000 ms exceeded","url":"https://ehp.niehs.nih.gov/about-ehp/staff","loadState":0,"page":"https://ehp.niehs.nih.gov/about-ehp/staff","workerid":0}}
{"timestamp":"2025-02-27T05:40:40.923Z","logLevel":"warn","context":"worker","message":"Page Worker Timeout","details":{"seconds":190,"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
{"timestamp":"2025-02-27T05:40:40.933Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":1,"page":"https://ehp.niehs.nih.gov/loi/cehp"}}
{"timestamp":"2025-02-27T05:40:40.934Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":3118,"total":4132,"pending":2,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":3129,\"started\":\"2025-02-27T05:40:40.933Z\",\"extraHops\":0,\"url\":\"https:\\/\\/ehp.niehs.nih.gov\\/loi\\/cehp\",\"added\":\"2025-02-27T02:27:46.407Z\",\"depth\":0}","{\"seedId\":3133,\"started\":\"2025-02-27T05:39:09.752Z\",\"extraHops\":0,\"url\":\"https:\\/\\/ehp.niehs.nih.gov\\/about-ehp\\/staff\",\"added\":\"2025-02-27T02:27:46.407Z\",\"depth\":0}"]}}
{"timestamp":"2025-02-27T05:40:49.963Z","logLevel":"error","context":"worker","message":"Page Close Timed Out","details":{"seconds":10,"page":"https://ehp.niehs.nih.gov/about-ehp/staff","workerid":0}}
{"timestamp":"2025-02-27T05:42:10.935Z","logLevel":"error","context":"fetch","message":"Direct fetch of page URL timed out","details":{"seconds":90,"page":"https://ehp.niehs.nih.gov/loi/cehp","workerid":1}}
{"timestamp":"2025-02-27T05:42:20.947Z","logLevel":"error","context":"worker","message":"Page Close Timed Out","details":{"seconds":10,"page":"https://ehp.niehs.nih.gov/loi/cehp","workerid":1}}
{"timestamp":"2025-02-27T05:44:37.407Z","logLevel":"warn","context":"general","message":"Failed to fetch favicon from browser /json endpoint","details":{"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}

After that, I manually hit ctrl+c to kill it, and it logged this but still hung:

{"timestamp":"2025-02-27T06:03:56.237Z","logLevel":"info","context":"general","message":"SIGINT received...","details":{}}
{"timestamp":"2025-02-27T06:03:56.237Z","logLevel":"info","context":"general","message":"SIGNAL: interrupt request received...","details":{}}
{"timestamp":"2025-02-27T06:03:56.237Z","logLevel":"info","context":"general","message":"Crawler interrupted, gracefully finishing current pages","details":{}}

I gave it another few minutes before sending SIGINT again (via kill this time) and it finally exited semi-gracefully:

{"timestamp":"2025-02-27T06:09:31.384Z","logLevel":"info","context":"general","message":"SIGINT received...","details":{}}
{"timestamp":"2025-02-27T06:09:31.385Z","logLevel":"info","context":"general","message":"SIGNAL: stopping crawl now...","details":{}}
{"timestamp":"2025-02-27T06:09:31.435Z","logLevel":"info","context":"general","message":"Saving crawl state to: /crawls/collections/edgi-active-urls--20250227022739--combined/crawls/crawl-20250227060931-36f55b2ad938.yaml","details":{}}
{"timestamp":"2025-02-27T06:09:31.437Z","logLevel":"info","context":"general","message":"Removing old save-state: /crawls/collections/edgi-active-urls--20250227022739--combined/crawls/crawl-20250227053729-36f55b2ad938.yaml","details":{}}
{"timestamp":"2025-02-27T06:09:36.445Z","logLevel":"warn","context":"browser","message":"Closing Browser Timed Out","details":{"seconds":5}}
{"timestamp":"2025-02-27T06:09:36.446Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: interrupted","details":{}}

Unfortunately the log level here is INFO; the next time I do this crawl, I’ll try including debug logs (is the right way to do this via --logLevel 'debug,info,warn,error' or --logging 'debug,stats'? the docs seem a little unclear here…).

If helpful, I am happy to send the full log file as a gist or e-mail or whatever. It’s 11 MB.

The crawl was run via Docker:

docker run \
    --rm \
    --attach stdout --attach stderr \
    --volume "./crawl-config.yaml:/app/config.yaml" \
    --volume "./crawls:/crawls/" \
    webrecorder/browsertrix-crawler:1.5.5 \
    crawl \
    --config /app/config.yaml \
    --collection "edgi-active-urls--20250227022739--combined" \
    --saveState always

With a crawl config file like:

workers: 2
saveStateHistory: 1
scopeType: page
rolloverSize: 8000000000

warcinfo:
  # Extra WARC fields

seeds:
  # Log seed list here

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Done!

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions