Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler hangs after logging "Page Close Timed Out" #780

Open
Mr0grog opened this issue Feb 27, 2025 · 1 comment
Open

Crawler hangs after logging "Page Close Timed Out" #780

Mr0grog opened this issue Feb 27, 2025 · 1 comment

Comments

@Mr0grog
Copy link

Mr0grog commented Feb 27, 2025

I have a crawl with ~4,100 seeds (with scopeType: page, so we are just capturing the seeds) that has frequently gotten hung up after encountering timeouts closing the page or browser (seems like Chromium is hung somewhere; ps indicates Chromium is still running, so hasn't straight-up crashed, something else is going wrong). The issues don’t appear to be related to a particular URL as far as I can tell, but it does seem to die after a similar number of page loads each time (~3,100), regardless of the amount of memory or CPU on the machine I’m using.

I was hopeful #779 would fix it, since previous runs included log lines like "message":"New Window Timed Out","details":{"seconds":20000,"workerid":0}}; the absurd seconds value seemed in line with the huge number of retries that was fixed there. But running the crawl again today with v1.5.5, I am still hitting similar issues. The logs are slightly different, though! So maybe #779 fixed one problem only to reveal another, or just shifted how and where it’s showing up.

Anyway! The logs are quite long, but this is the tail-end of them (2 workers):

{"timestamp":"2025-02-27T05:39:31.584Z","logLevel":"error","context":"general","message":"Custom page load check timed out","details":{"seconds":5,"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
{"timestamp":"2025-02-27T05:39:36.590Z","logLevel":"error","context":"general","message":"Timed out getting page title, something is likely wrong","details":{"seconds":5,"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
{"timestamp":"2025-02-27T05:40:39.963Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed: will retry","details":{"retry":0,"retries":2,"msg":"Navigation timeout of 90000 ms exceeded","url":"https://ehp.niehs.nih.gov/about-ehp/staff","loadState":0,"page":"https://ehp.niehs.nih.gov/about-ehp/staff","workerid":0}}
{"timestamp":"2025-02-27T05:40:40.923Z","logLevel":"warn","context":"worker","message":"Page Worker Timeout","details":{"seconds":190,"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
{"timestamp":"2025-02-27T05:40:40.933Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":1,"page":"https://ehp.niehs.nih.gov/loi/cehp"}}
{"timestamp":"2025-02-27T05:40:40.934Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":3118,"total":4132,"pending":2,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":3129,\"started\":\"2025-02-27T05:40:40.933Z\",\"extraHops\":0,\"url\":\"https:\\/\\/ehp.niehs.nih.gov\\/loi\\/cehp\",\"added\":\"2025-02-27T02:27:46.407Z\",\"depth\":0}","{\"seedId\":3133,\"started\":\"2025-02-27T05:39:09.752Z\",\"extraHops\":0,\"url\":\"https:\\/\\/ehp.niehs.nih.gov\\/about-ehp\\/staff\",\"added\":\"2025-02-27T02:27:46.407Z\",\"depth\":0}"]}}
{"timestamp":"2025-02-27T05:40:49.963Z","logLevel":"error","context":"worker","message":"Page Close Timed Out","details":{"seconds":10,"page":"https://ehp.niehs.nih.gov/about-ehp/staff","workerid":0}}
{"timestamp":"2025-02-27T05:42:10.935Z","logLevel":"error","context":"fetch","message":"Direct fetch of page URL timed out","details":{"seconds":90,"page":"https://ehp.niehs.nih.gov/loi/cehp","workerid":1}}
{"timestamp":"2025-02-27T05:42:20.947Z","logLevel":"error","context":"worker","message":"Page Close Timed Out","details":{"seconds":10,"page":"https://ehp.niehs.nih.gov/loi/cehp","workerid":1}}
{"timestamp":"2025-02-27T05:44:37.407Z","logLevel":"warn","context":"general","message":"Failed to fetch favicon from browser /json endpoint","details":{"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}

After that, I manually hit ctrl+c to kill it, and it logged this but still hung:

{"timestamp":"2025-02-27T06:03:56.237Z","logLevel":"info","context":"general","message":"SIGINT received...","details":{}}
{"timestamp":"2025-02-27T06:03:56.237Z","logLevel":"info","context":"general","message":"SIGNAL: interrupt request received...","details":{}}
{"timestamp":"2025-02-27T06:03:56.237Z","logLevel":"info","context":"general","message":"Crawler interrupted, gracefully finishing current pages","details":{}}

I gave it another few minutes before sending SIGINT again (via kill this time) and it finally exited semi-gracefully:

{"timestamp":"2025-02-27T06:09:31.384Z","logLevel":"info","context":"general","message":"SIGINT received...","details":{}}
{"timestamp":"2025-02-27T06:09:31.385Z","logLevel":"info","context":"general","message":"SIGNAL: stopping crawl now...","details":{}}
{"timestamp":"2025-02-27T06:09:31.435Z","logLevel":"info","context":"general","message":"Saving crawl state to: /crawls/collections/edgi-active-urls--20250227022739--combined/crawls/crawl-20250227060931-36f55b2ad938.yaml","details":{}}
{"timestamp":"2025-02-27T06:09:31.437Z","logLevel":"info","context":"general","message":"Removing old save-state: /crawls/collections/edgi-active-urls--20250227022739--combined/crawls/crawl-20250227053729-36f55b2ad938.yaml","details":{}}
{"timestamp":"2025-02-27T06:09:36.445Z","logLevel":"warn","context":"browser","message":"Closing Browser Timed Out","details":{"seconds":5}}
{"timestamp":"2025-02-27T06:09:36.446Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: interrupted","details":{}}

Unfortunately the log level here is INFO; the next time I do this crawl, I’ll try including debug logs (is the right way to do this via --logLevel 'debug,info,warn,error' or --logging 'debug,stats'? the docs seem a little unclear here…).

If helpful, I am happy to send the full log file as a gist or e-mail or whatever. It’s 11 MB.

The crawl was run via Docker:

docker run \
    --rm \
    --attach stdout --attach stderr \
    --volume "./crawl-config.yaml:/app/config.yaml" \
    --volume "./crawls:/crawls/" \
    webrecorder/browsertrix-crawler:1.5.5 \
    crawl \
    --config /app/config.yaml \
    --collection "edgi-active-urls--20250227022739--combined" \
    --saveState always

With a crawl config file like:

workers: 2
saveStateHistory: 1
scopeType: page
rolloverSize: 8000000000

warcinfo:
  # Extra WARC fields

seeds:
  # Log seed list here
@Mr0grog
Copy link
Author

Mr0grog commented Feb 27, 2025

Just realized I forgot to note this is running on an AWS EC2 Ubuntu x86 box. uname output, if helpful:

Linux <ipaddr> 6.8.0-1021-aws #23-Ubuntu SMP Mon Dec  9 23:59:34 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

is the right way to do this via --logLevel 'debug,info,warn,error' or --logging 'debug,stats'? the docs seem a little unclear here…

Nevermind, digging though the logging module makes it clear I want --logging debug,stats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

1 participant