Crawler hangs after logging "Page Close Timed Out"

I have a crawl with ~4,100 seeds (with `scopeType: page`, so we are just capturing the seeds) that has frequently gotten hung up after encountering timeouts closing the page or browser (seems like Chromium is hung somewhere; `ps` indicates Chromium is still running, so hasn't straight-up crashed, something else is going wrong). The issues don’t appear to be related to a particular URL as far as I can tell, but it does seem to die after a similar number of page loads each time (~3,100), regardless of the amount of memory or CPU on the machine I’m using.

I was hopeful #779 would fix it, since previous runs included log lines like `"message":"New Window Timed Out","details":{"seconds":20000,"workerid":0}}`; the absurd `seconds` value seemed in line with the huge number of retries that was fixed there. But running the crawl again today with v1.5.5, I am still hitting similar issues. The logs are slightly different, though! So maybe #779 fixed one problem only to reveal another, or just shifted how and where it’s showing up.

Anyway! The logs are quite long, but this is the tail-end of them (2 workers):

```
{"timestamp":"2025-02-27T05:39:31.584Z","logLevel":"error","context":"general","message":"Custom page load check timed out","details":{"seconds":5,"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
{"timestamp":"2025-02-27T05:39:36.590Z","logLevel":"error","context":"general","message":"Timed out getting page title, something is likely wrong","details":{"seconds":5,"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
{"timestamp":"2025-02-27T05:40:39.963Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed: will retry","details":{"retry":0,"retries":2,"msg":"Navigation timeout of 90000 ms exceeded","url":"https://ehp.niehs.nih.gov/about-ehp/staff","loadState":0,"page":"https://ehp.niehs.nih.gov/about-ehp/staff","workerid":0}}
{"timestamp":"2025-02-27T05:40:40.923Z","logLevel":"warn","context":"worker","message":"Page Worker Timeout","details":{"seconds":190,"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
{"timestamp":"2025-02-27T05:40:40.933Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":1,"page":"https://ehp.niehs.nih.gov/loi/cehp"}}
{"timestamp":"2025-02-27T05:40:40.934Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":3118,"total":4132,"pending":2,"failed":0,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":3129,\"started\":\"2025-02-27T05:40:40.933Z\",\"extraHops\":0,\"url\":\"https:\\/\\/ehp.niehs.nih.gov\\/loi\\/cehp\",\"added\":\"2025-02-27T02:27:46.407Z\",\"depth\":0}","{\"seedId\":3133,\"started\":\"2025-02-27T05:39:09.752Z\",\"extraHops\":0,\"url\":\"https:\\/\\/ehp.niehs.nih.gov\\/about-ehp\\/staff\",\"added\":\"2025-02-27T02:27:46.407Z\",\"depth\":0}"]}}
{"timestamp":"2025-02-27T05:40:49.963Z","logLevel":"error","context":"worker","message":"Page Close Timed Out","details":{"seconds":10,"page":"https://ehp.niehs.nih.gov/about-ehp/staff","workerid":0}}
{"timestamp":"2025-02-27T05:42:10.935Z","logLevel":"error","context":"fetch","message":"Direct fetch of page URL timed out","details":{"seconds":90,"page":"https://ehp.niehs.nih.gov/loi/cehp","workerid":1}}
{"timestamp":"2025-02-27T05:42:20.947Z","logLevel":"error","context":"worker","message":"Page Close Timed Out","details":{"seconds":10,"page":"https://ehp.niehs.nih.gov/loi/cehp","workerid":1}}
{"timestamp":"2025-02-27T05:44:37.407Z","logLevel":"warn","context":"general","message":"Failed to fetch favicon from browser /json endpoint","details":{"page":"https://ehp.niehs.nih.gov/about-ehp/connect","workerid":1}}
```

After that, I manually hit ctrl+c to kill it, and it logged this but still hung:

```
{"timestamp":"2025-02-27T06:03:56.237Z","logLevel":"info","context":"general","message":"SIGINT received...","details":{}}
{"timestamp":"2025-02-27T06:03:56.237Z","logLevel":"info","context":"general","message":"SIGNAL: interrupt request received...","details":{}}
{"timestamp":"2025-02-27T06:03:56.237Z","logLevel":"info","context":"general","message":"Crawler interrupted, gracefully finishing current pages","details":{}}
```

I gave it another few minutes before sending `SIGINT` again (via `kill` this time) and it finally exited semi-gracefully:

```
{"timestamp":"2025-02-27T06:09:31.384Z","logLevel":"info","context":"general","message":"SIGINT received...","details":{}}
{"timestamp":"2025-02-27T06:09:31.385Z","logLevel":"info","context":"general","message":"SIGNAL: stopping crawl now...","details":{}}
{"timestamp":"2025-02-27T06:09:31.435Z","logLevel":"info","context":"general","message":"Saving crawl state to: /crawls/collections/edgi-active-urls--20250227022739--combined/crawls/crawl-20250227060931-36f55b2ad938.yaml","details":{}}
{"timestamp":"2025-02-27T06:09:31.437Z","logLevel":"info","context":"general","message":"Removing old save-state: /crawls/collections/edgi-active-urls--20250227022739--combined/crawls/crawl-20250227053729-36f55b2ad938.yaml","details":{}}
{"timestamp":"2025-02-27T06:09:36.445Z","logLevel":"warn","context":"browser","message":"Closing Browser Timed Out","details":{"seconds":5}}
{"timestamp":"2025-02-27T06:09:36.446Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: interrupted","details":{}}
```

Unfortunately the log level here is INFO; the next time I do this crawl, I’ll try including debug logs (is the right way to do this via `--logLevel 'debug,info,warn,error'` or `--logging 'debug,stats'`? the docs seem a little unclear here…).

If helpful, I am happy to send the full log file as a gist or e-mail or whatever. It’s 11 MB.

The crawl was run via Docker:

```sh
docker run \
    --rm \
    --attach stdout --attach stderr \
    --volume "./crawl-config.yaml:/app/config.yaml" \
    --volume "./crawls:/crawls/" \
    webrecorder/browsertrix-crawler:1.5.5 \
    crawl \
    --config /app/config.yaml \
    --collection "edgi-active-urls--20250227022739--combined" \
    --saveState always
```

With a crawl config file like:

```yaml
workers: 2
saveStateHistory: 1
scopeType: page
rolloverSize: 8000000000

warcinfo:
  # Extra WARC fields

seeds:
  # Log seed list here
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Crawler hangs after logging "Page Close Timed Out" #780

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Crawler hangs after logging "Page Close Timed Out" #780

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions