Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Invalid Page - not a valid URL" on particular website #770

Open
nickjanssen opened this issue Feb 13, 2025 · 3 comments · May be fixed by #775
Open

"Invalid Page - not a valid URL" on particular website #770

nickjanssen opened this issue Feb 13, 2025 · 3 comments · May be fixed by #775

Comments

@nickjanssen
Copy link

Hello, I'm getting

│ {"timestamp":"2025-02-13T10:32:17.911Z","logLevel":"warn","context":"general","message":"Invalid Page - not a valid URL","details":{"url":"[","page":"https://www.exlayer.jp/","workerid":0}}

when trying to crawl https://www.exlayer.jp. Other sites work fine. Any ideas?
The crawler somehow thinks there is an URL [ on the page.

@ikreymer
Copy link
Member

Can confirm I'm able to repro this, very odd.

@ikreymer
Copy link
Member

This is an odd one indeed, appears to be a bug...
It looks like JSON.stringify double encodes the input - perhaps tied to the Shift-JIS encoding?

On that page:

> JSON.stringify(["https://example.com"])
'"[\\"https://example.com\\"]"'

Normally you would get that by calling JSON.stringify(JSON.stringify(["https://example.com"]))

Correct behavior:

> JSON.stringify(["https://example.com"])
'["https://example.com"]'

Puppeteer uses JSON.stringify under the hood, though I think can work around that...

@ikreymer
Copy link
Member

ikreymer commented Feb 21, 2025

Ahh, it's because this page overrides Object.toJSON...

It looks like it's a known issue / unfixed in puppeteer:
puppeteer/puppeteer#4334 :/

ikreymer added a commit that referenced this issue Feb 21, 2025
…ctly to avoid issues custom toJSON overrides:

- add Runtime.addBinding for each exposed function, handle in one place with Runtime.bindingCalled
- convert binding names to BxFunctionBindings enum
- update to browsertrix-behaviors 0.7.1 to avoid waiting for return value
- fixes #770
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
2 participants