Skip to content

ValueError("URL should be absolute") when crawling https://crawlee.dev/js/api/core/changelog and respecting robots.txt #1499

@ericvg97

Description

@ericvg97

Hello, I was trying my crawler in your webpage (specifically in https://crawlee.dev/js/api/core/changelog) and I encountered this error:

�[90m[crawlee.crawlers._basic._basic_crawler]�[0m �[33mWARN �[0m Retrying request to https://crawlee.dev/js/api/core/changelog due to: URL should be absoluteFile "python3.12/site-packages/yarl/_url.py", line 628, in _origin, raise ValueError("URL should be absolute")

This only happens when I set respect_robots_txt_file=True, I tried putting it to false and it doesn't fail. This is my crawler config in case it helps:

        crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser(
            playwright_crawler_specific_kwargs={
                "browser_type": "chromium",
                "headless": True,
            },
            configure_logging=True,
            use_session_pool=True,
            request_handler_timeout=timedelta(seconds=120),
            respect_robots_txt_file=True,
        )

I am not planning to crawl your page ;) , I was using it just as an example but it looks like there is some error when checking robots.txt with a relative path maybe?

Metadata

Metadata

Assignees

Labels

bugSomething isn't working.t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions