Skip to content

Implement/document a way how to pass information between handlers #524

Closed
@honzajavorek

Description

@honzajavorek

I came across a situation where I scrape half of the item's data in the listing page handler and the other half in a handler taking care of the detail page. I think must be quite common case. I struggle to see how I pass the information down from one handler to another. See concrete example below:

import re
import asyncio
from enum import StrEnum, auto

import click
from crawlee.beautifulsoup_crawler import (
    BeautifulSoupCrawler,
    BeautifulSoupCrawlingContext,
)
from crawlee.router import Router


LENGTH_RE = re.compile(r"(\d+)\s+min")


class Label(StrEnum):
    DETAIL = auto()


router = Router[BeautifulSoupCrawlingContext]()


@click.command()
def edison():
    asyncio.run(scrape())


async def scrape():
    crawler = BeautifulSoupCrawler(request_handler=router)
    await crawler.run(["https://edisonfilmhub.cz/program"])
    await crawler.export_data("edison.json", dataset_name="edison")


@router.default_handler
async def detault_handler(context: BeautifulSoupCrawlingContext):
    await context.enqueue_links(selector=".program_table .name a", label=Label.DETAIL)


@router.handler(Label.DETAIL)
async def detail_handler(context: BeautifulSoupCrawlingContext):
    context.log.info(f"Scraping {context.request.url}")

    description = context.soup.select_one(".filmy_page .desc3").text
    length_min = LENGTH_RE.search(description).group(1)
    # TODO get starts_at, then calculate ends_at

    await context.push_data(
        {
            "url": context.request.url,
            "title": context.soup.select_one(".filmy_page h1").text.strip(),
            "csfd_url": context.soup.select_one(".filmy_page .hrefs a")["href"],
        },
        dataset_name="edison",
    )

I need to scrape starts_at at the default_handler, then add more details to the item on the detail page, and calculate the ends_at time according to the length of the film. Even if I changed enqueue_links to something more delicate, how do I pass data from one request to another?

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation.t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions