Closed
Description
I came across a situation where I scrape half of the item's data in the listing page handler and the other half in a handler taking care of the detail page. I think must be quite common case. I struggle to see how I pass the information down from one handler to another. See concrete example below:
import re
import asyncio
from enum import StrEnum, auto
import click
from crawlee.beautifulsoup_crawler import (
BeautifulSoupCrawler,
BeautifulSoupCrawlingContext,
)
from crawlee.router import Router
LENGTH_RE = re.compile(r"(\d+)\s+min")
class Label(StrEnum):
DETAIL = auto()
router = Router[BeautifulSoupCrawlingContext]()
@click.command()
def edison():
asyncio.run(scrape())
async def scrape():
crawler = BeautifulSoupCrawler(request_handler=router)
await crawler.run(["https://edisonfilmhub.cz/program"])
await crawler.export_data("edison.json", dataset_name="edison")
@router.default_handler
async def detault_handler(context: BeautifulSoupCrawlingContext):
await context.enqueue_links(selector=".program_table .name a", label=Label.DETAIL)
@router.handler(Label.DETAIL)
async def detail_handler(context: BeautifulSoupCrawlingContext):
context.log.info(f"Scraping {context.request.url}")
description = context.soup.select_one(".filmy_page .desc3").text
length_min = LENGTH_RE.search(description).group(1)
# TODO get starts_at, then calculate ends_at
await context.push_data(
{
"url": context.request.url,
"title": context.soup.select_one(".filmy_page h1").text.strip(),
"csfd_url": context.soup.select_one(".filmy_page .hrefs a")["href"],
},
dataset_name="edison",
)
I need to scrape starts_at
at the default_handler
, then add more details to the item on the detail page, and calculate the ends_at
time according to the length of the film. Even if I changed enqueue_links
to something more delicate, how do I pass data from one request to another?