-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement/document a way how to pass information between handlers #524
Comments
You probably want to use the @crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')
extracted_data_in_default_handler = context.soup.title.string
await context.enqueue_links(
user_data={'extracted_data_in_default_handler': extracted_data_in_default_handler},
) |
That sounds about right. Haven't found much about it in the docs, at least using the built-in search. How do I access it in the other handler? Something like |
It's an attribute of the request, so you should be able to use |
Cool, thanks! This was my main blocker when developing kino over the weekend. Can't promise getting back to this soon as it's just a hobby thing, but I'll assume this is enough info for me to make it work. Feel free to close this unless you want to turn it into a tracking issue of "this needs more examples in the docs". |
Great, let us know once you try it. IMO we should add some examples to docs regarding this topic, so we can leave this open. |
It works, but it has a surprising quirk. It stringifies all @router.default_handler
async def detault_handler(context: BeautifulSoupCrawlingContext):
timetable = defaultdict(set)
for i in range(5):
url = f"https://example.com/{i}"
timetable[url].add(datetime(2024, 11, 2, 15, 30), tzinfo=zoneinfo.ZoneInfo(key='Europe/Prague'))
await context.enqueue_links(
selector=".program_table .name a",
user_data={"timetable": timetable},
label="detail",
)
@router.handler("detail")
async def detail_handler(context: BeautifulSoupCrawlingContext):
for url, starts_ats in context.request.user_data["timetable"].items():
print(starts_ats) # {datetime.datetime(2024, 11, 2, 15, 30, tzinfo=zoneinfo.ZoneInfo(key='Europe/Prague'))}
print(type(starts_ats)) # <class 'str'> I didn't play with it further, so I don't know what else it stringifies this way, but obviously such data is unusable. If this is expected behavior, I'd have to JSON-encode those and then JSON-decode. I could use list instead of |
As a separate comment I want to also add that I noticed there is nothing more delicate than # JUST PSEUDOCODE
for item in soup.select(".item"):
time = datetime.fromisoformat(item.select_one(".screening-time").text)
link = item.select_one(".movie-link")
context.enqueue_link(link, user_data={"time": time}, label="detail") This is an approach I'm used to do when creating scrapers, all my life. With both one-time scripts without any framework or with Scrapy. I don't believe I'm alone. As far as I understand, there's currently no way to do that in Crawlee. I'm forced to scrape the whole timetable and then pass it down to the detail handler. In the detail handler, I (want to, if not for the bug above) look up the movie in the timetable by its URL and get the screening times. This feels unnatural. I'm not sure if I do it wrong all the time and the way Crawlee does it is the preferred way to do this for some good reasons, or whether it's just bad UX (actually DX). Because... Getting the timetable first and then pairing back the movies takes care of duplicate requests and forces me to think about the situation when the same movie is screened multiple times. Crawlee forced me into unnatural architecture of my scraper, resulting in a better algorithm! Should I ask for |
I'm not sure I understand the whole thought process, but the example you posted is not far from being correct. You could just do this: for item in soup.select(".item"):
time = datetime.fromisoformat(item.select_one(".screening-time").text)
link = item.select_one(".movie-link")
context.add_requests([Request.from_url(link, user_data={"time": time}, label="detail")]) Or even better, you could gather the links in a list and then pass a list of Did I understand the question right? |
I think you did! I completely overlooked the existence of |
This is the relevant documentation - https://crawlee.dev/python/docs/guides/request-storage#request-related-helpers. Feel free to suggest a better place, or if you think that there's a place where a link to this page would be useful. |
Short answer is, you need to put JSON-serializable data into |
I think that guide goes through it well, I had to overlook that one. I went through the intro and I only noticed
Uff, so I'll need to do the heavy lifting. Getting exception is definitely better than just being surprised by the result, but sending just JSON-serializable data is very limiting. Especially when working with dates, or, e.g. money (= decimals). I can use Pydantic or something to help me with serialization and deserialization, but it feels strange that I have to do it just to pass a dict from one function to another. |
My workaround for now: from pydantic import RootModel
TimeTable = RootModel[dict[str, set[datetime]]]
@router.default_handler
async def detault_handler(context: BeautifulSoupCrawlingContext):
...
timetable_json = TimeTable(timetable).model_dump_json()
await context.enqueue_links(
selector=".program_table .name a",
user_data=dict(timetable=timetable_json),
label="detail",
)
@router.handler("detail")
async def detail_handler(context: BeautifulSoupCrawlingContext):
...
timetable_json = context.request.user_data["timetable"]
timetable = TimeTable.model_validate_json(timetable_json).model_dump()
for starts_at in timetable[context.request.url]:
ends_at = starts_at + timedelta(minutes=length_min)
await context.push_data(...) Couldn't come up with anything more beautiful. |
That part of the docs is also pretty new, so that might have played a part as well...
The reason for this requirement is that the request queue needs to be able to handle millions of items, and the local implementation uses the filesystem for that. If you deploy to Apify, the request will be sent as a JSON payload. So there's JSON serialization involved every time - this is no artificial restriction 🙂
Honestly, I think this is fine. You don't need to use |
I was able to get Crawlee working for the use case, and I learned a lot about parts I didn't know about. The How much surprising or necessary or convenient that serialization is, that is something I'll leave up to you and perhaps aggregated feedback from more users than just me. I'll attach my two cents below, but I don't want to get stuck discussing this further, because I myself think this is just a small ergonomic annoyance and we both have probably better things to do. You work on a framework, where this is just a tiny part, and I have some scrapers to finish 😄 One user's POV on serialization of user_dataI understand the limitation isn't arbitrary, but conditioned by the architecture. However, as someone who wants to focus on writing scrapers, this has implications for my DX:
I'm only trying to provide the mere user's POV on the matter. I can live just fine with PyDantic and model dumps, the same way I can live with other things which annoy me when using Scrapy. It's just that I can see this framework is being built right now, and I care about it, so I'm keen to provide this kind of feedback. I don't run these discussions at the Scrapy repo - not because I love how it's done, but because I don't care that much. |
This feels quite unlikely, I doubt they store everything in memory, which is the only way to do what you want. Or maybe you lose the values if they no longer fit into the memory. Btw this is not just about the use case of millions of items not fitting into the memory, it's also about being able to continue a failed/stopped run, or the infamous migrations on the apify platform. |
Well, they could be using something like |
- resolves #524 This adds validation to `Request.user_data` so that the user cannot pass in data that is not JSON-serializable.
BTW
I don't think so. Haven't tested, but I thought that just turns the model to a dictionary. It doesn't solve the problem with datetimes, for example, does it? I mean, if my data contains stuff like sets, dictionaries, or decimals, then I didn't need PyDantic in my code. I used it now only for the purpose of not having to write my own |
You're right, I mistook the behavior with
Gotcha. If you had any idea for a better API for user data handling, we're all ears. I'm very reluctant towards the likes of |
I think I don't think there's anything else than pickle or pydantic. I mean, widely used in the ecosystem, not something experimental or with marginal popularity. Pickle is native but has issues, pydantic would be a dependency. Maybe a safe subset of pickle? Maybe dataclasses? Ideas:
In the second scenario though Crawlee would have to take care of the serializing and deserializing logic, and basically duplicating at least some of pydantic's job. So might be simpler to just use pydantic for it and happily accept both dataclasses or pydantic models in the argument, if users are keen to pass them:
That way, unless I pass something convoluted with non-serializable types, I should not need to know there's any serialization at all. I send what I have and if that fails, the framework just asks me to send it as a dataclass, which is standard lib. If I'm fancy and keen to read the docs, I can learn that pydantic model would do as well. |
I came across a situation where I scrape half of the item's data in the listing page handler and the other half in a handler taking care of the detail page. I think must be quite common case. I struggle to see how I pass the information down from one handler to another. See concrete example below:
I need to scrape
starts_at
at thedefault_handler
, then add more details to the item on the detail page, and calculate theends_at
time according to the length of the film. Even if I changedenqueue_links
to something more delicate, how do I pass data from one request to another?The text was updated successfully, but these errors were encountered: