You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/12_framework.md
+142-30
Original file line number
Diff line number
Diff line change
@@ -178,39 +178,45 @@ In the final statistics, you can see that we made 25 requests (1 listing page +
178
178
The BeautifulSoup crawler provides handlers with the `context.soup` attribute, which contains the parsed HTML of the handled page. This is the same `soup` object we used in our previous program. Let's locate and extract the same data as before:
Now for the price. We're not doing anything new here—just import `Decimal` and copy-paste the code from our old scraper.
192
195
193
196
The only change will be in the selector. In `main.py`, we looked for `.price` within a `product_soup` object representing a product card. Now, we're looking for `.price` within the entire product detail page. It's better to be more specific so we don't accidentally match another price on the same page:
Finally, the variants. We can reuse the `parse_variant()` function as-is, and in the handler we'll again take inspiration from what we had in `main.py`. The full program will look like this:
@@ -272,10 +278,116 @@ Crawlee doesn't do much to help with locating and extracting the data—that par
272
278
273
279
## Saving data
274
280
275
-
When we're at _letting the framework take care of everything else_, let's take a look at what it can do about saving data.
281
+
When we're at _letting the framework take care of everything else_, let's take a look at what it can do about saving data. As of now the product detail page handler prints each item as soon as the item is ready. Instead, we can push the item to Crawlee's default dataset:
282
+
283
+
```py
284
+
asyncdefmain():
285
+
...
286
+
287
+
@crawler.router.handler("DETAIL")
288
+
asyncdefhandle_detail(context):
289
+
price_text = (
290
+
...
291
+
)
292
+
item = {
293
+
...
294
+
}
295
+
if variants := context.soup.select(".product-form__option.no-js option"):
That's it! If you run the program now, there should be a `storage` directory alonside the `newmain.py` file. Crawlee uses it to store its internal state. If you go to the `storage/datasets/default` subdirectory, you'll see over 30 JSON files, each representing a single item.
276
305
277
-
:::danger Work in progress
306
+

278
307
279
-
This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
308
+
We can also export all the items to a single file of our choice. We'll do it at the end of the `main()` function, after the crawler has finished scraping:
After running the scraper again, there should be two new files in your directory, `dataset.json` and `dataset.csv`, containing all the data. If you peek into the JSON file, it should have indentation.
322
+
323
+
## Logging
324
+
325
+
While Crawlee gives us statistics about HTTP requests and concurrency, we otherwise don't have much visibility into pages we're crawling or items we're saving. Let's add custom logging where we see fit given our use case:
326
+
327
+
```py
328
+
import asyncio
329
+
from decimal import Decimal
330
+
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
331
+
332
+
asyncdefmain():
333
+
crawler = BeautifulSoupCrawler()
334
+
335
+
@crawler.router.default_handler
336
+
asyncdefhandle_listing(context):
337
+
# highlight-next-line
338
+
context.log.info("Looking for product detail pages")
Depending on what we find useful, we can add more or less information to the logs. The `context.log` or `crawler.log` objects are [standard Python loggers](https://docs.python.org/3/library/logging.html).
392
+
393
+
Even after we added extensive logging, we've been able to shave off at least 20 lines of code in comparison with the code of the original program. Over this lesson we've added more and more features to match the functionality of our old scraper, but despite that, the new code still has clear structure and is readable. And we could focus on what's specific to the website we're scraping and the data we're interested in, while framework took care of the rest.
0 commit comments