You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/12_framework.md
+9-9Lines changed: 9 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -408,7 +408,7 @@ In the next lesson, we'll use a scraping platform to set up our application to r
408
408
409
409
### Build a Crawlee scraper of F1 Academy drivers
410
410
411
-
Scrape information about all [F1 Academy](https://en.wikipedia.org/wiki/F1_Academy) drivers listed on the official [Drivers](https://www.f1academy.com/Racing-Series/Drivers) page. Each item you push to the Crawlee's default dataset should contain the following data:
411
+
Scrape information about all [F1 Academy](https://en.wikipedia.org/wiki/F1_Academy) drivers listed on the official [Drivers](https://www.f1academy.com/Racing-Series/Drivers) page. Each item you push to Crawlee's default dataset should include the following data:
412
412
413
413
- URL of the driver's f1academy.com page
414
414
- Name
@@ -417,7 +417,7 @@ Scrape information about all [F1 Academy](https://en.wikipedia.org/wiki/F1_Acade
417
417
- Date of birth (as a `date()` object)
418
418
- Instagram URL
419
419
420
-
If you export the dataset as a JSON, you should see something like this:
420
+
If you export the dataset as JSON, it should look something like this:
421
421
422
422
<!-- eslint-skip -->
423
423
```json
@@ -444,8 +444,8 @@ If you export the dataset as a JSON, you should see something like this:
444
444
445
445
Hints:
446
446
447
-
- Use Python's native `datetime.strptime(text, "%d/%m/%Y").date()` to parse the `DD/MM/YYYY`date format. See [docs](https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime)to learn more.
448
-
-Use the attribute selector `a[href*='instagram']` to locate the Instagram URL. See [docs](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors) to learn more.
447
+
- Use Python's `datetime.strptime(text, "%d/%m/%Y").date()` to parse dates in the `DD/MM/YYYY` format. Check out the [docs](https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime)for more details.
448
+
-To locate the Instagram URL, use the attribute selector `a[href*='instagram']`. Learn more about attribute selectors in the [MDN docs](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors).
449
449
450
450
<details>
451
451
<summary>Solution</summary>
@@ -495,15 +495,15 @@ Hints:
495
495
496
496
</details>
497
497
498
-
### Use Crawlee to find rating of the most popular Netflix films
498
+
### Use Crawlee to find the ratings of the most popular Netflix films
499
499
500
-
The [Global Top 10](https://www.netflix.com/tudum/top10) page contains a table of the most currently popular Netflix films worldwide. Scrape the movie names, then search for each movie at the [IMDb](https://www.imdb.com/). Assume the first search result is correct and find out what's the film's rating. Each item you push to the Crawlee's default dataset should contain the following data:
500
+
The [Global Top 10](https://www.netflix.com/tudum/top10) page has a table listing the most popular Netflix films worldwide. Scrape the movie names from this page, then search for each movie on [IMDb](https://www.imdb.com/). Assume the first search result is correct and retrieve the film's rating. Each item you push to Crawlee's default dataset should include the following data:
501
501
502
502
- URL of the film's imdb.com page
503
503
- Title
504
504
- Rating
505
505
506
-
If you export the dataset as a JSON, you should see something like this:
506
+
If you export the dataset as JSON, it should look something like this:
507
507
508
508
<!-- eslint-skip -->
509
509
```json
@@ -522,7 +522,7 @@ If you export the dataset as a JSON, you should see something like this:
522
522
]
523
523
```
524
524
525
-
For each name from the Global Top 10, you'll need to construct a `Request` object with IMDb search URL. Take the following code snippet as a hint on how to do it:
525
+
To scrape IMDb data, you'll need to construct a `Request` object with the appropriate search URL for each movie title. The following code snippet gives you an idea of how to do this:
526
526
527
527
```py
528
528
...
@@ -544,7 +544,7 @@ async def main():
544
544
...
545
545
```
546
546
547
-
When following the first search result, you may find handy to know that `context.enqueue_links()`takes a `limit` keyword argument, where you can specify the max number of HTTP requests to enqueue.
547
+
When navigating to the first search result, you might find it helpful to know that `context.enqueue_links()`accepts a `limit` keyword argument, letting you specify the max number of HTTP requests to enqueue.
0 commit comments