Skip to content

Commit baca447

Browse files
docs(academy-advanced-crawling): comit my unfinished first articles
1 parent db7ff1c commit baca447

23 files changed

+18019
-13
lines changed

content/academy/advanced_web_scraping.md

+14-4
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: Advanced web scraping
3-
description: Take your scrapers to the next level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers.
3+
description: Take your scrapers to production-ready level by learning various advanced concepts and techniques that will help you build highly scalable and reliable crawlers.
44
menuWeight: 6
55
category: web scraping & automation
66
paths:
@@ -9,11 +9,21 @@ paths:
99

1010
# Advanced web scraping
1111

12-
In this course, we'll be tackling some of the most challenging and advanced web-scraping cases, such as mobile-app scraping, scraping sites with limited pagination, and handling large-scale cases where millions of items are scraped. Are **you** ready to take your scrapers to the next level?
12+
In [**Web scraping for beginners**]({{@link web_scraping_for_beginners.md}}) course, we have learned the nesesary basics required to create a scraper. In the following courses, we enhanced our scraping toolbox by scraping APIs, using browsers, scraping dynamic websites, understanding website anti-scraping protection and making our code more maintainable by moving into Typescript.
1313

14-
If you've managed to follow along with all of the courses prior to this one, then you're more than ready to take these upcoming lessons on 😎
14+
In this course, we will take all of that knowledge, add a few more advanced concepts and apply them to learn how to build a production-ready web scraper.
15+
16+
## [](#what-does-production-ready-mean) What does production-ready mean?
17+
18+
Of course, there is no single world-wide definition of what production-ready system is. Different companies and use-cases will place different priorities on the project. But in general, a production-ready system is stable, reliable, scalable, performant, observable and maintainable.
1519

16-
<!-- Just like the [**Web scraping for beginners**]({{@link web_scraping_for_beginners.md}}) course, this course is divided into two main sections: **Data collection** and **Crawling**. -->
20+
The following sections will cover the core concepts that will ensure that your scraper is production-ready:
21+
- Advanced crawling section will cover how to ensure we find all pages or products on the website.
22+
- The advanced data extraction will cover how to efficiently extact data from particular page or API.
23+
24+
Both of these section will include guides for monitoring, performance, anti-scraping protections and debugging.
25+
26+
If you've managed to follow along with all of the courses prior to this one, then you're more than ready to take these upcoming lessons on 😎
1727

1828
## [](#first-up) First up
1929

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
---
2+
title: Crawling sitemaps
3+
description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper.
4+
menuWeight: 2
5+
paths:
6+
- advanced-web-scraping/crawling/crawling-sitemaps
7+
---
8+
9+
In the previous lesson, we learned what is the utility (and dangers) of crawling sitemaps. In this lesson, we will go in-depth to how to crawl sitemaps.
10+
11+
We will look at the following topics:
12+
- How to find sitemap URLs
13+
- How to set up HTTP requests to download sitemaps
14+
- How to parse URLs from sitemaps
15+
16+
## [](#how-to-find-sitemap-urls) How to find sitemap URLs
17+
Sitemaps are commonly restricted to contain max 50k URLs so usually there will be a whole list of them. There can be a master sitemap containing URLs of all other sitemaps or the sitemaps might simply be indexed in robots.txt and/or having auto-incremented URLs like `/sitemap1.xml`, `/sitemap2.xml`, etc.
18+
19+
### [](#google) Google
20+
You can try your luck on Google by searching for `site:example.com sitemap.xml` or `site:example.com sitemap.xml.gz` and see if you get any results. If you do, you can try to download the sitemap and see if it contains any useful URLs. This success of this approach depends if the website tells Google to index the sitemap file itself which is rather uncommon.
21+
22+
### [](#robots-txt) robots.txt
23+
If the website has a robots.txt file, it often contains sitemap URLs. The sitemap URLs are usually listed under `Sitemap:` directive.
24+
25+
### [](#common-url-paths) Common URL paths
26+
You can try to iterate over a common URL paths like:
27+
```
28+
/sitemap.xml
29+
/product_index.xml
30+
/product_template.xml
31+
/sitemap_index.xml
32+
/sitemaps/sitemap_index.xml
33+
/sitemap/product_index.xml
34+
/media/sitemap.xml
35+
/media/sitemap/sitemap.xml
36+
/media/sitemap/index.xml
37+
```
38+
39+
Make also sure you test the list with `.gz`, `.tar.gz` and `.tgz` extensions and by capitalizing the words (e.g. `/Sitemap_index.xml.tar.gz`).
40+
41+
Some websites also provide an HTML version, to help indexing bots find new content. Those include:
42+
43+
```
44+
/sitemap
45+
/category-sitemap
46+
/sitemap.html
47+
/sitemap_index
48+
```
49+
50+
Apify provides the [Sitemap Sniffer actor](https://apify.com/vaclavrut/sitemap-sniffer) (open-source code), that scans the URL variations automatically for you, so that you don't have to check manually.
51+
52+
## [](#how-to-set-up-http-requests-to-download-sitemaps) How to set up HTTP requests to download sitemaps
53+
For most sitemaps, you can do a simple HTTP request and parse the downloaded XML text with Cheerio (or just use `CheerioCrawler`). Some sitemaps ar compressed and have to be streamed and decompressed. [This article]({{@link node-js/parsing_compressed_sitemaps}} describes step by step guide how to handle that.
54+
55+
## [](#how-to-parse-urls-from-sitemaps) How to parse URLs from sitemaps
56+
The easiest part is to parse the actual URLs from the sitemap. The URLs are usually listed under `<loc>` tags. You can use Cheerio to parse the XML text and extract the URLs. Just be careful that the sitemap might contain other URLs that you don't want to crawl (e.g. /about, /contact or various special category sections). [This article]({{@link node-js/node-js/scraping-from-sitemaps}} provides code examples for parsing sitemaps.
57+
58+
## [](#next) Next up
59+
That's all we need to know about sitemaps for now. Let's dive into a much more interesting topic - search, filters and pagination.

content/academy/advanced_web_scraping/scraping_paginated_sites.md renamed to content/academy/advanced_web_scraping/crawling/crawling-with-search-i.md

+10-7
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
11
---
2-
title: Scraping paginated sites
2+
title: Crawling with search I
33
description: Learn how to extract all of a website's listings even if they limit the number of results pages. See code examples for setting up your scraper.
4-
menuWeight: 1
4+
menuWeight: 3
55
paths:
6-
- advanced-web-scraping/scraping-paginated-sites
6+
- advanced-web-scraping/crawling/crawling-with-search-i
77
---
88

9-
# Scraping websites with limited pagination
9+
# Scraping websites with search I
10+
11+
In this lesson, we will start with a simpler example of scraping HTML based websites with limited pagination.
1012

1113
Limited pagination is a common practice on e-commerce sites and is becoming more popular over time. It makes sense: a real user will never want to look through more than 200 pages of results – only bots love unlimited pagination. Fortunately, there are ways to overcome this limit while keeping our code clean and generic.
1214

@@ -18,7 +20,7 @@ Limited pagination is a common practice on e-commerce sites and is becoming more
1820

1921
Websites usually limit the pagination of a single (sub)category to somewhere between 1,000 to 20,000 listings. The site might have over a million listings in total. Without a proven algorithm, it will be very manual and almost impossible to scrape all listings.
2022

21-
We will first look at a couple ideas that don't work so well and then present the [final robust solution](#using-filter-ranges).
23+
We will first look at a couple ideas that might cross our mind but don't work so well and then present the [most robust solution](#using-filter-ranges).
2224

2325
### [](#going-deeper-into-subcategories) Going deeper into subcategories
2426

@@ -278,9 +280,10 @@ for (const filter of newFilters) {
278280
await crawler.addRequests(requestsToEnqueue);
279281
```
280282

283+
Check out the [full code example](https://github.com/metalwarrior665/apify-utils/tree/master/examples/crawler-with-filters).
284+
281285
## [](#summary) Summary
282286

283-
And that's it. We have an elegant and simple solution for a complicated problem. In a real project, you would want to make this a bit more robust and [save analytics data]({{@link expert_scraping_with_apify/saving_useful_stats.md}}). This will let you know what filters you went through and how many products each of them had.
287+
And that's it. We have an elegant and simple solution for a complicated problem. In the next lesson, we will explore how to refine this algorithm and apply it to bigger use-cases like scraping APIs.
284288

285-
Check out the [full code example](https://github.com/metalwarrior665/apify-utils/tree/master/examples/crawler-with-filters).
286289

0 commit comments

Comments
 (0)