You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: sources/academy/webscraping/scraping_basics_python/13_platform.md
+26-21Lines changed: 26 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,9 +31,9 @@ That said, the main goal of this lesson is to show how deploying to **any platfo
31
31
32
32
## Registering
33
33
34
-
First, let's [create a new Apify account](https://console.apify.com/sign-up). The process includes several verifications that you're a human being and that your email address is valid. While annoying, these are necessary measures to prevent abuse of the platform.
34
+
First, let's [create a new Apify account](https://console.apify.com/sign-up). You'll go through a few checks to confirm you're human and your email is valid—annoying but necessary to prevent abuse of the platform.
35
35
36
-
Apify serves both as an infrastructure where to privately deploy and run own scrapers, and as a marketplace, where anyone can offer their ready scrapers to others for rent. But we'll overcome our curiosity for now and leave exploring the Apify Store for later.
36
+
Apify serves both as an infrastructure where to privately deploy and run own scrapers, and as a marketplace, where anyone can offer their ready scrapers to others for rent. But let's hold off on exploring the Apify Store for now.
37
37
38
38
## Getting access from the command line
39
39
@@ -56,11 +56,11 @@ Success: You are logged in to Apify as user1234!
56
56
57
57
## Starting a real-world project
58
58
59
-
Until now, we've kept our scrapers minimal, each represented by just a single Python module, such as `main.py`. Also, we've been adding dependencies to our project only by installing them with `pip` inside an activated virtual environment.
59
+
Until now, we've kept our scrapers simple, each with just a single Python module like `main.py`, and we've added dependencies only by installing them with `pip` inside a virtual environment.
60
60
61
-
If we were to send our code to a friend like this, they wouldn't know what they needed to install before running the scraper without import errors. The same applies if we were to deploy our code to a cloud platform.
61
+
If we sent our code to a friend, they wouldn't know what to install to avoid import errors. The same goes for deploying to a cloud platform.
62
62
63
-
To share what we've built, we need a packaged Python project. The best way to do that is by following the official [Python Packaging User Guide](https://packaging.python.org/), but for the sake of this course, let's take a shortcut with the Apify CLI.
63
+
To share our project, we need to package it. The best way is following the official [Python Packaging User Guide](https://packaging.python.org/), but for this course, we'll take a shortcut with the Apify CLI.
64
64
65
65
Change to a directory where you start new projects in your terminal. Then, run the following command—it will create a new subdirectory called `warehouse-watchdog` for the new project, containing all the necessary files:
66
66
@@ -80,7 +80,7 @@ Inside the `warehouse-watchdog` directory, we should see a `src` subdirectory co
80
80
81
81
The file contains a single asynchronous function, `main()`. At the beginning, it handles [input](https://docs.apify.com/platform/actors/running/input-and-output#input), then passes that input to a small crawler built on top of the Crawlee framework.
82
82
83
-
Every program that runs on the Apify platform first needs to be packaged as a so-called Actor—a standardized container with designated places for input and output. Crawlee scrapers automatically connect their detault dataset to the Actor output, but input needs to be explicitly handled in the code.
83
+
Every program that runs on the Apify platform first needs to be packaged as a so-called Actor—a standardized container with designated places for input and output. Crawlee scrapers automatically connect their default dataset to the Actor output, but input must be handled explicitly in the code.
84
84
85
85
We'll now adjust the template so it runs our program for watching prices. As a first step, we'll create a new empty file, `crawler.py`, inside the `warehouse-watchdog/src` directory. Then, we'll fill this file with the [final code](./12_framework.md#logging) from the previous lesson:
The Actor configuration from the template instructs the platform to expect input, so we should change that before running our scraper in the cloud.
146
+
The Actor configuration from the template tells the platform to expect input, so we need to update that before running our scraper in the cloud.
147
147
148
148
Inside `warehouse-watchdog`, there's a directory called `.actor`. Within it, we'll edit the `input_schema.json` file, which looks like this by default:
149
149
@@ -190,7 +190,9 @@ Make sure there's no trailing comma after `{}`, or the file won't be valid JSON.
190
190
191
191
:::
192
192
193
-
Now, we can proceed with deployment:
193
+
## Deploying the scraper
194
+
195
+
Now we can proceed to deployment:
194
196
195
197
```text
196
198
$ apify push
@@ -203,9 +205,9 @@ Actor build detail https://console.apify.com/actors/a123bCDefghiJkLMN#/builds/0.
203
205
? Do you want to open the Actor detail in your browser? (Y/n)
204
206
```
205
207
206
-
After agreeing to open the Actor details in our browser, assuming we're logged in, we'll see a **Start Actor** button. Clicking it takes us to a screen where we can specify Actor input and run options. Without changing anything, we'll continue by clicking **Start**, and we should immediately see the scraper's logs—similar to what we'd normally see in our terminal, but now running remotely on a cloud platform.
208
+
After agreeing to open the Actor details in our browser, assuming we're logged in, we'll see an option to **Start Actor**. Clicking it opens the execution settings. We won’t change anything—just hit **Start**, and we should see logssimilar to what wesee locally, but this time our scraper is running in the cloud.
207
209
208
-
When the run finishes, the interface should turn green. On the **Output** tab, we can preview the scraper's results as a table or JSON. There's even an option to export the data to various formats, including CSV, XML, Excel, RSS, and more.
210
+
When the run finishes, the interface will turn green. On the **Output** tab, we can preview the results as a table or JSON. We can even export the data to formats like CSV, XML, Excel, RSS, and more.
209
211
210
212
:::note Accessing data programmatically
211
213
@@ -215,13 +217,13 @@ You don't need to click buttons to download the data. You can also retrieve it u
215
217
216
218
## Running the scraper periodically
217
219
218
-
Let's say we want our scraper to collect sale price data daily. In the Apify web interface, we'll go to [Schedules](https://console.apify.com/schedules). Clicking**Create new** will open a setup screen where we can specify the frequency (daily is the default) and select the Actors that should be started. Once we're done, we can click **Enable**—that's it!
220
+
Now that our scraper is deployed, let's automate its execution. In the Apify web interface, we'll go to [Schedules](https://console.apify.com/schedules). Click**Create new**, review the periodicity (default: daily), and specify the Actor to run. Then click **Enable**—that's it!
219
221
220
-
From now on, the Actor will run daily, and we'll be able to inspect every execution. For each run, we'll have access to its logs and the collected data. We'll also see stats, monitoring charts, and have the option to set up alerts that notify us under specific conditions.
222
+
From now on, the Actor will execute daily. We can inspect each run, view logs, check collected data, see stats, monitor charts, and even set up alerts.
221
223
222
224
## Adding support for proxies
223
225
224
-
If our monitoring shows that the scraper frequently fails to reach the Warehouse Shop website, we're most likely getting blocked. In that case, we can use proxies to make requests from different locations, reducing the chances of detection and blocking.
226
+
If monitoring shows that our scraper frequently fails to reach the Warehouse Shop website, it's likely being blocked. To avoid this, we can configure proxies so our requests come from different locations, reducing the chances of detection and blocking.
225
227
226
228
Proxy configuration is a type of Actor input, so let's start by reintroducing the necessary code. We'll update `warehouse-watchdog/src/main.py` like this:
227
229
@@ -289,7 +291,7 @@ Finally, we'll modify the Actor configuration in `warehouse-watchdog/src/.actor/
289
291
}
290
292
```
291
293
292
-
Now, if we run the scraper locally, everything should work without errors. We'll use the `apify run` command again, but this time with the `--purge` option to ensure we're not reusing data from a previous run:
294
+
To verify everything works, we'll run the scraper locally. We'll use the `apify run` command again, but this time with the `--purge` option to ensure we're not reusing data from a previous run:
In the logs, we should see a line like `Using proxy: no`. When running the scraper locally, the Actor input doesn't include a proxy configuration, so all requests will be made from our own location, just as before. Now, let's update our cloud copy of the scraper with `apify push` to reflect our latest changes:
324
+
In the logs, we should see `Using proxy: no`, because local runs don't include proxy settings. All requests will be made from our own location, just as before. Now, let's update the cloud version of our scraper with `apify push`:
323
325
324
326
```text
325
327
$ apify push
@@ -331,7 +333,7 @@ Run: Building Actor warehouse-watchdog
331
333
? Do you want to open the Actor detail in your browser? (Y/n)
332
334
```
333
335
334
-
After opening the Actor detail in our browser, we should see the **Source** screen. We'll switch to the **Input** tab, where we can now see the **Proxy config**input option. By default, it's set to **Datacenter - Automatic**, and we'll leave it as is. Let's click **Start**! In the logs, we should see the following:
336
+
Back in the Apify console, go to the **Source** screen and switch to the **Input** tab. You'll see the new **Proxy config** option, which defaults to **Datacenter - Automatic**. Leave it as is and click **Start**. This time, the logs should show `Using proxy: yes`, as the scraper uses proxies provided by the platform:
335
337
336
338
```text
337
339
(timestamp) ACTOR: Pulling Docker image of build o6vHvr5KwA1sGNxP0 from repository.
@@ -361,13 +363,16 @@ After opening the Actor detail in our browser, we should see the **Source** scre
361
363
...
362
364
```
363
365
364
-
The logs should now include `Using proxy: yes`, confirming that the scraper is successfully using proxies provided by the Apify platform.
365
-
366
366
## Congratulations!
367
367
368
-
You've reached the end of the course—congratulations! 🎉
368
+
You've reached the end of the course—congratulations! 🎉 Together, we've built a program that:
369
369
370
-
Together, we've built a program that crawls a shop, extracts product and pricing data, and exports the results. We've also simplified our work using a framework and deployed our scraper to a cloud platform, enabling it to run periodically, collect data over time, and provide monitoring and anti-scraping protection.
370
+
- Crawls a shop and extracts product and pricing data
371
+
- Exports the results in several formats
372
+
- Uses a concise code, thanks to a scraping framework
373
+
- Runs on a cloud platform with monitoring and alerts
374
+
- Executes periodically without manual intervention, collecting data over time
375
+
- Uses proxies to avoid being blocked
371
376
372
377
We hope this serves as a solid foundation for your next scraping project. Perhaps you'll even start publishing scrapers for others to use—for a fee? 😉
0 commit comments