Skip to content

Commit db57bc2

Browse files
committed
improved and added
1 parent 215cc5d commit db57bc2

3 files changed

Lines changed: 248 additions & 1 deletion

File tree

20210913/3_1___CSS.ipynb

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@
2828
"source": [
2929
"#### Documentation:\n",
3030
"- [Requests.py](http://docs.python-requests.org)\n",
31-
"- [Beautifulsoup.py](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)"
31+
"- [Beautifulsoup.py](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)\n",
32+
"- [CSS selectors](https://www.w3schools.com/cssref/css_selectors.asp)"
3233
]
3334
},
3435
{
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"## Exercise: Olympics"
8+
]
9+
},
10+
{
11+
"cell_type": "code",
12+
"execution_count": null,
13+
"metadata": {},
14+
"outputs": [],
15+
"source": [
16+
"import requests\n",
17+
"import re\n",
18+
"headers = {'user-agent': 'scrapingCourseBot'}"
19+
]
20+
},
21+
{
22+
"cell_type": "markdown",
23+
"metadata": {},
24+
"source": [
25+
"- On **olympics.com**, find the page that lists all winter sports. Check the robots.txt for this page. If ok, retrieve this page with a Python request."
26+
]
27+
},
28+
{
29+
"cell_type": "code",
30+
"execution_count": null,
31+
"metadata": {},
32+
"outputs": [],
33+
"source": [
34+
"# Retrieve page:\n",
35+
"r = requests.???????????\n",
36+
"print(r.status_code)"
37+
]
38+
},
39+
{
40+
"cell_type": "markdown",
41+
"metadata": {},
42+
"source": [
43+
"- Look into the result if you can find any of the winter sports games via a regexp or by search in browser"
44+
]
45+
},
46+
{
47+
"cell_type": "code",
48+
"execution_count": null,
49+
"metadata": {},
50+
"outputs": [],
51+
"source": [
52+
"# Via regexp: for example search for BIATHLON or CURLING:\n",
53+
"print(re.????????)"
54+
]
55+
},
56+
{
57+
"cell_type": "code",
58+
"execution_count": null,
59+
"metadata": {},
60+
"outputs": [],
61+
"source": [
62+
"# Via search in browser:\n",
63+
"print(?????)"
64+
]
65+
}
66+
],
67+
"metadata": {
68+
"kernelspec": {
69+
"display_name": "Python 3",
70+
"language": "python",
71+
"name": "python3"
72+
},
73+
"language_info": {
74+
"codemirror_mode": {
75+
"name": "ipython",
76+
"version": 3
77+
},
78+
"file_extension": ".py",
79+
"mimetype": "text/x-python",
80+
"name": "python",
81+
"nbconvert_exporter": "python",
82+
"pygments_lexer": "ipython3",
83+
"version": "3.8.5"
84+
}
85+
},
86+
"nbformat": 4,
87+
"nbformat_minor": 2
88+
}

20210913/3_2_Headless.ipynb

Lines changed: 158 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,158 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"## Headless scraping using Selenium"
8+
]
9+
},
10+
{
11+
"cell_type": "code",
12+
"execution_count": null,
13+
"metadata": {},
14+
"outputs": [],
15+
"source": [
16+
"# Install Selenium (needs to be executed only once)\n",
17+
"!pip install selenium\n",
18+
"\n",
19+
"# Download the selenium driver for your browser and machine from\n",
20+
"# https://selenium-python.readthedocs.io/installation.html#drivers\n",
21+
"# and add it to this local folder:\n",
22+
"import os\n",
23+
"import sys\n",
24+
"os.path.dirname(sys.executable)"
25+
]
26+
},
27+
{
28+
"cell_type": "code",
29+
"execution_count": null,
30+
"metadata": {},
31+
"outputs": [],
32+
"source": [
33+
"# Imports:\n",
34+
"import time # for sleeping between multiple requests\n",
35+
"from selenium import webdriver"
36+
]
37+
},
38+
{
39+
"cell_type": "code",
40+
"execution_count": null,
41+
"metadata": {},
42+
"outputs": [],
43+
"source": [
44+
"#### Documentation:\n",
45+
"- [Requests.py](http://docs.python-requests.org)\n",
46+
"- [Beautifulsoup.py](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)\n",
47+
"- [Selenium with python](https://selenium-python.readthedocs.io/)\n",
48+
"- [Locating elements](https://selenium-python.readthedocs.io/locating-elements.html#locating-elements)"
49+
]
50+
},
51+
{
52+
"cell_type": "markdown",
53+
"metadata": {},
54+
"source": [
55+
"## Headless request to price list (Chrome):"
56+
]
57+
},
58+
{
59+
"cell_type": "code",
60+
"execution_count": null,
61+
"metadata": {},
62+
"outputs": [],
63+
"source": [
64+
"options = webdriver.ChromeOptions()\n",
65+
"options.headless = True\n",
66+
"options.add_argument('user-agent=scrapingCourseBot')\n",
67+
"\n",
68+
"browser = webdriver.Chrome(options=options)\n",
69+
"browser.get('http://testing-ground.webscraping.pro/price-list-1.html')\n",
70+
"\n",
71+
"# Retrieve first item name:\n",
72+
"elem = browser.find_element_by_css_selector('div.name')\n",
73+
"print(elem.text)\n",
74+
"\n",
75+
"browser.close()"
76+
]
77+
},
78+
{
79+
"cell_type": "markdown",
80+
"metadata": {},
81+
"source": [
82+
"## Headless request to price list (Firefox):"
83+
]
84+
},
85+
{
86+
"cell_type": "code",
87+
"execution_count": null,
88+
"metadata": {},
89+
"outputs": [],
90+
"source": [
91+
"from selenium.webdriver.firefox.options import Options\n",
92+
"profile = webdriver.FirefoxProfile()\n",
93+
"profile.set_preference(\"general.useragent.override\", \"scrapingCourseBot\")\n",
94+
"\n",
95+
"options = Options()\n",
96+
"options.headless = True\n",
97+
"\n",
98+
"browser = webdriver.Firefox(profile, options=options)\n",
99+
"browser.get('http://testing-ground.webscraping.pro/price-list-1.html')\n",
100+
"\n",
101+
"# Retrieve first item name:\n",
102+
"elem = browser.find_element_by_css_selector('div.name')\n",
103+
"print(elem.text)\n",
104+
"\n",
105+
"browser.close()"
106+
]
107+
},
108+
{
109+
"cell_type": "markdown",
110+
"metadata": {},
111+
"source": [
112+
"## Headless getting Olympic winter games (Chrome): "
113+
]
114+
},
115+
{
116+
"cell_type": "code",
117+
"execution_count": null,
118+
"metadata": {},
119+
"outputs": [],
120+
"source": [
121+
"options = webdriver.ChromeOptions()\n",
122+
"options.headless = True\n",
123+
"options.add_argument('user-agent=scrapingCourseBot')\n",
124+
"\n",
125+
"browser = webdriver.Chrome(options=options)\n",
126+
"browser.get('https://olympics.com/en/sports/winter-olympics')\n",
127+
"\n",
128+
"# Retrieve first item name:\n",
129+
"elements = browser.find_elements_by_css_selector('div.-sportlist h1.article--title')\n",
130+
"for e in elements:\n",
131+
" print(e.text)\n",
132+
"\n",
133+
"browser.close()"
134+
]
135+
}
136+
],
137+
"metadata": {
138+
"kernelspec": {
139+
"display_name": "Python 3",
140+
"language": "python",
141+
"name": "python3"
142+
},
143+
"language_info": {
144+
"codemirror_mode": {
145+
"name": "ipython",
146+
"version": 3
147+
},
148+
"file_extension": ".py",
149+
"mimetype": "text/x-python",
150+
"name": "python",
151+
"nbconvert_exporter": "python",
152+
"pygments_lexer": "ipython3",
153+
"version": "3.8.5"
154+
}
155+
},
156+
"nbformat": 4,
157+
"nbformat_minor": 4
158+
}

0 commit comments

Comments
 (0)