improved and added

olavtenbosch · olavtenbosch · commit db57bc2ae3f2 · 2021-09-18T01:29:03.000+02:00
diff --git a/20210913/3_1___CSS.ipynb b/20210913/3_1___CSS.ipynb
@@ -28,7 +28,8 @@
    "source": [
     "#### Documentation:\n",
     "- [Requests.py](http://docs.python-requests.org)\n",
-    "- [Beautifulsoup.py](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)"
+    "- [Beautifulsoup.py](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)\n",
+    "- [CSS selectors](https://www.w3schools.com/cssref/css_selectors.asp)"
    ]
   },
   {
diff --git a/20210913/3_2_Exercise_Olympics.ipynb b/20210913/3_2_Exercise_Olympics.ipynb
@@ -0,0 +1,88 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercise: Olympics"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import requests\n",
+    "import re\n",
+    "headers = {'user-agent': 'scrapingCourseBot'}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- On **olympics.com**, find the page that lists all winter sports. Check the robots.txt for this page. If ok, retrieve this page with a Python request."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Retrieve page:\n",
+    "r = requests.???????????\n",
+    "print(r.status_code)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- Look into the result if you can find any of the winter sports games via a regexp or by search in browser"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Via regexp: for example search for BIATHLON or CURLING:\n",
+    "print(re.????????)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Via search in browser:\n",
+    "print(?????)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/20210913/3_2_Headless.ipynb b/20210913/3_2_Headless.ipynb
@@ -0,0 +1,158 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Headless scraping using Selenium"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install Selenium (needs to be executed only once)\n",
+    "!pip install selenium\n",
+    "\n",
+    "# Download the selenium driver for your browser and machine from\n",
+    "#         https://selenium-python.readthedocs.io/installation.html#drivers\n",
+    "# and add it to this local folder:\n",
+    "import os\n",
+    "import sys\n",
+    "os.path.dirname(sys.executable)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Imports:\n",
+    "import time                      # for sleeping between multiple requests\n",
+    "from selenium import webdriver"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#### Documentation:\n",
+    "- [Requests.py](http://docs.python-requests.org)\n",
+    "- [Beautifulsoup.py](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)\n",
+    "- [Selenium with python](https://selenium-python.readthedocs.io/)\n",
+    "- [Locating elements](https://selenium-python.readthedocs.io/locating-elements.html#locating-elements)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Headless request to price list (Chrome):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "options = webdriver.ChromeOptions()\n",
+    "options.headless = True\n",
+    "options.add_argument('user-agent=scrapingCourseBot')\n",
+    "\n",
+    "browser = webdriver.Chrome(options=options)\n",
+    "browser.get('http://testing-ground.webscraping.pro/price-list-1.html')\n",
+    "\n",
+    "# Retrieve first item name:\n",
+    "elem = browser.find_element_by_css_selector('div.name')\n",
+    "print(elem.text)\n",
+    "\n",
+    "browser.close()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Headless request to price list (Firefox):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from selenium.webdriver.firefox.options import Options\n",
+    "profile = webdriver.FirefoxProfile()\n",
+    "profile.set_preference(\"general.useragent.override\", \"scrapingCourseBot\")\n",
+    "\n",
+    "options = Options()\n",
+    "options.headless = True\n",
+    "\n",
+    "browser = webdriver.Firefox(profile, options=options)\n",
+    "browser.get('http://testing-ground.webscraping.pro/price-list-1.html')\n",
+    "\n",
+    "# Retrieve first item name:\n",
+    "elem = browser.find_element_by_css_selector('div.name')\n",
+    "print(elem.text)\n",
+    "\n",
+    "browser.close()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Headless getting Olympic winter games (Chrome): "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "options = webdriver.ChromeOptions()\n",
+    "options.headless = True\n",
+    "options.add_argument('user-agent=scrapingCourseBot')\n",
+    "\n",
+    "browser = webdriver.Chrome(options=options)\n",
+    "browser.get('https://olympics.com/en/sports/winter-olympics')\n",
+    "\n",
+    "# Retrieve first item name:\n",
+    "elements = browser.find_elements_by_css_selector('div.-sportlist h1.article--title')\n",
+    "for e in elements:\n",
+    "    print(e.text)\n",
+    "\n",
+    "browser.close()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

Original file line number	Diff line number	Diff line change
`@@ -28,7 +28,8 @@`
`28`	`28`	`"source": [`
`29`	`29`	`"#### Documentation:\n",`
`30`	`30`	`"- [Requests.py](http://docs.python-requests.org)\n",`
`31`		`- "- [Beautifulsoup.py](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)"`
	`31`	`+ "- [Beautifulsoup.py](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)\n",`
	`32`	`+ "- [CSS selectors](https://www.w3schools.com/cssref/css_selectors.asp)"`
`32`	`33`	`]`
`33`	`34`	`},`
`34`	`35`	`{`