This project is a Python-based script to scrape product and location data from different restaurant websites. It supports parallel execution for faster processing and makes it easy to add new parsers.
- Scrapes data from specified restaurants.
- Saves the scraped data to
.hlds
files for further use. - Supports parallel and sequential execution modes, configurable via a boolean flag.
- Designed to be extendable with new restaurant parsers.
- Python 3.x installed on your system.
- Install the required dependencies (e.g.,
requests
,BeautifulSoup
, etc.).
-
Clone the Repository:
git clone https://github.com/ZeusWPI/haldis_een_prijsje.git cd haldis_een_prijsje
-
Install Dependencies:
pip install -r requirements.txt
-
Run the Script: To scrape data for a specific restaurant:
python main.py
- Set
restaurant_name
to the desired restaurant (e.g.,"simpizza"
) in the script. - Enable or disable parallelism by toggling the
use_parallelism
flag (True
orFalse
).
- Set
-
Run for All Restaurants: Set
run_everything
toTrue
in the script to scrape data from all available restaurants.
- restaurant_name: Set this to the name of the restaurant you want to scrape.
- use_parallelism: Set to
True
for parallel execution orFalse
for sequential execution. - run_everything: Set to
True
to scrape all restaurants; otherwise, leave it asFalse
.
To add support for a new restaurant scraper:
-
Check for Open Issues:
- Navigate to the Issues section of this repository.
- Look for an unsigned issue related to the new parser.
- Assign the issue to yourself.
-
Implement the Parser:
- Create a new scraper file under
scrapers/
(e.g.,newrestaurant_scraper.py
). - Implement a
get_prices()
method in the new scraper, returning (see interface inscrapers/scraper.py
):- A list of products.
- Location information.
- Create a new scraper file under
-
Add the Parser to the Main Script:
- Define a new function in
main.py
(e.g.,run_newrestaurant
):def run_newrestaurant(): newrestaurant_products, newrestaurant_location = NewRestaurantScraper.get_prices() with open("hlds_files/newrestaurant.hlds", "w", encoding="utf-8") as file: file.write(str(newrestaurant_location) + "\n") file.write(translate_products_to_text(newrestaurant_products)) print("newrestaurant done")
- Add the function conditionally to the
tasks
list inmain.py
:if restaurant_name.lower() == "newrestaurant" or run_everything: tasks.append(run_newrestaurant)
- Define a new function in
-
Test Your Parser:
- Run the script to ensure your parser works as expected.
- Fix any bugs or errors.
-
Submit Your Work:
- Mark the issue as resolved and create a pull request to merge your changes.
- Always assign yourself an open issue before starting work.
- Follow the project structure and coding conventions.
- Test your changes thoroughly before submitting a pull request.
- Ensure your code is well-documented.
This document provides an overview of all utility functions included in the script.
Fetches and parses the HTML content from a given URL.
- Parameters:
url
(str): The URL to fetch.
- Returns: A
BeautifulSoup
object containing the parsed HTML, orNone
if fetching fails.
Filters out non-UTF-8 characters from the given text.
- Parameters:
text
(str): The input text to filter.
- Returns: A string containing only UTF-8 characters.
Performs a GET request to the given URL and handles ConnectionError
.
- Parameters:
link
(str): The URL to fetch.
- Returns: A
requests.Response
object if successful, or an empty string if aConnectionError
occurs.
Extracts all non-empty text content from <span>
elements within a given <div>
.
- Parameters:
div
: A BeautifulSoup<div>
element.
- Returns: A list of strings containing the extracted text.
Filters <div>
elements from a parsed HTML based on a user-defined condition.
- Parameters:
soup
: ABeautifulSoup
object containing the HTML content.class_name
(str): The class name of the<div>
elements to filter.condition
(callable): A function that takes a<div>
element and returnsTrue
if the<div>
matches the condition.
- Returns: A list of
<div>
elements that match the condition.
Generates a condition function to check if a <div>
contains an <h2>
tag with the specified text.
- Parameters:
text_to_search
(str): The text to search for within an<h2>
tag.
- Returns: A function that takes a
<div>
and returnsTrue
if it contains an<h2>
tag with the specified text.
Downloads a PDF from a given URL and saves it locally.
- Parameters:
url
(str): The URL of the PDF to download.save_path
(str): The local path to save the downloaded PDF.
- Returns: None. Prints a success or failure message.
Extracts text from a PDF file. Optionally extracts text from a specified rectangular region.
- Parameters:
file_path
(str): Path to the PDF file.coords
(tuple, optional): A tuple defining the rectangle (x0, top, x1, bottom). Defaults toNone
for full-page extraction.
- Returns: Extracted text as a string.
Retrieves the dimensions of a specified page in a PDF.
- Parameters:
file_path
(str): Path to the PDF file.page_number
(int): The 1-based index of the page. Defaults to 1.
- Returns: A tuple containing the width and height of the page in points.
Converts a string representation of a number with a comma as a decimal separator to a float.
-
Parameters:
inp
(str): The input string containing the number (e.g.,"1,23"
).
-
Returns:
- A
float
where commas in the input string are replaced with dots to adhere to standard decimal notation (e.g.,1.23
).
- A
-
Example:
number = comma_float("1,23") print(number) # Output: 1.23