This repository contains the design documentation for a scalable web scraping system focused on collecting publicly available job listing data. This project was undertaken as a response to a data engineering challenge.
The primary output is a detailed design document, built using MkDocs, outlining the proposed architecture, technology choices, data flow, operational considerations, and justifications.
The full, browseable design document is deployed via GitHub Pages and can be accessed here:
➡️ https://jryusuf.github.io/web_scraper/ ⬅️
Navigate the documentation using the sidebar on the left, which follows the structure defined in the mkdocs.yml file.
The designed system aims to:
- Collect job postings from multiple job boards and company career pages.
- Target specific roles, sectors, and locations based on configuration.
- Handle common scraping challenges (dynamic content, pagination, rate limits, anti-scraping).
- Process and store data in a structured format suitable for analysis and potential ML applications.
- Be scalable, reliable, and maintainable.
This documentation site itself is built using:
- MkDocs: Static site generator for project documentation.
- MkDocs Material Theme: For enhanced visuals and features.
- mkdocs-mermaid2-plugin: For rendering embedded Mermaid diagrams.
To build and view the documentation site locally:
-
Prerequisites: Ensure you have Python 3.x and
pipinstalled. -
Clone Repository:
git clone https://github.com/jryusuf/web_scraper.git cd web_scraper -
Set up Virtual Environment (Recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Dependencies:
pip install -r requirements.txt
mkdocs serve
-
View: Open your web browser and navigate to
http://127.0.0.1:8000. The site will automatically reload when you save changes to the Markdown files ormkdocs.yml.
This site is automatically deployed to GitHub Pages, typically via the mkdocs gh-deploy command or a GitHub Actions workflow configured to build and deploy the contents of the docs/ directory on pushes to the main branch.