Job Scraper System - Design Document

Overview

This repository contains the design documentation for a scalable web scraping system focused on collecting publicly available job listing data. This project was undertaken as a response to a data engineering challenge.

The primary output is a detailed design document, built using MkDocs, outlining the proposed architecture, technology choices, data flow, operational considerations, and justifications.

📖 Accessing the Documentation

The full, browseable design document is deployed via GitHub Pages and can be accessed here:

➡️ https://jryusuf.github.io/web_scraper/ ⬅️

Navigate the documentation using the sidebar on the left, which follows the structure defined in the mkdocs.yml file.

Project Goal

The designed system aims to:

Collect job postings from multiple job boards and company career pages.
Target specific roles, sectors, and locations based on configuration.
Handle common scraping challenges (dynamic content, pagination, rate limits, anti-scraping).
Process and store data in a structured format suitable for analysis and potential ML applications.
Be scalable, reliable, and maintainable.

Documentation Stack

This documentation site itself is built using:

MkDocs: Static site generator for project documentation.
MkDocs Material Theme: For enhanced visuals and features.
mkdocs-mermaid2-plugin: For rendering embedded Mermaid diagrams.

Running Locally

To build and view the documentation site locally:

Prerequisites: Ensure you have Python 3.x and pip installed.

Clone Repository:

git clone https://github.com/jryusuf/web_scraper.git
cd web_scraper

Set up Virtual Environment (Recommended):

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install Dependencies:

pip install -r requirements.txt

mkdocs serve

View: Open your web browser and navigate to http://127.0.0.1:8000. The site will automatically reload when you save changes to the Markdown files or mkdocs.yml.

Deployment

This site is automatically deployed to GitHub Pages, typically via the mkdocs gh-deploy command or a GitHub Actions workflow configured to build and deploy the contents of the docs/ directory on pushes to the main branch.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
project		project
.gitignore		.gitignore
LISENCE		LISENCE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Job Scraper System - Design Document

Overview

📖 Accessing the Documentation

Project Goal

Documentation Stack

Running Locally

Deployment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

jryusuf/web_scraper

Folders and files

Latest commit

History

Repository files navigation

Job Scraper System - Design Document

Overview

📖 Accessing the Documentation

Project Goal

Documentation Stack

Running Locally

Deployment

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Packages