Python Web Scraping Portfolio

Professional web scraping & data extraction demos — built with the same techniques used in production e-commerce monitoring, lead generation, and API reverse-engineering projects.

What's Inside

Demo	Technique	Target
`ecommerce_scraper.py`	Pagination + detail crawling	books.toscrape.com
`dynamic_scraper.py`	Playwright headless browser	quotes.toscrape.com (JS)
`api_reverse.py`	Hidden API discovery	Hacker News Firebase API

Every demo includes:

Full browser-matching headers (not just User-Agent)
Randomized request delays to avoid rate limits
Exponential backoff retry logic
Multi-format export: CSV / JSON / Excel

Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Install Playwright browser (for dynamic scraper)
playwright install chromium

# 3. Run any demo
python examples/ecommerce_scraper.py
python examples/dynamic_scraper.py
python examples/api_reverse.py

Output lands in output/ — timestamped files in CSV, JSON, and XLSX.

Demo Breakdown

1. E-commerce Product Scraper

[catalog] page 1 ... 20 items
[catalog] page 2 ... 20 items
...
Total: 1000 products across 50 pages
[detail] 1/10: A Light in the Attic ... done
...
Exported:
  CSV  → output/books_catalog_20260612_143022.csv
  JSON → output/books_catalog_20260612_143022.json
  XLSX → output/books_catalog_20260612_143022.xlsx

What it proves:

Multi-page pagination with termination detection
Detail-page enrichment (catalog → individual pages)
Clean data modeling with list[dict]
Works on real e-commerce patterns (pagination, product grids, breadcrumbs)

2. Dynamic / JS-Rendered Page Scraper

Uses Playwright headless Chromium to scrape content that only exists after JavaScript executes. Includes:

wait_for_selector for reliable content detection
Infinite-scroll handling with height-delta detection
Cookie extraction for hybrid pipelines (Playwright → requests)
Anti-detection flags (--disable-blink-features=AutomationControlled)

3. API Reverse-Engineering

Finds and calls the backend JSON API directly — no DOM parsing, no headless browser, just pure HTTP. This is the fastest, most reliable scraping pattern:

Identify the API endpoint the frontend calls
Call it directly with proper parameters
Parse clean JSON instead of HTML

Also demonstrates comment-thread fetching — practical for social-media monitoring use cases.

Anti-Detection (`utils/stealth.py`)

Real-world scrapers fail silently without these:

Full header set: Sec-Ch-Ua, Accept-Language, Accept-Encoding — sites check the combination, not just UA
Jittered delays: random intervals between requests (constant timing = bot)
Exponential backoff: 429 → wait 2s, 4s, 8s with jitter
Session reuse: connection pooling across requests

Export Formats (`utils/exporters.py`)

One function call, pick your format:

from utils import exporters

exporters.to_csv(rows, "my_data")
exporters.to_json(rows, "my_data")
exporters.to_excel(rows, "my_data")
# → output/my_data_<timestamp>.{csv|json|xlsx}

Tech Stack

requests · BeautifulSoup4 · lxml · Playwright · openpyxl

Project Structure

scraper-portfolio/
├── examples/
│   ├── ecommerce_scraper.py    # Static HTML scraping
│   ├── dynamic_scraper.py      # JS-rendered / headless browser
│   └── api_reverse.py          # API discovery + direct calls
├── utils/
│   ├── stealth.py              # Anti-detection toolkit
│   └── exporters.py            # CSV / JSON / Excel exporters
├── output/                     # Generated data files
├── requirements.txt
└── README.md

Real-World Experience

Beyond these demos, I've worked on:

E-commerce price monitoring: automated daily scraping with proxy rotation, JS cookie fingerprinting, parameter signing (MD5 hash reversal)
Lead generation pipelines: multi-source data aggregation with deduplication
Anti-detection: TLS fingerprint matching, Cloudflare bypass strategies, fake-data detection

If you need custom data extraction, monitoring, or automation — let's talk.

Available for freelance Python scraping & automation projects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Web Scraping Portfolio

What's Inside

Quick Start

Demo Breakdown

1. E-commerce Product Scraper

2. Dynamic / JS-Rendered Page Scraper

3. API Reverse-Engineering

Anti-Detection (`utils/stealth.py`)

Export Formats (`utils/exporters.py`)

Tech Stack

Project Structure

Real-World Experience

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Python Web Scraping Portfolio

What's Inside

Quick Start

Demo Breakdown

1. E-commerce Product Scraper

2. Dynamic / JS-Rendered Page Scraper

3. API Reverse-Engineering

Anti-Detection (utils/stealth.py)

Export Formats (utils/exporters.py)

Tech Stack

Project Structure

Real-World Experience

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Anti-Detection (`utils/stealth.py`)

Export Formats (`utils/exporters.py`)

Packages