Professional web scraping & data extraction demos — built with the same techniques used in production e-commerce monitoring, lead generation, and API reverse-engineering projects.
| Demo | Technique | Target |
|---|---|---|
ecommerce_scraper.py |
Pagination + detail crawling | books.toscrape.com |
dynamic_scraper.py |
Playwright headless browser | quotes.toscrape.com (JS) |
api_reverse.py |
Hidden API discovery | Hacker News Firebase API |
Every demo includes:
- Full browser-matching headers (not just User-Agent)
- Randomized request delays to avoid rate limits
- Exponential backoff retry logic
- Multi-format export: CSV / JSON / Excel
# 1. Install dependencies
pip install -r requirements.txt
# 2. Install Playwright browser (for dynamic scraper)
playwright install chromium
# 3. Run any demo
python examples/ecommerce_scraper.py
python examples/dynamic_scraper.py
python examples/api_reverse.pyOutput lands in output/ — timestamped files in CSV, JSON, and XLSX.
[catalog] page 1 ... 20 items
[catalog] page 2 ... 20 items
...
Total: 1000 products across 50 pages
[detail] 1/10: A Light in the Attic ... done
...
Exported:
CSV → output/books_catalog_20260612_143022.csv
JSON → output/books_catalog_20260612_143022.json
XLSX → output/books_catalog_20260612_143022.xlsx
What it proves:
- Multi-page pagination with termination detection
- Detail-page enrichment (catalog → individual pages)
- Clean data modeling with
list[dict] - Works on real e-commerce patterns (pagination, product grids, breadcrumbs)
Uses Playwright headless Chromium to scrape content that only exists after JavaScript executes. Includes:
wait_for_selectorfor reliable content detection- Infinite-scroll handling with height-delta detection
- Cookie extraction for hybrid pipelines (Playwright → requests)
- Anti-detection flags (
--disable-blink-features=AutomationControlled)
Finds and calls the backend JSON API directly — no DOM parsing, no headless browser, just pure HTTP. This is the fastest, most reliable scraping pattern:
- Identify the API endpoint the frontend calls
- Call it directly with proper parameters
- Parse clean JSON instead of HTML
Also demonstrates comment-thread fetching — practical for social-media monitoring use cases.
Real-world scrapers fail silently without these:
- Full header set:
Sec-Ch-Ua,Accept-Language,Accept-Encoding— sites check the combination, not just UA - Jittered delays: random intervals between requests (constant timing = bot)
- Exponential backoff: 429 → wait 2s, 4s, 8s with jitter
- Session reuse: connection pooling across requests
One function call, pick your format:
from utils import exporters
exporters.to_csv(rows, "my_data")
exporters.to_json(rows, "my_data")
exporters.to_excel(rows, "my_data")
# → output/my_data_<timestamp>.{csv|json|xlsx}requests · BeautifulSoup4 · lxml · Playwright · openpyxl
scraper-portfolio/
├── examples/
│ ├── ecommerce_scraper.py # Static HTML scraping
│ ├── dynamic_scraper.py # JS-rendered / headless browser
│ └── api_reverse.py # API discovery + direct calls
├── utils/
│ ├── stealth.py # Anti-detection toolkit
│ └── exporters.py # CSV / JSON / Excel exporters
├── output/ # Generated data files
├── requirements.txt
└── README.md
Beyond these demos, I've worked on:
- E-commerce price monitoring: automated daily scraping with proxy rotation, JS cookie fingerprinting, parameter signing (MD5 hash reversal)
- Lead generation pipelines: multi-source data aggregation with deduplication
- Anti-detection: TLS fingerprint matching, Cloudflare bypass strategies, fake-data detection
If you need custom data extraction, monitoring, or automation — let's talk.
Available for freelance Python scraping & automation projects.