Skip to content

tlyyxjz/scraper-portfolio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Web Scraping Portfolio

Professional web scraping & data extraction demos — built with the same techniques used in production e-commerce monitoring, lead generation, and API reverse-engineering projects.

What's Inside

Demo Technique Target
ecommerce_scraper.py Pagination + detail crawling books.toscrape.com
dynamic_scraper.py Playwright headless browser quotes.toscrape.com (JS)
api_reverse.py Hidden API discovery Hacker News Firebase API

Every demo includes:

  • Full browser-matching headers (not just User-Agent)
  • Randomized request delays to avoid rate limits
  • Exponential backoff retry logic
  • Multi-format export: CSV / JSON / Excel

Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Install Playwright browser (for dynamic scraper)
playwright install chromium

# 3. Run any demo
python examples/ecommerce_scraper.py
python examples/dynamic_scraper.py
python examples/api_reverse.py

Output lands in output/ — timestamped files in CSV, JSON, and XLSX.

Demo Breakdown

1. E-commerce Product Scraper

[catalog] page 1 ... 20 items
[catalog] page 2 ... 20 items
...
Total: 1000 products across 50 pages
[detail] 1/10: A Light in the Attic ... done
...
Exported:
  CSV  → output/books_catalog_20260612_143022.csv
  JSON → output/books_catalog_20260612_143022.json
  XLSX → output/books_catalog_20260612_143022.xlsx

What it proves:

  • Multi-page pagination with termination detection
  • Detail-page enrichment (catalog → individual pages)
  • Clean data modeling with list[dict]
  • Works on real e-commerce patterns (pagination, product grids, breadcrumbs)

2. Dynamic / JS-Rendered Page Scraper

Uses Playwright headless Chromium to scrape content that only exists after JavaScript executes. Includes:

  • wait_for_selector for reliable content detection
  • Infinite-scroll handling with height-delta detection
  • Cookie extraction for hybrid pipelines (Playwright → requests)
  • Anti-detection flags (--disable-blink-features=AutomationControlled)

3. API Reverse-Engineering

Finds and calls the backend JSON API directly — no DOM parsing, no headless browser, just pure HTTP. This is the fastest, most reliable scraping pattern:

  1. Identify the API endpoint the frontend calls
  2. Call it directly with proper parameters
  3. Parse clean JSON instead of HTML

Also demonstrates comment-thread fetching — practical for social-media monitoring use cases.

Anti-Detection (utils/stealth.py)

Real-world scrapers fail silently without these:

  • Full header set: Sec-Ch-Ua, Accept-Language, Accept-Encoding — sites check the combination, not just UA
  • Jittered delays: random intervals between requests (constant timing = bot)
  • Exponential backoff: 429 → wait 2s, 4s, 8s with jitter
  • Session reuse: connection pooling across requests

Export Formats (utils/exporters.py)

One function call, pick your format:

from utils import exporters

exporters.to_csv(rows, "my_data")
exporters.to_json(rows, "my_data")
exporters.to_excel(rows, "my_data")
# → output/my_data_<timestamp>.{csv|json|xlsx}

Tech Stack

requests · BeautifulSoup4 · lxml · Playwright · openpyxl

Project Structure

scraper-portfolio/
├── examples/
│   ├── ecommerce_scraper.py    # Static HTML scraping
│   ├── dynamic_scraper.py      # JS-rendered / headless browser
│   └── api_reverse.py          # API discovery + direct calls
├── utils/
│   ├── stealth.py              # Anti-detection toolkit
│   └── exporters.py            # CSV / JSON / Excel exporters
├── output/                     # Generated data files
├── requirements.txt
└── README.md

Real-World Experience

Beyond these demos, I've worked on:

  • E-commerce price monitoring: automated daily scraping with proxy rotation, JS cookie fingerprinting, parameter signing (MD5 hash reversal)
  • Lead generation pipelines: multi-source data aggregation with deduplication
  • Anti-detection: TLS fingerprint matching, Cloudflare bypass strategies, fake-data detection

If you need custom data extraction, monitoring, or automation — let's talk.


Available for freelance Python scraping & automation projects.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages