Skip to content

faisalm1997/pyspark-api-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyspark-api-lab

A small lab to fetch data from an HTTP API and write it to Parquet. The goal is to practice simple ingestion, basic validation, and add tests you can run locally.

Features

  • Fetch JSON from an API (with retries/backoff in code)
  • Convert to Pandas and write Parquet
  • Unit tests that mock the API (no real network calls)
  • Simple, CLI-driven runner

Requirements

  • Python 3.10+ (macOS)
  • Recommended: virtualenv
  • Packages (install via requirements.txt): requests, PyYAML, pandas, pyarrow, pytest
  • Optional (for future Spark work): pyspark

Setup (macOS)

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Configuration

Create or edit config.yaml:

api_url: "https://jsonplaceholder.typicode.com/posts"
api_params: {}
api_headers: {}
timeout_sec: 20
retries: 3
backoff_sec: 1.5

app_name: "pyspark_api_lab"
output_path: "data/out.parquet"

Tip: Keep secrets out of the file; pass tokens via env vars and build headers in code if needed.

Run the pipeline

python -m src.main --config config.yaml

This will:

  • Fetch records from api_url
  • Write them as Parquet to output_path (creating folders as needed)

Tests

Run all tests:

pytest -q

What the tests do:

  • tests/test_api.py
    • Verifies a single JSON object is normalized to a list
    • Verifies retries/backoff and raises after exhaustion (no real HTTP; requests.get is mocked)
    • Note: You do NOT need to change the URL used in the test; it’s mocked.

Optional test you can add:

  • tests/test_main.py
    • Mocks fetch_api_data, runs src.main.run() with a temp config, asserts a Parquet file is written and readable.

Project layout

pyspark-api-lab/
├── README.md
├── requirements.txt
├── config.yaml
├── src/
│   ├── __init__.py
│   ├── fetch_api.py      # HTTP client with retries/backoff + normalization
│   └── main.py           # Reads config, fetches, writes Parquet
└── tests/
    └── test_api.py       # Unit tests for API client (mocked)

Troubleshooting

  • ModuleNotFoundError: pyarrow
    • Install pyarrow (it’s required for Parquet): pip install pyarrow
  • Empty output file or process exits with code 2
    • The API returned no records; check api_url/filters or try another endpoint
  • SSL/HTTP errors
    • Test with curl first; adjust headers/timeouts in config.yaml

Next steps

  • Add a test for main.run (end-to-end write) as tests/test_main.py
  • Introduce PySpark transforms and corresponding tests
  • Parameterize incremental fetch (by updated_at) and add idempotent writes

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages