A small lab to fetch data from an HTTP API and write it to Parquet. The goal is to practice simple ingestion, basic validation, and add tests you can run locally.
- Fetch JSON from an API (with retries/backoff in code)
- Convert to Pandas and write Parquet
- Unit tests that mock the API (no real network calls)
- Simple, CLI-driven runner
- Python 3.10+ (macOS)
- Recommended: virtualenv
- Packages (install via requirements.txt): requests, PyYAML, pandas, pyarrow, pytest
- Optional (for future Spark work): pyspark
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtCreate or edit config.yaml:
api_url: "https://jsonplaceholder.typicode.com/posts"
api_params: {}
api_headers: {}
timeout_sec: 20
retries: 3
backoff_sec: 1.5
app_name: "pyspark_api_lab"
output_path: "data/out.parquet"Tip: Keep secrets out of the file; pass tokens via env vars and build headers in code if needed.
python -m src.main --config config.yamlThis will:
- Fetch records from api_url
- Write them as Parquet to output_path (creating folders as needed)
Run all tests:
pytest -qWhat the tests do:
- tests/test_api.py
- Verifies a single JSON object is normalized to a list
- Verifies retries/backoff and raises after exhaustion (no real HTTP; requests.get is mocked)
- Note: You do NOT need to change the URL used in the test; it’s mocked.
Optional test you can add:
- tests/test_main.py
- Mocks fetch_api_data, runs src.main.run() with a temp config, asserts a Parquet file is written and readable.
pyspark-api-lab/
├── README.md
├── requirements.txt
├── config.yaml
├── src/
│ ├── __init__.py
│ ├── fetch_api.py # HTTP client with retries/backoff + normalization
│ └── main.py # Reads config, fetches, writes Parquet
└── tests/
└── test_api.py # Unit tests for API client (mocked)- ModuleNotFoundError: pyarrow
- Install pyarrow (it’s required for Parquet): pip install pyarrow
- Empty output file or process exits with code 2
- The API returned no records; check api_url/filters or try another endpoint
- SSL/HTTP errors
- Test with curl first; adjust headers/timeouts in config.yaml
- Add a test for main.run (end-to-end write) as tests/test_main.py
- Introduce PySpark transforms and corresponding tests
- Parameterize incremental fetch (by updated_at) and add idempotent writes