pyspark-api-lab

A small lab to fetch data from an HTTP API and write it to Parquet. The goal is to practice simple ingestion, basic validation, and add tests you can run locally.

Features

Fetch JSON from an API (with retries/backoff in code)
Convert to Pandas and write Parquet
Unit tests that mock the API (no real network calls)
Simple, CLI-driven runner

Requirements

Python 3.10+ (macOS)
Recommended: virtualenv
Packages (install via requirements.txt): requests, PyYAML, pandas, pyarrow, pytest
Optional (for future Spark work): pyspark

Setup (macOS)

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Configuration

Create or edit config.yaml:

api_url: "https://jsonplaceholder.typicode.com/posts"
api_params: {}
api_headers: {}
timeout_sec: 20
retries: 3
backoff_sec: 1.5

app_name: "pyspark_api_lab"
output_path: "data/out.parquet"

Tip: Keep secrets out of the file; pass tokens via env vars and build headers in code if needed.

Run the pipeline

python -m src.main --config config.yaml

This will:

Fetch records from api_url
Write them as Parquet to output_path (creating folders as needed)

Tests

Run all tests:

pytest -q

What the tests do:

tests/test_api.py
- Verifies a single JSON object is normalized to a list
- Verifies retries/backoff and raises after exhaustion (no real HTTP; requests.get is mocked)
- Note: You do NOT need to change the URL used in the test; it’s mocked.

Optional test you can add:

tests/test_main.py
- Mocks fetch_api_data, runs src.main.run() with a temp config, asserts a Parquet file is written and readable.

Project layout

pyspark-api-lab/
├── README.md
├── requirements.txt
├── config.yaml
├── src/
│   ├── __init__.py
│   ├── fetch_api.py      # HTTP client with retries/backoff + normalization
│   └── main.py           # Reads config, fetches, writes Parquet
└── tests/
    └── test_api.py       # Unit tests for API client (mocked)

Troubleshooting

ModuleNotFoundError: pyarrow
- Install pyarrow (it’s required for Parquet): pip install pyarrow
Empty output file or process exits with code 2
- The API returned no records; check api_url/filters or try another endpoint
SSL/HTTP errors
- Test with curl first; adjust headers/timeouts in config.yaml

Next steps

Add a test for main.run (end-to-end write) as tests/test_main.py
Introduce PySpark transforms and corresponding tests
Parameterize incremental fetch (by updated_at) and add idempotent writes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyspark-api-lab

Features

Requirements

Setup (macOS)

Configuration

Run the pipeline

Tests

Project layout

Troubleshooting

Next steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

pyspark-api-lab

Features

Requirements

Setup (macOS)

Configuration

Run the pipeline

Tests

Project layout

Troubleshooting

Next steps

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages