An asynchronous ETL pipeline for fetching, transforming, and loading animal data into a target system.
This project demonstrates clean architecture with modular design, retry logic, concurrency, and batch posting.
-
Async HTTP Client (
http_client.py)- Built on top of
httpx.AsyncClient. - Retry logic with exponential backoff & jitter.
- Validation-aware error handling (
422).
- Built on top of
-
API Layer (
api.py)- Wraps the
animals/v1endpoints with typed interfaces. - Gracefully handles non-JSON responses.
- Wraps the
-
Pipeline (
pipeline.py)- Concurrent fetching of animal details.
- Transformation:
friends→ split into list of strings.born_at→ epoch → ISO8601 UTC timestamp.
- Batch posting (≤100 records per request).
-
CLI Entrypoint (
cli.py)- Configurable via arguments or environment.
- Prints runtime config summary.
- Runs fetch → transform → load.
-
Typed Models (
models.py)- TypedDicts for raw, detailed, and transformed records.
- Extensible for schema evolution.
-
Testing
- Pytest-based.
- Fake async clients for retry tests.
- Python ≥ 3.10
- Dependencies:
httpx,pytest makeis usually pre-installed on macOS / Linux- Run challenge locally:
docker load -i lp-programming-challenge-1-1625610904.tar.gz docker run --rm -p 3123:3123 -ti lp-programming-challenge-1
git clone git@github.com:meghna0593/Project-Fauna.git
cd project-fauna
- Create virtual environment and install dependencies
make setup - Execute pipeline with optional arguments (use
make helpto understand how the commands work)make run - Run tests
make test - Remove venv, cache and build artifacts
make clean
-
Create and activate virtual environment
python3 -m venv .venv source .venv/bin/activate -
(Optional) Load env vars if you have a .env file
source .env -
Install dependencies
pip3 install -r requirements.txt -
Editable install so imports work
pip3 install -e . -
Run the barebones
animals-etl -
To run tests
pytest -q -
Exit virtual environment
deactivate
-
Create and activate virtual environment
python3 -m venv .venv source .venv/bin/activate -
(Optional) Load env vars if you have a .env file
source .env -
Install dependencies
pip3 install -r requirements.txt -
Run the ETL script
python3 scripts/animals_etl.py
- The ETL pipeline leverages Python
asyncio+httpxfor concurrent fetch / transform / post. - Running under Uvicorn preserves async concurrency
- This is an I/O-bound workload (HTTP calls to the challenge API), so
asynciowas chosen. - For CPU-heavy workloads, we can consider offloading parts to threads or processes.
- Add more modules (
birds-etl,plants-etl, etc.) with the same pipeline structure. - Abstract pipeline stages into reusable building blocks.
- Ensure re-runs do not duplicate loads by:
- Storing processed IDs in a local store (
sqliteorJSONfile). - Implementing "upsert" semantics on POST.
- Storing processed IDs in a local store (
- Add optional persistence:
sqlitebackend for checkpoints and processed state.- JSON-based fallback for local runs.
- Use Pydantic models instead of
TypedDictfor:- Schema validation.
- Input/output coercion.
- Better error messages.
- Currently not parallelizing POST (since batches are fast).
- Add configurable concurrent POST if API latency grows.
- Code Quality: integrate linting and formatting.
- Branching Strategy: dev → staging → main.
- PR Template: enforce consistent review process.
- Health Check Endpoint: CLI or /healthz probe.
- Dockerize: containerize for reproducible deployments.