Welcome to ETL Problems, an open-source project designed for learning, experimenting, and contributing to real-world data engineering workflows.
This repository contains a deliberately broken ETL pipeline that mimics issues data engineers face daily. The goal is for contributors to identify, fix, and enhance the pipeline — while learning best practices in data extraction, transformation, and loading.
The pipeline follows a simple ETL flow:
- Extract → Reads data from a CSV file (with encoding fallback).
- Transform → Cleans, deduplicates, and prepares the dataset.
- Load → Stores processed data into an SQLite database (with idempotency).
These bugs are intentionally introduced and marked in the code with
# TODO (Find & Fix): ...
Contributors should search for these comments and fix the issues.
- Unused imports
- Incorrect default values
- Wrong file extension checks
- Missing error handling
- Print statements instead of logging
- Missing idempotency in database load
- No duplicate removal in transform
- Missing actual logic in extract/transform/load steps
- Fix bugs marked with
# TODO (Find & Fix): ... - Improve error handling and logging
- Add tests and validation
- Enhance documentation
- Add new features (scrapers, data quality checks, visualizations)
Clone the repo and install dependencies:
git clone https://github.com/<your-username>/etl-problems.git
cd etl-problems
pip install -r requirements.txt
python main.pyUnit tests can be added in the tests/ folder.
Run them with:
pytest tests/- Search for
# TODO (Find & Fix): ...in the codebase. - Check the Issues for tasks and guidance.
- If you find a new bug, open an issue and suggest a fix.
- All contributions, big or small, are welcome!
Open an issue or start a discussion in the repo. Happy hacking!