Skip to content

charlesrobison/vendee-globe-databricks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

21 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Vendee Globe Race Tracker

A data engineering project that ingests, processes, and analyzes data from the official Vendรฉe Globe API using a modern data pipeline architecture.

The project demonstrates orchestration with Prefect, raw-to-curated modeling with the Bronze/Silver/Gold pattern, and is designed to be Databricks-ready for cloud-scale execution.


๐Ÿš€ Project Architecture

airflow/                     # Legacy DAG version (kept for reference)
databricks_notebooks/
  00_fetch_vendee_data.py    # Fetch raw race snapshots from the API
  01_bronze_ingest.py        # Ingest raw JSON โ†’ Bronze
  02_silver_transform.py     # Clean + normalize โ†’ Silver
  03_gold_models.py          # Analytics tables โ†’ Gold
scripts/
  inspect_latest.py           # Quick JSON inspection utility
data/
  raw/                       # Raw snapshots from API
  processed/                 # Local Bronze/Silver/Gold outputs
docs/
  pipeline_architecture.md    # Documentation for design decisions
prefect_pipeline.py          # Prefect orchestration flow
run_pipeline.py              # Simple sequential runner (local simulation)

๐Ÿ“Š Pipeline Flow

  1. Fetch
  • Calls the official Vendรฉe Globe API
  • Saves raw JSON snapshots to /data/raw
  1. Bronze Layer
  • Loads JSON snapshots
  • Explodes boat data into structured rows
  1. Silver Layer
  • Cleans and normalizes data
  • Converts lat/lon into decimal degrees
  • Extracts numeric values from text fields (21.8 kts โ†’ 21.8)
  1. Gold Layer
  • Produces analytics-ready tables
  • Leaderboards, rank deltas, rolling averages
  1. Orchestration (Prefect)
  • Handles scheduling and execution order
  • Can run locally or be deployed to Prefect Cloud

๐Ÿ› ๏ธ Tech Stack

  • Python 3.12
  • Prefect for orchestration
  • PySpark for scalable transformations
  • Databricks-ready architecture (Bronze/Silver/Gold pattern)
  • Git + GitHub for version control

โšก How to Run

  1. Clone the repo:
git clone https://github.com/<your-username>/VENDEE-GLOBE-DATABRICKS.git
cd VENDEE-GLOBE-DATABRICKS
  1. Create virtual environment:
python -m venv vendee_env
source vendee_env/bin/activate
pip install -r requirements.txt
  1. Set your API key in .env:
VGL_API_KEY=your_api_key_here
  1. Run pipeline (local):
python run_pipeline.py
  1. Orchestrate with Prefect:
python prefect_pipeline.py

๐Ÿ“ˆ Example Output

  • Bronze Layer: 35 boats, raw metrics per update
  • Silver Layer: Cleaned metrics (speeds, headings, distances as floats)
  • Gold Layer: Leaderboard with ranks, trends, and deltas

๐Ÿ”ฎ Future Enhancements

  • Databricks integration for cloud-scale execution
  • Historical Vendรฉe Globe datasets for richer analysis
  • Real-time dashboard (Sigma, Streamlit, or similar)
  • Automated deployment with Docker + CI/CD

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages