Skip to content

Fastest open-source tool for replicating Databases to Data Lake in Open Table Formats like Apache Iceberg. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB and MySQL

License

Notifications You must be signed in to change notification settings

datazip-inc/olake

Repository files navigation

olake
OLake

The fastest open-source tool for replicating databases to Apache Iceberg. OLake, an easy-to-use web interface and a CLI for efficient, scalable, & real-time data ingestion. Visit olake.io/docs for the full documentation, and benchmarks

GitHub issues Documentation slack

🧊 TL;DR: OLake — Super-fast Sync to Apache Iceberg

OLake is an open-source connector for replicating data from transactional databases like PostgreSQL, MySQL, MongoDB, Oracle & Kafka to open data lakehouse formats like Apache Iceberg — at blazing speeds and minimal infrastructure cost.


🚀 Why OLake?

  • 🧠 Smart sync: Full + CDC replication with automatic schema discovery
  • High throughput: 46K RPS (Postgres) & 64K RPS (MySQL)
  • 💾 Iceberg-native: Supports Glue, Hive, JDBC, REST catalogs
  • 🖥️ Self-serve UI: Deploy via Docker Compose and sync in minutes
  • 💸 Infra-light: No Spark, no Flink, no Kafka, no Debezium

📊 Benchmarks & possible connections

Source → Destination Throughput Relative Performance Full Report
Postgres → Iceberg 46,262 RPS (Full load) 101× faster than Airbyte Full Report
MySQL → Iceberg 64,334 RPS (Full load) 9× faster than Airbyte Full Report
MongoDB → Iceberg WIP
Oracle → Iceberg WIP
Postgres → Object Store (Parquet) WIP
MySQL → Object Store (Parquet) WIP
MongoDB → Object Store (Parquet) WIP
Oracle → Object Store (Parquet) WIP

*These are preliminary results. Fully reproducible benchmark scores will be published soon.


🔧 Supported Sources and Destinations

Sources

Source Full Load CDC Incremental Notes Documentation
PostgreSQL wal2json WIP pgoutput support WIP Postgres Docs
MySQL Binlog-based CDC MySQL Docs
MongoDB Oplog-based CDC MongoDB Docs
Oracle WIP JDBC based Full Load & Incremental Oracle Docs
Kafka WIP WIP WIP

Destinations

Destination Format Supported Catalogs
Iceberg Glue, Hive, JDBC, REST (Nessie, Polaris, Unity, Lakekeeper, AWS S3 tables)
Parquet Filesystem
Other formats 🔜 Planned: Delta Lake, Hudi
Writer Docs
  1. Apache Iceberg Docs

    1. Catalogs
      1. AWS Glue Catalog
      2. REST Catalog
      3. JDBC Catalog
      4. Hive Catalog
    2. Azure ADLS Gen2
    3. Google Cloud Storage (GCS)
    4. MinIO (local)
    5. Iceberg Table Management
      1. S3 Tables Supported
  2. Parquet Writer

    1. AWS S3 Docs
    2. Google Cloud Storage (GCS)
    3. Local FileSystem Docs

🧪 Quickstart (UI + Docker)

OLake UI is a web-based interface for managing OLake jobs, sources, destinations, and configurations. You can run the entire OLake stack (UI, Backend, and all dependencies) using Docker Compose. This is the recommended way to get started. Run the UI, connect your source DB, and start syncing in minutes.

curl -sSL https://raw.githubusercontent.com/datazip-inc/olake-ui/master/docker-compose.yml | docker compose -f - up -d

Access the UI: * OLake UI: http://localhost:8000 * Log in with default credentials: admin / password.

Detailed getting started using OLake UI can be found here.

olake-ui

Creating Your First Job

With the UI running, you can create a data pipeline in a few steps:

  1. Create a Job: Navigate to the Jobs tab and click Create Job.
  2. Configure Source: Set up your source connection (e.g., PostgreSQL, MySQL, MongoDB).
  3. Configure Destination: Set up your destination (e.g., Apache Iceberg with a Glue, REST, Hive, or JDBC catalog).
  4. Select Streams: Choose which tables to sync and configure their sync mode (CDC or Full Refresh).
  5. Configure & Run: Give your job a name, set a schedule, and click Create Job to finish.

For a detailed walkthrough, refer to the Jobs documentation.


🛠️ CLI Usage (Advanced)

For advanced users and automation, OLake's core logic is exposed via a powerful CLI. The core framework handles state management, configuration validation, logging, and type detection. It interacts with drivers using four main commands:

  • spec: Returns a render-able JSON Schema for a connector's configuration.
  • check: Validates connection configurations for sources and destinations.
  • discover: Returns all available streams (e.g., tables) and their schemas from a source.
  • sync: Executes the data replication job, extracting from the source and writing to the destination.

Find out more about CLI here.


Install OLake

Below are other different ways you can run OLake:

  1. OLake UI (Recommended)
  2. Standalone Docker container
  3. Airflow on EC2
  4. Airflow on Kubernetes
  5. Kubernetes using Helm (Coming soon!)

Playground

  1. OLake + Apache Iceberg + REST Catalog + Presto
  2. OLake + Apache Iceberg + AWS Glue + Trino
  3. OLake + Apache Iceberg + AWS Glue + Athena
  4. OLake + Apache Iceberg + AWS Glue + Snowflake

📦 Architecture

OLake Architecture


🌍 Use Cases

  • ✅ Migrate from OLTP to Iceberg without Spark or Flink
  • ✅ Enable BI over fresh CDC data using Athena, StarRocks, Trino, Presto, Dremio, Databricks, Snowflake and more!
  • ✅ Build near real-time data lake-house on cost-efficient cloud object stores
  • ✅ Move away from vendor-lock-in warehouse or tools with open data lake-house
  • ✅ Single copy for both analytics & machine learning.

🧭 Roadmap Highlights

  • Oracle Full Load Support
  • Oracle Incremental
  • Filters for Full Load and Incremental
  • Real-time Streaming Mode (Kafka)
  • Iceberg V3 Support

📌 Check out our GitHub Project Roadmap and the Upcoming OLake Roadmap to track what's next. If you have ideas or feedback, please share them in our GitHub Discussions or by opening an issue.


🤝 Contributing

We ❤️ contributions, big or small!

Check out our Bounty Program. A huge thanks to all our amazing contributors!

About

Fastest open-source tool for replicating Databases to Data Lake in Open Table Formats like Apache Iceberg. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Supporting Postgres, MongoDB and MySQL

Topics

Resources

License

Contributing

Stars

Watchers

Forks