Skip to content

diegoQuinas/minifusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

minifusion

A tiny columnar analytical query engine written in Rust, built from scratch to learn how engines like Apache DataFusion work under the hood.

This is a learning project. The goal is not to compete with DataFusion, but to reproduce its key design decisions at small scale:

  • Apache Arrow as the internal columnar format.
  • A pull-based execution model built on async Streams — operators pull RecordBatches from their children on demand.
  • Separation between the logical plan and the physical plan.

Status: early stage / work in progress. The core execution abstraction, a CSV source, and a projection operator are implemented and tested. The remaining operators and the DataFrame / SQL frontends are on the roadmap below.

The central abstraction: ExecutionPlan

Every physical operator implements one trait and produces a stream of Arrow record batches:

#[async_trait]
pub trait ExecutionPlan: Send + Sync {
    fn schema(&self) -> SchemaRef;
    fn children(&self) -> Vec<Arc<dyn ExecutionPlan>>;
    fn execute(&self) -> Result<SendableRecordBatchStream>;
}

Operators compose by wrapping each other — Projection wraps CsvScan, a future Filter wraps Projection, and so on. It is the same idea as chained Iterators, but async and columnar. Calling execute() on the outermost plan lazily drives the whole pipeline.

What works today

  • CsvScan — reads a CSV file into batched Arrow RecordBatches, inferring the schema from the header and first rows.
  • ProjectionExec — selects a subset of columns by name, projecting both the schema and each batch.
  • MiniFusionError — a single typed error (thiserror) with Io, Arrow, Schema, and NotImplemented variants, plus a Result<T> alias.
  • An end-to-end integration test that scans a CSV fixture and asserts batching, schema, and row counts.

The main.rs CLI, the dataframe DSL builder, and the execution (SessionContext) module are scaffolded but not yet implemented.

Running the tests

cargo test

The integration test lives in tests/csv_scan.rs and uses the fixture at tests/fixtures/people.csv.

Architecture

src/
├── lib.rs              # public re-exports
├── main.rs             # CLI entry point (stub)
├── error.rs            # MiniFusionError + Result alias
├── datasource/         # data readers (CSV today, Parquet planned)
│   └── csv.rs          # CsvScan
├── physical_plan/      # ExecutionPlan trait + physical operators
│   └── projection.rs   # ProjectionExec
├── execution/          # SessionContext / runtime (planned)
└── dataframe.rs        # DataFrame DSL builder (planned)

Modules appear progressively, level by level, rather than all at once.

Roadmap

The project is organized into levels, each closing with an integration test.

  • Level 1 — Basics: CSV scan ✅, projection ✅, limit, DataFrame DSL, SessionContext, and a minifusion run CLI.
  • Level 2 — Filters: row filtering (=, !=, <, >, <=, >=) with a minimal Expr tree and vectorized evaluation over Arrow arrays.
  • Level 3 — Aggregations: COUNT, SUM, AVG, MIN, MAX and GROUP BY via hash aggregation with incremental accumulators.
  • Level 4 — Parquet: a Parquet data source, with projection pushdown to the reader as a bonus.
  • Level 5 — Logical / physical plans: introduce a LogicalPlan tree and a planner that lowers it to Arc<dyn ExecutionPlan>, plus a simple optimizer rule (projection pushdown) and an optional minimal SQL frontend.

Acknowledgements

Heavily inspired by Apache DataFusion and the broader Arrow ecosystem. Any good design idea here is theirs; any rough edge is mine.

License

MIT — see LICENSE.

About

A tiny columnar analytical query engine in Rust — a learning project inspired by Apache DataFusion.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages