A tiny columnar analytical query engine written in Rust, built from scratch to learn how engines like Apache DataFusion work under the hood.
This is a learning project. The goal is not to compete with DataFusion, but to reproduce its key design decisions at small scale:
- Apache Arrow as the internal columnar format.
- A pull-based execution model built on async
Streams — operators pullRecordBatches from their children on demand. - Separation between the logical plan and the physical plan.
Status: early stage / work in progress. The core execution abstraction, a CSV source, and a projection operator are implemented and tested. The remaining operators and the DataFrame / SQL frontends are on the roadmap below.
Every physical operator implements one trait and produces a stream of Arrow record batches:
#[async_trait]
pub trait ExecutionPlan: Send + Sync {
fn schema(&self) -> SchemaRef;
fn children(&self) -> Vec<Arc<dyn ExecutionPlan>>;
fn execute(&self) -> Result<SendableRecordBatchStream>;
}Operators compose by wrapping each other — Projection wraps CsvScan, a
future Filter wraps Projection, and so on. It is the same idea as chained
Iterators, but async and columnar. Calling execute() on the outermost plan
lazily drives the whole pipeline.
CsvScan— reads a CSV file into batched ArrowRecordBatches, inferring the schema from the header and first rows.ProjectionExec— selects a subset of columns by name, projecting both the schema and each batch.MiniFusionError— a single typed error (thiserror) withIo,Arrow,Schema, andNotImplementedvariants, plus aResult<T>alias.- An end-to-end integration test that scans a CSV fixture and asserts batching, schema, and row counts.
The main.rs CLI, the dataframe DSL builder, and the execution
(SessionContext) module are scaffolded but not yet implemented.
cargo testThe integration test lives in tests/csv_scan.rs and uses the fixture at
tests/fixtures/people.csv.
src/
├── lib.rs # public re-exports
├── main.rs # CLI entry point (stub)
├── error.rs # MiniFusionError + Result alias
├── datasource/ # data readers (CSV today, Parquet planned)
│ └── csv.rs # CsvScan
├── physical_plan/ # ExecutionPlan trait + physical operators
│ └── projection.rs # ProjectionExec
├── execution/ # SessionContext / runtime (planned)
└── dataframe.rs # DataFrame DSL builder (planned)
Modules appear progressively, level by level, rather than all at once.
The project is organized into levels, each closing with an integration test.
- Level 1 — Basics: CSV scan ✅, projection ✅, limit, DataFrame DSL,
SessionContext, and aminifusion runCLI. - Level 2 — Filters: row filtering (
=,!=,<,>,<=,>=) with a minimalExprtree and vectorized evaluation over Arrow arrays. - Level 3 — Aggregations:
COUNT,SUM,AVG,MIN,MAXandGROUP BYvia hash aggregation with incremental accumulators. - Level 4 — Parquet: a Parquet data source, with projection pushdown to the reader as a bonus.
- Level 5 — Logical / physical plans: introduce a
LogicalPlantree and a planner that lowers it toArc<dyn ExecutionPlan>, plus a simple optimizer rule (projection pushdown) and an optional minimal SQL frontend.
Heavily inspired by Apache DataFusion and the broader Arrow ecosystem. Any good design idea here is theirs; any rough edge is mine.
MIT — see LICENSE.