Ballista is will be a proof-of-concept distributed compute platform based on Kubernetes and the Rust implementation of Apache Arrow.
This is not my first attempt at building something like this. I originally wanted DataFusion to be a distributed compute platform but this was overly ambitious at the time, and it ended up becoming an in-memory query execution engine for the Rust implementation of Apache Arrow. However, DataFusion now provides a good foundation to have another attempt at building a modern distributed compute platform in Rust.
My goal is to use this repo to move fast and try out ideas that eventually can be contributed back to Apache Arrow and to help drive requirements for Apache Arrow and DataFusion.
I will be working on this project in my spare time, which is limited, so progress will likely be slow.
- README describing project
- Define service and minimal query plan in protobuf file
- Generate code from protobuf file
- Implement skeleton gRPC server
- Implement skeleton gRPC client
- Client can send query plan
- Server can receive query plan
- CLI to create cluster using Kubernetes
- Server can translate protobuf query plan to DataFusion query plan
- Server can execute query plan using DataFusion
- Server can write results to CSV files
- Server can stream Arrow data back to client
- Benchmarks
- Implement Flight protocol
Currently depends on https://github.com/tower-rs/tower-grpc/tree/master/tower-grpc being cloned in a parallel directory.
Open two terminal sessions. In first session, run:
cargo run --bin server
In second terminal, run:
cargo run --example client
So far, this just sends a logical query plan from the client to the server.