-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Context
Apache DataFusion is putting together an application for GSoC 2025. For those who do not know, GSoC is a Google-sponsored program for students passionate about open-source to come in and contribute to cool projects for a duration of 12 weeks over the summer. The main body of this issue will serve as a repository for project ideas for prospective students, should we become a mentoring organization.
Process
As an ex-GSoC student who has been through the process (time flies!), I will be coordinating our application to the program, and oversee the progress on behalf of the PMC. I do not know how many student slots we will get, but in addition to myself, @jayzhan-synnada and @berkaysynnada are volunteering to be mentors to GSoC students and help with their projects. If there is a need for more mentors, I will ask the committers and the PMC for help (and I'm sure we will get as many mentors as necessary 🙂 ), but we have many interesting project ideas and certainly enough people to mentor the students.
Idea Proposals
Let's use the comments section of this issue to discuss and ideate about possible 12-week projects for GSoC students. Note that our aim is to answer the following questions positively as we graduate students from the program:
- Did the student learn new skills for building cutting-edge data systems? This involves things like (a) improving their Rust proficiency, (b) extending their knowledge on schedulers and sync/async computations, (c) developing their ability to grasp and implement novel ideas from recent academic papers, and many other practical computer science skills.
- Was the student able to complete a non-trivial project that actually integrates with (and is useful within) Apache DataFusion?
- Will the student leave the program with enthusiasm to become an active member of the Apache DataFusion community, or at least of another open-source community?
Ideas
TODO: I will be adding some ideas here during the course of this week, and hope to draw inspiration from the community feedback. Let's go! 🚀
Here is a quick summary of the project ideas, along with links to more details:
Project | Category | Difficulty Level | Possible Mentor(s) and/or Helper(s) | Skills | Expected Project Size |
---|---|---|---|---|---|
Implement Continuous Monitoring of DataFusion Performance | Tooling | Medium | @alamb and @mertak-synnada | DevOps, Cloud Computing, Web Development, Integrations | 175 to 350 hours* |
Supporting Correlated Subqueries | Core | Advanced | @jayzhan-synnada and @xudong963 | Databases, Algorithms, Data Structures, Testing Techniques | 350 hours |
Improving DataFusion DX (e.g. 1 and 2) | DX | Medium | @eliaperantoni and @mkarbo | Software Engineering, Terminal Visualizations | 175 to 350 hours* |
Robust WASM Support | Build | Medium | @alamb and @waynexia | WASM, Advanced Rust, Web Development, Software Engineering | 175 to 350 hours* |
High Performance Aggregations | Core | Advanced | @jayzhan-synnada and @Rachelint | Algorithms, Data Structures, Advanced Rust, Databases, Benchmarking Techniques | 350 hours |
Improving Python Bindings | Python Bindings | Medium | @timsaucer | APIs, FFIs, DataFrame Libraries | 175 to 350 hours* |
Optimizing DataFusion Binary Size | Core/Build | Medium | @comphead and @alamb | Software Engineering, Refactoring, Dependency Management, Compilers | 175 to 350 hours* |
Ergonomic SQL Features | SQL FE | Medium | @berkaysynnada | SQL, Planning, Parsing, Software Engineering | 350 hours |
Advanced Interval Analysis | Core | Advanced | @ozankabak and @berkaysynnada | Algorithms, Data Structures, Applied Mathematics, Software Engineering | 350 hours |
Spark-Compatible Functions Crate | Extensions | Medium | @alamb, @andygrove | SQL, Spark, Software Engineering | 175 to 350 hours* |
SQL Fuzzing Framework in Rust | Extensions | Advanced | @2010YOUY01 | SQL, Testing Techniques, Advanced Rust, Software Engineering | 175 to 350 hours* |
- There is enough material to make this a 350-hour project, but it is granular enough to make it a 175-hour project as well.
Implement Continuous Monitoring of DataFusion Performance
DataFusion lacks continuous monitoring of how performance evolves over time -- we do this somewhat manually today. Even though performance has been one of our top priorities for a while now, we didn't build a continuous monitoring system yet. This linked issue contains a summary of all the previous efforts that made us inch closer to having such a system, but a functioning system needs to built on top of that progress. A student successfully completing this project would gain experience in building an end-to-end monitoring system that integrates with GitHub, scheduling/running benchmarks on some sort of a cloud infrastructure, and building a versatile web UI to expose the results. The outcome of this project will benefit Apache DataFusion on an ongoing basis in its quest for ever-more performance.
Supporting Correlated Subqueries
Correlated subqueries are an important SQL feature that enables some users to express their business logic more intuitively without thinking about "joins". Even though DataFusion has decent join support, it doesn't fully support correlated subqueries. The linked epic contains bite-size pieces of the steps necessary to achieve full support. For students interested in internals of data systems and databases, this project is a good opportunity to apply and/or improve their computer science knowledge. The experience of adding such a feature to a widely-used foundational query engine can also serve as a good opportunity to kickstart a career in the area of databases and data systems.
Improving DataFusion DX
While performance, extensibility and customizability is DataFusion's strong aspects, we have much work to do in terms of user-friendliness and ease of debug-ability. This project aims to make strides in these areas by improving terminal visualizations of query plans and increasing the "deployment" of the newly-added diagnostics framework. This project is a potential high-impact project with high output visibility, and reduce the barrier to entry to new users.
Robust WASM Support
DataFusion can be compiled today to WASM with some care. However, it is somewhat tricky and brittle. Having robust WASM support improves the embeddability aspect of DataFusion, and can enable many practical use cases. A good conclusion of this project would be the addition of a live demo sub-page to the DataFusion homepage.
High Performance Aggregations
An aggregation is one of the most fundamental operations within a query engine. Practical performance in many use cases, and results in many well-known benchmarks (e.g. ClickBench), depend heavily on aggregation performance. DataFusion community has been working on improving aggregation performance for a while now, but there is still work to do. A student working on this project will get the chance to hone their skills on high-performance, low(ish) level coding, intricacies of measuring performance, data structures and others.
Improving Python Bindings
DataFusion offers Python bindings that enable users to build data systems using Python. However, the Python bindings are still relatively low-level, and do not expose all APIs libraries like Pandas and Polars with a end-user focus offer. This project aims to improve DataFusion's Python bindings to make progress towards moving it closer to such libraries in terms of built-in APIs and functionality.
Optimizing DataFusion Binary Size
DataFusion is a foundational library with a large feature set. Even though we try to avoid adding too many dependencies and implement many low-level functionalities inside the codebase, the fast moving nature of the project results in an accumulation of dependencies over time. This inflates DataFusion's binary size over time, which reduces portability and embeddability. This project involves a study of the codebase, using compiler tooling, to understand where code bloat comes from, simplifying/reducing the number of dependencies by efficient in-house implementations, and avoiding code duplications.
Ergonomic SQL Features
DuckDB has many innovative features that significantly improve the SQL UX. Even though some of those features are already implemented in DataFusion, there are many others we can implement (and get inspiration from). This page contains a good summary of such features. Each such feature will serve as a bite-size, achievable milestone for a cool GSoC project that will have user-facing impact improving the UX on a broad basis. The project will start with a survey of what is already implemented, what is missing, and kick off with a prioritization proposal/implementation plan.
Advanced Interval Analysis
DataFusion implements interval arithmetic and utilizes it for range estimations, which enables use cases in data pruning, optimizations and statistics. However, the current implementation only works efficiently for forward evaluation; i.e. calculating the output range of an expression given input ranges (ranges of columns). When propagating constraints using the same graph, the current approach requires multiple bottom-up and top-down traversals to narrow column bounds fully. This project aims to fix this deficiency by utilizing a better algorithmic approach. Note that this is a very advanced project for students with a deep interest in computational methods, expression graphs, and constraint solvers.
Spark-Compatible Functions Crate
In general, DataFusion aims to be compatible with PostgreSQL in terms of functions and behaviors. However, there are many users (and downstream projects, such as DataFusion Comet) that desire compatibility with Apache Spark. This project aims to collect Spark-compatible functions into a separate crate to help such users and/or projects. The project will be an exercise in creating the right APIs, explaining how to use them, and then telling the world about them (e.g. via creating a compatibility-tracking page cataloging such functions, writing blog posts etc.).
SQL Fuzzing Framework in Rust
Fuzz testing is a very important technique we utilize often in DataFusion. Having SQL-level fuzz testing enables us to battle-test DataFusion in an end-to-end fashion. Initial version of our fuzzing framework is Java-based, but the time has come to migrate to Rust-native solution. This will simplify the overall implementation (by avoiding things like JDBC), enable us to implement more advanced algorithms for query generation, and attract more contributors over time. This project is a good blend of software engineering, algorithms and testing techniques (i.e. fuzzing techniques).