Skip to content

Project Ideas for GSoC 2025 (Google Summer of Code) #14478

@ozankabak

Description

@ozankabak

Context

Apache DataFusion is putting together an application for GSoC 2025. For those who do not know, GSoC is a Google-sponsored program for students passionate about open-source to come in and contribute to cool projects for a duration of 12 weeks over the summer. The main body of this issue will serve as a repository for project ideas for prospective students, should we become a mentoring organization.

Process

As an ex-GSoC student who has been through the process (time flies!), I will be coordinating our application to the program, and oversee the progress on behalf of the PMC. I do not know how many student slots we will get, but in addition to myself, @jayzhan-synnada and @berkaysynnada are volunteering to be mentors to GSoC students and help with their projects. If there is a need for more mentors, I will ask the committers and the PMC for help (and I'm sure we will get as many mentors as necessary 🙂 ), but we have many interesting project ideas and certainly enough people to mentor the students.

Idea Proposals

Let's use the comments section of this issue to discuss and ideate about possible 12-week projects for GSoC students. Note that our aim is to answer the following questions positively as we graduate students from the program:

  1. Did the student learn new skills for building cutting-edge data systems? This involves things like (a) improving their Rust proficiency, (b) extending their knowledge on schedulers and sync/async computations, (c) developing their ability to grasp and implement novel ideas from recent academic papers, and many other practical computer science skills.
  2. Was the student able to complete a non-trivial project that actually integrates with (and is useful within) Apache DataFusion?
  3. Will the student leave the program with enthusiasm to become an active member of the Apache DataFusion community, or at least of another open-source community?

Ideas

TODO: I will be adding some ideas here during the course of this week, and hope to draw inspiration from the community feedback. Let's go! 🚀

Here is a quick summary of the project ideas, along with links to more details:

Project Category Difficulty Level Possible Mentor(s) and/or Helper(s) Skills Expected Project Size
Implement Continuous Monitoring of DataFusion Performance Tooling Medium @alamb and @mertak-synnada DevOps, Cloud Computing, Web Development, Integrations 175 to 350 hours*
Supporting Correlated Subqueries Core Advanced @jayzhan-synnada and @xudong963 Databases, Algorithms, Data Structures, Testing Techniques 350 hours
Improving DataFusion DX (e.g. 1 and 2) DX Medium @eliaperantoni and @mkarbo Software Engineering, Terminal Visualizations 175 to 350 hours*
Robust WASM Support Build Medium @alamb and @waynexia WASM, Advanced Rust, Web Development, Software Engineering 175 to 350 hours*
High Performance Aggregations Core Advanced @jayzhan-synnada and @Rachelint Algorithms, Data Structures, Advanced Rust, Databases, Benchmarking Techniques 350 hours
Improving Python Bindings Python Bindings Medium @timsaucer APIs, FFIs, DataFrame Libraries 175 to 350 hours*
Optimizing DataFusion Binary Size Core/Build Medium @comphead and @alamb Software Engineering, Refactoring, Dependency Management, Compilers 175 to 350 hours*
Ergonomic SQL Features SQL FE Medium @berkaysynnada SQL, Planning, Parsing, Software Engineering 350 hours
Advanced Interval Analysis Core Advanced @ozankabak and @berkaysynnada Algorithms, Data Structures, Applied Mathematics, Software Engineering 350 hours
Spark-Compatible Functions Crate Extensions Medium @alamb, @andygrove SQL, Spark, Software Engineering 175 to 350 hours*
SQL Fuzzing Framework in Rust Extensions Advanced @2010YOUY01 SQL, Testing Techniques, Advanced Rust, Software Engineering 175 to 350 hours*
  • There is enough material to make this a 350-hour project, but it is granular enough to make it a 175-hour project as well.

Implement Continuous Monitoring of DataFusion Performance

DataFusion lacks continuous monitoring of how performance evolves over time -- we do this somewhat manually today. Even though performance has been one of our top priorities for a while now, we didn't build a continuous monitoring system yet. This linked issue contains a summary of all the previous efforts that made us inch closer to having such a system, but a functioning system needs to built on top of that progress. A student successfully completing this project would gain experience in building an end-to-end monitoring system that integrates with GitHub, scheduling/running benchmarks on some sort of a cloud infrastructure, and building a versatile web UI to expose the results. The outcome of this project will benefit Apache DataFusion on an ongoing basis in its quest for ever-more performance.

Supporting Correlated Subqueries

Correlated subqueries are an important SQL feature that enables some users to express their business logic more intuitively without thinking about "joins". Even though DataFusion has decent join support, it doesn't fully support correlated subqueries. The linked epic contains bite-size pieces of the steps necessary to achieve full support. For students interested in internals of data systems and databases, this project is a good opportunity to apply and/or improve their computer science knowledge. The experience of adding such a feature to a widely-used foundational query engine can also serve as a good opportunity to kickstart a career in the area of databases and data systems.

Improving DataFusion DX

While performance, extensibility and customizability is DataFusion's strong aspects, we have much work to do in terms of user-friendliness and ease of debug-ability. This project aims to make strides in these areas by improving terminal visualizations of query plans and increasing the "deployment" of the newly-added diagnostics framework. This project is a potential high-impact project with high output visibility, and reduce the barrier to entry to new users.

Robust WASM Support

DataFusion can be compiled today to WASM with some care. However, it is somewhat tricky and brittle. Having robust WASM support improves the embeddability aspect of DataFusion, and can enable many practical use cases. A good conclusion of this project would be the addition of a live demo sub-page to the DataFusion homepage.

High Performance Aggregations

An aggregation is one of the most fundamental operations within a query engine. Practical performance in many use cases, and results in many well-known benchmarks (e.g. ClickBench), depend heavily on aggregation performance. DataFusion community has been working on improving aggregation performance for a while now, but there is still work to do. A student working on this project will get the chance to hone their skills on high-performance, low(ish) level coding, intricacies of measuring performance, data structures and others.

Improving Python Bindings

DataFusion offers Python bindings that enable users to build data systems using Python. However, the Python bindings are still relatively low-level, and do not expose all APIs libraries like Pandas and Polars with a end-user focus offer. This project aims to improve DataFusion's Python bindings to make progress towards moving it closer to such libraries in terms of built-in APIs and functionality.

Optimizing DataFusion Binary Size

DataFusion is a foundational library with a large feature set. Even though we try to avoid adding too many dependencies and implement many low-level functionalities inside the codebase, the fast moving nature of the project results in an accumulation of dependencies over time. This inflates DataFusion's binary size over time, which reduces portability and embeddability. This project involves a study of the codebase, using compiler tooling, to understand where code bloat comes from, simplifying/reducing the number of dependencies by efficient in-house implementations, and avoiding code duplications.

Ergonomic SQL Features

DuckDB has many innovative features that significantly improve the SQL UX. Even though some of those features are already implemented in DataFusion, there are many others we can implement (and get inspiration from). This page contains a good summary of such features. Each such feature will serve as a bite-size, achievable milestone for a cool GSoC project that will have user-facing impact improving the UX on a broad basis. The project will start with a survey of what is already implemented, what is missing, and kick off with a prioritization proposal/implementation plan.

Advanced Interval Analysis

DataFusion implements interval arithmetic and utilizes it for range estimations, which enables use cases in data pruning, optimizations and statistics. However, the current implementation only works efficiently for forward evaluation; i.e. calculating the output range of an expression given input ranges (ranges of columns). When propagating constraints using the same graph, the current approach requires multiple bottom-up and top-down traversals to narrow column bounds fully. This project aims to fix this deficiency by utilizing a better algorithmic approach. Note that this is a very advanced project for students with a deep interest in computational methods, expression graphs, and constraint solvers.

Spark-Compatible Functions Crate

In general, DataFusion aims to be compatible with PostgreSQL in terms of functions and behaviors. However, there are many users (and downstream projects, such as DataFusion Comet) that desire compatibility with Apache Spark. This project aims to collect Spark-compatible functions into a separate crate to help such users and/or projects. The project will be an exercise in creating the right APIs, explaining how to use them, and then telling the world about them (e.g. via creating a compatibility-tracking page cataloging such functions, writing blog posts etc.).

SQL Fuzzing Framework in Rust

Fuzz testing is a very important technique we utilize often in DataFusion. Having SQL-level fuzz testing enables us to battle-test DataFusion in an end-to-end fashion. Initial version of our fuzzing framework is Java-based, but the time has come to migrate to Rust-native solution. This will simplify the overall implementation (by avoiding things like JDBC), enable us to implement more advanced algorithms for query generation, and attract more contributors over time. This project is a good blend of software engineering, algorithms and testing techniques (i.e. fuzzing techniques).

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions