Project Ideas for GSoC 2025 (Google Summer of Code) #14478

ozankabak · 2025-02-04T09:20:59Z

Context

Apache DataFusion is putting together an application for GSoC 2025. For those who do not know, GSoC is a Google-sponsored program for students passionate about open-source to come in and contribute to cool projects for a duration of 12 weeks over the summer. The main body of this issue will serve as a repository for project ideas for prospective students, should we become a mentoring organization.

Process

As an ex-GSoC student who has been through the process (time flies!), I will be coordinating our application to the program, and oversee the progress on behalf of the PMC. I do not know how many student slots we will get, but in addition to myself, @jayzhan-synnada and @berkaysynnada are volunteering to be mentors to GSoC students and help with their projects. If there is a need for more mentors, I will ask the committers and the PMC for help (and I'm sure we will get as many mentors as necessary 🙂 ), but we have many interesting project ideas and certainly enough people to mentor the students.

Idea Proposals

Let's use the comments section of this issue to discuss and ideate about possible 12-week projects for GSoC students. Note that our aim is to answer the following questions positively as we graduate students from the program:

Did the student learn new skills for building cutting-edge data systems? This involves things like (a) improving their Rust proficiency, (b) extending their knowledge on schedulers and sync/async computations, (c) developing their ability to grasp and implement novel ideas from recent academic papers, and many other practical computer science skills.
Was the student able to complete a non-trivial project that actually integrates with (and is useful within) Apache DataFusion?
Will the student leave the program with enthusiasm to become an active member of the Apache DataFusion community, or at least of another open-source community?

Ideas

TODO: I will be adding some ideas here during the course of this week, and hope to draw inspiration from the community feedback. Let's go! 🚀

Here is a quick summary of the project ideas, along with links to more details:

Project	Category	Difficulty Level	Possible Mentor(s) and/or Helper(s)	Skills
Implement Continuous Monitoring of DataFusion Performance	Tooling	Medium	@alamb and @mertak-synnada	DevOps, Cloud Computing, Web Development, Integrations
Supporting Correlated Subqueries	Core	Advanced	@jayzhan-synnada and @xudong963	Databases, Algorithms, Data Structures, Testing Techniques
Improving DataFusion DX (e.g. 1 and 2)	DX	Medium	@eliaperantoni and @mkarbo	Software Engineering, Terminal Visualizations
Robust WASM Support	Build	Medium	@alamb and @waynexia?	WASM, Advanced Rust, Web Development, Software Engineering
High Performance Aggregations	Core	Advanced	@jayzhan-synnada and @Rachelint	Algorithms, Data Structures, Advanced Rust, Databases, Benchmarking Techniques
Improving Python Bindings	Python Bindings	Medium	@timsaucer	APIs, FFIs, DataFrame Libraries
Optimizing DataFusion Binary Size	Core/Build	Medium	@comphead and @alamb	Software Engineering, Refactoring, Dependency Management, Compilers
Ergonomic SQL Features	SQL FE	Medium	@berkaysynnada	SQL, Planning, Parsing, Software Engineering
Advanced Interval Analysis	Core	Advanced	@ozankabak and @berkaysynnada	Algorithms, Data Structures, Applied Mathematics, Software Engineering
Spark-Compatible Functions Crate	Extensions	Medium	@alamb, @andygrove	SQL, Spark, Software Engineering
SQL Fuzzing Framework in Rust	Extensions	Advanced	@2010YOUY01	SQL, Testing Techniques, Advanced Rust, Software Engineering

Implement Continuous Monitoring of DataFusion Performance

DataFusion lacks continuous monitoring of how performance evolves over time -- we do this somewhat manually today. Even though performance has been one of our top priorities for a while now, we didn't build a continuous monitoring system yet. This linked issue contains a summary of all the previous efforts that made us inch closer to having such a system, but a functioning system needs to built on top of that progress. A student successfully completing this project would gain experience in building an end-to-end monitoring system that integrates with GitHub, scheduling/running benchmarks on some sort of a cloud infrastructure, and building a versatile web UI to expose the results. The outcome of this project will benefit Apache DataFusion on an ongoing basis in its quest for ever-more performance.

Supporting Correlated Subqueries

Correlated subqueries are an important SQL feature that enables some users to express their business logic more intuitively without thinking about "joins". Even though DataFusion has decent join support, it doesn't fully support correlated subqueries. The linked epic contains bite-size pieces of the steps necessary to achieve full support. For students interested in internals of data systems and databases, this project is a good opportunity to apply and/or improve their computer science knowledge. The experience of adding such a feature to a widely-used foundational query engine can also serve as a good opportunity to kickstart a career in the area of databases and data systems.

Improving DataFusion DX

While performance, extensibility and customizability is DataFusion's strong aspects, we have much work to do in terms of user-friendliness and ease of debug-ability. This project aims to make strides in these areas by improving terminal visualizations of query plans and increasing the "deployment" of the newly-added diagnostics framework. This project is a potential high-impact project with high output visibility, and reduce the barrier to entry to new users.

Robust WASM Support

DataFusion can be compiled today to WASM with some care. However, it is somewhat tricky and brittle. Having robust WASM support improves the embeddability aspect of DataFusion, and can enable many practical use cases. A good conclusion of this project would be the addition of a live demo sub-page to the DataFusion homepage.

High Performance Aggregations

An aggregation is one of the most fundamental operations within a query engine. Practical performance in many use cases, and results in many well-known benchmarks (e.g. ClickBench), depend heavily on aggregation performance. DataFusion community has been working on improving aggregation performance for a while now, but there is still work to do. A student working on this project will get the chance to hone their skills on high-performance, low(ish) level coding, intricacies of measuring performance, data structures and others.

Improving Python Bindings

DataFusion offers Python bindings that enable users to build data systems using Python. However, the Python bindings are still relatively low-level, and do not expose all APIs libraries like Pandas and Polars with a end-user focus offer. This project aims to improve DataFusion's Python bindings to make progress towards moving it closer to such libraries in terms of built-in APIs and functionality.

Optimizing DataFusion Binary Size

DataFusion is a foundational library with a large feature set. Even though we try to avoid adding too many dependencies and implement many low-level functionalities inside the codebase, the fast moving nature of the project results in an accumulation of dependencies over time. This inflates DataFusion's binary size over time, which reduces portability and embeddability. This project involves a study of the codebase, using compiler tooling, to understand where code bloat comes from, simplifying/reducing the number of dependencies by efficient in-house implementations, and avoiding code duplications.

Ergonomic SQL Features

DuckDB has many innovative features that significantly improve the SQL UX. Even though some of those features are already implemented in DataFusion, there are many others we can implement (and get inspiration from). This page contains a good summary of such features. Each such feature will serve as a bite-size, achievable milestone for a cool GSoC project that will have user-facing impact improving the UX on a broad basis. The project will start with a survey of what is already implemented, what is missing, and kick off with a prioritization proposal/implementation plan.

Advanced Interval Analysis

DataFusion implements interval arithmetic and utilizes it for range estimations, which enables use cases in data pruning, optimizations and statistics. However, the current implementation only works efficiently for forward evaluation; i.e. calculating the output range of an expression given input ranges (ranges of columns). When propagating constraints using the same graph, the current approach requires multiple bottom-up and top-down traversals to narrow column bounds fully. This project aims to fix this deficiency by utilizing a better algorithmic approach. Note that this is a very advanced project for students with a deep interest in computational methods, expression graphs, and constraint solvers.

Spark-Compatible Functions Crate

In general, DataFusion aims to be compatible with PostgreSQL in terms of functions and behaviors. However, there are many users (and downstream projects, such as DataFusion Comet) that desire compatibility with Apache Spark. This project aims to collect Spark-compatible functions into a separate crate to help such users and/or projects. The project will be an exercise in creating the right APIs, explaining how to use them, and then telling the world about them (e.g. via creating a compatibility-tracking page cataloging such functions, writing blog posts etc.).

SQL Fuzzing Framework in Rust

Fuzz testing is a very important technique we utilize often in DataFusion. Having SQL-level fuzz testing enables us to battle-test DataFusion in an end-to-end fashion. Initial version of our fuzzing framework is Java-based, but the time has come to migrate to Rust-native solution. This will simplify the overall implementation (by avoiding things like JDBC), enable us to implement more advanced algorithms for query generation, and attract more contributors over time. This project is a good blend of software engineering, algorithms and testing techniques (i.e. fuzzing techniques).

Omega359 · 2025-02-04T09:48:42Z

I do not know what level students would be for GSoC however @alamb had some ideas for CMU class projects for university level

ozankabak · 2025-02-04T10:14:18Z

IIRC, GSoC students can be undergraduates or graduate students. The main difference between CMU class projects and GSoC projects is that the former will be optimizer-focused, while the latter can be anything. Probably we will collect optimizer-related ideas in the CMU bucket, with the remaining falling in the GSoC bucket.

mkarbo · 2025-02-04T14:21:13Z

We are happy to provide input and feedback to anyone looking to get involved with #14429

ozankabak · 2025-02-04T14:26:08Z

Thank you @mkarbo! Can I assume @eliaperantoni would help too?

Some other CTA:

@timsaucer, would you be willing to mentor a student help us improve Python bindings?
@alamb, can you help mentor students who would be interested in improving aggregation performance or correlated subqueries? I think @jayzhan-synnada can also help with mentoring aggregation performance work as he spent on it before as well.
@waynexia, can you help mentor a student working on improving WASM support?

eliaperantoni · 2025-02-04T14:27:55Z

Can I assume @eliaperantoni would help too?

Absolutely :) I'm actively working on this or related features e.g. #14439

timsaucer · 2025-02-04T14:49:50Z

@ozankabak Yes, I will always try to make time for students.

ozankabak · 2025-02-04T16:11:56Z

@mertak-synnada has been working on benchmarking Synnada's fork, so he may be a good co-mentor (along with someone from InfluxData) for the continuous monitoring project.

ozankabak · 2025-02-05T07:04:02Z

@comphead, would you be willing to mentor a student on a project to study our codebase and dependencies to reduce DF binary size?

berkaysynnada · 2025-02-05T13:47:17Z

I got three posts from duckdb related with these ideas. Perhaps they can provide some insights:

Project #5504 -> https://duckdb.org/2024/06/26/benchmarks-over-time.html
Project https://github.com/apache/datafusion-python -> https://duckdb.org/2023/07/07/python-udf.html
Project #5483 -> https://duckdb.org/2023/05/26/correlated-subqueries-in-sql.html

comphead · 2025-02-05T15:40:03Z

@comphead, would you be willing to mentor a student on a project to study our codebase and dependencies to reduce DF binary size?

Hi @ozankabak I would love to.

alamb · 2025-02-05T15:49:53Z

This is a great list -- thank you @ozankabak and everyone for the ideas

Idea proposal:

Note that our aim is to answer the following questions positively as we graduate students from the program:

I would like to suggest an additional goal which is:

Create public written artifact (DataFusion Blog Post) explaining the project and why it is great. As an example of what I have in mind, check check out @XiangpengHao 's (our intern from last year)'s writeups on Using StringView / German Style Strings to Make Queries Faster: Part 1 - Reading Parquet and How Good is Parquet for Wide Tables (Machine Learning Workloads) Really?

Projects I would be willing (and happy!) to help mentor:

✅ Run DataFusion benchmarks regularly and track performance history over time #5504 this is a great project and long overdue in my mind
✅ [EPIC] A collection of tickets for improved WASM support in DataFusion #13815 (this would also be great)
✅ Datafusion binary size has been getting bigger #13816

Probably Not: Aggregation Performance

@alamb, can you help mentor students who would be interested in improving aggregation performance or correlated subqueries? I think @jayzhan-synnada can also help with mentoring aggregation performance work as he spent on it before as well.

I would probably decline advising aggregation performance as an intern project as I don't think it would be a good intern experience (unless it was an exceptional intern -- see below). The code is already quite highly optimized and I don't have any simple ideas to make it faster (though maybe @Rachelint does). Any changes here must be made quite carefully to avoid regressions

The ideal candidate would be someone already with very strong Rust and low level optimization experience (we can teach them the needed database internals).

Probably Not: Correlated Subqueries

For this project:

[EPIC] More Subquery support #5483

For this project, I think a successful candidate (and the kind of person I would be happy to help mentor) is graduate students who have a background in query optimizers (for example, can explain clearly what Unnesting Arbitrary Queries is about)

This isn't a great project for someone who doesn't already have a deep understanding on queries, join graphs, and subqueries as it will take most of the summer just to understand what we are trying to do

alamb · 2025-02-05T15:51:07Z

Another potential project that I think would be huge and very much in the realm is Spark Functions

[DISCUSSION] Add separate crate to cover spark builtin functions #5600

This set of functions is critical to many DataFusion users and the code already largely exists. This would be an exercise in creating the right APIs, explaining how to use them, and then telling the world about them

ozankabak · 2025-02-05T18:20:47Z

@alamb, thank you for sharing your thoughts. I added a difficulty column to guide students as they browse the projects. I agree that the "Advanced" level projects are either for graduate students or students with strong enthusiasm in this area. At least in my time, the student body included both kinds of students, in addition to more junior ones.

I updated the table to list you as a mentor for the projects you mentioned in your first comment. I will add your proposed project to the list as well. Do you have any recommendations for possible mentors for the correlated subquery project? We can help, but it would be better to find a mentor (or a co-mentor) for it.

@Rachelint, would you be willing to mentor a student who wants to work on high performance aggregations?

ozankabak · 2025-02-05T18:55:24Z

@andygrove, would you be willing to mentor a student if they choose to work on the Spark-compatible functions crate?

Rachelint · 2025-02-05T19:28:15Z

@ozankabak Willing to offer helps but may not have enough bandwidth as main mentors (busy next few months for finding new job...)

Agree with @alamb , in-memory aggregation is highly optimized now.

And in my opinion, maybe more work we can do in improving large than memory aggregation. And @2010YOUY01 may be also interested about this.

ozankabak · 2025-02-05T19:38:57Z

Thanks @Rachelint -- you can be a co-mentor if you like.

@2010YOUY01, I'm not up-to-date with our status on larger-than-memory aggregation. If you think you can enrich the epic and divide it into small tasks, we can create a GSoC project out of it and you can mentor it.

XiangpengHao · 2025-02-05T19:39:32Z

Thanks for bringing me here @alamb . I'm also happy to help mentoring students on performance related projects (although myself also seems to qualify for GSoC lol), larger than memory aggregation/join seems particularly interesting to me.

Rachelint · 2025-02-05T19:43:47Z

Thanks @Rachelint -- you can be a co-mentor if you like.

@2010YOUY01, I'm not up-to-date with our status on larger-than-memory aggregation. If you think you can enrich the epic and divide it into small tasks, we can create a GSoC project out of it and you can mentor it.

Thanks, I would like to.

ozankabak · 2025-02-05T19:47:52Z

Thanks for bringing me here @alamb . I'm also happy to help mentoring students on performance related projects (although myself also seems to qualify for GSoC lol), larger than memory aggregation/join seems particularly interesting to me.

Great. As I said in my previous comment, if we divide the large-than-memory aggregations epic into smaller tasks, it can be a great project. Maybe you can collaborate with @2010YOUY01 and be co-mentors if he is also interested.

alamb · 2025-02-05T23:05:58Z

ANother possible project that would be neat would be to implement variant support in parquet and DataFusion

There is not a ticket yet in DataFusion that I know of but there is one for parquet in arrow-rs

[Parquet] Implement Variant type support in Parquet arrow-rs#6736

Variant is like BSON/JSON

2010YOUY01 · 2025-02-06T04:32:13Z

Thanks @Rachelint -- you can be a co-mentor if you like.

@2010YOUY01, I'm not up-to-date with our status on larger-than-memory aggregation. If you think you can enrich the epic and divide it into small tasks, we can create a GSoC project out of it and you can mentor it.

@ozankabak Thank you for putting this together.

Enhancing external aggregation/join is not a 100% clearly defined project from my perspective. This feature is not very mature, and we want to follow the path of: fixing existing bugs -> more testing -> benchmarking and improving performance.
I'm happy to be a helper (with @Rachelint and @XiangpengHao ), instead of a major mentor on this project

Besides, there are many well-defined tasks in our SQL fuzzer #11030 and I'm interested to mentor. I'll open an issue for GSOC project later

ozankabak · 2025-02-06T06:24:44Z

Besides, there are many well-defined tasks in our SQL fuzzer #11030 and I'm interested to mentor. I'll open an issue for GSOC project later.

@2010YOUY01, this is great -- let's create some sub-issues under the main issue you linked, and I will add it as a project idea with you as the mentor.

ozankabak · 2025-02-06T06:30:15Z

@XiangpengHao, do you think you can take a look at how we can divide up the work of larger-than-memory aggregations into smaller tasks? If this is possible, we can make it a project with you as the mentor.

andygrove · 2025-02-06T13:55:59Z

@andygrove, would you be willing to mentor a student if they choose to work on the Spark-compatible functions crate?

Yes, I would be happy to.

xudong963 · 2025-02-06T14:38:24Z

Probably Not: Correlated Subqueries

For this project:

[EPIC] More Subquery support #5483

For this project, I think a successful candidate (and the kind of person I would be happy to help mentor) is graduate students who have a background in query optimizers (for example, can explain clearly what Unnesting Arbitrary Queries is about)

This isn't a great project for someone who doesn't already have a deep understanding on queries, join graphs, and subqueries as it will take most of the summer just to understand what we are trying to do

I have a strong background in Correlated Subqueries and have fully implemented the paper Unnesting Arbitrary Queries in practice, but I don't have enough time to mentor a student, so I'm happy to be a helper if someone wants to mentor a student.

ozankabak · 2025-02-06T15:27:17Z

Great, thanks @xudong963 - I will add you as a co-mentor/helper along with Jay

XiangpengHao · 2025-02-06T16:24:41Z

@XiangpengHao, do you think you can take a look at how we can divide up the work of larger-than-memory aggregations into smaller tasks? If this is possible, we can make it a project with you as the mentor.

Sorry I'm quite tied up with other things these days, but I can co-mentor the project if anyone takes the lead to write the proposal.

ozankabak · 2025-02-06T18:11:53Z

No worries - I will list you as a co-mentor if we find someone to write up the project and be a co-mentor alongside you

2010YOUY01 · 2025-02-07T04:44:18Z

Besides, there are many well-defined tasks in our SQL fuzzer #11030 and I'm interested to mentor. I'll open an issue for GSOC project later.

@2010YOUY01, this is great -- let's create some sub-issues under the main issue you linked, and I will add it as a project idea with you as the mentor.

@ozankabak I have created #14535 as a project idea

ozankabak · 2025-02-07T06:01:05Z

@2010YOUY01 -- fantastic, added the project idea to the list.

edmondop · 2025-02-08T00:14:08Z

I would love to help with "Improving Python Bindings"

berkaysynnada mentioned this issue Feb 5, 2025

[EPIC] DuckDB-Inspired Feature Enhancements #14514

Open

5 tasks

alamb changed the title ~~Project Ideas for GSoC 2025~~ Project Ideas for GSoC 2025 (Google Summer of Code) Feb 5, 2025

2010YOUY01 mentioned this issue Feb 7, 2025

Rewrite datafusion-sqlancer in Rust #14535

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Ideas for GSoC 2025 (Google Summer of Code) #14478

Project Ideas for GSoC 2025 (Google Summer of Code) #14478

ozankabak commented Feb 4, 2025 •

edited

Loading

Omega359 commented Feb 4, 2025

ozankabak commented Feb 4, 2025 •

edited

Loading

mkarbo commented Feb 4, 2025

ozankabak commented Feb 4, 2025 •

edited

Loading

eliaperantoni commented Feb 4, 2025

timsaucer commented Feb 4, 2025

ozankabak commented Feb 4, 2025

ozankabak commented Feb 5, 2025

berkaysynnada commented Feb 5, 2025

comphead commented Feb 5, 2025

alamb commented Feb 5, 2025

alamb commented Feb 5, 2025

ozankabak commented Feb 5, 2025

ozankabak commented Feb 5, 2025

Rachelint commented Feb 5, 2025 •

edited

Loading

ozankabak commented Feb 5, 2025

XiangpengHao commented Feb 5, 2025

Rachelint commented Feb 5, 2025 •

edited

Loading

ozankabak commented Feb 5, 2025

alamb commented Feb 5, 2025

2010YOUY01 commented Feb 6, 2025

ozankabak commented Feb 6, 2025

ozankabak commented Feb 6, 2025

andygrove commented Feb 6, 2025

xudong963 commented Feb 6, 2025 •

edited

Loading

Probably Not: Correlated Subqueries

ozankabak commented Feb 6, 2025

XiangpengHao commented Feb 6, 2025

ozankabak commented Feb 6, 2025

2010YOUY01 commented Feb 7, 2025

ozankabak commented Feb 7, 2025

edmondop commented Feb 8, 2025

Project Ideas for GSoC 2025 (Google Summer of Code) #14478

Project Ideas for GSoC 2025 (Google Summer of Code) #14478

Comments

ozankabak commented Feb 4, 2025 • edited Loading

Context

Process

Idea Proposals

Ideas

Implement Continuous Monitoring of DataFusion Performance

Supporting Correlated Subqueries

Improving DataFusion DX

Robust WASM Support

High Performance Aggregations

Improving Python Bindings

Optimizing DataFusion Binary Size

Ergonomic SQL Features

Advanced Interval Analysis

Spark-Compatible Functions Crate

SQL Fuzzing Framework in Rust

Omega359 commented Feb 4, 2025

ozankabak commented Feb 4, 2025 • edited Loading

mkarbo commented Feb 4, 2025

ozankabak commented Feb 4, 2025 • edited Loading

eliaperantoni commented Feb 4, 2025

timsaucer commented Feb 4, 2025

ozankabak commented Feb 4, 2025

ozankabak commented Feb 5, 2025

berkaysynnada commented Feb 5, 2025

comphead commented Feb 5, 2025

alamb commented Feb 5, 2025

Projects I would be willing (and happy!) to help mentor:

Probably Not: Aggregation Performance

Probably Not: Correlated Subqueries

alamb commented Feb 5, 2025

ozankabak commented Feb 5, 2025

ozankabak commented Feb 5, 2025

Rachelint commented Feb 5, 2025 • edited Loading

ozankabak commented Feb 5, 2025

XiangpengHao commented Feb 5, 2025

Rachelint commented Feb 5, 2025 • edited Loading

ozankabak commented Feb 5, 2025

alamb commented Feb 5, 2025

2010YOUY01 commented Feb 6, 2025

ozankabak commented Feb 6, 2025

ozankabak commented Feb 6, 2025

andygrove commented Feb 6, 2025

xudong963 commented Feb 6, 2025 • edited Loading

Probably Not: Correlated Subqueries

ozankabak commented Feb 6, 2025

XiangpengHao commented Feb 6, 2025

ozankabak commented Feb 6, 2025

2010YOUY01 commented Feb 7, 2025

ozankabak commented Feb 7, 2025

edmondop commented Feb 8, 2025

ozankabak commented Feb 4, 2025 •

edited

Loading

ozankabak commented Feb 4, 2025 •

edited

Loading

ozankabak commented Feb 4, 2025 •

edited

Loading

Rachelint commented Feb 5, 2025 •

edited

Loading

Rachelint commented Feb 5, 2025 •

edited

Loading

xudong963 commented Feb 6, 2025 •

edited

Loading