-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Project Ideas for GSoC 2025 (Google Summer of Code) #14478
Comments
I do not know what level students would be for GSoC however @alamb had some ideas for CMU class projects for university level |
IIRC, GSoC students can be undergraduates or graduate students. The main difference between CMU class projects and GSoC projects is that the former will be optimizer-focused, while the latter can be anything. Probably we will collect optimizer-related ideas in the CMU bucket, with the remaining falling in the GSoC bucket. |
We are happy to provide input and feedback to anyone looking to get involved with #14429 |
Thank you @mkarbo! Can I assume @eliaperantoni would help too? Some other CTA:
|
Absolutely :) I'm actively working on this or related features e.g. #14439 |
@ozankabak Yes, I will always try to make time for students. |
@mertak-synnada has been working on benchmarking Synnada's fork, so he may be a good co-mentor (along with someone from InfluxData) for the continuous monitoring project. |
@comphead, would you be willing to mentor a student on a project to study our codebase and dependencies to reduce DF binary size? |
I got three posts from duckdb related with these ideas. Perhaps they can provide some insights: Project #5504 -> https://duckdb.org/2024/06/26/benchmarks-over-time.html |
Hi @ozankabak I would love to. |
This is a great list -- thank you @ozankabak and everyone for the ideas Idea proposal:
I would like to suggest an additional goal which is:
Projects I would be willing (and happy!) to help mentor:
Probably Not: Aggregation Performance
I would probably decline advising aggregation performance as an intern project as I don't think it would be a good intern experience (unless it was an exceptional intern -- see below). The code is already quite highly optimized and I don't have any simple ideas to make it faster (though maybe @Rachelint does). Any changes here must be made quite carefully to avoid regressions The ideal candidate would be someone already with very strong Rust and low level optimization experience (we can teach them the needed database internals). Probably Not: Correlated SubqueriesFor this project: For this project, I think a successful candidate (and the kind of person I would be happy to help mentor) is graduate students who have a background in query optimizers (for example, can explain clearly what Unnesting Arbitrary Queries is about) This isn't a great project for someone who doesn't already have a deep understanding on queries, join graphs, and subqueries as it will take most of the summer just to understand what we are trying to do |
Another potential project that I think would be huge and very much in the realm is Spark Functions This set of functions is critical to many DataFusion users and the code already largely exists. This would be an exercise in creating the right APIs, explaining how to use them, and then telling the world about them |
@alamb, thank you for sharing your thoughts. I added a difficulty column to guide students as they browse the projects. I agree that the "Advanced" level projects are either for graduate students or students with strong enthusiasm in this area. At least in my time, the student body included both kinds of students, in addition to more junior ones. I updated the table to list you as a mentor for the projects you mentioned in your first comment. I will add your proposed project to the list as well. Do you have any recommendations for possible mentors for the correlated subquery project? We can help, but it would be better to find a mentor (or a co-mentor) for it. @Rachelint, would you be willing to mentor a student who wants to work on high performance aggregations? |
@andygrove, would you be willing to mentor a student if they choose to work on the Spark-compatible functions crate? |
@ozankabak Willing to offer helps but may not have enough bandwidth as main mentors (busy next few months for finding new job...) Agree with @alamb , in-memory aggregation is highly optimized now. And in my opinion, maybe more work we can do in improving large than memory aggregation. And @2010YOUY01 may be also interested about this. |
Thanks @Rachelint -- you can be a co-mentor if you like. @2010YOUY01, I'm not up-to-date with our status on larger-than-memory aggregation. If you think you can enrich the epic and divide it into small tasks, we can create a GSoC project out of it and you can mentor it. |
Thanks for bringing me here @alamb . I'm also happy to help mentoring students on performance related projects (although myself also seems to qualify for GSoC lol), larger than memory aggregation/join seems particularly interesting to me. |
Thanks, I would like to. |
Great. As I said in my previous comment, if we divide the large-than-memory aggregations epic into smaller tasks, it can be a great project. Maybe you can collaborate with @2010YOUY01 and be co-mentors if he is also interested. |
ANother possible project that would be neat would be to implement variant support in parquet and DataFusion There is not a ticket yet in DataFusion that I know of but there is one for parquet in arrow-rs Variant is like BSON/JSON |
@ozankabak Thank you for putting this together. Enhancing external aggregation/join is not a 100% clearly defined project from my perspective. This feature is not very mature, and we want to follow the path of: fixing existing bugs -> more testing -> benchmarking and improving performance. Besides, there are many well-defined tasks in our SQL fuzzer #11030 and I'm interested to mentor. I'll open an issue for GSOC project later |
@2010YOUY01, this is great -- let's create some sub-issues under the main issue you linked, and I will add it as a project idea with you as the mentor. |
@XiangpengHao, do you think you can take a look at how we can divide up the work of larger-than-memory aggregations into smaller tasks? If this is possible, we can make it a project with you as the mentor. |
Yes, I would be happy to. |
I have a strong background in Correlated Subqueries and have fully implemented the paper Unnesting Arbitrary Queries in practice, but I don't have enough time to mentor a student, so I'm happy to be a helper if someone wants to mentor a student. |
Great, thanks @xudong963 - I will add you as a co-mentor/helper along with Jay |
Sorry I'm quite tied up with other things these days, but I can co-mentor the project if anyone takes the lead to write the proposal. |
No worries - I will list you as a co-mentor if we find someone to write up the project and be a co-mentor alongside you |
@ozankabak I have created #14535 as a project idea |
@2010YOUY01 -- fantastic, added the project idea to the list. |
I would love to help with "Improving Python Bindings" |
Context
Apache DataFusion is putting together an application for GSoC 2025. For those who do not know, GSoC is a Google-sponsored program for students passionate about open-source to come in and contribute to cool projects for a duration of 12 weeks over the summer. The main body of this issue will serve as a repository for project ideas for prospective students, should we become a mentoring organization.
Process
As an ex-GSoC student who has been through the process (time flies!), I will be coordinating our application to the program, and oversee the progress on behalf of the PMC. I do not know how many student slots we will get,
but in addition to myself, @jayzhan-synnada and @berkaysynnada are volunteering to be mentors to GSoC students and help with their projects. If there is a need for more mentors, I will ask the committers and the PMC for help (and I'm sure we will get as many mentors as necessary 🙂 ), but we have many interesting project ideas and certainly enough people to mentor the students.Idea Proposals
Let's use the comments section of this issue to discuss and ideate about possible 12-week projects for GSoC students. Note that our aim is to answer the following questions positively as we graduate students from the program:
Ideas
TODO: I will be adding some ideas here during the course of this week, and hope to draw inspiration from the community feedback. Let's go! 🚀Here is a quick summary of the project ideas, along with links to more details:
Implement Continuous Monitoring of DataFusion Performance
DataFusion lacks continuous monitoring of how performance evolves over time -- we do this somewhat manually today. Even though performance has been one of our top priorities for a while now, we didn't build a continuous monitoring system yet. This linked issue contains a summary of all the previous efforts that made us inch closer to having such a system, but a functioning system needs to built on top of that progress. A student successfully completing this project would gain experience in building an end-to-end monitoring system that integrates with GitHub, scheduling/running benchmarks on some sort of a cloud infrastructure, and building a versatile web UI to expose the results. The outcome of this project will benefit Apache DataFusion on an ongoing basis in its quest for ever-more performance.
Supporting Correlated Subqueries
Correlated subqueries are an important SQL feature that enables some users to express their business logic more intuitively without thinking about "joins". Even though DataFusion has decent join support, it doesn't fully support correlated subqueries. The linked epic contains bite-size pieces of the steps necessary to achieve full support. For students interested in internals of data systems and databases, this project is a good opportunity to apply and/or improve their computer science knowledge. The experience of adding such a feature to a widely-used foundational query engine can also serve as a good opportunity to kickstart a career in the area of databases and data systems.
Improving DataFusion DX
While performance, extensibility and customizability is DataFusion's strong aspects, we have much work to do in terms of user-friendliness and ease of debug-ability. This project aims to make strides in these areas by improving terminal visualizations of query plans and increasing the "deployment" of the newly-added diagnostics framework. This project is a potential high-impact project with high output visibility, and reduce the barrier to entry to new users.
Robust WASM Support
DataFusion can be compiled today to WASM with some care. However, it is somewhat tricky and brittle. Having robust WASM support improves the embeddability aspect of DataFusion, and can enable many practical use cases. A good conclusion of this project would be the addition of a live demo sub-page to the DataFusion homepage.
High Performance Aggregations
An aggregation is one of the most fundamental operations within a query engine. Practical performance in many use cases, and results in many well-known benchmarks (e.g. ClickBench), depend heavily on aggregation performance. DataFusion community has been working on improving aggregation performance for a while now, but there is still work to do. A student working on this project will get the chance to hone their skills on high-performance, low(ish) level coding, intricacies of measuring performance, data structures and others.
Improving Python Bindings
DataFusion offers Python bindings that enable users to build data systems using Python. However, the Python bindings are still relatively low-level, and do not expose all APIs libraries like Pandas and Polars with a end-user focus offer. This project aims to improve DataFusion's Python bindings to make progress towards moving it closer to such libraries in terms of built-in APIs and functionality.
Optimizing DataFusion Binary Size
DataFusion is a foundational library with a large feature set. Even though we try to avoid adding too many dependencies and implement many low-level functionalities inside the codebase, the fast moving nature of the project results in an accumulation of dependencies over time. This inflates DataFusion's binary size over time, which reduces portability and embeddability. This project involves a study of the codebase, using compiler tooling, to understand where code bloat comes from, simplifying/reducing the number of dependencies by efficient in-house implementations, and avoiding code duplications.
Ergonomic SQL Features
DuckDB has many innovative features that significantly improve the SQL UX. Even though some of those features are already implemented in DataFusion, there are many others we can implement (and get inspiration from). This page contains a good summary of such features. Each such feature will serve as a bite-size, achievable milestone for a cool GSoC project that will have user-facing impact improving the UX on a broad basis. The project will start with a survey of what is already implemented, what is missing, and kick off with a prioritization proposal/implementation plan.
Advanced Interval Analysis
DataFusion implements interval arithmetic and utilizes it for range estimations, which enables use cases in data pruning, optimizations and statistics. However, the current implementation only works efficiently for forward evaluation; i.e. calculating the output range of an expression given input ranges (ranges of columns). When propagating constraints using the same graph, the current approach requires multiple bottom-up and top-down traversals to narrow column bounds fully. This project aims to fix this deficiency by utilizing a better algorithmic approach. Note that this is a very advanced project for students with a deep interest in computational methods, expression graphs, and constraint solvers.
Spark-Compatible Functions Crate
In general, DataFusion aims to be compatible with PostgreSQL in terms of functions and behaviors. However, there are many users (and downstream projects, such as DataFusion Comet) that desire compatibility with Apache Spark. This project aims to collect Spark-compatible functions into a separate crate to help such users and/or projects. The project will be an exercise in creating the right APIs, explaining how to use them, and then telling the world about them (e.g. via creating a compatibility-tracking page cataloging such functions, writing blog posts etc.).
SQL Fuzzing Framework in Rust
Fuzz testing is a very important technique we utilize often in DataFusion. Having SQL-level fuzz testing enables us to battle-test DataFusion in an end-to-end fashion. Initial version of our fuzzing framework is Java-based, but the time has come to migrate to Rust-native solution. This will simplify the overall implementation (by avoiding things like JDBC), enable us to implement more advanced algorithms for query generation, and attract more contributors over time. This project is a good blend of software engineering, algorithms and testing techniques (i.e. fuzzing techniques).
The text was updated successfully, but these errors were encountered: