Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: Ballista roadmap proposal #1068

Open
15 of 19 tasks
milenkovicm opened this issue Oct 11, 2024 · 5 comments
Open
15 of 19 tasks

EPIC: Ballista roadmap proposal #1068

milenkovicm opened this issue Oct 11, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@milenkovicm
Copy link
Contributor

milenkovicm commented Oct 11, 2024

Ballista Reloaded - Roadmap Proposal

As it looks like we reached some kind of consensus about moving Ballista from application to a library, I'd like to propose few targets that I see as short to medium term goals for ballista. This would address comments from @alamb & @Dandandan.

Personally, I see two main short term goals, improving ballista usability, and decreasing maintainable code. Robustness may come up as one important goal, for which I don't see bandwidth or infrastructure at this point.

0. Keep up with DataFusion releases

Nothing else to add :)

1. Usability

It would be great if we could make writing ballista application as easy as DataFusion, ideally it should be very hard to spot the difference between them.

1.1 BallistaContext removal or evolution

Can we replace BallistaContext with SessionContext? It would definitely improve usability as we would get most of the methods available in SessionContext also, some DataFusion applications would be deployable to Ballista with single line change.

let ctx = SessionContext::ballista_standalone().await?;

This approach may bring DataFusion Python on board as well, not sure how easy would it be.

There are clear benefits of deprecation of BallistaContext, decision may hurt us in a long rung.

SessionContext may bring usability issues with UDF support, configuration and basically all functionalities which need to be propagated across the cluster to work, and which may not be trivial to address. We may try to be address the by "turning off" those methods in ballista or just by documenting it, still some effort is needed. Or maybe its not issue at all?

1.2 Scheduler/executor binaries

Ballista to a library should keep scheduler and executors binaries, as they would improve overall ballista usability and provide a quick way to bootstrap ballista cluster, for easy on boarding and testing purposes.

We should focus our effort would be to provide a methods which would would help making custom scheduler/executors binaries easy. We could provide a way to create new scheduler/executor with default configurations, or add a way to plug in object store registries, configurations, protocols, session context factories ...

1.3 Ballista Contrib

Move some of the components which are now optional to a separate sub-projects.

2. Protocol (client - scheduler - executor)

Two protocols we may need to have a look at, client-scheduler and scheduler-executor.
Two major use cases may be support for user defined functions, configuration propagation and replacement of protocol itself.

2.1 Propagate SessionContext configuration from client to executor

At the moment SessionContext or some other state is not propagated from client to scheduler and executors.
Enabling this would simplify overall configuration, it would enable use-cases where configuration can hold
secret keys, object store configuration or similar.

2.2 Support for user defined functions

I'm not aware of any examples where rust based UDFs are made serializable and shipped from client to server,
many examples where python functions are shipped, so this effort may focus on python UDF. This effort
would probably impact DataFusion plans, more details to follow.

2.3 Make client-scheduler protocol plugable

Current client-scheduler protocol will be improved, also as there are new protocols coming out we may provide
a way to replace default protocol.

One (new) protocol example is Spark Connect, it is well thought approach covering most if not all cases for layered data processing. Users could be able to provide support for it and deploy frameworks like Sail on top of Ballista or even spark applications. Personally I find this interesting and with growing operators from DataFusion Comet supports it might bring interesting possibilities.

Also, this is needed if flight-sql is made optional and moved to 'contrib' project.

2.4 Bridge the gap between datafusion and ballista

Some operations are supported by datafusion but not ballista, we should try to bridge the gap

3. Shuffle improvements

@andygrove mentioned, re-implement the shuffle writer/reader to re-use the logic in Comet which has a more efficient shuffle implementation based on Spark. It would be great if we could see this implemented in short term.

4. Scheduler

Improvements to internal scheduler could be a mid to long term goal, where users can bring their own strategies. Not many use-cases come to my mind apart from HDFS collocation or caching.

Two possible items here:

  • Pluggable scheduler
  • Adding/improving Failure detector(s)

5. Observability

As UI has been removed, and rest-api may be moved to contrib API we need to come up with notification mechanism external systems can subscribe to get scheduling events, execution metrics ... We would need to put some more effort to break down this functionality. I guess we could learn from Apache Spark

6. Testing

Effort into getting more tests and covering edge cases. It may not be easy as it needs additional infrastructure and lot of effort for testing. It would be great if we can re-use DataFusion set of tests somehow

7. Python bindings

Make ballista available for datafusion-python users

@drauschenbach
Copy link

Re: 2.2 User Defined Functions:

Rust compiles to WASM, and WASM explicitly supports over-the-air deployment. This is one possible way to express UDF's in Rust.

Some other projects allow UDFs in "pickle-able Python", although use of transitive dependencies rapidly descends into Python packaging hell.

@wpf375516041
Copy link

image Does starrocks java udf have any reference value?

@tbar4
Copy link
Contributor

tbar4 commented Oct 18, 2024

I think pictures and graphs help me better understand what is being updated, so I made this to reflect what I think you are saying should be updated. I am happy to update and create a proposed Architecture that we can maybe put on the site.
ballista drawio

@milenkovicm
Copy link
Contributor Author

Just one note, core ballista should not care about UDFs, if they can serialise to logical/physical plan ballista should provide extension points to run them. My opinion is that rather than providing specific UDF implementation in core ballista, we should provide a way to plug your own UDF implementation

@milenkovicm
Copy link
Contributor Author

Does starrocks java udf have any reference value?

@wpf375516041 its not too hard to implement java udf on top of datafusion. I did example implementation some time ago https://github.com/milenkovicm/adhesive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants