Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDL Statement Propagation (INSERT INTO support) #1164

Open
Tracked by #1068
milenkovicm opened this issue Jan 18, 2025 · 6 comments
Open
Tracked by #1068

DDL Statement Propagation (INSERT INTO support) #1164

milenkovicm opened this issue Jan 18, 2025 · 6 comments
Labels
enhancement New feature or request

Comments

@milenkovicm
Copy link
Contributor

milenkovicm commented Jan 18, 2025

Is your feature request related to a problem or challenge?

With apache/datafusion#14079 merged, we're a step closed having support for INSERT INTO in ballista.
Latest issue is that scheduler can't find table reference specified in DML.table_name (type of TableReference).

This specific issue is due to having two different un-synchronized session contexts ballista has, client and corresponding scheduler context.

Describe the solution you'd like

I do not have a good or preferred solution at this point, asking for opinions.
Ideally it should be a solution which would be flexible.

Describe alternatives you've considered

I have few alternatives, not of which is ideal. Have I missed something?

Replace TableReference with actual table in the LogicalPlan::DML

Initial idea was to replace TableReference with actual table in the plan but that would not work due to
table provider lookup to create insert into exec https://github.com/milenkovicm/arrow-datafusion-fork/blob/dc22b3fc846c23f69325be6e11c8ef204c3dc6be/datafusion/core/src/physical_planner.rs#L550

I'm not convinced it will work

Propagate DDLs Statements to QueryPlanner

BallistaQueryPlanner is in charge of client-scheduler communication, at the moment it does not propagate DDL statements from client to scheduler. It could be modified to handle DDL statements, the problem is that SessionContext will execute DDL statements immediately and LogicalPlan::DDL will be swapped with LogicalPlan::Empty, thus no DDL information will reach the planner.

Looking at datafusion code, I'm not sure that this could be changed on the SessionContext without major disruption.

Synchronize Catalogs Between Client and Scheduler

INSERT INTO will work if scheduler catalog has table information, so some kind of remote catalog would help. As it would affect user experience if remote catalog had to be setup, this option is not the first choice .

We could come up with ballista catalog (schema registry) which could synchronize catalog state between client and the scheduler, it could be a bit of the work with non async methods exposed by SchemaCatalog.

At the end, as SchemaProvider.table is async, table could be lazy registered first time table is needed by a query plan. This would require custom SchemaProvider on the client side.

Synchronize Contexts on ExecuteQuery

Implement some kind of tracking logic, which would be triggered on ExecuteQuery which would synchronise SchemaRegistry between client and scheduler. I'm not really keen on this solution as I believe it will get very complicated very quickly.

On another thought we could synchronise only table definition on DML execution, maybe even easiest to implement.

Modify Ballista Protocol to send PhysicalPlans

At the moment client would send LogicalPlan to scheduler which would be then converted to physical plan on the scheduler. At this point we need table reference. I was wondering do can we resolve physical plan on the client side, but split them to stages on the server side.

This would be quite a big change, so i'm asking if anybody remembers why logical plan was selected to be exchange instead of physical plan.

Additional context

@milenkovicm milenkovicm added the enhancement New feature or request label Jan 18, 2025
@milenkovicm
Copy link
Contributor Author

@alamb, @andygrove, @Dandandan I would like to pick your brain on this issue, or anyone who could help. There is small gap to support INSERT INTO

thanks a lot

@alamb
Copy link
Contributor

alamb commented Jan 22, 2025

I will try and respond to this tomorrow

@alamb
Copy link
Contributor

alamb commented Jan 23, 2025

This is on my list, but I may not get to it tonight

@milenkovicm
Copy link
Contributor Author

Thanks @alamb! There is no rush, DF 45 needed to proceed with this task

@alamb
Copy link
Contributor

alamb commented Jan 24, 2025

the problem is that SessionContext will execute DDL statements immediately and LogicalPlan::DDL will be swapped with LogicalPlan::Empty, thus no DDL information will reach the planner.

I am sorry if I am misunderstanding and I am not faimilar with the Balilsta code. From the DataFusion side, however, you can avoid immediately executing DDL by using lower level APIs on SessionState

So instead of running SessionContext::sql

You could do what it does (source):

        let plan = self.state().create_logical_plan(sql).await?;

        match plan { 
          LogicalPlan::DML(...) => { // do whatever ballista needs here }
          _ => ctx.execute_logical_plan()
    }

Then you could perhaps Propagate DDLs Statements to QueryPlanner

@milenkovicm
Copy link
Contributor Author

thanks @alamb

one of the biggest changes changes in ballista is removal of ballista context and exposing datafusion context directly to the user. we get almost all operations for free, and applications running on vanilla datafusion can run on ballista with single line change. this means all required manipulations to get this working have done somehow in the background.
from what I know so far DML is the only operation which makes this approach challenging, but I don't think its as how stopper.

I also wanted to use this as opportunity to simply overall flow if possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants