Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce return_type_from_args for ScalarFunction. #14094

Merged
merged 30 commits into from
Jan 20, 2025
Merged

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Jan 12, 2025

Which issue does this PR close?

Part of #13717

Rationale for this change

return_type_from_args that has less dependencies on Expr itself but the computed properties of Expr and Schema including data_type and nullability

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

TODO

Combine return_type_from_args and is_nullable_from_args_nullable

Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jan 12, 2025
pub fn is_nullable(&self, args: &[Expr], schema: &dyn ExprSchema) -> bool {
self.inner.is_nullable(args, schema)
}

pub fn is_nullable_from_args_nullable(&self, args_nullables: &[bool]) -> bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove Expr dependency

/// The data types of the arguments to the function
pub arg_types: &'a [DataType],
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
pub arguments: &'a [String],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better name 🤔 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to unify the argument handling so that both return type and nullability are returned the same?

I wonder if it would somehow be possible to add the input nullable information here too 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not sure about only supporting string args, that is likely a regression in behavior for some users (For example, maybe they look for constant integers as well)

let name_column = &chunk[0];
let name = match name_column {
ColumnarValue::Scalar(ScalarValue::Utf8(Some(name_scalar))) => name_scalar,
_ => return exec_err!("named_struct even arguments must be string literals, got {name_column:?} instead at position {}", i * 2)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name_column output less readable array in this change, remove it for now.

@findepi
Copy link
Member

findepi commented Jan 12, 2025

As stated in #13717 (comment) , this new method doesn't necessarily simplify anything.

Can you please fill "Rationale for this change"? What problem are we solving?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211 -- I think this is a step in the right direction, but I am worried it just makes the API more complicated (adds as many functions as it deprecates)

Challenge: Exprs / constants

It seems to me one challenge is that different information is known for computing return types at different points in the plan (e.g. sometimes we have Expr and sometimes we don't)

What would you think about making this more explicit in ReturnTypeArgs by making it an enum:

#[derive(Debug)]
pub enum ReturnTypeArgs<'a> {
    /// information known at logical planning time
    /// Note you can get get type and nullability for each arg
   // using the specified ExprSchema
    Planning {
       pub args: &'a[Expr],
       pub schema: &'a dyn ExprSchema
    },
    /// Information known during Execution
    Execution {
    /// The data types of the arguments to the function
      pub arg_types: &'a [DataType],
      pub arg_nullability: [bool],
    }
}

Challenge: Multiple APIs (Nullability and return type)

It is somewhat akward to have two functions, one for nullability and one for return type. Also I can imagine that the nullability calculation depends on the input type of arguments too (not just the input nullability) I wonder if we can combine them into a single API:

Maybe something like

/// Information about the output of the function
/// including the data type and nullability:
struct ReturnTypeInfo {
  data_type: DataType,
  nullable: bool,
}

trait ScalarUDFImpl {
    /// Returns the 
    pub fn return_type_from_args(&self, args: ReturnTypeArgs) -> Result<ReturnTypeInfo> 
}

/// The data types of the arguments to the function
pub arg_types: &'a [DataType],
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
pub arguments: &'a [String],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to unify the argument handling so that both return type and nullability are returned the same?

I wonder if it would somehow be possible to add the input nullable information here too 🤔

/// The data types of the arguments to the function
pub arg_types: &'a [DataType],
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
pub arguments: &'a [String],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not sure about only supporting string args, that is likely a regression in behavior for some users (For example, maybe they look for constant integers as well)

@jayzhan211
Copy link
Contributor Author

Multiple APIs (Nullability and return type)

Great, I also want this too.

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jan 12, 2025

I am also not sure about only supporting string args, that is likely a regression in behavior for some users (For example, maybe they look for constant integers as well)

Yes, it might be, since I assume we don't really need Expr but String instead. However, in theory they can achieve what they need with String, but of course this is breaking change.

If the constant integer is the only concern, ScalarValue or ColumnarValue looks good for me too!

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jan 12, 2025

#[derive(Debug)]
pub enum ReturnTypeArgs<'a> {
    /// information known at logical planning time
    /// Note you can get get type and nullability for each arg
   // using the specified ExprSchema
    Planning {
       pub args: &'a[Expr],
       pub schema: &'a dyn ExprSchema
    },
    /// Information known during Execution
    Execution {
    /// The data types of the arguments to the function
      pub arg_types: &'a [DataType],
      pub arg_nullability: [bool],
    }
}

One good thing in this PR is that we don't need Expr anymore, we compute data type and nullable in datafusion core and they are not "public" for customization.

Do we really need Planning? My thought is that whenever we have Expr and Schema we can compute corresponding DataType and Nullability. Therefore, even for Planning stage, we can still get the information DataType + Nullability

Signed-off-by: Jay Zhan <[email protected]>
Signed-off-by: Jay Zhan <[email protected]>
@jayzhan211 jayzhan211 changed the title Add two new methods in ScalarFunction return_type_from_args and is_nullable_from_args_nullable Introduce return_type_from_args for ScalarFunction. Jan 13, 2025
@alamb
Copy link
Contributor

alamb commented Jan 13, 2025

Do we really need Planning? My thought is that whenever we have Expr and Schema we can compute corresponding DataType and Nullability. Therefore, even for Planning stage, we can still get the information DataType + Nullability

I think the usecase is for expressions like arrow_cast whose output type depends on the actual value of the arguments (not just the type of the argument)

So like arrow-cast(String) can return Int64 or Float64 depending on the value of the argument -- the value is passed as an Expr

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jan 14, 2025

So like arrow-cast(String

For arrow_cast, the output type is determined by the 2nd argument which is string.

arrow_cast(a, 'Int64') returns I64, arrow_cast(a, 'List(Int64)') returns List(Int64)

All the functions in datafusion have constant string type, the only concern is that other users do the crazy things with Expr, but I don't think it is practical since String is enough in my opionion

Signed-off-by: Jay Zhan <[email protected]>
@jayzhan211
Copy link
Contributor Author

I plan to change to ScalarValue after #14167, so we can parse the string easily.

If anyone interested in this change can also take it.

Signed-off-by: Jay Zhan <[email protected]>
@alamb
Copy link
Contributor

alamb commented Jan 18, 2025

I plan to change to ScalarValue after #14167, so we can parse the string easily.

If anyone interested in this change can also take it.

I have just merged in #14167

@github-actions github-actions bot added the common Related to common crate label Jan 19, 2025
@jayzhan211
Copy link
Contributor Author

We can start return_type deprecation after this PR

@jayzhan211 jayzhan211 marked this pull request as draft January 19, 2025 03:32
@jayzhan211 jayzhan211 marked this pull request as ready for review January 19, 2025 07:16
@alamb
Copy link
Contributor

alamb commented Jan 19, 2025

We can start return_type deprecation after this PR

I recommend we do not deprecate return_type - after this PR I think we have a nice API. Namely most simple uses can use return_type and ones that need more control (like returning nullability, or getting scalar args) can use return_type_from_args

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 -- I think this looks really nice ❤️

I left some comment suggestions, but I can also make them as a follow on PR as well

let a = Arc::new(Int32Array::from(vec![3; 200])) as ArrayRef;
println!(
"display {}",
pretty_format_columns("ColumnarValue(ArrayRef)", &[a]).unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I forgot to remove this 😆


let (return_type, nullable) =
func.return_type_from_args(args)?.into_parts();
Ok((return_type, nullable))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

/// The data types of the arguments to the function
pub arg_types: &'a [DataType],
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
pub arguments: &'a [Option<&'a ScalarValue>],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also recommend naming this scalar_arguments

Suggested change
pub arguments: &'a [Option<&'a ScalarValue>],
pub scalar_arguments: &'a [Option<&'a ScalarValue>],

pub struct ReturnTypeArgs<'a> {
/// The data types of the arguments to the function
pub arg_types: &'a [DataType],
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
/// Is argument `i` to the function a scalar (constant)
///
/// If argument `i` is not a scalar, it will be None
///
/// For example, if a function is called like `my_function(column_a, 5)`
/// this field will be `[None, Some(ScalarValue::Int32(Some(5)))]`

Comment on lines +360 to +364
#[derive(Debug)]
pub struct ReturnInfo {
return_type: DataType,
nullable: bool,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[derive(Debug)]
pub struct ReturnInfo {
return_type: DataType,
nullable: bool,
}
/// Return metadata for this function.
///
/// See [`ScalarUDFImpl::return_type_from_args`] for more information
#[derive(Debug)]
pub struct ReturnInfo {
/// The type of the value return by the function
return_type: DataType,
/// Can the returned value (ever) be null?
nullable: bool,
}

@@ -342,6 +348,56 @@ pub struct ScalarFunctionArgs<'a> {
pub return_type: &'a DataType,
}

#[derive(Debug)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[derive(Debug)]
/// Information about arguments passed to the function
///
/// This structure contains metadata about how the function was called
/// such as the type of the arguments, any scalar arguments and if the
/// arguments can (ever) be null
///
/// See [`ScalarUDFImpl::return_type_from_args`] for more information
#[derive(Debug)]

pub arg_types: &'a [DataType],
/// The Utf8 arguments to the function, if the expression is not Utf8, it will be empty string
pub arguments: &'a [Option<&'a ScalarValue>],
pub nullables: &'a [bool],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub nullables: &'a [bool],
/// Can argument `i` (ever) null?
pub nullables: &'a [bool],

debug_assert_eq!(args.arguments.len(), 2);

args.arguments[1]
.and_then(|sv| sv.try_as_str().flatten().filter(|s| !s.is_empty()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty fancy functional programming :bowtie:

@github-actions github-actions bot removed the common Related to common crate label Jan 20, 2025
@alamb
Copy link
Contributor

alamb commented Jan 20, 2025

Love it. And love that this isn't an API change. Thank you so much @jayzhan211 -- this really looks great

@alamb alamb merged commit d3f1c9a into apache:main Jan 20, 2025
25 checks passed
) -> Result<DataType> {
self.return_type(arg_types)
fn return_type_from_args(&self, args: ReturnTypeArgs) -> Result<ReturnInfo> {
let return_type = self.return_type(args.arg_types)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was a bit problematic for users with custom UDFs in that it may cause breaks at runtime.

The #[deprecated()] apparently doesn't affect for people implementing this trait, only those who call it. As such, if someone (like us) had implemented return_type_from_exprs and had return_type just panic, that'd compile fine, but rather than calling the return_type_from_exprs DF would call this new return_type_from_args with its default impl delegating to return_type which would then panic.

Luckily, we had tests that caught it, but in general I think in this case it would have been better to fully remove the return_type_from_exprs so that users who override it would get compile failures, rather than runtime ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luckily, we had tests that caught it, but in general I think in this case it would have been better to fully remove the return_type_from_exprs so that users who override it would get compile failures, rather than runtime ones.

THis makes sense to me - can you file a ticket? I think this one is related but I am not sure it is entirely the same

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #14729.

invoke_batch doesn't suffer from this runtime issue since invoke_with_args by default delegates to invoke_batch, so it is enough for users to still just override invoke_batch.

It does however have the same issue in that the users who do override invoke_batch aren't alerted to it being deprecated, meaning its future removal will still be a sudden compile-time break.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate functions Changes to functions implementation logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants