-
Notifications
You must be signed in to change notification settings - Fork 109
Add fill_null method to DataFrame API for handling missing values #1019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a very worthwhile and useful addition. Thank you!
We've tried to keep most of the heavier logic on the rust side and to keep the python wrappers as way to convert from rust to pythonic interfaces. Do you think this is a case where doing the logic in the python side makes more sense?
More generally, do you think this is something we can or should upstream to the core datafusion repo? I can assist with that if you like.
Hi @kosiew I moved this to draft since it looks like you're doing a good job on the upstream work which would change how we would want to handle this. |
thanks |
The upstream PR for fill_null is included in datafusion 46.0.0. |
- Implemented `fill_null` method in `dataframe.rs` to allow filling null values with a specified value for specific columns or all columns. - Added a helper function `python_value_to_scalar_value` to convert Python values to DataFusion ScalarValues, supporting various types including integers, floats, booleans, strings, and timestamps. - Updated the `count` method in `PyDataFrame` to maintain functionality.
…to_scalar_value function
…to_scalar_value function
…ex type conversion
…act_bound for PyArrow scalar conversion
…y and streamline error handling
…on of basic types
@@ -1236,3 +1236,57 @@ def test_between_default(df): | |||
def test_alias_with_metadata(df): | |||
df = df.select(f.alias(f.col("a"), "b", {"key": "value"})) | |||
assert df.schema().field("b").metadata == {b"key": b"value"} | |||
|
|||
|
|||
def test_coalesce(df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this test because while researching this PR, I initially checked out the coalesce function and found there were no tests yet.
fn py_obj_to_scalar_value(py: Python, obj: PyObject) -> ScalarValue { | ||
if let Ok(value) = obj.extract::<bool>(py) { | ||
ScalarValue::Boolean(Some(value)) | ||
} else if let Ok(value) = obj.extract::<i64>(py) { | ||
ScalarValue::Int64(Some(value)) | ||
} else if let Ok(value) = obj.extract::<u64>(py) { | ||
ScalarValue::UInt64(Some(value)) | ||
} else if let Ok(value) = obj.extract::<f64>(py) { | ||
ScalarValue::Float64(Some(value)) | ||
} else if let Ok(value) = obj.extract::<String>(py) { | ||
ScalarValue::Utf8(Some(value)) | ||
} else { | ||
panic!("Unsupported value type") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to src/utils.rs with a simpler implementation
pub(crate) fn py_obj_to_scalar_value(py: Python, obj: PyObject) -> PyResult<ScalarValue> { | ||
// convert Python object to PyScalarValue to ScalarValue | ||
|
||
let pa = py.import("pyarrow")?; | ||
|
||
// Convert Python object to PyArrow scalar | ||
let scalar = pa.call_method1("scalar", (obj,))?; | ||
|
||
// Convert PyArrow scalar to PyScalarValue | ||
let py_scalar = PyScalarValue::extract_bound(scalar.as_ref()) | ||
.map_err(|e| PyValueError::new_err(format!("Failed to extract PyScalarValue: {}", e)))?; | ||
|
||
// Convert PyScalarValue to ScalarValue | ||
Ok(py_scalar.into()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above is simpler than the original
fn py_obj_to_scalar_value(py: Python, obj: PyObject) -> ScalarValue {
if let Ok(value) = obj.extract::<bool>(py) {
ScalarValue::Boolean(Some(value))
} else if let Ok(value) = obj.extract::<i64>(py) {
ScalarValue::Int64(Some(value))
} else if let Ok(value) = obj.extract::<u64>(py) {
ScalarValue::UInt64(Some(value))
} else if let Ok(value) = obj.extract::<f64>(py) {
ScalarValue::Float64(Some(value))
} else if let Ok(value) = obj.extract::<String>(py) {
ScalarValue::Utf8(Some(value))
} else {
panic!("Unsupported value type")
}
}
which did not handle other python scalars eg datetime
Which issue does this PR close?
Rationale for this change
Handling missing values is a common operation in data processing. This change introduces a convenient and expressive way to replace
NULL
values in DataFusion DataFrames using a single method, improving usability and parity with other data processing frameworks like pandas or PySpark.What changes are included in this PR?
fill_null
method on the PythonDataFrame
API.ScalarValue
.fill_null
.Are these changes tested?
✅ Yes. The PR includes an extensive set of unit tests in
test_dataframe.py
, validating behavior across data types, subsets, error tolerance, and null-handling edge cases.Are there any user-facing changes?
✅ Yes:
fill_null()
method is now available on Python DataFrames.common-operations/functions.rst
.