Add fill_null method to DataFrame API for handling missing values #1019

kosiew · 2025-02-12T08:40:40Z

Which issue does this PR close?

Closes Add DataFrame fill_nan/fill_null #922

Rationale for this change

Handling missing values is a common operation in data processing. This change introduces a convenient and expressive way to replace NULL values in DataFusion DataFrames using a single method, improving usability and parity with other data processing frameworks like pandas or PySpark.

What changes are included in this PR?

Introduced a new fill_null method on the Python DataFrame API.
Added support in Rust backend to apply null-filling logic using a scalar value with optional column selection.
Implemented type-safe conversion of Python objects to Arrow-compatible ScalarValue.
Comprehensive test suite covering various scenarios:
- Filling across all columns or a specific subset
- Handling of different data types (integers, floats, strings, booleans, dates)
- Coercion behavior when casting between types
- Immutability of original DataFrames
- Edge cases including empty DataFrames and columns with all nulls
Updated user-facing documentation to include usage examples and caveats for fill_null.

Are these changes tested?

✅ Yes. The PR includes an extensive set of unit tests in test_dataframe.py, validating behavior across data types, subsets, error tolerance, and null-handling edge cases.

Are there any user-facing changes?

✅ Yes:

A new fill_null() method is now available on Python DataFrames.
Documentation has been updated with examples and behavior details under common-operations/functions.rst.

timsaucer

This looks like a very worthwhile and useful addition. Thank you!

We've tried to keep most of the heavier logic on the rust side and to keep the python wrappers as way to convert from rust to pythonic interfaces. Do you think this is a case where doing the logic in the python side makes more sense?

More generally, do you think this is something we can or should upstream to the core datafusion repo? I can assist with that if you like.

timsaucer · 2025-03-08T13:29:15Z

Hi @kosiew I moved this to draft since it looks like you're doing a good job on the upstream work which would change how we would want to handle this.

kosiew · 2025-03-10T08:46:08Z

thanks

kosiew · 2025-03-14T06:31:09Z

The upstream PR for fill_null is included in datafusion 46.0.0.
We can revisit this when datafusion-python upgrade the dependency to 46.0.0.

- Implemented `fill_null` method in `dataframe.rs` to allow filling null values with a specified value for specific columns or all columns. - Added a helper function `python_value_to_scalar_value` to convert Python values to DataFusion ScalarValues, supporting various types including integers, floats, booleans, strings, and timestamps. - Updated the `count` method in `PyDataFrame` to maintain functionality.

…to_scalar_value function

…function

…ex type conversion

…act_bound for PyArrow scalar conversion

…y and streamline error handling

… error handling

…lues

…on of basic types

…e function

…information

kosiew · 2025-04-30T12:26:12Z

python/tests/test_functions.py

@@ -1236,3 +1236,57 @@ def test_between_default(df):
 def test_alias_with_metadata(df):
    df = df.select(f.alias(f.col("a"), "b", {"key": "value"}))
    assert df.schema().field("b").metadata == {b"key": b"value"}
+
+
+def test_coalesce(df):


Added this test because while researching this PR, I initially checked out the coalesce function and found there were no tests yet.

kosiew · 2025-04-30T12:28:14Z

src/config.rs

-fn py_obj_to_scalar_value(py: Python, obj: PyObject) -> ScalarValue {
-    if let Ok(value) = obj.extract::<bool>(py) {
-        ScalarValue::Boolean(Some(value))
-    } else if let Ok(value) = obj.extract::<i64>(py) {
-        ScalarValue::Int64(Some(value))
-    } else if let Ok(value) = obj.extract::<u64>(py) {
-        ScalarValue::UInt64(Some(value))
-    } else if let Ok(value) = obj.extract::<f64>(py) {
-        ScalarValue::Float64(Some(value))
-    } else if let Ok(value) = obj.extract::<String>(py) {
-        ScalarValue::Utf8(Some(value))
-    } else {
-        panic!("Unsupported value type")
-    }
-}


Moved to src/utils.rs with a simpler implementation

kosiew · 2025-04-30T12:31:46Z

src/utils.rs

+pub(crate) fn py_obj_to_scalar_value(py: Python, obj: PyObject) -> PyResult<ScalarValue> {
+    // convert Python object to PyScalarValue to ScalarValue
+
+    let pa = py.import("pyarrow")?;
+
+    // Convert Python object to PyArrow scalar
+    let scalar = pa.call_method1("scalar", (obj,))?;
+
+    // Convert PyArrow scalar to PyScalarValue
+    let py_scalar = PyScalarValue::extract_bound(scalar.as_ref())
+        .map_err(|e| PyValueError::new_err(format!("Failed to extract PyScalarValue: {}", e)))?;
+
+    // Convert PyScalarValue to ScalarValue
+    Ok(py_scalar.into())
+}


The above is simpler than the original

fn py_obj_to_scalar_value(py: Python, obj: PyObject) -> ScalarValue { if let Ok(value) = obj.extract::<bool>(py) { ScalarValue::Boolean(Some(value)) } else if let Ok(value) = obj.extract::<i64>(py) { ScalarValue::Int64(Some(value)) } else if let Ok(value) = obj.extract::<u64>(py) { ScalarValue::UInt64(Some(value)) } else if let Ok(value) = obj.extract::<f64>(py) { ScalarValue::Float64(Some(value)) } else if let Ok(value) = obj.extract::<String>(py) { ScalarValue::Utf8(Some(value)) } else { panic!("Unsupported value type") } }

which did not handle other python scalars eg datetime

kosiew · 2025-05-16T04:09:51Z

@timsaucer

This PR is ready for review.

timsaucer

Very nice work, especially the excellent unit test coverage!

kosiew added 8 commits February 12, 2025 15:01

feat: add fill_null method to DataFrame for handling null values

106555e

test: add coalesce function tests for handling default values

cff9b7c

Resolve test cases for fill_null

4cf7496

feat: add fill_nan method to DataFrame for handling NaN values

df6208e

move imports out of functions

23ba1bd

docs: add documentation for fill_null and fill_nan methods in DataFrame

d6ca465

Add more tests

8582104

fix ruff errors

73b692f

kosiew force-pushed the fill-null branch from 9509f6d to 73b692f Compare February 12, 2025 09:03

kosiew marked this pull request as ready for review February 12, 2025 09:47

timsaucer reviewed Feb 15, 2025

View reviewed changes

This was referenced Feb 19, 2025

Add DataFrame fill_null apache/datafusion#14765

Closed

Add DataFrame fill_nan apache/datafusion#14770

Open

Merge branch 'main' into fill-null

07d4f4b

timsaucer marked this pull request as draft March 8, 2025 13:28

kosiew added 12 commits April 3, 2025 18:17

Merge branch 'main' into fill-null

8b51ee9

Merge branch 'main' into fill-null

924de28

refactor: remove fill_nan method documentation from functions.rst

4499e45

refactor: remove unused import of Enum from dataframe.py

bf9d7da

refactor: improve error handling and type extraction in python_value_…

dc86e77

…to_scalar_value function

refactor: enhance datetime and date conversion logic in python_value_…

6fbafcd

…to_scalar_value function

refactor: streamline type extraction in python_value_to_scalar_value …

681b2e5

…function

fix try_convert_to_string

aa87a8e

refactor: improve type handling in python_value_to_scalar_value function

0dfbdfa

refactor: move py_obj_to_scalar_value function to utils module

ecc4376

refactor: update fill_null to use py_obj_to_scalar_value from utils

412029c

kosiew added 11 commits April 30, 2025 19:47

Remove python_object_to_scalar_value code

4c40b85

refactor: enhance py_obj_to_scalar_value to utilize PyArrow for compl…

82bf6f4

…ex type conversion

refactor: update py_obj_to_scalar_value to handle errors and use extr…

b5d87b0

…act_bound for PyArrow scalar conversion

refactor: modify py_obj_to_scalar_value to return ScalarValue directl…

d546f7a

…y and streamline error handling

refactor: update py_obj_to_scalar_value to return a Result for better…

b89c695

… error handling

test: add tests for fill_null functionality in DataFrame with null va…

b140523

…lues

test: enhance null DataFrame tests to include date32 and date64 columns

3065773

refactor: simplify py_obj_to_scalar_value by removing direct extracti…

d7cf099

…on of basic types

refactor: remove unnecessary documentation from py_obj_to_scalar_valu…

0aebd74

…e function

Fix ruff errors

e3d643b

test: update datetime handling in coalesce tests to include timezone …

68b520e

…information

kosiew commented Apr 30, 2025

View reviewed changes

kosiew marked this pull request as ready for review April 30, 2025 12:38

Fix ruff errors

22519aa

kosiew force-pushed the fill-null branch from 8d4210d to 22519aa Compare April 30, 2025 12:40

kosiew changed the title ~~Add DataFrame fill_nan/fill_null~~ Add fill_null method to DataFrame API for handling missing values Apr 30, 2025

kosiew added 2 commits April 30, 2025 22:55

trigger ci

799b67c

Merge branch 'main' into fill-null

4681420

timsaucer approved these changes May 16, 2025

View reviewed changes

timsaucer merged commit f3c98ec into apache:main May 16, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add fill_null method to DataFrame API for handling missing values #1019

Add fill_null method to DataFrame API for handling missing values #1019

Uh oh!

kosiew commented Feb 12, 2025 •

edited

Loading

Uh oh!

timsaucer left a comment

Uh oh!

timsaucer commented Mar 8, 2025

Uh oh!

kosiew commented Mar 10, 2025

Uh oh!

kosiew commented Mar 14, 2025

Uh oh!

kosiew Apr 30, 2025

Uh oh!

kosiew Apr 30, 2025

Uh oh!

kosiew Apr 30, 2025

Uh oh!

kosiew commented May 16, 2025

Uh oh!

timsaucer left a comment

Uh oh!

Uh oh!

Uh oh!

Add fill_null method to DataFrame API for handling missing values #1019

Add fill_null method to DataFrame API for handling missing values #1019

Uh oh!

Conversation

kosiew commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

timsaucer commented Mar 8, 2025

Uh oh!

kosiew commented Mar 10, 2025

Uh oh!

kosiew commented Mar 14, 2025

Uh oh!

kosiew Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew commented May 16, 2025

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kosiew commented Feb 12, 2025 •

edited

Loading