Skip to content

Add fill_null method to DataFrame API for handling missing values #1019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Feb 12, 2025

Which issue does this PR close?

Rationale for this change

Handling missing values is a common operation in data processing. This change introduces a convenient and expressive way to replace NULL values in DataFusion DataFrames using a single method, improving usability and parity with other data processing frameworks like pandas or PySpark.

What changes are included in this PR?

  • Introduced a new fill_null method on the Python DataFrame API.
  • Added support in Rust backend to apply null-filling logic using a scalar value with optional column selection.
  • Implemented type-safe conversion of Python objects to Arrow-compatible ScalarValue.
  • Comprehensive test suite covering various scenarios:
    • Filling across all columns or a specific subset
    • Handling of different data types (integers, floats, strings, booleans, dates)
    • Coercion behavior when casting between types
    • Immutability of original DataFrames
    • Edge cases including empty DataFrames and columns with all nulls
  • Updated user-facing documentation to include usage examples and caveats for fill_null.

Are these changes tested?

✅ Yes. The PR includes an extensive set of unit tests in test_dataframe.py, validating behavior across data types, subsets, error tolerance, and null-handling edge cases.

Are there any user-facing changes?

✅ Yes:

  • A new fill_null() method is now available on Python DataFrames.
  • Documentation has been updated with examples and behavior details under common-operations/functions.rst.

@kosiew kosiew marked this pull request as ready for review February 12, 2025 09:47
Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a very worthwhile and useful addition. Thank you!

We've tried to keep most of the heavier logic on the rust side and to keep the python wrappers as way to convert from rust to pythonic interfaces. Do you think this is a case where doing the logic in the python side makes more sense?

More generally, do you think this is something we can or should upstream to the core datafusion repo? I can assist with that if you like.

@timsaucer timsaucer marked this pull request as draft March 8, 2025 13:28
@timsaucer
Copy link
Contributor

Hi @kosiew I moved this to draft since it looks like you're doing a good job on the upstream work which would change how we would want to handle this.

@kosiew
Copy link
Contributor Author

kosiew commented Mar 10, 2025

thanks

@kosiew
Copy link
Contributor Author

kosiew commented Mar 14, 2025

The upstream PR for fill_null is included in datafusion 46.0.0.
We can revisit this when datafusion-python upgrade the dependency to 46.0.0.

@@ -1236,3 +1236,57 @@ def test_between_default(df):
def test_alias_with_metadata(df):
df = df.select(f.alias(f.col("a"), "b", {"key": "value"}))
assert df.schema().field("b").metadata == {b"key": b"value"}


def test_coalesce(df):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this test because while researching this PR, I initially checked out the coalesce function and found there were no tests yet.

Comment on lines -87 to -101
fn py_obj_to_scalar_value(py: Python, obj: PyObject) -> ScalarValue {
if let Ok(value) = obj.extract::<bool>(py) {
ScalarValue::Boolean(Some(value))
} else if let Ok(value) = obj.extract::<i64>(py) {
ScalarValue::Int64(Some(value))
} else if let Ok(value) = obj.extract::<u64>(py) {
ScalarValue::UInt64(Some(value))
} else if let Ok(value) = obj.extract::<f64>(py) {
ScalarValue::Float64(Some(value))
} else if let Ok(value) = obj.extract::<String>(py) {
ScalarValue::Utf8(Some(value))
} else {
panic!("Unsupported value type")
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to src/utils.rs with a simpler implementation

Comment on lines +93 to +107
pub(crate) fn py_obj_to_scalar_value(py: Python, obj: PyObject) -> PyResult<ScalarValue> {
// convert Python object to PyScalarValue to ScalarValue

let pa = py.import("pyarrow")?;

// Convert Python object to PyArrow scalar
let scalar = pa.call_method1("scalar", (obj,))?;

// Convert PyArrow scalar to PyScalarValue
let py_scalar = PyScalarValue::extract_bound(scalar.as_ref())
.map_err(|e| PyValueError::new_err(format!("Failed to extract PyScalarValue: {}", e)))?;

// Convert PyScalarValue to ScalarValue
Ok(py_scalar.into())
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above is simpler than the original

fn py_obj_to_scalar_value(py: Python, obj: PyObject) -> ScalarValue {
    if let Ok(value) = obj.extract::<bool>(py) {
        ScalarValue::Boolean(Some(value))
    } else if let Ok(value) = obj.extract::<i64>(py) {
        ScalarValue::Int64(Some(value))
    } else if let Ok(value) = obj.extract::<u64>(py) {
        ScalarValue::UInt64(Some(value))
    } else if let Ok(value) = obj.extract::<f64>(py) {
        ScalarValue::Float64(Some(value))
    } else if let Ok(value) = obj.extract::<String>(py) {
        ScalarValue::Utf8(Some(value))
    } else {
        panic!("Unsupported value type")
    }
}

which did not handle other python scalars eg datetime

@kosiew kosiew marked this pull request as ready for review April 30, 2025 12:38
@kosiew kosiew changed the title Add DataFrame fill_nan/fill_null Add fill_null method to DataFrame API for handling missing values Apr 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add DataFrame fill_nan/fill_null
2 participants