Skip to content

Commit 3c91c58

Browse files
committed
[SPARK-56742][PYTHON][TESTS] Skip string-to-decimal failure assertion on pandas 3 in test_type_coercion_string_to_numeric
### What changes were proposed in this pull request? Gate one `assertRaises(PythonException)` block in `ArrowPythonUDFTestsMixin.test_type_coercion_string_to_numeric` on `LooseVersion(pd.__version__) < "3.0.0"`. Specifically, the `string("1","2") -> decimal` failure assertion is skipped on pandas 3+. The other failure assertions (`"1.1" -> int`, `"1.1" -> decimal`) and all success cases are unchanged. ### Why are the changes needed? `ArrowPythonUDFLegacyTests.test_type_coercion_string_to_numeric` is failing on the scheduled `Build / Python-only (master, Python 3.12, Pandas 3)` job, e.g. https://github.com/apache/spark/actions/runs/25402959034/job/74508177526. Root cause: pandas 3's `StringDtype` implements `__arrow_array__`. In `PandasToArrowConversion.convert` (`python/pyspark/sql/conversion.py`), the path is ```python mask = None if hasattr(series.array, "__arrow_array__") else series.isnull() ... pa.Array.from_pandas(series, mask=mask, type=arrow_type, safe=safecheck) ``` On pandas 2 the result series of strings has object dtype, no `__arrow_array__`, and `from_pandas` with `type=decimal128(...)` raises `ArrowTypeError` ("int or Decimal object expected, got str") which surfaces as `PythonException`. On pandas 3 the series has `StringDtype`, mask is `None`, and the `__arrow_array__` protocol cleanly casts `"1"` to `Decimal("1")` — the conversion silently succeeds, so `assertRaises(PythonException)` fails. The non-legacy `ArrowPythonUDF` path is unaffected because it converts a Python list directly via `pa.array(list, type=...)`, where pyarrow's per-element type check still rejects `str` for `Decimal`. ### Does this PR introduce _any_ user-facing change? No. Test-only. ### How was this patch tested? Verified locally in a Python 3.13 + pandas 3.0.2 + pyarrow 23.0.1 conda env. All three suites pass: ``` $ python/run-tests --testnames \ "pyspark.sql.tests.arrow.test_arrow_python_udf ArrowPythonUDFLegacyTests.test_type_coercion_string_to_numeric, \ pyspark.sql.tests.arrow.test_arrow_python_udf ArrowPythonUDFTests.test_type_coercion_string_to_numeric, \ pyspark.sql.tests.arrow.test_arrow_python_udf ArrowPythonUDFNonLegacyTests.test_type_coercion_string_to_numeric" ... Tests passed in 11 seconds ``` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.7) Closes #55698 from zhengruifeng/fix-arrow-legacy-type-coercion-test. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit c23e166) Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
1 parent c90b96f commit 3c91c58

1 file changed

Lines changed: 10 additions & 2 deletions

File tree

python/pyspark/sql/tests/arrow/test_arrow_python_udf.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
import unittest
2020

2121
from pyspark.errors import AnalysisException, PythonException, PySparkNotImplementedError
22+
from pyspark.loose_version import LooseVersion
2223
from pyspark.sql import Row
2324
from pyspark.sql.functions import udf, col
2425
from pyspark.sql.tests.test_udf import BaseUDFTestsMixin
@@ -43,6 +44,9 @@
4344
)
4445
from pyspark.util import PythonEvalType
4546

47+
if have_pandas:
48+
import pandas as pd
49+
4650

4751
@unittest.skipIf(
4852
not have_pandas or not have_pyarrow, pandas_requirement_message or pyarrow_requirement_message
@@ -190,8 +194,12 @@ def test_type_coercion_string_to_numeric(self):
190194
with self.assertRaises(PythonException):
191195
df_floating_value.select(udf(lambda x: x, "int")("value").alias("res")).collect()
192196

193-
with self.assertRaises(PythonException):
194-
df_int_value.select(udf(lambda x: x, "decimal")("value").alias("res")).collect()
197+
# Skip on pandas 3+ legacy conversion: pandas 3's StringDtype implements
198+
# __arrow_array__, which lets pyarrow coerce integer-like strings to
199+
# decimal. Older pandas (object dtype) raised ArrowTypeError here.
200+
if LooseVersion(pd.__version__) < "3.0.0":
201+
with self.assertRaises(PythonException):
202+
df_int_value.select(udf(lambda x: x, "decimal")("value").alias("res")).collect()
195203

196204
with self.assertRaises(PythonException):
197205
df_floating_value.select(udf(lambda x: x, "decimal")("value").alias("res")).collect()

0 commit comments

Comments
 (0)