[SPARK-56742][PYTHON][TESTS] Skip string-to-decimal failure assertion on pandas 3 in test_type_coercion_string_to_numeric

zhengruifeng · zhengruifeng · commit 3c91c5890138 · 2026-05-07T08:24:35.000+08:00
### What changes were proposed in this pull request? Gate one `assertRaises(PythonException)` block in `ArrowPythonUDFTestsMixin.test_type_coercion_string_to_numeric` on `LooseVersion(pd.__version__) < "3.0.0"`. Specifically, the `string("1","2") -> decimal` failure assertion is skipped on pandas 3+. The other failure assertions (`"1.1" -> int`, `"1.1" -> decimal`) and all success cases are unchanged. ### Why are the changes needed? `ArrowPythonUDFLegacyTests.test_type_coercion_string_to_numeric` is failing on the scheduled `Build / Python-only (master, Python 3.12, Pandas 3)` job, e.g. https://github.com/apache/spark/actions/runs/25402959034/job/74508177526. Root cause: pandas 3's `StringDtype` implements `__arrow_array__`. In `PandasToArrowConversion.convert` (`python/pyspark/sql/conversion.py`), the path is ```python mask = None if hasattr(series.array, "__arrow_array__") else series.isnull() ... pa.Array.from_pandas(series, mask=mask, type=arrow_type, safe=safecheck) ``` On pandas 2 the result series of strings has object dtype, no `__arrow_array__`, and `from_pandas` with `type=decimal128(...)` raises `ArrowTypeError` ("int or Decimal object expected, got str") which surfaces as `PythonException`. On pandas 3 the series has `StringDtype`, mask is `None`, and the `__arrow_array__` protocol cleanly casts `"1"` to `Decimal("1")` — the conversion silently succeeds, so `assertRaises(PythonException)` fails. The non-legacy `ArrowPythonUDF` path is unaffected because it converts a Python list directly via `pa.array(list, type=...)`, where pyarrow's per-element type check still rejects `str` for `Decimal`. ### Does this PR introduce _any_ user-facing change? No. Test-only. ### How was this patch tested? Verified locally in a Python 3.13 + pandas 3.0.2 + pyarrow 23.0.1 conda env. All three suites pass: ``` $ python/run-tests --testnames \ "pyspark.sql.tests.arrow.test_arrow_python_udf ArrowPythonUDFLegacyTests.test_type_coercion_string_to_numeric, \ pyspark.sql.tests.arrow.test_arrow_python_udf ArrowPythonUDFTests.test_type_coercion_string_to_numeric, \ pyspark.sql.tests.arrow.test_arrow_python_udf ArrowPythonUDFNonLegacyTests.test_type_coercion_string_to_numeric" ... Tests passed in 11 seconds ``` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Opus 4.7) Closes #55698 from zhengruifeng/fix-arrow-legacy-type-coercion-test. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com> (cherry picked from commit c23e166) Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
diff --git a/python/pyspark/sql/tests/arrow/test_arrow_python_udf.py b/python/pyspark/sql/tests/arrow/test_arrow_python_udf.py
@@ -19,6 +19,7 @@
 import unittest
 
 from pyspark.errors import AnalysisException, PythonException, PySparkNotImplementedError
+from pyspark.loose_version import LooseVersion
 from pyspark.sql import Row
 from pyspark.sql.functions import udf, col
 from pyspark.sql.tests.test_udf import BaseUDFTestsMixin
@@ -43,6 +44,9 @@
 )
 from pyspark.util import PythonEvalType
 
+if have_pandas:
+    import pandas as pd
+
 
 @unittest.skipIf(
     not have_pandas or not have_pyarrow, pandas_requirement_message or pyarrow_requirement_message
@@ -190,8 +194,12 @@ def test_type_coercion_string_to_numeric(self):
         with self.assertRaises(PythonException):
             df_floating_value.select(udf(lambda x: x, "int")("value").alias("res")).collect()
 
-        with self.assertRaises(PythonException):
-            df_int_value.select(udf(lambda x: x, "decimal")("value").alias("res")).collect()
+        # Skip on pandas 3+ legacy conversion: pandas 3's StringDtype implements
+        # __arrow_array__, which lets pyarrow coerce integer-like strings to
+        # decimal. Older pandas (object dtype) raised ArrowTypeError here.
+        if LooseVersion(pd.__version__) < "3.0.0":
+            with self.assertRaises(PythonException):
+                df_int_value.select(udf(lambda x: x, "decimal")("value").alias("res")).collect()
 
         with self.assertRaises(PythonException):
             df_floating_value.select(udf(lambda x: x, "decimal")("value").alias("res")).collect()