Skip to content

Commit c187333

Browse files
tadejarokpitrou
authored
GH-48241: [Python] Scalar inferencing doesn't infer UUID (#48727)
### Rationale for this change This closes #48241, #44224 and #43855. Currently uuid.UUID objects are not inferred/converted automatically in PyArrow, requiring users to explicitly specify the type. ### What changes are included in this PR? Adding support for Python's uuid.UUID objects in PyArrow's type inference and conversion. ### Are these changes tested? Yes, added test_uuid_scalar_from_python() and test_uuid_array_from_python() in `test_extension.py`. ### Are there any user-facing changes? Users can now pass Python uuid.UUID objects directly to PyArrow functions like pa.scalar() and pa.array() without specifying the type; ```python import uuid import pyarrow as pa pa.scalar(uuid.uuid4()) ``` <pyarrow.UuidScalar: UUID('958174b9-3a5c-4cdd-8fc5-d51a2fc55784')> ```python pa.array([uuid.uuid4()]) ``` <pyarrow.lib.UuidArray object at 0x1217725f0> [ 73611FD81F764A209C8B9CDBADDA1F53 ] * GitHub Issue: #48241 Lead-authored-by: Tadeja Kadunc <tadeja.kadunc@gmail.com> Co-authored-by: tadeja <tadeja@users.noreply.github.com> Co-authored-by: Rok Mihevc <rok@mihevc.org> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Rok Mihevc <rok@mihevc.org>
1 parent cfbbf70 commit c187333

7 files changed

Lines changed: 265 additions & 71 deletions

File tree

docs/source/python/extending_types.rst

Lines changed: 41 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -476,8 +476,8 @@ You can find the official list of canonical extension types in the
476476
:ref:`format_canonical_extensions` section. Here we add examples on how to
477477
use them in PyArrow.
478478

479-
Fixed size tensor
480-
"""""""""""""""""
479+
Fixed shape tensor
480+
""""""""""""""""""
481481

482482
To create an array of tensors with equal shape (fixed shape tensor array) we
483483
first need to define a fixed shape tensor extension type with value type
@@ -487,7 +487,7 @@ and shape:
487487
488488
>>> tensor_type = pa.fixed_shape_tensor(pa.int32(), (2, 2))
489489
490-
Then we need the storage array with :func:`pyarrow.list_` type where ``value_type```
490+
Then we need the storage array with :func:`pyarrow.list_` type where ``value_type``
491491
is the fixed shape tensor value type and list size is a product of ``tensor_type``
492492
shape elements. Then we can create an array of tensors with
493493
``pa.ExtensionArray.from_storage()`` method:
@@ -629,3 +629,41 @@ for ``NCHW`` format where:
629629
* C: number of channels of the image
630630
* H: height of the image
631631
* W: width of the image
632+
633+
UUID
634+
""""
635+
636+
The UUID extension type (``arrow.uuid``) represents universally unique
637+
identifiers as 16-byte fixed-size binary values. PyArrow provides integration
638+
with Python's built-in :mod:`uuid` module, including automatic type inference.
639+
640+
Creating UUID scalars and arrays
641+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
642+
643+
PyArrow infers the UUID type from Python's ``uuid.UUID`` objects,
644+
so you can pass them directly to :func:`pyarrow.scalar` and :func:`pyarrow.array`:
645+
646+
.. code-block:: python
647+
648+
>>> import uuid
649+
>>> import pyarrow as pa
650+
651+
>>> pa.scalar(uuid.uuid4())
652+
<pyarrow.UuidScalar: UUID('...')>
653+
654+
>>> uuids = [uuid.uuid4() for _ in range(3)]
655+
>>> arr = pa.array(uuids)
656+
>>> arr.type
657+
UuidType(extension<arrow.uuid>)
658+
659+
You can also explicitly specify the UUID type using :func:`pyarrow.uuid`:
660+
661+
.. code-block:: python
662+
663+
>>> pa.array([uuid.uuid4(), uuid.uuid4()], type=pa.uuid())
664+
<pyarrow.lib.UuidArray object at ...>
665+
[
666+
...,
667+
...
668+
]
669+

python/pyarrow/src/arrow/python/common.h

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -419,6 +419,20 @@ struct PyBytesView {
419419
return Status::OK();
420420
}
421421

422+
// Parse bytes from a uuid.UUID object (stores reference to keep bytes alive)
423+
Status ParseUuid(PyObject* obj) {
424+
ref.reset(PyObject_GetAttrString(obj, "bytes"));
425+
RETURN_IF_PYERROR();
426+
if (!PyBytes_Check(ref.obj())) {
427+
return Status::TypeError("Expected uuid.UUID.bytes to return bytes, got '",
428+
Py_TYPE(ref.obj())->tp_name, "' object");
429+
}
430+
bytes = PyBytes_AS_STRING(ref.obj());
431+
size = PyBytes_GET_SIZE(ref.obj());
432+
is_utf8 = false;
433+
return Status::OK();
434+
}
435+
422436
protected:
423437
OwnedRef ref;
424438
};

python/pyarrow/src/arrow/python/helpers.cc

Lines changed: 89 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -296,16 +296,69 @@ bool PyFloat_IsNaN(PyObject* obj) {
296296

297297
namespace {
298298

299-
// This needs a conditional, because using std::once_flag could introduce
300-
// a deadlock when the GIL is enabled. See
301-
// https://github.com/apache/arrow/commit/f69061935e92e36e25bb891177ca8bc4f463b272 for
302-
// more info.
299+
// Thread-safe one-time Python module import + attribute lookup. For Pandas and UUID.
300+
// Uses std::call_once when the GIL is disabled, or a simple boolean flag when
301+
// the GIL is enabled to avoid deadlocks. See ARROW-10519 for more details and
302+
// https://github.com/apache/arrow/commit/f69061935e92e36e25bb891177ca8bc4f463b272
303+
struct ModuleOnceRunner {
304+
std::string module_name;
303305
#ifdef Py_GIL_DISABLED
304-
static std::once_flag pandas_static_initialized;
306+
std::once_flag initialized;
305307
#else
306-
static bool pandas_static_initialized = false;
308+
bool initialized = false;
307309
#endif
308310

311+
explicit ModuleOnceRunner(const std::string& module_name) : module_name(module_name) {}
312+
313+
template <typename Func>
314+
void RunOnce(Func&& func) {
315+
auto do_init = [&]() {
316+
OwnedRef module;
317+
if (ImportModule(module_name, &module).ok()) {
318+
#ifndef Py_GIL_DISABLED
319+
// Since ImportModule can release the GIL, another thread could have
320+
// already initialized the static data.
321+
if (initialized) {
322+
return;
323+
}
324+
#endif
325+
func(module);
326+
}
327+
};
328+
#ifdef Py_GIL_DISABLED
329+
std::call_once(initialized, do_init);
330+
#else
331+
if (!initialized) {
332+
do_init();
333+
initialized = true;
334+
}
335+
#endif
336+
}
337+
};
338+
339+
static PyObject* uuid_UUID = nullptr;
340+
static ModuleOnceRunner uuid_runner("uuid");
341+
342+
} // namespace
343+
344+
bool IsPyUuid(PyObject* obj) {
345+
uuid_runner.RunOnce([](OwnedRef& module) {
346+
OwnedRef ref;
347+
if (ImportFromModule(module.obj(), "UUID", &ref).ok()) {
348+
uuid_UUID = ref.obj();
349+
}
350+
});
351+
if (!uuid_UUID) return false;
352+
int result = PyObject_IsInstance(obj, uuid_UUID);
353+
if (result < 0) {
354+
PyErr_Clear();
355+
return false;
356+
}
357+
return result != 0;
358+
}
359+
360+
namespace {
361+
309362
// Once initialized, these variables hold borrowed references to Pandas static data.
310363
// We should not use OwnedRef here because Python destructors would be
311364
// called on a finalized interpreter.
@@ -315,72 +368,43 @@ static PyObject* pandas_Timedelta = nullptr;
315368
static PyObject* pandas_Timestamp = nullptr;
316369
static PyTypeObject* pandas_NaTType = nullptr;
317370
static PyObject* pandas_DateOffset = nullptr;
371+
static ModuleOnceRunner pandas_runner("pandas");
318372

319-
void GetPandasStaticSymbols() {
320-
OwnedRef pandas;
321-
322-
// Import pandas
323-
Status s = ImportModule("pandas", &pandas);
324-
if (!s.ok()) {
325-
return;
326-
}
327-
328-
#ifndef Py_GIL_DISABLED
329-
// Since ImportModule can release the GIL, another thread could have
330-
// already initialized the static data.
331-
if (pandas_static_initialized) {
332-
return;
333-
}
334-
#endif
335-
336-
OwnedRef ref;
337-
338-
// set NaT sentinel and its type
339-
if (ImportFromModule(pandas.obj(), "NaT", &ref).ok()) {
340-
pandas_NaT = ref.obj();
341-
// PyObject_Type returns a new reference but we trust that pandas.NaT will
342-
// outlive our use of this PyObject*
343-
pandas_NaTType = Py_TYPE(ref.obj());
344-
}
345-
346-
// retain a reference to Timedelta
347-
if (ImportFromModule(pandas.obj(), "Timedelta", &ref).ok()) {
348-
pandas_Timedelta = ref.obj();
349-
}
373+
} // namespace
350374

351-
// retain a reference to Timestamp
352-
if (ImportFromModule(pandas.obj(), "Timestamp", &ref).ok()) {
353-
pandas_Timestamp = ref.obj();
354-
}
375+
void InitPandasStaticData() {
376+
pandas_runner.RunOnce([](OwnedRef& module) {
377+
OwnedRef ref;
378+
379+
// set NaT sentinel and its type
380+
if (ImportFromModule(module.obj(), "NaT", &ref).ok()) {
381+
pandas_NaT = ref.obj();
382+
// PyObject_Type returns a new reference but we trust that pandas.NaT will
383+
// outlive our use of this PyObject*
384+
pandas_NaTType = Py_TYPE(ref.obj());
385+
}
355386

356-
// if pandas.NA exists, retain a reference to it
357-
if (ImportFromModule(pandas.obj(), "NA", &ref).ok()) {
358-
pandas_NA = ref.obj();
359-
}
387+
// retain a reference to Timedelta
388+
if (ImportFromModule(module.obj(), "Timedelta", &ref).ok()) {
389+
pandas_Timedelta = ref.obj();
390+
}
360391

361-
// Import DateOffset type
362-
if (ImportFromModule(pandas.obj(), "DateOffset", &ref).ok()) {
363-
pandas_DateOffset = ref.obj();
364-
}
365-
}
392+
// retain a reference to Timestamp
393+
if (ImportFromModule(module.obj(), "Timestamp", &ref).ok()) {
394+
pandas_Timestamp = ref.obj();
395+
}
366396

367-
} // namespace
397+
// if pandas.NA exists, retain a reference to it
398+
if (ImportFromModule(module.obj(), "NA", &ref).ok()) {
399+
pandas_NA = ref.obj();
400+
}
368401

369-
#ifdef Py_GIL_DISABLED
370-
void InitPandasStaticData() {
371-
std::call_once(pandas_static_initialized, GetPandasStaticSymbols);
372-
}
373-
#else
374-
void InitPandasStaticData() {
375-
// NOTE: This is called with the GIL held. We needn't (and shouldn't,
376-
// to avoid deadlocks) use an additional C++ lock (ARROW-10519).
377-
if (pandas_static_initialized) {
378-
return;
379-
}
380-
GetPandasStaticSymbols();
381-
pandas_static_initialized = true;
402+
// Import DateOffset type
403+
if (ImportFromModule(module.obj(), "DateOffset", &ref).ok()) {
404+
pandas_DateOffset = ref.obj();
405+
}
406+
});
382407
}
383-
#endif
384408

385409
bool PandasObjectIsNull(PyObject* obj) {
386410
if (!MayHaveNaN(obj)) {

python/pyarrow/src/arrow/python/helpers.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,10 @@ PyObject* BorrowPandasDataOffsetType();
9292
ARROW_PYTHON_EXPORT
9393
bool PyFloat_IsNaN(PyObject* obj);
9494

95+
// \brief Check whether obj is a uuid.UUID instance
96+
ARROW_PYTHON_EXPORT
97+
bool IsPyUuid(PyObject* obj);
98+
9599
inline bool IsPyBinary(PyObject* obj) {
96100
return PyBytes_Check(obj) || PyByteArray_Check(obj) || PyMemoryView_Check(obj);
97101
}

python/pyarrow/src/arrow/python/inference.cc

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
#include <utility>
2828
#include <vector>
2929

30+
#include "arrow/extension/uuid.h"
3031
#include "arrow/scalar.h"
3132
#include "arrow/status.h"
3233
#include "arrow/util/decimal.h"
@@ -407,6 +408,7 @@ class TypeInferrer {
407408
arrow_scalar_count_(0),
408409
numpy_dtype_count_(0),
409410
interval_count_(0),
411+
uuid_count_(0),
410412
max_decimal_metadata_(std::numeric_limits<int32_t>::min(),
411413
std::numeric_limits<int32_t>::min()),
412414
decimal_type_() {
@@ -475,6 +477,9 @@ class TypeInferrer {
475477
++decimal_count_;
476478
} else if (PyObject_IsInstance(obj, interval_types_.obj())) {
477479
++interval_count_;
480+
} else if (internal::IsPyUuid(obj)) {
481+
++uuid_count_;
482+
*keep_going = make_unions_;
478483
} else {
479484
return internal::InvalidValue(obj,
480485
"did not recognize Python value type when inferring "
@@ -604,6 +609,8 @@ class TypeInferrer {
604609
*out = utf8();
605610
} else if (interval_count_) {
606611
*out = month_day_nano_interval();
612+
} else if (uuid_count_) {
613+
*out = extension::uuid();
607614
} else if (arrow_scalar_count_) {
608615
*out = scalar_type_;
609616
} else {
@@ -766,6 +773,7 @@ class TypeInferrer {
766773
int64_t arrow_scalar_count_;
767774
int64_t numpy_dtype_count_;
768775
int64_t interval_count_;
776+
int64_t uuid_count_;
769777
std::unique_ptr<TypeInferrer> list_inferrer_;
770778
std::vector<std::pair<std::string, TypeInferrer>> struct_inferrers_;
771779
std::unordered_map<std::string, size_t> struct_field_index_;

python/pyarrow/src/arrow/python/python_to_arrow.cc

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@
3636
#include "arrow/array/builder_primitive.h"
3737
#include "arrow/array/builder_time.h"
3838
#include "arrow/chunked_array.h"
39+
#include "arrow/extension_type.h"
3940
#include "arrow/result.h"
4041
#include "arrow/scalar.h"
4142
#include "arrow/status.h"
@@ -512,7 +513,12 @@ class PyValue {
512513

513514
static Status Convert(const FixedSizeBinaryType* type, const O&, I obj,
514515
PyBytesView& view) {
515-
ARROW_RETURN_NOT_OK(view.ParseString(obj));
516+
// Check if obj is a uuid.UUID instance
517+
if (internal::IsPyUuid(obj)) {
518+
ARROW_RETURN_NOT_OK(view.ParseUuid(obj));
519+
} else {
520+
ARROW_RETURN_NOT_OK(view.ParseString(obj));
521+
}
516522
if (view.size != type->byte_width()) {
517523
std::stringstream ss;
518524
ss << "expected to be length " << type->byte_width() << " was " << view.size;
@@ -1268,16 +1274,24 @@ Result<std::shared_ptr<ChunkedArray>> ConvertPySequence(PyObject* obj, PyObject*
12681274
// In some cases, type inference may be "loose", like strings. If the user
12691275
// passed pa.string(), then we will error if we encounter any non-UTF8
12701276
// value. If not, then we will allow the result to be a BinaryArray
1277+
std::shared_ptr<DataType> extension_type;
12711278
if (options.type == nullptr) {
12721279
ARROW_ASSIGN_OR_RAISE(options.type, InferArrowType(seq, mask, options.from_pandas));
12731280
options.strict = false;
1281+
// If type inference returned an extension type, convert using
1282+
// the storage type and then wrap the result as an extension array
1283+
if (options.type->id() == Type::EXTENSION) {
1284+
extension_type = options.type;
1285+
options.type = checked_cast<const ExtensionType&>(*options.type).storage_type();
1286+
}
12741287
} else {
12751288
options.strict = true;
12761289
}
12771290
ARROW_DCHECK_GE(size, 0);
12781291

12791292
ARROW_ASSIGN_OR_RAISE(auto converter, (MakeConverter<PyConverter, PyConverterTrait>(
12801293
options.type, options, pool)));
1294+
std::shared_ptr<ChunkedArray> result;
12811295
if (converter->may_overflow()) {
12821296
// The converter hierarchy contains binary- or list-like builders which can overflow
12831297
// depending on the input values. Wrap the converter with a chunker which detects
@@ -1288,7 +1302,7 @@ Result<std::shared_ptr<ChunkedArray>> ConvertPySequence(PyObject* obj, PyObject*
12881302
} else {
12891303
RETURN_NOT_OK(chunked_converter->Extend(seq, size));
12901304
}
1291-
return chunked_converter->ToChunkedArray();
1305+
ARROW_ASSIGN_OR_RAISE(result, chunked_converter->ToChunkedArray());
12921306
} else {
12931307
// If the converter can't overflow spare the capacity error checking on the hot-path,
12941308
// this improves the performance roughly by ~10% for primitive types.
@@ -1297,8 +1311,13 @@ Result<std::shared_ptr<ChunkedArray>> ConvertPySequence(PyObject* obj, PyObject*
12971311
} else {
12981312
RETURN_NOT_OK(converter->Extend(seq, size));
12991313
}
1300-
return converter->ToChunkedArray();
1314+
ARROW_ASSIGN_OR_RAISE(result, converter->ToChunkedArray());
1315+
}
1316+
// If we inferred an extension type, wrap as an extension array
1317+
if (extension_type != nullptr) {
1318+
return ExtensionType::WrapArray(extension_type, result);
13011319
}
1320+
return result;
13021321
}
13031322

13041323
} // namespace py

0 commit comments

Comments
 (0)