Default string dtype (PDEP-14): naming convention to distinguish the dtype variants

Context: the future string dtype for 3.0 (currently enabled with `pd.options.futu.infer_string = True`) is being formalized in a PDEP in https://github.com/pandas-dev/pandas/pull/58551, and one of the discussion points is how to name the different variants of the StringDtype that will exist with the PDEP (whether using pyarrow or numpy object dtype for the data under the hood, and whether using NA or NaN as missing value sentinel).

As explained in https://github.com/pandas-dev/pandas/issues/54792#issuecomment-1695403727, we introduced the NaN-variant of the dtype for 3.0 as `pd.StringDtype(storage="pyarrow_numpy")` because we wanted to reuse the `storage` keyword but `"pyarrow"` is already taken (by the dtype using `pd.NA` introduced in pandas 1.3), and because we couldn't think of a better name at the time. But as also mentioned back then, that is far from a great name.

But as mentioned by @jbrockmendel in https://github.com/pandas-dev/pandas/pull/58551#discussion_r1589744714, we don't necessarily need to reuse just the `storage` keyword, but we could also add new keywords to distinguish the dtype variants.

That got me thinking and listing some possible options here:

* Add an extra keyword that distinguishes the NA sentinel (and with that implicitly the type of missing value semantics):
  * Possible names for `pd.StringDtype(storage="python"|"pyarrow", <something>)`:
    * `semantics="numpy"` (and the other would then be "nullable" or 'arrow" or ..?)
    * `na_value=np.nan`
    * `na_marker=np.nan`
    * `missing=np.nan`
    * `nullable=False` (although we have used "nullable dtypes" in the past to denote the dtypes using NA, it's also confusing here because the False variant does support missing values as well)
  * One drawback here that I don't think users should actually ever explicitly do `pd.StringDtype(storage="pyarrow", na_value=np.nan)` as that is not future proof. But defaulting to `na_value=np.nan` (to avoid requiring to specify it) is then not backwards compatible with current `pd.StringDtype(storage="pyarrow")`
* Add a new keyword separate from `storage` to determine the storage/backend that only controls the new variants with NaN.
  * Given we are using `storage` right now, but speak about "backend" in other places, we could add for example a `backend` keyword, where `StringDtype(storage="python"|"pyarrow")` keeps resulting in the dtypes using NA (backwards compatible), while doing `StringDtype(backend="python"|"pyarrow")` gives you the new dtypes using NaN (and specifying both then obviously errors)
  * This is not great API design to have two keywords that are mutually exclusive but are essentially controlling the same thing, but, it does avoid having to specify two keywords (or having the confusing names)
  * One question is which keyword name to use. `backend` has prior use in the "dtypes_backend" terminology. Irv suggested `nature` below.
* For completeness, we can also still come up with a better `storage` name than `"pyarrrow_numpy"` and stick to that single existing keyword. Suggestions from the PDEP PR:
  * `"pyarrow_nan"`
  * `"pyarrow_legacy"` (I wouldn't go with this one, because for users it is not "legacy" right now, rather it would be the default. It will only become legacy later if we decide on switching to NA later)



After writing this down, I think my current preference would go to `StringDtype(backend="python"|"pyarrow")`, as that seems the simplest for most users (it's a bit confusing for those who already explicitly used `storage`, but most users have never done that)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Default string dtype (PDEP-14): naming convention to distinguish the dtype variants #58613

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Default string dtype (PDEP-14): naming convention to distinguish the dtype variants #58613

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions