- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 19.2k
Description
Context: the future string dtype for 3.0 (currently enabled with pd.options.futu.infer_string = True) is being formalized in a PDEP in #58551, and one of the discussion points is how to name the different variants of the StringDtype that will exist with the PDEP (whether using pyarrow or numpy object dtype for the data under the hood, and whether using NA or NaN as missing value sentinel).
As explained in #54792 (comment), we introduced the NaN-variant of the dtype for 3.0 as pd.StringDtype(storage="pyarrow_numpy") because we wanted to reuse the storage keyword but "pyarrow" is already taken (by the dtype using pd.NA introduced in pandas 1.3), and because we couldn't think of a better name at the time. But as also mentioned back then, that is far from a great name.
But as mentioned by @jbrockmendel in #58551 (comment), we don't necessarily need to reuse just the storage keyword, but we could also add new keywords to distinguish the dtype variants.
That got me thinking and listing some possible options here:
- Add an extra keyword that distinguishes the NA sentinel (and with that implicitly the type of missing value semantics):
- Possible names for pd.StringDtype(storage="python"|"pyarrow", <something>):- semantics="numpy"(and the other would then be "nullable" or 'arrow" or ..?)
- na_value=np.nan
- na_marker=np.nan
- missing=np.nan
- nullable=False(although we have used "nullable dtypes" in the past to denote the dtypes using NA, it's also confusing here because the False variant does support missing values as well)
 
- One drawback here that I don't think users should actually ever explicitly do pd.StringDtype(storage="pyarrow", na_value=np.nan)as that is not future proof. But defaulting tona_value=np.nan(to avoid requiring to specify it) is then not backwards compatible with currentpd.StringDtype(storage="pyarrow")
 
- Possible names for 
- Add a new keyword separate from storageto determine the storage/backend that only controls the new variants with NaN.- Given we are using storageright now, but speak about "backend" in other places, we could add for example abackendkeyword, whereStringDtype(storage="python"|"pyarrow")keeps resulting in the dtypes using NA (backwards compatible), while doingStringDtype(backend="python"|"pyarrow")gives you the new dtypes using NaN (and specifying both then obviously errors)
- This is not great API design to have two keywords that are mutually exclusive but are essentially controlling the same thing, but, it does avoid having to specify two keywords (or having the confusing names)
- One question is which keyword name to use. backendhas prior use in the "dtypes_backend" terminology. Irv suggestednaturebelow.
 
- Given we are using 
- For completeness, we can also still come up with a better storagename than"pyarrrow_numpy"and stick to that single existing keyword. Suggestions from the PDEP PR:- "pyarrow_nan"
- "pyarrow_legacy"(I wouldn't go with this one, because for users it is not "legacy" right now, rather it would be the default. It will only become legacy later if we decide on switching to NA later)
 
After writing this down, I think my current preference would go to StringDtype(backend="python"|"pyarrow"), as that seems the simplest for most users (it's a bit confusing for those who already explicitly used storage, but most users have never done that)