-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Default string dtype (PDEP-14): naming convention to distinguish the dtype variants #58613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
FWIW, there is also a fourth option: not have any keyword for this, and don't give users a way to control this through |
I guess I was hinting at something similar for the semantics keyword #58551 (comment). i.e. We do don't have that public on the dtype (but as a dtype property) and perhaps control that at the DataFrame (and maybe Series) level, so that a nullable array assigned to a DataFrame would coerce to NumPy semantics. |
It seems to me that there are 2 choices that a user could make, along with options for those choices
Aside from the numpy 2.0 option, I think we have implementation of all combinations of the storage and missing values available, and it seems to me that when we implement support for numpy strings that depends on numpy 2.0, we'd want to support There are then a few questions to address:
I'd like to suggest the word class StringDtype(StorageExtensionDtype):
def __init__(self, nature: Literal["pyarrow", "numpy.object", "numpy.string"] = "pyarrow",
missing: np.nan | pd.NA = pd.NA) -> None: ... Why I'd also like to address question 6, we use a new nomenclature for the strings that represent dtypes corresponding to strings. Let's NOT use the word
For the missing value indicators, let's be explicit about whether |
some more suggestions:
This is maybe not explicit about the behavior in the sense that the nullable string dtypes that return numeric output will always return a nullable integer dtype, rather than either int or float dtype, depending on the presence of NA values. So this behavior gives a more consistent return type and using terms such as |
is "python" an option for the dtype_backend keyword? The keywords are similar enough that it will cause confusion if they don't have matching behavior.
Yah I think there is confusion as to what "nullable" means depending on the writer/context.
I lean towards |
For the PDEP / pandas 3.0, I personally explicitly do not want typical users to make this choice. In each context (pyarrow installed or not), there is just one default string dtype, and that is all what most uses should worry about. So while we still need some (public or internal) way to create the different variants explicitly (which is what we are discussing here, so thanks for your comments!), in context of the PDEP I would like to hide that as much as possible. For that reason, I am personally not really a fan of adding an explicit keyword to choose the missing value semantics (like |
No, but currently you also can't choose the "default" dtypes through |
But at the top of this issue, you wrote:
I agree that typical users should not make that choice, but if we use the |
Big +1 on this. I don't think there's a good way to resolve the ambiguity in a name like So, I would be fine making the users manually specify the whatever keywords that we decide on to create the pyarrow/python backed string array with np.nan as the missing value, and having no way to create e.g. a
+1 on this. There's a precedent for storage for string arrays, and na_value has history in things like I would be against adding a new keyword that clashes with "storage" (or changing "storage"), since it makes for messy handling internally. I also don't think it's worth the churn. I'm less opinionated about na_value. |
I realize this is for PDEP 14 which we want as a fast mover, but PDEP 13 proposed the following structure for a data type: class BaseType:
@property
def dtype_backend -> Literal["pandas", "numpy", "pyarrow"]:
"""
Library is responsible for the array implementation
"""
...
@property
def physical_type:
"""
How does the backend physically implement this logical type? i.e. our
logical type may be a "string" and we are using pyarrow underneath -
is it a pa.string(), pa.large_string(), pa.string_view() or something else?
"""
...
@property
def missing_value_marker -> pd.NA|np.nan:
"""
Sentinel used to denote missing values
"""
... Which may be of interest here too (though feedback so far is that It might be useful to separate the idea of the backend / provider (e.g. "pyarrow") from how that backend / provider is actually implementing things (e.g. "string", "large_string", "string_view"). @Dr-Irv I think you are going down that path with your "nature" suggestion but rather than trying to cram all that metadata into a single string having two properties might be more future proof |
Yes, the fact that a |
Thanks for the discussion here. I would like to do a next iteration of my proposal, based on:
That leads me to the following table (the first column is how the user can create a dtype, the second column the concrete dtype isntance they would get, and the third column the string alias they see in displays and can use as
(1) You get "pyarrow" or "python" depending on pyarrow being installed. Additional notes on the string aliases:
|
I think the rows annotated with footnote (2) are problematic, because it is a change in behavior for people currently using Here's another idea. What if we deprecate the top-level namespace specifications for |
Yes, but as discussed on the PDEP, we could still add a deprecation warning for it. I know that doesn't change that it still is a behaviour change (and users will only see the deprecation warning for a short time), but at least we could do it with some warning in advance. Further, my guess is that So if we are eventually on board with changing the behaviour for Further, I think quite some users that now do |
And to be clear, I think that is a good idea long-term, but I would personally keep that for when we are ready to do that for all dtypes, instead of now only having a string dtype in such a namespace (while other default dtypes like categorical, datetimetz, etc still live top-level) |
Given what @WillAyd wrote here: #58551 (comment) I think we need to be careful about this. |
But if we did this for all dtypes as part of the change for strings (and deprecated the top level dtypes), then we accomplish both goals that includes a better migration path for strings (maybe). |
I think many of the posts about the StringDtype are not necessarily about the aspect how NA is different from NaN, but just about how cool it is to have an actual string dtype, instead of the confusing-catch-all-object-dtype, and (in case it's about the pyarrow variant) the performance improvements. |
Consider this code (using 2.2): >>> s = pd.Series(["a", "b", "c"], dtype="string[pyarrow_numpy]")
>>> s
0 a
1 b
2 c
dtype: string
>>> s.str.len()
0 1
1 1
2 1
dtype: int64
>>> s2 = pd.Series(["a", "b", "c"], dtype="string")
>>> s2
0 a
1 b
2 c
dtype: string
>>> s2.str.len()
0 1
1 1
2 1
dtype: Int64
>>> s.shift(1)
0 NaN
1 a
2 b
dtype: string
>>> s2.shift(1)
0 <NA>
1 a
2 b
dtype: string
>>> s2.shift(1).str.len()
0 <NA>
1 1
2 1
dtype: Int64
>>> s.shift(1).str.len()
0 NaN
1 1.0
2 1.0
dtype: float64 If we adopt your proposal, then if you have I'm not saying this is a reason to not adopt this proposal, but just wanted to point out this behavior. |
For Also, for |
I think this is going to be really problematic. While I like the spirit of having one >>> ser = pd.Series(["x", "y", None], dtype=pd.StringDtype(na_value=pd.NA)
>>> ser.str.len() What data type gets returned here if arrow is installed? "pyarrow[int64]" maybe makes sense, but then if the user for whatever reason uninstalls pyarrow does the same code now return "Int64"? What if they changed the na_value to: >>> ser = pd.Series(["x", "y", None], dtype=pd.StringDtype() Whether or not we have pyarrow installed I assume this returns a float (?) I am definitely not a fan of our current "string[pyarrow_numpy]" but to its credit it is at least consistent in the data types it returns With PDEP-13 my expectation would be that regardless of what is installed: >>> ser = pd.Series(["x", "y", None], dtype=pd.StringDtype(na_value=<na_value>)
>>> ser.str.len()
0 1
1 1
2 <na_value>
dtype: pd.Int64() so I think abstracts a lot of these issues. But I don't think only partially implementing that for a StringDtype is going to get us to a better place |
@WillAyd no, this is not correct. Or at least, that's not how the existing If you use (it's only when you use
Yes, you will get a float, just as you get a float result right now with object dtype (nothing changes there compared to the status quo). And yes this is annoying, but that is not limited to strings, but this is annoying whenever some operation casts ints to floats because of the introduction of missing values and our default integer dtypes not supporting missing values. But this is not really any more about the naming but behaviour, so let's continue those discussions about behaviour in the PDEP PR #58551 itself |
As mentioned on the call, if people think this would help, I am certainly fine with adding that alias. It is 1) consistent with the current string aliases of the other "nullable" (using pd.NA) Int/Float/Boolean dtypes, and 2) it makes the code edits to keep using the NA-variant of StringDtype a bit smaller (i.e. a user can change I think the main downside is that it makes the string aliases even more confusing. And also, the existing
That sounds good to me. Currently that does not happen automatically: the default was simply But I think it certainly makes sense to change this, and let it follow the "choose the default depending on whether pyarrow is installed" logic for the NaN-variant of the dtype (if the user did not set that |
Ah OK - I thought this PDEP wanted to re-use the functionality of I am on board with the default string type continuing to return extension types regardless of what is installed (that aligns with the PDEP-13 proposal), although my remaining hang up I think would just be not changing the nullability semantics for that StringDtype by default |
I guess just to be clear, both |
No, the And in both cases, I say "variants" because this is true regardless of whether it is using pyarrow under the hood ( So this means that the return type of operations does not depend on some library being installed or not, and only depends on the user explicitly opting in to NA-dtypes (by default, a user only gets default data types using NaN semantics). This last item is what I tried to clarify above, because in your comment you said "What data type gets returned here if arrow is installed? "pyarrow[int64]" maybe makes sense, but then if the user for whatever reason uninstalls pyarrow does the same code now return "Int64"?", and I wanted to clarify that the return type never depends on whether it uses pyarrow under the hood or not. (but again, this is more related to the actual proposed behaviour of the PDEP and not the naming scheme. So let's continue this discussion on the PDEP PR itself. I just pushed a commit to expand on this missing value semantics in 54a43b3) |
Ah...sorry I was just replying to the github notification. Though this was the PDEP itself all this time...will try to port over that conversation |
In the PDEP, in the end we went with As this has been incorporated in the PDEP, and the PDEP was accepted, closing this issue. |
Context: the future string dtype for 3.0 (currently enabled with
pd.options.futu.infer_string = True
) is being formalized in a PDEP in #58551, and one of the discussion points is how to name the different variants of the StringDtype that will exist with the PDEP (whether using pyarrow or numpy object dtype for the data under the hood, and whether using NA or NaN as missing value sentinel).As explained in #54792 (comment), we introduced the NaN-variant of the dtype for 3.0 as
pd.StringDtype(storage="pyarrow_numpy")
because we wanted to reuse thestorage
keyword but"pyarrow"
is already taken (by the dtype usingpd.NA
introduced in pandas 1.3), and because we couldn't think of a better name at the time. But as also mentioned back then, that is far from a great name.But as mentioned by @jbrockmendel in #58551 (comment), we don't necessarily need to reuse just the
storage
keyword, but we could also add new keywords to distinguish the dtype variants.That got me thinking and listing some possible options here:
pd.StringDtype(storage="python"|"pyarrow", <something>)
:semantics="numpy"
(and the other would then be "nullable" or 'arrow" or ..?)na_value=np.nan
na_marker=np.nan
missing=np.nan
nullable=False
(although we have used "nullable dtypes" in the past to denote the dtypes using NA, it's also confusing here because the False variant does support missing values as well)pd.StringDtype(storage="pyarrow", na_value=np.nan)
as that is not future proof. But defaulting tona_value=np.nan
(to avoid requiring to specify it) is then not backwards compatible with currentpd.StringDtype(storage="pyarrow")
storage
to determine the storage/backend that only controls the new variants with NaN.storage
right now, but speak about "backend" in other places, we could add for example abackend
keyword, whereStringDtype(storage="python"|"pyarrow")
keeps resulting in the dtypes using NA (backwards compatible), while doingStringDtype(backend="python"|"pyarrow")
gives you the new dtypes using NaN (and specifying both then obviously errors)backend
has prior use in the "dtypes_backend" terminology. Irv suggestednature
below.storage
name than"pyarrrow_numpy"
and stick to that single existing keyword. Suggestions from the PDEP PR:"pyarrow_nan"
"pyarrow_legacy"
(I wouldn't go with this one, because for users it is not "legacy" right now, rather it would be the default. It will only become legacy later if we decide on switching to NA later)After writing this down, I think my current preference would go to
StringDtype(backend="python"|"pyarrow")
, as that seems the simplest for most users (it's a bit confusing for those who already explicitly usedstorage
, but most users have never done that)The text was updated successfully, but these errors were encountered: