-
Notifications
You must be signed in to change notification settings - Fork 39
awkward.isnan #227
Comments
I think this is a good idea, the only thing I'm on the fence about is the name. I already started down the path of imitating Pandas with If a user is thinking of floating-point Maybe |
pandas has |
Yeah. They have so many synonyms because they're courting a userbase from R, which has a distinction between na and null. But also, I'm arguing that the distinction matters less if all of the values in question are some kind of number (or string, but more likely dictionary-encoded category). I'm thinking we might want to make a sharper distinction between a floating-point |
Hey... I am interested in solving this issue. Can I be assigned to it? |
I certainly wouldn't be opposed. Be aware, though, that the primary development branch of Awkward is in the scikit-hep/awkward-1.0 repo. Contributions to the 0.x repo are welcome and affect existing users, but would have to be reimplemented in 1.0. Therefore, what you do here could be seen as a "trial run" for the feature, getting it in front of users to see how useful it is, and maybe we might need to change some names in the 1.0. Also, note that the direction Pandas is headed is to more fully distinguish between "NA" (for any data type) and "NaN" (for floating point). This seems to be their biggest change when they released Pandas 1.0. If they've renamed the functions for that, we should follow their new naming scheme. |
I would fully support this. An example of a similar global function that applies to MaskedArrays is a = ak.fromiter([None, [1, 2, None, 3], []])
a.isna(axis=None).tolist() == [True, False, False]
a.isna(axis=0).tolist() == [True, False, False]
a.isna(axis=1).tolist() == [None, [False, False, True, False], []] There may be reason to actually consider |
Probably |
Unless a compelling argument can be given, I'd rather not consider There's two schools of thought on this: Google's formats (Protobuf, Flatbuffers, and Dremel, which became Parquet) don't make a distinction between Other formats (Thrift, Avro, and Arrow) take the "programming language" approach that distinguishes |
So, is this functionality to be implemented for Awkward Arrays in general or just for Jagged Arrays? I was thinking that we could just flatten the array to find out the 'NaN' values and reconstruct them from scratch by the start and stop counts, with the content being True/False accordingly |
All operations should apply to all Awkward Array types. One of the most persistent issues has been when somebody's code is running fine on JaggedArrays, then for some reason they have MaskedArray of JaggedArray, or maybe ChunkedArray of JaggedArray, or something similar, possibly without realizing it, and the script no longer works. In this environment, "preserving the abstraction" means only requiring the users to think about the logical meaning of their data, not the specific structures built up to represent it. |
So, should I implement this method in the base file of awkward array or implement it differently for each of the types of Awkward Arrays? Also, can I go ahead and use my above proposed solution or should I think of another one? (i.e. flattening and reconstruction) |
It should probably be separately implemented for each of the array classes (i.e. as a method on each) because you'll probably have to do something different in some cases. In base.py, there's a superclass for all array types that have one Meanwhile,
Therefore, this function won't be unwrapping awkward.fromiter([[1.1, 2.2, None, 3.3], [], [4.4, None, 5.5]]).isna() would return [False, False, False] because none of those three lists are missing. (That is, A function that descends all the way, giving [[False, False, True, False], [], [False, True, False]] in the above example, should have a different name or be a non-default parameterized version of |
Currently, there is no convenient way for checking if a given array has masked entries if it is not a
MaskedArray
type, i.e. the mask methods are not universal.For float arrays, a workaround is
numpy.isnan(array.fillna(numpy.nan))
.I would propose
array.isnan(axis=-1)
orawkward.isnan(array, axis=-1)
signatures, where the axis chooses the depth of the structure at which to evaluate the masking status, much like thearray.flatten(axis=-1)
function.The text was updated successfully, but these errors were encountered: