Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Are nested masked arrays a valid type? #218

Open
nsmith- opened this issue Nov 27, 2019 · 1 comment
Open

Are nested masked arrays a valid type? #218

nsmith- opened this issue Nov 27, 2019 · 1 comment

Comments

@nsmith-
Copy link
Member

nsmith- commented Nov 27, 2019

I managed to end up with something like

a = ak.JaggedArray.fromcounts(
    np.array([1, 0, 3]),
    ak.MaskedArray(
        np.array([True, False, True, False]),
        np.arange(4),
    )
)

which gives <JaggedArray [[None] [] [1 None 3]] at 0x000111048690>, and then proceeded to select some index inside the array with

af = a[a.argmax()].pad(1, clip=True).flatten()

leaving me <MaskedArray [None None 3] at 0x00013a865750>. All good so far, but the type is very strange: ArrayType(3, OptionType(OptionType(dtype('int64')))) I don't understand what nested OptionType means. I can collapse it at least: af[~af.boolmask()].content.content returns array([3]).

@jpivarski
Copy link
Member

It's valid and it should be equivalent to the union of the two masks (from maskedwhen=True). I had thought there was logic to say that OptionType(OptionType(X)) is an equal type to OptionType(X); I put in a few of these algebraic things, but that's a rabbit hole.

Yeah, it's true:

>>> import awkward, numpy
>>> array = awkward.MaskedArray([False, True, False, True, False, True],
...             awkward.MaskedArray([False, False, False, True, True, True],
...             [1.1, 2.2, 3.3, 4.4, 5.5, 6.6]))

>>> # checkerboard unions with half-and-half
>>> array
<MaskedArray [1.1 None 3.3 None None None] at 0x78d638ac5a90>

>>> # two levels deep
>>> array.type
ArrayType(6, OptionType(OptionType(dtype('float64'))))

>>> # is equivalent to one level deep
>>> array.type == awkward.type.ArrayType(6,
...                   awkward.type.OptionType(numpy.dtype("float64")))
True

I have to decide how much of that should survive into the new era. One good thing about reimplementation is that stuff that seemed like a good idea at the time but never actually got used goes away. Users won't be encouraged to make their own array structures anymore, so I guess I don't need to police it. I guess you've found that pad needs to be smarter: if it's already looking at a MaskedArray, it should add to its mask, rather than introduce another layer.

Also seeking opinions: I want to change the name from "MaskedArray" to something else because of how often we use the word "mask" to refer to slicing with a boolean array—a concept that's similar enough but different from what MaskedArray does to cause confusion. "Masked" is what NumPy calls it, though maybe it's a bad thing to use a similar word for not-really-the-same classes (numpy.ma.MaskedArray isn't interchangeable with awkward.MaskedArray: the latter can contain jagged data, for instance). Besides, "masked" describes the how, not the what.

It seems to me that we have two other words for this, "nullable" and "optional." "Nullable" is an SQL term and "optional" or "option" is popular among modern programming languages. Haskell uses "maybe." I'm leaning toward

  • MaskedArrayBoolOptionalArray
  • BitMaskedArrayBitOptionalArray
  • IndexedMaskedArrayIndexedOptionalArray

(I'm not ignoring your other issue, #217; it just looks more difficult at the moment.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants