Categorical does not work with nan #36

cancan101 · 2014-03-18T17:34:51Z

I have a columns whose unique looks like:

array([nan, 'CONFERENCE', 'ANALYST', 'FORUM', 'SEMINAR'], dtype=object)

I would expect that adding C(col_name) to the formula would create 4 dummy variables (5 values-1), bu in fact it only adds 3.

When I tried to explicitly set control to be nan, i get an exception:

C(col_name, Treatment(reference=nan))

PatsyError: specified level nan not found

The text was updated successfully, but these errors were encountered:

njsmith · 2014-03-18T17:42:20Z

By default, patsy thinks that 'nan' indicates missing data, and is dropping
those rows from your data rather than treating them like a 5th category.
(If they really are missing then treating them like a 5th category is
pretty statistically suspect, I think...) If this isn't what you want, then
dmatrix and friends take an NA_action= argument, to which you can pass an
NAAction object set up to tell patsy what you really want it to do:
http://patsy.readthedocs.org/en/latest/API-reference.html#patsy.NAAction
(Notice that by default NA_types includes "nan" -- this is what's causing
your problem.)

If you want to just disable missing value handling altogether, that can be
accomplished with something like:
dmatrix(..., NA_action=NAAction(NA_types=[]))

Does that help?

On Tue, Mar 18, 2014 at 5:34 PM, Alex Rothberg [email protected]:

I have a columns whose unique looks like:

array([nan, 'CONFERENCE', 'ANALYST', 'FORUM', 'SEMINAR'], dtype=object)

I would expect that adding C(col_name) to the formula would create 4
dummy variables (5 values-1), bu in fact it only adds 3.

When I tried to explicitly set control to be nan, i get an exception:

C(col_name, Treatment(reference=nan))

PatsyError: specified level nan not found

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/36
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

cancan101 · 2014-03-18T17:54:11Z

Currently I am using patsy through statsmodels:

from statsmodels.formula.api import ols
model = ols( y ~ x...", data)

so how would I make changes to nan handling?

Also the rows with nan in them are definitely not being dropped.

njsmith · 2014-03-18T18:04:01Z

I don't know -- I just tested what I said against patsy itself, and:

by default it did in fact both ignore the nan when deciding how many
levels there were, and then dropped that row when building the design matrix
and if I set NA_action like I said, then it did include the nan when
deciding how many levels were, and did include it correctedly in the design
matrix.

So I guess it's a bug in how statsmodels is calling patsy...?

@jseabold @josef-pkt

On Tue, Mar 18, 2014 at 5:54 PM, Alex Rothberg [email protected]:

Currently I am using patsy through statsmodels:

from statsmodels.formula.api import ols
model = ols( y ~ x...", data)

so how would I make changes to nan handling?

Also the rows with nan in them are definitely not being dropped.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/36#issuecomment-37965573
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

jseabold · 2014-03-18T18:08:58Z

Related I guess statsmodels/statsmodels#805

I haven't looked at this in a while, and we didn't coordinate well on this in the beginning. We tried to keep missing data handling mostly on our side because we have more than y/X to deal with.

jseabold · 2014-03-18T18:11:39Z

...and patsy didn't have any missing data handling when I wrote that.

njsmith · 2014-03-18T18:20:25Z

Brainstorming:

As a workaround if you want to be in charge of missing data handling you
could just always disable patsy's. But this might make it tricky to handle
categorical variables and stateful transforms right...

Ideal solution might be to move all NA handling into patsy, but to do that
we'd need to add a way to pass parallel vectors through patsy (parallel =
parallel to y/X, things like weights).

If you don't care about eliminating NA values in weights, then you could
let patsy do the missing value handling and then peek at the index on the
returned dataframe to see which rows got eliminated, and throw those out of
the other vectors. I remember I ran into some problem trying to do this
though in my own code and ended up with a hack instead:
https://github.com/rerpy/rerpy/blob/master/rerpy/rerp.py#L339
I don't remember what exactly the problem was, I could probably find some
notes somewhere...

Alex: Your best quick workaround might be to swap your nan values for a
string, like "nan" or "--" or whatever that value actually means to you :-).

On Tue, Mar 18, 2014 at 6:11 PM, Skipper Seabold
[email protected]:

...and patsy didn't have any missing data handling when I wrote that.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/36#issuecomment-37967889
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

cancan101 · 2014-03-18T18:30:52Z

Yep, putting this in formula seems to work:

C(col_name.fillna("N/A"), Treatment(reference="N/A"))

josef-pkt · 2014-03-19T02:35:42Z

If there is not explicit missing='drop' when creating the model, then statsmodels doesn't check at all for nans. The nan handling is all patsy in this case, and if there are no extra arrays, then, I think, there are not or should not be any problems.
The model is initialized with whatever endog and exog patsy returns.

If my reading of the statsmodels source is correct:
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/formula/formulatools.py#L38
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/base/model.py#L109

josef-pkt · 2014-03-19T02:36:50Z

Is there a replicable example or test case for this?

jseabold · 2014-03-19T02:49:19Z

There's one at the top of the issue I linked to above. Note that the current behavior on that issue is the opposite of what was causing the problem before and what is causing the issue here. The nan category is dropped in patsy by default now I guess, and we don't do anything to control this.

josef-pkt · 2014-03-19T02:58:35Z

Yes, I understand mostly our problems with statsmodels 805, however, I think in this issue, patsy 36, the missing data handling of statsmodels is not involved at all. So this issue should be all patsy, even if the call goes through statsmodels.

maybe I'm late and cancan101's solution/workaround already made this clear.

jseabold · 2014-03-19T03:04:51Z

See the second comment above. The issue from our end is that we don't pass any NA handling to patsy under the hood, so we don't have any way to suppress its dropping of NAs in the categoricals. So the issue with #805 is actually resolved, but it's because the defaults in patsy changed / missing data handling was added. We don't allow users to treat NaNs as a category right now. (I'm not convinced we should, though.)

josef-pkt · 2014-03-19T04:31:15Z

Ok, I see, I didn't understand that part.
So the from_formula method needs to hand off some patsy_options to dmatrices. ?
which might collide with whatever deterministic (not user influenced) behavior we want to expect from patsy. Users should have the option to turn off patsy's nan checking if they don't want any at all.

jankatins · 2014-08-14T13:37:05Z

Just for reference: in pandas you can now add np.nan as a level:

a = array([nan, 'CONFERENCE', 'ANALYST', 'FORUM', 'SEMINAR'], dtype=object)
df[cats] = pd.Categorical(a, levels=a) # works here because a has only unique values

Not sure what patsys makes from that and how it gets the reference level, though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorical does not work with nan #36

Categorical does not work with nan #36

cancan101 commented Mar 18, 2014

njsmith commented Mar 18, 2014

cancan101 commented Mar 18, 2014

njsmith commented Mar 18, 2014

jseabold commented Mar 18, 2014

jseabold commented Mar 18, 2014

njsmith commented Mar 18, 2014

cancan101 commented Mar 18, 2014

josef-pkt commented Mar 19, 2014

josef-pkt commented Mar 19, 2014

jseabold commented Mar 19, 2014

josef-pkt commented Mar 19, 2014

jseabold commented Mar 19, 2014

josef-pkt commented Mar 19, 2014

jankatins commented Aug 14, 2014

Categorical does not work with nan #36

Categorical does not work with nan #36

Comments

cancan101 commented Mar 18, 2014

njsmith commented Mar 18, 2014

cancan101 commented Mar 18, 2014

njsmith commented Mar 18, 2014

jseabold commented Mar 18, 2014

jseabold commented Mar 18, 2014

njsmith commented Mar 18, 2014

cancan101 commented Mar 18, 2014

josef-pkt commented Mar 19, 2014

josef-pkt commented Mar 19, 2014

jseabold commented Mar 19, 2014

josef-pkt commented Mar 19, 2014

jseabold commented Mar 19, 2014

josef-pkt commented Mar 19, 2014

jankatins commented Aug 14, 2014