Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical does not work with nan #36

Open
cancan101 opened this issue Mar 18, 2014 · 14 comments
Open

Categorical does not work with nan #36

cancan101 opened this issue Mar 18, 2014 · 14 comments

Comments

@cancan101
Copy link

I have a columns whose unique looks like:

array([nan, 'CONFERENCE', 'ANALYST', 'FORUM', 'SEMINAR'], dtype=object)

I would expect that adding C(col_name) to the formula would create 4 dummy variables (5 values-1), bu in fact it only adds 3.

When I tried to explicitly set control to be nan, i get an exception:

C(col_name, Treatment(reference=nan)) 
PatsyError: specified level nan not found
@njsmith
Copy link
Member

njsmith commented Mar 18, 2014

By default, patsy thinks that 'nan' indicates missing data, and is dropping
those rows from your data rather than treating them like a 5th category.
(If they really are missing then treating them like a 5th category is
pretty statistically suspect, I think...) If this isn't what you want, then
dmatrix and friends take an NA_action= argument, to which you can pass an
NAAction object set up to tell patsy what you really want it to do:
http://patsy.readthedocs.org/en/latest/API-reference.html#patsy.NAAction
(Notice that by default NA_types includes "nan" -- this is what's causing
your problem.)

If you want to just disable missing value handling altogether, that can be
accomplished with something like:
dmatrix(..., NA_action=NAAction(NA_types=[]))

Does that help?

On Tue, Mar 18, 2014 at 5:34 PM, Alex Rothberg [email protected]:

I have a columns whose unique looks like:

array([nan, 'CONFERENCE', 'ANALYST', 'FORUM', 'SEMINAR'], dtype=object)

I would expect that adding C(col_name) to the formula would create 4
dummy variables (5 values-1), bu in fact it only adds 3.

When I tried to explicitly set control to be nan, i get an exception:

C(col_name, Treatment(reference=nan))

PatsyError: specified level nan not found


Reply to this email directly or view it on GitHubhttps://github.com//issues/36
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@cancan101
Copy link
Author

Currently I am using patsy through statsmodels:

from statsmodels.formula.api import ols
model = ols( y ~ x...", data)

so how would I make changes to nan handling?

Also the rows with nan in them are definitely not being dropped.

@njsmith
Copy link
Member

njsmith commented Mar 18, 2014

I don't know -- I just tested what I said against patsy itself, and:

  • by default it did in fact both ignore the nan when deciding how many
    levels there were, and then dropped that row when building the design matrix
  • and if I set NA_action like I said, then it did include the nan when
    deciding how many levels were, and did include it correctedly in the design
    matrix.

So I guess it's a bug in how statsmodels is calling patsy...?

@jseabold @josef-pkt

On Tue, Mar 18, 2014 at 5:54 PM, Alex Rothberg [email protected]:

Currently I am using patsy through statsmodels:

from statsmodels.formula.api import ols
model = ols( y ~ x...", data)

so how would I make changes to nan handling?

Also the rows with nan in them are definitely not being dropped.


Reply to this email directly or view it on GitHubhttps://github.com//issues/36#issuecomment-37965573
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@jseabold
Copy link
Member

Related I guess statsmodels/statsmodels#805

I haven't looked at this in a while, and we didn't coordinate well on this in the beginning. We tried to keep missing data handling mostly on our side because we have more than y/X to deal with.

@jseabold
Copy link
Member

...and patsy didn't have any missing data handling when I wrote that.

@njsmith
Copy link
Member

njsmith commented Mar 18, 2014

Brainstorming:

As a workaround if you want to be in charge of missing data handling you
could just always disable patsy's. But this might make it tricky to handle
categorical variables and stateful transforms right...

Ideal solution might be to move all NA handling into patsy, but to do that
we'd need to add a way to pass parallel vectors through patsy (parallel =
parallel to y/X, things like weights).

If you don't care about eliminating NA values in weights, then you could
let patsy do the missing value handling and then peek at the index on the
returned dataframe to see which rows got eliminated, and throw those out of
the other vectors. I remember I ran into some problem trying to do this
though in my own code and ended up with a hack instead:
https://github.com/rerpy/rerpy/blob/master/rerpy/rerp.py#L339
I don't remember what exactly the problem was, I could probably find some
notes somewhere...

Alex: Your best quick workaround might be to swap your nan values for a
string, like "nan" or "--" or whatever that value actually means to you :-).

On Tue, Mar 18, 2014 at 6:11 PM, Skipper Seabold
[email protected]:

...and patsy didn't have any missing data handling when I wrote that.


Reply to this email directly or view it on GitHubhttps://github.com//issues/36#issuecomment-37967889
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@cancan101
Copy link
Author

Yep, putting this in formula seems to work:

C(col_name.fillna("N/A"), Treatment(reference="N/A")) 

@josef-pkt
Copy link

If there is not explicit missing='drop' when creating the model, then statsmodels doesn't check at all for nans. The nan handling is all patsy in this case, and if there are no extra arrays, then, I think, there are not or should not be any problems.
The model is initialized with whatever endog and exog patsy returns.

If my reading of the statsmodels source is correct:
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/formula/formulatools.py#L38
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/base/model.py#L109

@josef-pkt
Copy link

Is there a replicable example or test case for this?

@jseabold
Copy link
Member

There's one at the top of the issue I linked to above. Note that the current behavior on that issue is the opposite of what was causing the problem before and what is causing the issue here. The nan category is dropped in patsy by default now I guess, and we don't do anything to control this.

@josef-pkt
Copy link

Yes, I understand mostly our problems with statsmodels 805, however, I think in this issue, patsy 36, the missing data handling of statsmodels is not involved at all. So this issue should be all patsy, even if the call goes through statsmodels.

maybe I'm late and cancan101's solution/workaround already made this clear.

@jseabold
Copy link
Member

See the second comment above. The issue from our end is that we don't pass any NA handling to patsy under the hood, so we don't have any way to suppress its dropping of NAs in the categoricals. So the issue with #805 is actually resolved, but it's because the defaults in patsy changed / missing data handling was added. We don't allow users to treat NaNs as a category right now. (I'm not convinced we should, though.)

@josef-pkt
Copy link

Ok, I see, I didn't understand that part.
So the from_formula method needs to hand off some patsy_options to dmatrices. ?
which might collide with whatever deterministic (not user influenced) behavior we want to expect from patsy. Users should have the option to turn off patsy's nan checking if they don't want any at all.

@jankatins
Copy link

Just for reference: in pandas you can now add np.nan as a level:

a = array([nan, 'CONFERENCE', 'ANALYST', 'FORUM', 'SEMINAR'], dtype=object)
df[cats] = pd.Categorical(a, levels=a) # works here because a has only unique values

Not sure what patsys makes from that and how it gets the reference level, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants