-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorical does not work with nan #36
Comments
By default, patsy thinks that 'nan' indicates missing data, and is dropping If you want to just disable missing value handling altogether, that can be Does that help? On Tue, Mar 18, 2014 at 5:34 PM, Alex Rothberg [email protected]:
Nathaniel J. Smith |
Currently I am using patsy through statsmodels:
so how would I make changes to nan handling? Also the rows with nan in them are definitely not being dropped. |
I don't know -- I just tested what I said against patsy itself, and:
So I guess it's a bug in how statsmodels is calling patsy...? On Tue, Mar 18, 2014 at 5:54 PM, Alex Rothberg [email protected]:
Nathaniel J. Smith |
Related I guess statsmodels/statsmodels#805 I haven't looked at this in a while, and we didn't coordinate well on this in the beginning. We tried to keep missing data handling mostly on our side because we have more than y/X to deal with. |
...and patsy didn't have any missing data handling when I wrote that. |
Brainstorming: As a workaround if you want to be in charge of missing data handling you Ideal solution might be to move all NA handling into patsy, but to do that If you don't care about eliminating NA values in weights, then you could Alex: Your best quick workaround might be to swap your nan values for a On Tue, Mar 18, 2014 at 6:11 PM, Skipper Seabold
Nathaniel J. Smith |
Yep, putting this in formula seems to work:
|
If there is not explicit If my reading of the statsmodels source is correct: |
Is there a replicable example or test case for this? |
There's one at the top of the issue I linked to above. Note that the current behavior on that issue is the opposite of what was causing the problem before and what is causing the issue here. The nan category is dropped in patsy by default now I guess, and we don't do anything to control this. |
Yes, I understand mostly our problems with statsmodels 805, however, I think in this issue, patsy 36, the missing data handling of statsmodels is not involved at all. So this issue should be all patsy, even if the call goes through statsmodels. maybe I'm late and cancan101's solution/workaround already made this clear. |
See the second comment above. The issue from our end is that we don't pass any NA handling to patsy under the hood, so we don't have any way to suppress its dropping of NAs in the categoricals. So the issue with #805 is actually resolved, but it's because the defaults in patsy changed / missing data handling was added. We don't allow users to treat NaNs as a category right now. (I'm not convinced we should, though.) |
Ok, I see, I didn't understand that part. |
Just for reference: in pandas you can now add a = array([nan, 'CONFERENCE', 'ANALYST', 'FORUM', 'SEMINAR'], dtype=object)
df[cats] = pd.Categorical(a, levels=a) # works here because a has only unique values Not sure what patsys makes from that and how it gets the reference level, though. |
I have a columns whose unique looks like:
I would expect that adding
C(col_name)
to the formula would create 4 dummy variables (5 values-1), bu in fact it only adds 3.When I tried to explicitly set control to be
nan
, i get an exception:The text was updated successfully, but these errors were encountered: