-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coding categorical response variables for use with scikit-learn #77
Comments
+1 |
Yeah, I think statsmodels has similar trouble with their categorical models. Only question is what the interface to this should look like. What do you want to be able to request from patsy, and what kind of output should patsy support? Array of strings? Array of ints corresponding to the categories? (Annoyingly we still don't have a standard way to represent categorical data in numpy, so I guess we'll have to do something ad hoc...) When you process a formula, do you know ahead of time whether you want the response variable to be categorical, or do there exist models that want to do one thing if R's way of handling this is that whatever is on the left-hand side of formulas gets treated as R code rather than formula code, so in |
My guess is that for statsmodels it would be helpful to have a keyword in dmatrices to turn off categorical "treatment" on the left hand side of We would still need patsy to extract the arrays from the For GLM-Binomial I used left hand |
Just fyi, for classification, scikit-learn accepts anything that is not a float. It'll do a |
@amueller correct me if I'm wrong, but when sklearn does |
@pkch wrong, because |
@amueller Ah thanks, the I guess not only is this information preserved in |
PR very welcome, I'm surprised it's not there. Small doc fixes are very valuable. |
@amueller ah I didn't think it's that terrible to depend on the stability of If Also, what about the default scorer of those estimators that use |
Optional.
|
scikit-learn expects the response variable to be a 1d array. For example,
However, if y is an array of strings, patsy will convert it to dummy variables, which scikit-learn will not accept as a valid response
y
.Would it be useful perhaps to be able to tell patsy that a given (string-type) variable should remain a string and/or converted to numeric labels?
The text was updated successfully, but these errors were encountered: