coding categorical response variables for use with scikit-learn #77

pkch · 2015-11-11T10:01:52Z

scikit-learn expects the response variable to be a 1d array. For example,

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X, y) # here y is expected to be 1d array, strings or numerical labels

However, if y is an array of strings, patsy will convert it to dummy variables, which scikit-learn will not accept as a valid response y.

Would it be useful perhaps to be able to tell patsy that a given (string-type) variable should remain a string and/or converted to numeric labels?

The text was updated successfully, but these errors were encountered:

datnamer · 2015-11-11T13:37:32Z

+1

njsmith · 2015-11-12T23:37:40Z

Yeah, I think statsmodels has similar trouble with their categorical models. Only question is what the interface to this should look like. What do you want to be able to request from patsy, and what kind of output should patsy support? Array of strings? Array of ints corresponding to the categories? (Annoyingly we still don't have a standard way to represent categorical data in numpy, so I guess we'll have to do something ad hoc...) When you process a formula, do you know ahead of time whether you want the response variable to be categorical, or do there exist models that want to do one thing if y is categorical and a different-but-equally-valid thing if y is numerical?

R's way of handling this is that whatever is on the left-hand side of formulas gets treated as R code rather than formula code, so in x + y ~ z + w, the first + does addition and the second + concatenates columns, which is rather confusing. And they have the luxury of having a single standard way to represent categorical data, so it's reasonable for the formula system to just get out of the way and let the end-user and the underlying model talk to each other directly. Not very helpful for us, unfortunately.

CC @josef-pkt @amueller

josef-pkt · 2015-11-13T00:13:47Z

My guess is that for statsmodels it would be helpful to have a keyword in dmatrices to turn off categorical "treatment" on the left hand side of ~. We would need a way to adjust the treatment of categorical without directly manipulating the formula string.
This is currently model specific and we are still missing some models. For ordered Logit it would be nice to keep pandas ordered Categorical but the ordered/ordinal model is still just a basic prototype without formulas yet.

We would still need patsy to extract the arrays from the data, DataFrame or dictionary, because we don't have string parsing.

For GLM-Binomial I used left hand + for standard formula concatenation. success + fail ~ ... for Binomial counts. My guess is that lefthand formulas will also be useful for multivariate models directly

amueller · 2015-11-14T01:58:38Z

Just fyi, for classification, scikit-learn accepts anything that is not a float. It'll do a np.unique on it.
If you give models 2d data, they assume it is multi-label or multi-output data.

pkch · 2016-01-24T12:13:07Z

@amueller correct me if I'm wrong, but when sklearn does np.unique on the 1D non-float data, it becomes impossible to use the trained learner to predict on the new data (since the conversion map is irretrievably lost). In other words, if I later call sklearn's predict_proba function, I will have no way of knowing which probability refers to which class (it's just a 2D array of numbers, with no labels).

amueller · 2016-01-24T18:28:49Z

@pkch wrong, because unique actually returns the unique values, which are stored in the classes_ attribute.

pkch · 2016-01-24T19:44:08Z

@amueller Ah thanks, the classes_ attribute wasn't mentioned in LogisticRegression, not sure if it's worth submitting a PR for such a small issue though.

I guess not only is this information preserved in classes_, but it can also be deduced by rerunning LabelBinarizer.transform() on the set of original labels. This is important because without this guarantee that LabelBinarizer is fully deterministic, it would be impossible to write custom scoring functions that require probabilities. For example, the built-in log_loss metric starts by calling LabelBinarizer.transform() without actually being able to see the classes_ attribute and without the access to the LabelBinarizer instance from the estimator.

amueller · 2016-01-24T19:57:30Z

PR very welcome, I'm surprised it's not there. Small doc fixes are very valuable.
We want to make using predictors and scoring metrics as easy to use as possible, but it does have the draw-back you mention. Increasingly more metrics have a labels parameter, which allows you to pass the estimator.classes_.
It seems not to be present in log_loss at the moment, but it would be a welcome addition.

pkch · 2016-01-24T20:22:43Z

@amueller ah I didn't think it's that terrible to depend on the stability of LabelBinarizer; but I guess it's not ideal. Did you mean as a required argument, or as an optional with the current behavior allowed as default?

If labels is added to log_loss, it will make sense to add it to the API for the user-defined function score_func accepted by make_scorer (requires a modest code change in make_scorer).

Also, what about the default scorer of those estimators that use log_loss? Where is the code that needs to be changed to make them use the new labels argument? If it's not done, then GridSearchCV and cross_val_score (which use the estimator's scorer by default) will still use the old behavior.

amueller · 2016-01-24T20:25:34Z

Optional.
Well it's not only about stability, its that different subsets of the data
(as happens in cross validation) can have different label sets.
On Jan 24, 2016 15:22, "pkch" [email protected] wrote:

@amueller https://github.com/amueller ah I didn't think it's that
terrible to depend on the stability of LabelBinarizer; but I guess it's
not ideal. Did you mean as a required argument, or as an optional with the
current behavior allowed as default?

If labels is added to log_loss, it will make sense to add it to the API
for the user-defined function score_func accepted by make_scorer
(requires a modest code change in make_scorer).

Also, what about the default scorer of those estimators that use log_loss?
Where is the code that needs to be changed to make them use the new labels
argument? If it's not done, then GridSearchCV and cross_val_score (which
use the estimator's scorer by default) will still use the old behavior.

—
Reply to this email directly or view it on GitHub
#77 (comment).

pkch mentioned this issue Jan 26, 2016

Scoring functions don't know classes_ scikit-learn/scikit-learn#6231

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coding categorical response variables for use with scikit-learn #77

coding categorical response variables for use with scikit-learn #77

pkch commented Nov 11, 2015

datnamer commented Nov 11, 2015

njsmith commented Nov 12, 2015

josef-pkt commented Nov 13, 2015

amueller commented Nov 14, 2015

pkch commented Jan 24, 2016

amueller commented Jan 24, 2016

pkch commented Jan 24, 2016

amueller commented Jan 24, 2016

pkch commented Jan 24, 2016

amueller commented Jan 24, 2016

coding categorical response variables for use with scikit-learn #77

coding categorical response variables for use with scikit-learn #77

Comments

pkch commented Nov 11, 2015

datnamer commented Nov 11, 2015

njsmith commented Nov 12, 2015

josef-pkt commented Nov 13, 2015

amueller commented Nov 14, 2015

pkch commented Jan 24, 2016

amueller commented Jan 24, 2016

pkch commented Jan 24, 2016

amueller commented Jan 24, 2016

pkch commented Jan 24, 2016

amueller commented Jan 24, 2016