Better handling for unrecognized categorical levels? #110

njsmith · 2017-07-15T05:13:08Z

There's a request here to add an option so that when an unrecognized categorical level is encountered, it should be encoded as all-zeros, which is apparently similar to what scikit-learn's DictVectorizer does.

Technically this is something patsy could do. But AFAICT this would lead to terribly incorrect behavior in any kind of linear-ish model, and AFAIK linear-ish models are what one-hot encoding are for, so I don't understand what's going on here or why people want this, and I like to understand things before implementing them :-).

Specifically, the kind of issue I'm thinking about is... say you have a logistic regression model you're using to predict whether an apartment is occupied, with a model like occupancy ~ C(city) + bedrooms + baths. In this model, patsy will use treatment coding, so returning all-zeros for unrecognized cities is the same as predicting that they act just like whichever city was assigned as the reference category (probably the one that's first alphabetically). OK, but that's not what DictVectorizer does -- it always uses a full-rank encoding, so it's more like patsy's occupancy ~ 0 + C(city) + bedrooms + baths. Now in this model, the beta for each city gives something like the (logistic-transformed) mean occupancy for each city, and using all-zeros for unrecognized cities is equivalent to assuming that their mean occupancy is exactly 0 on the logistic scale, which is a terrible guess. You really want it to do something like... return a vector of [1/n, 1/n, ..., 1/n], so that you're assuming unseen cities have similar occupancy to the average of the seen cities. Of course high-frequency cities and low-frequency cities are probably different too...

And other categorical encodings (e.g. polynomial coding) are even more of a mess.

So I'm not sure what to do here, if anything.

I feel like 99% of the time if you have an open category like this and don't want to use some principle solution like bayesian non-parametrics, then you instead want to do something like... keep the top 100 categories and bin everything else into "other", so then you actually have training data on the "other" category that's plausibly representative of what you'll see later (because in both cases it's relatively low frequency items). I guess this is also something patsy could potentially provide helpers for, though maybe it's more of a pandas thing.

CC: @ameuller

The text was updated successfully, but these errors were encountered:

thequackdaddy · 2017-08-22T04:08:11Z

If I may enter here, initially I thought that would be nice feature. I pull data from a database where "codes" mean different things, and are occasionally wrong. For example, in the year 2014, for a particular field, "A" might mean "apple" and "B" means banana. But a change to the system flips it so "B" means "apple" and "A" means "banana". (I'm not the data architect... and we do this for a reason... our datasets are far more complicated and that flipping occurs at somewhat regular intervals.)

Anyways, in this merry-go-round of codes to meaningful names, something always gets screwed up... our programmers catch it but not after a handful of records were coded with meaningless data. They are coded with a "C" which has no meaning at that particular time.

If I split the dataset into 2 pieces--one for actual model-building and a holdout dataset for model validation--it happens (not irregularly) that I end up with one dataset with no "C"s, and another with all of them. When I try to predict the model on the holdout data if I haven't specifically handled the "C"s somehow, patsy will (as designed) error.

What I've done to account for this is to just spend a little more time looking at the domain of values in the full dataset. Additionally, I've taken a liking to writing lots of formula helper functions... and generally will write a statement that groups "leftovers" with a large, somewhat heterogenous categorical. But you really have to know the data to know when to do that.

tldr; Initially, I would've agreed with you. But in reality if I spend reasonable time getting to know my data (which isn't a bad thing) you can easily write helper functions to deal with this.

memeplex · 2018-04-25T04:01:40Z

I think in ridge-regularized high-dimensional one-hot-encoded regression it's ok to do an all zero encoding for unknown categories because of two reasons:

The intercept is usually not shrinked so it will tend to the average of the one hot encoded variable in order to minimize the variance of its coeficients.
Even if the intercept were shrinked, there are tons of other features to distribute the average weight of the variable in question in a way that reduces the penalization.

That is, in both cases an all zero encoding will somehow tend to the average value.

memeplex · 2018-04-25T04:12:03Z

I believe even lasso could be doing something like what I described above, since both ridge and lasso impose priors that assume coefficients more likely near zero (gaussian and laplacian). Then it's preferable to accomodate the average weight of some given categorical variable between lots of coefficients related to other variables than to put its burden over the relatively few dummies that encode the considered variable.

josef-pkt · 2019-12-20T18:40:11Z

(found by chance)
I think the only reasonable solution besides raising an exception would be to treat it as missing value.
(statsmodels predict would return nan in that case)
Multiple imputation could be used to replace "missing" with the closest level of the categorical variable that is in the training set.

Otherwise, I don't see how a general estimation procedure can include the data cleaning step and associated modelling decisions for that and replace the user who is supposed to know the context, as in @thequackdaddy comment.
e.g. the user would need a model for never seen explanatory variables and decide how to map them to existing explanatory variables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better handling for unrecognized categorical levels? #110

Better handling for unrecognized categorical levels? #110

njsmith commented Jul 15, 2017

thequackdaddy commented Aug 22, 2017 •

edited

Loading

memeplex commented Apr 25, 2018

memeplex commented Apr 25, 2018

josef-pkt commented Dec 20, 2019

Better handling for unrecognized categorical levels? #110

Better handling for unrecognized categorical levels? #110

Comments

njsmith commented Jul 15, 2017

thequackdaddy commented Aug 22, 2017 • edited Loading

memeplex commented Apr 25, 2018

memeplex commented Apr 25, 2018

josef-pkt commented Dec 20, 2019

thequackdaddy commented Aug 22, 2017 •

edited

Loading