-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better handling for unrecognized categorical levels? #110
Comments
If I may enter here, initially I thought that would be nice feature. I pull data from a database where "codes" mean different things, and are occasionally wrong. For example, in the year 2014, for a particular field, "A" might mean "apple" and "B" means banana. But a change to the system flips it so "B" means "apple" and "A" means "banana". (I'm not the data architect... and we do this for a reason... our datasets are far more complicated and that flipping occurs at somewhat regular intervals.) Anyways, in this merry-go-round of codes to meaningful names, something always gets screwed up... our programmers catch it but not after a handful of records were coded with meaningless data. They are coded with a "C" which has no meaning at that particular time. If I split the dataset into 2 pieces--one for actual model-building and a holdout dataset for model validation--it happens (not irregularly) that I end up with one dataset with no "C"s, and another with all of them. When I try to What I've done to account for this is to just spend a little more time looking at the domain of values in the full dataset. Additionally, I've taken a liking to writing lots of formula helper functions... and generally will write a statement that groups "leftovers" with a large, somewhat heterogenous categorical. But you really have to know the data to know when to do that. tldr; Initially, I would've agreed with you. But in reality if I spend reasonable time getting to know my data (which isn't a bad thing) you can easily write helper functions to deal with this. |
I think in ridge-regularized high-dimensional one-hot-encoded regression it's ok to do an all zero encoding for unknown categories because of two reasons:
That is, in both cases an all zero encoding will somehow tend to the average value. |
I believe even lasso could be doing something like what I described above, since both ridge and lasso impose priors that assume coefficients more likely near zero (gaussian and laplacian). Then it's preferable to accomodate the average weight of some given categorical variable between lots of coefficients related to other variables than to put its burden over the relatively few dummies that encode the considered variable. |
(found by chance) Otherwise, I don't see how a general estimation procedure can include the data cleaning step and associated modelling decisions for that and replace the user who is supposed to know the context, as in @thequackdaddy comment. |
There's a request here to add an option so that when an unrecognized categorical level is encountered, it should be encoded as all-zeros, which is apparently similar to what scikit-learn's
DictVectorizer
does.Technically this is something patsy could do. But AFAICT this would lead to terribly incorrect behavior in any kind of linear-ish model, and AFAIK linear-ish models are what one-hot encoding are for, so I don't understand what's going on here or why people want this, and I like to understand things before implementing them :-).
Specifically, the kind of issue I'm thinking about is... say you have a logistic regression model you're using to predict whether an apartment is occupied, with a model like
occupancy ~ C(city) + bedrooms + baths
. In this model, patsy will use treatment coding, so returning all-zeros for unrecognized cities is the same as predicting that they act just like whichever city was assigned as the reference category (probably the one that's first alphabetically). OK, but that's not whatDictVectorizer
does -- it always uses a full-rank encoding, so it's more like patsy'soccupancy ~ 0 + C(city) + bedrooms + baths
. Now in this model, the beta for eachcity
gives something like the (logistic-transformed) mean occupancy for each city, and using all-zeros for unrecognized cities is equivalent to assuming that their mean occupancy is exactly 0 on the logistic scale, which is a terrible guess. You really want it to do something like... return a vector of[1/n, 1/n, ..., 1/n]
, so that you're assuming unseen cities have similar occupancy to the average of the seen cities. Of course high-frequency cities and low-frequency cities are probably different too...And other categorical encodings (e.g. polynomial coding) are even more of a mess.
So I'm not sure what to do here, if anything.
I feel like 99% of the time if you have an open category like this and don't want to use some principle solution like bayesian non-parametrics, then you instead want to do something like... keep the top 100 categories and bin everything else into "other", so then you actually have training data on the "other" category that's plausibly representative of what you'll see later (because in both cases it's relatively low frequency items). I guess this is also something patsy could potentially provide helpers for, though maybe it's more of a pandas thing.
CC: @ameuller
The text was updated successfully, but these errors were encountered: