Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical encodings, entropic approach #119

Open
sjLambda opened this issue Jan 26, 2018 · 5 comments
Open

Categorical encodings, entropic approach #119

sjLambda opened this issue Jan 26, 2018 · 5 comments

Comments

@sjLambda
Copy link

One of my favorite parts of patsy are the categorical encodings. That feature alone is worth using it. While playing with those options, I could not find encodings based on standard entropy calculations? Is that covered in there and I missed?

For an example of using entropy for categorical encodings, see this video: https://youtu.be/IPkRVpXtbdY?t=4m49s

@njsmith
Copy link
Member

njsmith commented Jan 26, 2018

You didn't miss anything. I'm not aware of any categorical encoding schemes that use information theory. I skimmed the video and all I saw was general discussion about how to calculate entropy; not it's connection to encoding categorical variables for use in linear modeling. Is this a thing you've encountered, and if so, do you have a link for it? (Preferably not video.)

@sjLambda
Copy link
Author

sjLambda commented Jan 27, 2018 via email

@njsmith
Copy link
Member

njsmith commented Jan 27, 2018

He calls it "binning", which is the same as nominal categorization. But more importantly, he compares the different value ranges for best gain in info probability. This is the central reason for categorical encoding in statistical inference. The coding reasons in statistics are just a convenience, but the real reason is finding info gain.

Here it sounds like you're talking about the problem of, given a continuous random variable, find a discretization that preserves the most information (I guess under the constraint that each discrete value has to correspond to a contiguous range of the continuous space)? Patsy's categorical encodings are about taking a categorical variable, and encoding it as a multi-dimensional real vector. I'm not sure what connection you see between these two problems, or whether you mean something else.

The b-ary section just gives the definition of the entropy of a categorical random variable (with b setting the base of the log). I know what entropy is; what I don't know is what it has to do with categorical encodings :-).

For implementing in patsy, I envsion one or more categorical variables fed via the "formula" along with a target variable that is a continuous variable. This target var is what gets the ranges or bins (as in the bins parameter for np.dititize). Patsy will then calculate max "information gain" figure (which is a probability value between 0 and 1) for the categories based on a vareity of ranges of the continuous variable. Instead of zeros and 1s, it will use the floating point values for each of the categories.

Can you give a concrete, worked example?

@sjLambda
Copy link
Author

sjLambda commented Jan 27, 2018 via email

@njsmith
Copy link
Member

njsmith commented Jan 27, 2018

I'm asking because I can't make any head or tail out of anything you've said so far, so just referring back to it isn't going to clarify anything. Also I'm pretty sure none of those videos had examples of patsy formulas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants