Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simpler term names #19

Open
jtornero opened this issue May 24, 2013 · 6 comments
Open

Simpler term names #19

jtornero opened this issue May 24, 2013 · 6 comments

Comments

@jtornero
Copy link

First I want to thank the developer team for their excellent work.

Well, I feel that the "predictors" names when using for instaqnce C( predictor, Treatment(5)) ar too long and somehow confusing. When you make interactions between predictors, you get things like:

C(trimcod, Treatment(4))[T.3]:C(flota, Treatment(11))[T.10]

It would be nice to be able to assign an alias or just forget all the stuff apart from the predictor name to get something like:

[trimcod][T.3]

or just [trimcod 3]

I've playng with the MyTreat example but I can't get any positive results

Thank you very much

Jorge Tornero

@njsmith
Copy link
Member

njsmith commented May 24, 2013

There are two parts to the name -- the "C(trimcod, Treatment(4))" is the
literal Python code that was executed to get the variable, and the "[T.3]"
part is added on by the categorical variable coder.

There really isn't any way to pull out "trimcod" from "C(trimcod,
Treatment(4))", because that would require parsing Python source code...
note that C() and Treatment() are just regular Python functions.

In the short run, you can store the output of C() to a temporary variable
with whatever name you want, and use that in your formula:

Ctrimcod = patsy.C(trimcod, patsy.Treatment(4))
Cflota = patsy.C(flota, patsy.Treatment(4))
lm("y ~ Ctrimcod * Cfloat", ...)

But Ctrimcod and Cflota will be strange opaque objects that you can't do
much else with, so you'll want to keep the original variables around as
well.

The real solution in the long run will be to implement a proper data type
in Python for storing categorical data, and which can have default coding
options attached to it -- basically turning the output of C() into an
object that's actually useful. That's how this stuff works in R -- if you
store your data as a "factor" object, you can attach the equivalent of
Treatment(4) to it directly. But this will take a while, since it needs
enhancements in numpy, in pandas, etc.

On Fri, May 24, 2013 at 12:50 PM, jtornero [email protected] wrote:

First I want to thank the developer team for their excellent work.

Well, I feel that the "predictors" names when using for instaqnce C(
predictor, Treatment(5)) ar too long and somehow confusing. When you make
interactions between predictors, you get things like:

C(trimcod, Treatment(4))[T.3]:C(flota, Treatment(11))[T.10]

It would be nice to be able to assign an alias or just forget all the
stuff apart from the predictor name to get something like:

[trimcod][T.3]

or just [trimcod 3]

I've playng with the MyTreat example but I can't get any positive results

Thank you very much

Jorge Tornero


Reply to this email directly or view it on GitHubhttps://github.com//issues/19
.

@jtornero
Copy link
Author

Thank you very much for your fast answer.

I've made a workaround forcing the dmatrices (or dmatrix) output to pandas dataframe and substituing in dataframe.columns, i.e., say predictors is our dataframe:

newcol=predictors.rename(columns=lambda x: str.replace(x,'C(anocod, Treatment(23))','ANOCOD'))
newcol=newcol.rename(columns=lambda x: str.replace(x,'C(trimcod, Treatment(4))','TRIMCOD'))
newcol=newcol.rename(columns=lambda x: str.replace(x,'C(flota, Treatment(11))','FLEETYPE'))

and then

predictors.columns=newcol

Cheers

Jorge Tornero

@njsmith
Copy link
Member

njsmith commented May 24, 2013

Looks like a good workaround. The only thing to watch out for here is that
the .design_info attribute on your design matrix will still have the old
term names in it. Right now that probably won't affect anything, but
someday statsmodels and friends will probably be smart enough to use that
metadata for various things, so keep an eye out for that.

(This comment is partly directed at people who google this thread years
from now.)

On Fri, May 24, 2013 at 2:11 PM, jtornero [email protected] wrote:

Thank you very much for your fast answer.

I've made a workaround forcing the dmatrices (or dmatrix) output to pandas
dataframe and substituing in dataframe.columns, i.e., say predictors is our
dataframe:

newcol=predictors.rename(columns=lambda x: str.replace(x,'C(anocod,
Treatment(23))','ANOCOD'))
newcol=newcol.rename(columns=lambda x: str.replace(x,'C(trimcod,
Treatment(4))','TRIMCOD'))
newcol=newcol.rename(columns=lambda x: str.replace(x,'C(flota,
Treatment(11))','FLEETYPE'))

and then

predictors.columns=newcol

Cheers

Jorge Tornero


Reply to this email directly or view it on GitHubhttps://github.com//issues/19#issuecomment-18403778
.

@jtornero
Copy link
Author

Well, I've been playing a little with git and I've been able to modify the source code in my local repo. The idea is to be able to pass an additional parameter to Treatment, say display_name that overrides, if provided, the default name for Transform, i.e., [T.1] to whatever you want. It looks very nice, but I don't know how to separate the "very first part' from the "Treatment" part. I guessed that it should be contained somewhere in the .design_info variables, but I haven't been able to find it. Any suggestions to proceed?

I see that modifying the source is somehow a step further and more dangerous that implement the recipe for MyTreat in the documentation, but I haven't been able to make that sort of constructions, sorry.

Thank you very much

Jorge Tornero

@njsmith
Copy link
Member

njsmith commented Jun 10, 2013

I'm afraid I don't really understand what you're asking. A coding class like Treatment gets to set the [T.1] part to be whatever it wants. You can easily do that with a custom class like MyTreat too, though, the built-in classes like Treatment just use the same APIs that you can use yourself in a custom class. You don't have to separate the "very first part" from the [T.1] part, because you can only affect the [T.1] part. The rest comes from the factor's .name (https://patsy.readthedocs.org/en/latest/expert-model-specification.html#patsy.factor_protocol.name). The .design_info variables don't have anything to do with this AFAICT; that's just where patsy puts the names after it has figured them out, to pass them back to the user.

@jtornero
Copy link
Author

Dear njsmith,

I'm sorry maybe I mixed up stuff with statsmodels and patsy. What I wanted to mean is concerned to the names that appear int statsmodels.GLMResults.summary(). Those names are the names I referred in my first message, sort of

C(trimcod, Treatment(4))[T.3]:C(flota, Treatment(11))[T.10]

for instance.

That's why I asked for simpler names. AFAIK those names are stored in desig_info.column_names but they're not modifyable by user, BUT if you get pandas DataFrames instead of designmatrices as output for patsy.dmatrices(), you are able to tweak those names replacing text in the DataFrame column list for what you want.

So what I am asking for is:

A way to make possible to rename the output column names to whatever you want; maybe something like a design_info.setColumnNames for both desingmatrices output or dataframe output. Or just and additional parameter in patsy.dmatrix and/or patsy.dmatrices to provide a list of column names, say so, display names.

The issue about tratment is because it is a nice way to, at least, provide nicer names for some part of the final column names in the dmatrices/dmatrix output. Maybe an option in dmatrices /dmatrix like columns_names_from treatment=True could do the trick also.

I provide you with a little example... with three or four interactions, the GLMResults.summary gets a little confusing:

C(trimcod, Treatment(4))[T.3]:C(flota, Treatment(11))[T.10]:C(flota, Treatment(6))[T.11]

Whe the relevant information is that the term is formed by the interaction of

trimcod 3, flota 11 and flota 6

I hope I have explained myself better this time. Sorry for the inconveniences.

Jorge Tornero

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants