-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use mutli-column functions in Patsy which return categorical output? #79
Comments
Interesting use case! Unfortunately, right now there's no way to do what you want. I'm not quite seeing a clean way to include something like your (As a starting point, the current |
@njsmith thanks for the suggestion. I looked at the I have updated my function into a stateful object for doing this now: class MultiVal(object):
def __init__(self):
print "Using class based MultiVal"
self.levels = []
self.reference = []
self.colnames = []
self.level_len = 0
self.nx = 0
def memorize_chunk(self,*x,**kwargs):
levels, reference = kwargs.get("levels", None), kwargs.get("reference", None)
self.nx = len(x)
if self.nx < 2:
raise Exception("Need at least 2 columns to do multival. Otherwise just use C(column_name)")
# If number of columns are 2 allow single weight
# Else for each column there should be a weight
if len(x[0].shape) != 1:
raise Exception("Mismatching Shapes. All arrays should be 1d and should have the same shape")
for k in x:
if k.shape != x[0].shape:
raise Exception("Mismatching Shapes. All arrays should be 1d and should have the same shape")
if levels is None:
self.levels.extend(np.sort(np.unique(np.hstack(x))).tolist())
else:
self.levels.extend(levels)
self.reference = reference
def memorize_finish(self):
self.levels = np.array(list(set(self.levels)))
if self.reference is None:
self.reference = self.levels[0]
self.levels = self.levels[self.levels != self.reference] # Remove reference from levels
self.level_len = len(self.levels)
self.colnames = ["T.%s" % k for k in self.levels]
def transform(self, *x, **kwargs):
out = np.zeros((x[0].shape[0], self.level_len))
weights = kwargs.get("weights", None)
if weights is None:
weights = np.ones((x[0].shape[0], 1)) # Create 1 column all ones weight matrix
if len(weights.shape) == 1:
if self.nx == 2:
print "using complimentary weights for 2 columns. w and 1-w"
weights = np.insert(weights[:, np.newaxis], 1, 1. - weights, axis=1) # Add complimentary weights
else:
weights = weights[:, np.newaxis] # Create weights into 1 column matrix
elif self.nx > 1 and weights.shape[1] != self.nx and weights.shape[1] != 1:
raise Exception("Either weights should be a 1d array or 2d array with number of columns equal to %s" % self.nx)
for i, v in enumerate(self.levels):
for j,col in enumerate(x):
idx = np.where(np.array(col) == v)
w_col = min(weights.shape[1] -1,j)
out[idx, i] = weights[idx, w_col]
return pd.DataFrame(out, columns=self.colnames)
MC = patsy.stateful_transform(MultiVal) |
I am dealing with a situation where each item can have 1 or 2 labels both of which come from the same set. In this case I only want to have 1 set of categorical variables which encode if the values of each level are present or absent with respect to the reference. I also want to inversely weight the values of these categorical variables based on how many labels are present.
E.g. Let my categories be
{A, B, C}
I tried to solve this issue using the following code:
There is no way of identifying the column names generated by patsy:
E.g. The names are like this:
I would much rather prefer column names like the following:
OR if I am passing the levels variable:
Is there a way to achieve this in patsy ?
Duplicate of statsmodels/statsmodels#2843
The text was updated successfully, but these errors were encountered: