Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use mutli-column functions in Patsy which return categorical output? #79

Open
napsternxg opened this issue Mar 6, 2016 · 2 comments

Comments

@napsternxg
Copy link

I am dealing with a situation where each item can have 1 or 2 labels both of which come from the same set. In this case I only want to have 1 set of categorical variables which encode if the values of each level are present or absent with respect to the reference. I also want to inversely weight the values of these categorical variables based on how many labels are present.
E.g. Let my categories be {A, B, C}

data = pd.DataFrame({"col1": ["A", "A", "B"], "col2": ["O", "B", "C"], "num_vals": [1,2,2]})

def inverse_val(x):
    return 1.0/x

# Using categorical coding will not give the correct output
X = patsy.dmatrix("(C(col1) + C(col2)):inverse_val(num_vals)", data, return_type="dataframe")

I tried to solve this issue using the following code:

def multival(*x, **kwargs):
  #raise Exception("Not Implemented")
  levels, reference = kwargs.get("levels", None), kwargs.get("reference", None)
  weights = kwargs.get("weights", None)
  if len(x[0].shape) != 1:
    raise Exception("Mismatching Shapes. All arrays should be 1d and should have the same shape")
  for k in x:
    if k.shape != x[0].shape:
      raise Exception("Mismatching Shapes. All arrays should be 1d and should have the same shape")
  if levels is None:
    levels = np.sort(np.unique(np.hstack(x))) # Sort the unique values and then use this ordering as levels
  if reference is None:
    reference = levels[0]
  #print "Levels: %s, reference: %s" % (levels, reference)
  levels = levels[levels != reference] # Remove reference from levels
  level_len = len(levels)
  #print x[0].shape[0], level_len
  out = np.zeros((x[0].shape[0], level_len))
  for i, v in enumerate(levels):
    # print i, v
    for col in x:
      out[np.where(np.array(col) == v), i] = 1
  #print "Created matrix with shape: ", out.shape
  colnames = ["T.%s" % k for k in levels]
  if weights is not None:
    weights = weights.values
    return pd.DataFrame(out, columns=colnames).divide(weights, axis=0)

  return pd.DataFrame(out, columns=colnames)

X = patsy.dmatrix("multival(col1, col2, weights=num_vals)", data, return_type="dataframe")

There is no way of identifying the column names generated by patsy:

E.g. The names are like this:

multival(col1, col2, weights=num_vals, reference='O')[0]
multival(col1, col2, weights=num_vals, reference='O')[1]
multival(col1, col2, weights=num_vals, reference='O')[2]

I would much rather prefer column names like the following:

multival(col1, col2, weights=num_vals, reference='O')[T.0]
multival(col1, col2, weights=num_vals, reference='O')[T.1]
multival(col1, col2, weights=num_vals, reference='O')[T.2]

OR if I am passing the levels variable:

multival(col1, col2, weights=num_vals, reference='O')[T.A]
multival(col1, col2, weights=num_vals, reference='O')[T.B]
multival(col1, col2, weights=num_vals, reference='O')[T.C]

Is there a way to achieve this in patsy ?

Duplicate of statsmodels/statsmodels#2843

@njsmith
Copy link
Member

njsmith commented Mar 6, 2016

Interesting use case!

Unfortunately, right now there's no way to do what you want. I'm not quite seeing a clean way to include something like your multival as part of patsy's regular interface (do you?), but it would certainly be nice if it were possible for a multi-column numerical factor to tell patsy what names to use for the columns. I'm unlikely to have time to work on implementing this soon, but if you're interested then it would require figuring out (a) what the interface should be for specifying column names (pull them out of a DataFrame? re-use the ContrastMatrix wrapper that patsy already supports for the similar case of naming categorical columns? both?), (b) how this information should be represented in the DesignInfo metadata, (c) how to teach the machinery in build.py to extract the column names and put them into the DesignInfo, (d) how to teach the machinery in build.py to take the column names from the DesignInfo and put them into a design matrix.

(As a starting point, the current [0], [1], style names are hard-coded in _subterm_column_names_iter in build.py -- a few lines below the hard-coded part you can see how the categorical names get pulled out of a user-specified ContrastMatrix.)

@napsternxg
Copy link
Author

napsternxg commented May 19, 2016

@njsmith thanks for the suggestion. I looked at the build.py and it appears there there might be some API level changes required to support this. Are there any classes which I can sub class to create a new feature based on some functions ?

I have updated my function into a stateful object for doing this now:

class MultiVal(object):

  def __init__(self):
    print "Using class based MultiVal"
    self.levels = [] 
    self.reference = []
    self.colnames = []
    self.level_len = 0 
    self.nx = 0 

  def memorize_chunk(self,*x,**kwargs):
    levels, reference = kwargs.get("levels", None), kwargs.get("reference", None)
    self.nx = len(x)
    if self.nx < 2:
      raise Exception("Need at least 2 columns to do multival. Otherwise just use C(column_name)")
    # If number of columns are 2 allow single weight
    # Else for each column there should be a weight
    if len(x[0].shape) != 1:
      raise Exception("Mismatching Shapes. All arrays should be 1d and should have the same shape")
    for k in x:
      if k.shape != x[0].shape:
        raise Exception("Mismatching Shapes. All arrays should be 1d and should have the same shape")
    if levels is None:
      self.levels.extend(np.sort(np.unique(np.hstack(x))).tolist())
    else:
      self.levels.extend(levels)
    self.reference = reference

  def memorize_finish(self):
    self.levels = np.array(list(set(self.levels)))
    if self.reference is None:
      self.reference = self.levels[0]
    self.levels = self.levels[self.levels != self.reference] # Remove reference from levels
    self.level_len = len(self.levels)
    self.colnames = ["T.%s" % k for k in self.levels]

  def transform(self, *x, **kwargs):
    out = np.zeros((x[0].shape[0], self.level_len))
    weights = kwargs.get("weights", None)
    if weights is None:
      weights = np.ones((x[0].shape[0], 1)) # Create 1 column all ones weight matrix
    if len(weights.shape) == 1:
      if self.nx == 2:
        print "using complimentary weights for 2 columns. w and 1-w"
        weights = np.insert(weights[:, np.newaxis], 1, 1. - weights, axis=1) # Add complimentary weights
      else:
        weights = weights[:, np.newaxis] # Create weights into 1 column matrix
    elif self.nx > 1 and weights.shape[1] != self.nx and weights.shape[1] != 1:
      raise Exception("Either weights should be a 1d array or 2d array with number of columns equal to %s" % self.nx)
    for i, v in enumerate(self.levels):
      for j,col in enumerate(x):
        idx = np.where(np.array(col) == v)
        w_col = min(weights.shape[1] -1,j)
        out[idx, i] = weights[idx, w_col]
    return pd.DataFrame(out, columns=self.colnames)

MC = patsy.stateful_transform(MultiVal)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants