Transforms like 'scale' need some way to handle missing data #82

rsgmon · 2016-04-06T16:48:48Z

In[92]: df = pd.DataFrame([(1,3),(2,6),(4,2),(6,5),(7,3),(4,6),(2,2),(6,4)], columns=['y','X'])
In[93]: pt.dmatrices('y ~ X.diff()', df)
Out[93]: 
(DesignMatrix with shape (7, 1)
   y
   2
   4
   6
   7
   4
   2
   6
   Terms:
     'y' (column 0),
 DesignMatrix with shape (7, 2)
   Intercept  X.diff()
           1         3
           1        -4
           1         3
           1        -2
           1         3
           1        -4
           1         2
   Terms:
     'Intercept' (column 0)
     'X.diff()' (column 1))

In[94]: pt.dmatrices('y ~ scale(X.diff())', df)

Traceback (most recent call last):
  File "C:\Python34\lib\site-packages\IPython\core\formatters.py", line 222, in catch_format_error
    r = method(self, *args, **kwargs)
  File "C:\Python34\lib\site-packages\IPython\core\formatters.py", line 699, in __call__
    printer.pretty(obj)
  File "C:\Python34\lib\site-packages\IPython\lib\pretty.py", line 368, in pretty
    return self.type_pprinters[cls](obj, self, cycle)
  File "C:\Python34\lib\site-packages\IPython\lib\pretty.py", line 552, in inner
    p.pretty(x)
  File "C:\Python34\lib\site-packages\IPython\lib\pretty.py", line 382, in pretty
    return meth(obj, self, cycle)
  File "C:\Python34\lib\site-packages\patsy\design_info.py", line 1089, in _repr_pretty_
    for col in formatted_cols]
  File "C:\Python34\lib\site-packages\patsy\design_info.py", line 1089, in <listcomp>
    for col in formatted_cols]
ValueError: max() arg is an empty sequence
Out[94]:

The text was updated successfully, but these errors were encountered:

njsmith · 2016-04-06T23:49:50Z

FYI -- to paste multi-line code blocks on github, use triple-backquotes. (I just fixed your original post -- if you click "edit" on it you can see how I modified it.)

The main problem you are hitting here is that your X.diff() thing has a NaN in it:

In [11]: df["X"].diff()
Out[11]: 
0   NaN
1     3
2    -4
3     3
4    -2
5     3
6    -4
7     2
Name: X, dtype: float64

Then when you pass that to scale, it tries to calculate the mean/stddev of the array, and the nan propagates and it returns an array of all-nans:

In [12]: pt.builtins.scale(df["X"].diff())
Out[12]: 
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
7   NaN
Name: X, dtype: float128

And then patsy's missing-data handling kicks in and throws away all of these NaNs, you get back a design matrix with zero rows in it.

And then there's a bug in patsy which I should fix, where if you try to print a design matrix with zero rows then it throws an error. But that's not really your main problem, it just obscures it :-)

@rsgmon

Discovered by @rsgmon in pydatagh-82.

njsmith · 2016-04-07T00:04:27Z

The deeper issue, which is a genuine issue, is that scale doesn't have any way to handle data with missing values inside it :-/. I never implemented this because I'm not really sure what the right approach is -- there are different ways to handle missing values, and there are different ways to flag them, and there isn't really any way right now to propagate the current settings (see the NA_action argument to dmatrix and friends) into scale. So there's definitely something to fix here, but I don't know how to do it right now, so I'll rename this issue to serve as a marker and hopefully come back to it at some point...

rsgmon · 2016-04-07T00:07:36Z

Thanks Nathaniel for the edit tip and explanation of the underlying issue. I can find a work around for it now that you've explained it.

njsmith added a commit to njsmith/patsy that referenced this issue Apr 6, 2016

Fix a crash in DesignMatrix.__repr__ when shape[0] == 0

bfe145b

Discovered by @rsgmon in pydatagh-82.

njsmith mentioned this issue Apr 6, 2016

Fix a crash in DesignMatrix.__repr__ when shape[0] == 0 #83

Merged

njsmith changed the title ~~Can't seem to get scaled diff's~~ Transforms like 'scale' need some way to handle missing data Apr 7, 2016

njsmith mentioned this issue Nov 15, 2017

Formula splines with missing values statsmodels/statsmodels#4122

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transforms like 'scale' need some way to handle missing data #82

Transforms like 'scale' need some way to handle missing data #82

rsgmon commented Apr 6, 2016

njsmith commented Apr 6, 2016

njsmith commented Apr 7, 2016

rsgmon commented Apr 7, 2016

Transforms like 'scale' need some way to handle missing data #82

Transforms like 'scale' need some way to handle missing data #82

Comments

rsgmon commented Apr 6, 2016

njsmith commented Apr 6, 2016

njsmith commented Apr 7, 2016

rsgmon commented Apr 7, 2016