Complex Pipeline process & show_prediction #213

armgilles · 2017-06-08T16:09:28Z

Hi

I'm strangling to try to use show_prediction with a more complex pipeline and heterogeneous data... I know it is a pretty hot topic in Scikit & Eli5.

I would like to use it like your exemple in Titanic Dataset but with more than one column with text

#X_train & X_test are DataFrames

count_vec_txt_1 = CountVectorizer(analyzer='word', max_features=75)
count_vec_txt_2 = CountVectorizer(analyzer='word', max_features=35)

clf = Pipeline([
        ('union', FeatureUnion(
                    transformer_list = [
                        ('cst',  cust_regression_vals()),     # Already did some features engenring, so I just keep it 
                        ('text_feature_1', Pipeline([
                            ('text_feature_1', cust_txt_col(key='text_feature_1')), # Selector
                            ('count_vec_txt_1', count_vec_txt_1)
                        ])),
                        ('text_feature_2', Pipeline([
                            ('text_feature_2', cust_txt_col(key='text_feature_2')), # Selector
                            ('count_vec_txt_2', count_vec_txt_2)
                        ])),
       
                    ]
        )),
        ('algo', xgb_model)
    ])

# Learning
clf.fit(X_train, y_train)

## My Goal now is to get all my features names (no get_feature_names() yet)
# get feature name with text transformer :
features =  X_train.columns.tolist()

# Remove feature with Text processing
for col in ['text_feature_1', 'text_feature_2']:
    features .remove(col)

count_vec_txt_1.fit(X_train.text_feature_1)
features_xgb.extend(count_vec_txt_1.get_feature_names())

count_vec_txt_2.fit(X_train.text_feature_2)
features_xgb.extend(count_vec_txt_2.get_feature_names())
# I got all my features name now

# Want to debug some rows (curiosity etc...)

eli5.show_prediction(clf, X_test[X_test.index == 42], feature_names=feature)
# ERROR

Error: estimator Pipeline(steps=[('union', FeatureUnion(n_jobs=1, transformer_list=[('cst', 
cust_regression_vals()), ('text_feature_1', Pipeline(steps=[('text_feature_1', 
cust_txt_col(key='text_feature_1')), ('count_vec_txt_1', CountVectorizer(analyzer='word', 
binary=False, ........))]) is not supported

I try many things but I'm stuck here...

The text was updated successfully, but these errors were encountered:

lopuhin · 2017-06-08T16:15:22Z

@armgilles wow, thanks for a great example - it would be great to try applying eli5 in this case. Could you post a complete notebook somewhere, if it's convenient to you?

armgilles · 2017-06-08T16:17:19Z

Sure, but I can't share data... I can use the classic fetch_20newsgroups if it's ok for you ?

lopuhin · 2017-06-08T16:19:04Z

@armgilles ah sorry, I thought it's based directly on the titanic tutorial. I thnk what you provided is already enough.

kmike · 2017-06-08T19:14:53Z

Currently Pipeline support is not implemented for explain_prediction - it is implemented only for explain_weigths; that's the reason #15 is still open.

Could you try passing clf.named_steps['algo'] as an estimator and clf.named_steps['union'] as vec? Does it work?

eli5.show_prediction(clf.named_steps['algo'], X_test[X_test.index == 42], 
    feature_names=feature, vec=clf.named_steps['union'])

armgilles · 2017-06-09T06:55:40Z

Nop it doesn't work :

eli5.show_prediction(clf.named_steps['algo'], X_test[X_test.index == 42], 
    feature_names=feature, vec=clf.named_steps['union'])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-60-34ed186ec620> in <module>()
      1 eli5.show_prediction(clf.named_steps['algo'], X_test[X_test.index == 42], 
----> 2                      feature_names=features, vec=clf.named_steps['union'])

/root/anaconda2/lib/python2.7/site-packages/eli5/ipython.pyc in show_prediction(estimator, doc, **kwargs)
    261     """
    262     format_kwargs, explain_kwargs = _split_kwargs(kwargs)
--> 263     expl = explain_prediction(estimator, doc, **explain_kwargs)
    264     html = format_as_html(expl, **format_kwargs)
    265     return HTML(html)

/root/anaconda2/lib/python2.7/site-packages/singledispatch.pyc in wrapper(*args, **kw)
    208 
    209     def wrapper(*args, **kw):
--> 210         return dispatch(args[0].__class__)(*args, **kw)
    211 
    212     registry[object] = func

/root/anaconda2/lib/python2.7/site-packages/eli5/xgboost.pyc in explain_prediction_xgboost(xgb, doc, vec, top, top_targets, target_names, targets, feature_names, feature_re, feature_filter, vectorized)
    123     Weights of all features sum to the output score of the estimator.
    124     """
--> 125     xgb_feature_names = xgb.booster().feature_names
    126     vec, feature_names = handle_vec(
    127         xgb, doc, vec, vectorized, feature_names, num_features=len(xgb_feature_names))

TypeError: 'str' object is not callable

Some informations :

clf.named_steps['algo']
# Return
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.05.......)

clf.named_steps['union']
# Return 
FeatureUnion(n_jobs=1,
       transformer_list=[('cst', cust_regression_vals()), ('text_feature_1', Pipeline(steps=[('text_feature_1', cust_txt_col(key='text_feature_1')), ('count_vec_txt_1', CountVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u...trip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None))]))],
       transformer_weights=None)

armgilles · 2017-06-13T09:50:37Z

Hey I created a notebook with Titanic dataset with this kind of pipeline.

I use a specific function to build some features and then apply my pipeline process.

If I can help for anything

See GH-213 - this makes this example work, but I'm not sure if this is a correct thing to do.

lopuhin · 2017-06-15T09:20:51Z

Thanks for an example @armgilles !

Actually, this last error is related to how we handle pandas dataframes. Currently we assume that vectorizer is able to handle a list of inputs as it's input, but this is not correct in this case. A way to make your example work with current eli5 is to pass an already vectorized document:

eli5.show_prediction(clf.named_steps['algo'], 
                     clf.named_steps['union'].transform(X_test[X_test.index == 809]), 
                     feature_names=features)

this gives an exaplanation:

There is also a way to make your original example work (a9ec021), but I'm not sure it's consistent with our API: currently we always advise to pass a single document, not a container of length 1. To be fair, passing X_test[X_test.index == 809].iloc[0] instead of X_test[X_test.index == 809] also fails currently. So it requires more thought about the API we advertise, and more pandas support probably - cause it seems natural to have vectorizer operate on pandas dataframes.

armgilles · 2017-06-15T10:21:10Z

Thank @lopuhin for reply.

I update my notebook and add some comments.

I wish I could help with PR, but i'm not in my confort zone here. Maybe help you with some exemples and documentation.

armgilles · 2017-06-27T13:38:20Z

I have a strange bug in this notebook, when I fit my model (simple xgboost, no pipeline). I predict a line with eli5.show_prediction

y=1 is wrong here (0.061 proba), it should be y=0

If I force targets in eli5.show_prediction with xgb_model_1.classes_ (array([0, 1])), it's the same result :

To fix it I have to set targets= [1, 0] :

Did I miss something ?

I could open a new issue for better understanding.

lopuhin · 2017-06-27T13:40:30Z

@armgilles currently y=1 is shown for binary classifiers in any case, but @kmike is working on this issue: #223

kmike · 2017-06-27T20:32:35Z

@armgilles if you have binary classification task with class names (e.g. "red' and "blue") it is not that bad - y="red" (probability=0.061) kind of makes sense. So currently y=1 (probability=0.061) should be read as "y=1 with probability 0.061". But as @lopuhin said, it'll be fixed.

sathyz · 2018-09-28T11:29:09Z

I'm trying a simple pipeline and show_prediction() and it is failing.

import numpy as np

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

import eli5

fnames = ["sepal_length", "sepal_width", "petal_length", "petal_width",]
tnames = ["Setosa", "Versicolour", "Virginica"]

Xs, ys = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(Xs, ys, shuffle=True, test_size=0.2)

scaler = StandardScaler()
lr = LogisticRegression()
pipeline = make_pipeline( scaler, lr)
pipeline.fit(X_train, y_train)

random_sample = np.random.randint(len(X_test))
doc = X_test[random_sample]
eli5.show_prediction(pipeline, doc, feature_names=fnames, target_names=tnames)

The error is,

Error: estimator Pipeline(memory=None, steps=[('standardscaler', StandardScaler(copy=True, 
with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None,
dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, 
penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False))]) is not
supported

I did the following to get it work,

doc_raw = np.expand_dims(X_test[random_sample], axis=0)
doc = np.squeeze( scaler.transform(doc_raw) )
eli5.explain_prediction(lr, doc, feature_names=fnames, target_names=tnames)

Version: 0.8

lopuhin added a commit that referenced this issue Jun 15, 2017

Pass dataframe to vectorizer unchanged

a9ec021

See GH-213 - this makes this example work, but I'm not sure if this is a correct thing to do.

kmike mentioned this issue Jun 23, 2017

allow to pass multiple documents to explain_prediction #225

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complex Pipeline process & show_prediction #213

Complex Pipeline process & show_prediction #213

armgilles commented Jun 8, 2017

lopuhin commented Jun 8, 2017

armgilles commented Jun 8, 2017

lopuhin commented Jun 8, 2017

kmike commented Jun 8, 2017

armgilles commented Jun 9, 2017

armgilles commented Jun 13, 2017

lopuhin commented Jun 15, 2017

armgilles commented Jun 15, 2017

armgilles commented Jun 27, 2017

lopuhin commented Jun 27, 2017

kmike commented Jun 27, 2017 •

edited

Loading

sathyz commented Sep 28, 2018

Complex Pipeline process & show_prediction #213

Complex Pipeline process & show_prediction #213

Comments

armgilles commented Jun 8, 2017

lopuhin commented Jun 8, 2017

armgilles commented Jun 8, 2017

lopuhin commented Jun 8, 2017

kmike commented Jun 8, 2017

armgilles commented Jun 9, 2017

armgilles commented Jun 13, 2017

lopuhin commented Jun 15, 2017

armgilles commented Jun 15, 2017

armgilles commented Jun 27, 2017

lopuhin commented Jun 27, 2017

kmike commented Jun 27, 2017 • edited Loading

sathyz commented Sep 28, 2018

kmike commented Jun 27, 2017 •

edited

Loading