Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complex Pipeline process & show_prediction #213

Open
armgilles opened this issue Jun 8, 2017 · 12 comments
Open

Complex Pipeline process & show_prediction #213

armgilles opened this issue Jun 8, 2017 · 12 comments

Comments

@armgilles
Copy link

Hi

I'm strangling to try to use show_prediction with a more complex pipeline and heterogeneous data... I know it is a pretty hot topic in Scikit & Eli5.

I would like to use it like your exemple in Titanic Dataset but with more than one column with text

#X_train & X_test are DataFrames

count_vec_txt_1 = CountVectorizer(analyzer='word', max_features=75)
count_vec_txt_2 = CountVectorizer(analyzer='word', max_features=35)

clf = Pipeline([
        ('union', FeatureUnion(
                    transformer_list = [
                        ('cst',  cust_regression_vals()),     # Already did some features engenring, so I just keep it 
                        ('text_feature_1', Pipeline([
                            ('text_feature_1', cust_txt_col(key='text_feature_1')), # Selector
                            ('count_vec_txt_1', count_vec_txt_1)
                        ])),
                        ('text_feature_2', Pipeline([
                            ('text_feature_2', cust_txt_col(key='text_feature_2')), # Selector
                            ('count_vec_txt_2', count_vec_txt_2)
                        ])),
       
                    ]
        )),
        ('algo', xgb_model)
    ])

# Learning
clf.fit(X_train, y_train)

## My Goal now is to get all my features names (no get_feature_names() yet)
# get feature name with text transformer :
features =  X_train.columns.tolist()

# Remove feature with Text processing
for col in ['text_feature_1', 'text_feature_2']:
    features .remove(col)

count_vec_txt_1.fit(X_train.text_feature_1)
features_xgb.extend(count_vec_txt_1.get_feature_names())

count_vec_txt_2.fit(X_train.text_feature_2)
features_xgb.extend(count_vec_txt_2.get_feature_names())
# I got all my features name now

# Want to debug some rows (curiosity etc...)

eli5.show_prediction(clf, X_test[X_test.index == 42], feature_names=feature)
# ERROR
Error: estimator Pipeline(steps=[('union', FeatureUnion(n_jobs=1, transformer_list=[('cst', 
cust_regression_vals()), ('text_feature_1', Pipeline(steps=[('text_feature_1', 
cust_txt_col(key='text_feature_1')), ('count_vec_txt_1', CountVectorizer(analyzer='word', 
binary=False, ........))]) is not supported 

I try many things but I'm stuck here...

@lopuhin
Copy link
Contributor

lopuhin commented Jun 8, 2017

@armgilles wow, thanks for a great example - it would be great to try applying eli5 in this case. Could you post a complete notebook somewhere, if it's convenient to you?

@armgilles
Copy link
Author

Sure, but I can't share data... I can use the classic fetch_20newsgroups if it's ok for you ?

@lopuhin
Copy link
Contributor

lopuhin commented Jun 8, 2017

@armgilles ah sorry, I thought it's based directly on the titanic tutorial. I thnk what you provided is already enough.

@kmike
Copy link
Contributor

kmike commented Jun 8, 2017

Currently Pipeline support is not implemented for explain_prediction - it is implemented only for explain_weigths; that's the reason #15 is still open.

Could you try passing clf.named_steps['algo'] as an estimator and clf.named_steps['union'] as vec? Does it work?

eli5.show_prediction(clf.named_steps['algo'], X_test[X_test.index == 42], 
    feature_names=feature, vec=clf.named_steps['union'])

@armgilles
Copy link
Author

Nop it doesn't work :

eli5.show_prediction(clf.named_steps['algo'], X_test[X_test.index == 42], 
    feature_names=feature, vec=clf.named_steps['union'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-60-34ed186ec620> in <module>()
      1 eli5.show_prediction(clf.named_steps['algo'], X_test[X_test.index == 42], 
----> 2                      feature_names=features, vec=clf.named_steps['union'])

/root/anaconda2/lib/python2.7/site-packages/eli5/ipython.pyc in show_prediction(estimator, doc, **kwargs)
    261     """
    262     format_kwargs, explain_kwargs = _split_kwargs(kwargs)
--> 263     expl = explain_prediction(estimator, doc, **explain_kwargs)
    264     html = format_as_html(expl, **format_kwargs)
    265     return HTML(html)

/root/anaconda2/lib/python2.7/site-packages/singledispatch.pyc in wrapper(*args, **kw)
    208 
    209     def wrapper(*args, **kw):
--> 210         return dispatch(args[0].__class__)(*args, **kw)
    211 
    212     registry[object] = func

/root/anaconda2/lib/python2.7/site-packages/eli5/xgboost.pyc in explain_prediction_xgboost(xgb, doc, vec, top, top_targets, target_names, targets, feature_names, feature_re, feature_filter, vectorized)
    123     Weights of all features sum to the output score of the estimator.
    124     """
--> 125     xgb_feature_names = xgb.booster().feature_names
    126     vec, feature_names = handle_vec(
    127         xgb, doc, vec, vectorized, feature_names, num_features=len(xgb_feature_names))

TypeError: 'str' object is not callable

Some informations :

clf.named_steps['algo']
# Return
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.05.......)
clf.named_steps['union']
# Return 
FeatureUnion(n_jobs=1,
       transformer_list=[('cst', cust_regression_vals()), ('text_feature_1', Pipeline(steps=[('text_feature_1', cust_txt_col(key='text_feature_1')), ('count_vec_txt_1', CountVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u...trip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None))]))],
       transformer_weights=None)

@armgilles
Copy link
Author

Hey I created a notebook with Titanic dataset with this kind of pipeline.

I use a specific function to build some features and then apply my pipeline process.

If I can help for anything

lopuhin added a commit that referenced this issue Jun 15, 2017
See GH-213 - this makes this example work, but I'm not sure
if this is a correct thing to do.
@lopuhin
Copy link
Contributor

lopuhin commented Jun 15, 2017

Thanks for an example @armgilles !

Actually, this last error is related to how we handle pandas dataframes. Currently we assume that vectorizer is able to handle a list of inputs as it's input, but this is not correct in this case. A way to make your example work with current eli5 is to pass an already vectorized document:

eli5.show_prediction(clf.named_steps['algo'], 
                     clf.named_steps['union'].transform(X_test[X_test.index == 809]), 
                     feature_names=features)

this gives an exaplanation:
image

There is also a way to make your original example work (a9ec021), but I'm not sure it's consistent with our API: currently we always advise to pass a single document, not a container of length 1. To be fair, passing X_test[X_test.index == 809].iloc[0] instead of X_test[X_test.index == 809] also fails currently. So it requires more thought about the API we advertise, and more pandas support probably - cause it seems natural to have vectorizer operate on pandas dataframes.

@armgilles
Copy link
Author

Thank @lopuhin for reply.

I update my notebook and add some comments.

I wish I could help with PR, but i'm not in my confort zone here. Maybe help you with some exemples and documentation.

@armgilles
Copy link
Author

I have a strange bug in this notebook, when I fit my model (simple xgboost, no pipeline). I predict a line with eli5.show_prediction

image

y=1 is wrong here (0.061 proba), it should be y=0

If I force targets in eli5.show_prediction with xgb_model_1.classes_ (array([0, 1])), it's the same result :

image

To fix it I have to set targets= [1, 0] :

image

Did I miss something ?

I could open a new issue for better understanding.

@lopuhin
Copy link
Contributor

lopuhin commented Jun 27, 2017

@armgilles currently y=1 is shown for binary classifiers in any case, but @kmike is working on this issue: #223

@kmike
Copy link
Contributor

kmike commented Jun 27, 2017

@armgilles if you have binary classification task with class names (e.g. "red' and "blue") it is not that bad - y="red" (probability=0.061) kind of makes sense. So currently y=1 (probability=0.061) should be read as "y=1 with probability 0.061". But as @lopuhin said, it'll be fixed.

@sathyz
Copy link

sathyz commented Sep 28, 2018

I'm trying a simple pipeline and show_prediction() and it is failing.

import numpy as np

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

import eli5

fnames = ["sepal_length", "sepal_width", "petal_length", "petal_width",]
tnames = ["Setosa", "Versicolour", "Virginica"]

Xs, ys = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(Xs, ys, shuffle=True, test_size=0.2)

scaler = StandardScaler()
lr = LogisticRegression()
pipeline = make_pipeline( scaler, lr)
pipeline.fit(X_train, y_train)

random_sample = np.random.randint(len(X_test))
doc = X_test[random_sample]
eli5.show_prediction(pipeline, doc, feature_names=fnames, target_names=tnames)

The error is,

Error: estimator Pipeline(memory=None, steps=[('standardscaler', StandardScaler(copy=True, 
with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None,
dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, 
penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False))]) is not
supported

I did the following to get it work,

doc_raw = np.expand_dims(X_test[random_sample], axis=0)
doc = np.squeeze( scaler.transform(doc_raw) )
eli5.explain_prediction(lr, doc, feature_names=fnames, target_names=tnames)

Version: 0.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants