Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine SMOTENC and TomekLink and Classifier together in a pipeline for Mixed Datatype Datasets #1082

Open
Sehjbir opened this issue May 22, 2024 · 0 comments

Comments

@Sehjbir
Copy link

Sehjbir commented May 22, 2024

Description:

I have a dataset which contains both numeric and categorical variables. I want to combine oversampling and under-sampling together. SMOTEOMEK is only applicable to pure numeric dataset.

Code Snippet:

model_oversampler_smotenc = make_pipeline(
    SMOTENC(random_state=44, categorical_features= category_cols),
    TomekLinks(sampling_strategy='auto'),
    GradientBoostingClassifier())

scoring=['balanced_accuracy', 'f1', 'precision', 'recall']
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=3)
cv_results_oversampler_smotenc = cross_validate(
    model_oversampler_smotenc, data_train , target_train, scoring=scoring,
    return_train_score=True, return_estimator=True, cv=cv,
    n_jobs=-1)

print(
    f"Balanced accuracy mean +/- std. dev.: "
    f"{cv_results_oversampler_smotenc['test_balanced_accuracy'].mean():.3f} +/- "
    f"{cv_results_oversampler_smotenc['test_balanced_accuracy'].std():.3f}"

Questions:

  • Is this the right approach ? If yes, can i also use other under-samplers in the pipeline ?
  • The code runs without any error but i want to know the underlying process ?
  • If this logic is wrong, is there any alternative?
@Sehjbir Sehjbir changed the title Combine SMOTENC and TomekLInk together in a pipeline for Mixed Datatype Datasets Combine SMOTENC and TomekLink together in a pipeline for Mixed Datatype Datasets May 23, 2024
@Sehjbir Sehjbir changed the title Combine SMOTENC and TomekLink together in a pipeline for Mixed Datatype Datasets Combine SMOTENC and TomekLink and Classifier together in a pipeline for Mixed Datatype Datasets May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant