From 0c49bfb3df906098152f65c0678998f7713dbe63 Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre <g.lemaitre58@gmail.com> Date: Fri, 1 Mar 2019 14:40:29 +0100 Subject: [PATCH 01/22] SLEP005: Outlier Rejection API --- index.rst | 1 + slep005/proposal.rst | 98 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 99 insertions(+) create mode 100644 slep005/proposal.rst diff --git a/index.rst b/index.rst index 9713a84..cbe75c6 100644 --- a/index.rst +++ b/index.rst @@ -26,6 +26,7 @@ slep002/proposal slep003/proposal slep004/proposal + slep005/proposal .. toctree:: :maxdepth: 1 diff --git a/slep005/proposal.rst b/slep005/proposal.rst new file mode 100644 index 0000000..33dcf4d --- /dev/null +++ b/slep005/proposal.rst @@ -0,0 +1,98 @@ +.. _slep_005: + +===================== +Outlier rejection API +===================== + +:Author: Oliver Raush (oliverrausch99@gmail.com), Guillaume Lemaitre (g.lemaitre58@gmail.com) +:Status: Draft +:Type: Standards Track +:Created: created on, in 2019-03-01 +:Resolution: <url> + +Abstract +-------- + +We propose a new mixin ``OutlierRejectionMixin`` implementing a +``fit_resample(X, y)`` method. This method will remove samples from +``X`` and ``y`` to get a outlier-free dataset. This method is also +handle in ``Pipeline``. + +Detailed description +-------------------- + +Fitting a machine learning model on an outlier-free dataset can be +beneficial. Currently, the family of outlier detection algorithms +allows to detect outliers using `estimator.fit_predict(X, y)`. However, +there is no mechanism to remove outliers without any manual step. It +is even impossible when a ``Pipeline`` is used. + +We propose the following changes: + +* implement an ``OutlierRejectionMixin``; +* this mixin add a method ``fit_resample(X, y)`` removing outliers + from ``X`` and ``y``; +* ``fit_resample`` should be handled in ``Pipeline``. + +Implementation +-------------- + +API changes are implemented in +https://github.com/scikit-learn/scikit-learn/pull/13269 + +Estimator implementation +........................ + +The new mixin is implemented as:: + + class OutlierRejectionMixin: + _estimator_type = "outlier_rejector" + def fit_resample(self, X, y): + inliers = self.fit_predict(X) == 1 + return safe_mask(X, inliers), safe_mask(y, inliers) + +This will be used as follows for the outlier detection algorithms:: + + class IsolationForest(BaseBagging, OutlierMixin, OutlierRejectionMixin): + ... + +One can use the new algorithm with:: + + from sklearn.ensemble import IsolationForest + estimator = IsolationForest() + X_free, y_free = estimator.fit_resample(X, y) + +Pipeline implementation +....................... + +To handle outlier rejector in ``Pipeline``, we enforce the following: + +* an estimator cannot implement both ``fit_resample(X, y)`` and + ``fit_transform(X)`` / ``transform(X)``. +* ``fit_predict(X)`` (i.e., clustering methods) should not be called if an + outlier rejector is in the pipeline. + +Backward compatibility +---------------------- + +There is no backward incompatibilities with the current API. + +Discussion +---------- + +* https://github.com/scikit-learn/scikit-learn/pull/13269 + +References and Footnotes +------------------------ + +.. [1] Each SLEP must either be explicitly labeled as placed in the public + domain (see this SLEP as an example) or licensed under the `Open + Publication License`_. + +.. _Open Publication License: https://www.opencontent.org/openpub/ + + +Copyright +--------- + +This document has been placed in the public domain. [1]_ From 4ecc51bc00f113f8be9b6c4aad8ee477e3c9a02c Mon Sep 17 00:00:00 2001 From: Oliver Rausch <Oliverrausch99@gmail.com> Date: Sat, 2 Mar 2019 19:55:11 +0100 Subject: [PATCH 02/22] Update slep005/proposal.rst Co-Authored-By: glemaitre <g.lemaitre58@gmail.com> --- slep005/proposal.rst | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 33dcf4d..6df9eed 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -71,6 +71,18 @@ To handle outlier rejector in ``Pipeline``, we enforce the following: ``fit_transform(X)`` / ``transform(X)``. * ``fit_predict(X)`` (i.e., clustering methods) should not be called if an outlier rejector is in the pipeline. +* We propose that resamplers are only applied during fit time. Specifically, the pipeline will act as follows: +===================== ================================ +Method Resamplers applied +===================== ================================ +``fit`` Yes +``fit_transform`` Yes +``transform`` Yes +``fit_resample`` Yes +``predict`` No +``score`` No +``fit_predict`` not supported +===================== ================================ Backward compatibility ---------------------- From c855ffe16c14266a8241e37d04c3e3fcc32845ba Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre <g.lemaitre58@gmail.com> Date: Sat, 2 Mar 2019 23:09:04 +0100 Subject: [PATCH 03/22] Update slep --- slep005/proposal.rst | 123 +++++++++++++++++++++---------------------- 1 file changed, 61 insertions(+), 62 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 6df9eed..7f34af5 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -1,10 +1,12 @@ .. _slep_005: -===================== -Outlier rejection API -===================== +============= +Resampler API +============= -:Author: Oliver Raush (oliverrausch99@gmail.com), Guillaume Lemaitre (g.lemaitre58@gmail.com) +:Author: Oliver Raush (oliverrausch99@gmail.com), + Christos Aridas (char@upatras.gr), + Guillaume Lemaitre (g.lemaitre58@gmail.com) :Status: Draft :Type: Standards Track :Created: created on, in 2019-03-01 @@ -13,77 +15,74 @@ Outlier rejection API Abstract -------- -We propose a new mixin ``OutlierRejectionMixin`` implementing a -``fit_resample(X, y)`` method. This method will remove samples from -``X`` and ``y`` to get a outlier-free dataset. This method is also -handle in ``Pipeline``. +We propose the inclusion of a new type of estimator: resampler. The +resampler will change the samples in ``X`` and ``y``. In short: -Detailed description --------------------- +* resamplers will reduce or augment the number of samples in ``X`` and + ``y``; +* ``Pipeline`` should treat them as a separate type of estimator. -Fitting a machine learning model on an outlier-free dataset can be -beneficial. Currently, the family of outlier detection algorithms -allows to detect outliers using `estimator.fit_predict(X, y)`. However, -there is no mechanism to remove outliers without any manual step. It -is even impossible when a ``Pipeline`` is used. +Motivation +---------- -We propose the following changes: +Sample reduction or augmentation are part of machine-learning +pipeline. The current scikit-learn API does not offer support for such +use cases. -* implement an ``OutlierRejectionMixin``; -* this mixin add a method ``fit_resample(X, y)`` removing outliers - from ``X`` and ``y``; -* ``fit_resample`` should be handled in ``Pipeline``. +Two possible use cases are currently reported: +* sample rebalancing to correct bias toward class with large cardinality; +* outlier rejection to fit a clean dataset. + Implementation -------------- -API changes are implemented in -https://github.com/scikit-learn/scikit-learn/pull/13269 - -Estimator implementation -........................ - -The new mixin is implemented as:: - - class OutlierRejectionMixin: - _estimator_type = "outlier_rejector" - def fit_resample(self, X, y): - inliers = self.fit_predict(X) == 1 - return safe_mask(X, inliers), safe_mask(y, inliers) +To handle outlier rejector in ``Pipeline``, we enforce the following: -This will be used as follows for the outlier detection algorithms:: +* an estimator cannot implement both ``fit_resample(X, y)`` and + ``fit_transform(X)`` / ``transform(X)``. If both are implemented, + ``Pipeline`` will not be able to know which of the two methods to + call. +* resamplers are only applied during ``fit``. Otherwise, scoring will + be harder. Specifically, the pipeline will act as follows: - class IsolationForest(BaseBagging, OutlierMixin, OutlierRejectionMixin): - ... - -One can use the new algorithm with:: + ===================== ================================ + Method Resamplers applied + ===================== ================================ + ``fit`` Yes + ``fit_transform`` Yes + ``fit_resample`` Yes + ``transform`` No + ``predict`` No + ``score`` No + ``fit_predict`` not supported + ===================== ================================ + +* ``fit_predict(X)`` (i.e., clustering methods) should not be called + if an outlier rejector is in the pipeline. The output will be of + different size than ``X`` breaking metric computation. +* in a supervised scheme, resampler will need to validate which type + of target is passed. Up to our knowledge, supervised are used for + binary and multiclass classification. - from sklearn.ensemble import IsolationForest - estimator = IsolationForest() - X_free, y_free = estimator.fit_resample(X, y) +Alternative implementation +.......................... -Pipeline implementation -....................... +Alternatively ``sample_weight`` could be used as a placeholder to +perform resampling. However, the current limitations are: -To handle outlier rejector in ``Pipeline``, we enforce the following: - -* an estimator cannot implement both ``fit_resample(X, y)`` and - ``fit_transform(X)`` / ``transform(X)``. -* ``fit_predict(X)`` (i.e., clustering methods) should not be called if an - outlier rejector is in the pipeline. -* We propose that resamplers are only applied during fit time. Specifically, the pipeline will act as follows: -===================== ================================ -Method Resamplers applied -===================== ================================ -``fit`` Yes -``fit_transform`` Yes -``transform`` Yes -``fit_resample`` Yes -``predict`` No -``score`` No -``fit_predict`` not supported -===================== ================================ +* ``sample_weight`` is not available for all estimators; +* ``sample_weight`` will implement only sample reductions; +* ``sample_weight`` can be applied at both fit and predict time; +* ``sample_weight`` need to be passed and modified within a + ``Pipeline``. + +Current implementation +...................... +* Outlier rejection are implemented in: + https://github.com/scikit-learn/scikit-learn/pull/13269 + Backward compatibility ---------------------- @@ -92,7 +91,7 @@ There is no backward incompatibilities with the current API. Discussion ---------- -* https://github.com/scikit-learn/scikit-learn/pull/13269 +* https://github.com/scikit-learn/scikit-learn/pull/13269{ References and Footnotes ------------------------ From c16ef7b21c20ada88d59baa475e42b610f520643 Mon Sep 17 00:00:00 2001 From: Adrin Jalali <adrin.jalali@gmail.com> Date: Tue, 5 Mar 2019 13:23:29 +0100 Subject: [PATCH 04/22] Update slep005/proposal.rst Co-Authored-By: glemaitre <g.lemaitre58@gmail.com> --- slep005/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 7f34af5..7848635 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -91,7 +91,7 @@ There is no backward incompatibilities with the current API. Discussion ---------- -* https://github.com/scikit-learn/scikit-learn/pull/13269{ +* https://github.com/scikit-learn/scikit-learn/pull/13269 References and Footnotes ------------------------ From e2f6a7059ec2f04681949d953313949736e76df9 Mon Sep 17 00:00:00 2001 From: Oliver Rausch <oliverrausch99@gmail.com> Date: Tue, 25 Jun 2019 23:51:53 +0200 Subject: [PATCH 05/22] Update proposal based on discussion - removed the proposal for pipeline modification - added some more usecases - added a description of the api and the constraints --- slep005/proposal.rst | 87 +++++++++++++++++++++++--------------------- 1 file changed, 45 insertions(+), 42 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 7848635..685f87f 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -4,7 +4,7 @@ Resampler API ============= -:Author: Oliver Raush (oliverrausch99@gmail.com), +:Author: Oliver Rausch (oliverrausch99@gmail.com), Christos Aridas (char@upatras.gr), Guillaume Lemaitre (g.lemaitre58@gmail.com) :Status: Draft @@ -18,53 +18,57 @@ Abstract We propose the inclusion of a new type of estimator: resampler. The resampler will change the samples in ``X`` and ``y``. In short: -* resamplers will reduce or augment the number of samples in ``X`` and - ``y``; -* ``Pipeline`` should treat them as a separate type of estimator. +* resamplers will reduce and/or augment the number of samples in ``X`` and + ``y`` during ``fit``, but will perform no changes during ``predict``. +* a new verb/method that all resamplers must implement is introduced: ``fit_resample``. +* A new meta-estimator, ``ResampledTrainer``, that allows for the composition of + resamplers and estimators is proposed. + Motivation ---------- -Sample reduction or augmentation are part of machine-learning -pipeline. The current scikit-learn API does not offer support for such +Sample reduction or augmentation are common parts of machine-learning +pipelines. The current scikit-learn API does not offer support for such use cases. -Two possible use cases are currently reported: +Usecases +........ + +* sample rebalancing to correct bias toward class with large cardinality +* outlier rejection to fit a clean dataset +* representing a dataset by generating centroids of clustering methods. +* adding unlabeled samples to a dataset during semi-supervised fit time for + cross validation (simply passing a semi-supervised dataset to cross validation + methods doesn't work since the cross validation will treat the label -1 as a + separate class). Alternative approach is a new cv splitter. -* sample rebalancing to correct bias toward class with large cardinality; -* outlier rejection to fit a clean dataset. - Implementation -------------- +API and Constraints +................... +Resamplers implement a method ``fit_resample(X, y)``, a pure function which +returns ``Xt, yt`` corresponding to the resampled dataset, where samples may +have been added and/or removed. + +Resamplers cannot be transformers, that is, a resampler cannot implement +``fit_transform`` or ``transform``. Similarly, transformers cannot implement ``fit_resample``. + +Resamplers may not change the order, meaning or format of features (This is left +to Transformers). + +ResampledTrainer +................ +This metaestimator composes a resampler and a predictor. It +behaves as follows: + + ``fit(X, y)``: resample ``X, y`` with the resampler, then fit on the resampled + dataset. +* ``predict(X)``: simply predict on ``X`` with the predictor. +* ``score(X)``: simply score on ``X`` with the predictor. + +See PR #13269 for an implementation. -To handle outlier rejector in ``Pipeline``, we enforce the following: - -* an estimator cannot implement both ``fit_resample(X, y)`` and - ``fit_transform(X)`` / ``transform(X)``. If both are implemented, - ``Pipeline`` will not be able to know which of the two methods to - call. -* resamplers are only applied during ``fit``. Otherwise, scoring will - be harder. Specifically, the pipeline will act as follows: - - ===================== ================================ - Method Resamplers applied - ===================== ================================ - ``fit`` Yes - ``fit_transform`` Yes - ``fit_resample`` Yes - ``transform`` No - ``predict`` No - ``score`` No - ``fit_predict`` not supported - ===================== ================================ - -* ``fit_predict(X)`` (i.e., clustering methods) should not be called - if an outlier rejector is in the pipeline. The output will be of - different size than ``X`` breaking metric computation. -* in a supervised scheme, resampler will need to validate which type - of target is passed. Up to our knowledge, supervised are used for - binary and multiclass classification. - Alternative implementation .......................... @@ -76,13 +80,12 @@ perform resampling. However, the current limitations are: * ``sample_weight`` can be applied at both fit and predict time; * ``sample_weight`` need to be passed and modified within a ``Pipeline``. - + Current implementation ...................... -* Outlier rejection are implemented in: - https://github.com/scikit-learn/scikit-learn/pull/13269 - +https://github.com/scikit-learn/scikit-learn/pull/13269 + Backward compatibility ---------------------- From 8f8ebb6b81e7a619a09e6aa6e33a2ffe3582dbf1 Mon Sep 17 00:00:00 2001 From: Oliver Rausch <oliverrausch99@gmail.com> Date: Wed, 26 Jun 2019 00:08:44 +0200 Subject: [PATCH 06/22] Reword semisupervised usecase --- slep005/proposal.rst | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 685f87f..dd4448d 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -32,16 +32,17 @@ Sample reduction or augmentation are common parts of machine-learning pipelines. The current scikit-learn API does not offer support for such use cases. -Usecases -........ +Possible Usecases +................. * sample rebalancing to correct bias toward class with large cardinality * outlier rejection to fit a clean dataset * representing a dataset by generating centroids of clustering methods. -* adding unlabeled samples to a dataset during semi-supervised fit time for - cross validation (simply passing a semi-supervised dataset to cross validation - methods doesn't work since the cross validation will treat the label -1 as a - separate class). Alternative approach is a new cv splitter. +* currently semi-supervised learning is not supported by scoring-based + functions like ``cross_val_score``, ``GridSearchCV`` or ``validation_curve`` + since the scorers will regard "unlabeled" as a separate class. A resampler + could add the unlabeled samples to the dataset during fit time to solve this + (note that this can also be solved by a new cv splitter). Implementation -------------- From ae03400215adfb0dc210f8a20b11e78fe775c20e Mon Sep 17 00:00:00 2001 From: Oliver Rausch <oliverrausch99@gmail.com> Date: Wed, 26 Jun 2019 21:07:16 +0200 Subject: [PATCH 07/22] Add description of first few pipeline methods --- slep005/proposal.rst | 42 ++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 40 insertions(+), 2 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index dd4448d..8439aaa 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -43,6 +43,7 @@ Possible Usecases since the scorers will regard "unlabeled" as a separate class. A resampler could add the unlabeled samples to the dataset during fit time to solve this (note that this can also be solved by a new cv splitter). +* Dataset augmentation (very common in vision problems) Implementation -------------- @@ -55,8 +56,10 @@ have been added and/or removed. Resamplers cannot be transformers, that is, a resampler cannot implement ``fit_transform`` or ``transform``. Similarly, transformers cannot implement ``fit_resample``. -Resamplers may not change the order, meaning or format of features (This is left -to Transformers). +Resamplers may not change the order, meaning, dtype or format of features (this is left +to transformers). + +Resamplers should also resample any kwargs that are array-like and have the same `shape[0]` as `X` and `y`. ResampledTrainer ................ @@ -70,6 +73,41 @@ behaves as follows: See PR #13269 for an implementation. +Modifying Pipeline +.................. +As an alternative to ``ResampledTrainer``, ``Pipeline`` could be modified to +accomodate resamplers. +The functionality is described in terms of the head (all stages except the last) +and the tail (the last stage) of the ``Pipeline``. Note that we assume +resamplers and transformers are exclusive so that the pipeline can decide which +method to call. Further note that ``Xt, yt`` are the outputs of the stage, and +``X, y`` are the inputs to the stage. + +``fit``: + head for resamplers: `Xt, yt = est.fit_resample(X, y)` + head for transformers: `Xt, yt = est.fit_transform(X, y)` + tail for transformers and predictors: `est.fit(X, y)` + tail for resamplers: `pass` + +``fit_transform``: + Equivalent to `fit(X, y).transform(X)` overall + +``predict`` + head for resamplers: `Xt = X` + head for transformers: `Xt = est.transform(X)` + tail for predictors: `return est.predict(X)` + tail for transformers and resamplers: `error` + +``transform`` + head for resamplers: `Xt = X` + head for transformers: `Xt = est.transform(X)` + tail for predictors and resamplers: `error` + tail for transformers: `return est.transform(X)` + +``score`` + see predict + + Alternative implementation .......................... From 10c85ff580ae2aa9b753a5f3d08781570e90db01 Mon Sep 17 00:00:00 2001 From: Oliver Rausch <oliverrausch99@gmail.com> Date: Wed, 26 Jun 2019 21:13:33 +0200 Subject: [PATCH 08/22] Add code examples --- slep005/proposal.rst | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 8439aaa..0725aaf 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -73,6 +73,19 @@ behaves as follows: See PR #13269 for an implementation. +Example Usage +""""""""""""" +:: + est = ResamplingTrainer(RandomUnderSampler(), SVC()) + est = make_pipeline( + StandardScaler(), + ResamplingTrainer(Birch(), make_pipeline(SelectKBest(), SVC())) + ) + est = ResamplingTrainer( + RandomUnderSampler(), + make_pipeline(StandardScaler(), SelectKBest(), SVC()), + ) + Modifying Pipeline .................. As an alternative to ``ResampledTrainer``, ``Pipeline`` could be modified to @@ -107,6 +120,12 @@ method to call. Further note that ``Xt, yt`` are the outputs of the stage, and ``score`` see predict +Example Usage:: + est = make_pipeline(RandomUnderSampler(), SVC()) + est = make_pipeline(StandardScaler(), Birch(), SelectKBest(), SVC()) + est = make_pipeline( + RandomUnderSampler(), StandardScaler(), SelectKBest(), SVC() + ) Alternative implementation .......................... From c39d615439746c785f8900d242de9f2c58a79d8b Mon Sep 17 00:00:00 2001 From: Oliver Rausch <oliverrausch99@gmail.com> Date: Wed, 26 Jun 2019 21:16:20 +0200 Subject: [PATCH 09/22] formatting --- slep005/proposal.rst | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 0725aaf..a64326e 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -73,9 +73,7 @@ behaves as follows: See PR #13269 for an implementation. -Example Usage -""""""""""""" -:: +Example Usage:: est = ResamplingTrainer(RandomUnderSampler(), SVC()) est = make_pipeline( StandardScaler(), From 2de0d4885005ec552ed106e3eb62b645cb473889 Mon Sep 17 00:00:00 2001 From: Oliver Rausch <oliverrausch99@gmail.com> Date: Thu, 27 Jun 2019 02:48:16 +0200 Subject: [PATCH 10/22] Formatting and cleanup --- index.rst | 2 +- slep005/proposal.rst | 82 +++++++++++++++++++++++++++++--------------- 2 files changed, 55 insertions(+), 29 deletions(-) diff --git a/index.rst b/index.rst index cbe75c6..68c4028 100644 --- a/index.rst +++ b/index.rst @@ -10,6 +10,7 @@ :caption: Under review under_review + slep005/proposal .. toctree:: :maxdepth: 1 @@ -26,7 +27,6 @@ slep002/proposal slep003/proposal slep004/proposal - slep005/proposal .. toctree:: :maxdepth: 1 diff --git a/slep005/proposal.rst b/slep005/proposal.rst index a64326e..0317d5f 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -18,11 +18,15 @@ Abstract We propose the inclusion of a new type of estimator: resampler. The resampler will change the samples in ``X`` and ``y``. In short: -* resamplers will reduce and/or augment the number of samples in ``X`` and - ``y`` during ``fit``, but will perform no changes during ``predict``. -* a new verb/method that all resamplers must implement is introduced: ``fit_resample``. -* A new meta-estimator, ``ResampledTrainer``, that allows for the composition of - resamplers and estimators is proposed. +* a new verb/method that all resamplers must implement is introduced: + ``fit_resample``. +* resamplers are able to reduce and/or augment the number of samples in + ``X`` and ``y`` during ``fit``, but will perform no changes during + ``predict``. +* to facilitate this behavior a new meta-estimator (``ResampledTrainer``) that + allows for the composition of resamplers and estimators is proposed. + Alternatively we propose changes to ``Pipeline`` that also enable similar + compositions. Motivation @@ -35,34 +39,41 @@ use cases. Possible Usecases ................. -* sample rebalancing to correct bias toward class with large cardinality -* outlier rejection to fit a clean dataset -* representing a dataset by generating centroids of clustering methods. -* currently semi-supervised learning is not supported by scoring-based +* Sample rebalancing to correct bias toward class with large cardinality + outlier rejection to fit a clean dataset. +* Sample reduction e.g. representing a dataset by its k-means centroids. +* Currently semi-supervised learning is not supported by scoring-based functions like ``cross_val_score``, ``GridSearchCV`` or ``validation_curve`` since the scorers will regard "unlabeled" as a separate class. A resampler could add the unlabeled samples to the dataset during fit time to solve this - (note that this can also be solved by a new cv splitter). -* Dataset augmentation (very common in vision problems) + (note that this could also be solved by a new cv splitter). +* NaNRejector (drop all samples that contain nan). +* Dataset augmentation (like is commonly done in DL). Implementation -------------- + API and Constraints ................... -Resamplers implement a method ``fit_resample(X, y)``, a pure function which -returns ``Xt, yt`` corresponding to the resampled dataset, where samples may -have been added and/or removed. -Resamplers cannot be transformers, that is, a resampler cannot implement -``fit_transform`` or ``transform``. Similarly, transformers cannot implement ``fit_resample``. +* Resamplers implement a method ``fit_resample(X, y, **kwargs)``, a pure function which + returns ``Xt, yt, kwargs`` corresponding to the resampled dataset, where + samples may have been added and/or removed. +* An estimator may only implement either ``fit_transform`` or ``fit_resample``. +* Resamplers may not change the order, meaning, dtype or format of features + (this is left to transformers). +* Resamplers should also resample any kwargs. -Resamplers may not change the order, meaning, dtype or format of features (this is left -to transformers). +Composition +----------- -Resamplers should also resample any kwargs that are array-like and have the same `shape[0]` as `X` and `y`. +An key part of the proposal is the introduction of a way of composing resamplers +with predictors. We present two options: ``ResampledTrainer`` and modifications +to ``Pipeline``. ResampledTrainer ................ + This metaestimator composes a resampler and a predictor. It behaves as follows: @@ -73,7 +84,10 @@ behaves as follows: See PR #13269 for an implementation. -Example Usage:: +Example Usage: + +.. code-block:: python + est = ResamplingTrainer(RandomUnderSampler(), SVC()) est = make_pipeline( StandardScaler(), @@ -83,6 +97,11 @@ Example Usage:: RandomUnderSampler(), make_pipeline(StandardScaler(), SelectKBest(), SVC()), ) + clf = ResampledTrainer( + NaNRejector(), # removes samples containing NaN + ResampledTrainer(RandomUnderSampler(), + make_pipeline(StandardScaler(), SGDClassifier())) + ) Modifying Pipeline .................. @@ -91,17 +110,17 @@ accomodate resamplers. The functionality is described in terms of the head (all stages except the last) and the tail (the last stage) of the ``Pipeline``. Note that we assume resamplers and transformers are exclusive so that the pipeline can decide which -method to call. Further note that ``Xt, yt`` are the outputs of the stage, and -``X, y`` are the inputs to the stage. +method to call. Further note that ``Xt, yt, kwt`` are the outputs of the stage, and +``X, y, **kw`` are the inputs to the stage. ``fit``: - head for resamplers: `Xt, yt = est.fit_resample(X, y)` - head for transformers: `Xt, yt = est.fit_transform(X, y)` - tail for transformers and predictors: `est.fit(X, y)` - tail for resamplers: `pass` + head for resamplers: `Xt, yt, kwt = est.fit_resample(X, y, **kw)`. + head for transformers: `Xt, yt = est.fit_transform(X, y, **kw)`. + tail for transformers and predictors: `est.fit(X, y, **kw)`. + tail for resamplers: `pass`. ``fit_transform``: - Equivalent to `fit(X, y).transform(X)` overall + Equivalent to `fit(X, y).transform(X)` overall. ``predict`` head for resamplers: `Xt = X` @@ -118,12 +137,19 @@ method to call. Further note that ``Xt, yt`` are the outputs of the stage, and ``score`` see predict -Example Usage:: +Example Usage: + +.. code-block:: python + est = make_pipeline(RandomUnderSampler(), SVC()) est = make_pipeline(StandardScaler(), Birch(), SelectKBest(), SVC()) est = make_pipeline( RandomUnderSampler(), StandardScaler(), SelectKBest(), SVC() ) + est = make_pipeline( + NaNRejector(), RandomUnderSampler(), StandardScaler(), SGDClassifer() + ) + Alternative implementation .......................... From 387b338f0e6eb7fadf1a98852d34095648dbe463 Mon Sep 17 00:00:00 2001 From: Oliver Rausch <oliverrausch99@gmail.com> Date: Thu, 27 Jun 2019 02:51:00 +0200 Subject: [PATCH 11/22] even more formatting --- slep005/proposal.rst | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 0317d5f..f7f870b 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -77,7 +77,7 @@ ResampledTrainer This metaestimator composes a resampler and a predictor. It behaves as follows: - ``fit(X, y)``: resample ``X, y`` with the resampler, then fit on the resampled +* ``fit(X, y)``: resample ``X, y`` with the resampler, then fit on the resampled dataset. * ``predict(X)``: simply predict on ``X`` with the predictor. * ``score(X)``: simply score on ``X`` with the predictor. @@ -114,25 +114,25 @@ method to call. Further note that ``Xt, yt, kwt`` are the outputs of the stage, ``X, y, **kw`` are the inputs to the stage. ``fit``: - head for resamplers: `Xt, yt, kwt = est.fit_resample(X, y, **kw)`. - head for transformers: `Xt, yt = est.fit_transform(X, y, **kw)`. - tail for transformers and predictors: `est.fit(X, y, **kw)`. - tail for resamplers: `pass`. + head for resamplers: ``Xt, yt, kwt = est.fit_resample(X, y, **kw)``. + head for transformers: ``Xt, yt = est.fit_transform(X, y, **kw)``. + tail for transformers and predictors: ``est.fit(X, y, **kw)``. + tail for resamplers: ``pass``. ``fit_transform``: - Equivalent to `fit(X, y).transform(X)` overall. + Equivalent to ``fit(X, y).transform(X)`` overall. ``predict`` - head for resamplers: `Xt = X` - head for transformers: `Xt = est.transform(X)` - tail for predictors: `return est.predict(X)` - tail for transformers and resamplers: `error` + head for resamplers: ``Xt = X`` + head for transformers: ``Xt = est.transform(X)`` + tail for predictors: ``return est.predict(X)`` + tail for transformers and resamplers: ``error`` ``transform`` - head for resamplers: `Xt = X` - head for transformers: `Xt = est.transform(X)` - tail for predictors and resamplers: `error` - tail for transformers: `return est.transform(X)` + head for resamplers: ``Xt = X`` + head for transformers: ``Xt = est.transform(X)`` + tail for predictors and resamplers: ``error`` + tail for transformers: ``return est.transform(X)`` ``score`` see predict From 5ecfead026aa618288baf4fd893d5dec37fab5ec Mon Sep 17 00:00:00 2001 From: Oliver Rausch <oliverrausch99@gmail.com> Date: Thu, 27 Jun 2019 02:52:37 +0200 Subject: [PATCH 12/22] more formatting --- slep005/proposal.rst | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index f7f870b..8fdc3e5 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -114,25 +114,25 @@ method to call. Further note that ``Xt, yt, kwt`` are the outputs of the stage, ``X, y, **kw`` are the inputs to the stage. ``fit``: - head for resamplers: ``Xt, yt, kwt = est.fit_resample(X, y, **kw)``. - head for transformers: ``Xt, yt = est.fit_transform(X, y, **kw)``. - tail for transformers and predictors: ``est.fit(X, y, **kw)``. - tail for resamplers: ``pass``. +* head for resamplers: ``Xt, yt, kwt = est.fit_resample(X, y, **kw)``. +* head for transformers: ``Xt, yt = est.fit_transform(X, y, **kw)``. +* tail for transformers and predictors: ``est.fit(X, y, **kw)``. +* tail for resamplers: ``pass``. ``fit_transform``: - Equivalent to ``fit(X, y).transform(X)`` overall. +* Equivalent to ``fit(X, y).transform(X)`` overall. ``predict`` - head for resamplers: ``Xt = X`` - head for transformers: ``Xt = est.transform(X)`` - tail for predictors: ``return est.predict(X)`` - tail for transformers and resamplers: ``error`` +* head for resamplers: ``Xt = X`` +* head for transformers: ``Xt = est.transform(X)`` +* tail for predictors: ``return est.predict(X)`` +* tail for transformers and resamplers: ``error`` ``transform`` - head for resamplers: ``Xt = X`` - head for transformers: ``Xt = est.transform(X)`` - tail for predictors and resamplers: ``error`` - tail for transformers: ``return est.transform(X)`` +* head for resamplers: ``Xt = X`` +* head for transformers: ``Xt = est.transform(X)`` +* tail for predictors and resamplers: ``error`` +* tail for transformers: ``return est.transform(X)`` ``score`` see predict From e7faa6ee3151540d5759c04c4d308ae1bc18792e Mon Sep 17 00:00:00 2001 From: Oliver Rausch <oliverrausch99@gmail.com> Date: Thu, 27 Jun 2019 02:54:57 +0200 Subject: [PATCH 13/22] try these headings --- slep005/proposal.rst | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 8fdc3e5..58921de 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -113,28 +113,33 @@ resamplers and transformers are exclusive so that the pipeline can decide which method to call. Further note that ``Xt, yt, kwt`` are the outputs of the stage, and ``X, y, **kw`` are the inputs to the stage. -``fit``: +fit +~~~ * head for resamplers: ``Xt, yt, kwt = est.fit_resample(X, y, **kw)``. * head for transformers: ``Xt, yt = est.fit_transform(X, y, **kw)``. * tail for transformers and predictors: ``est.fit(X, y, **kw)``. * tail for resamplers: ``pass``. -``fit_transform``: +fit_transform +~~~~~~~~~~~~~ * Equivalent to ``fit(X, y).transform(X)`` overall. -``predict`` +predict +~~~~~~~ * head for resamplers: ``Xt = X`` * head for transformers: ``Xt = est.transform(X)`` * tail for predictors: ``return est.predict(X)`` * tail for transformers and resamplers: ``error`` -``transform`` +transform +~~~~~~~~~ * head for resamplers: ``Xt = X`` * head for transformers: ``Xt = est.transform(X)`` * tail for predictors and resamplers: ``error`` * tail for transformers: ``return est.transform(X)`` -``score`` +score +~~~~~ see predict Example Usage: From a4019ed832ffe833d8d61999765005e79efb8b41 Mon Sep 17 00:00:00 2001 From: Oliver Rausch <oliverrausch99@gmail.com> Date: Thu, 27 Jun 2019 02:55:45 +0200 Subject: [PATCH 14/22] last one --- slep005/proposal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 58921de..025ae8d 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -140,7 +140,7 @@ transform score ~~~~~ - see predict +* see predict Example Usage: From 5ddc6f9c8b72a63c1222b9c767a0380fdfec2283 Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre <g.lemaitre58@gmail.com> Date: Wed, 3 Jul 2019 11:23:40 +0200 Subject: [PATCH 15/22] minor rephrasing --- slep005/proposal.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 025ae8d..4af1f94 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -163,8 +163,8 @@ Alternatively ``sample_weight`` could be used as a placeholder to perform resampling. However, the current limitations are: * ``sample_weight`` is not available for all estimators; -* ``sample_weight`` will implement only sample reductions; -* ``sample_weight`` can be applied at both fit and predict time; +* ``sample_weight`` will implement only simple resampling (only when resampling + uses original samples); * ``sample_weight`` need to be passed and modified within a ``Pipeline``. From cde164b52ce7dd82e7680b598bd8ada8e7989da7 Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre <g.lemaitre58@gmail.com> Date: Wed, 3 Jul 2019 11:32:58 +0200 Subject: [PATCH 16/22] address comments --- slep005/proposal.rst | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 4af1f94..4a33cb4 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -8,7 +8,6 @@ Resampler API Christos Aridas (char@upatras.gr), Guillaume Lemaitre (g.lemaitre58@gmail.com) :Status: Draft -:Type: Standards Track :Created: created on, in 2019-03-01 :Resolution: <url> @@ -16,7 +15,8 @@ Abstract -------- We propose the inclusion of a new type of estimator: resampler. The -resampler will change the samples in ``X`` and ``y``. In short: +resampler will change the samples in ``X`` and ``y`` and return both +``Xt`` and ``yt``. In short: * a new verb/method that all resamplers must implement is introduced: ``fit_resample``. @@ -85,6 +85,7 @@ behaves as follows: See PR #13269 for an implementation. Example Usage: +~~~~~~~~~~~~~~ .. code-block:: python @@ -105,6 +106,7 @@ Example Usage: Modifying Pipeline .................. + As an alternative to ``ResampledTrainer``, ``Pipeline`` could be modified to accomodate resamplers. The functionality is described in terms of the head (all stages except the last) @@ -115,10 +117,10 @@ method to call. Further note that ``Xt, yt, kwt`` are the outputs of the stage, fit ~~~ -* head for resamplers: ``Xt, yt, kwt = est.fit_resample(X, y, **kw)``. -* head for transformers: ``Xt, yt = est.fit_transform(X, y, **kw)``. -* tail for transformers and predictors: ``est.fit(X, y, **kw)``. -* tail for resamplers: ``pass``. +* head for resamplers: ``Xt, yt, kwt = est.fit_resample(X, y, **kw)`` +* head for transformers: ``Xt, yt = est.fit_transform(X, y, **kw)`` +* tail for transformers and predictors: ``est.fit(X, y, **kw)`` +* tail for resamplers: ``pass`` fit_transform ~~~~~~~~~~~~~ @@ -143,6 +145,7 @@ score * see predict Example Usage: +~~~~~~~~~~~~~~ .. code-block:: python From e87fd7e906804843d3b7df6d80edf8e75c64a839 Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre <g.lemaitre58@gmail.com> Date: Wed, 3 Jul 2019 13:19:19 +0200 Subject: [PATCH 17/22] Apply suggestions from code review Co-Authored-By: Joel Nothman <joel.nothman@gmail.com> --- slep005/proposal.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 4a33cb4..59270f2 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -39,8 +39,8 @@ use cases. Possible Usecases ................. -* Sample rebalancing to correct bias toward class with large cardinality - outlier rejection to fit a clean dataset. +* Sample rebalancing to correct bias toward class with large cardinality. +* Outlier rejection to fit a clean dataset. * Sample reduction e.g. representing a dataset by its k-means centroids. * Currently semi-supervised learning is not supported by scoring-based functions like ``cross_val_score``, ``GridSearchCV`` or ``validation_curve`` @@ -168,8 +168,8 @@ perform resampling. However, the current limitations are: * ``sample_weight`` is not available for all estimators; * ``sample_weight`` will implement only simple resampling (only when resampling uses original samples); -* ``sample_weight`` need to be passed and modified within a - ``Pipeline``. +* ``sample_weight`` needs to be passed and modified within a + ``Pipeline``, which isn't possible without something like resamplers. Current implementation ...................... From ad4e94fdbe2b07f55b0e80d013fddc6c230cabe7 Mon Sep 17 00:00:00 2001 From: Joel Nothman <joel.nothman@gmail.com> Date: Wed, 3 Jul 2019 21:39:52 +1000 Subject: [PATCH 18/22] Some text about resampling pipelines and their issues --- slep005/proposal.rst | 126 +++++++++++++++++++++++++++---------------- 1 file changed, 80 insertions(+), 46 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 4a33cb4..7ffeef5 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -71,30 +71,44 @@ An key part of the proposal is the introduction of a way of composing resamplers with predictors. We present two options: ``ResampledTrainer`` and modifications to ``Pipeline``. -ResampledTrainer -................ +Alternative 1: ResampledTrainer +............................... This metaestimator composes a resampler and a predictor. It behaves as follows: -* ``fit(X, y)``: resample ``X, y`` with the resampler, then fit on the resampled - dataset. +* ``fit(X, y)``: resample ``X, y`` with the resampler, then fit the predictor + on the resampled dataset. * ``predict(X)``: simply predict on ``X`` with the predictor. * ``score(X)``: simply score on ``X`` with the predictor. See PR #13269 for an implementation. +One benefit of the ``ResampledTrainer`` is that it does not stop the resampler +having other methods, such as ``transform``, as it is clear that the +``ResampledTrainer`` will only call ``fit_resample``. + +There are complications around supporting ``fit_transform``, ``fit_predict`` +and ``fit_resample`` methods in ``ResampledTrainer``. ``fit_transform`` support +is only possible by implementing ``fit_transform(X, y)`` as ``fit(X, +y).transform(X)``, rather than calling ``fit_transform`` of the predictor. +``fit_predict`` would have to behave similarly. Thus ``ResampledTrainer`` +would not work with non-inductive estimators (TSNE, AgglomerativeClustering, +etc.) as their final step. If the predictor of a ``ResampledTrainer`` is +itself a resampler, it's unclear how ``ResampledTrainer.fit_resample`` should +behave. These caveats also apply to the Pipeline modification below. + Example Usage: ~~~~~~~~~~~~~~ .. code-block:: python - est = ResamplingTrainer(RandomUnderSampler(), SVC()) + est = ResampledTrainer(RandomUnderSampler(), SVC()) est = make_pipeline( StandardScaler(), - ResamplingTrainer(Birch(), make_pipeline(SelectKBest(), SVC())) + ResampledTrainer(Birch(), make_pipeline(SelectKBest(), SVC())) ) - est = ResamplingTrainer( + est = ResampledTrainer( RandomUnderSampler(), make_pipeline(StandardScaler(), SelectKBest(), SVC()), ) @@ -104,45 +118,65 @@ Example Usage: make_pipeline(StandardScaler(), SGDClassifier())) ) -Modifying Pipeline -.................. - -As an alternative to ``ResampledTrainer``, ``Pipeline`` could be modified to -accomodate resamplers. -The functionality is described in terms of the head (all stages except the last) -and the tail (the last stage) of the ``Pipeline``. Note that we assume -resamplers and transformers are exclusive so that the pipeline can decide which -method to call. Further note that ``Xt, yt, kwt`` are the outputs of the stage, and -``X, y, **kw`` are the inputs to the stage. - -fit -~~~ -* head for resamplers: ``Xt, yt, kwt = est.fit_resample(X, y, **kw)`` -* head for transformers: ``Xt, yt = est.fit_transform(X, y, **kw)`` -* tail for transformers and predictors: ``est.fit(X, y, **kw)`` -* tail for resamplers: ``pass`` - -fit_transform -~~~~~~~~~~~~~ -* Equivalent to ``fit(X, y).transform(X)`` overall. - -predict -~~~~~~~ -* head for resamplers: ``Xt = X`` -* head for transformers: ``Xt = est.transform(X)`` -* tail for predictors: ``return est.predict(X)`` -* tail for transformers and resamplers: ``error`` - -transform -~~~~~~~~~ -* head for resamplers: ``Xt = X`` -* head for transformers: ``Xt = est.transform(X)`` -* tail for predictors and resamplers: ``error`` -* tail for transformers: ``return est.transform(X)`` - -score -~~~~~ -* see predict +Alternative 2: Prediction Pipeline +.................................. + +As an alternative to ``ResampledTrainer``, ``Pipeline`` can be modified to +accomodate resamplers. The essence of the operation is this: one or more steps +of the pipeline may be a resampler. When fitting the Pipeline, ``fit_resample`` +will be called on each resampler instead of ``fit_transform``, and the output +of ``fit_resample`` will be used in place of the original ``X``, ``y``, etc., +to fit the subsequent step (and so on). When predicting in the Pipeline, +the resampler will act as a passthrough step. + +Limitations +~~~~~~~~~~~ + +.. rubric:: Prohibiting ``transform`` on resamplers + +It may be problematic for a resampler to provide ``transform`` if Pipelines +support resampling: + +1. It is unclear what to do at test time if a resampler has a transform + method. +2. Adding fit_resample to the API of an an existing transformer may + drastically change its behaviour in a Pipeline. + +For this reason, it may be best to reject resamplers supporting ``transform`` +from being used in a Pipeline. + +.. rubric:: Prohibiting ``transform`` on resampling Pipelines + +Providing a ``transform`` method on a Pipeline that contains a resampler +presents several problems: + +1. A resampling Pipeline needs to use a special code path for ``fit_transform`` + that would call ``fit(X, y, **kw).transform(X)`` on the Pipeline. + Ordinarily a Pipeline would pass the transformed data to ``fit_transform`` + of the left step. If the Pipeline contains a resampler, it rather needs to + fit the Pipeline excluding the last step, then transform the original + training data until the last step, then fit_transform the last step. This + means special code paths for pipelines containing resamplers; the effect of + the resampler is not localised in terms of code maintenance. +2. As a result of issue 1, appending a step to the transformation Pipeline + means that the transformer which was previously last, and previously trained + on the full dataset, will now be trained on the resampled dataset. +3. As a result of issue 1, the last step cannot be 'passthrough' as in other + transformer pipelines. + +For this reason, it may be best to disable ``fit_transform`` and ``transform`` +on the Pipeline. A resampling Pipeline would therefore not be usable as a +transformation within a ``FeatureUnion`` or ``ColumnTransformer``. Thus the +``ResampledTrainer`` would be strictly more expressive than a resampling +Pipeline. + +.. rubric:: Handling ``fit`` parameters + +Sample props or weights cannot be routed to steps downstream of a resampler in +a Pipeline, unless they too are resampled. It's very unclear how this would +work with Pipeline's current prefix-based fit parameter routing. + +TODO: propose solutions Example Usage: ~~~~~~~~~~~~~~ From b989562f865e5dbd1d4e2895580fcb04dbea1f4a Mon Sep 17 00:00:00 2001 From: Joel Nothman <joel.nothman@gmail.com> Date: Wed, 3 Jul 2019 21:48:39 +1000 Subject: [PATCH 19/22] Some text about resampling pipelines and their issues (#2) --- slep005/proposal.rst | 126 +++++++++++++++++++++++++++---------------- 1 file changed, 80 insertions(+), 46 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 59270f2..1e375ea 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -71,30 +71,44 @@ An key part of the proposal is the introduction of a way of composing resamplers with predictors. We present two options: ``ResampledTrainer`` and modifications to ``Pipeline``. -ResampledTrainer -................ +Alternative 1: ResampledTrainer +............................... This metaestimator composes a resampler and a predictor. It behaves as follows: -* ``fit(X, y)``: resample ``X, y`` with the resampler, then fit on the resampled - dataset. +* ``fit(X, y)``: resample ``X, y`` with the resampler, then fit the predictor + on the resampled dataset. * ``predict(X)``: simply predict on ``X`` with the predictor. * ``score(X)``: simply score on ``X`` with the predictor. See PR #13269 for an implementation. +One benefit of the ``ResampledTrainer`` is that it does not stop the resampler +having other methods, such as ``transform``, as it is clear that the +``ResampledTrainer`` will only call ``fit_resample``. + +There are complications around supporting ``fit_transform``, ``fit_predict`` +and ``fit_resample`` methods in ``ResampledTrainer``. ``fit_transform`` support +is only possible by implementing ``fit_transform(X, y)`` as ``fit(X, +y).transform(X)``, rather than calling ``fit_transform`` of the predictor. +``fit_predict`` would have to behave similarly. Thus ``ResampledTrainer`` +would not work with non-inductive estimators (TSNE, AgglomerativeClustering, +etc.) as their final step. If the predictor of a ``ResampledTrainer`` is +itself a resampler, it's unclear how ``ResampledTrainer.fit_resample`` should +behave. These caveats also apply to the Pipeline modification below. + Example Usage: ~~~~~~~~~~~~~~ .. code-block:: python - est = ResamplingTrainer(RandomUnderSampler(), SVC()) + est = ResampledTrainer(RandomUnderSampler(), SVC()) est = make_pipeline( StandardScaler(), - ResamplingTrainer(Birch(), make_pipeline(SelectKBest(), SVC())) + ResampledTrainer(Birch(), make_pipeline(SelectKBest(), SVC())) ) - est = ResamplingTrainer( + est = ResampledTrainer( RandomUnderSampler(), make_pipeline(StandardScaler(), SelectKBest(), SVC()), ) @@ -104,45 +118,65 @@ Example Usage: make_pipeline(StandardScaler(), SGDClassifier())) ) -Modifying Pipeline -.................. - -As an alternative to ``ResampledTrainer``, ``Pipeline`` could be modified to -accomodate resamplers. -The functionality is described in terms of the head (all stages except the last) -and the tail (the last stage) of the ``Pipeline``. Note that we assume -resamplers and transformers are exclusive so that the pipeline can decide which -method to call. Further note that ``Xt, yt, kwt`` are the outputs of the stage, and -``X, y, **kw`` are the inputs to the stage. - -fit -~~~ -* head for resamplers: ``Xt, yt, kwt = est.fit_resample(X, y, **kw)`` -* head for transformers: ``Xt, yt = est.fit_transform(X, y, **kw)`` -* tail for transformers and predictors: ``est.fit(X, y, **kw)`` -* tail for resamplers: ``pass`` - -fit_transform -~~~~~~~~~~~~~ -* Equivalent to ``fit(X, y).transform(X)`` overall. - -predict -~~~~~~~ -* head for resamplers: ``Xt = X`` -* head for transformers: ``Xt = est.transform(X)`` -* tail for predictors: ``return est.predict(X)`` -* tail for transformers and resamplers: ``error`` - -transform -~~~~~~~~~ -* head for resamplers: ``Xt = X`` -* head for transformers: ``Xt = est.transform(X)`` -* tail for predictors and resamplers: ``error`` -* tail for transformers: ``return est.transform(X)`` - -score -~~~~~ -* see predict +Alternative 2: Prediction Pipeline +.................................. + +As an alternative to ``ResampledTrainer``, ``Pipeline`` can be modified to +accomodate resamplers. The essence of the operation is this: one or more steps +of the pipeline may be a resampler. When fitting the Pipeline, ``fit_resample`` +will be called on each resampler instead of ``fit_transform``, and the output +of ``fit_resample`` will be used in place of the original ``X``, ``y``, etc., +to fit the subsequent step (and so on). When predicting in the Pipeline, +the resampler will act as a passthrough step. + +Limitations +~~~~~~~~~~~ + +.. rubric:: Prohibiting ``transform`` on resamplers + +It may be problematic for a resampler to provide ``transform`` if Pipelines +support resampling: + +1. It is unclear what to do at test time if a resampler has a transform + method. +2. Adding fit_resample to the API of an an existing transformer may + drastically change its behaviour in a Pipeline. + +For this reason, it may be best to reject resamplers supporting ``transform`` +from being used in a Pipeline. + +.. rubric:: Prohibiting ``transform`` on resampling Pipelines + +Providing a ``transform`` method on a Pipeline that contains a resampler +presents several problems: + +1. A resampling Pipeline needs to use a special code path for ``fit_transform`` + that would call ``fit(X, y, **kw).transform(X)`` on the Pipeline. + Ordinarily a Pipeline would pass the transformed data to ``fit_transform`` + of the left step. If the Pipeline contains a resampler, it rather needs to + fit the Pipeline excluding the last step, then transform the original + training data until the last step, then fit_transform the last step. This + means special code paths for pipelines containing resamplers; the effect of + the resampler is not localised in terms of code maintenance. +2. As a result of issue 1, appending a step to the transformation Pipeline + means that the transformer which was previously last, and previously trained + on the full dataset, will now be trained on the resampled dataset. +3. As a result of issue 1, the last step cannot be 'passthrough' as in other + transformer pipelines. + +For this reason, it may be best to disable ``fit_transform`` and ``transform`` +on the Pipeline. A resampling Pipeline would therefore not be usable as a +transformation within a ``FeatureUnion`` or ``ColumnTransformer``. Thus the +``ResampledTrainer`` would be strictly more expressive than a resampling +Pipeline. + +.. rubric:: Handling ``fit`` parameters + +Sample props or weights cannot be routed to steps downstream of a resampler in +a Pipeline, unless they too are resampled. It's very unclear how this would +work with Pipeline's current prefix-based fit parameter routing. + +TODO: propose solutions Example Usage: ~~~~~~~~~~~~~~ From ee197cbc0b88cbfdaa18910827afe8ab1c2cdaec Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre <g.lemaitre58@gmail.com> Date: Wed, 3 Jul 2019 13:58:30 +0200 Subject: [PATCH 20/22] minor changes --- slep005/proposal.rst | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 1e375ea..2402cb3 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -56,18 +56,20 @@ Implementation API and Constraints ................... -* Resamplers implement a method ``fit_resample(X, y, **kwargs)``, a pure function which - returns ``Xt, yt, kwargs`` corresponding to the resampled dataset, where - samples may have been added and/or removed. -* An estimator may only implement either ``fit_transform`` or ``fit_resample``. +* Resamplers implement a method ``fit_resample(X, y, **kwargs)``, a pure + function which returns ``Xt, yt, kwargs`` corresponding to the resampled + dataset, where samples may have been added and/or removed. +* An estimator may only implement either ``fit_transform`` or ``fit_resample`` + if support for ``Resamplers`` in ``Pipeline`` is enabled. * Resamplers may not change the order, meaning, dtype or format of features (this is left to transformers). -* Resamplers should also resample any kwargs. +* Resamplers should also handled (e.g. resample, generate anew, etc.) any + kwargs. Composition ----------- -An key part of the proposal is the introduction of a way of composing resamplers +A key part of the proposal is the introduction of a way of composing resamplers with predictors. We present two options: ``ResampledTrainer`` and modifications to ``Pipeline``. From 35c140d01a290a8af8348df9c04f2fbb9d83c79f Mon Sep 17 00:00:00 2001 From: Guillaume Lemaitre <g.lemaitre58@gmail.com> Date: Wed, 3 Jul 2019 14:05:20 +0200 Subject: [PATCH 21/22] iter --- slep005/proposal.rst | 40 +++++++++++++++++++++------------------- 1 file changed, 21 insertions(+), 19 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 2402cb3..79f839f 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -60,7 +60,8 @@ API and Constraints function which returns ``Xt, yt, kwargs`` corresponding to the resampled dataset, where samples may have been added and/or removed. * An estimator may only implement either ``fit_transform`` or ``fit_resample`` - if support for ``Resamplers`` in ``Pipeline`` is enabled. + if support for ``Resamplers`` in ``Pipeline`` is enabled + (see Sect. "Limitations"). * Resamplers may not change the order, meaning, dtype or format of features (this is left to transformers). * Resamplers should also handled (e.g. resample, generate anew, etc.) any @@ -136,13 +137,13 @@ Limitations .. rubric:: Prohibiting ``transform`` on resamplers -It may be problematic for a resampler to provide ``transform`` if Pipelines +It may be problematic for a resampler to provide ``transform`` if ``Pipeline``s support resampling: 1. It is unclear what to do at test time if a resampler has a transform method. -2. Adding fit_resample to the API of an an existing transformer may - drastically change its behaviour in a Pipeline. +2. Adding ``fit_resample`` to the API of an an existing transformer may + drastically change its behaviour in a ``Pipeline``. For this reason, it may be best to reject resamplers supporting ``transform`` from being used in a Pipeline. @@ -152,31 +153,32 @@ from being used in a Pipeline. Providing a ``transform`` method on a Pipeline that contains a resampler presents several problems: -1. A resampling Pipeline needs to use a special code path for ``fit_transform`` - that would call ``fit(X, y, **kw).transform(X)`` on the Pipeline. - Ordinarily a Pipeline would pass the transformed data to ``fit_transform`` - of the left step. If the Pipeline contains a resampler, it rather needs to - fit the Pipeline excluding the last step, then transform the original - training data until the last step, then fit_transform the last step. This - means special code paths for pipelines containing resamplers; the effect of - the resampler is not localised in terms of code maintenance. -2. As a result of issue 1, appending a step to the transformation Pipeline +1. A resampling ``Pipeline`` needs to use a special code path for + ``fit_transform`` that would call ``fit(X, y, **kw).transform(X)`` on the + ``Pipeline``. Ordinarily a ``Pipeline`` would pass the transformed data to + ``fit_transform`` of the left step. If the ``Pipeline`` contains a + resampler, it rather needs to fit the ``Pipeline`` excluding the last step, + then transform the original training data until the last step, then + ``fit_transform`` the last step. This means special code paths for pipelines + containing resamplers; the effect of the resampler is not localised in terms + of code maintenance. +2. As a result of issue 1, appending a step to the transformation ``Pipeline`` means that the transformer which was previously last, and previously trained on the full dataset, will now be trained on the resampled dataset. -3. As a result of issue 1, the last step cannot be 'passthrough' as in other - transformer pipelines. +3. As a result of issue 1, the last step cannot be ``'passthrough'`` as in + other transformer pipelines. For this reason, it may be best to disable ``fit_transform`` and ``transform`` -on the Pipeline. A resampling Pipeline would therefore not be usable as a +on the Pipeline. A resampling ``Pipeline`` would therefore not be usable as a transformation within a ``FeatureUnion`` or ``ColumnTransformer``. Thus the ``ResampledTrainer`` would be strictly more expressive than a resampling -Pipeline. +``Pipeline``. .. rubric:: Handling ``fit`` parameters Sample props or weights cannot be routed to steps downstream of a resampler in -a Pipeline, unless they too are resampled. It's very unclear how this would -work with Pipeline's current prefix-based fit parameter routing. +a ``Pipeline``, unless they too are resampled. It's very unclear how this would +work with ``Pipeline``'s current prefix-based fit parameter routing. TODO: propose solutions From bc45d6aba464398b5d6ecf755a3570b087a3ecdd Mon Sep 17 00:00:00 2001 From: Joel Nothman <joel.nothman@gmail.com> Date: Tue, 27 Aug 2019 00:23:30 +1000 Subject: [PATCH 22/22] Some comments on fit params --- slep005/proposal.rst | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/slep005/proposal.rst b/slep005/proposal.rst index 7ffeef5..7f13530 100644 --- a/slep005/proposal.rst +++ b/slep005/proposal.rst @@ -173,10 +173,15 @@ Pipeline. .. rubric:: Handling ``fit`` parameters Sample props or weights cannot be routed to steps downstream of a resampler in -a Pipeline, unless they too are resampled. It's very unclear how this would -work with Pipeline's current prefix-based fit parameter routing. - -TODO: propose solutions +a Pipeline, unless they too are resampled. To support this, a resampler +would need to be passed all props that are required downstream, and +``fit_resample`` should return resampled versions of them. Note that these +must be distinct from parameters that affect the resampler's fitting. +That is, consider the signature ``fit_resample(X, y=None, props=None, sample_weight=None)``. +The ``sample_weight`` passed in should affect the resampling, but does not +itself need to be resampled. A Pipeline would pass ``props`` including the fit +parameters required downstream, which would be resampled and returned by +``fit_resample``. Example Usage: ~~~~~~~~~~~~~~ @@ -191,6 +196,7 @@ Example Usage: est = make_pipeline( NaNRejector(), RandomUnderSampler(), StandardScaler(), SGDClassifer() ) + est.fit(X,y, sgdclassifier__sample_weight=my_weight) Alternative implementation