From ca3b3345fcef372140c4798382e769142a75e560 Mon Sep 17 00:00:00 2001 From: "Allen.YL" Date: Tue, 11 Aug 2020 15:01:48 +0800 Subject: [PATCH] Fix: Found array with 0 sample(s) Symptom: When using SVMSMOTE on dataset which contains a minority class which has very few samples (may be < 10), it'll raise error `ValueError: Found array with 0 sample(s) (shape=(0, 600)) while a minimum of 1 is required.` Root cause: The line `noise_bool = self._in_danger_noise(...)` will find noise data according to `kneighbors` estimator's `n_neighbors` attribute, this value is equal to `m_neighbors` attribute of `SVMSMOTE` class. If we set a very large number to `m_neighbors` to initialize `SVMSMOTE`, for example: `SVMSMOTE(m_neighbors=1000)`, this error will be gone. This is because the range of neighbor searches is large enough to contain another minority class data point, therefore the center data point will not be treated as noise according to this line `n_maj == nn_estimator.n_neighbors - 1`. But when `m_neighbors` is small (default is 10), and the minority class has very few sample, it may treat whole minority class data as noise data, cause returned `noise_bool` with all true, then in _safe_indexing(...) will remove all these data, resulted in zero number of support_vector data. Solution: Save `support vector` before trimming noise data point. When after trimmed noise data, check whether the length of support vector is zero, if true, then restore previous saved `support vector`, this enforce every minority data point used as `support_vector`. --- imblearn/over_sampling/_smote.py | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/imblearn/over_sampling/_smote.py b/imblearn/over_sampling/_smote.py index bb5b3bd01..ee652928f 100644 --- a/imblearn/over_sampling/_smote.py +++ b/imblearn/over_sampling/_smote.py @@ -554,12 +554,19 @@ def _fit_resample(self, X, y): support_vector = _safe_indexing(X, support_index) self.nn_m_.fit(X) + + prev_support_vector = support_vector + noise_bool = self._in_danger_noise( self.nn_m_, support_vector, class_sample, y, kind="noise" ) support_vector = _safe_indexing( support_vector, np.flatnonzero(np.logical_not(noise_bool)) ) + + if len(support_vector) == 0: + support_vector = prev_support_vector + danger_bool = self._in_danger_noise( self.nn_m_, support_vector, class_sample, y, kind="danger" )