Transform with update_outlier_params=False will still change Hotelling T2 outlier results on the fit data #54

lambdatascience · 2024-01-26T05:48:05Z

I have found edge cases that transforming new unseen data will change the results in the 'outliers' dataframe on the original data used in fit, even with update_outlier_params=False. This specifically applies to the Hotelling T2 statistic only.

Digging into it, the cause it is the usage of all rows in the PC dataframe in hotellingsT2(), called by compute_outliers() from transform().

The hotellingsT2() function uses all rows of the PC dataframe to compute the outliers in the new data, and the results don't change for the calculation of y_score (as the mean, var are locked), or y_proba, or even Pcomb variables.

But, the calculation of Pcorr using the multitest_correction() will be directly affected by using more rows than before, and it is this column that is compared to alpha to determine the y_bool column in results['outliers'].

So, in short, fitting data then transforming data with update_outlier_params=False will change the y_proba and y_bool of original fit data in certain cases.

I experimented and created a simple dummy data example that replicates this. To be fair, I'm not even sure this is a huge concern, but I figure that the expectation is that the outlier params of previously-fit data won't change if update_outlier_params=False. And it showed up in the usage I'm building.

This example changes the number of HotellingT2 outliers (as determined by y_bool) of original fit data from 1 to 0.

import numpy as np
import pandas as pd

from pca import pca

# Create dataset
np.random.seed(42)
X_orig = pd.DataFrame(np.random.randint(low=1, high=10, size=(10000, 10)))
# Insert Outliers
X_orig.iloc[500:510, 8:] = 15

# PCA Training
model = pca(n_components=5, alpha=0.05, n_std=3, normalize=True, random_state=42)
results = model.fit_transform(X=X_orig)

outliers_original = model.results['outliers']

# Create New Data
X_new = pd.DataFrame(np.random.randint(low=1, high=10, size=(1000, 10)))

# Transform New Data
model.transform(X=X_new, update_outlier_params=False)
outliers_new = model.results['outliers']

# Compare Original Points Outlier Results Before and After Transform
print("Before:", outliers_original['y_bool'].value_counts())
print("After:", outliers_new.iloc[:n_total]['y_bool'].value_counts())

I'm not sure what the fix is from a statistics standpoint, whether it's running the multitest differently or checking for changes, etc. But I wanted to raise the question.

I understand that inherently it makes sense for the y_proba to change for the previous data once more is added in, so it seems more a philosophical problem than a statistical one, but as someone tracking outliers as more and more data is transformed, it showed up.

erdogant · 2024-03-06T19:27:32Z

Thank you for observing and mentioning this.
I need to chew on this a bit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform with update_outlier_params=False will still change Hotelling T2 outlier results on the fit data #54

Transform with update_outlier_params=False will still change Hotelling T2 outlier results on the fit data #54

lambdatascience commented Jan 26, 2024 •

edited

Loading

erdogant commented Mar 6, 2024

Transform with update_outlier_params=False will still change Hotelling T2 outlier results on the fit data #54

Transform with update_outlier_params=False will still change Hotelling T2 outlier results on the fit data #54

Comments

lambdatascience commented Jan 26, 2024 • edited Loading

erdogant commented Mar 6, 2024

lambdatascience commented Jan 26, 2024 •

edited

Loading