Jordan Pfeifer, Ekin Secilmis, Egor Serebriakov
We have three "black boxes'' meant to be representative of the actual data from LHC. Each "black box" contains 1M events. The given events might have signals that we consider as anomaly signals.
Additionally, we have a background sample of 1M events simulated using Pythia8 and Delphes. This data was simulated in order to aid in the anomaly detection from the "black boxes". However, some assumption during the simulation might not exactly reflect the "black boxes" data.
All datasets are stored as pandas DataFrames saved to compressed h5 format. Each event consists of 700 particles (we might have some events with some degree of zero padding) and each particle has three coordinates (pT, eta, phi).
-
Google Drive with Split Data (TensorFlow) and Preprocessed Data (torch)
Files X_train_small.csv, X_test_small.csv, X_valid_small.csv are smaller versions of background data that can be useful to build an appropriate model faster. X_test_first.csv is the data from the first box and so on.