Skip to content

SamuelHomberg/mutagenicity-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Mutagenicity Dataset

Note: In some papers such as GNNExplainer the dataset is referred to as MUTAG instead of Mutagenicity. However, in the TUDataset collection, MUTAG is a different, smaller dataset.

This dataset contains 4337 molecules, 2401 of which are confirmed mutagens. In the original publication the data is only available upon request. However, the data is available as graphs for machine learning from the TUDataset collection or directly within PyTorch Geometric.

It is therefore possible to reverse engineer the molecules from the datasets.

As the dataset has been used in multiple studies to explain the predictions of graph neural networks (GNNs), and displaying the actual molecules will help domain experts better evaluate the performance of GNN explainers on molecular data.

However, I uncovered some mistakes in the original preparation of the dataset and had to manually look through remove salts, duplicates and mixtures of compounds. An example of a strange graph can be seen below, in the original graph the hydrogens were individual unconnected nodes.

Example of weird molecule.

The new dataset contains 4247 molecules, 2362 of which are confirmed mutagens.

Distribution of molecular weight. t-SNE dimensionality reduction with PCA.

About

A curated version of the Mutagenicity dataset, as SMILES instead of graphs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages