Note: In some papers such as GNNExplainer the dataset is referred to as MUTAG instead of Mutagenicity. However, in the TUDataset collection, MUTAG is a different, smaller dataset.
This dataset contains 4337 molecules, 2401 of which are confirmed mutagens. In the original publication the data is only available upon request. However, the data is available as graphs for machine learning from the TUDataset collection or directly within PyTorch Geometric.
It is therefore possible to reverse engineer the molecules from the datasets.
As the dataset has been used in multiple studies to explain the predictions of graph neural networks (GNNs), and displaying the actual molecules will help domain experts better evaluate the performance of GNN explainers on molecular data.
However, I uncovered some mistakes in the original preparation of the dataset and had to manually look through remove salts, duplicates and mixtures of compounds. An example of a strange graph can be seen below, in the original graph the hydrogens were individual unconnected nodes.
The new dataset contains 4247 molecules, 2362 of which are confirmed mutagens.


