Currently, the structural_prior/ output directory for a given enum_factor will look something like this:
$ ls
smiles_SMILES_0_CV_ranks_structure.csv.gz smiles_SMILES_7_CV_ranks_structure.csv.gz
smiles_SMILES_0_CV_tc.csv.gz smiles_SMILES_7_CV_tc.csv.gz
smiles_SMILES_1_CV_ranks_structure.csv.gz smiles_SMILES_8_CV_ranks_structure.csv.gz
smiles_SMILES_1_CV_tc.csv.gz smiles_SMILES_8_CV_tc.csv.gz
smiles_SMILES_2_CV_ranks_structure.csv.gz smiles_SMILES_9_CV_ranks_structure.csv.gz
smiles_SMILES_2_CV_tc.csv.gz smiles_SMILES_9_CV_tc.csv.gz
smiles_SMILES_3_CV_ranks_structure.csv.gz smiles_SMILES_min1_all_freq-avg_CV_ranks_structure.csv.gz
smiles_SMILES_3_CV_tc.csv.gz smiles_SMILES_min1_all_freq-avg_CV_tc.csv.gz
smiles_SMILES_4_CV_ranks_structure.csv.gz smiles_SMILES_min2_all_freq-avg_CV_ranks_structure.csv.gz
smiles_SMILES_4_CV_tc.csv.gz smiles_SMILES_min2_all_freq-avg_CV_tc.csv.gz
smiles_SMILES_5_CV_ranks_structure.csv.gz smiles_SMILES_min3_all_freq-avg_CV_ranks_structure.csv.gz
smiles_SMILES_5_CV_tc.csv.gz smiles_SMILES_min3_all_freq-avg_CV_tc.csv.gz
smiles_SMILES_6_CV_ranks_structure.csv.gz smiles_SMILES_min4_all_freq-avg_CV_ranks_structure.csv.gz
smiles_SMILES_6_CV_tc.csv.gz smiles_SMILES_min4_all_freq-avg_CV_tc.csv.gz
This has the potential to mislead users. In particular, the fold-specific rank files are misleading, because any structure in the training folds is excluded from the sampled SMILES. So the problem is in effect made artificially easier: it is implicitly assumed that any given structure is a novel structure, not found in the training folds.
Fix is to simply not write these files. All plots should be made from the min1 file anyway.
Currently, the
structural_prior/output directory for a given enum_factor will look something like this:This has the potential to mislead users. In particular, the fold-specific rank files are misleading, because any structure in the training folds is excluded from the sampled SMILES. So the problem is in effect made artificially easier: it is implicitly assumed that any given structure is a novel structure, not found in the training folds.
Fix is to simply not write these files. All plots should be made from the
min1file anyway.