Summary: Tc is currently computed via pairwise FingerprintSimilarity loops and can be accelerated using BulkTanimotoSimilarity.
While profiling Tc-based nearest-neighbor computations, I noticed that FingerprintSimilarity(fp1, fp2) is used inside nested Python loops.
|
if max_tc < (fps := FingerprintSimilarity(target_fps, ref_fp)): |
|
FingerprintSimilarity(row["fp"], target_fp) |
|
FingerprintSimilarity(input_fp, target_fp) for input_fp in input_fps |
|
tcs.append(FingerprintSimilarity(fp1, fp2)) |
RDKit provides a bulk API (BulkTanimotoSimilarity) that computes the same Tanimoto scores but is significantly faster for this use case. Here is a simple benchmark comparing the following approaches. Using Morgan bit vectors, all methods produced identical outputs, but performance differed substantially:
- pairwise
FingerprintSimilarity: ~32 sec
- pairwise
TanimotoSimilarity: ~24 sec
BulkTanimotoSimilarity: ~1.4 sec
Summary: Tc is currently computed via pairwise
FingerprintSimilarityloops and can be accelerated usingBulkTanimotoSimilarity.While profiling Tc-based nearest-neighbor computations, I noticed that
FingerprintSimilarity(fp1, fp2)is used inside nested Python loops.CLM/src/clm/commands/write_nn_Tc.py
Line 63 in 2cf5e22
CLM/src/clm/commands/write_structural_prior_CV.py
Line 169 in 2cf5e22
CLM/src/clm/commands/create_training_sets.py
Line 136 in 2cf5e22
CLM/src/clm/functions.py
Line 408 in 2cf5e22
RDKit provides a bulk API (
BulkTanimotoSimilarity) that computes the same Tanimoto scores but is significantly faster for this use case. Here is a simple benchmark comparing the following approaches. Using Morgan bit vectors, all methods produced identical outputs, but performance differed substantially:FingerprintSimilarity: ~32 secTanimotoSimilarity: ~24 secBulkTanimotoSimilarity: ~1.4 sec