Skip to content

TFD Draft#127

Draft
evasnow1992 wants to merge 25 commits intoNVIDIA-Digital-Bio:mainfrom
evasnow1992:evax/tfd
Draft

TFD Draft#127
evasnow1992 wants to merge 25 commits intoNVIDIA-Digital-Bio:mainfrom
evasnow1992:evax/tfd

Conversation

@evasnow1992
Copy link
Copy Markdown
Collaborator

No description provided.

Implement CPU-based TFD computation as a drop-in replacement for
RDKit's TorsionFingerprints.GetTFDMatrix(). This provides the
foundational data structures and algorithms that will also be used
by the GPU implementation.

Key components:
- tfd_common.h/cpp: TFD system data structures (TFDSystemHost,
  TFDSystemDevice) and buildTFDSystem() which extracts torsion
  definitions, weights, coordinates and precomputes flattened
  work items for both dihedral and TFD pair computations
- tfd_cpu.h/cpp: CPU TFD generator with OpenMP parallelization
  over conformers and conformer pairs
- tfd_detail.h: Shared math utilities (computeDihedralAngle,
  circularDifference) marked __host__ __device__ for CPU/GPU use
- test_tfd_cpu.cpp: Comprehensive tests including unit tests for
  each stage of the pipeline and end-to-end validation against
  hardcoded RDKit reference values

Torsion extraction replicates RDKit's Python implementation in C++,
including the _calculateBeta weight calculation and symmetric bond
type classification. Numerical results match RDKit to within 1e-4
tolerance.
Implement GPU-accelerated kernels for the two core TFD operations:
computing dihedral angles across all conformer-torsion pairs, and
computing pairwise TFD values across all conformer pairs.

Key components:
- tfd_kernels.h/cu: Two flattened kernels, each using one thread
  per work item. The dihedral kernel computes one angle per
  (conformer, torsion) pair. The TFD matrix kernel computes one
  TFD value per conformer pair (i > j) with a serial loop over
  torsions. Work item indices are precomputed on the host in
  buildTFDSystem() to avoid CSR lookups in the kernels.
- tfd_transfer.cu: Host-to-device transfer (transferToDevice) and
  stream management (setStream) for TFDSystemDevice. Placed in the
  tfd_cpu CMake target which has access to both RDKit headers and
  CUDA, bridging the compilation boundary.
- test_tfd_kernels.cu: GPU kernel tests including basic dihedral
  verification, GPU-vs-CPU consistency for single and batched
  molecules, direct comparison against hardcoded RDKit reference
  values, and edge cases (two-conformer minimum, batch with
  zero-torsion molecule).

TFDKernelParams is minimal (two work item counts). All indexing
arrays live in TFDSystemDevice and are passed as raw pointers to
kernel launch wrappers.
Stage 3 of TFD implementation: GPU orchestration and unified API.

- Add TFDGpuGenerator (tfd_gpu.h/cpp) for GPU-accelerated TFD
  computation. Uses TFDSystemDevice and transferToDevice() from
  Stage 2 for host-to-device transfer. Launches dihedral and TFD
  matrix kernels, returns results as host vectors or GPU-resident
  TFDGpuResult with extractMolecule/extractAll methods.

- Add TFDGenerator unified API (tfd.h/cpp) with automatic backend
  selection (CPU/GPU) and lazy initialization of backends.

- Simplify kernel launch interface: remove TFDKernelParams struct,
  pass totalWorkItems as plain int to launchDihedralKernel and
  launchTFDMatrixKernel.

- Add comprehensive test suite (test_tfd_gpu.cpp):
  GPU vs CPU correctness, unweighted mode, batch processing,
  single conformer and no-torsion edge cases, generator reuse,
  GPU result extraction, unified API backend selection, error
  handling for CPU backend with GPU buffer request.

- Update CMakeLists.txt for tfd_gpu, tfd libraries and test target.
Stage 4 of TFD implementation:

- Add Boost.Python bindings (nvmolkit/tfd.cpp)
  - Module-level functions: GetTFDMatrix, GetTFDMatrices, GetTFDMatricesGpuBuffer
  - Support for all TFD options: useWeights, maxDev, symmRadius, ignoreColinearBonds
  - Backend selection: 'gpu' or 'cpu'
  - Use toOwnedPyArray pattern for correct PyArray ownership in tuple returns

- Add Python wrapper API (nvmolkit/tfd.py)
  - RDKit-compatible GetTFDMatrix() and GetTFDMatrices()
  - GPU-resident output: GetTFDMatrixGpu(), GetTFDMatricesGpu()
  - TFDGpuResult class for batch GPU results with extraction methods

- Add type stubs (nvmolkit/_TFD.pyi)

- Add comprehensive pytest test suite (nvmolkit/tests/test_tfd.py)
  - Single molecule and batch processing tests
  - GPU/CPU backend consistency tests
  - GPU-resident output tests with torch tensor extraction
  - RDKit comparison tests (symmRadius, ignoreColinearBonds, weighted/unweighted)
  - Edge case and error handling tests

- Update nvmolkit/CMakeLists.txt with _TFD module target
- Update nvmolkit/__init__.py documentation
Extend TFD to correctly handle torsions with multiple quartets:

- Ring torsions (e.g. cyclohexane): average abs(signed dihedral) across
  all ring quartets per conformer, then compare the averages.
- Symmetric torsions (e.g. isopentane CC(C)CC): compute the minimum
  circular difference across all (qi, qj) cross-product pairs.

Key changes:
- Add TorsionType enum (Single, Ring, Symmetric) and quartetStarts CSR
  index to TFDSystemHost/TFDSystemDevice for mapping torsions to their
  quartet ranges in torsionAtoms.
- buildTFDSystem() now stores all quartets (not just the first) and
  classifies each torsion by type. Dihedral storage uses totalQuartets
  stride instead of numTorsions.
- CPU computeTFDPair() implements type-based branching matching RDKit's
  CalculateTFD semantics exactly.
- Single unified GPU tfdMatrixKernel handles all three torsion types
  with per-torsion type dispatch, eliminating the need for separate
  fast-path and multi-quartet kernel paths.
- Transfer new quartetStarts and torsionTypes fields to device.

Tests include hardcoded RDKit reference values for single-quartet,
symmetric, ring, mixed, and AddHs molecules on both CPU and GPU paths.
buildTFDSystem(vector) now builds each molecule in parallel via OpenMP
then merges the per-molecule results, eliminating serial RDKit extraction
as the bottleneck in the GPU path. Extracted sequential single-molecule
logic into static buildTFDSystemImpl and added mergeTFDSystems to stitch
per-molecule CSR arrays with adjusted offsets.
Add ScopedNvtxRange annotations across the TFD pipeline for nsys
profiling:
- tfd_common.cpp: buildTFDSystem, parallel RDKit extraction,
  mergeTFDSystems (cyan)
- tfd_cpu.cpp: GetTFDMatrices, computeDihedralAngles,
  computeTFDMatrix (green)
- tfd_gpu.cpp: GetTFDMatricesGpuBuffer, transferToDevice (yellow),
  kernel launches (orange), extractAll D2H (red)

Link nvtx to tfd_cpu and tfd_gpu CMake targets.

Add benchmarks/tfd_profile.py for use with nsys, supporting CPU, GPU,
and GPU-buffer backends with Python-level nvtx annotations for setup,
warmup, and profiled run phases.
- Remove hasMultiQuartet flag (torsionTypes/quartetStarts encode this)
- Replace molDihedralStarts CSR with single totalDihedrals_ counter
- Compute getBondsForTorsions once in buildTFDSystemImpl and pass to
  extractTorsionListImpl/computeTorsionWeightsImpl to avoid duplicate work
- Optimize Equal maxDevMode path to skip d1/d2 size branching
- Merge parallel-array loops in mergeTFDSystems for fewer passes
- Move computeTFDMatrixFromAngles to private in TFDCpuGenerator
… RDKit

Three correctness bugs in TFD computation:
- getMorganGenerator used wrong params (countSimulation=true, useBondTypes=false);
  fixed to match RDKit defaults (false, true)
- Atom invariants stored as int, causing signed overflow for bitIds > INT_MAX;
  changed to uint32_t to preserve unsigned sort order matching RDKit
- Ring torsion averaging used min(angle, 360-angle) which is incorrect under
  the 180-degree dihedral offset; fixed to abs(angle-180) in CPU and GPU paths

Also consolidates diagnostic tests into existing test suites:
- Torsion list checks merged into ExtractTorsionListRDKitReference
- Weight checks merged into ComputeTorsionWeightsRDKitReference
- End-to-end TFD checks merged into CompareWithRDKitReference
- Complex molecules added to GPU BatchMultipleMolecules test
- Add bool skipGpuWorkItems parameter to buildTFDSystem and
  buildTFDSystemImpl (default: false) to skip building flattened GPU
  kernel work item vectors (dihedralConfIdx/TorsIdx/OutIdx,
  tfdAnglesI/J/TorsStart/NumTorsions/OutIdx) when only the CPU path
  is used
- CPU path passes skipGpuWorkItems=true to avoid unnecessary allocation
…arks

- Add benchmarks/prepare_mols.py to precompute molecules with conformers
  as per-conformer-count pickle files, with multiprocessing and progress
  tracking. Avoids expensive ETKDG embedding during benchmark/profile runs.
- Update tfd_bench.py and tfd_profile.py to load from precomputed pickle
  files (prepared_mols_{N}confs.pkl) when available, falling back to
  generation from scratch with a message indicating which path is used.
Replace 8 per-work-item GPU index arrays with compact per-molecule
MolDescriptor struct (32 bytes/molecule). Extract TorsionType and
MolDescriptor into tfd_types.h for CUDA-safe inclusion.

- dihedralKernel: binary search on cumulative work-item CSR to find
  molecule, then compute (confIdx, quartetIdx, outIdx) arithmetically
- tfdMatrixKernel: one block per molecule with grid-stride loop over
  conformer pairs; eliminates cross-molecule warp divergence
- Zero output buffers before kernel launch to handle no-torsion molecules
- Detach CPU path from TFDSystemHost: uses TorsionList + weights directly
- Add __restrict__ to kernel pointer parameters
- Tighten ethane zero-output assertion in BatchedMultipleMolecules test
…action

- Add return_type parameter to GetTFDMatrix/GetTFDMatrices: 'list' (default,
  RDKit-compatible), 'numpy' (CPU arrays, no Python object overhead), or
  'tensor' (GPU torch.Tensor, no D2H copy)
- GPU Python path now routes through GetTFDMatricesGpuBuffer directly,
  bypassing C++ extractAll and its N vector allocations + float-to-double
- Rename TFDGpuResult.extract_all to to_tensors, add to_numpy and to_lists
- Optimize to_lists: single bulk numpy tolist + list slicing
- Optimize extractAll: separate allocation from conversion, use raw pointers
- Update tfd_bench.py: benchmark all GPU return variants (list/numpy/tensor),
  report speedups vs RDKit only
- Update tfd_profile.py: configurable sweep over mol/conf counts
- Increase verify molecule count from 5 to 50
Regenerate hardcoded TFD reference values for molecules affected by
RDKit 2025 ETKDG changes: CC(=O)[C@H]1CCCC[C@@h]1CN (ring torsions)
and C[C@@h](CN(C)C(=O)C#CCN)C1CC1 (cyclopropane + triple bond).
Also update AddHs reference values for CCCCC and CC(C)CC.
Original 6 simple molecule references are unchanged across versions.
- Remove GetTFDMatricesGpuBuffer from TFDGenerator (tfd.h/tfd.cpp);
  GPU-buffer access goes directly through TFDGpuGenerator
- Simplify Python binding (nvmolkit/tfd.cpp) to two C++ functions:
  GetTFDMatricesCpu and GetTFDMatricesGpuBuffer, each routing directly
  to the corresponding generator without TFDGenerator indirection
- Clean up Python API (nvmolkit/tfd.py): remove GetTFDMatricesGpu,
  GetTFDMatrixGpu, extract_molecule; make TFDGpuResult private;
  GetTFDMatrix is now a wrapper over GetTFDMatrices
- Update profile script: remove gpu_buffer backend, GPU uses tensor output
- Update tests: replace GetTFDMatricesGpu calls with return_type parameter,
  remove GpuBufferWithCPUBackendThrows test, use TFDGpuGenerator directly
…tion

- Add tests/test_tfd.cpp: extract TFDGeneratorTest (UnifiedAPIBackendSelection,
  BatchProcessing, GpuBufferMethodWorks, EmptyInput) from test_tfd_gpu.cpp
- test_tfd_gpu.cpp now links tfd_gpu instead of tfd, contains only
  GPU-specific TFDGpuTest tests
- Add backend string validation in Python tfd.py: raises ValueError
  for invalid backend (previously silently fell through to GPU)
CPU path: remove [list(r) for r in results] that unnecessarily copied
inner lists already produced as Python lists by C++ nestedVectorToList.
GPU path: remove list() wrapping of output_starts and conformer_counts
already returned as Python lists from C++ intVectorToList.
For molecules with near-perfect C2 topological symmetry (e.g. macrocyclic
peptides), the central bond selection can differ from RDKit Python due to
machine-epsilon differences in the STD computation (~4e-16). This affects
weighted TFD for ~0.8% of ChEMBL molecules; unweighted TFD is unaffected.
Both choices are equally valid per the TFD paper.
Support benchmarking directly from precomputed pickle files (e.g. ChEMBL
stratified datasets) via --pkl-file, bypassing SMILES loading and conformer
generation. Conformer count is auto-detected from the loaded molecules.

Add --verify-only to run correctness checks without the full benchmark
sweep, exiting with code 0/1 for CI-friendly pass/fail reporting.
Add NVTX instrumentation to TFD Python bindings and docs for TFD feature

Add NVTX ranges to capture time spent in C++/Python boundary conversions:
- CPU: nestedVectorToList (C++ vectors → boost::python lists)
- GPU: result metadata packaging and per-molecule tensor slicing
Clean up tfd_profile.py to remove diagnostic ranges.

Add TFD to docs/index.rst feature list and docs/api/nvmolkit.rst API
reference (GetTFDMatrix, GetTFDMatrices).
…mentation

Replace nestedVectorToList (per-element boost::python append) with zero-copy
numpy views: C++ results are moved to a PyCapsule-owned heap allocation, and
each inner vector<double> is wrapped as a numpy array via from_data. No data
copies at any step.

Unify CPU/GPU binding patterns: both return buffer + metadata, with symmetric
_extract_cpu_result/_extract_gpu_result in Python. Remove unused
conformer_counts from GPU return tuple.

Add NVTX ranges to tfd.cpp (C++/Python boundary conversions) and tfd.py
(result splitting). Add CPU numpy benchmark to tfd_bench.py and tfd_profile.py.

Add TFD to docs/index.rst feature list and docs/api/nvmolkit.rst API reference.
Replace sequential mergeTFDSystems with a two-pass parallel approach:
Pass 1 extracts per-molecule torsion data concurrently, Pass 2 fills
pre-allocated arrays via memcpy to non-overlapping slices, eliminating
O(N) push_back/insert overhead during merge.

Simplify C++ extension API: remove GetTFDMatrix/GetTFDMatrices from the
C++ layer (now pure Python), rename GetTFDMatricesCpu to
GetTFDMatricesCpuBuffer returning numpy arrays, and drop the redundant
conformer_counts return from GetTFDMatricesGpuBuffer.
@evasnow1992 evasnow1992 requested a review from scal444 April 3, 2026 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant