Draft
Conversation
Implement CPU-based TFD computation as a drop-in replacement for RDKit's TorsionFingerprints.GetTFDMatrix(). This provides the foundational data structures and algorithms that will also be used by the GPU implementation. Key components: - tfd_common.h/cpp: TFD system data structures (TFDSystemHost, TFDSystemDevice) and buildTFDSystem() which extracts torsion definitions, weights, coordinates and precomputes flattened work items for both dihedral and TFD pair computations - tfd_cpu.h/cpp: CPU TFD generator with OpenMP parallelization over conformers and conformer pairs - tfd_detail.h: Shared math utilities (computeDihedralAngle, circularDifference) marked __host__ __device__ for CPU/GPU use - test_tfd_cpu.cpp: Comprehensive tests including unit tests for each stage of the pipeline and end-to-end validation against hardcoded RDKit reference values Torsion extraction replicates RDKit's Python implementation in C++, including the _calculateBeta weight calculation and symmetric bond type classification. Numerical results match RDKit to within 1e-4 tolerance.
Implement GPU-accelerated kernels for the two core TFD operations: computing dihedral angles across all conformer-torsion pairs, and computing pairwise TFD values across all conformer pairs. Key components: - tfd_kernels.h/cu: Two flattened kernels, each using one thread per work item. The dihedral kernel computes one angle per (conformer, torsion) pair. The TFD matrix kernel computes one TFD value per conformer pair (i > j) with a serial loop over torsions. Work item indices are precomputed on the host in buildTFDSystem() to avoid CSR lookups in the kernels. - tfd_transfer.cu: Host-to-device transfer (transferToDevice) and stream management (setStream) for TFDSystemDevice. Placed in the tfd_cpu CMake target which has access to both RDKit headers and CUDA, bridging the compilation boundary. - test_tfd_kernels.cu: GPU kernel tests including basic dihedral verification, GPU-vs-CPU consistency for single and batched molecules, direct comparison against hardcoded RDKit reference values, and edge cases (two-conformer minimum, batch with zero-torsion molecule). TFDKernelParams is minimal (two work item counts). All indexing arrays live in TFDSystemDevice and are passed as raw pointers to kernel launch wrappers.
Stage 3 of TFD implementation: GPU orchestration and unified API. - Add TFDGpuGenerator (tfd_gpu.h/cpp) for GPU-accelerated TFD computation. Uses TFDSystemDevice and transferToDevice() from Stage 2 for host-to-device transfer. Launches dihedral and TFD matrix kernels, returns results as host vectors or GPU-resident TFDGpuResult with extractMolecule/extractAll methods. - Add TFDGenerator unified API (tfd.h/cpp) with automatic backend selection (CPU/GPU) and lazy initialization of backends. - Simplify kernel launch interface: remove TFDKernelParams struct, pass totalWorkItems as plain int to launchDihedralKernel and launchTFDMatrixKernel. - Add comprehensive test suite (test_tfd_gpu.cpp): GPU vs CPU correctness, unweighted mode, batch processing, single conformer and no-torsion edge cases, generator reuse, GPU result extraction, unified API backend selection, error handling for CPU backend with GPU buffer request. - Update CMakeLists.txt for tfd_gpu, tfd libraries and test target.
Stage 4 of TFD implementation: - Add Boost.Python bindings (nvmolkit/tfd.cpp) - Module-level functions: GetTFDMatrix, GetTFDMatrices, GetTFDMatricesGpuBuffer - Support for all TFD options: useWeights, maxDev, symmRadius, ignoreColinearBonds - Backend selection: 'gpu' or 'cpu' - Use toOwnedPyArray pattern for correct PyArray ownership in tuple returns - Add Python wrapper API (nvmolkit/tfd.py) - RDKit-compatible GetTFDMatrix() and GetTFDMatrices() - GPU-resident output: GetTFDMatrixGpu(), GetTFDMatricesGpu() - TFDGpuResult class for batch GPU results with extraction methods - Add type stubs (nvmolkit/_TFD.pyi) - Add comprehensive pytest test suite (nvmolkit/tests/test_tfd.py) - Single molecule and batch processing tests - GPU/CPU backend consistency tests - GPU-resident output tests with torch tensor extraction - RDKit comparison tests (symmRadius, ignoreColinearBonds, weighted/unweighted) - Edge case and error handling tests - Update nvmolkit/CMakeLists.txt with _TFD module target - Update nvmolkit/__init__.py documentation
Extend TFD to correctly handle torsions with multiple quartets: - Ring torsions (e.g. cyclohexane): average abs(signed dihedral) across all ring quartets per conformer, then compare the averages. - Symmetric torsions (e.g. isopentane CC(C)CC): compute the minimum circular difference across all (qi, qj) cross-product pairs. Key changes: - Add TorsionType enum (Single, Ring, Symmetric) and quartetStarts CSR index to TFDSystemHost/TFDSystemDevice for mapping torsions to their quartet ranges in torsionAtoms. - buildTFDSystem() now stores all quartets (not just the first) and classifies each torsion by type. Dihedral storage uses totalQuartets stride instead of numTorsions. - CPU computeTFDPair() implements type-based branching matching RDKit's CalculateTFD semantics exactly. - Single unified GPU tfdMatrixKernel handles all three torsion types with per-torsion type dispatch, eliminating the need for separate fast-path and multi-quartet kernel paths. - Transfer new quartetStarts and torsionTypes fields to device. Tests include hardcoded RDKit reference values for single-quartet, symmetric, ring, mixed, and AddHs molecules on both CPU and GPU paths.
buildTFDSystem(vector) now builds each molecule in parallel via OpenMP then merges the per-molecule results, eliminating serial RDKit extraction as the bottleneck in the GPU path. Extracted sequential single-molecule logic into static buildTFDSystemImpl and added mergeTFDSystems to stitch per-molecule CSR arrays with adjusted offsets.
Add ScopedNvtxRange annotations across the TFD pipeline for nsys profiling: - tfd_common.cpp: buildTFDSystem, parallel RDKit extraction, mergeTFDSystems (cyan) - tfd_cpu.cpp: GetTFDMatrices, computeDihedralAngles, computeTFDMatrix (green) - tfd_gpu.cpp: GetTFDMatricesGpuBuffer, transferToDevice (yellow), kernel launches (orange), extractAll D2H (red) Link nvtx to tfd_cpu and tfd_gpu CMake targets. Add benchmarks/tfd_profile.py for use with nsys, supporting CPU, GPU, and GPU-buffer backends with Python-level nvtx annotations for setup, warmup, and profiled run phases.
- Remove hasMultiQuartet flag (torsionTypes/quartetStarts encode this) - Replace molDihedralStarts CSR with single totalDihedrals_ counter - Compute getBondsForTorsions once in buildTFDSystemImpl and pass to extractTorsionListImpl/computeTorsionWeightsImpl to avoid duplicate work - Optimize Equal maxDevMode path to skip d1/d2 size branching - Merge parallel-array loops in mergeTFDSystems for fewer passes - Move computeTFDMatrixFromAngles to private in TFDCpuGenerator
… RDKit Three correctness bugs in TFD computation: - getMorganGenerator used wrong params (countSimulation=true, useBondTypes=false); fixed to match RDKit defaults (false, true) - Atom invariants stored as int, causing signed overflow for bitIds > INT_MAX; changed to uint32_t to preserve unsigned sort order matching RDKit - Ring torsion averaging used min(angle, 360-angle) which is incorrect under the 180-degree dihedral offset; fixed to abs(angle-180) in CPU and GPU paths Also consolidates diagnostic tests into existing test suites: - Torsion list checks merged into ExtractTorsionListRDKitReference - Weight checks merged into ComputeTorsionWeightsRDKitReference - End-to-end TFD checks merged into CompareWithRDKitReference - Complex molecules added to GPU BatchMultipleMolecules test
- Add bool skipGpuWorkItems parameter to buildTFDSystem and buildTFDSystemImpl (default: false) to skip building flattened GPU kernel work item vectors (dihedralConfIdx/TorsIdx/OutIdx, tfdAnglesI/J/TorsStart/NumTorsions/OutIdx) when only the CPU path is used - CPU path passes skipGpuWorkItems=true to avoid unnecessary allocation
…arks
- Add benchmarks/prepare_mols.py to precompute molecules with conformers
as per-conformer-count pickle files, with multiprocessing and progress
tracking. Avoids expensive ETKDG embedding during benchmark/profile runs.
- Update tfd_bench.py and tfd_profile.py to load from precomputed pickle
files (prepared_mols_{N}confs.pkl) when available, falling back to
generation from scratch with a message indicating which path is used.
Replace 8 per-work-item GPU index arrays with compact per-molecule MolDescriptor struct (32 bytes/molecule). Extract TorsionType and MolDescriptor into tfd_types.h for CUDA-safe inclusion. - dihedralKernel: binary search on cumulative work-item CSR to find molecule, then compute (confIdx, quartetIdx, outIdx) arithmetically - tfdMatrixKernel: one block per molecule with grid-stride loop over conformer pairs; eliminates cross-molecule warp divergence - Zero output buffers before kernel launch to handle no-torsion molecules - Detach CPU path from TFDSystemHost: uses TorsionList + weights directly - Add __restrict__ to kernel pointer parameters - Tighten ethane zero-output assertion in BatchedMultipleMolecules test
…action - Add return_type parameter to GetTFDMatrix/GetTFDMatrices: 'list' (default, RDKit-compatible), 'numpy' (CPU arrays, no Python object overhead), or 'tensor' (GPU torch.Tensor, no D2H copy) - GPU Python path now routes through GetTFDMatricesGpuBuffer directly, bypassing C++ extractAll and its N vector allocations + float-to-double - Rename TFDGpuResult.extract_all to to_tensors, add to_numpy and to_lists - Optimize to_lists: single bulk numpy tolist + list slicing - Optimize extractAll: separate allocation from conversion, use raw pointers - Update tfd_bench.py: benchmark all GPU return variants (list/numpy/tensor), report speedups vs RDKit only - Update tfd_profile.py: configurable sweep over mol/conf counts - Increase verify molecule count from 5 to 50
Regenerate hardcoded TFD reference values for molecules affected by RDKit 2025 ETKDG changes: CC(=O)[C@H]1CCCC[C@@h]1CN (ring torsions) and C[C@@h](CN(C)C(=O)C#CCN)C1CC1 (cyclopropane + triple bond). Also update AddHs reference values for CCCCC and CC(C)CC. Original 6 simple molecule references are unchanged across versions.
- Remove GetTFDMatricesGpuBuffer from TFDGenerator (tfd.h/tfd.cpp); GPU-buffer access goes directly through TFDGpuGenerator - Simplify Python binding (nvmolkit/tfd.cpp) to two C++ functions: GetTFDMatricesCpu and GetTFDMatricesGpuBuffer, each routing directly to the corresponding generator without TFDGenerator indirection - Clean up Python API (nvmolkit/tfd.py): remove GetTFDMatricesGpu, GetTFDMatrixGpu, extract_molecule; make TFDGpuResult private; GetTFDMatrix is now a wrapper over GetTFDMatrices - Update profile script: remove gpu_buffer backend, GPU uses tensor output - Update tests: replace GetTFDMatricesGpu calls with return_type parameter, remove GpuBufferWithCPUBackendThrows test, use TFDGpuGenerator directly
…tion - Add tests/test_tfd.cpp: extract TFDGeneratorTest (UnifiedAPIBackendSelection, BatchProcessing, GpuBufferMethodWorks, EmptyInput) from test_tfd_gpu.cpp - test_tfd_gpu.cpp now links tfd_gpu instead of tfd, contains only GPU-specific TFDGpuTest tests - Add backend string validation in Python tfd.py: raises ValueError for invalid backend (previously silently fell through to GPU)
CPU path: remove [list(r) for r in results] that unnecessarily copied inner lists already produced as Python lists by C++ nestedVectorToList. GPU path: remove list() wrapping of output_starts and conformer_counts already returned as Python lists from C++ intVectorToList.
For molecules with near-perfect C2 topological symmetry (e.g. macrocyclic peptides), the central bond selection can differ from RDKit Python due to machine-epsilon differences in the STD computation (~4e-16). This affects weighted TFD for ~0.8% of ChEMBL molecules; unweighted TFD is unaffected. Both choices are equally valid per the TFD paper.
Support benchmarking directly from precomputed pickle files (e.g. ChEMBL stratified datasets) via --pkl-file, bypassing SMILES loading and conformer generation. Conformer count is auto-detected from the loaded molecules. Add --verify-only to run correctness checks without the full benchmark sweep, exiting with code 0/1 for CI-friendly pass/fail reporting.
Add NVTX instrumentation to TFD Python bindings and docs for TFD feature Add NVTX ranges to capture time spent in C++/Python boundary conversions: - CPU: nestedVectorToList (C++ vectors → boost::python lists) - GPU: result metadata packaging and per-molecule tensor slicing Clean up tfd_profile.py to remove diagnostic ranges. Add TFD to docs/index.rst feature list and docs/api/nvmolkit.rst API reference (GetTFDMatrix, GetTFDMatrices).
…mentation Replace nestedVectorToList (per-element boost::python append) with zero-copy numpy views: C++ results are moved to a PyCapsule-owned heap allocation, and each inner vector<double> is wrapped as a numpy array via from_data. No data copies at any step. Unify CPU/GPU binding patterns: both return buffer + metadata, with symmetric _extract_cpu_result/_extract_gpu_result in Python. Remove unused conformer_counts from GPU return tuple. Add NVTX ranges to tfd.cpp (C++/Python boundary conversions) and tfd.py (result splitting). Add CPU numpy benchmark to tfd_bench.py and tfd_profile.py. Add TFD to docs/index.rst feature list and docs/api/nvmolkit.rst API reference.
Replace sequential mergeTFDSystems with a two-pass parallel approach: Pass 1 extracts per-molecule torsion data concurrently, Pass 2 fills pre-allocated arrays via memcpy to non-overlapping slices, eliminating O(N) push_back/insert overhead during merge. Simplify C++ extension API: remove GetTFDMatrix/GetTFDMatrices from the C++ layer (now pure Python), rename GetTFDMatricesCpu to GetTFDMatricesCpuBuffer returning numpy arrays, and drop the redundant conformer_counts return from GetTFDMatricesGpuBuffer.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.