Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
418ff93
add simple pipeline
drawadiagram Sep 17, 2025
b05f5f2
update pipeline
drawadiagram Jan 15, 2026
5c396b6
remove file
drawadiagram Jan 15, 2026
1523b8f
update run script
drawadiagram Jan 15, 2026
08b86a3
example input
drawadiagram Jan 15, 2026
9e2c122
mpnn thru sc filtering
drawadiagram Jan 20, 2026
f8bc347
update
drawadiagram Jan 30, 2026
3039176
rfd thru af2
drawadiagram Feb 3, 2026
d1b9fc4
rfd working
drawadiagram Feb 6, 2026
9a28d76
env stuff
drawadiagram Feb 20, 2026
0b15c77
Implement adaptive routing in SmallMoleculeBindingPipeline
drawadiagram Feb 27, 2026
d74edf5
rework adaptivity
drawadiagram Mar 8, 2026
743655b
pre exec bug
drawadiagram Mar 11, 2026
1a2f57a
fixes, inc fr metric and threshold
drawadiagram Mar 11, 2026
abd5ca3
fix filename length accumulation
drawadiagram Mar 12, 2026
97c508b
move mocks out
drawadiagram Mar 12, 2026
9ab7a4c
mock and alr
drawadiagram Mar 12, 2026
ab2d53b
name thing again
drawadiagram Mar 12, 2026
642fa98
revise interface bb adaptivity
drawadiagram Mar 12, 2026
d474404
fix af2 forward pass
drawadiagram Mar 12, 2026
a8d11ca
rmv preexec, add scripts and claudemd
drawadiagram Apr 2, 2026
b23272b
rmv None taskdescs
drawadiagram Apr 2, 2026
3d8e0ca
fix wf inputs
drawadiagram Apr 2, 2026
c57840f
update readme
drawadiagram Apr 2, 2026
68c7416
debugging fastrelax
drawadiagram Apr 3, 2026
3390b61
fix fixedres, use localbackend
drawadiagram Apr 6, 2026
87aa6ff
claudemereadme
drawadiagram Apr 6, 2026
4a1bd5c
env reqs
drawadiagram Apr 7, 2026
7015b73
rfd batch size
drawadiagram Apr 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
181 changes: 181 additions & 0 deletions examples/small_molecule_binding/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Revision History

| Date | Commit | Notes |
|---|---|---|
| 2026-04-06 | 3390b61 | change log added |

## Context

This directory is an **example workflow** within the larger [IMPRESS framework](https://github.com/radical-collaboration/IMPRESS) (Integrated Machine-learning for PRotEin Structures at Scale). IMPRESS is an HPC framework for protein inverse design using Foundation Models.

The framework package lives at `../../` (two levels up). Install it with:
```shell
cd ../../
pip install .
```

## Running the Pipeline

```shell
python run_small_molecule_binding.py
```

Before running on HPC, edit the path constants at the top of `run_small_molecule_binding.py` and the `__init__` kwargs in `SmallMoleculeBindingPipeline` (`foundry_sif_path`, `colabfold_path`, `mpnn_dir`, `ligand_params`, etc.) to match the target system.

## Architecture

### Two-file structure

- **`small_molecule_binding.py`** — defines `SmallMoleculeBindingPipeline(ImpressBasePipeline)`, all step constants, ensemble utility functions (`_ca_rmsd`, `_seq_identity`, `_ensemble_selective_avg`), and the inner `_run_refine_cycle()` loop. All pipeline tasks (HPC and local analysis) are registered via `@self.auto_register_task()` inside `_register_real_tasks()`. The `run()` method drives a state-machine loop; `_run_refine_cycle()` handles the MPNN+PackMin inner loop with per-cycle sequence retry support.

- **`run_small_molecule_binding.py`** — entry point. Sets threshold constants, defines the `adaptive_decision()` callback, creates an `ImpressManager`, and launches via `manager.start(pipeline_setups=[...])`.

### Step constants (state-machine constants in `small_molecule_binding.py`)

| Constant | Value | Meaning |
|---|---|---|
| `STEP_DONE` | 0 | pipeline complete |
| `STEP_RFD3` | 1 | backbone diffusion |
| `STEP_MPNN` | 2 | MPNN + PackMin refinement cycle |
| `STEP_FASTRELAX` | 3 | Rosetta FastRelax |
| `STEP_INTERFACE` | 4 | filter_shape (PyRosetta, gates AF2) |
| `STEP_AF2` | 5 | fold prediction |
| `STEP_RETRY_SEQ` | 6 | internal: retry sequence prediction without backbone restart |

### Pipeline tasks and scripts

| Task (registered name) | Type | Script / Tool | Resource |
|---|---|---|---|
| `rfd3` | HPC | `scripts/rfd3.sh` (RFDiffusion3 via `apptainer exec`) | GPU |
| `analysis_backbone` | local | reads JSON metrics from `rfd3` output dir | CPU |
| `mpnn` | HPC | `scripts/mpnn.sh` → `scripts/mpnn_wrapper.sh` (LigandMPNN) | CPU |
| `analysis_sequence` | local | reads `.fa` headers from MPNN `seqs/` output | CPU |
| `packmin` | HPC | `scripts/packmin.sh` → `scripts/packmin.py` (PyRosetta pack+minimize) | CPU |
| `analysis_packmin` | local | reads `_packmin_score.json` from packmin output | CPU |
| `fastrelax` | HPC | `scripts/fastrelax.sh` → `scripts/fastrelax.py` (Rosetta FastRelax) | CPU |
| `analysis_fastrelax` | local | reads `.fasc` score file from fastrelax output | CPU |
| `filter_shape` | HPC | `scripts/filter_shape.sh` → `scripts/filter_shape.py` (PyRosetta shape complementarity) | CPU |
| `analysis_interface` | local | reads `shape_complementarity_values.txt` | CPU |
| `af2` | HPC | `scripts/af2.sh` (ColabFold/LocalColabFold) | GPU |
| `analysis_fold` | local | reads ColabFold `_scores.json` files | CPU |
| `filter_energy` | HPC | `scripts/filter_energy.sh` → `scripts/filter_energy.py` (ligand energy filter) | CPU |

### State-machine execution flow

The pipeline runs as a `while self.next_step != STEP_DONE` loop. After each stage, `run_adaptive_step()` calls `adaptive_decision()` to set `pipeline.next_step`.

```
STEP_RFD3 → analysis_backbone → adaptive_decision()
STEP_MPNN → _run_refine_cycle():
for each cycle:
mpnn → analysis_sequence → adaptive_decision()
packmin → analysis_packmin → adaptive_decision()
STEP_FASTRELAX → analysis_fastrelax → adaptive_decision()
STEP_INTERFACE → analysis_interface → adaptive_decision()
STEP_AF2 → analysis_fold → adaptive_decision()
```

After a successful fold, `adaptive_decision()` always returns to `STEP_RFD3` for the next backbone generation. The pipeline terminates when `max_tasks` ensemble entries have been accumulated or `STEP_DONE` is set.

### MPNN + PackMin inner refinement cycle

`_run_refine_cycle()` runs `num_refine_cycles` (default 3) iterations of MPNN→PackMin:
- **Cycle 0**: MPNN generates `mpnn_ensemble_size` (default 10) sequence candidates from `best_backbone_path`.
- **Cycles 1+**: MPNN generates a single candidate from the current `best_packed_pdb`.
- PackMin is skipped on the last cycle; the best-scoring packed PDB advances to FastRelax.
- If `analysis_sequence` triggers `STEP_RETRY_SEQ`, MPNN is re-run for the same cycle (up to 3 retries before escalating to `STEP_RFD3`).

### Adaptive decision logic

`adaptive_decision()` in `run_small_molecule_binding.py` uses ensemble history and pairwise similarity to decide next steps:

| Stage | Pass condition | Pass action | Fail action |
|---|---|---|---|
| `backbone` | no ligand clashes, `max_ca_deviation < threshold`, sufficient secondary structure | `STEP_MPNN` (with ensemble similarity gating) | `STEP_RFD3` |
| `sequence` | ensemble similarity check (sequence identity) | `STEP_MPNN` | `STEP_RETRY_SEQ` (up to 3x), then `STEP_RFD3` |
| `packmin` | always passes | `STEP_MPNN` | — |
| `fastrelax` | interaction energy, total score, fa_rep below thresholds | `STEP_INTERFACE` | `STEP_MPNN` |
| `interface` | shape complementarity `max_sc >= interface_min_sc` | `STEP_AF2` | `STEP_MPNN` (up to 5x), then `STEP_RFD3` |
| `fold` | mean pLDDT `>= fold_min_plddt` | sets `rfd3_input_pdb` for guided backbone → `STEP_RFD3` | clears `rfd3_input_pdb` → `STEP_RFD3` |

### Ensemble-guided backbone feedback

After a successful fold prediction, `adaptive_decision()` computes CA-RMSD between the current AF2 model and all prior fold ensemble entries. If the selective average score (for structurally similar models) exceeds the overall average, the current AF2 model is fed back as `rfd3_input_pdb` for the next RFDiffusion run (`scaffoldguided.target_pdb`), biasing the next backbone toward successful structural motifs.

Ensemble similarity utilities (all in `small_molecule_binding.py`):
- `_ca_rmsd(path1, path2)` — Kabsch-aligned CA RMSD between two PDB files
- `_seq_identity(fasta1, fasta2)` — fraction matching residues over shorter sequence
- `_ensemble_selective_avg(current, prior, sim_fn, similar_if_low)` — returns `(overall_avg, selective_avg, has_data)` for scores of entries whose similarity is on the "similar" side of the mean pairwise similarity

### Quality thresholds (configurable)

| Kwarg | Default | Metric |
|---|---|---|
| `backbone_max_ca_deviation` | 2.0 | max CA deviation (Å) from target |
| `backbone_min_ss_fraction` | 0.2 | minimum helix+sheet fraction |
| `fastrelax_max_interact` | 0.0 | interaction energy (REU) |
| `fastrelax_max_total_score` | 0.0 | total Rosetta score (REU) |
| `fastrelax_max_fa_rep` | 150.0 | fa_rep repulsion energy (REU) |
| `interface_min_sc` | 0.5 | minimum shape complementarity score |
| `fold_min_plddt` | 70.0 | minimum mean pLDDT |
| `max_tasks` | 300 | maximum ensemble entries before stopping |

Threshold constants in `run_small_molecule_binding.py` override these defaults at `PipelineSetup` construction.

### Output directory structure

Each HPC task creates its working directory as `{base_path}/{taskcount}_{taskname}/in` and `.../out`. `taskcount` is a flat integer incremented for every HPC task (local analysis tasks do not increment it).

```
{base_path}/
{name}_in/ # pipeline inputs (ALR_binder_design.json, ligand .params, etc.)
1_rfd3/out/ # RFDiffusion3 outputs (.cif.gz + .json per model)
2_mpnn/out/ # LigandMPNN outputs (seqs/*.fa, packed/*.pdb)
3_packmin/out/ # packed+minimized PDB + _packmin_score.json
4_mpnn/out/ # cycle 1 MPNN ...
...
N_fastrelax/out/ # FastRelax PDB + .fasc score file
N+1_filter_shape/out/
N+2_alphafold/out/
```

MPNN copies the input backbone to a short fixed filename (`binder.cif.gz` or `binder.<ext>`) in `{taskdir}/in/` each cycle to avoid 255-character filename limits in AF2 result archives.

### Inter-step state passing

Steps communicate via `self.state`:

**Set by HPC task wrappers / local analysis tasks:**
- `best_backbone_path` — path to best `.cif.gz` from `rfd3` (set by `analysis_backbone`)
- `best_packed_pdb` — path to best packed PDB (set by `analysis_sequence`, updated by `packmin`)
- `last_seq_fasta` — path to best FASTA from MPNN (set by `analysis_sequence`)
- `best_af2_model` — path to best AF2 PDB (set by `analysis_fold`)
- `last_analysis_step` — `'backbone'` / `'sequence'` / `'packmin'` / `'fastrelax'` / `'interface'` / `'fold'`
- `last_analysis_metrics` — dict with `pass` bool and step-specific score fields
- `ensemble` — list of `(etype, score, input_path, output_path)` tuples

**Set by `adaptive_decision`:**
- `rfd3_input_pdb` — if set, passed to `rfd3` as `scaffoldguided.target_pdb` for guided diffusion
- `seq_retry_count` — retry counter for sequence stage (reset on new backbone or successful sequence)
- `interface_fail_count` — retry counter for interface stage (reset on pass or after 5 failures)

**Set at run start (`setdefault`):**
- `ensemble` — initialized to `[]`
- `rfd3_input_pdb` — initialized to `None`
- `seq_retry_count` — initialized to `0`
- `last_seq_fasta` — initialized to `None`

### Execution backends

`run_small_molecule_binding.py` has `LocalExecutionBackend(ProcessPoolExecutor())` active by default. `DragonExecutionBackendV3()` is commented out — swap it in for HPC production runs.

### Pipeline inputs

Each pipeline instance (named e.g. `p1`) expects a `{name}_in/` directory containing:
- `ALR_binder_design.json` — RFDiffusion3 input spec (contig, ligand, scaffold args)
- `<ligand_name>.params` — Rosetta ligand params file (default `ALR.params`)
- Optionally `common_filenames.txt` — used by `filter_energy` for cross-filtering
Loading
Loading