┌──────────────────────────────────────────────────────────────────────────────┐
│ │
│ │
│ ██████╗ ███████╗ █████╗ ██╗ ███╗ ███╗ │
│ ██╔══██╗██╔════╝██╔══██╗██║ ████╗ ████║ │
│ ██████╔╝███████╗███████║██║ ██╔████╔██║ │
│ ██╔═══╝ ╚════██║██╔══██║██║ ██║╚██╔╝██║ │
│ ██║ ███████║██║ ██║███████╗██║ ╚═╝ ██║ │
│ ╚═╝ ╚══════╝╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝ │
│ Protein Sequence Annotation using a Language Model │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
Persistent session mode (load model once, scan many times):
psalm -d auto
# inside shell:
# scan -f path/to/seqs.fa
# scan --sort -f path/to/seqs.fa -c 4 --to-tsv hits.tsv
# scan -s "MSTNPKPQR..."
# quit
Quick usage:
psalm-scan -f path/to/your_sequence.fasta
CLI behavior notes:
- Default model:
ProteinSequenceAnnotation/PSALM-2 - Default device:
auto(cuda->mps->cpu) - FASTA scans use fast batched scanning by default
--serialrestores the legacy serial FASTA behavior--sortremains opt-in-c/--cpu-workersis the number of fast-mode CPU decode helper processes- default behavior is equivalent to
-c 0 - if the interactive shell already has warmed workers, later default fast scans reuse that pool
- default behavior is equivalent to
--max-batch-sizecontrols the fast-mode embedding batch budget in tokens/amino acids--max-queue-sizecontrols the fast-mode decode queue in sequences- default:
128
- default:
-q/--quietsuppresses scan result output only; startup/status still prints--to-tsvand--to-txtwork for single or multi-sequence FASTA-v/--verboseenables detailed alignment and model tables- verbose FASTA scans use the serial path
- without
-v, PSALM prints the compact HITS report
-Tkeeps domains withScore >= threshold(default:0.5)-Ekeeps domains withE-value <= threshold(default:0.1)-Zsets dataset size for E-value scaling- if omitted for
-s:Z=1 - if omitted for
-f:Z=#sequences in FASTA
- if omitted for
--to-tsvis the supported machine-readable output format
Common shell usage:
psalm
scan --sort -f path/to/seqs.fa --to-tsv hits.tsv
Useful output modes:
# compact terminal report + TSV
scan -f path/to/seqs.fa --to-tsv hits.tsv
# with TSV only
scan -q --sort -f path/to/seqs.fa --to-tsv hits.tsv
# verbose per-domain output
scan -v -f path/to/seqs.fa
For the full option set, run psalm --help, psalm-scan --help, or scan --help.
Create a fresh Python 3.10 environment, install PyTorch for your hardware, then install PSALM.
conda create -n psalm python=3.10 -y
conda activate psalm
python -m pip install --upgrade pip
# 1) Install PyTorch for your hardware
# Apple Silicon (MPS):
python -m pip install torch
# CPU-only (Linux/Windows):
# python -m pip install torch
# NVIDIA CUDA 12.1:
# python -m pip install --index-url https://download.pytorch.org/whl/cu121 \
# torch
# 2) Install PSALM
python -m pip install protein-sequence-annotation==2.1.9
If you are unsure which PyTorch command matches your GPU/driver, use the official selector: https://pytorch.org/get-started/locally/
Intel Mac (x86_64) tested path:
conda create -n psalm python=3.10 -y
conda activate psalm
conda install -y -c conda-forge "llvmlite=0.44.*" "numba=0.61.*"
conda install -y -c conda-forge "pytorch=2.5" torchvision torchaudio
python -m pip install protein-sequence-annotation==2.1.9
Optional: run without activating conda manually:
conda run -n psalm psalm-scan -f path/to/seqs.fa
from psalm.psalm_model import PSALM
psalm = PSALM(model_name="ProteinSequenceAnnotation/PSALM-2")
# Scan FASTA
results = psalm.scan(fasta="path/to/your_sequence.fasta")
print(results)
# Scan sequence string
results = psalm.scan(sequence="MSTNPKPQR...AA")Output options:
to_tsv="results.tsv"writes:Sequence,E-value,Score,Pfam,Start,Stop,Model,Len Frac,Statusto_txt="results.txt"saves console-style output- For multi-sequence FASTA, TSV rows are combined with the query id in the
Sequencecolumn
The core workflow is:
scripts/data/augment_fasta.py→ slice sequences and generate augmented FASTA + domain dictscripts/data/data_processing.py→ tokenize, label, batch, and shard datasetsscripts/train/train_psalm.py→ train/evaluate the PSALM model on shards
Splits long sequences into domain-preserving slices and optionally emits shuffled and negative variants. Produces a new FASTA and a new domain dict with aligned IDs.
Key inputs
--fasta,--domain-dict--output-fasta,--output-dict
Common flags
--max-length: slice length threshold--negative-prob: target fraction of negatives (approximate)--include-domain-slices,--shuffle-only,--no-shuffle,--domain-slices-only--large-datawith--p-shuffled,--domain-counts-tsv,--domain-slice-frac--seed,--verbose
Tokenizes sequences, generates per-token labels from the domain dict and label mapping, batches by token budget, and saves shards.
Config handling
- This script is CLI-only; it does not read
config.yaml.
Required args
--fasta,--domain-dict,--output-dir,--ignore-label--model-name,--max-length,--max-tokens-per-batch--label-mapping-dict
Optional args
--chunk-size,--tmp-dir,--shard-size,--seed,--keep-tmp
Notes
- ID normalization uses the FASTA header segment between
>and the first space. --ignore-labelmust match the training--ignore-label.
Trains or evaluates PSALM on preprocessed shard datasets.
Config handling
- Training always uses a YAML config.
- If
--configis provided without a value, the script looks forpsalm/config.yaml. - If
--configis not provided, the script still looks forpsalm/config.yaml.
Required args
--val-dir,--ignore-label--train-diriftraining.total_steps > 0in config
Optional args
--label-mapping-dictto override configmodel.label_mapping_path
Checkpoint loading
- Supports
model.safetensorsorpytorch_model.binwithin a checkpoint directory, or a direct path to a.safetensors/.binfile.
Logging
report_to=["wandb"]is enabled by default.
Trains the CatBoost scoring model used by scan() (saved as score.cbm).
Required args
--pos,--neg: Pickle or JSON files containing a list of 7-tuples:(pfam, start, stop, bit_score, len_ratio, bias, status)(orscan()output dicts containing 8-tuples withcbm_score).
Example
python scripts/train/train_cbm.py \
--pos path/to/positives.pkl \
--neg path/to/negatives.pkl \
--outdir cbm_outputs \
--model-out score.cbm
The scripts expect a YAML config with these sections:
model
model_namemax_batch_sizeoutput_sizefreeze_esmuse_fapretrained_checkpoint_pathlabel_mapping_path
training
gradient_accumulation_steps,learning_rate,optimizer,gradient_clippinglr_scheduler,eval_strategy,eval_steps,total_steps,warmup_stepslogging_steps,save_steps,output_dirmixed_precision,dataloader_num_workers,dataloader_prefetch_factor,dataloader_pin_memory,seed
data
chunk_size,default_tmp_dir,default_shard_size
psalm/config.yaml is provided as a template with null values. Populate it
before use, or pass all required values via CLI without --config.
python scripts/data/augment_fasta.py \
--fasta input.fa \
--domain-dict domains.pkl \
--output-fasta augmented.fa \
--output-dict augmented.pkl
python scripts/data/data_processing.py \
--fasta augmented.fa \
--domain-dict augmented.pkl \
--label-mapping-dict labels.pkl \
--output-dir data/shards \
--model-name ProteinSequenceAnnotation/esm2_t33_650M_PFS90_leaky \
--max-length 4096 \
--max-tokens-per-batch 8196 \
--ignore-label -100
python scripts/train/train_psalm.py \
--config psalm/config.yaml \
--train-dir data/shards/train \
--val-dir data/shards/val \
--ignore-label -100
PyYAMLis required for config loading.faesmis required only ifuse_fa: truein config.- Core inference runtime uses
torch,transformers,biopython,pandas,numba, andcatboost.