Ultra-fast SNP/indel-level distance calculator for core genome MLST analysis
cgDist is a high-performance Rust implementation for calculating genetic distances in bacterial genomics, specifically designed for epidemiological outbreak investigations and phylogenetic analysis.
- ⚡ Ultra-fast: Parallel processing with optimized algorithms
- 🎯 Precision: SNP/indel-level distance calculation
- 🔧 Flexible: Multiple hashing algorithms (CRC32, MD5, SHA256)
- 📊 Comprehensive: Built-in comparison tools and statistical analysis
- 🧬 Recombination-candidate flagging: Per-locus mutation-density screen to flag loci as recombination candidates for downstream phylogenetic confirmation
- 💾 Efficient: LZ4 compression for fast caching
- 📈 Scalable: Memory-efficient processing of large datasets
- Features
- Installation
- Quick Start
- Usage
- Recombination-Candidate Flagging
- Cache Inspector
- Custom Hashers Plugin System
- API Documentation
- Citation
- Support
- License
- Rust 1.88 or later (the minimum supported Rust version, MSRV, is also declared in
Cargo.toml). The pinned dependency set requires a recent toolchain, so simply using the latest stable Rust is recommended. The easiest way to install or update Rust is via rustup.rs:# Install rustup (skip if already installed) curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # If rustup is already installed but Rust is older than 1.88, update with: rustup update stable
- Python 3.8+ (only for the validation scripts in
validation_test/) - System build dependencies for
parasail-rs: the alignment backend is built from C source via CMake, which requires a C compiler andzlibdevelopment headers. Install once per machine:Windows users are encouraged to use the Docker image or WSL2, which provide a ready-to-build Linux environment. Native Windows builds additionally require zlib via vcpkg (# Debian / Ubuntu / WSL sudo apt install build-essential cmake zlib1g-dev # RHEL / AlmaLinux / Rocky / CentOS / Fedora sudo dnf install gcc gcc-c++ cmake zlib-devel # macOS (Homebrew; Xcode Command Line Tools provide compiler + zlib) xcode-select --install brew install cmake
vcpkg install zlib) or MSYS2.
# Clone the repository
git clone https://github.com/genpat-it/cgDist.git
cd cgDist
# Build release version
cargo build --release
# The binary will be available at ./target/release/cgdistcargo install cgdistThis fetches the latest published release from
crates.io, builds it locally with
your stable Rust toolchain, and installs the cgdist and
recombination_candidate_analyzer binaries to ~/.cargo/bin/
(which should already be on your PATH after a default rustup
install). Cache inspection is available via cgdist --inspector. The
deprecated recombination_analyzer binary is also installed and forwards
every argument to recombination_candidate_analyzer with a deprecation
notice — existing scripts continue to work.
To pin a specific published version:
cargo install cgdist --version 0.1.2For a fully reproducible build that uses exactly the dependency
versions we tested against, add --locked (this reuses the published
Cargo.lock instead of re-resolving to the newest compatible
dependencies, which may otherwise require an even more recent Rust
toolchain):
cargo install cgdist --lockedTo install directly from the GitHub repository — useful for installing an unreleased commit or for fully self-contained reproducibility when citing the manuscript:
# Specific release tag
cargo install --git https://github.com/genpat-it/cgDist --tag v0.1.2 cgdist
# Latest state on the default branch
cargo install --git https://github.com/genpat-it/cgDist cgdistcgdist is a binary crate, so its Cargo.lock is committed to the
repository to guarantee reproducible builds — this is the convention
recommended in the
official Cargo FAQ
for binary crates.
A multi-arch (linux/amd64 + linux/arm64) image is published to GitHub Container Registry on every release:
# Pull the public image (no authentication required)
docker pull ghcr.io/genpat-it/cgdist:0.1.2
# or pin to the minor / major series:
# docker pull ghcr.io/genpat-it/cgdist:0.1
# docker pull ghcr.io/genpat-it/cgdist:latest # tracks master HEAD
# Run with the image (mount your working directory at /data).
# The image's ENTRYPOINT is `cgdist`, so flags are passed directly:
docker run --rm -v $(pwd):/data ghcr.io/genpat-it/cgdist:0.1.2 \
--schema /data/schema_dir --profiles /data/profiles.tsv \
--output /data/distances.tsv --mode snps-indel-basesTo build the image locally instead of pulling (useful for development):
docker build -t cgdist:dev .
docker run --rm -v $(pwd):/data cgdist:dev --help# Calculate SNP distances from cgMLST profiles
cgdist --schema schema_dir/ --profiles profiles.tsv --output distances.tsv
# Use different distance mode
cgdist --schema schema_dir/ --profiles profiles.tsv --output distances.tsv --mode snps-indel-bases
# Use different hashing algorithm
cgdist --schema schema_dir/ --profiles profiles.tsv --output distances.tsv --hasher-type sha256
# Enable cache for faster recomputation
cgdist --schema schema_dir/ --profiles profiles.tsv --output distances.tsv --cache-file cache.lz4
# Specify number of threads
cgdist --schema schema_dir/ --profiles profiles.tsv --output distances.tsv --threads 16A self-contained validation suite with a small embedded test dataset
(3 loci, 10 samples, ~3 KB) is provided in
validation_test/. It verifies algorithmic
correctness across all four distance modes (Hamming, SNPs,
SNPs+InDel-events, SNPs+InDel-bases), checks the mathematical invariant
cgDist ≥ Hamming, and confirms Parasail alignment integration.
# After installing or building cgDist (see Installation)
cd validation_test
pip install -r requirements.txt # one-time: installs pandas
# run_validation.py is self-contained: it locates the cgdist binary
# ($CGDIST_BIN, ../target/release, or PATH), regenerates the four distance
# matrices, and validates them.
python3 run_validation.pyExpected output: 🎉 ALL VALIDATION TESTS PASSED! See
validation_test/README.md for details on
the test design, expected distances, and how to regenerate the fixture
from scratch.
The validation suite also runs automatically in CI on every push and
pull request (see .github/workflows/ci-and-docker.yml).
A configuration file is optional: every parameter accepted by cgdist also has
a CLI flag. The configuration file simply lets you persist commonly-used
settings without retyping them. CLI flags always override TOML values
when both are provided.
A canonical example is shipped at
examples/cgdist-config.toml; a
Hamming-mode variant is at
examples/hamming-config.toml. Both
files use the same flat key structure (no [sections]), and the same
key names as the corresponding CLI flags (the only normalization is
that CLI flag dashes become underscores in TOML — e.g. --hasher-type
becomes hasher_type).
You can also generate a fresh annotated sample with:
cgdist --generate-config > cgdist-config.tomlA minimal example (alignment-based mode):
profiles = "profiles.tsv"
schema = "schema/"
output = "distances.tsv"
hasher_type = "crc32"
mode = "snps" # legacy alias snps-indel-events == snps-indel-contiguous (deprecated)
format = "tsv"
missing_char = "-"
threads = 1 # default; set to 0 for auto-detect
hamming_fallback = false # opt-in (see Hamming Fallback section below)# Use a configuration file
cgdist --config cgdist-config.toml
# CLI overrides example: config says threads=1, but the CLI wins → 16 threads used
cgdist --config cgdist-config.toml --threads 16When both the configuration file and the command line specify the same
parameter, the command-line value wins. Internally this is
implemented by loading the TOML first, then overlaying any CLI flag
that the user explicitly set. The same rule applies to switches: e.g.
if the TOML says hamming_fallback = false but you pass
--hamming-fallback on the command line, the fallback will be enabled
for that run.
cgdist [OPTIONS]
MAIN OPTIONS:
--schema <PATH> Path to FASTA schema directory or schema file
--profiles <PATH> Path to allelic profile matrix (.tsv or .csv)
--output <FILE> Output distance matrix file
--mode <MODE> Distance mode [default: snps]
Options: snps, snps-indel-contiguous, snps-indel-bases, hamming
(legacy alias snps-indel-events == snps-indel-contiguous, deprecated)
--format <FORMAT> Output format [default: tsv]
Options: tsv, csv, phylip, nexus
FILTERING OPTIONS:
--min-loci <N> Minimum shared loci for distance calculation [default: 0]
--sample-threshold <VAL> Sample quality filter (0.0-1.0) [default: 0.0]
--locus-threshold <VAL> Locus quality filter (0.0-1.0) [default: 0.0]
--include-samples <REGEX> Include only samples matching regex pattern
--exclude-samples <REGEX> Exclude samples matching regex pattern
--include-loci <REGEX> Include only loci matching regex pattern
--exclude-loci <REGEX> Exclude loci matching regex pattern
--include-samples-list <FILE> Include samples from file (one per line)
--exclude-samples-list <FILE> Exclude samples from file (one per line)
--include-loci-list <FILE> Include loci from file (one per line)
--exclude-loci-list <FILE> Exclude loci from file (one per line)
ALIGNMENT OPTIONS:
--alignment-mode <MODE> Alignment mode [default: dna]
Options: dna, dna-strict, dna-permissive, custom
--match-score <N> Custom match score (enables custom mode)
--mismatch-penalty <N> Custom mismatch penalty (enables custom mode)
--gap-open <N> Custom gap open penalty (enables custom mode)
--gap-extend <N> Custom gap extend penalty (enables custom mode)
--save-alignments <FILE> Save detailed alignments to TSV file
PERFORMANCE OPTIONS:
--threads <N> Number of threads [default: 1; pass 0 for auto-detect]
--cache-file <FILE> Cache file path (.lz4 extension)
--cache-note <TEXT> Note to save with cache
--cache-only Build cache only without computing distance matrix
--force-recompute Force recomputation ignoring cache
--hasher-type <TYPE> Allele hasher type [default: crc32]
Options: crc32, sha256, md5, sequence, hamming
CACHE ENRICHMENT OPTIONS:
--enrich-lengths Enrich cache with nucleotide sequence lengths from schema
--enrich-output <FILE> Output file for enriched cache [default: overwrites input cache]
OTHER OPTIONS:
--missing-char <CHAR> Missing data character [default: -]
--no-hamming-fallback Disable Hamming fallback for SNPs mode
--stats-only Show matrix statistics only
--benchmark Measure alignment processing speed
--benchmark-duration <N> Benchmark duration in seconds [default: 15]
--dry-run Validate inputs without computation
--inspector <FILE> Inspect cache file
--config <FILE> Path to TOML configuration file
--generate-config Generate sample configuration file
--help Display usage informationSchema (FASTA directory):
- Individual FASTA files per locus
- Each file contains allele sequences
- File names correspond to locus names
Profiles (allelic profiles):
- TSV: Tab-separated values
- CSV: Comma-separated values
- Format: Sample name | Locus1 | Locus2 | ... | LocusN
- Missing data represented by configurable character (default:
-)
Cache files:
- LZ4: Compressed cache files (.lz4 or .bin extension)
- Automatic compression/decompression
- TSV: Tab-separated distance matrix (default)
- CSV: Comma-separated distance matrix
- PHYLIP: Phylogenetic analysis format
- NEXUS: Nexus format for phylogenetic tools
cgDist includes a companion screen that flags candidate recombinant loci based on per-locus mutation density. This is not a recombination detector: confirmation of recombination requires downstream phylogeny-aware tools (e.g. Gubbins, ClonalFrameML, fastGEAR). The flagging output identifies which loci warrant that follow-up.
- Mutation Density Analysis: Flags loci with high SNP/indel density per alignment as recombination candidates
- Hamming Distance Filtering: Focuses analysis on genetically related sample pairs
- Pairwise Flagging Summary: Per sample-pair count of flagged loci
- EFSA Loci Support: Compatible with standardized loci sets for food safety applications
- Distance Matrix Correction: Recomputes distances excluding flagged loci
Recombination-candidate flagging is performed by the
recombination_candidate_analyzer binary, which reads an enriched cache
(a cgDist cache that also stores per-allele sequence lengths) and flags loci
whose per-locus nucleotide-difference density exceeds a threshold. This is the
workflow described in the paper (Supplementary §S6).
# Step 1: build an enriched cache alongside the distance matrix.
# On a freshly created cache, --enrich-lengths records the sequence lengths
# in place, so a single cgdist run is enough.
cgdist --schema schema_dir/ --profiles profiles.tsv --output distances.tsv \
--mode snps-indel-bases --cache-file cache.bin --enrich-lengths
# Step 2: flag candidate loci and write a corrected distance matrix.
recombination_candidate_analyzer \
--enriched-cache cache.bin \
--profiles profiles.tsv \
--distance-matrix distances.tsv \
--output-matrix corrected_distances.tsv \
--candidate-recombination-log candidate_recombination_loci.tsv \
--threshold 3.0 # mutation-density % (default: 3.0; e.g. 5.0 for a stricter screen)Note — the binary
recombination_analyzerand the flag--recombination-logare kept as deprecated aliases for backward compatibility (a deprecation notice is printed when invoked). Existing scripts continue to work.
cgDist consumes standard cgMLST outputs. Profiles and schemas can be generated, for example, with ChewBBACA (Silva et al. 2018) or downloaded from Chewie-NS (Mamede et al. 2020).
- Enriched cache (
.bin): a cgDist cache built with--cache-file+--enrich-lengths - Allelic profiles (TSV/CSV): sample-locus-allele matrix
- Distance matrix (TSV): the original distance matrix from cgdist
- EFSA loci (optional): TSV file listing loci of interest
- Corrected distance matrix (
--output-matrix): distances recomputed with the flagged candidate loci excluded - Candidate flagging log (
--candidate-recombination-log, TSV): one row per flagged sample-pair/locus, with SNP/InDel counts, Hamming distance, average allele length, mutation-density %, and recombination-excess %
--threshold: mutation-density percentage above which a locus is flagged (default: 3.0)--candidate-recombination-log: output flagging log path. Legacy alias--recombination-logis also accepted.--output-matrix: corrected distance-matrix output path
# Step 1: build an enriched cache + distance matrix
cgdist --schema schema/ --profiles samples.tsv --output distances.tsv \
--mode snps-indel-bases --cache-file cache.bin --enrich-lengths
# Step 2: flag candidate loci and write a corrected distance matrix
recombination_candidate_analyzer \
--enriched-cache cache.bin \
--profiles samples.tsv \
--distance-matrix distances.tsv \
--output-matrix corrected_distances.tsv \
--candidate-recombination-log candidate_recombination_loci.tsv \
--threshold 3.0- High SNP Density: > 3% flags a locus as a recombination candidate (confirm with phylogeny-aware tools)
- High Indel Events: May indicate mobile genetic elements; warrants downstream inspection
- Pairwise Patterns: Multiple flagged loci between the same sample pair suggests related strains
- Hamming Filtering: Ensures focus on epidemiologically relevant comparisons
- Memory Usage: ~4-8GB for typical bacterial datasets (1000+ samples)
- Processing Time: 2-5 minutes for 21M cache entries on modern hardware
- Scalability: Linear with cache size, efficient for large epidemiological studies
- Outbreak Investigation: Flag candidate recombination loci in transmission chains for downstream confirmation
- Evolutionary Analysis: Identify candidate horizontal gene transfer events
- Food Safety: Screen for recombination signatures in foodborne pathogens
- Antimicrobial Resistance: Flag candidate resistance gene transfer events
- Population Genomics: Identify loci that may bias clonal-frame distance estimates
Inspect a cgDist cache — version, hasher type, distance mode, alignment parameters, per-locus entry counts, and any saved note — with the built-in inspector flag:
cgdist --inspector cache.lz4This reads the cache format written by --cache-file (LZ4-compressed JSON),
including length-enriched caches built with --enrich-lengths (the file
extension, e.g. .lz4 or .bin, does not matter — the format is the same).
Use it to:
- Validate a cache file's integrity before reuse
- Audit which alignment parameters and distance mode a cache was built with
- Inspect cache size and per-locus entry distribution
- Troubleshoot cache compatibility issues
cgDist provides a powerful plugin architecture for implementing custom hashing algorithms. This is particularly useful for specialized applications or compatibility with other tools.
Create a new hasher by implementing the AlleleHasher trait:
use cgdist::hashers::{AlleleHasher, AlleleHash};
/// Example: Simple nucleotide composition hasher
#[derive(Debug)]
pub struct CompositionHasher;
impl AlleleHasher for CompositionHasher {
fn hash_sequence(&self, sequence: &str) -> AlleleHash {
// Count nucleotides: A, T, G, C
let mut counts = [0u8; 4]; // A, T, G, C
for nucleotide in sequence.chars() {
match nucleotide.to_ascii_uppercase() {
'A' => counts[0] += 1,
'T' => counts[1] += 1,
'G' => counts[2] += 1,
'C' => counts[3] += 1,
_ => {} // Ignore ambiguous bases
}
}
// Create hash from composition: AAAAATTTTGGGGCCCC format
let hash_string = format!("A{}T{}G{}C{}",
counts[0], counts[1], counts[2], counts[3]);
AlleleHash::String(hash_string)
}
fn parse_allele(&self, allele_str: &str, missing_char: &str) -> Result<AlleleHash, String> {
if allele_str == missing_char {
Ok(AlleleHash::Missing)
} else {
// Parse composition string or return as-is
Ok(AlleleHash::String(allele_str.to_string()))
}
}
fn name(&self) -> &'static str {
"composition"
}
fn description(&self) -> &'static str {
"Nucleotide composition-based hasher (A/T/G/C counts)"
}
fn validate_sequence(&self, sequence: &str) -> Result<(), String> {
// Only allow ATGC nucleotides
for ch in sequence.chars() {
match ch.to_ascii_uppercase() {
'A' | 'T' | 'G' | 'C' | 'N' => {}
_ => return Err(format!("Invalid nucleotide: {}", ch)),
}
}
Ok(())
}
}use cgdist::hashers::HasherRegistry;
fn main() {
let mut registry = HasherRegistry::new();
// Register your custom hasher
registry.register_hasher("composition", Box::new(CompositionHasher));
// Use it like any built-in hasher
let hasher = registry.get_hasher("composition").unwrap();
let hash = hasher.hash_sequence("ATCGATCG");
println!("Hash: {}", hash); // Output: A2T2G2C2
}#[derive(Debug)]
pub struct KmerHasher {
k: usize,
}
impl KmerHasher {
pub fn new(k: usize) -> Self {
Self { k }
}
}
impl AlleleHasher for KmerHasher {
fn hash_sequence(&self, sequence: &str) -> AlleleHash {
let mut kmers = Vec::new();
let seq_bytes = sequence.as_bytes();
if seq_bytes.len() >= self.k {
for i in 0..=(seq_bytes.len() - self.k) {
let kmer = std::str::from_utf8(&seq_bytes[i..i + self.k])
.unwrap_or("")
.to_string();
kmers.push(kmer);
}
}
kmers.sort();
let hash_string = kmers.join("|");
AlleleHash::String(hash_string)
}
// ... implement other required methods
}#[derive(Debug)]
pub struct CustomNumericHasher;
impl AlleleHasher for CustomNumericHasher {
fn hash_sequence(&self, sequence: &str) -> AlleleHash {
// Convert sequence to custom numeric representation
let mut hash_value = 0u32;
for (i, nucleotide) in sequence.chars().enumerate() {
let base_value = match nucleotide.to_ascii_uppercase() {
'A' => 0,
'T' => 1,
'G' => 2,
'C' => 3,
_ => 0, // Default for ambiguous
};
// Simple polynomial rolling hash
hash_value = hash_value.wrapping_mul(4).wrapping_add(base_value);
}
AlleleHash::Crc32(hash_value)
}
fn parse_allele(&self, allele_str: &str, missing_char: &str) -> Result<AlleleHash, String> {
if allele_str == missing_char {
Ok(AlleleHash::Missing)
} else {
match allele_str.parse::<u32>() {
Ok(value) => Ok(AlleleHash::Crc32(value)),
Err(_) => Err(format!("Invalid numeric allele: {}", allele_str)),
}
}
}
fn name(&self) -> &'static str {
"custom-numeric"
}
fn description(&self) -> &'static str {
"Custom polynomial rolling hash for sequences"
}
}To use custom hashers with the cgdist command-line tool, you can:
- Fork and modify: Add your hasher to the registry in
src/main.rs - Configuration file: Load hashers from a configuration file
- Dynamic loading: Use Rust's plugin system (advanced)
Example integration in main.rs:
fn create_registry() -> HasherRegistry {
let mut registry = HasherRegistry::new();
// Add your custom hashers here
registry.register_hasher("composition", Box::new(CompositionHasher));
registry.register_hasher("kmer3", Box::new(KmerHasher::new(3)));
registry.register_hasher("custom-numeric", Box::new(CustomNumericHasher));
registry
}- Legacy Compatibility: Match existing tool formats
- Domain-Specific: Specialized algorithms for specific organisms
- Research: Experimental hashing strategies
- Performance: Optimized for specific hardware or datasets
- Compliance: Meet specific regulatory or institutional requirements
- Deterministic: Ensure same sequence always produces same hash
- Collision-Resistant: Minimize hash collisions for your use case
- Performance: Consider computational overhead
- Validation: Implement robust input validation
- Documentation: Provide clear usage examples and limitations
The plugin architecture makes cgDist highly extensible while maintaining backward compatibility with existing workflows.
See the complete working example:
# Run the custom hasher demonstration
cargo run --example custom_hasher
# Output shows different hashers applied to test sequences:
# 🔌 cgDist Custom Hasher Examples
# ===================================
#
# 📊 Available Hashers:
# • crc32: Fast CRC32 checksum (chewBBACA compatible)
# • composition: Nucleotide composition-based hasher (A/T/G/C counts)
# • kmer3: K-mer composition hasher (sorted k-mers)
# • polynomial: Polynomial rolling hash for sequences
#
# 🧬 Testing hasher: composition
# Description: Nucleotide composition-based hasher (A/T/G/C counts)
# ATCGATCGATCG → A3T3G3C3
# AAATTTGGGCCC → A3T3G3C3
# ATGCATGCATGC → A3T3G3C3This example demonstrates practical implementation patterns for:
- Composition-based hashing: Count nucleotide frequencies
- K-mer analysis: Extract and sort sequence k-mers
- Polynomial hashing: Mathematical sequence encoding
- Error handling: Validation and missing data management
use cgdist::{DistanceCalculator, Config};
// Create calculator with custom config
let config = Config::new()
.hasher("crc32")
.threads(8)
.cache_enabled(true);
let calculator = DistanceCalculator::new(config);
// Calculate distances
let distances = calculator.calculate_from_file("sequences.fasta")?;import subprocess
import pandas as pd
# Run cgdist from Python
result = subprocess.run([
'cgdist',
'--schema', 'schema_dir/',
'--profiles', 'profiles.tsv',
'--output', 'distances.tsv',
'--mode', 'snps-indel-bases'
], capture_output=True, text=True)
# Check for errors
if result.returncode != 0:
print(f"Error: {result.stderr}")
else:
# Load results
distances = pd.read_csv('distances.tsv', sep='\t', index_col=0)
print(f"Distance matrix shape: {distances.shape}")
print(distances.head())If you use cgDist in your research, please cite our preprint:
de Ruvo, A.; Castelli, P.; Bucciacchio, A.; Mangone, I.; Mixao, V.; Borges, V.; Radomski, N.; Di Pasquale, A. (2025). cgDist: An Enhanced Algorithm for Efficient Calculation of pairwise SNP and InDel differences from Core Genome Multilocus Sequence Typing. bioRxiv. DOI: 10.1101/2025.10.16.682749
@article{deruvo2025cgdist,
title = {cgDist: An Enhanced Algorithm for Efficient Calculation of pairwise SNP and InDel differences from Core Genome Multilocus Sequence Typing},
author = {de Ruvo, Andrea and Castelli, Pierluigi and Bucciacchio, Andrea and Mangone, Iolanda and Mixao, Verónica and Borges, Vítor and Radomski, Nicolas and Di Pasquale, Adriano},
year = {2025},
month = {October},
doi = {10.1101/2025.10.16.682749},
journal = {bioRxiv},
note = {Preprint. Software: https://github.com/genpat-it/cgDist}
}- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: a.deruvo@izs.it
This project is licensed under the MIT License - see the LICENSE file for details.
Made with ❤️ for the bioinformatics community