A high-performance Snakemake workflow for generating molecular fingerprints from DNA-encoded library (DEL) compounds, featuring an advanced dedicated I/O cores pipeline for 48% faster throughput (250 vs 170 million molecules/hour).
This workflow is completely self-contained and only requires:
- Snakemake (with conda/mamba support)
- Conda or Mamba for environment management
No other dependencies need to be pre-installed!
# Using conda
conda install -c conda-forge snakemake=7 cookiecutter
# Or using mamba (recommended for faster installs)
mamba install -c conda-forge snakemake=7 cookiecutter# If you have git
git clone <repository-url>
cd FingerDELicious
# Or download and extract the workflow filesEdit config/config.yaml to specify your input files:
input:
# Option 1: Specify individual files
files:
- "/path/to/your/library1.parquet"
- "/path/to/your/library2.parquet"
# Option 2: Scan a directory for files
directory: "/path/to/your/libraries/"
pattern: "*.parquet" # File pattern to match# Full production run with robust job management
snakemake --configfile config/config.yaml \
--profile profiles/slurm \
--use-conda \
--conda-frontend conda \
--rerun-incomplete \
--keep-going \
--jobs 13
# Alternative: Keep failed outputs for debugging
snakemake --configfile config/config.yaml \
--profile profiles/slurm \
--use-conda \
--conda-frontend conda \
--rerun-incomplete \
--keep-going \
--keep-incomplete \
--jobs 13# Test the workflow (dry run)
snakemake --configfile config/config.yaml --dry-run
# Local testing with default profile
snakemake --configfile config/config.yaml \
--profile profiles/default \
--use-conda \
--cores 8
# Quick test with small dataset
snakemake --configfile config/test_phase2.yaml \
--profile profiles/default \
--use-conda \
--cores 4# Standard cluster execution
snakemake --configfile config/config.yaml \
--profile profiles/slurm \
--use-conda \
--jobs 10
# Maximum parallelism (adjust based on cluster policies)
snakemake --configfile config/config.yaml \
--profile profiles/slurm \
--use-conda \
--conda-frontend mamba \
--jobs 20 \
--max-jobs-per-second 5
# Restart failed workflow
snakemake --configfile config/config.yaml \
--profile profiles/slurm \
--use-conda \
--rerun-incomplete \
--keep-going \
--jobs 13# Generate workflow visualization
snakemake --configfile config/config.yaml --dag | dot -Tpng > workflow_dag.png
# Detailed dry run with reasons
snakemake --configfile config/config.yaml --dry-run --reason
# Force rerun of specific rule
snakemake --configfile config/config.yaml \
--forcerun generate_fingerprints \
--use-conda \
--jobs 4
# Monitor workflow status
snakemake --configfile config/config.yaml --summary--keep-going: Prevents job failure cascade - continues other jobs even if some fail--keep-incomplete: Keeps partial outputs from failed jobs for debugging--rerun-incomplete: Automatically reruns jobs that didn't complete successfully--conda-frontend mamba: Uses mamba for faster environment creation (if available)--max-jobs-per-second: Limits job submission rate to avoid overwhelming scheduler--jobs N: Maximum number of simultaneous jobs (adjust based on cluster limits)
The workflow features an advanced two-approach processing architecture that automatically selects the optimal method based on available resources:
- Memory-efficient sequential processing
- Optimal for resource-constrained environments
- Single-threaded I/O with parallel fingerprint computation
- Performance: ~170 million molecules/hour
- Overlapped I/O operations while CPU processes fingerprints
- Separate thread pools for reading, processing, and writing
- Bounded buffering prevents memory bloat
- Performance boost: 48% faster throughput (250 vs 170 million molecules/hour)
The script automatically chooses the optimal approach based on available cores:
performance:
n_jobs: 12 # ≥8 cores → Dedicated I/O cores pipeline
n_jobs: 4 # ≤7 cores → Sequential streaming
force_sequential: false # Set true to force sequential for debugging- Automatic approach selection: Chooses optimal processing method based on available cores
- Dedicated I/O cores architecture: Overlaps I/O operations with computation for maximum throughput
- Memory-efficient streaming: Processes large datasets without loading everything into memory
- Robust error handling: Continues processing other libraries even if individual jobs fail
- fingerprints.yaml: Creates environment with scikit-fingerprints, pandas, numpy, scipy
- report.yaml: Creates environment with matplotlib, seaborn for visualizations
- All dependencies are automatically installed by Snakemake
- Input Validation: Scans parquet files and extracts library IDs
- Approach Selection: Automatically chooses Sequential Streaming or Dedicated I/O Cores based on core count
- Fingerprint Generation: Generates ECFP/FCFP fingerprints with optimized I/O and computation overlap
- Output Organization: Creates structured directories by library ID with incremental chunk numbering
- Performance Analytics: Comprehensive HTML report with visualizations
- Multi-type processing: Generate ECFP and FCFP fingerprints in a single execution
- Flexible input: Support for individual parquet files or directory scanning
- Library ID extraction: Uses library_ID from compound column for output naming
- Profile-based execution: Local development or cluster execution
- Threading control: Fine-tune read/process/write thread allocation
- Structured output: Organized directory structure with chunked files for ML workflows
- Robust job management: Prevents cascading failures with
--keep-goingflag
FingerDELicious/
├── Snakefile # Main workflow definition
├── config/
│ ├── config.yaml # Main configuration file
│ └── config_test.yaml # Test configuration (smaller datasets)
├── scripts/
│ ├── generate_fingerprints.py # Fingerprint generation script
│ └── generate_report.py # Enhanced HTML report generation script
├── envs/
│ └── report.yaml # Conda environment for reporting
├── profiles/
│ ├── default/
│ │ └── config.yaml # Sequential execution (1 job)
│ └── slurm/
│ └── config.yaml # Parallel execution (4 jobs × 32 cores × 96GB)
├── reports/ # Generated HTML reports and performance data
└── logs/ # Job logs
Edit config/config.yaml to specify your input files:
# Option 1: List specific files
input:
files:
- "/path/to/library1.parquet"
- "/path/to/library2.parquet"
# Option 2: Scan directory
input:
directory: "/path/to/libraries"
pattern: "*.parquet"🚀 Recommended Production Command (Prevents Job Failure Cascades):
cd FingerDELicious
snakemake --configfile config/config.yaml \
--profile profiles/slurm \
--use-conda \
--conda-frontend conda \
--rerun-incomplete \
--keep-going \
--jobs 13For Local Testing:
cd FingerDELicious
snakemake --configfile config/config.yaml \
--profile profiles/default \
--use-conda \
--cores 8With Test Configuration:
cd FingerDELicious
snakemake --configfile config/test_phase2.yaml \
--profile profiles/default \
--use-conda \
--cores 4The workflow automatically generates a comprehensive HTML report with:
- System information and configuration details
- Performance metrics and visualizations
- Processing statistics per library
- Resource usage analysis
Generate report only:
cd FingerDELicious
snakemake --configfile config/config.yaml \
reports/fingerprint_generation_report.html# Input configuration
input:
files:
- "/path/to/library1.parquet"
- "/path/to/library2.parquet"
# Fingerprint generation settings
fingerprints:
types: ["ecfp", "fcfp"] # Generate both types in single execution
radius: 2 # ECFP radius (2 = ECFP4)
fp_size: 2048 # Fingerprint vector size
# Performance settings
performance:
batch_size: 100000 # Processing batch size
output_chunk_size: 2000000 # Output file chunk size
n_jobs: 20 # CPU cores (auto-selects Phase 1 vs Phase 2)
# Dedicated I/O Cores Threading Configuration
read_threads: 0 # Background data reading (0 = auto)
process_threads: 0 # Fingerprint computation (0 = auto)
write_threads: 0 # Background output writing (0 = auto)
buffer_size: 0 # Memory buffering (0 = auto)
force_sequential: false # Force sequential processing
# Resource allocation
resources:
base_memory_gb: 8 # Base memory per job
memory_multiplier: 2 # Retry scaling factor
max_memory_gb: 64 # Maximum memory limit
max_runtime_hours: 4 # Job timeoutperformance:
n_jobs: 24 # Let script choose optimal approach
read_threads: 0 # Auto: 1 thread for ≥8 cores
process_threads: 0 # Auto: 22 threads for ≥8 cores
write_threads: 0 # Auto: 1 thread for ≥8 cores
buffer_size: 0 # Auto: based on batch sizeperformance:
n_jobs: 32 # Total cores available
read_threads: 2 # Dedicated I/O reading
process_threads: 28 # Fingerprint computation
write_threads: 2 # Dedicated I/O writing
buffer_size: 4 # 4 batches in memoryperformance:
n_jobs: 24 # Even with many cores...
force_sequential: true # ...force sequential streamingDefault Profile: Sequential processing, 1 job at a time
# profiles/default/config.yaml
jobs: 1 # Sequential execution
local-cores: 1 # Use 1 core locallySlurm Profile: Parallel processing, up to 4 jobs × 32 cores
# profiles/slurm/config.yaml
jobs: 4 # Up to 4 parallel jobs
__default__:
threads: 32 # 32 cores per job
mem_gb: 96 # 96GB RAM per job
runtime: 240 # 4 hours max per jobEach library generates its own directory structure:
results/
└── {library_ID}/
├── ecfp_fingerprints/
│ ├── {library_ID}_ecfp_chunk-0.npz
│ └── {library_ID}_ecfp_chunk-1.npz
├── fcfp_fingerprints/
│ ├── {library_ID}_fcfp_chunk-0.npz
│ └── {library_ID}_fcfp_chunk-1.npz
└── metadata/
├── {library_ID}_metadata_chunk-0.parquet
└── {library_ID}_metadata_chunk-1.parquet
- CPU: 20 cores per library (configurable)
- Memory: ~60GB RAM per library
- Execution: Sequential (1 library at a time)
- CPU: 32 cores per job, up to 4 jobs = 128 cores total
- Memory: 96GB RAM per job, up to 384GB total
- Execution: Up to 4 libraries in parallel
One of the most important considerations when running large-scale workflows is preventing individual job failures from stopping the entire workflow. Use these strategies:
# Recommended production command
snakemake --configfile config/config.yaml \
--profile profiles/slurm \
--use-conda \
--keep-going \ # Continue other jobs if some fail
--rerun-incomplete \ # Retry failed/incomplete jobs
--keep-incomplete \ # Keep partial outputs for debugging
--jobs 13
# For development/debugging
snakemake --configfile config/config.yaml \
--keep-going \
--printshellcmds \ # Show actual commands being run
--verbose \ # Detailed logging
--cores 4--keep-going: Most important flag - prevents one failed library from stopping others--rerun-incomplete: Automatically retries jobs that didn't complete successfully--keep-incomplete: Preserves partial outputs for debugging failed jobs--max-jobs-per-second 5: Limits job submission rate to avoid scheduler overload
The workflow includes automatic retry logic with memory scaling:
resources:
base_memory_gb: 8 # First attempt: 8GB
memory_multiplier: 2 # Second attempt: 16GB
max_memory_gb: 64 # Third attempt: 32GB, Fourth: 64GB (max)- ≤7 cores: Automatically uses Phase 1 (sequential streaming)
- ≥8 cores: Automatically uses Phase 2 (producer-consumer pipeline)
- 16+ cores: Optimal for Phase 2 with custom threading
- 32+ cores: Consider custom thread allocation for maximum efficiency
# Monitor specific library progress
tail -f logs/fingerprints_{library_id}.log
# Watch Snakemake workflow status
watch -n 30 "snakemake --configfile config/config.yaml --summary"
# Check cluster job status (SLURM)
squeue -u $USER
# Monitor resource usage
snakemake --configfile config/config.yaml \
--profile profiles/slurm \
--cluster-status scripts/slurm_status.py # If available# Rerun only failed jobs
snakemake --configfile config/config.yaml \
--rerun-incomplete \
--keep-going \
--cores 4
# Force rerun specific library
snakemake --configfile config/config.yaml \
--forcerun results/{library_id} \
--cores 4
# Detailed error analysis
snakemake --configfile config/config.yaml \
--keep-going \
--printshellcmds \
--verbose \
--cores 1# For small libraries (< 1M compounds)
batch_size: 50000
# For medium libraries (1-10M compounds)
batch_size: 100000 # Recommended default
# For large libraries (> 10M compounds)
batch_size: 200000# For ML training datasets
output_chunk_size: 2000000 # 2M molecules per file
# For analysis workflows
output_chunk_size: 5000000 # 5M molecules per file
# For memory-constrained systems
output_chunk_size: 1000000 # 1M molecules per file# Create workflow diagram
snakemake --configfile config/config.yaml --dag | dot -Tpng > workflow_dag.png
# Rule dependency graph
snakemake --configfile config/config.yaml --rulegraph | dot -Tpng > rules_dag.png
# File dependency graph
snakemake --configfile config/config.yaml --filegraph | dot -Tpng > files_dag.png# Detailed job statistics
snakemake --configfile config/config.yaml --detailed-summary
# Resource usage report
snakemake --configfile config/config.yaml --stats stats.json
# Generate benchmark report (if benchmarking enabled)
snakemake --configfile config/config.yaml reports/fingerprint_generation_report.html# Remove all output files
snakemake --configfile config/config.yaml --delete-all-output
# Remove specific library output
rm -rf results/{library_id}
# Clean conda environments (if needed)
snakemake --configfile config/config.yaml --use-conda --conda-cleanup-envs
# Clean up Snakemake metadata
rm -rf .snakemake/# Check output sizes
du -sh results/*/
# Compress old fingerprint files
find results/ -name "*.npz" -mtime +30 -exec gzip {} \;
# Archive completed libraries
tar -czf library_{library_id}_$(date +%Y%m%d).tar.gz results/{library_id}/Real-world performance comparison on 9.84 million molecules from qDOS24.parquet:
| Approach | Status | Time | Throughput | Memory | Efficiency Score |
|---|---|---|---|---|---|
| Sequential Streaming | ✅ SUCCESS | 3m 29s | 168.7 M/h | 7.1 GB | 1,389,591 |
| Dedicated I/O Cores | ✅ SUCCESS | 2m 21s | 250.8 M/h | 9.0 GB | 1,091,543 |
Performance Gains with Dedicated I/O Cores:
- 48% faster processing: 250.8 vs 168.7 million molecules/hour
- 32% time reduction: 2m 21s vs 3m 29s
- Minimal memory overhead: Only 1.9GB additional memory usage
- Scalable architecture: Performance gains increase with larger datasets
- Sequential Streaming (≤7 cores): Single-threaded I/O with parallel computation
- Dedicated I/O Cores (≥8 cores): Overlapped I/O operations with separate read/process/write threads
- Multi-type processing: ECFP + FCFP in single execution for maximum efficiency
- Memory efficiency: ~3-4GB per core during dedicated I/O operation
- Batch size 100K: Provides optimal balance of memory usage and throughput
- Output chunks 2M: Creates ML-friendly file sizes without excessive I/O overhead
- Threading (Dedicated I/O): 1 read + N-2 process + 1 write threads optimal for most workloads
- Buffer size: 3-4 batches provides good throughput without excessive memory
- Local workstation (8-16 cores): Use dedicated I/O cores with default auto-allocation
- Cluster node (24-32 cores): Custom threading with 2-4 I/O threads
- High-memory node (64+ cores): Increase batch size to 200K for optimal efficiency
- Multiple libraries: Use
--jobs 10-20for parallel library processing
| Library Size | Recommended Config | Memory | Runtime | Throughput |
|---|---|---|---|---|
| < 1M compounds | Sequential streaming, batch 50K | 4-8GB | 15-30 min | 170K mol/h |
| 1-10M compounds | Dedicated I/O cores, batch 100K | 8-16GB | 1-3 hours | 250K mol/h |
| 10-50M compounds | Dedicated I/O cores, batch 200K | 16-32GB | 4-12 hours | 280K mol/h |
| 50M+ compounds | Dedicated I/O cores, custom threads | 32-64GB | 12+ hours | 300K mol/h |
Problem: One failed library causes entire workflow to stop
Solution: Use --keep-going flag
snakemake --configfile config/config.yaml \
--profile profiles/slurm \
--use-conda \
--keep-going \ # Essential for robust execution
--rerun-incomplete \
--jobs 13Problem: Failed jobs are deleted and can't be debugged
Solution: Use --keep-incomplete flag
snakemake --configfile config/config.yaml \
--keep-going \
--keep-incomplete \ # Preserve failed outputs
--printshellcmds \ # Show exact commands
--cores 4Symptoms:
MemoryErrorin logs- Jobs killed by SLURM with
OUT_OF_MEMORY - Process killed with signal 9
Solutions:
# Reduce resource requirements
performance:
batch_size: 50000 # Reduce from 100000
n_jobs: 16 # Reduce from 24
# Or increase memory allocation
resources:
base_memory_gb: 16 # Increase from 8
max_memory_gb: 128 # Increase limitSymptoms:
- Very slow performance despite many cores
- High CPU wait times
- Dedicated I/O cores slower than sequential streaming
Solutions:
# Force sequential streaming for debugging
performance:
force_sequential: true
# Or optimize dedicated I/O cores threading
performance:
read_threads: 1 # Single reader
process_threads: 22 # Most cores for processing
write_threads: 1 # Single writer
buffer_size: 2 # Reduce if memory limitedSymptoms:
- Slow performance with small batch sizes
- File not found errors
- Permission denied errors
Solutions:
# Check file permissions
ls -la /path/to/libraries/
# Increase batch size for better I/O efficiency
batch_size: 200000
# Use local scratch space if available
export TMPDIR=/local/scratch/$USERSymptoms:
- Jobs stuck in pending state
- Jobs cancelled by scheduler
- Resource allocation errors
Solutions:
# Check cluster policies
sinfo
sqos
sacctmgr show qos
# Adjust job submission rate
snakemake --configfile config/config.yaml \
--profile profiles/slurm \
--max-jobs-per-second 2 \
--jobs 5 # Reduce parallel jobsSymptoms:
- Package conflicts
- Environment creation failures
- Import errors
Solutions:
# Clean conda environments
snakemake --use-conda --conda-cleanup-envs
# Force environment recreation
rm -rf .snakemake/conda/
# Use mamba for faster resolution
snakemake --conda-frontend mamba \
--use-conda \
--cores 4# Check workflow status
snakemake --configfile config/config.yaml --summary
# Look for failed jobs
snakemake --configfile config/config.yaml --detailed-summary
# Check specific job logs
cat logs/fingerprints_{library_id}.log# Test with single library
snakemake --configfile config/test_phase2.yaml \
--cores 4 \
--verbose
# Force rerun specific rule
snakemake --forcerun generate_fingerprints \
--cores 4# Try sequential streaming approach
python scripts/generate_fingerprints.py \
/path/to/library.parquet \
test_output \
--no-io-threads \
--batch-size 50000 \
--n-jobs 4 \
--log-level DEBUG
# Try dedicated I/O cores with minimal threading
python scripts/generate_fingerprints.py \
/path/to/library.parquet \
test_output \
--read-threads 1 \
--process-threads 6 \
--write-threads 1 \
--buffer-size 2 \
--batch-size 50000 \
--n-jobs 8- Workflow logs:
logs/fingerprints_{library_id}.log - Snakemake logs:
.snakemake/log/ - SLURM logs: Usually in
slurm-{job_id}.out - Error logs: Check both stdout and stderr in job logs
# System resource check
free -h # Memory
lscpu # CPU info
df -h # Disk space
# Process monitoring
htop # Real-time process monitor
iostat -x 1 # I/O statistics
# Cluster status (SLURM)
squeue -u $USER # Your jobs
sinfo # Node status# Profile single library run
python -m cProfile -o profile.stats \
scripts/generate_fingerprints.py \
library.parquet output_dir
# Memory profiling
python -m memory_profiler \
scripts/generate_fingerprints.py \
library.parquet output_dir- Python 3.8+
- scikit-fingerprints
- pandas
- numpy
- scipy
- psutil
- pyarrow
- snakemake
- CPU: Minimum 4 cores, optimal 16+ cores for Phase 2 pipeline
- Memory: 8GB minimum, 16-32GB recommended for large libraries
- Storage: ~10-20GB per million compounds (including intermediate files)
- OS: Linux/macOS (Windows with WSL2)
- Mamba: Faster conda environment resolution (
--conda-frontend mamba) - Local scratch storage: For temporary files during processing
- High-speed interconnect: For cluster-based parallel processing