Skip to content

VonBoss/FingerDELicious

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FingerDELicious: DEL Fingerprint Generation Workflow

A high-performance Snakemake workflow for generating molecular fingerprints from DNA-encoded library (DEL) compounds, featuring an advanced dedicated I/O cores pipeline for 48% faster throughput (250 vs 170 million molecules/hour).

Requirements

This workflow is completely self-contained and only requires:

  • Snakemake (with conda/mamba support)
  • Conda or Mamba for environment management

No other dependencies need to be pre-installed!

Quick Start

1. Install Snakemake (if not already installed)

# Using conda
conda install -c conda-forge snakemake=7 cookiecutter

# Or using mamba (recommended for faster installs)
mamba install -c conda-forge snakemake=7 cookiecutter

2. Clone/Download the Workflow

# If you have git
git clone <repository-url>
cd FingerDELicious

# Or download and extract the workflow files

3. Configure Your Input

Edit config/config.yaml to specify your input files:

input:
  # Option 1: Specify individual files
  files:
    - "/path/to/your/library1.parquet"
    - "/path/to/your/library2.parquet"
  
  # Option 2: Scan a directory for files
  directory: "/path/to/your/libraries/"
  pattern: "*.parquet"  # File pattern to match

4. Run the Workflow

Recommended Production Command

# Full production run with robust job management
snakemake --configfile config/config.yaml \
          --profile profiles/slurm \
          --use-conda \
          --conda-frontend conda \
          --rerun-incomplete \
          --keep-going \
          --jobs 13

# Alternative: Keep failed outputs for debugging
snakemake --configfile config/config.yaml \
          --profile profiles/slurm \
          --use-conda \
          --conda-frontend conda \
          --rerun-incomplete \
          --keep-going \
          --keep-incomplete \
          --jobs 13

Development & Testing Commands

# Test the workflow (dry run)
snakemake --configfile config/config.yaml --dry-run

# Local testing with default profile
snakemake --configfile config/config.yaml \
          --profile profiles/default \
          --use-conda \
          --cores 8

# Quick test with small dataset
snakemake --configfile config/test_phase2.yaml \
          --profile profiles/default \
          --use-conda \
          --cores 4

Slurm-Specific Commands

# Standard cluster execution
snakemake --configfile config/config.yaml \
          --profile profiles/slurm \
          --use-conda \
          --jobs 10

# Maximum parallelism (adjust based on cluster policies)
snakemake --configfile config/config.yaml \
          --profile profiles/slurm \
          --use-conda \
          --conda-frontend mamba \
          --jobs 20 \
          --max-jobs-per-second 5

# Restart failed workflow
snakemake --configfile config/config.yaml \
          --profile profiles/slurm \
          --use-conda \
          --rerun-incomplete \
          --keep-going \
          --jobs 13

Debugging & Monitoring Commands

# Generate workflow visualization
snakemake --configfile config/config.yaml --dag | dot -Tpng > workflow_dag.png

# Detailed dry run with reasons
snakemake --configfile config/config.yaml --dry-run --reason

# Force rerun of specific rule
snakemake --configfile config/config.yaml \
          --forcerun generate_fingerprints \
          --use-conda \
          --jobs 4

# Monitor workflow status
snakemake --configfile config/config.yaml --summary

Key Snakemake Flags Explained

  • --keep-going: Prevents job failure cascade - continues other jobs even if some fail
  • --keep-incomplete: Keeps partial outputs from failed jobs for debugging
  • --rerun-incomplete: Automatically reruns jobs that didn't complete successfully
  • --conda-frontend mamba: Uses mamba for faster environment creation (if available)
  • --max-jobs-per-second: Limits job submission rate to avoid overwhelming scheduler
  • --jobs N: Maximum number of simultaneous jobs (adjust based on cluster limits)

Architecture: Dedicated I/O Cores Pipeline

The workflow features an advanced two-approach processing architecture that automatically selects the optimal method based on available resources:

Sequential Streaming (≤7 cores)

  • Memory-efficient sequential processing
  • Optimal for resource-constrained environments
  • Single-threaded I/O with parallel fingerprint computation
  • Performance: ~170 million molecules/hour

Dedicated I/O Cores Pipeline (≥8 cores)

  • Overlapped I/O operations while CPU processes fingerprints
  • Separate thread pools for reading, processing, and writing
  • Bounded buffering prevents memory bloat
  • Performance boost: 48% faster throughput (250 vs 170 million molecules/hour)

Automatic Selection

The script automatically chooses the optimal approach based on available cores:

performance:
  n_jobs: 12                    # ≥8 cores → Dedicated I/O cores pipeline
  n_jobs: 4                     # ≤7 cores → Sequential streaming
  force_sequential: false       # Set true to force sequential for debugging

What the Workflow Does

Intelligent Processing Pipeline

  • Automatic approach selection: Chooses optimal processing method based on available cores
  • Dedicated I/O cores architecture: Overlaps I/O operations with computation for maximum throughput
  • Memory-efficient streaming: Processes large datasets without loading everything into memory
  • Robust error handling: Continues processing other libraries even if individual jobs fail

Automatic Environment Management

  • fingerprints.yaml: Creates environment with scikit-fingerprints, pandas, numpy, scipy
  • report.yaml: Creates environment with matplotlib, seaborn for visualizations
  • All dependencies are automatically installed by Snakemake

Processing Steps

  1. Input Validation: Scans parquet files and extracts library IDs
  2. Approach Selection: Automatically chooses Sequential Streaming or Dedicated I/O Cores based on core count
  3. Fingerprint Generation: Generates ECFP/FCFP fingerprints with optimized I/O and computation overlap
  4. Output Organization: Creates structured directories by library ID with incremental chunk numbering
  5. Performance Analytics: Comprehensive HTML report with visualizations

Key Features

  • Multi-type processing: Generate ECFP and FCFP fingerprints in a single execution
  • Flexible input: Support for individual parquet files or directory scanning
  • Library ID extraction: Uses library_ID from compound column for output naming
  • Profile-based execution: Local development or cluster execution
  • Threading control: Fine-tune read/process/write thread allocation
  • Structured output: Organized directory structure with chunked files for ML workflows
  • Robust job management: Prevents cascading failures with --keep-going flag

Directory Structure

FingerDELicious/
├── Snakefile              # Main workflow definition
├── config/
│   ├── config.yaml        # Main configuration file
│   └── config_test.yaml   # Test configuration (smaller datasets)
├── scripts/
│   ├── generate_fingerprints.py  # Fingerprint generation script
│   └── generate_report.py         # Enhanced HTML report generation script
├── envs/
│   └── report.yaml        # Conda environment for reporting
├── profiles/
│   ├── default/
│   │   └── config.yaml    # Sequential execution (1 job)
│   └── slurm/
│       └── config.yaml    # Parallel execution (4 jobs × 32 cores × 96GB)
├── reports/               # Generated HTML reports and performance data
└── logs/                  # Job logs

Quick Start

1. Configure Input Files

Edit config/config.yaml to specify your input files:

# Option 1: List specific files
input:
  files:
    - "/path/to/library1.parquet"
    - "/path/to/library2.parquet"

# Option 2: Scan directory
input:
  directory: "/path/to/libraries"
  pattern: "*.parquet"

2. Run the Workflow

🚀 Recommended Production Command (Prevents Job Failure Cascades):

cd FingerDELicious
snakemake --configfile config/config.yaml \
          --profile profiles/slurm \
          --use-conda \
          --conda-frontend conda \
          --rerun-incomplete \
          --keep-going \
          --jobs 13

For Local Testing:

cd FingerDELicious
snakemake --configfile config/config.yaml \
          --profile profiles/default \
          --use-conda \
          --cores 8

With Test Configuration:

cd FingerDELicious
snakemake --configfile config/test_phase2.yaml \
          --profile profiles/default \
          --use-conda \
          --cores 4

3. Generate Reports

The workflow automatically generates a comprehensive HTML report with:

  • System information and configuration details
  • Performance metrics and visualizations
  • Processing statistics per library
  • Resource usage analysis

Generate report only:

cd FingerDELicious
snakemake --configfile config/config.yaml \
          reports/fingerprint_generation_report.html

Configuration

Main Settings (config.yaml)

# Input configuration
input:
  files:
    - "/path/to/library1.parquet"
    - "/path/to/library2.parquet"

# Fingerprint generation settings
fingerprints:
  types: ["ecfp", "fcfp"]  # Generate both types in single execution
  radius: 2                # ECFP radius (2 = ECFP4)
  fp_size: 2048           # Fingerprint vector size

# Performance settings
performance:
  batch_size: 100000        # Processing batch size
  output_chunk_size: 2000000  # Output file chunk size
  n_jobs: 20               # CPU cores (auto-selects Phase 1 vs Phase 2)
  
  # Dedicated I/O Cores Threading Configuration
  read_threads: 0          # Background data reading (0 = auto)
  process_threads: 0       # Fingerprint computation (0 = auto)
  write_threads: 0         # Background output writing (0 = auto)
  buffer_size: 0           # Memory buffering (0 = auto)
  force_sequential: false  # Force sequential processing

# Resource allocation
resources:
  base_memory_gb: 8        # Base memory per job
  memory_multiplier: 2     # Retry scaling factor
  max_memory_gb: 64        # Maximum memory limit
  max_runtime_hours: 4     # Job timeout

Threading Configuration Examples

Auto-Configuration (Recommended)

performance:
  n_jobs: 24              # Let script choose optimal approach
  read_threads: 0         # Auto: 1 thread for ≥8 cores
  process_threads: 0      # Auto: 22 threads for ≥8 cores
  write_threads: 0        # Auto: 1 thread for ≥8 cores
  buffer_size: 0          # Auto: based on batch size

Custom Threading (Advanced)

performance:
  n_jobs: 32              # Total cores available
  read_threads: 2         # Dedicated I/O reading
  process_threads: 28     # Fingerprint computation
  write_threads: 2        # Dedicated I/O writing
  buffer_size: 4          # 4 batches in memory

Force Sequential Processing

performance:
  n_jobs: 24                    # Even with many cores...
  force_sequential: true        # ...force sequential streaming

Profile-Specific Settings

Default Profile: Sequential processing, 1 job at a time

# profiles/default/config.yaml
jobs: 1              # Sequential execution
local-cores: 1       # Use 1 core locally

Slurm Profile: Parallel processing, up to 4 jobs × 32 cores

# profiles/slurm/config.yaml
jobs: 4              # Up to 4 parallel jobs
__default__:
  threads: 32        # 32 cores per job
  mem_gb: 96         # 96GB RAM per job
  runtime: 240       # 4 hours max per job

Output Structure

Each library generates its own directory structure:

results/
└── {library_ID}/
    ├── ecfp_fingerprints/
    │   ├── {library_ID}_ecfp_chunk-0.npz
    │   └── {library_ID}_ecfp_chunk-1.npz
    ├── fcfp_fingerprints/
    │   ├── {library_ID}_fcfp_chunk-0.npz
    │   └── {library_ID}_fcfp_chunk-1.npz
    └── metadata/
        ├── {library_ID}_metadata_chunk-0.parquet
        └── {library_ID}_metadata_chunk-1.parquet

Resource Requirements

Default Profile

  • CPU: 20 cores per library (configurable)
  • Memory: ~60GB RAM per library
  • Execution: Sequential (1 library at a time)

Slurm Profile

  • CPU: 32 cores per job, up to 4 jobs = 128 cores total
  • Memory: 96GB RAM per job, up to 384GB total
  • Execution: Up to 4 libraries in parallel

Advanced Usage

Preventing Job Failure Cascades

One of the most important considerations when running large-scale workflows is preventing individual job failures from stopping the entire workflow. Use these strategies:

Essential Flags for Robust Execution

# Recommended production command
snakemake --configfile config/config.yaml \
          --profile profiles/slurm \
          --use-conda \
          --keep-going \           # Continue other jobs if some fail
          --rerun-incomplete \     # Retry failed/incomplete jobs
          --keep-incomplete \      # Keep partial outputs for debugging
          --jobs 13

# For development/debugging
snakemake --configfile config/config.yaml \
          --keep-going \
          --printshellcmds \       # Show actual commands being run
          --verbose \              # Detailed logging
          --cores 4

Job Failure Management

  • --keep-going: Most important flag - prevents one failed library from stopping others
  • --rerun-incomplete: Automatically retries jobs that didn't complete successfully
  • --keep-incomplete: Preserves partial outputs for debugging failed jobs
  • --max-jobs-per-second 5: Limits job submission rate to avoid scheduler overload

Memory and Resource Management

Automatic Memory Scaling

The workflow includes automatic retry logic with memory scaling:

resources:
  base_memory_gb: 8        # First attempt: 8GB
  memory_multiplier: 2     # Second attempt: 16GB
  max_memory_gb: 64        # Third attempt: 32GB, Fourth: 64GB (max)

Core Count Guidelines

  • ≤7 cores: Automatically uses Phase 1 (sequential streaming)
  • ≥8 cores: Automatically uses Phase 2 (producer-consumer pipeline)
  • 16+ cores: Optimal for Phase 2 with custom threading
  • 32+ cores: Consider custom thread allocation for maximum efficiency

Monitoring and Debugging

Real-time Monitoring

# Monitor specific library progress
tail -f logs/fingerprints_{library_id}.log

# Watch Snakemake workflow status
watch -n 30 "snakemake --configfile config/config.yaml --summary"

# Check cluster job status (SLURM)
squeue -u $USER

# Monitor resource usage
snakemake --configfile config/config.yaml \
          --profile profiles/slurm \
          --cluster-status scripts/slurm_status.py  # If available

Debugging Failed Jobs

# Rerun only failed jobs
snakemake --configfile config/config.yaml \
          --rerun-incomplete \
          --keep-going \
          --cores 4

# Force rerun specific library
snakemake --configfile config/config.yaml \
          --forcerun results/{library_id} \
          --cores 4

# Detailed error analysis
snakemake --configfile config/config.yaml \
          --keep-going \
          --printshellcmds \
          --verbose \
          --cores 1

Performance Optimization

Batch Size Optimization

# For small libraries (< 1M compounds)
batch_size: 50000

# For medium libraries (1-10M compounds)  
batch_size: 100000         # Recommended default

# For large libraries (> 10M compounds)
batch_size: 200000

Output Chunk Optimization

# For ML training datasets
output_chunk_size: 2000000   # 2M molecules per file

# For analysis workflows
output_chunk_size: 5000000   # 5M molecules per file

# For memory-constrained systems
output_chunk_size: 1000000   # 1M molecules per file

Workflow Visualization and Analysis

Generate Workflow DAG

# Create workflow diagram
snakemake --configfile config/config.yaml --dag | dot -Tpng > workflow_dag.png

# Rule dependency graph
snakemake --configfile config/config.yaml --rulegraph | dot -Tpng > rules_dag.png

# File dependency graph  
snakemake --configfile config/config.yaml --filegraph | dot -Tpng > files_dag.png

Performance Analysis

# Detailed job statistics
snakemake --configfile config/config.yaml --detailed-summary

# Resource usage report
snakemake --configfile config/config.yaml --stats stats.json

# Generate benchmark report (if benchmarking enabled)
snakemake --configfile config/config.yaml reports/fingerprint_generation_report.html

Clean Up and Maintenance

Cleanup Commands

# Remove all output files
snakemake --configfile config/config.yaml --delete-all-output

# Remove specific library output
rm -rf results/{library_id}

# Clean conda environments (if needed)
snakemake --configfile config/config.yaml --use-conda --conda-cleanup-envs

# Clean up Snakemake metadata
rm -rf .snakemake/

Disk Space Management

# Check output sizes
du -sh results/*/

# Compress old fingerprint files
find results/ -name "*.npz" -mtime +30 -exec gzip {} \;

# Archive completed libraries
tar -czf library_{library_id}_$(date +%Y%m%d).tar.gz results/{library_id}/

Performance Notes

Dedicated I/O Cores Performance Benchmark

Real-world performance comparison on 9.84 million molecules from qDOS24.parquet:

Approach Status Time Throughput Memory Efficiency Score
Sequential Streaming ✅ SUCCESS 3m 29s 168.7 M/h 7.1 GB 1,389,591
Dedicated I/O Cores ✅ SUCCESS 2m 21s 250.8 M/h 9.0 GB 1,091,543

Performance Gains with Dedicated I/O Cores:

  • 48% faster processing: 250.8 vs 168.7 million molecules/hour
  • 32% time reduction: 2m 21s vs 3m 29s
  • Minimal memory overhead: Only 1.9GB additional memory usage
  • Scalable architecture: Performance gains increase with larger datasets

Architecture Performance Details

  • Sequential Streaming (≤7 cores): Single-threaded I/O with parallel computation
  • Dedicated I/O Cores (≥8 cores): Overlapped I/O operations with separate read/process/write threads
  • Multi-type processing: ECFP + FCFP in single execution for maximum efficiency
  • Memory efficiency: ~3-4GB per core during dedicated I/O operation

Optimal Configuration Benchmarks

  • Batch size 100K: Provides optimal balance of memory usage and throughput
  • Output chunks 2M: Creates ML-friendly file sizes without excessive I/O overhead
  • Threading (Dedicated I/O): 1 read + N-2 process + 1 write threads optimal for most workloads
  • Buffer size: 3-4 batches provides good throughput without excessive memory

Scaling Guidelines

  • Local workstation (8-16 cores): Use dedicated I/O cores with default auto-allocation
  • Cluster node (24-32 cores): Custom threading with 2-4 I/O threads
  • High-memory node (64+ cores): Increase batch size to 200K for optimal efficiency
  • Multiple libraries: Use --jobs 10-20 for parallel library processing

Resource Requirements by Scale

Library Size Recommended Config Memory Runtime Throughput
< 1M compounds Sequential streaming, batch 50K 4-8GB 15-30 min 170K mol/h
1-10M compounds Dedicated I/O cores, batch 100K 8-16GB 1-3 hours 250K mol/h
10-50M compounds Dedicated I/O cores, batch 200K 16-32GB 4-12 hours 280K mol/h
50M+ compounds Dedicated I/O cores, custom threads 32-64GB 12+ hours 300K mol/h

Troubleshooting

Job Failure Prevention

Problem: One failed library causes entire workflow to stop Solution: Use --keep-going flag

snakemake --configfile config/config.yaml \
          --profile profiles/slurm \
          --use-conda \
          --keep-going \          # Essential for robust execution
          --rerun-incomplete \
          --jobs 13

Problem: Failed jobs are deleted and can't be debugged Solution: Use --keep-incomplete flag

snakemake --configfile config/config.yaml \
          --keep-going \
          --keep-incomplete \     # Preserve failed outputs
          --printshellcmds \      # Show exact commands
          --cores 4

Common Issues and Solutions

1. Memory-Related Failures

Symptoms:

  • MemoryError in logs
  • Jobs killed by SLURM with OUT_OF_MEMORY
  • Process killed with signal 9

Solutions:

# Reduce resource requirements
performance:
  batch_size: 50000        # Reduce from 100000
  n_jobs: 16              # Reduce from 24
  
# Or increase memory allocation
resources:
  base_memory_gb: 16       # Increase from 8
  max_memory_gb: 128       # Increase limit

2. Threading and Core Issues

Symptoms:

  • Very slow performance despite many cores
  • High CPU wait times
  • Dedicated I/O cores slower than sequential streaming

Solutions:

# Force sequential streaming for debugging
performance:
  force_sequential: true

# Or optimize dedicated I/O cores threading
performance:
  read_threads: 1          # Single reader
  process_threads: 22      # Most cores for processing
  write_threads: 1         # Single writer
  buffer_size: 2           # Reduce if memory limited

3. File System and I/O Issues

Symptoms:

  • Slow performance with small batch sizes
  • File not found errors
  • Permission denied errors

Solutions:

# Check file permissions
ls -la /path/to/libraries/

# Increase batch size for better I/O efficiency
batch_size: 200000

# Use local scratch space if available
export TMPDIR=/local/scratch/$USER

4. SLURM-Specific Issues

Symptoms:

  • Jobs stuck in pending state
  • Jobs cancelled by scheduler
  • Resource allocation errors

Solutions:

# Check cluster policies
sinfo
sqos
sacctmgr show qos

# Adjust job submission rate
snakemake --configfile config/config.yaml \
          --profile profiles/slurm \
          --max-jobs-per-second 2 \
          --jobs 5              # Reduce parallel jobs

5. Conda Environment Issues

Symptoms:

  • Package conflicts
  • Environment creation failures
  • Import errors

Solutions:

# Clean conda environments
snakemake --use-conda --conda-cleanup-envs

# Force environment recreation
rm -rf .snakemake/conda/

# Use mamba for faster resolution
snakemake --conda-frontend mamba \
          --use-conda \
          --cores 4

Debugging Workflow

Step 1: Identify the Problem

# Check workflow status
snakemake --configfile config/config.yaml --summary

# Look for failed jobs
snakemake --configfile config/config.yaml --detailed-summary

# Check specific job logs
cat logs/fingerprints_{library_id}.log

Step 2: Isolate the Issue

# Test with single library
snakemake --configfile config/test_phase2.yaml \
          --cores 4 \
          --verbose

# Force rerun specific rule
snakemake --forcerun generate_fingerprints \
          --cores 4

Step 3: Test Solutions

# Try sequential streaming approach
python scripts/generate_fingerprints.py \
    /path/to/library.parquet \
    test_output \
    --no-io-threads \
    --batch-size 50000 \
    --n-jobs 4 \
    --log-level DEBUG

# Try dedicated I/O cores with minimal threading
python scripts/generate_fingerprints.py \
    /path/to/library.parquet \
    test_output \
    --read-threads 1 \
    --process-threads 6 \
    --write-threads 1 \
    --buffer-size 2 \
    --batch-size 50000 \
    --n-jobs 8

Getting Help

Log File Locations

  • Workflow logs: logs/fingerprints_{library_id}.log
  • Snakemake logs: .snakemake/log/
  • SLURM logs: Usually in slurm-{job_id}.out
  • Error logs: Check both stdout and stderr in job logs

Diagnostic Commands

# System resource check
free -h                    # Memory
lscpu                     # CPU info
df -h                     # Disk space

# Process monitoring
htop                      # Real-time process monitor
iostat -x 1               # I/O statistics

# Cluster status (SLURM)
squeue -u $USER           # Your jobs
sinfo                     # Node status

Performance Profiling

# Profile single library run
python -m cProfile -o profile.stats \
    scripts/generate_fingerprints.py \
    library.parquet output_dir

# Memory profiling
python -m memory_profiler \
    scripts/generate_fingerprints.py \
    library.parquet output_dir

Dependencies

Automatically Managed (via Conda)

  • Python 3.8+
  • scikit-fingerprints
  • pandas
  • numpy
  • scipy
  • psutil
  • pyarrow
  • snakemake

System Requirements

  • CPU: Minimum 4 cores, optimal 16+ cores for Phase 2 pipeline
  • Memory: 8GB minimum, 16-32GB recommended for large libraries
  • Storage: ~10-20GB per million compounds (including intermediate files)
  • OS: Linux/macOS (Windows with WSL2)

Optional for Enhanced Performance

  • Mamba: Faster conda environment resolution (--conda-frontend mamba)
  • Local scratch storage: For temporary files during processing
  • High-speed interconnect: For cluster-based parallel processing

About

Fingerprint generator for DEL libraries

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages