Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

markdups and molcov #210

Merged
merged 17 commits into from
Feb 24, 2025
Merged

markdups and molcov #210

merged 17 commits into from
Feb 24, 2025

Conversation

pdimens
Copy link
Owner

@pdimens pdimens commented Feb 22, 2025

This PR:

  1. improves samtools markdup detecting proper optical duplicates by using a -d value inferred from the sequencer id of the alignments within a sample
  2. Use a binned-range-Counter approach to calculating molecular coverage, which is much faster and very light. Reduces the sqlite overhead and is overall easier to read/maintain. Plus it's fast. Did I mention it's fast? And light.

Summary by CodeRabbit

  • New Features

    • Alignment commands now include an option for specifying the distance cutoff for molecule assignment.
    • Chart visualizations have updated axis scaling for inferred molecule lengths.
    • Workflows incorporate a dynamic optical duplicate buffer to enhance duplicate marking.
    • New functions for calculating molecular coverage with in-memory counting have been introduced.
    • New variables have been added for improved configuration in linked read simulations.
    • Updated command-line options for enhanced input validation across multiple commands.
    • Added a new dependency on click version 8.
  • Refactor

    • Optimized output processing with enhanced compression methods.
    • Improved data partitioning and demultiplexing for smoother workflow execution.
  • Chores

    • Centralized configuration and added cleanup routines to better manage temporary simulation files.

Copy link

coderabbitai bot commented Feb 22, 2025

📝 Walkthrough

Walkthrough

This pull request introduces multiple enhancements across the codebase. It modifies command-line options for alignment commands, particularly changing the default value for --molecule-distance and updating relevant documentation. The compression workflow in the haplotag script now utilizes subprocess-based gzip, while the molecule coverage computation has been refactored to use in-memory counters instead of an SQLite database. Additionally, visualization axis configurations have been adjusted, and several Snakemake workflows have been updated to determine optical duplicate buffer sizes dynamically and improve partition handling and cleanup operations.

Changes

File(s) Summary
harpy/align.py Modified command-line options for ema and strobe commands; changed default for --molecule-distance from 100000 to 0; updated function signatures and documentation strings.
harpy/bin/inline_to_haplotag.py Replaced direct gzip file creation with standard file opening and subprocess-based gzip compression for forward and reverse read outputs, redirecting record data to the gzip subprocesses’ stdin.
harpy/bin/molecule_coverage.py Removed SQLite-based counting and auxiliary functions; introduced a new approach using collections.Counter along with new functions (new_intervals, which_overlap, and print_depth_counts) to generate intervals and print depth counts.
harpy/reports/align_stats.qmd Modified highchart configuration by moving the maximum value setting from the y-axis to the x-axis, adjusting the visual representation of inferred molecule lengths.
harpy/snakefiles/align_bwa.smk, harpy/snakefiles/align_ema.smk, harpy/snakefiles/align_strobealign.smk Introduced conditional assignment of OPTICAL_BUFFER based on SAM file header; updated workflow_summary rules to reflect new buffer logic.
harpy/snakefiles/demultiplex_gen1.smk Updated rule groupings from a static "partition" to a dynamic "partition.{part}" and converted the checkpoint demultiplex into a standard rule to support partition-based processing.
harpy/snakefiles/simulate_linkedreads.smk Consolidated variable declarations (outdir, envdir, gen_hap1, gen_hap2, barcode_file, barcode_len, genodict) and added cleanup steps (shutil.rmtree('_Inline', ignore_errors=True)) in both onsuccess and onerror sections to remove temporary directories.
harpy/_cli_types_generic.py Removed the IntList class, which was a custom parameter type for handling lists of integers in the click library.
harpy/assembly.py, harpy/container.py, harpy/deconvolve.py, harpy/demultiplex.py, harpy/diagnose.py, harpy/downsample.py, harpy/impute.py, harpy/imputeparams.py, harpy/metassembly.py, harpy/phase.py, harpy/popgroup.py, harpy/preflight.py, harpy/qc.py, harpy/resume.py, harpy/simulate_linkedreads.py, harpy/simulate_variants.py, harpy/snp.py, harpy/sv.py, harpy/view.py Removed no_args_is_help=True from @click.command decorators and updated several @click.option types to include clamp=True for better input validation.
resources/harpy.yaml, resources/meta.yaml Added a new dependency for click=8 in both configuration files.

Sequence Diagram(s)

sequenceDiagram
    participant Main as Main Process
    participant File as Uncompressed File Handler
    participant Gzip as Gzip Subprocess

    Main->>Gzip: Start gzip subprocess for forward/reverse reads
    Main->>Gzip: Pipe record data via stdin
    Gzip->>File: Write compressed output data
Loading
sequenceDiagram
    participant Config as Configuration
    participant Rule as Snakemake Rule (mark_duplicates)
    participant Samtools as Samtools markdup

    Config->>Rule: Evaluate SAM file header
    alt If header starts with "A", "B", "C", or "D"
      Note over Config,Rule: OPTICAL_BUFFER set to 2500
    else
      Note over Config,Rule: OPTICAL_BUFFER set to 100
    end
    Rule->>Samtools: Execute markdup with parameter "-d {OPTICAL_BUFFER}"
Loading

Possibly related PRs

  • haplotagging barcodes as default #162: The changes in the main PR regarding command-line options and function signatures in harpy/align.py are related to the modifications in the retrieved PR, particularly the updates to the inline_to_haplotag.py script, which also involves command-line argument handling and processing logic. Both PRs focus on enhancing the command-line interface and functionality related to barcode processing.
  • change cli to remove inputs as options #185: The changes in the main PR regarding command-line options and parameter handling in harpy/align.py are related to the modifications in the retrieved PR, which also focuses on changes to command-line argument parsing in harpy/bin/inline_to_haplotag.py, specifically in how options are structured and presented. Both PRs involve updates to the command-line interface that affect how users interact with the respective scripts.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6796b03 and 25532d9.

📒 Files selected for processing (1)
  • .github/workflows/tests.yml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (17)
  • GitHub Check: align strobe
  • GitHub Check: simulate variants
  • GitHub Check: simulate linkedreads
  • GitHub Check: qc
  • GitHub Check: preflight
  • GitHub Check: phase
  • GitHub Check: sv naibr
  • GitHub Check: mpileup
  • GitHub Check: sv leviathan
  • GitHub Check: impute
  • GitHub Check: freebayes
  • GitHub Check: align EMA
  • GitHub Check: demux gen1
  • GitHub Check: deconvolve
  • GitHub Check: Rebuild Container
  • GitHub Check: align BWA
  • GitHub Check: assembly
🔇 Additional comments (1)
.github/workflows/tests.yml (1)

237-237: Ensure Consistent CLI Option Formatting

The change from a comma-separated list to a space-separated list for the -c option is aligned with the updated CLI input validation (using click.IntRange(0,100, clamp=True)) and helps clarify the expected input format. Please verify that the corresponding CLI help text and documentation in harpy/qc.py reflect this updated format for consistency.


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
harpy/bin/molecule_coverage.py (3)

21-21: Docstring mismatch
The documentation states "Prints binary to stdout," but the output is tab-delimited text. Consider updating it for clarity and accuracy.


66-70: Consider adding start coordinates to the output
Currently, only the bin’s stop coordinate is printed. Including both start and stop coordinates of each bin could give more context to downstream analysis.


98-99: Potential undefined variables
Static analysis raises concerns that counter and geno_intervals might be undefined here if the loop is empty. For clarity, consider initializing these variables before the loop to avoid edge-case errors and silence false positives.

🧰 Tools
🪛 Ruff (0.8.2)

98-98: Undefined name counter

(F821)


98-98: Undefined name geno_intervals

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 07c227b and 6f74ec3.

📒 Files selected for processing (9)
  • harpy/align.py (14 hunks)
  • harpy/bin/inline_to_haplotag.py (2 hunks)
  • harpy/bin/molecule_coverage.py (4 hunks)
  • harpy/reports/align_stats.qmd (1 hunks)
  • harpy/snakefiles/align_bwa.smk (5 hunks)
  • harpy/snakefiles/align_ema.smk (5 hunks)
  • harpy/snakefiles/align_strobealign.smk (5 hunks)
  • harpy/snakefiles/demultiplex_gen1.smk (2 hunks)
  • harpy/snakefiles/simulate_linkedreads.smk (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/bin/molecule_coverage.py

98-98: Undefined name counter

(F821)


98-98: Undefined name geno_intervals

(F821)

⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: align strobe
  • GitHub Check: simulate linkedreads
  • GitHub Check: demux gen1
  • GitHub Check: align EMA
  • GitHub Check: Rebuild Container
  • GitHub Check: align BWA
🔇 Additional comments (28)
harpy/bin/molecule_coverage.py (6)

8-8: Good introduction of Counter
Using an in-memory Counter is simpler than relying on an external database, which should significantly improve performance and reduce complexity.


55-60: Interval generation logic is well-structured
Appending the final segment when the contig length isn't evenly divisible by the window size ensures complete coverage. Good job handling partial windows at the end.


62-65: Check for full bin coverage
Since this checks only if the read's start or end lie in a bin, a read fully spanning multiple bins without landing its start/end in them might be missed. Verify whether this partial coverage is intended or if you need to capture all bins in the [start, end] range.


90-90: Header and comment skipping
It's good practice to skip comment lines starting with '#', ensuring only valid data lines are processed.


104-104: Validate overcount or undercount risk
If an alignment spans multiple bins but doesn’t place its start or end in each bin, some bins may remain uncounted. Confirm this logic aligns with the intended coverage model.


106-107: Proper final flush of coverage
Printing the coverage for the last contig ensures no data is missed at the end of the file. Good catch!

harpy/bin/inline_to_haplotag.py (2)

8-8: Subprocess import for flexible compression
Using subprocess allows greater control over compression processes compared to the built-in gzip module.


131-133: Ensure data flushing
Writing encoded data to the subprocess’s stdin is effective but remember to close or flush the stdin streams once done to avoid data loss in buffers.

harpy/snakefiles/simulate_linkedreads.smk (3)

10-16: Configuration-driven initialization
Defining these paths and references upfront clarifies the workflow. Ensure the config file consistently provides the necessary keys, or add fallback logic if they're missing.


22-22: Cleanup on success
Removing the _Inline directory helps keep the workspace tidy after a successful run.


25-25: Consistent cleanup on failure
Applying the same cleanup logic in error scenarios avoids leaving behind partial artifacts that might complicate reruns.

harpy/snakefiles/demultiplex_gen1.smk (3)

98-98: LGTM! Dynamic group name improves resource management.

The change from "partition" to "partition.{part}" allows for better parallelization by creating separate resource groups for each partition.


111-111: LGTM! Consistent group naming with partition_reads rule.

The change maintains consistency with the partition_reads rule's group naming convention.


117-118: LGTM! Improved rule definition and resource management.

The changes:

  1. Convert checkpoint to regular rule, simplifying the workflow.
  2. Add dynamic group name for better resource management.
harpy/snakefiles/align_bwa.smk (3)

25-25: LGTM! Sequencer-specific optical duplicate detection.

The conditional buffer size (100 for HiSeq, 2500 for others) aligns with different optical layouts of sequencing platforms.


126-127: LGTM! Consistent parameter handling in mark_duplicates.

The changes properly integrate the optical buffer parameter with existing barcode tag handling.


318-318: LGTM! Comprehensive workflow documentation.

The optical buffer parameter is properly included in the workflow summary for transparency.

harpy/snakefiles/align_strobealign.smk (3)

28-28: LGTM! Consistent sequencer-specific configuration.

The implementation matches the BWA workflow's approach to optical duplicate detection.


117-118: LGTM! Well-structured parameter handling.

The changes maintain consistency with other alignment workflows in parameter organization.


311-311: LGTM! Complete workflow documentation.

The optical buffer configuration is properly documented in the workflow summary.

harpy/snakefiles/align_ema.smk (3)

30-30: LGTM! Unified sequencer configuration across workflows.

The implementation maintains consistency with BWA and Strobealign workflows.


204-205: LGTM! Consistent parameter structure.

The optical buffer parameter is properly integrated with existing parameters.


394-395: LGTM! Thorough workflow documentation.

The optical buffer configuration is properly included in the workflow summary.

harpy/align.py (4)

36-36: LGTM! Consistent addition of the --sequencer option.

The --sequencer option has been consistently added to all three alignment commands in the docstring dictionary.

Also applies to: 48-48, 60-60


77-78: LGTM! Well-structured command-line options.

The command-line options are well-structured with:

  • Clear default values
  • Case-insensitive choices
  • Helpful descriptions
  • Consistent implementation across all three commands

Also applies to: 172-172, 281-282, 284-284


98-101: LGTM! Clear documentation of the new features.

The documentation clearly explains:

  • The purpose of the --molecule-distance parameter for barcode deconvolution
  • The purpose of the --sequencer option for optical duplicate detection

129-129: LGTM! Consistent configuration handling.

The sequencer value is consistently:

  • Converted to lowercase before storing in config
  • Added to all three alignment command configurations

Also applies to: 238-238, 337-337

harpy/reports/align_stats.qmd (1)

314-315: LGTM! Improved axis configuration.

The axis configuration has been updated to:

  • Set the x-axis type to logarithmic for better visualization of molecule lengths
  • Set a maximum value of 100 for the y-axis percentage scale

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
harpy/bin/molecule_coverage.py (1)

55-60: 🛠️ Refactor suggestion

Add input validation and consider memory optimization.

The function could benefit from:

  1. Input validation for negative values
  2. Memory optimization for large contigs with small window sizes

Consider this implementation:

-def new_intervals(contig_len, windowsize) -> list:
-    starts = list(range(0, contig_len + 1, windowsize))
-    ends = [i for i in starts[1:]]
-    if ends[-1] != contig_len:
-        ends.append(contig_len)
-    return [range(i,j) for i,j in zip(starts,ends)]
+def new_intervals(contig_len: int, windowsize: int) -> list:
+    if contig_len < 0 or windowsize <= 0:
+        raise ValueError("contig_len must be non-negative and windowsize must be positive")
+    # Yield intervals instead of storing all in memory
+    intervals = []
+    for start in range(0, contig_len, windowsize):
+        end = min(start + windowsize, contig_len)
+        intervals.append(range(start, end))
+    return intervals
🧹 Nitpick comments (3)
harpy/snakefiles/align_strobealign.smk (1)

125-130: Document the rationale behind optical buffer values.

The implementation of dynamic optical buffer sizing based on sequencer type is good. However, consider adding comments to explain:

  1. Why these specific buffer values (2500 vs 100) were chosen
  2. What sequencer types correspond to headers A-D
+        # Set optical duplicate buffer size based on sequencer type:
+        # - Headers A-D: Illumina sequencers with higher optical duplicate density (e.g., NovaSeq)
+        # - Others: Standard sequencers with lower optical duplicate density
         SAMHEADER=$(samtools head -h 0 -n 1 {input.sam} | cut -d: -f1)
         if grep -q "^[ABCD]" <<< $SAMHEADER; then
+            # Higher buffer for Illumina sequencers to account for denser optical duplicates
             OPTICAL_BUFFER=2500
         else
+            # Standard buffer for other sequencers
             OPTICAL_BUFFER=100
         fi
harpy/bin/molecule_coverage.py (2)

62-67: Optimize interval overlap detection.

The current implementation performs a linear search through all intervals. For large contigs with many intervals, this could be inefficient.

Consider using binary search for better performance:

-def which_overlap(start: int, end: int, binlist: list):
-    """return a list of which genomic intervals the molecule spans"""
-    startstop = [idx for idx, val in enumerate(binlist) if start in val or end in val]
-    if startstop:
-        startstop = range(startstop[0], startstop[-1] + 1)
-    return startstop
+def which_overlap(start: int, end: int, binlist: list):
+    """Return a list of which genomic intervals the molecule spans"""
+    if not binlist or start < 0 or end < start:
+        return []
+    
+    def binary_search(pos):
+        left, right = 0, len(binlist) - 1
+        while left <= right:
+            mid = (left + right) // 2
+            if pos in binlist[mid]:
+                return mid
+            elif pos < binlist[mid].start:
+                right = mid - 1
+            else:
+                left = mid + 1
+        return left if left < len(binlist) else len(binlist) - 1
+    
+    start_idx = binary_search(start)
+    end_idx = binary_search(end)
+    return range(start_idx, end_idx + 1)

69-72: Consider buffered output for better performance.

Writing line by line to stdout can be inefficient for large datasets.

Consider using buffered output:

-def print_depth_counts(contig, counter_obj, intervals):
-    """Print the Counter object to stdout"""
-    for idx,int_bin in enumerate(intervals):
-        sys.stdout.write(f"{contig}\t{int_bin.stop}\t{counter_obj[idx]}\n")
+def print_depth_counts(contig, counter_obj, intervals):
+    """Print the Counter object to stdout with buffering"""
+    buffer = []
+    buffer_size = 1000  # Adjust based on memory constraints
+    
+    for idx, int_bin in enumerate(intervals):
+        buffer.append(f"{contig}\t{int_bin.stop}\t{counter_obj[idx]}\n")
+        if len(buffer) >= buffer_size:
+            sys.stdout.write("".join(buffer))
+            buffer = []
+    
+    if buffer:
+        sys.stdout.write("".join(buffer))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f74ec3 and 924a4a3.

📒 Files selected for processing (5)
  • harpy/align.py (5 hunks)
  • harpy/bin/molecule_coverage.py (4 hunks)
  • harpy/snakefiles/align_bwa.smk (2 hunks)
  • harpy/snakefiles/align_ema.smk (2 hunks)
  • harpy/snakefiles/align_strobealign.smk (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • harpy/snakefiles/align_bwa.smk
  • harpy/snakefiles/align_ema.smk
  • harpy/align.py
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/bin/molecule_coverage.py

101-101: Undefined name counter

(F821)


101-101: Undefined name geno_intervals

(F821)

⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: simulate linkedreads
  • GitHub Check: align EMA
  • GitHub Check: demux gen1
  • GitHub Check: Rebuild Container
  • GitHub Check: align BWA
🔇 Additional comments (2)
harpy/snakefiles/align_strobealign.smk (2)

134-134: LGTM! Good use of dynamic buffer size.

The modification to use the dynamic OPTICAL_BUFFER value in the markdup command is well implemented.


334-334: LGTM! Good addition of optical buffer parameter.

The inclusion of the optical buffer parameter in the workflow summary is helpful for documentation.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (8)
harpy/snakefiles/align_strobealign.smk (1)

332-332: Improve clarity of the workflow summary.

The current summary text could be clearer about the conditional nature of the optical buffer size.

-        duplicates += f"\tsamtools markdup -S {params.bx_mode} -d 100 (2500 for novaseq)"
+        duplicates += f"\tsamtools markdup -S {params.bx_mode} -d <optical buffer: 2500 for NovaSeq (A-D headers), 100 for other sequencers>"
resources/harpy.yaml (1)

8-8: Consider using a more flexible version constraint for click.

While pinning to exact versions can prevent unexpected behavior, it might be too restrictive for maintenance. Consider using >=8,<9 to allow patch updates while maintaining compatibility.

-  - click =8
+  - click >=8,<9
resources/meta.yaml (1)

38-38: Consider using a more flexible version constraint for click.

Similar to harpy.yaml, consider using a more flexible version constraint to allow patch updates.

-    - click =8
+    - click >=8,<9
harpy/imputeparams.py (1)

8-8: Consider keeping the help behavior for no arguments.

Since this command requires the -o/--output option, it would be user-friendly to show the help message when no arguments are provided.

-@click.command(context_settings=dict(allow_interspersed_args=False), epilog = "Documentation: https://pdimens.github.io/harpy/workflows/impute/#parameter-file")
+@click.command(no_args_is_help=True, context_settings=dict(allow_interspersed_args=False), epilog = "Documentation: https://pdimens.github.io/harpy/workflows/impute/#parameter-file")
harpy/simulate_variants.py (1)

114-114: Consider improving help discoverability for complex commands.

While removing no_args_is_help=True is consistent with other changes and well-supported by the command structure, these simulation commands are complex with many options. Consider adding a note in the docstring suggesting --help for new users.

Example docstring addition for each command:

 def snpindel(genome, ...):
     """
     Introduce snps and/or indels into a genome
+
+    Run with --help to see all available options.
 
     ### Haploid

Also applies to: 242-242, 342-342, 454-454

harpy/demultiplex.py (1)

42-42: Consider keeping the help behavior for no arguments.

Removing no_args_is_help=True changes the CLI behavior when no arguments are provided. Consider keeping this parameter to maintain a user-friendly experience where help is automatically shown when needed.

-@click.command(context_settings=dict(allow_interspersed_args=False), epilog = "Documentation: https://pdimens.github.io/harpy/workflows/demultiplex/")
+@click.command(no_args_is_help=True, context_settings=dict(allow_interspersed_args=False), epilog = "Documentation: https://pdimens.github.io/harpy/workflows/demultiplex/")
harpy/metassembly.py (1)

30-30: Consider the impact of removing no_args_is_help.

The removal of no_args_is_help=True from the command decorator might affect user experience when running the command without arguments. Consider keeping this parameter to maintain a user-friendly behavior where help is displayed when no arguments are provided.

LGTM: Thread clamping and memory validation improvements.

The addition of clamp=True to the threads option and the simplified memory validation are good improvements that:

  1. Prevent invalid thread counts by clamping values to the valid range.
  2. Simplify memory validation while maintaining the minimum requirement.

Also applies to: 34-34, 39-39

harpy/deconvolve.py (1)

32-34: Verify the impact of removing max_open from density and dropout options.

While the changes align with the standardization effort:

  1. The removal of max_open=True from density and dropout options might need validation to ensure there are no practical upper limits that should be enforced.
  2. The thread clamping addition is a good safety improvement.

Could you verify if there are any practical upper limits for density and dropout values that should be enforced?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 559186b and 8177436.

📒 Files selected for processing (26)
  • harpy/_cli_types_generic.py (0 hunks)
  • harpy/align.py (6 hunks)
  • harpy/assembly.py (1 hunks)
  • harpy/container.py (1 hunks)
  • harpy/deconvolve.py (1 hunks)
  • harpy/demultiplex.py (1 hunks)
  • harpy/diagnose.py (1 hunks)
  • harpy/downsample.py (1 hunks)
  • harpy/impute.py (1 hunks)
  • harpy/imputeparams.py (1 hunks)
  • harpy/metassembly.py (1 hunks)
  • harpy/phase.py (1 hunks)
  • harpy/popgroup.py (1 hunks)
  • harpy/preflight.py (2 hunks)
  • harpy/qc.py (3 hunks)
  • harpy/resume.py (1 hunks)
  • harpy/simulate_linkedreads.py (2 hunks)
  • harpy/simulate_variants.py (4 hunks)
  • harpy/snakefiles/align_bwa.smk (3 hunks)
  • harpy/snakefiles/align_ema.smk (3 hunks)
  • harpy/snakefiles/align_strobealign.smk (3 hunks)
  • harpy/snp.py (2 hunks)
  • harpy/sv.py (3 hunks)
  • harpy/view.py (1 hunks)
  • resources/harpy.yaml (1 hunks)
  • resources/meta.yaml (1 hunks)
💤 Files with no reviewable changes (1)
  • harpy/_cli_types_generic.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • harpy/snakefiles/align_ema.smk
  • harpy/snakefiles/align_bwa.smk
  • harpy/align.py
🔇 Additional comments (30)
harpy/snakefiles/align_strobealign.smk (3)

125-129: LGTM! Dynamic optical buffer determination based on sequencer type.

The implementation correctly determines the optical buffer size based on the first character of the SAM header:

  • 2500 for sequencers A-D (NovaSeq)
  • 100 for other sequencers

133-133: LGTM! Optical buffer parameter correctly applied.

The dynamic OPTICAL_BUFFER value is properly used in the samtools markdup command.


125-129: Verify the design choice: Automatic detection vs. CLI option.

The PR objectives mention adding a CLI option for sequencer type, but the implementation automatically detects the sequencer type from SAM headers. While this approach is more user-friendly as it requires no manual intervention, please verify if this meets the intended requirements.

harpy/container.py (1)

11-11: LGTM!

Good cleanup. Removing no_args_is_help=False simplifies the code as it was redundant with the default behavior.

harpy/view.py (1)

41-41: LGTM, but verify CLI behavior.

The removal of no_args_is_help=True is acceptable here since the command has a required argument and comprehensive error handling. However, we should verify that the new behavior provides a good user experience.

Please test the following scenarios to ensure good user experience:

  1. Running the command without arguments
  2. Running with --help
  3. Running with invalid arguments
harpy/popgroup.py (1)

12-12: LGTM!

The removal of no_args_is_help=True is well-supported by the command's structure:

  • Required argument ensures validation
  • Comprehensive error messages
  • Detailed documentation with epilog
harpy/diagnose.py (1)

10-10: LGTM!

The removal of no_args_is_help=True is appropriate as the command:

  • Has a required directory argument
  • Provides clear error messages
  • Has a well-defined purpose in the docstring
harpy/demultiplex.py (1)

46-46: LGTM! Thread range validation improved.

Adding clamp=True to the thread range validation is a good improvement. It prevents errors by clamping values to the valid range (2-999) instead of raising exceptions.

harpy/resume.py (1)

15-15: Similar changes as in demultiplex.py

The changes to the command decorator and thread range validation are consistent with those in demultiplex.py.

Also applies to: 17-17

harpy/downsample.py (1)

34-34:

❓ Verification inconclusive

Verify the minimum thread count requirement.

The thread range here is 1-999 while other files use 2-999. Please verify if this difference is intentional and if the module can operate with a single thread.


🏁 Script executed:

#!/bin/bash
# Search for any thread-specific code in the downsample workflow
rg -A 5 "thread|core|cpu" workflow/downsample.smk

Length of output: 115


Action required: Verify the minimum thread count for downsampling

The thread range in harpy/downsample.py is configured with click.IntRange(1,999, clamp=True), while other modules use a range of 2-999. The automated search for thread-specific code in the expected workflow file (workflow/downsample.smk) did not return any data because that file could not be found. Please manually verify whether the module is designed to operate correctly with a single thread. If single-thread operation is intended and supported, confirm this decision; otherwise, consider revising the range for consistency.

harpy/impute.py (1)

36-36:

❓ Verification inconclusive

Verify the minimum thread count requirement.

The thread range here is 4-999 while most other files use 2-999. Please verify if this higher minimum thread count is required for the STITCH imputation process.


🏁 Script executed:

#!/bin/bash
# Search for any thread-specific code in the impute workflow and STITCH parameters
rg -A 5 "thread|core|cpu" workflow/impute.smk
ast-grep --pattern 'nCores = $_'

Length of output: 142


Action Required: Verify Minimum Thread Count for STITCH Imputation Process

The current configuration in harpy/impute.py sets the thread range to 4–999, while other modules in the codebase use 2–999. Our initial automated searches (targeting workflow/impute.smk and a generic thread keyword search) did not yield relevant evidence—either because the searched file isn’t present or no pertinent thread-related code was found. Please manually verify that the decision to require a minimum of 4 threads is indeed necessary for the STITCH imputation process and not an oversight.

  • Location: harpy/impute.py (Line 36)
  • Points to Confirm:
    • Is the elevated minimum thread count (4 vs. 2) specifically required by the STITCH imputation process?
    • Should other imputation modules share the same lower bound, or is a higher thread count justified here?
harpy/simulate_linkedreads.py (2)

29-29: LGTM! CLI behavior change.

The removal of no_args_is_help=True aligns with the standardization of CLI behavior across the codebase.


39-39: LGTM! Enhanced thread count validation.

The addition of clamp=True ensures thread count values are automatically adjusted to fit within the 1-999 range instead of raising an error.

harpy/preflight.py (2)

41-42: LGTM! Consistent CLI behavior and thread validation.

The changes align with the standardization across the codebase:

  • Removal of no_args_is_help=True for consistent CLI behavior
  • Addition of clamp=True for thread count validation

102-104: LGTM! Consistent changes in fastq command.

The same standardization changes have been applied to the fastq command, maintaining consistency.

harpy/phase.py (2)

31-31: LGTM! CLI behavior standardization.

The removal of no_args_is_help=True aligns with the standardization of CLI behavior.


37-38: LGTM! Enhanced parameter validation.

The addition of clamp=True improves robustness:

  • Prune threshold is automatically adjusted to fit within 0-100 range
  • Thread count is automatically adjusted to fit within 2-999 range
harpy/qc.py (3)

31-32: LGTM! Enhanced parameter handling.

The changes improve robustness and usability:

  • Standardized CLI behavior by removing no_args_is_help=True
  • Enhanced deconvolve parameter validation with clamp=True

38-38: LGTM! Consistent thread validation.

The addition of clamp=True ensures thread count values are automatically adjusted to fit within the 4-999 range.


64-65: LGTM! Improved documentation.

The documentation has been updated to reflect the parameter format change from comma-separated to space-separated values.

harpy/metassembly.py (1)

30-30: Command-line interface improvements for better input validation.

The changes enhance input validation and error handling:

  1. Memory limit validation is now more strict without max_open=True
  2. Thread count is now clamped to stay within 1-999 range instead of throwing an error

Let's verify the thread clamping behavior works as expected:

#!/usr/bin/env python3
import rich_click as click

# Test thread clamping
def test_thread_clamping():
    # Create a test command with the same thread option
    @click.command()
    @click.option('-t', '--threads', type=click.IntRange(1,999, clamp=True))
    def cmd(threads):
        print(f"Threads: {threads}")
    
    # Test values
    test_values = [-1, 0, 1, 500, 999, 1000, 1500]
    print("Testing thread clamping with values:", test_values)
    for val in test_values:
        try:
            ctx = click.Context(cmd)
            param = next(p for p in cmd.params if p.name == "threads")
            result = param.type_cast_value(ctx, str(val))
            print(f"Input: {val} -> Output: {result}")
        except Exception as e:
            print(f"Input: {val} -> Error: {str(e)}")

if __name__ == "__main__":
    test_thread_clamping()

Also applies to: 34-34, 39-39

harpy/deconvolve.py (1)

29-29: Consistent command-line interface improvements across files.

The changes align with the improvements in metassembly.py:

  1. Density and dropout options have stricter validation
  2. Thread count is clamped to stay within 1-999 range

Also applies to: 32-34

harpy/snp.py (3)

54-54: Appropriate ploidy validation for different variant callers.

The changes correctly implement different ploidy validation rules:

  1. mpileup: Ploidy is restricted to 1-2 as it's not recommended for higher ploidies
  2. freebayes: Ploidy only has a minimum of 1 as it's recommended for higher ploidies (as mentioned in the docstring)
  3. Both commands have thread count clamping (4-999)

Also applies to: 58-58, 61-61, 156-156, 160-160, 163-163


58-58: LGTM: Appropriate ploidy ranges for different variant callers.

The different ploidy range constraints between mpileup (1-2) and freebayes (min=1) align well with their capabilities:

  1. mpileup is optimized for diploid organisms (ploidy ≤ 2).
  2. freebayes supports higher ploidy levels, making it suitable for polyploid organisms.

Also applies to: 160-160


61-61: LGTM: Consistent thread validation.

The addition of clamp=True to thread options in both commands maintains consistency with the codebase-wide standardization of thread validation.

Also applies to: 163-163

harpy/sv.py (4)

8-8: LGTM!

The import of convert_to_int is correctly placed and used in the quiet option's callback parameter.


68-68: LGTM! Thread option improvements.

Adding clamp=True to thread options is a good improvement as it ensures values remain within the specified range (4-999) by clamping them instead of raising an error.

Also applies to: 166-166


64-64: LGTM! Improved sharing-thresholds option type.

The change from IntList(3) to click.IntRange(0,100, clamp=True) with nargs=3 is a good improvement as it:

  1. Explicitly defines the valid range (0-100)
  2. Clamps values to the range instead of raising errors
  3. Maintains the requirement for exactly 3 values

61-63:

❓ Verification inconclusive

Verify the impact of removing maximum limits.

The removal of maximum limits from various IntRange options might allow unexpectedly large values. Let's verify if this could cause any issues:

Also applies to: 160-163


🏁 Script executed:

#!/bin/bash
# Description: Search for any hardcoded limits or array allocations that might be affected by these parameters
# Expected: No hardcoded limits that could cause issues with large values

# Search for array allocations or loops using these parameters
ast-grep --pattern 'for $_ in range($iterations)' # Check iterations usage
ast-grep --pattern 'array($duplicates)' # Check duplicates usage
ast-grep --pattern 'array($min_size)' # Check min_size usage
ast-grep --pattern 'array($min_barcodes)' # Check min_barcodes usage

Length of output: 177


Action Required: Manually verify maximum limits removal in IntRange options

While our automated AST searches didn't reveal any hardcoded limits or array allocations directly referencing the parameters (iterations, duplicates, min_size, min_barcodes), the absence of output means we could not conclusively confirm that no downstream usage relies on maximum boundaries. Please manually verify that:

  • No loops, allocations, or dependent calculations assume an upper bound for these parameters.
  • The similar changes applied at lines 160–163 are consistent and do not risk resource exhaustion or unexpected behavior when large values are provided.
harpy/assembly.py (1)

46-46: LGTM: Consistent validation improvements across options.

The addition of clamp=True to multiple options (min-quality, seq-identity, and threads) provides consistent and robust input validation:

  1. Values outside the range will be automatically adjusted to the nearest valid value.
  2. The ranges are sensible for each option (0-40 for quality, 0-100 for identity, 1-999 for threads).

Also applies to: 50-50, 54-54

@pdimens pdimens merged commit 2579cd5 into main Feb 24, 2025
20 checks passed
@pdimens pdimens deleted the markdups_improvement branch February 24, 2025 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant