-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
markdups and molcov #210
markdups and molcov #210
Conversation
📝 WalkthroughWalkthroughThis pull request introduces multiple enhancements across the codebase. It modifies command-line options for alignment commands, particularly changing the default value for Changes
Sequence Diagram(s)sequenceDiagram
participant Main as Main Process
participant File as Uncompressed File Handler
participant Gzip as Gzip Subprocess
Main->>Gzip: Start gzip subprocess for forward/reverse reads
Main->>Gzip: Pipe record data via stdin
Gzip->>File: Write compressed output data
sequenceDiagram
participant Config as Configuration
participant Rule as Snakemake Rule (mark_duplicates)
participant Samtools as Samtools markdup
Config->>Rule: Evaluate SAM file header
alt If header starts with "A", "B", "C", or "D"
Note over Config,Rule: OPTICAL_BUFFER set to 2500
else
Note over Config,Rule: OPTICAL_BUFFER set to 100
end
Rule->>Samtools: Execute markdup with parameter "-d {OPTICAL_BUFFER}"
Possibly related PRs
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms (17)
🔇 Additional comments (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
harpy/bin/molecule_coverage.py (3)
21-21
: Docstring mismatch
The documentation states "Prints binary to stdout," but the output is tab-delimited text. Consider updating it for clarity and accuracy.
66-70
: Consider adding start coordinates to the output
Currently, only the bin’s stop coordinate is printed. Including both start and stop coordinates of each bin could give more context to downstream analysis.
98-99
: Potential undefined variables
Static analysis raises concerns thatcounter
andgeno_intervals
might be undefined here if the loop is empty. For clarity, consider initializing these variables before the loop to avoid edge-case errors and silence false positives.🧰 Tools
🪛 Ruff (0.8.2)
98-98: Undefined name
counter
(F821)
98-98: Undefined name
geno_intervals
(F821)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
harpy/align.py
(14 hunks)harpy/bin/inline_to_haplotag.py
(2 hunks)harpy/bin/molecule_coverage.py
(4 hunks)harpy/reports/align_stats.qmd
(1 hunks)harpy/snakefiles/align_bwa.smk
(5 hunks)harpy/snakefiles/align_ema.smk
(5 hunks)harpy/snakefiles/align_strobealign.smk
(5 hunks)harpy/snakefiles/demultiplex_gen1.smk
(2 hunks)harpy/snakefiles/simulate_linkedreads.smk
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/bin/molecule_coverage.py
98-98: Undefined name counter
(F821)
98-98: Undefined name geno_intervals
(F821)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: align strobe
- GitHub Check: simulate linkedreads
- GitHub Check: demux gen1
- GitHub Check: align EMA
- GitHub Check: Rebuild Container
- GitHub Check: align BWA
🔇 Additional comments (28)
harpy/bin/molecule_coverage.py (6)
8-8
: Good introduction of Counter
Using an in-memory Counter is simpler than relying on an external database, which should significantly improve performance and reduce complexity.
55-60
: Interval generation logic is well-structured
Appending the final segment when the contig length isn't evenly divisible by the window size ensures complete coverage. Good job handling partial windows at the end.
62-65
: Check for full bin coverage
Since this checks only if the read's start or end lie in a bin, a read fully spanning multiple bins without landing its start/end in them might be missed. Verify whether this partial coverage is intended or if you need to capture all bins in the [start, end] range.
90-90
: Header and comment skipping
It's good practice to skip comment lines starting with '#', ensuring only valid data lines are processed.
104-104
: Validate overcount or undercount risk
If an alignment spans multiple bins but doesn’t place its start or end in each bin, some bins may remain uncounted. Confirm this logic aligns with the intended coverage model.
106-107
: Proper final flush of coverage
Printing the coverage for the last contig ensures no data is missed at the end of the file. Good catch!harpy/bin/inline_to_haplotag.py (2)
8-8
: Subprocess import for flexible compression
Usingsubprocess
allows greater control over compression processes compared to the built-in gzip module.
131-133
: Ensure data flushing
Writing encoded data to the subprocess’sstdin
is effective but remember to close or flush the stdin streams once done to avoid data loss in buffers.harpy/snakefiles/simulate_linkedreads.smk (3)
10-16
: Configuration-driven initialization
Defining these paths and references upfront clarifies the workflow. Ensure the config file consistently provides the necessary keys, or add fallback logic if they're missing.
22-22
: Cleanup on success
Removing the_Inline
directory helps keep the workspace tidy after a successful run.
25-25
: Consistent cleanup on failure
Applying the same cleanup logic in error scenarios avoids leaving behind partial artifacts that might complicate reruns.harpy/snakefiles/demultiplex_gen1.smk (3)
98-98
: LGTM! Dynamic group name improves resource management.The change from
"partition"
to"partition.{part}"
allows for better parallelization by creating separate resource groups for each partition.
111-111
: LGTM! Consistent group naming with partition_reads rule.The change maintains consistency with the
partition_reads
rule's group naming convention.
117-118
: LGTM! Improved rule definition and resource management.The changes:
- Convert checkpoint to regular rule, simplifying the workflow.
- Add dynamic group name for better resource management.
harpy/snakefiles/align_bwa.smk (3)
25-25
: LGTM! Sequencer-specific optical duplicate detection.The conditional buffer size (100 for HiSeq, 2500 for others) aligns with different optical layouts of sequencing platforms.
126-127
: LGTM! Consistent parameter handling in mark_duplicates.The changes properly integrate the optical buffer parameter with existing barcode tag handling.
318-318
: LGTM! Comprehensive workflow documentation.The optical buffer parameter is properly included in the workflow summary for transparency.
harpy/snakefiles/align_strobealign.smk (3)
28-28
: LGTM! Consistent sequencer-specific configuration.The implementation matches the BWA workflow's approach to optical duplicate detection.
117-118
: LGTM! Well-structured parameter handling.The changes maintain consistency with other alignment workflows in parameter organization.
311-311
: LGTM! Complete workflow documentation.The optical buffer configuration is properly documented in the workflow summary.
harpy/snakefiles/align_ema.smk (3)
30-30
: LGTM! Unified sequencer configuration across workflows.The implementation maintains consistency with BWA and Strobealign workflows.
204-205
: LGTM! Consistent parameter structure.The optical buffer parameter is properly integrated with existing parameters.
394-395
: LGTM! Thorough workflow documentation.The optical buffer configuration is properly included in the workflow summary.
harpy/align.py (4)
36-36
: LGTM! Consistent addition of the--sequencer
option.The
--sequencer
option has been consistently added to all three alignment commands in the docstring dictionary.Also applies to: 48-48, 60-60
77-78
: LGTM! Well-structured command-line options.The command-line options are well-structured with:
- Clear default values
- Case-insensitive choices
- Helpful descriptions
- Consistent implementation across all three commands
Also applies to: 172-172, 281-282, 284-284
98-101
: LGTM! Clear documentation of the new features.The documentation clearly explains:
- The purpose of the
--molecule-distance
parameter for barcode deconvolution- The purpose of the
--sequencer
option for optical duplicate detection
129-129
: LGTM! Consistent configuration handling.The sequencer value is consistently:
- Converted to lowercase before storing in config
- Added to all three alignment command configurations
Also applies to: 238-238, 337-337
harpy/reports/align_stats.qmd (1)
314-315
: LGTM! Improved axis configuration.The axis configuration has been updated to:
- Set the x-axis type to logarithmic for better visualization of molecule lengths
- Set a maximum value of 100 for the y-axis percentage scale
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
harpy/bin/molecule_coverage.py (1)
55-60
: 🛠️ Refactor suggestionAdd input validation and consider memory optimization.
The function could benefit from:
- Input validation for negative values
- Memory optimization for large contigs with small window sizes
Consider this implementation:
-def new_intervals(contig_len, windowsize) -> list: - starts = list(range(0, contig_len + 1, windowsize)) - ends = [i for i in starts[1:]] - if ends[-1] != contig_len: - ends.append(contig_len) - return [range(i,j) for i,j in zip(starts,ends)] +def new_intervals(contig_len: int, windowsize: int) -> list: + if contig_len < 0 or windowsize <= 0: + raise ValueError("contig_len must be non-negative and windowsize must be positive") + # Yield intervals instead of storing all in memory + intervals = [] + for start in range(0, contig_len, windowsize): + end = min(start + windowsize, contig_len) + intervals.append(range(start, end)) + return intervals
🧹 Nitpick comments (3)
harpy/snakefiles/align_strobealign.smk (1)
125-130
: Document the rationale behind optical buffer values.The implementation of dynamic optical buffer sizing based on sequencer type is good. However, consider adding comments to explain:
- Why these specific buffer values (2500 vs 100) were chosen
- What sequencer types correspond to headers A-D
+ # Set optical duplicate buffer size based on sequencer type: + # - Headers A-D: Illumina sequencers with higher optical duplicate density (e.g., NovaSeq) + # - Others: Standard sequencers with lower optical duplicate density SAMHEADER=$(samtools head -h 0 -n 1 {input.sam} | cut -d: -f1) if grep -q "^[ABCD]" <<< $SAMHEADER; then + # Higher buffer for Illumina sequencers to account for denser optical duplicates OPTICAL_BUFFER=2500 else + # Standard buffer for other sequencers OPTICAL_BUFFER=100 fiharpy/bin/molecule_coverage.py (2)
62-67
: Optimize interval overlap detection.The current implementation performs a linear search through all intervals. For large contigs with many intervals, this could be inefficient.
Consider using binary search for better performance:
-def which_overlap(start: int, end: int, binlist: list): - """return a list of which genomic intervals the molecule spans""" - startstop = [idx for idx, val in enumerate(binlist) if start in val or end in val] - if startstop: - startstop = range(startstop[0], startstop[-1] + 1) - return startstop +def which_overlap(start: int, end: int, binlist: list): + """Return a list of which genomic intervals the molecule spans""" + if not binlist or start < 0 or end < start: + return [] + + def binary_search(pos): + left, right = 0, len(binlist) - 1 + while left <= right: + mid = (left + right) // 2 + if pos in binlist[mid]: + return mid + elif pos < binlist[mid].start: + right = mid - 1 + else: + left = mid + 1 + return left if left < len(binlist) else len(binlist) - 1 + + start_idx = binary_search(start) + end_idx = binary_search(end) + return range(start_idx, end_idx + 1)
69-72
: Consider buffered output for better performance.Writing line by line to stdout can be inefficient for large datasets.
Consider using buffered output:
-def print_depth_counts(contig, counter_obj, intervals): - """Print the Counter object to stdout""" - for idx,int_bin in enumerate(intervals): - sys.stdout.write(f"{contig}\t{int_bin.stop}\t{counter_obj[idx]}\n") +def print_depth_counts(contig, counter_obj, intervals): + """Print the Counter object to stdout with buffering""" + buffer = [] + buffer_size = 1000 # Adjust based on memory constraints + + for idx, int_bin in enumerate(intervals): + buffer.append(f"{contig}\t{int_bin.stop}\t{counter_obj[idx]}\n") + if len(buffer) >= buffer_size: + sys.stdout.write("".join(buffer)) + buffer = [] + + if buffer: + sys.stdout.write("".join(buffer))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
harpy/align.py
(5 hunks)harpy/bin/molecule_coverage.py
(4 hunks)harpy/snakefiles/align_bwa.smk
(2 hunks)harpy/snakefiles/align_ema.smk
(2 hunks)harpy/snakefiles/align_strobealign.smk
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- harpy/snakefiles/align_bwa.smk
- harpy/snakefiles/align_ema.smk
- harpy/align.py
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/bin/molecule_coverage.py
101-101: Undefined name counter
(F821)
101-101: Undefined name geno_intervals
(F821)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: simulate linkedreads
- GitHub Check: align EMA
- GitHub Check: demux gen1
- GitHub Check: Rebuild Container
- GitHub Check: align BWA
🔇 Additional comments (2)
harpy/snakefiles/align_strobealign.smk (2)
134-134
: LGTM! Good use of dynamic buffer size.The modification to use the dynamic
OPTICAL_BUFFER
value in the markdup command is well implemented.
334-334
: LGTM! Good addition of optical buffer parameter.The inclusion of the optical buffer parameter in the workflow summary is helpful for documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (8)
harpy/snakefiles/align_strobealign.smk (1)
332-332
: Improve clarity of the workflow summary.The current summary text could be clearer about the conditional nature of the optical buffer size.
- duplicates += f"\tsamtools markdup -S {params.bx_mode} -d 100 (2500 for novaseq)" + duplicates += f"\tsamtools markdup -S {params.bx_mode} -d <optical buffer: 2500 for NovaSeq (A-D headers), 100 for other sequencers>"resources/harpy.yaml (1)
8-8
: Consider using a more flexible version constraint for click.While pinning to exact versions can prevent unexpected behavior, it might be too restrictive for maintenance. Consider using
>=8,<9
to allow patch updates while maintaining compatibility.- - click =8 + - click >=8,<9resources/meta.yaml (1)
38-38
: Consider using a more flexible version constraint for click.Similar to harpy.yaml, consider using a more flexible version constraint to allow patch updates.
- - click =8 + - click >=8,<9harpy/imputeparams.py (1)
8-8
: Consider keeping the help behavior for no arguments.Since this command requires the
-o/--output
option, it would be user-friendly to show the help message when no arguments are provided.-@click.command(context_settings=dict(allow_interspersed_args=False), epilog = "Documentation: https://pdimens.github.io/harpy/workflows/impute/#parameter-file") +@click.command(no_args_is_help=True, context_settings=dict(allow_interspersed_args=False), epilog = "Documentation: https://pdimens.github.io/harpy/workflows/impute/#parameter-file")harpy/simulate_variants.py (1)
114-114
: Consider improving help discoverability for complex commands.While removing
no_args_is_help=True
is consistent with other changes and well-supported by the command structure, these simulation commands are complex with many options. Consider adding a note in the docstring suggesting--help
for new users.Example docstring addition for each command:
def snpindel(genome, ...): """ Introduce snps and/or indels into a genome + + Run with --help to see all available options. ### HaploidAlso applies to: 242-242, 342-342, 454-454
harpy/demultiplex.py (1)
42-42
: Consider keeping the help behavior for no arguments.Removing
no_args_is_help=True
changes the CLI behavior when no arguments are provided. Consider keeping this parameter to maintain a user-friendly experience where help is automatically shown when needed.-@click.command(context_settings=dict(allow_interspersed_args=False), epilog = "Documentation: https://pdimens.github.io/harpy/workflows/demultiplex/") +@click.command(no_args_is_help=True, context_settings=dict(allow_interspersed_args=False), epilog = "Documentation: https://pdimens.github.io/harpy/workflows/demultiplex/")harpy/metassembly.py (1)
30-30
: Consider the impact of removingno_args_is_help
.The removal of
no_args_is_help=True
from the command decorator might affect user experience when running the command without arguments. Consider keeping this parameter to maintain a user-friendly behavior where help is displayed when no arguments are provided.LGTM: Thread clamping and memory validation improvements.
The addition of
clamp=True
to the threads option and the simplified memory validation are good improvements that:
- Prevent invalid thread counts by clamping values to the valid range.
- Simplify memory validation while maintaining the minimum requirement.
Also applies to: 34-34, 39-39
harpy/deconvolve.py (1)
32-34
: Verify the impact of removingmax_open
from density and dropout options.While the changes align with the standardization effort:
- The removal of
max_open=True
from density and dropout options might need validation to ensure there are no practical upper limits that should be enforced.- The thread clamping addition is a good safety improvement.
Could you verify if there are any practical upper limits for density and dropout values that should be enforced?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (26)
harpy/_cli_types_generic.py
(0 hunks)harpy/align.py
(6 hunks)harpy/assembly.py
(1 hunks)harpy/container.py
(1 hunks)harpy/deconvolve.py
(1 hunks)harpy/demultiplex.py
(1 hunks)harpy/diagnose.py
(1 hunks)harpy/downsample.py
(1 hunks)harpy/impute.py
(1 hunks)harpy/imputeparams.py
(1 hunks)harpy/metassembly.py
(1 hunks)harpy/phase.py
(1 hunks)harpy/popgroup.py
(1 hunks)harpy/preflight.py
(2 hunks)harpy/qc.py
(3 hunks)harpy/resume.py
(1 hunks)harpy/simulate_linkedreads.py
(2 hunks)harpy/simulate_variants.py
(4 hunks)harpy/snakefiles/align_bwa.smk
(3 hunks)harpy/snakefiles/align_ema.smk
(3 hunks)harpy/snakefiles/align_strobealign.smk
(3 hunks)harpy/snp.py
(2 hunks)harpy/sv.py
(3 hunks)harpy/view.py
(1 hunks)resources/harpy.yaml
(1 hunks)resources/meta.yaml
(1 hunks)
💤 Files with no reviewable changes (1)
- harpy/_cli_types_generic.py
🚧 Files skipped from review as they are similar to previous changes (3)
- harpy/snakefiles/align_ema.smk
- harpy/snakefiles/align_bwa.smk
- harpy/align.py
🔇 Additional comments (30)
harpy/snakefiles/align_strobealign.smk (3)
125-129
: LGTM! Dynamic optical buffer determination based on sequencer type.The implementation correctly determines the optical buffer size based on the first character of the SAM header:
- 2500 for sequencers A-D (NovaSeq)
- 100 for other sequencers
133-133
: LGTM! Optical buffer parameter correctly applied.The dynamic
OPTICAL_BUFFER
value is properly used in thesamtools markdup
command.
125-129
: Verify the design choice: Automatic detection vs. CLI option.The PR objectives mention adding a CLI option for sequencer type, but the implementation automatically detects the sequencer type from SAM headers. While this approach is more user-friendly as it requires no manual intervention, please verify if this meets the intended requirements.
harpy/container.py (1)
11-11
: LGTM!Good cleanup. Removing
no_args_is_help=False
simplifies the code as it was redundant with the default behavior.harpy/view.py (1)
41-41
: LGTM, but verify CLI behavior.The removal of
no_args_is_help=True
is acceptable here since the command has a required argument and comprehensive error handling. However, we should verify that the new behavior provides a good user experience.Please test the following scenarios to ensure good user experience:
- Running the command without arguments
- Running with
--help
- Running with invalid arguments
harpy/popgroup.py (1)
12-12
: LGTM!The removal of
no_args_is_help=True
is well-supported by the command's structure:
- Required argument ensures validation
- Comprehensive error messages
- Detailed documentation with epilog
harpy/diagnose.py (1)
10-10
: LGTM!The removal of
no_args_is_help=True
is appropriate as the command:
- Has a required directory argument
- Provides clear error messages
- Has a well-defined purpose in the docstring
harpy/demultiplex.py (1)
46-46
: LGTM! Thread range validation improved.Adding
clamp=True
to the thread range validation is a good improvement. It prevents errors by clamping values to the valid range (2-999) instead of raising exceptions.harpy/resume.py (1)
15-15
: Similar changes as in demultiplex.pyThe changes to the command decorator and thread range validation are consistent with those in demultiplex.py.
Also applies to: 17-17
harpy/downsample.py (1)
34-34
:❓ Verification inconclusive
Verify the minimum thread count requirement.
The thread range here is 1-999 while other files use 2-999. Please verify if this difference is intentional and if the module can operate with a single thread.
🏁 Script executed:
#!/bin/bash # Search for any thread-specific code in the downsample workflow rg -A 5 "thread|core|cpu" workflow/downsample.smkLength of output: 115
Action required: Verify the minimum thread count for downsampling
The thread range in
harpy/downsample.py
is configured withclick.IntRange(1,999, clamp=True)
, while other modules use a range of 2-999. The automated search for thread-specific code in the expected workflow file (workflow/downsample.smk
) did not return any data because that file could not be found. Please manually verify whether the module is designed to operate correctly with a single thread. If single-thread operation is intended and supported, confirm this decision; otherwise, consider revising the range for consistency.harpy/impute.py (1)
36-36
:❓ Verification inconclusive
Verify the minimum thread count requirement.
The thread range here is 4-999 while most other files use 2-999. Please verify if this higher minimum thread count is required for the STITCH imputation process.
🏁 Script executed:
#!/bin/bash # Search for any thread-specific code in the impute workflow and STITCH parameters rg -A 5 "thread|core|cpu" workflow/impute.smk ast-grep --pattern 'nCores = $_'Length of output: 142
Action Required: Verify Minimum Thread Count for STITCH Imputation Process
The current configuration in
harpy/impute.py
sets the thread range to 4–999, while other modules in the codebase use 2–999. Our initial automated searches (targetingworkflow/impute.smk
and a generic thread keyword search) did not yield relevant evidence—either because the searched file isn’t present or no pertinent thread-related code was found. Please manually verify that the decision to require a minimum of 4 threads is indeed necessary for the STITCH imputation process and not an oversight.
- Location:
harpy/impute.py
(Line 36)- Points to Confirm:
- Is the elevated minimum thread count (4 vs. 2) specifically required by the STITCH imputation process?
- Should other imputation modules share the same lower bound, or is a higher thread count justified here?
harpy/simulate_linkedreads.py (2)
29-29
: LGTM! CLI behavior change.The removal of
no_args_is_help=True
aligns with the standardization of CLI behavior across the codebase.
39-39
: LGTM! Enhanced thread count validation.The addition of
clamp=True
ensures thread count values are automatically adjusted to fit within the 1-999 range instead of raising an error.harpy/preflight.py (2)
41-42
: LGTM! Consistent CLI behavior and thread validation.The changes align with the standardization across the codebase:
- Removal of
no_args_is_help=True
for consistent CLI behavior- Addition of
clamp=True
for thread count validation
102-104
: LGTM! Consistent changes in fastq command.The same standardization changes have been applied to the fastq command, maintaining consistency.
harpy/phase.py (2)
31-31
: LGTM! CLI behavior standardization.The removal of
no_args_is_help=True
aligns with the standardization of CLI behavior.
37-38
: LGTM! Enhanced parameter validation.The addition of
clamp=True
improves robustness:
- Prune threshold is automatically adjusted to fit within 0-100 range
- Thread count is automatically adjusted to fit within 2-999 range
harpy/qc.py (3)
31-32
: LGTM! Enhanced parameter handling.The changes improve robustness and usability:
- Standardized CLI behavior by removing
no_args_is_help=True
- Enhanced deconvolve parameter validation with
clamp=True
38-38
: LGTM! Consistent thread validation.The addition of
clamp=True
ensures thread count values are automatically adjusted to fit within the 4-999 range.
64-65
: LGTM! Improved documentation.The documentation has been updated to reflect the parameter format change from comma-separated to space-separated values.
harpy/metassembly.py (1)
30-30
: Command-line interface improvements for better input validation.The changes enhance input validation and error handling:
- Memory limit validation is now more strict without
max_open=True
- Thread count is now clamped to stay within 1-999 range instead of throwing an error
Let's verify the thread clamping behavior works as expected:
#!/usr/bin/env python3 import rich_click as click # Test thread clamping def test_thread_clamping(): # Create a test command with the same thread option @click.command() @click.option('-t', '--threads', type=click.IntRange(1,999, clamp=True)) def cmd(threads): print(f"Threads: {threads}") # Test values test_values = [-1, 0, 1, 500, 999, 1000, 1500] print("Testing thread clamping with values:", test_values) for val in test_values: try: ctx = click.Context(cmd) param = next(p for p in cmd.params if p.name == "threads") result = param.type_cast_value(ctx, str(val)) print(f"Input: {val} -> Output: {result}") except Exception as e: print(f"Input: {val} -> Error: {str(e)}") if __name__ == "__main__": test_thread_clamping()Also applies to: 34-34, 39-39
harpy/deconvolve.py (1)
29-29
: Consistent command-line interface improvements across files.The changes align with the improvements in metassembly.py:
- Density and dropout options have stricter validation
- Thread count is clamped to stay within 1-999 range
Also applies to: 32-34
harpy/snp.py (3)
54-54
: Appropriate ploidy validation for different variant callers.The changes correctly implement different ploidy validation rules:
- mpileup: Ploidy is restricted to 1-2 as it's not recommended for higher ploidies
- freebayes: Ploidy only has a minimum of 1 as it's recommended for higher ploidies (as mentioned in the docstring)
- Both commands have thread count clamping (4-999)
Also applies to: 58-58, 61-61, 156-156, 160-160, 163-163
58-58
: LGTM: Appropriate ploidy ranges for different variant callers.The different ploidy range constraints between mpileup (1-2) and freebayes (min=1) align well with their capabilities:
- mpileup is optimized for diploid organisms (ploidy ≤ 2).
- freebayes supports higher ploidy levels, making it suitable for polyploid organisms.
Also applies to: 160-160
61-61
: LGTM: Consistent thread validation.The addition of
clamp=True
to thread options in both commands maintains consistency with the codebase-wide standardization of thread validation.Also applies to: 163-163
harpy/sv.py (4)
8-8
: LGTM!The import of
convert_to_int
is correctly placed and used in thequiet
option's callback parameter.
68-68
: LGTM! Thread option improvements.Adding
clamp=True
to thread options is a good improvement as it ensures values remain within the specified range (4-999) by clamping them instead of raising an error.Also applies to: 166-166
64-64
: LGTM! Improved sharing-thresholds option type.The change from
IntList(3)
toclick.IntRange(0,100, clamp=True)
withnargs=3
is a good improvement as it:
- Explicitly defines the valid range (0-100)
- Clamps values to the range instead of raising errors
- Maintains the requirement for exactly 3 values
61-63
:❓ Verification inconclusive
Verify the impact of removing maximum limits.
The removal of maximum limits from various IntRange options might allow unexpectedly large values. Let's verify if this could cause any issues:
Also applies to: 160-163
🏁 Script executed:
#!/bin/bash # Description: Search for any hardcoded limits or array allocations that might be affected by these parameters # Expected: No hardcoded limits that could cause issues with large values # Search for array allocations or loops using these parameters ast-grep --pattern 'for $_ in range($iterations)' # Check iterations usage ast-grep --pattern 'array($duplicates)' # Check duplicates usage ast-grep --pattern 'array($min_size)' # Check min_size usage ast-grep --pattern 'array($min_barcodes)' # Check min_barcodes usageLength of output: 177
Action Required: Manually verify maximum limits removal in IntRange options
While our automated AST searches didn't reveal any hardcoded limits or array allocations directly referencing the parameters (
iterations
,duplicates
,min_size
,min_barcodes
), the absence of output means we could not conclusively confirm that no downstream usage relies on maximum boundaries. Please manually verify that:
- No loops, allocations, or dependent calculations assume an upper bound for these parameters.
- The similar changes applied at lines 160–163 are consistent and do not risk resource exhaustion or unexpected behavior when large values are provided.
harpy/assembly.py (1)
46-46
: LGTM: Consistent validation improvements across options.The addition of
clamp=True
to multiple options (min-quality
,seq-identity
, andthreads
) provides consistent and robust input validation:
- Values outside the range will be automatically adjusted to the nearest valid value.
- The ranges are sensible for each option (0-40 for quality, 0-100 for identity, 1-999 for threads).
Also applies to: 50-50, 54-54
This PR:
-d
value inferred from the sequencer id of the alignments within a sampleSummary by CodeRabbit
New Features
click
version 8.Refactor
Chores