Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revisit demultiplexing, sv cli commands #191

Merged
merged 29 commits into from
Feb 5, 2025
Merged

revisit demultiplexing, sv cli commands #191

merged 29 commits into from
Feb 5, 2025

Conversation

pdimens
Copy link
Owner

@pdimens pdimens commented Feb 3, 2025

This PR:

  • introduces a python-based demultiplexing script that achieves the same as its C++ predecessor, but demultiplexes samples as its demultiplexing barcodes, which should boost speed/efficiency fixes Speedup Demultiplex workflow #190
  • the text that gets printed at the start of a workflow is now standardized in an internal function
    • removes importing two rich modules for each workflow
    • removes 3 lines of code in each workflow that instantiate the rich.Table with proper formatting
    • all input values (index 1) are converted to strings as a failsafe (names do not as they have to be strings)
    • introduces logic to skip None, making the row-adding more flexible than it was
  • makes the QX and RX demux tags optional and opt-in with new --qx-rx flag fixes Make writing QX and RX tags optional #195
  • revisits the leviathan CLI to add --duplicates and --sharing-thresholds fixes revisit sv leviathan cli options #196
    • --min-sv is now --min-size in both leviathan and naibr workfows
  • adds 0 as as special option to align --molecule-threshold to ignore aligment distance for MI assignment fixes add 0 as special value for align molecule threshold #199
  • adds --ignore-bx to align bwa and align strobe, along with qc to process the data without linked-read considerations fixes add option to disable linked-read stuff within align bwa/strobe #193

The original python code was provided by @BELKHIR and moderately modified to work in a Harpy workflow situation

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Enhanced the demultiplexing workflow with improved dependency management, modular script fetching, and refined barcoding validations.
    • Introduced a new command-line option to include additional tags in read headers.
    • Added new parameters for variant thresholds in structural variant analysis.
    • Added a new command-line option to ignore parts of the workflow specific to linked-read sequences.
  • Refactor

    • Standardized workflow display across commands, streamlining how key run details are presented.
    • Simplified input validation and error handling in simulation and analysis workflows for a more robust user experience.
    • Updated the method of constructing workflow information for better readability and maintainability.
    • Refined the handling of BAM file indexing within the workflow.
    • Renamed configuration parameters for clarity, enhancing the usability of the workflow.
  • Chore

    • Removed an outdated demultiplexing script and updated output file naming conventions for greater clarity.

Copy link

coderabbitai bot commented Feb 3, 2025

Warning

Rate limit exceeded

@pdimens has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 21 minutes and 33 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 34942c0 and dc0703d.

📒 Files selected for processing (2)
  • README.md (1 hunks)
  • harpy/_cli_types_generic.py (2 hunks)
📝 Walkthrough

Walkthrough

This pull request introduces a new conda environment category named "demultiplex" along with its associated dependencies. It refactors the presentation of workflow information by adding a common workflow_info function to the _printing module, replacing manual rich.Table constructions throughout the codebase. Additionally, output file naming conventions in the barcode processing scripts have been updated, a legacy demultiplexing file has been removed, and a new demultiplexing script along with snakefile modifications have been added. The overall control flow remains intact with the refactoring consolidating formatting logic.

Changes

File(s) Change Summary
harpy/_conda.py Added a new "demultiplex" conda environment category with dependencies: bioconda::pheniqs, bioconda::pysam=0.22, and conda-forge::python-levenshtein.
harpy/_printing.py Introduced the new workflow_info(*args) function to streamline workflow output formatting.
harpy/align.py, harpy/assembly.py, harpy/deconvolve.py, harpy/downsample.py, harpy/impute.py, harpy/metassembly.py, harpy/phase.py, harpy/preflight.py, harpy/qc.py, harpy/resume.py, harpy/snp.py, harpy/sv.py, harpy/simulate.py, harpy/demultiplex.py Replaced manual construction of workflow output (using rich.Table) with calls to workflow_info, updating import statements accordingly.
harpy/bin/haplotag_acbd.py Updated output file naming from BC_{BC}.txt to segment_{BC}.bc.
harpy/bin/dmux_gen1.py Removed legacy demultiplexing script handling raw linked-reads.
harpy/scripts/demultiplex_gen1.py Added a new demultiplexing script with functions for reading barcodes, schema, calculating minimum Levenshtein distance, and processing read codes.
harpy/snakefiles/demultiplex_gen1.smk Renamed rule from demultiplex_barcodes to demultiplex, updated input/output paths and parameters (e.g., output files now use segment_{letter}.bc format), and refined error handling.

Sequence Diagram(s)

sequenceDiagram
    participant U as User
    participant D as Demultiplex Command
    participant F as fetch_script (_misc)
    participant W as workflow_info (_printing)
    participant S as launch_snakemake

    U->>D: Execute demultiplex command with parameters
    D->>F: Fetch "demultiplex_gen1.py" script
    F-->>D: Return script content
    D->>W: Generate formatted workflow information
    W-->>D: Return start_text
    D->>S: Invoke launch_snakemake with start_text
    S-->>D: Process workflow execution
Loading

Assessment against linked issues

Objective Addressed Explanation
Enhance demultiplexing speed (190)
Replace two-step demultiplexing with a one-step process (190)
Make writing QX and RX tags optional (195)
Change --min-sv to --min-size (196)
Allow a value of 0 to disable a barcode distance threshold (199)

Suggested labels

enhancement


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
harpy/scripts/demultiplex_gen1.py (2)

52-52: Avoid calling .keys() explicitly for membership

The code uses .keys() unnecessarily. Iteration over a dictionary automatically iterates its keys, which is slightly more idiomatic.

-for seq in bar_codes[code_letter].keys():
+for seq in bar_codes[code_letter]:
🧰 Tools
🪛 Ruff (0.8.2)

52-52: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


66-66: Ambiguous variable name I

The variable I can be confused with l (lowercase L) or 1 (one). Use a more descriptive name to improve readability.

For example:

-def get_read_codes(I, codeL, codeR):
+def get_read_codes(barcode_seq, codeL, codeR):
🧰 Tools
🪛 Ruff (0.8.2)

66-66: Ambiguous variable name: I

(E741)

harpy/snakefiles/demultiplex_gen1.smk (1)

147-161: Improve input file path presentation in workflow summary.

The input file paths could be formatted more consistently for better readability.

Apply this diff to improve formatting:

-        inputs += f"\tread 1: {params.R1}\n"
-        inputs += f"\tread 2: {params.R2}\n"
-        inputs += f"\tindex 1: {params.I1}\n"
-        inputs += f"\tindex 2: {params.I2}"
-        inputs += f"Sample demultiplexing schema: {samplefile}"
+        inputs += f"\tRead 1:  {params.R1}\n"
+        inputs += f"\tRead 2:  {params.R2}\n"
+        inputs += f"\tIndex 1: {params.I1}\n"
+        inputs += f"\tIndex 2: {params.I2}\n"
+        inputs += f"\tSchema:  {samplefile}"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c997d2a and f3f0e89.

📒 Files selected for processing (20)
  • harpy/_conda.py (1 hunks)
  • harpy/_printing.py (1 hunks)
  • harpy/align.py (4 hunks)
  • harpy/assembly.py (2 hunks)
  • harpy/bin/dmux_gen1.py (0 hunks)
  • harpy/bin/haplotag_acbd.py (1 hunks)
  • harpy/deconvolve.py (2 hunks)
  • harpy/demultiplex.py (4 hunks)
  • harpy/downsample.py (2 hunks)
  • harpy/impute.py (2 hunks)
  • harpy/metassembly.py (2 hunks)
  • harpy/phase.py (2 hunks)
  • harpy/preflight.py (3 hunks)
  • harpy/qc.py (2 hunks)
  • harpy/resume.py (2 hunks)
  • harpy/scripts/demultiplex_gen1.py (1 hunks)
  • harpy/simulate.py (10 hunks)
  • harpy/snakefiles/demultiplex_gen1.smk (3 hunks)
  • harpy/snp.py (3 hunks)
  • harpy/sv.py (3 hunks)
💤 Files with no reviewable changes (1)
  • harpy/bin/dmux_gen1.py
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/scripts/demultiplex_gen1.py

17-17: Undefined name parser

(F821)


52-52: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


66-66: Ambiguous variable name: I

(E741)


86-86: Undefined name snakemake

(F821)


88-88: Undefined name snakemake

(F821)


89-89: Undefined name snakemake

(F821)


90-90: Undefined name snakemake

(F821)


91-91: Undefined name snakemake

(F821)


92-92: Undefined name snakemake

(F821)


93-93: Undefined name snakemake

(F821)


94-94: Undefined name snakemake

(F821)


95-95: Undefined name snakemake

(F821)


96-96: Undefined name snakemake

(F821)


97-97: Undefined name snakemake

(F821)


110-110: Use a context manager for opening files

(SIM115)


111-111: Use a context manager for opening files

(SIM115)


121-121: Undefined name snakemake

(F821)


122-122: Undefined name snakemake

(F821)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build/Cache Container
🔇 Additional comments (25)
harpy/scripts/demultiplex_gen1.py (1)

86-123: Undefined snakemake references and file handles in dictionaries

  1. Frequent references to snakemake (lines 86-103, 121-122) may raise a NameError in a plain Python context. In Snakemake scripts, this variable is globally injected, so this is valid only if run within Snakemake.

  2. Multiple file handles are stored in dictionaries (lines 110-111), and they are manually closed (lines 158-160). This is acceptable. However, if you want to leverage context managers for better safety and clarity, you could consider using ExitStack from contextlib.

Example for a safer pattern:

from contextlib import ExitStack
import gzip
import pysam

with ExitStack() as stack:
    R1_output = {
        sample: stack.enter_context(gzip.open(f"{outdir}/{sample}.R1.fq.gz", "wb", compresslevel=6))
        for sample in samples
    }
    # ...
    # further steps that ensure all opened files will be closed automatically

Please confirm if the code is run exclusively within Snakemake, where snakemake is a known global variable.

✅ Verification successful

Snakemake Context Confirmation and File Handle Management

The code is intended to run exclusively within a Snakemake workflow—here, the global snakemake variable is injected automatically, so the references are valid in that context. The use of dictionaries to manage file handles, with manual closing, is acceptable though switching to a context manager like ExitStack could further guarantee resource safety.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found

🧰 Tools
🪛 Ruff (0.8.2)

86-86: Undefined name snakemake

(F821)


88-88: Undefined name snakemake

(F821)


89-89: Undefined name snakemake

(F821)


90-90: Undefined name snakemake

(F821)


91-91: Undefined name snakemake

(F821)


92-92: Undefined name snakemake

(F821)


93-93: Undefined name snakemake

(F821)


94-94: Undefined name snakemake

(F821)


95-95: Undefined name snakemake

(F821)


96-96: Undefined name snakemake

(F821)


97-97: Undefined name snakemake

(F821)


110-110: Use a context manager for opening files

(SIM115)


111-111: Use a context manager for opening files

(SIM115)


121-121: Undefined name snakemake

(F821)


122-122: Undefined name snakemake

(F821)

harpy/resume.py (1)

50-53: Clean and consistent approach to workflow info

Using workflow_info in place of manually constructing tables is a neat refactoring. This keeps the code concise and readable.

harpy/bin/haplotag_acbd.py (1)

30-30: Changed output file extension from .txt to .bc

The new naming convention segment_{BC}.bc appears domain-specific, which is fine if it's well-documented. Ensure downstream consumers can handle .bc as a tab-delimited text.

harpy/deconvolve.py (1)

86-90: LGTM!

The implementation correctly uses the new workflow_info function to create a consistent workflow information display.

harpy/downsample.py (1)

96-101: LGTM!

The implementation correctly uses the new workflow_info function to create a consistent workflow information display.

harpy/demultiplex.py (5)

11-12: LGTM!

The new imports enhance modularity by separating script fetching and workflow info presentation concerns.


60-60: LGTM!

The docstring update improves clarity by better describing the barcode segment's purpose.


75-76: LGTM!

The addition of script fetching enhances modularity by ensuring the demultiplexing script is available in the workflow directory.


79-79: LGTM!

The addition of the "demultiplex" environment to conda_envs ensures proper dependency management.


104-109: LGTM!

The refactored workflow info presentation using workflow_info function improves consistency and maintainability.

harpy/metassembly.py (2)

12-12: LGTM!

The addition of workflow_info import aligns with the codebase-wide refactoring for consistent output formatting.


101-106: LGTM!

The refactored workflow info presentation using workflow_info function improves consistency and maintainability.

harpy/snakefiles/demultiplex_gen1.smk (3)

25-30: LGTM!

The improved error handling with specific ValueError catch enhances code robustness.


38-40: LGTM!

The updated file paths using workflow directory improve organization and maintainability.


46-69: LGTM!

The refactored demultiplex rule with explicit input structure and improved organization enhances clarity and maintainability.

harpy/impute.py (2)

14-14: LGTM!

The addition of workflow_info import aligns with the codebase-wide refactoring for consistent output formatting.


109-117: LGTM!

The refactored workflow info presentation using workflow_info function improves consistency and maintainability.

harpy/phase.py (1)

111-119: LGTM! Clean refactoring of workflow information display.

The refactoring maintains all the necessary information while improving code consistency. The conditional handling of genome information is well implemented.

harpy/qc.py (1)

117-124: LGTM! Well-structured workflow information display.

The refactoring effectively presents all workflow parameters with clear yes/no indicators for the operational flags.

harpy/assembly.py (1)

131-136: LGTM! Effective refactoring of assembly workflow information.

The implementation cleanly handles both auto and custom kmer length displays while maintaining consistent formatting.

harpy/preflight.py (2)

93-97: LGTM! Clean implementation of BAM preflight information display.

The refactoring maintains consistency with other workflow displays while clearly presenting BAM-specific information.


155-159: LGTM! Clean implementation of FASTQ preflight information display.

The refactoring maintains parallel structure with the BAM implementation while clearly presenting FASTQ-specific information.

harpy/snp.py (1)

14-14: LGTM! Clean refactoring of workflow information presentation.

The changes improve code consistency by using the new workflow_info function to standardize the presentation of workflow information across different commands.

Also applies to: 145-151, 249-255

harpy/sv.py (1)

14-14: LGTM! Consistent refactoring of workflow information presentation.

The changes align with the codebase-wide standardization of workflow information presentation using the workflow_info function.

Also applies to: 135-141, 238-245

harpy/simulate.py (1)

262-267: LGTM! Enhanced error handling for variant simulation.

The added error handling prevents users from specifying both random and known variants, providing clear error messages. This improves the robustness of the command-line interface.

Also applies to: 374-375, 484-485, 584-585

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
harpy/scripts/demultiplex_gen1.py (2)

49-65: Improve code readability and maintainability.

The function has several areas for improvement:

  1. Replace the magic number 999 with a named constant.
  2. Use more descriptive variable names.

Apply this diff to improve the code:

+MAX_DISTANCE = 999  # Maximum possible Levenshtein distance

def get_min_dist(needle, code_letter):
-    minDist = 999
-    nbFound = 0
-    minSeq =""
+    min_distance = MAX_DISTANCE
+    matches_found = 0
+    min_sequence = ""
    for seq in bar_codes[code_letter]:
        d = distance(needle, seq)
-        if (d < minDist):
-            minDist = d
-            nbFound = 1
-            minSeq= seq
-        elif (d == minDist):
-            nbFound += 1   
-    if (nbFound>1):
+        if d < min_distance:
+            min_distance = d
+            matches_found = 1
+            min_sequence = seq
+        elif d == min_distance:
+            matches_found += 1   
+    if matches_found > 1:
         code_min_dist = f"{code_letter}00"
     else:
-        code_min_dist =  bar_codes[code_letter][minSeq]   
+        code_min_dist = bar_codes[code_letter][min_sequence]   
     return code_min_dist
🧰 Tools
🪛 Ruff (0.8.2)

53-53: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


67-85: Improve code readability and maintainability.

The function has several areas for improvement:

  1. Use a more descriptive parameter name instead of I.
  2. Extract the hardcoded slice length into a named constant.

Apply this diff to improve the code:

-def get_read_codes(I, codeL, codeR):
+BARCODE_LENGTH = 6  # Length of each barcode sequence

+def get_read_codes(sequence, code_left, code_right):
-  left  = I[0:6] # change depending on protocol
-  right = I[7:]
+  left = sequence[0:BARCODE_LENGTH]
+  right = sequence[BARCODE_LENGTH + 1:]  # Skip the separator character
   status = "found"
-  if left in bar_codes[codeL]:
-    lc = bar_codes[codeL][left]
+  if left in bar_codes[code_left]:
+    left_code = bar_codes[code_left][left]
   else:
-    lc = get_min_dist(left, codeL)
+    left_code = get_min_dist(left, code_left)
     status = "corrected"

-  if right in bar_codes[codeR]:
-    rc = bar_codes[codeR][right]
+  if right in bar_codes[code_right]:
+    right_code = bar_codes[code_right][right]
   else:
-    rc = get_min_dist(right, codeR)
+    right_code = get_min_dist(right, code_right)
     status = "corrected"

-  if (lc == f"{codeL}00" or rc == f"{codeR}00"):
+  if (left_code == f"{code_left}00" or right_code == f"{code_right}00"):
     status = "unclear"
-  return rc, lc, status
+  return right_code, left_code, status
🧰 Tools
🪛 Ruff (0.8.2)

67-67: Ambiguous variable name: I

(E741)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f3f0e89 and 1f5235e.

📒 Files selected for processing (5)
  • harpy/_conda.py (1 hunks)
  • harpy/_printing.py (1 hunks)
  • harpy/_validations.py (1 hunks)
  • harpy/align.py (4 hunks)
  • harpy/scripts/demultiplex_gen1.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • harpy/_conda.py
  • harpy/align.py
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/_validations.py

276-276: Undefined name seg_id

(F821)


284-284: Undefined name file_path

(F821)

harpy/scripts/demultiplex_gen1.py

53-53: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


67-67: Ambiguous variable name: I

(E741)


87-87: Undefined name snakemake

(F821)


89-89: Undefined name snakemake

(F821)


90-90: Undefined name snakemake

(F821)


91-91: Undefined name snakemake

(F821)


92-92: Undefined name snakemake

(F821)


93-93: Undefined name snakemake

(F821)


94-94: Undefined name snakemake

(F821)


95-95: Undefined name snakemake

(F821)


96-96: Undefined name snakemake

(F821)


97-97: Undefined name snakemake

(F821)


98-98: Undefined name snakemake

(F821)


111-111: Use a context manager for opening files

(SIM115)


112-112: Use a context manager for opening files

(SIM115)


122-122: Undefined name snakemake

(F821)


123-123: Undefined name snakemake

(F821)

🔇 Additional comments (1)
harpy/_printing.py (1)

104-115: ⚠️ Potential issue

Fix the conditional check and add type safety.

The function has several issues that need to be addressed:

  1. The conditional check if i: is incorrect as it checks the entire argument tuple instead of checking if it's None.
  2. No validation for tuple/list length which could lead to runtime errors.

Apply this diff to fix the issues:

def workflow_info(*arg: tuple[str, str | int | float]) -> Table:
    """
    Accepts an unlimited number of length-2 lists or tuples and returns a rich.Table with the value of the first indices as the row names and the second indices as the values
    Use None instead of a list to ignore that entry (useful for conditionals). The second value will always be converted to a string.
    """
    table = Table(show_header=False,pad_edge=False, show_edge=False, padding = (0,0), box=box.SIMPLE)
    table.add_column("detail", justify="left", style="light_steel_blue", no_wrap=True)
    table.add_column("value", justify="left")
    for i in arg:
-        if i:
+        if i is not None and len(i) == 2:
            table.add_row(i[0], str(i[1]))
+        elif i is not None:
+            raise ValueError("Each argument must be a tuple/list of length 2")
    return table

Likely invalid or redundant comment.

Repository owner deleted a comment from coderabbitai bot Feb 4, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
harpy/scripts/demultiplex_gen1.py (1)

83-170:

🧩 Analysis chain

Several improvements needed in the main execution block.

  1. File handling should use context managers for output files.
  2. The script relies on undefined snakemake variable.
  3. Variable I is ambiguous.

Apply these improvements:

+#!/usr/bin/env python3
+
+import argparse
+import sys
+import pysam
+from pathlib import Path
+from Levenshtein import distance
+
+def parse_args():
+    parser = argparse.ArgumentParser(description='Demultiplex haplotagged FASTQ files')
+    parser.add_argument('--r1', required=True, help='Read 1 FASTQ file')
+    parser.add_argument('--r2', required=True, help='Read 2 FASTQ file')
+    parser.add_argument('--i1', required=True, help='Index 1 FASTQ file')
+    parser.add_argument('--i2', required=True, help='Index 2 FASTQ file')
+    parser.add_argument('--schema', required=True, help='Demultiplexing schema file')
+    parser.add_argument('--outdir', required=True, help='Output directory')
+    parser.add_argument('--qxrx', action='store_true', help='Include QX:Z and RX:Z tags')
+    return parser.parse_args()

-with open(snakemake.log[0], "w") as f:
-    sys.stderr = sys.stdout = f
-    outdir = snakemake.params.outdir
-    qxrx = snakemake.params.qxrx
+def main():
+    args = parse_args()
+    outdir = args.outdir
+    qxrx = args.qxrx

     # Create output files using context managers
     output_files = {}
     try:
         for sample in samples:
             output_files[sample] = {
                 'R1': open(f"{outdir}/{sample}.R1.fq", 'w'),
                 'R2': open(f"{outdir}/{sample}.R2.fq", 'w')
             }
         
         with (
-            pysam.FastxFile(r1) as R1,
-            pysam.FastxFile(r2) as R2,
-            pysam.FastxFile(i1, persist=False) as I1,
-            pysam.FastxFile(i2, persist=False) as I2,
+            pysam.FastxFile(args.r1) as read1,
+            pysam.FastxFile(args.r2) as read2,
+            pysam.FastxFile(args.i1, persist=False) as index1,
+            pysam.FastxFile(args.i2, persist=False) as index2,
         ):
             # Process reads...
     finally:
         # Ensure all files are closed
         for sample_files in output_files.values():
             for f in sample_files.values():
                 f.close()

+if __name__ == '__main__':
+    main()

Run this script to check for undefined variables:


🏁 Script executed:

#!/bin/bash
# Description: Check for undefined variables in Python files
rg -l "snakemake" . | xargs ruff --select F821

Length of output: 160


I'll run a new shell script to check if any occurrences of the snakemake variable remain in the updated file:


🏁 Script executed:

#!/bin/bash
# Verify that the 'snakemake' variable is no longer present in the updated file
rg -n "snakemake" harpy/scripts/demultiplex_gen1.py

Length of output: 619

Undefined "snakemake" variable still present in the file.

  • The file still contains multiple instances of snakemake (e.g., lines 83–95 and 119–120), meaning the review-requested improvements haven’t been applied.
  • The suggested replacement using command-line arguments (i.e., using args from an argument parser) is not yet implemented.
🧰 Tools
🪛 Ruff (0.8.2)

83-83: Undefined name snakemake

(F821)


85-85: Undefined name snakemake

(F821)


86-86: Undefined name snakemake

(F821)


87-87: Undefined name snakemake

(F821)


88-88: Undefined name snakemake

(F821)


89-89: Undefined name snakemake

(F821)


90-90: Undefined name snakemake

(F821)


91-91: Undefined name snakemake

(F821)


92-92: Undefined name snakemake

(F821)


93-93: Undefined name snakemake

(F821)


94-94: Undefined name snakemake

(F821)


95-95: Undefined name snakemake

(F821)


108-108: Use a context manager for opening files

(SIM115)


109-109: Use a context manager for opening files

(SIM115)


119-119: Undefined name snakemake

(F821)


120-120: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f5235e and c9a2153.

📒 Files selected for processing (4)
  • harpy/_validations.py (1 hunks)
  • harpy/demultiplex.py (7 hunks)
  • harpy/scripts/demultiplex_gen1.py (1 hunks)
  • harpy/snakefiles/demultiplex_gen1.smk (3 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/_validations.py

284-284: Undefined name file_path

(F821)

harpy/scripts/demultiplex_gen1.py

3-3: os imported but unused

Remove unused import: os

(F401)


49-49: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


63-63: Ambiguous variable name: I

(E741)


83-83: Undefined name snakemake

(F821)


85-85: Undefined name snakemake

(F821)


86-86: Undefined name snakemake

(F821)


87-87: Undefined name snakemake

(F821)


88-88: Undefined name snakemake

(F821)


89-89: Undefined name snakemake

(F821)


90-90: Undefined name snakemake

(F821)


91-91: Undefined name snakemake

(F821)


92-92: Undefined name snakemake

(F821)


93-93: Undefined name snakemake

(F821)


94-94: Undefined name snakemake

(F821)


95-95: Undefined name snakemake

(F821)


108-108: Use a context manager for opening files

(SIM115)


109-109: Use a context manager for opening files

(SIM115)


119-119: Undefined name snakemake

(F821)


120-120: Undefined name snakemake

(F821)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build/Cache Container
🔇 Additional comments (8)
harpy/demultiplex.py (5)

31-31: LGTM!

The docstring parameters section is correctly updated to include the new --qx-rx option.


44-44: LGTM!

The qx-rx option is properly implemented with clear help text and correctly added to the function signature.

Also applies to: 55-55


82-82: LGTM!

The conda environments list is updated to include the new "demultiplex" environment.


92-92: LGTM!

The qx_rx flag is correctly added to the config dictionary.


108-114: LGTM!

The workflow info presentation is nicely refactored using the new workflow_info function, improving code maintainability.

harpy/snakefiles/demultiplex_gen1.smk (3)

8-9: LGTM!

The config variables are correctly initialized from the config dictionary.


38-70: LGTM!

The rules are properly updated with:

  • Correct output paths for barcode segments
  • Proper input file handling
  • Clear parameter passing

161-174: LGTM!

The workflow summary is properly updated with input file information and correct terminology.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
harpy/scripts/demultiplex_gen1.py (3)

3-6: Remove unused import.

The os module is imported but never used in the code.

-import os
 import sys
 import pysam
 from Levenshtein import distance
🧰 Tools
🪛 Ruff (0.8.2)

3-3: os imported but unused

Remove unused import: os

(F401)


43-59: Improve code readability and maintainability.

Several issues need to be addressed:

  1. Magic number (999) should be replaced with a meaningful constant
  2. Variable names should follow Python naming conventions
  3. Unnecessary use of .keys()
 def get_min_dist(needle, code_letter):
-    minDist = 999
-    nbFound = 0
-    minSeq = ""
-    for seq in bar_codes[code_letter].keys():
+    MAX_DIST = float('inf')
+    min_dist = MAX_DIST
+    matches_found = 0
+    min_seq = ""
+    for seq in bar_codes[code_letter]:
         d = distance(needle, seq)
-        if (d < minDist):
-            minDist = d
-            nbFound = 1
-            minSeq = seq
-        elif (d == minDist):
-            nbFound += 1   
-    if (nbFound > 1):
+        if d < min_dist:
+            min_dist = d
+            matches_found = 1
+            min_seq = seq
+        elif d == min_dist:
+            matches_found += 1   
+    if matches_found > 1:
         code_min_dist = f"{code_letter}00"
     else:
-        code_min_dist = bar_codes[code_letter][minSeq]   
+        code_min_dist = bar_codes[code_letter][min_seq]   
     return code_min_dist
🧰 Tools
🪛 Ruff (0.8.2)

47-47: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


61-80: Improve code clarity and configurability.

The function has several issues:

  1. Ambiguous parameter name 'I'
  2. Hardcoded barcode length (6)
  3. Inconsistent spacing around operators
-def get_read_codes(I, codeL, codeR):
+def get_read_codes(index_seq, left_code, right_code):
+    """Extract and validate barcode sequences from the index read.
+    
+    Args:
+        index_seq: The index sequence to process
+        left_code: The expected left barcode segment
+        right_code: The expected right barcode segment
+    
+    Returns:
+        tuple: (right_code, left_code, status)
+    """
+    BARCODE_LENGTH = 6  # Make protocol-dependent value configurable
-    left = I[0:6]  # protocol-dependant
-    right = I[7:]
+    left = index_seq[0:BARCODE_LENGTH]
+    right = index_seq[7:]
     status = "found"
-    if left in bar_codes[codeL]:
-        lc = bar_codes[codeL][left]
+    if left in bar_codes[left_code]:
+        lc = bar_codes[left_code][left]
     else:
-        lc = get_min_dist(left, codeL)
+        lc = get_min_dist(left, left_code)
         status = "corrected"

-    if right in bar_codes[codeR]:
-        rc = bar_codes[codeR][right]
+    if right in bar_codes[right_code]:
+        rc = bar_codes[right_code][right]
     else:
-        rc = get_min_dist(right, codeR)
+        rc = get_min_dist(right, right_code)
         status = "corrected"

-    if (lc == f"{codeL}00" or rc == f"{codeR}00"):
+    if lc == f"{left_code}00" or rc == f"{right_code}00":
         status = "unclear"
     return rc, lc, status
🧰 Tools
🪛 Ruff (0.8.2)

61-61: Ambiguous variable name: I

(E741)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c9a2153 and 6390c63.

📒 Files selected for processing (1)
  • harpy/scripts/demultiplex_gen1.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/scripts/demultiplex_gen1.py

3-3: os imported but unused

Remove unused import: os

(F401)


47-47: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


61-61: Ambiguous variable name: I

(E741)


81-81: Undefined name snakemake

(F821)


83-83: Undefined name snakemake

(F821)


84-84: Undefined name snakemake

(F821)


85-85: Undefined name snakemake

(F821)


86-86: Undefined name snakemake

(F821)


87-87: Undefined name snakemake

(F821)


88-88: Undefined name snakemake

(F821)


89-89: Undefined name snakemake

(F821)


90-90: Undefined name snakemake

(F821)


91-91: Undefined name snakemake

(F821)


92-92: Undefined name snakemake

(F821)


93-93: Undefined name snakemake

(F821)


106-106: Use a context manager for opening files

(SIM115)


107-107: Use a context manager for opening files

(SIM115)


117-117: Undefined name snakemake

(F821)


118-118: Undefined name snakemake

(F821)

🔇 Additional comments (2)
harpy/scripts/demultiplex_gen1.py (2)

8-22: 🛠️ Refactor suggestion

Add input validation for nucleotide sequences.

The function should validate that sequences contain only valid nucleotides (ATCG).

 def read_barcodes(file_path, segment):
     """Read and parse input barcode (segment) file of segment<tab>sequence"""
     data_dict = {}
+    valid_nucleotides = set('ATCG')
     with open(file_path, 'r') as file:
         for line in file:
             try:
                 code, seq = line.rstrip().split()
+                if not set(seq.upper()).issubset(valid_nucleotides):
+                    sys.stderr.write(f"Invalid nucleotides in sequence {seq} in {file_path}\n")
+                    sys.exit(1)
                 if code[0].upper() != segment:
                     sys.stderr.write(f"Segments in {file_path} are expected to begin with {segment}, but begin with {code[0].upper()}\n")
                     sys.exit(1)
                 data_dict[seq] = code
             except ValueError:
                 # skip rows without two columns
                 continue
     return data_dict

Likely invalid or redundant comment.


81-81: Verify Snakemake environment.

The script assumes it's running within a Snakemake workflow. This should be verified to prevent runtime errors.

Add a check at the start of the script:

+if 'snakemake' not in globals():
+    raise RuntimeError("This script is meant to be run within a Snakemake workflow.")
+
 with open(snakemake.log[0], "w") as f:
🧰 Tools
🪛 Ruff (0.8.2)

81-81: Undefined name snakemake

(F821)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
harpy/_validations.py (2)

262-262: Make the regex pattern more specific.

The current pattern allows any number of digits after the letter. Based on the error message in line 269, it should specifically match two digits.

-    segment_pattern = re.compile(r'^[A-D]\d+$')
+    segment_pattern = re.compile(r'^[A-D]\d{2}$')

282-284: Improve error handling for malformed rows.

The current implementation silently skips rows that don't have two columns. Consider providing feedback about malformed rows to help users identify and fix issues in their input files.

             except ValueError:
-                # skip rows without two columns
-                continue
+                if line.strip() and not line.strip().startswith('#'):
+                    print_error("malformed row", f"Row '{line.strip()}' does not have the expected format of 'sample<tab>segment'.")
+                    print_solution("Each non-empty, non-comment row should have exactly two columns separated by a tab character.")
+                    sys.exit(1)
+                continue
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6390c63 and 79bcd53.

📒 Files selected for processing (1)
  • harpy/_validations.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build/Cache Container
🔇 Additional comments (2)
harpy/_validations.py (2)

258-291: LGTM! The overall implementation is solid.

The function effectively validates the demultiplex schema file by:

  1. Ensuring segment IDs follow the correct format
  2. Preventing duplicate segment IDs
  3. Enforcing consistent code letters across all segments
  4. Providing clear error messages with solutions

289-289: 🛠️ Refactor suggestion

Use print_error for consistency.

The error message uses print instead of print_error, which is inconsistent with the error handling pattern used throughout the file.

-        print("invalid schema", f"Schema file {os.path.basename(infile)} has sample IDs expected to be identified across multiple barcode segments. All sample IDs for this technology should be in a single segment, such as [bold green]C[/bold green] or [bold green]D[/bold green].")
+        print_error("invalid schema", f"Schema file {os.path.basename(infile)} has sample IDs expected to be identified across multiple barcode segments ({', '.join(sorted(code_letters))}). All sample IDs for this technology should be in a single segment, such as [bold green]C[/bold green] or [bold green]D[/bold green].")

Likely invalid or redundant comment.

@pdimens pdimens self-assigned this Feb 5, 2025
Fix f string

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
harpy/scripts/demultiplex_gen1.py (1)

106-107: 🛠️ Refactor suggestion

Use context managers for file handling.

File objects should be managed using context managers to ensure proper cleanup.

-    R1_output = {sample: open(f"{outdir}/{sample}.R1.fq", 'w') for sample in samples}
-    R2_output = {sample: open(f"{outdir}/{sample}.R2.fq", 'w') for sample in samples}
+    class OutputFiles:
+        def __init__(self, outdir, samples):
+            self.files = {
+                'R1': {sample: f"{outdir}/{sample}.R1.fq" for sample in samples},
+                'R2': {sample: f"{outdir}/{sample}.R2.fq" for sample in samples}
+            }
+            self.handles = {'R1': {}, 'R2': {}}
+        
+        def __enter__(self):
+            for read in ['R1', 'R2']:
+                for sample, path in self.files[read].items():
+                    self.handles[read][sample] = open(path, 'w')
+            return self.handles['R1'], self.handles['R2']
+        
+        def __exit__(self, exc_type, exc_val, exc_tb):
+            for handles in self.handles.values():
+                for f in handles.values():
+                    f.close()
+
+    with OutputFiles(outdir, samples) as (R1_output, R2_output):

Also applies to: 152-154

🧰 Tools
🪛 Ruff (0.8.2)

106-106: Use a context manager for opening files

(SIM115)


107-107: Use a context manager for opening files

(SIM115)

🧹 Nitpick comments (2)
harpy/scripts/demultiplex_gen1.py (2)

3-6: Remove unused import.

The os module is imported but never used in the code.

-import os
 import sys
 import pysam
 from Levenshtein import distance
🧰 Tools
🪛 Ruff (0.8.2)

3-3: os imported but unused

Remove unused import: os

(F401)


81-82: Add type hints for snakemake workflow.

Add type hints to help IDE and static analysis tools recognize snakemake workflow variables.

+from snakemake.script import Snakemake
+
+# Type hint for snakemake workflow
+snakemake: Snakemake
+
 with open(snakemake.log[0], "w") as f:
     sys.stderr = sys.stdout = f
🧰 Tools
🪛 Ruff (0.8.2)

81-81: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 79bcd53 and 6d9dd59.

📒 Files selected for processing (1)
  • harpy/scripts/demultiplex_gen1.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/scripts/demultiplex_gen1.py

3-3: os imported but unused

Remove unused import: os

(F401)


47-47: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


61-61: Ambiguous variable name: I

(E741)


81-81: Undefined name snakemake

(F821)


83-83: Undefined name snakemake

(F821)


84-84: Undefined name snakemake

(F821)


85-85: Undefined name snakemake

(F821)


86-86: Undefined name snakemake

(F821)


87-87: Undefined name snakemake

(F821)


88-88: Undefined name snakemake

(F821)


89-89: Undefined name snakemake

(F821)


90-90: Undefined name snakemake

(F821)


91-91: Undefined name snakemake

(F821)


92-92: Undefined name snakemake

(F821)


93-93: Undefined name snakemake

(F821)


106-106: Use a context manager for opening files

(SIM115)


107-107: Use a context manager for opening files

(SIM115)


117-117: Undefined name snakemake

(F821)


118-118: Undefined name snakemake

(F821)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build/Cache Container
🔇 Additional comments (3)
harpy/scripts/demultiplex_gen1.py (3)

43-59: Improve code clarity and efficiency.

  1. Replace magic number with float('inf').
  2. Use clearer variable names.
  3. Remove unnecessary .keys() call.
 def get_min_dist(needle, code_letter):
-    minDist = 999
-    nbFound = 0
-    minSeq =""
-    for seq in bar_codes[code_letter].keys():
+    min_dist = float('inf')
+    matches_found = 0
+    min_seq = ""
+    for seq in bar_codes[code_letter]:
         d = distance(needle, seq)
-        if (d < minDist):
-            minDist = d
-            nbFound = 1
-            minSeq= seq
-        elif (d == minDist):
-            nbFound += 1   
-    if (nbFound>1):
+        if d < min_dist:
+            min_dist = d
+            matches_found = 1
+            min_seq = seq
+        elif d == min_dist:
+            matches_found += 1   
+    if matches_found > 1:
         code_min_dist = f"{code_letter}00"
     else:
-        code_min_dist =  bar_codes[code_letter][minSeq]   
+        code_min_dist = bar_codes[code_letter][min_seq]   
     return code_min_dist
🧰 Tools
🪛 Ruff (0.8.2)

47-47: Use key in dict instead of key in dict.keys()

Remove .keys()

(SIM118)


61-79: Improve code maintainability and readability.

  1. Make protocol-dependent value configurable.
  2. Use descriptive parameter name.
  3. Add docstring.
-def get_read_codes(I, codeL, codeR):
+def get_read_codes(sequence, left_code, right_code, barcode_length=6):
+    """Extract and validate barcode sequences from input read.
+    
+    Args:
+        sequence: Input sequence to extract barcodes from
+        left_code: Expected left barcode code letter
+        right_code: Expected right barcode code letter
+        barcode_length: Length of barcode sequence (default: 6)
+        
+    Returns:
+        tuple: (right_code, left_code, status)
+    """
-    left = I[0:6]  # protocol-dependant
-    right = I[7:]
+    left = sequence[0:barcode_length]
+    right = sequence[barcode_length + 1:]
     status = "found"
-    if left in bar_codes[codeL]:
-        lc = bar_codes[codeL][left]
+    if left in bar_codes[left_code]:
+        lc = bar_codes[left_code][left]
     else:
-        lc = get_min_dist(left, codeL)
+        lc = get_min_dist(left, left_code)
         status = "corrected"

-    if right in bar_codes[codeR]:
-        rc = bar_codes[codeR][right]
+    if right in bar_codes[right_code]:
+        rc = bar_codes[right_code][right]
     else:
-        rc = get_min_dist(right, codeR)
+        rc = get_min_dist(right, right_code)
         status = "corrected"

-    if (lc == f"{codeL}00" or rc == f"{codeR}00"):
+    if (lc == f"{left_code}00" or rc == f"{right_code}00"):
         status = "unclear"
     return rc, lc, status
🧰 Tools
🪛 Ruff (0.8.2)

61-61: Ambiguous variable name: I

(E741)


8-22: 🛠️ Refactor suggestion

Add input validation and improve error handling.

  1. Add validation for nucleotides (ATCG only).
  2. Replace sys.exit() with proper exception handling.
 def read_barcodes(file_path, segment):
     """Read and parse input barcode (segment) file of segment<tab>sequence"""
     data_dict = {}
+    valid_nucleotides = set('ATCG')
     with open(file_path, 'r') as file:
         for line in file:
             try:
                 code, seq = line.rstrip().split()
+                if not set(seq.upper()).issubset(valid_nucleotides):
+                    raise ValueError(f"Invalid nucleotides in sequence {seq} in {file_path}")
                 if code[0].upper() != segment:
-                    sys.stderr.write(f"Segments in {file_path} are expected to begin with {segment}, but begin with {code[0].upper()}\n")
-                    sys.exit(1)
+                    raise ValueError(
+                        f"Segments in {file_path} are expected to begin with {segment}, "
+                        f"but begin with {code[0].upper()}"
+                    )
                 data_dict[seq] = code
             except ValueError:
                 # skip rows without two columns
                 continue
     return data_dict

Likely invalid or redundant comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🔭 Outside diff range comments (1)
harpy/snakefiles/align_strobealign.smk (1)

130-138: Remove redundant index_duplicates rule.

This rule should be removed as the BAM indexing is now handled in the assign_molecules rule. The rule's presence is inconsistent with the changes in align_bwa.smk and could lead to redundant operations.

Apply this diff to remove the redundant rule:

-rule index_duplicates:
-    input:
-        outdir + "/samples/{sample}/{sample}.markdup.bam"
-    output:
-        temp(outdir + "/samples/{sample}/{sample}.markdup.bam.bai")
-    container:
-        None
-    shell:
-        "samtools index {input}"
♻️ Duplicate comments (1)
harpy/scripts/demultiplex_gen1.py (1)

105-106: 🛠️ Refactor suggestion

Use context managers for file handling.

File handles should be managed using context managers to ensure proper cleanup.

🧰 Tools
🪛 Ruff (0.8.2)

105-105: Use a context manager for opening files

(SIM115)


106-106: Use a context manager for opening files

(SIM115)

🧹 Nitpick comments (2)
harpy/scripts/demultiplex_gen1.py (1)

119-132: Improve error handling and logging.

Add error handling for file I/O operations and improve logging of read processing.

Apply this diff to improve error handling:

-        for r1_rec, r2_rec, i1_rec, i2_rec in zip(R1, R2, I1, I2):
+        try:
+            for r1_rec, r2_rec, i1_rec, i2_rec in zip(R1, R2, I1, I2):
+                if not all([r1_rec, r2_rec, i1_rec, i2_rec]):
+                    sys.stderr.write("Mismatched number of reads in input files\n")
+                    sys.exit(1)
+                try:
+                    segments['A'], segments['C'], statusR1 = get_read_codes(i1_rec.sequence, "C", "A", bar_codes)
+                    segments['B'], segments['D'], statusR2 = get_read_codes(i2_rec.sequence, "D", "B", bar_codes)
+                except Exception as e:
+                    sys.stderr.write(f"Error processing read: {e}\n")
+                    continue
harpy/_validations.py (1)

258-291: Improve documentation and error handling.

The function needs:

  1. Better docstring with parameters and returns
  2. Consistent error message formatting
  3. Use of print_error instead of print

Apply this diff to improve the code:

 def validate_demuxschema(infile):
-    """Validate the file format of the demultiplex schema"""
+    """Validate the file format of the demultiplex schema.
+    
+    Args:
+        infile: Path to the schema file.
+        
+    Returns:
+        None
+        
+    Raises:
+        SystemExit: If validation fails with descriptive error message.
+    """
     code_letters = set() #codes can be Axx, Bxx, Cxx, Dxx
     segment_ids = set()
     segment_pattern = re.compile(r'^[A-D]\d{2}$')
     with open(infile, 'r') as file:
         for line in file:
             try:
                 sample, segment_id = line.rstrip().split()
                 if not segment_pattern.match(segment_id):
                     print_error("invalid segment format", f"Segment ID '{segment_id}' does not follow the expected format.")
                     print_solution("This haplotagging design expects segments to follow the format of letter [green bold]A-D[/green bold] followed by [bold]two digits[/bold], e.g. [green bold]C51[/green bold]). Check that your ID segments or formatted correctly and that you are attempting to demultiplex with a workflow appropriate for your data design.")
                     sys.exit(1)
                 code_letters.add(segment_id[0])
                 if segment_id in segment_ids:
                     print_error("ambiguous segment ID", "An ID segment must only be associated with a single sample.")
                     print_solution_with_culprits(
                         "A barcode segment can only be associated with a single sample. For example: [green bold]C05[/green bold] cannot identify both [green]sample_01[/green] and [green]sample_2[/green]. In other words, a segment can only appear once.",
                         "The segment triggering this error is:"
                         )
                     click.echo(segment_id)
                     sys.exit(1)
                 else:
                     segment_ids.add(segment_id)
             except ValueError:
                 # skip rows without two columns
                 continue
     if not code_letters:
         print_error("incorrect schema format", f"Schema file {os.path.basename(infile)} has no valid rows. Rows should be sample<tab>segment, e.g. sample_01<tab>C75")
         sys.exit(1)
     if len(code_letters) > 1:
-        print("invalid schema", f"Schema file {os.path.basename(infile)} has sample IDs expected to be identified across multiple barcode segments. All sample IDs for this technology should be in a single segment, such as [bold green]C[/bold green] or [bold green]D[/bold green].")
+        print_error("invalid schema", f"Schema file {os.path.basename(infile)} has sample IDs expected to be identified across multiple barcode segments ({', '.join(sorted(code_letters))}). All sample IDs for this technology should be in a single segment, such as [bold green]C[/bold green] or [bold green]D[/bold green].")
         sys.exit(1)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6d9dd59 and feaeb83.

📒 Files selected for processing (4)
  • harpy/_validations.py (1 hunks)
  • harpy/scripts/demultiplex_gen1.py (1 hunks)
  • harpy/snakefiles/align_bwa.smk (1 hunks)
  • harpy/snakefiles/align_strobealign.smk (1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/scripts/demultiplex_gen1.py

80-80: Undefined name snakemake

(F821)


82-82: Undefined name snakemake

(F821)


83-83: Undefined name snakemake

(F821)


84-84: Undefined name snakemake

(F821)


85-85: Undefined name snakemake

(F821)


86-86: Undefined name snakemake

(F821)


87-87: Undefined name snakemake

(F821)


88-88: Undefined name snakemake

(F821)


89-89: Undefined name snakemake

(F821)


90-90: Undefined name snakemake

(F821)


91-91: Undefined name snakemake

(F821)


92-92: Undefined name snakemake

(F821)


105-105: Use a context manager for opening files

(SIM115)


106-106: Use a context manager for opening files

(SIM115)


116-116: Undefined name snakemake

(F821)


117-117: Undefined name snakemake

(F821)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build/Cache Container
🔇 Additional comments (3)
harpy/snakefiles/align_bwa.smk (1)

144-144: LGTM! Workflow optimization achieved.

The integration of BAM indexing into the assign_molecules rule and marking the index as temporary streamlines the workflow by reducing the number of rules and properly managing intermediate files.

harpy/snakefiles/align_strobealign.smk (1)

145-145: LGTM! Workflow optimization achieved.

The integration of BAM indexing into the assign_molecules rule and marking the index as temporary streamlines the workflow by reducing the number of rules and properly managing intermediate files.

harpy/scripts/demultiplex_gen1.py (1)

23-40: 🛠️ Refactor suggestion

Add input validation and handle edge cases.

The function needs to:

  1. Validate non-empty input files
  2. Validate segment ID format
  3. Handle empty code_letters set safely

Apply this diff to add validation:

 def read_schema(file_path):
     """Read and parse schema file of sample<tab>id_segment"""
     data_dict = {}
     code_letters = set()
+    segment_pattern = re.compile(r'^[A-D]\d{2}$')
     with open(file_path, 'r') as file:
+        content = file.readlines()
+        if not content:
+            sys.stderr.write(f"Empty schema file: {file_path}\n")
+            sys.exit(1)
         for line in file:
             try:
                 sample, segment_id = line.rstrip().split()
+                if not segment_pattern.match(segment_id):
+                    sys.stderr.write(f"Invalid segment ID format: {segment_id}\n")
+                    sys.exit(1)
                 data_dict[segment_id] = sample
                 code_letters.add(segment_id[0])
             except ValueError:
                 # skip rows without two columns
                 continue
+    if not code_letters:
+        sys.stderr.write(f"No valid segment IDs found in {file_path}\n")
+        sys.exit(1)
     id_letter = code_letters.pop()
     return id_letter, data_dict

Likely invalid or redundant comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
harpy/bin/assign_mi.py (1)

115-115: LGTM! Addition of until_eof=True improves read processing.

The addition of until_eof=True ensures that all reads are processed, including those without coordinates. This is a good practice and aligns well with the existing unmapped read handling logic.

This change makes the script more robust by ensuring no reads are accidentally skipped, which is particularly important for downstream analysis.

harpy/snakefiles/sv_naibr.smk (1)

25-25: Maintain consistent parameter naming.

The configuration parameter is named min_size but is mapped to min_sv in the args dictionary. Consider using consistent naming to improve code clarity.

Apply this diff to maintain consistency:

-        "min_sv"   : min_size,
+        "min_size" : min_size,

Also applies to: 38-38

harpy/snakefiles/sv_leviathan.smk (2)

44-59: Remove commented-out rule if no longer needed.

The process_alignments rule is commented out. If this rule is no longer needed, it should be removed rather than left as commented code.


29-29: Fix typo in parameter name.

The parameter name contains a typo: duplcates_thresh should be duplicates_thresh.

Apply this diff to fix the typo:

-duplcates_thresh = config["variant_thresholds"]["duplicates"]
+duplicates_thresh = config["variant_thresholds"]["duplicates"]
harpy/sv.py (1)

9-9: Remove unused imports.

The following imports are not used in the code:

  • LeviathanParams
  • NaibrParams

Apply this diff to remove the unused imports:

-from ._cli_types_params import LeviathanParams, NaibrParams
🧰 Tools
🪛 Ruff (0.8.2)

9-9: ._cli_types_params.LeviathanParams imported but unused

Remove unused import

(F401)


9-9: ._cli_types_params.NaibrParams imported but unused

Remove unused import

(F401)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between feaeb83 and 6b8b46b.

📒 Files selected for processing (11)
  • harpy/bin/assign_mi.py (1 hunks)
  • harpy/snakefiles/align_bwa.smk (0 hunks)
  • harpy/snakefiles/align_strobealign.smk (0 hunks)
  • harpy/snakefiles/sv_leviathan.smk (5 hunks)
  • harpy/snakefiles/sv_leviathan_pop.smk (3 hunks)
  • harpy/snakefiles/sv_naibr.smk (2 hunks)
  • harpy/snakefiles/sv_naibr_phase.smk (2 hunks)
  • harpy/snakefiles/sv_naibr_pop.smk (2 hunks)
  • harpy/snakefiles/sv_naibr_pop_phase.smk (2 hunks)
  • harpy/snp.py (5 hunks)
  • harpy/sv.py (11 hunks)
💤 Files with no reviewable changes (2)
  • harpy/snakefiles/align_strobealign.smk
  • harpy/snakefiles/align_bwa.smk
🚧 Files skipped from review as they are similar to previous changes (1)
  • harpy/snp.py
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/sv.py

9-9: ._cli_types_params.LeviathanParams imported but unused

Remove unused import

(F401)


9-9: ._cli_types_params.NaibrParams imported but unused

Remove unused import

(F401)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build/Cache Container
🔇 Additional comments (9)
harpy/bin/assign_mi.py (1)

105-108: Clarify BAM index requirements.

There's an inconsistency between the deprecation of index creation and the documentation. Line 25 still states "If bam, a matching index file should be in the same directory", but the index creation code is being deprecated.

Please clarify:

  1. Is a BAM index still required?
  2. If yes, how should users ensure the index exists?
  3. If no, should we update the help text to remove the index requirement?

Also applies to: 25-25

harpy/snakefiles/sv_naibr_phase.smk (1)

27-27: Maintain consistent parameter naming.

The configuration parameter is named min_size but is mapped to min_sv in the args dictionary. Consider using consistent naming to improve code clarity.

Apply this diff to maintain consistency:

-        "min_sv"   : min_size,
+        "min_size" : min_size,

Also applies to: 45-45

harpy/snakefiles/sv_naibr_pop.smk (1)

24-24: Maintain consistent parameter naming.

The configuration parameter is named min_size but is mapped to min_sv in the args dictionary. Consider using consistent naming to improve code clarity.

Apply this diff to maintain consistency:

-        "min_sv"   : min_size,
+        "min_size" : min_size,

Also applies to: 38-38

harpy/snakefiles/sv_leviathan_pop.smk (2)

23-29: LGTM! Configuration parameters are well-structured.

The renaming of min_sv to min_size improves clarity, and the new variant threshold parameters provide granular control over variant size categories.


167-173: LGTM! Parameters are properly passed to LEVIATHAN.

The parameters are correctly formatted and passed to the LEVIATHAN command, ensuring proper variant size categorization.

harpy/snakefiles/sv_naibr_pop_phase.smk (1)

29-29: LGTM! Configuration parameter is well-named.

The renaming of min_sv to min_size improves clarity.

harpy/sv.py (3)

58-60: LGTM! New command options are well-designed.

The new options for duplicates and sharing thresholds are:

  • Well-documented with clear descriptions
  • Have appropriate type constraints
  • Include sensible default values

118-123: LGTM! Variant thresholds configuration is well-structured.

The dictionary structure for variant thresholds is:

  • Clear and organized
  • Properly typed
  • Includes all necessary threshold categories

144-150: LGTM! Workflow info implementation is clean and consistent.

The new implementation using workflow_info:

  • Improves code readability
  • Maintains consistent formatting across workflows
  • Properly handles optional parameters

Also applies to: 245-252

@pdimens pdimens changed the title use one-and-done demultiplexing script revisit demultiplexing, sv cli commands Feb 5, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
harpy/snakefiles/align_bwa.smk (1)

233-246: Add error handling for file operations.

While the rule correctly handles file operations, it could benefit from error handling to ensure data integrity.

Consider this improvement:

     shell:
         """
-        mv {input.bam} {output.bam}
-        samtools index {output.bam}
+        if mv {input.bam} {output.bam}; then
+            samtools index {output.bam} || {{ echo "Failed to index BAM file"; exit 1; }}
+        else
+            echo "Failed to move BAM file"
+            exit 1
+        fi
         """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9dd9349 and 31379c9.

📒 Files selected for processing (3)
  • harpy/align.py (13 hunks)
  • harpy/snakefiles/align_bwa.smk (6 hunks)
  • harpy/snakefiles/align_strobealign.smk (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • harpy/align.py
  • harpy/snakefiles/align_strobealign.smk
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build/Cache Container
🔇 Additional comments (3)
harpy/snakefiles/align_bwa.smk (3)

21-21: LGTM! Configuration for optional barcode handling.

The addition of ignore_bx configuration aligns with the PR objectives to make barcode-related features optional.


120-120: LGTM! Proper implementation of conditional barcode handling.

The changes correctly implement optional barcode handling by:

  1. Conditionally changing output paths
  2. Adding barcode mode parameter
  3. Updating the samtools markdup command

Also applies to: 124-125, 137-137


124-125: Verify implementation of QX and RX demux tags.

The PR objectives mention making QX and RX demux tags optional with a new --qx-rx flag, but this implementation only shows BX tag handling.

Please confirm if the QX and RX demux tags are handled elsewhere or if they need to be added here. Let me help you search the codebase:

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Search for QX and RX tag handling
rg -i "qx|rx" --type py

Length of output: 1540


Verify QX/RX flag implementation across workflow components.

The search confirms that while the align_bwa.smk file only handles the BX tag, the QX/RX demux tags are implemented in the demultiplexing code (e.g., in harpy/demultiplex.py and harpy/scripts/demultiplex_gen1.py), where the --qx-rx flag is defined and used. This is consistent with the design, as alignment-specific scripts need not manage these tags directly.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
harpy/snakefiles/qc.smk (1)

22-22: Use .get() method with a default value for optional parameter.

To prevent potential KeyError and maintain consistency with other optional parameters in the file, consider using the .get() method.

-ignore_bx    = config["ignore_bx"]
+ignore_bx    = config.get("ignore_bx", False)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31379c9 and 4c0be04.

📒 Files selected for processing (3)
  • harpy/align.py (12 hunks)
  • harpy/qc.py (5 hunks)
  • harpy/snakefiles/qc.smk (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • harpy/align.py
🔇 Additional comments (7)
harpy/snakefiles/qc.smk (1)

199-199: LGTM! Barcode report is now correctly conditional.

The condition properly combines both skip_reports and ignore_bx flags to determine whether to include the barcode report, aligning with the PR objectives of making barcode-related features optional.

harpy/qc.py (6)

13-13: LGTM!

The import is correctly placed and follows Python's import style guidelines.


24-24: LGTM!

The --ignore-bx option is correctly documented in the workflow controls section.


41-41: LGTM!

The --ignore-bx option is well-defined with clear help text and follows the CLI options style.


96-96: LGTM!

The ignore_bx configuration is correctly added to pass the flag to the workflow.


119-126: LGTM!

The workflow information presentation is well-structured and consistent with the standardization goals mentioned in the PR objectives.


46-46: Verify the handling of BX tags in dependent workflows.

The function signature correctly includes the new ignore_bx parameter. Since this makes BX tags optional, we should verify the impact on dependent workflows.

Let's check for any workflows that might be affected by the BX tag changes:

✅ Verification successful

🏁 Script executed:

#!/bin/bash
# Description: Search for workflows that handle BX tags
# Test: Look for BX-related code in workflow files
rg -i '\bbx\b' --type py

Length of output: 14452


Action: BX Tag Workflow Handling Verified
The search results confirm that multiple workflows acknowledge the --ignore-bx option and consistently handle BX tags—consistent with the changes made in the qc function signature. The dependent workflows in files like align.py, metassembly.py, and others demonstrate proper handling of linked-read BX tags. There’s no evidence to suggest that the new optional behavior has a negative impact on these workflows.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
harpy/_cli_types_generic.py (1)

13-25: Improve validation logic and error handling.

The validation logic should handle the case where entries is None, allowing for an arbitrary number of integers.

Apply this diff to improve the validation logic:

     def convert(self, value, param, ctx):
         try:
             parts = [i.strip() for i in value.split(',')]
-            if len(parts) != self.entries:
+            if self.entries is not None and len(parts) != self.entries:
                 raise ValueError
             for i in parts:
                 try:
                     int(i)
                 except:
                     raise ValueError
             return [int(i) for i in parts]
         except ValueError:
-            self.fail(f"{value} is not a valid list of integers. The value should be {self.entries} integers separated by a comma.", param, ctx)
+            expected = f"{self.entries} integers" if self.entries is not None else "a list of integers"
+            self.fail(f"{value} is not valid. The value should be {expected} separated by a comma.", param, ctx)
🧰 Tools
🪛 Ruff (0.8.2)

21-21: Do not use bare except

(E722)


22-22: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🧹 Nitpick comments (1)
harpy/_cli_types_generic.py (1)

6-8: Update docstring to reflect actual behavior.

The current docstring states that the class accepts "an arbitrary number of integers", but the implementation enforces a specific number of entries when entries is provided.

Apply this diff to update the docstring:

 class IntList(click.ParamType):
-    """A class for a click type which accepts an arbitrary number of integers, separated by a comma."""
+    """A class for a click type which accepts a comma-separated list of integers.
+    
+    Args:
+        entries (int, optional): If provided, enforces the exact number of integers required.
+                               If None, accepts an arbitrary number of integers.
+    """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4c0be04 and 34942c0.

📒 Files selected for processing (3)
  • harpy/_cli_types_generic.py (2 hunks)
  • harpy/snakefiles/align_bwa.smk (6 hunks)
  • harpy/snakefiles/align_strobealign.smk (6 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/_cli_types_generic.py

11-11: Undefined name max_entries

(F821)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build/Cache Container
🔇 Additional comments (8)
harpy/snakefiles/align_bwa.smk (4)

21-21: LGTM!

The ignore_bx configuration variable is properly added and used consistently throughout the workflow.


120-120: LGTM!

The mark_duplicates rule has been properly updated to handle both BX and non-BX modes:

  • Output path is conditionally modified based on ignore_bx
  • New bx_mode parameter correctly implements optional BX tag handling

Also applies to: 124-125


233-246: LGTM!

The new index_bam rule is well-implemented:

  • Correctly conditional on ignore_bx
  • Properly handles BAM indexing when BX tags are ignored
  • Uses appropriate container configuration

303-305: LGTM!

The workflow_summary rule updates are correct:

  • Reports collection is properly conditioned on both skip_reports and ignore_bx
  • The bx_mode parameter is correctly integrated

Also applies to: 309-309

harpy/snakefiles/align_strobealign.smk (4)

26-26: LGTM!

The ignore_bx configuration variable is consistently implemented across both alignment workflows.


111-111: LGTM!

The mark_duplicates rule changes maintain consistency with the BWA workflow:

  • Output path and bx_mode parameter are handled identically

Also applies to: 115-116


224-238: LGTM!

The index_bam rule implementation is consistent with the BWA workflow:

  • Same conditional logic
  • Identical BAM indexing handling

295-296: LGTM!

The workflow_summary rule changes maintain consistency with the BWA workflow:

  • Reports collection and bx_mode parameter are handled identically

Also applies to: 302-302

@pdimens pdimens merged commit ad259db into main Feb 5, 2025
20 checks passed
@pdimens pdimens deleted the python_demult branch February 5, 2025 20:28
@coderabbitai coderabbitai bot mentioned this pull request Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant