-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
standardize scripts to write to stdout #183
Conversation
📝 WalkthroughWalkthroughThe pull request introduces changes to multiple files in the harpy project, focusing on modifying how barcode statistics are generated and processed. The primary modifications include updating the Changes
Sequence DiagramsequenceDiagram
participant Input as Input BAM File
participant BXStats as bx_stats.py
participant Output as stdout/Output File
Input->>BXStats: Provide alignment data
BXStats->>BXStats: Process barcode statistics
BXStats->>Output: Redirect processed data
sequenceDiagram
participant Input as Input BAM File
participant Deconvolve as deconvolve_alignments.py
participant Output as stdout/Output BAM
Input->>Deconvolve: Provide alignment records
Deconvolve->>Deconvolve: Check BX and MI tags
Deconvolve->>Deconvolve: Assign/Update Molecular Identifiers
Deconvolve->>Output: Write processed alignments
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
harpy/bin/bx_stats.py
(3 hunks)harpy/bin/deconvolve_alignments.py
(1 hunks)harpy/snakefiles/align_bwa.smk
(1 hunks)harpy/snakefiles/align_ema.smk
(1 hunks)harpy/snakefiles/align_strobealign.smk
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Build/Cache Container
🔇 Additional comments (6)
harpy/bin/deconvolve_alignments.py (2)
123-128
: Verify the regex pattern for detecting invalid barcodes.The regex pattern
"[ABCD]0{2,4}"
used to identify invalid barcodes might not cover all intended cases or could potentially flag valid barcodes incorrectly. Ensure that the pattern accurately matches only invalid barcodes with sequences like 'A00', 'B000', etc.Consider testing the regex with various barcode examples to confirm its accuracy.
189-192
: Review the condition for updatinglastpos
.The current condition:
if record.is_reverse or (record.is_forward and not record.is_paired):may not cover all scenarios, especially with paired-end reads and supplementary alignments. Verify whether this logic correctly updates
d[bx]["lastpos"]
for all read orientations and pairing statuses to ensure accurate tracking of molecule positions.✅ Verification successful
The condition for updating
lastpos
is correctly implemented.The current logic appropriately handles all read scenarios:
- Reverse reads mark the end of a molecule segment
- Unpaired forward reads represent complete molecules
- Paired forward reads are correctly skipped as they await their mate
- Supplementary alignments are handled separately in the pipeline
🏁 Scripts executed
The following scripts were executed for the analysis:
Error: No shell scripts found
Script:
#!/bin/bash # Find the complete function implementation ast-grep --pattern 'def $FUNC($$$) { $$$ lastpos $$$ }' # Search for lastpos updates and related context rg "lastpos" -A 5 -B 5 # Look for supplementary alignment handling rg "supplementary|is_supplementary" -A 3 -B 3Length of output: 10211
harpy/bin/bx_stats.py (1)
50-55
: Ensure proper handling of output redirection and gzip compression.The code correctly writes the output to
sys.stdout.buffer
using gzip compression. Confirm that this approach integrates smoothly with downstream processes that consume the output, and that they are capable of reading gzipped data from standard output without issues.harpy/snakefiles/align_bwa.smk (1)
178-178
: Update the Snakemake rule to reflect the new output handling.The shell command in the
barcode_stats
rule has been correctly updated to:"bx_stats.py {input.bam} > {output}"
This change aligns with the modification in
bx_stats.py
to write output to standard output, ensuring consistency across the workflow.harpy/snakefiles/align_strobealign.smk (1)
169-169
: LGTM! Command successfully standardized to use stdout.The change aligns with the PR objective of standardizing scripts to write to stdout.
harpy/snakefiles/align_ema.smk (1)
274-274
: LGTM! Command successfully standardized and maintains consistency.The change aligns with the PR objective and maintains consistency with the modification in align_strobealign.smk.
dist = pos_start - d[bx]["lastpos"] | ||
# if the distance between alignments is > cutoff, it's a different molecule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle potential negative distance calculations.
The calculation dist = pos_start - d[bx]["lastpos"]
may result in negative distances if pos_start
is less than d[bx]["lastpos"]
, which can occur due to overlaps or sequencing errors. To ensure accurate molecule identification, consider using the absolute value of the distance or adjusting the logic to handle negative distances appropriately.
Apply this diff to modify the distance calculation:
- dist = pos_start - d[bx]["lastpos"]
+ dist = abs(pos_start - d[bx]["lastpos"])
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
dist = pos_start - d[bx]["lastpos"] | |
# if the distance between alignments is > cutoff, it's a different molecule | |
dist = abs(pos_start - d[bx]["lastpos"]) | |
# if the distance between alignments is > cutoff, it's a different molecule |
if not os.path.exists(args.input): | ||
parser.error(f"{args.input} was not found") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a check to confirm the input BAM file is coordinate-sorted.
While the usage instructions state that the input file must be coordinate-sorted, there is no explicit check in the code to enforce this requirement. Consider adding a validation step to ensure the input BAM file is coordinate-sorted before processing to prevent potential errors.
Apply this diff to add a check for coordinate sorting:
if not os.path.exists(args.input):
parser.error(f"{args.input} was not found")
+with pysam.AlignmentFile(args.input, "rb") as test_bam:
+ if not test_bam.header.get('HD', {}).get('SO') == 'coordinate':
+ parser.error(f"{args.input} is not coordinate-sorted")
def write_validbx(bam, alnrecord, bx_tag, mol_id):
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if not os.path.exists(args.input): | |
parser.error(f"{args.input} was not found") | |
if not os.path.exists(args.input): | |
parser.error(f"{args.input} was not found") | |
with pysam.AlignmentFile(args.input, "rb") as test_bam: | |
if not test_bam.header.get('HD', {}).get('SO') == 'coordinate': | |
parser.error(f"{args.input} is not coordinate-sorted") | |
Summary by CodeRabbit
New Features
deconvolve_alignments.py
for processing BX-tagged barcodes and assigning Molecular Identifier (MI) tags.Changes
bx_stats.py
to redirect output to stdout instead of using an explicit output file.align_bwa.smk
,align_ema.smk
,align_strobealign.smk
) to use new output redirection method forbx_stats.py
.Improvements
bx_stats.py
.bx_stats.py
.