Skip to content

Fix blacklist BED file parsing in cooler balance CLI (Fixes #196) #462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

ShigrafS
Copy link

PR Description:

Fixes: #209
Original Issues: #196

Overview

This PR fixes issue [#196](#196) by improving how blacklist BED files are parsed in the cooler balance CLI. Previously, single-line BED files, files with metadata headers (track=), and empty blacklist files were not handled correctly, leading to parsing errors and crashes.

The key fixes include:

  • Switching blacklist parsing from custom logic to bioframe.read_table.
  • Skipping track= headers in blacklist files.
  • Handling empty blacklist files to prevent np.concatenate errors.
  • Fixing an HDF5 writing scope error by correcting indentation.
  • Explicitly setting dtype="float64" for HDF5 weights to handle NaN values.
  • Adjusting test parameters to improve robustness.

Changes Made

Replaced custom blacklist parsing with bioframe.read_table

  • Previous method failed for single-line BED files due to csv.Sniffer assumptions.
  • bioframe.read_table now ensures proper BED parsing.

Added header skipping for track= metadata lines

  • If the first line starts with track=, it is skipped before parsing.
  • Prevents issues where the header was mistakenly interpreted as a chromosome entry.

Handled empty blacklist files ("") gracefully

  • Avoided np.concatenate errors by checking for empty results.
  • Ensured that blacklist filtering works even if no regions are provided.

Fixed indentation issue in HDF5 weight storage

  • Ensured create_dataset operations remain within the with h5py.File(...) block.
  • Prevented KeyError when writing to HDF5 files.

Explicitly set dtype="float64" in HDF5 options

  • Prevented ValueError: cannot convert float NaN to integer.
  • Ensured robust handling of blacklist-masked bins.

Adjusted test parameters for better stability

  • Modified test_balancing_with_blacklist parameters (tol=0.1, max-iters=1000, nproc=1)
  • Improved test reliability across different platforms.

Impact

  • Fixes parsing for single-line BED files, ensuring correctness.
  • Supports blacklist files with metadata headers (track=) without errors.
  • Prevents crashes when the blacklist file is empty.
  • Improves compatibility with bioframe for BED file handling.
  • Ensures proper HDF5 weight storage, avoiding scope-related errors.
  • Tests now pass reliably, improving maintainability.

Closes:

Fixes #196

ShigrafS and others added 14 commits February 26, 2025 14:18
Co-authored-by: Nezar Abdennur <[email protected]>
- Added validate_pairs_columns to check column indices against file content.

- Updated get_header to use readline() instead of peek() for test compatibility.

- Modified pairs to use validate_pairs_columns stream, added kwargs = {}, removed duplicate call.
- Created 7 pytest cases in test_cload.py to test valid/invalid indices, stdin, headers, empty files, and extra fields.

- Ensures validate_pairs_columns and pairs handle file and stdin inputs correctly.
@vedatonuryilmaz vedatonuryilmaz requested a review from nvictus March 17, 2025 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

csv sniffer fails, when given only a single line of BED-like input
2 participants