Skip to content
Merged

0.4.1 #362

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/tutorials/cli_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ The load command builds the raw AnnData object from your raw sequencing data. It
- adata.X contains binarized modification data (conversion/deaminase), or modification probabilitiesc (native).
- Adds basic read-level QC annotations (Read start, end, length, mean quality).
- Adds layers encoding read DNA sequences, base quality scores, base mismatches.
- Maintains BAM Tags/Flags in adata.obs.
- Maintains BAM tags/flags in adata.obs (UMI and barcode annotations loaded from Parquet sidecars).
- Writes the raw AnnData to the canonical output path and runs MultiQC.
- Optionally deletes intermediate BAMs, H5ADs, and TSVs.

Expand Down
29 changes: 20 additions & 9 deletions docs/source/tutorials/experiment_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,18 +57,25 @@ Below are some of the most commonly edited fields and how they affect the CLI wo
- Lists are written in bracketed form, e.g. `[5mC]` or `[5mC_5hmC]`.
- If you update the CSV, re-run the CLI command pointing at the updated file.

## BAM tags
## Read annotations

smftools writes and/or propagates the following BAM tags when loading data. These are also loaded
into `adata.obs` when `load_adata` reads BAM tags.
smftools annotates reads during `load_adata` and stores the results in `adata.obs`. Standard BAM
tags (e.g. `NM`, `MD`, `MM`, `ML`) are read directly from BAM files. UMI and barcode annotations
are computed in parallel and written to Parquet sidecar files alongside the aligned BAM, then loaded
into `adata.obs` from those sidecars. The aligned BAM itself is not modified.

**UMI tags**
**UMI annotations** (written to `.umi_tags.parquet`)

- `U1`: UMI from the *top* flank (read start or read end depending on match).
- `U2`: UMI from the *bottom* flank.
- `U1`: Orientation-corrected UMI for the *left* reference end of the mapped fragment (forward reads: US, reverse reads: UE).
- `U2`: Orientation-corrected UMI for the *right* reference end of the mapped fragment (forward reads: UE, reverse reads: US).
- `US`: Positional UMI from read start (delimited `UMI_seq;slot;flank_seq`).
- `UE`: Positional UMI from read end (delimited `UMI_seq;slot;flank_seq`).
- `RX`: Combined UMI string (`U1-U2`, or `U1`/`U2` if only one is present).
- `FC`: Flank context of the U1/U2 pair (e.g. `top-bottom`).

**Barcode tags (smftools demux backend)**
When `threads` is set, UMI extraction is parallelized across multiple CPU cores.

**Barcode annotations (smftools demux backend)** (written to `.barcode_tags.parquet`)

- `BC`: Assigned barcode name, or `unclassified`.
- `BM`: Match type (`both`, `read_start_only`, `read_end_only`, `mismatch`, `unclassified`).
Expand All @@ -79,9 +86,13 @@ into `adata.obs` when `load_adata` reads BAM tags.
- `B5`: Barcode name matched at the read start (corresponds to `B1`/`B3`).
- `B6`: Barcode name matched at the read end (corresponds to `B2`/`B4`).

**Barcode tags (dorado demux backend)**
When `threads` is set, barcode extraction is parallelized across multiple CPU cores.
Demultiplexing (splitting reads into per-barcode BAMs) uses the sidecar `BC` assignments.
Only primary alignments are included in split BAMs and sidecar files.

**Barcode annotations (dorado demux backend)**

- `BC`: Assigned barcode name.
- `BC`: Assigned barcode name (read from BAM tag).
- `bi`: Dorado barcode info array (if present; expanded into columns during load).

Notes:
Expand Down
56 changes: 40 additions & 16 deletions src/smftools/cli/load_adata.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from typing import Iterable, Union

import numpy as np
import pandas as pd

from smftools.constants import BARCODE_KIT_ALIASES, LOAD_DIR, LOGGING_DIR, UMI_KIT_ALIASES
from smftools.logging_utils import get_logger, setup_logging
Expand Down Expand Up @@ -516,6 +517,8 @@ def load_adata_core(cfg, paths: AdataPaths, config_path: str | None = None):
########################################################################################################################

################################### 4.5) Optional UMI annotation #############################################
umi_sidecar = None
barcode_sidecar = None
if getattr(cfg, "use_umi", False):
logger.info("Annotating UMIs in aligned and sorted BAM before demultiplexing")

Expand All @@ -538,20 +541,20 @@ def load_adata_core(cfg, paths: AdataPaths, config_path: str | None = None):
umi_kit_config = load_umi_config_from_yaml(umi_yaml_path)
resolved_umi = resolve_umi_config(umi_kit_config, cfg)

annotate_umi_tags_in_bam(
umi_sidecar = annotate_umi_tags_in_bam(
aligned_sorted_output,
use_umi=True,
umi_adapters=getattr(cfg, "umi_adapters", None),
umi_kit_config=umi_kit_config,
umi_length=getattr(cfg, "umi_length", None),
umi_search_window=getattr(cfg, "umi_search_window", 200),
umi_adapter_matcher=getattr(cfg, "umi_adapter_matcher", "exact"),
umi_adapter_matcher=getattr(cfg, "umi_adapter_matcher", "edlib"),
umi_adapter_max_edits=resolved_umi["umi_adapter_max_edits"],
samtools_backend=cfg.samtools_backend,
umi_kit_config=umi_kit_config,
umi_ends=resolved_umi["umi_ends"],
umi_flank_mode=resolved_umi["umi_flank_mode"],
umi_amplicon_max_edits=resolved_umi["umi_amplicon_max_edits"],
same_orientation=resolved_umi.get("same_orientation", False),
threads=cfg.threads,
)
########################################################################################################################

Expand Down Expand Up @@ -603,7 +606,7 @@ def load_adata_core(cfg, paths: AdataPaths, config_path: str | None = None):
resolved_bc = resolve_barcode_config(barcode_kit_config, cfg)

logger.info("Extracting and assigning barcodes to aligned BAM using smftools backend")
barcoded_bam = extract_and_assign_barcodes_in_bam(
barcode_sidecar = extract_and_assign_barcodes_in_bam(
aligned_sorted_output,
barcode_adapters=getattr(cfg, "barcode_adapters", [None, None]),
barcode_references=barcode_references,
Expand All @@ -619,10 +622,9 @@ def load_adata_core(cfg, paths: AdataPaths, config_path: str | None = None):
barcode_kit_config=barcode_kit_config,
barcode_ends=resolved_bc["barcode_ends"],
barcode_amplicon_gap_tolerance=resolved_bc["barcode_amplicon_gap_tolerance"],
threads=cfg.threads,
)
# Update aligned_sorted_output to point to the barcoded BAM
aligned_sorted_output = barcoded_bam
logger.info(f"smftools barcode extraction complete: {barcoded_bam}")
logger.info(f"smftools barcode extraction complete: {barcode_sidecar}")
########################################################################################################################

################################### 5) Demultiplexing ######################################################################
Expand All @@ -646,6 +648,7 @@ def load_adata_core(cfg, paths: AdataPaths, config_path: str | None = None):
cfg.split_path,
cfg.bam_suffix,
samtools_backend=cfg.samtools_backend,
barcode_sidecar=barcode_sidecar,
)

unclassified_bams = [p for p in all_bam_files if "unclassified" in p.name]
Expand Down Expand Up @@ -716,7 +719,7 @@ def load_adata_core(cfg, paths: AdataPaths, config_path: str | None = None):
# Annotate BM tag from bi per-end scores on each demuxed BAM
for bam in bam_files:
if "unclassified" not in bam.name:
annotate_demux_type_from_bi_tag(bam, threshold=0.0)
annotate_demux_type_from_bi_tag(bam)

se_bam_files = bam_files
bam_dir = cfg.split_path
Expand Down Expand Up @@ -956,14 +959,10 @@ def load_adata_core(cfg, paths: AdataPaths, config_path: str | None = None):
default_tags = ["NM", "MD", "fn"]
if cfg.smf_modality == "direct":
default_tags.extend(["MM", "ML"])
# Add UMI tags if UMI extraction was enabled
if getattr(cfg, "use_umi", False):
default_tags.extend(["U1", "U2", "RX"])
# Add barcode tags if smftools barcode extraction was used
if demux_backend == "smftools" and cfg.barcode_kit:
default_tags.extend(["BC", "BM", "B1", "B2", "B3", "B4", "B5", "B6"])
# UMI tags are loaded from Parquet sidecar below (not from BAM)
# Barcode tags are loaded from Parquet sidecar below (not from BAM)
# Add barcode tags from dorado single-pass demux (BM annotated from bi tag)
elif demux_backend == "dorado" and cfg.barcode_kit and not cfg.input_already_demuxed:
if demux_backend == "dorado" and cfg.barcode_kit and not cfg.input_already_demuxed:
dorado_ver = _get_dorado_version()
if dorado_ver is not None and dorado_ver >= (1, 3, 1):
default_tags.extend(["BC", "BM", "bi"])
Expand All @@ -980,6 +979,31 @@ def load_adata_core(cfg, paths: AdataPaths, config_path: str | None = None):
samtools_backend=cfg.samtools_backend,
)

# Load UMI tags from Parquet sidecar (written by annotate_umi_tags_in_bam)
if getattr(cfg, "use_umi", False) and umi_sidecar and Path(umi_sidecar).exists():
logger.info("Loading UMI tags from Parquet sidecar: %s", umi_sidecar)
umi_df = pd.read_parquet(umi_sidecar).set_index("read_name")
umi_df = umi_df.reindex(raw_adata.obs_names)
for col in ["U1", "U2", "RX", "FC", "US", "UE"]:
if col in umi_df.columns:
raw_adata.obs[col] = umi_df[col].values
del umi_df

# Load barcode tags from Parquet sidecar (written by extract_and_assign_barcodes_in_bam)
if (
demux_backend == "smftools"
and cfg.barcode_kit
and barcode_sidecar
and Path(barcode_sidecar).exists()
):
logger.info("Loading barcode tags from Parquet sidecar: %s", barcode_sidecar)
bc_df = pd.read_parquet(barcode_sidecar).set_index("read_name")
bc_df = bc_df.reindex(raw_adata.obs_names)
for col in ["BC", "BM", "B1", "B2", "B3", "B4", "B5", "B6"]:
if col in bc_df.columns:
raw_adata.obs[col] = bc_df[col].values
del bc_df

# Expand dorado bi array tag into individual float score columns
if "bi" in bam_tag_names:
expand_bi_tag_columns(raw_adata, bi_column="bi")
Expand Down
1 change: 0 additions & 1 deletion src/smftools/config/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,6 @@ demux_backend: dorado # smftools|dorado - smftools uses YAML-based barcode refs,
barcode_both_ends: False # Require barcode match at both ends
trim: False # dorado adapter and barcode removal during demultiplexing
use_umi: False # Whether to detect and annotate UMIs in aligned_sorted BAM before demultiplexing
umi_adapters: [null, null] # Two-slot list [left_ref_end_adapter, right_ref_end_adapter]
umi_length: null # Length of each UMI (required when use_umi is true)
umi_search_window: 200 # Max distance from read ends to consider an adapter match for UMI extraction
umi_adapter_matcher: edlib # exact|edlib
Expand Down
Loading