Skip to content

Commit

Permalink
Merge pull request #64 from rajewsky-lab/barcode_preprocess_optimize
Browse files Browse the repository at this point in the history
Optimize flow cell spatial mapping
  • Loading branch information
danilexn authored Oct 3, 2024
2 parents 69b3363 + 3732d64 commit 18a48bb
Show file tree
Hide file tree
Showing 10 changed files with 659 additions and 86 deletions.
24 changes: 24 additions & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,30 @@ options:
--unsorted (Optional) set when file is unsorted respect to tiles; might be slower
```

## `flowcell_map`
Convert basecalls from a whole flow cell into barcodes-to-spatial coordinates maps (tabular format) that can be used with spacemake
for spatial mapping of transcriptomic libraries.

Usage:
```text
openst flowcell_map [-h] --bcl-in BCL_IN --tiles-out TILES_OUT [--out-suffix OUT_SUFFIX] [--out-prefix OUT_PREFIX]
[--crop-seq CROP_SEQ] [--rev-comp] [--parallel-processes PARALLEL_PROCESSES]
options:
-h, --help show this help message and exit
--bcl-in BCL_IN Input directory containing BCL files
--tiles-out TILES_OUT
Output directory for tile coordinate files
--out-suffix OUT_SUFFIX
(Optional) Suffix for output files
--out-prefix OUT_PREFIX
(Optional) Prefix for output files
--crop-seq CROP_SEQ (Optional) Python slice format for cropping sequences
--rev-comp (Optional) Reverse complement the sequences
--parallel-processes PARALLEL_PROCESSES
(Optional) Number of parallel processes to use
```

## `image_stitch`
Stitch image fields of view into a single image. Currently, it only supports `--microscope keyence`, for the
default microscopy setup used in our paper.
Expand Down
117 changes: 68 additions & 49 deletions docs/computational/preprocessing_capture_area.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ information:
|GGTCCCGTCCAAGAAGTAAATCGAA|9272|1016|
|...|...|...|

Where `cell_bc` is the 32 nucleotide-long spatial barcode, and `x_pos`/`y_pos` are 2D spatial coordinates
Where `cell_bc` is the 25 nucleotide-long spatial barcode, and `x_pos`/`y_pos` are 2D spatial coordinates
of a specific **tile** in the capture area (see below).

Before diving into the code, let's clarify some of the
[terms](#flow-cell-related-terms) that are specific to using Illumina flow cells as capture areas. We quote from
[Illumina's documentation](https://support-docs.illumina.com/IN/NextSeq_550-500/Content/IN/NextSeq/FlowCell_Tiles_fNS.htm)

### Flow cell-related terms
## Flow cell-related terms
!!! info "Tiles"
"Small imaging areas on the flow cell defined as the field of view by the
camera. The total number of tiles depends on the number of lanes, swaths, and surfaces that are imaged on
Expand All @@ -36,67 +36,86 @@ Before diving into the code, let's clarify some of the
!!! info "Swath"
"A column of tiles in a lane."

### Retrieve spatial barcodes coordinates for one tile
The `x_pos` and `y_pos` coordinates from the table above are given for each tile, separately. This information is
encoded in the `bcl` and `fastq` files. To obtain per-tile barcodes and coordinates, run the following code:
## Generating barcode-to-coordinate map

For each flow cell, we generate plain text files with three columns: `cell_bc`, `x_pos`, and `y_pos`.
These files are later used by `spacemake` to reconstruct the spatial coordinates from transcriptomic libraries.
This process is performed only once per barcoded flow cell.

!!! warning "Software dependencies"
Running `openst flowcell_map` below requires installing either `bcl2fastq` or `bclconvert`.
You can find instructions for [`bcl2fastq`](https://emea.support.illumina.com/sequencing/sequencing_software/bcl2fastq-conversion-software.html),
and [`bclconvert`](https://emea.support.illumina.com/sequencing/sequencing_software/bcl-convert.html).

Then, make sure they are added to the `PATH` environment variable.

For instance, in Linux:
```bash
export PATH=/path/to/bcl2fastq:$PATH
# or
# export PATH=/path/to/bclconvert:$PATH
```

Make sure you use a version of these softwares compatible with your sequencer.

Once dependencies have been installed, create the barcode-to-coordinate map for all tiles:

```sh
openst barcode_preprocessing \
--fastq-in <fastq_of_tile> \
--tilecoords-out <out_path> \
--out-suffix <out_suffix> \
--out-prefix <out_prefix> \
--crop-seq <len_int> \
--rev-comp \
--single-tile
openst flowcell_map \
--bcl-in /path/to/fc/bcl \
--tiles-out /path/to/fc_tiles \
--crop-seq 5:30 \ # for default Open-ST sequencing recipe
--rev-comp
```

Make sure to replace the placeholders:
`<fastq_of_tile>` to the `fastq` file of a specific tile; `<out_path>` where the table-like
files will be written; `<out_suffix>` and `<out_prefix>` are suffixes and prefixes that are added to the tile file names;
`<len_int>` from the `--crop-seq` argument is a string in the [Python slice](https://docs.python.org/3/tutorial/datastructures.html) format
(e.g., 2:32 will take nucleotides 2nd until 32th of the sequence in the `fastq` file); `--rev-comp` is provided whether the barcode sequences
must be written into the `csv` as their reverse-complementary; `--single-tile` argument is provided when the `fastq` file only contains data for
a single tile (**our recommendation**).
Make sure to specify the arguments:
- `--crop-seq`: Use a compatible Python slice (e.g., 5:30 will take 25 nucleotides, from the 6th to the 30th from the input reads)
- `--rev-comp`: After cropping the sequences, will compute and store the reverse complement of the barcode sequences

### Retrieve spatial barcodes coordinates for all tiles
Above you generated a single tile coordinate. To process all tiles from a flow cell (in parallel), you can run the
following snippets for Linux, assuming you have access to the basecalls folder.
This command will create as many barcode-to-coordinate compressed text files as there are tiles in the flow cell under the folder `/path/to/fc_tiles`

First create a `lanes_and_tiles.txt` file:
### Workflow Details
The `openst flowcell_map` command executes a multi-step workflow to process the barcode data:

```sh
cat RunInfo.xml | grep "<Tile>" | sed 's/ *<Tile>//' | sed 's/<\/Tile>//' | sed 's/^[ \t]*//;s/[ \t]*$//' > lanes_and_tiles.txt
```
1. **Tile Processing**: Each of the 3,744 tiles (for the S4 flow cell) is processed individually using `bcl2fastq` (or `bclconvert`).
2. **Barcode Preprocessing**: The `barcode_preprocessing` function from our openst tools is applied to each tile. This step:
- Trims barcodes according to the specified `--crop-seq` parameter
- Computes the reverse complement of the barcodes (if `--rev-comp` is specified)
- Adds spatial coordinates (`x_pos` and `y_pos`) to each barcode

3. **Individual Tile Deduplication**: Each processed tile file is deduplicated to remove duplicate barcodes within the same tile.
4. **Cross-tile Merging and Deduplication**: All deduplicated tile files are merged, and a second round of deduplication is performed to remove duplicate barcodes across different tiles.
5. **File Splitting**: The merged and deduplicated data is split back into individual files, one for each original tile. This step facilitates faster processing with `spacemake` in subsequent analyses. The final tile files are compressed to save storage space.

where `RunInfo.xml` is a file contained in the basecalls directory.
!!! info "Coordinate system of tiles"
The spatial coordinates acquired with `bcl2fastq` (or `bclconvert`) are in a tile-specific coordinate system. For samples spanning multiple tiles, mapping to a global coordinate system becomes necessary. This global mapping is typically done using the `puck_collection` functionality from `spacemake`.

Then, run demultiplexing and conversion to `fastq`
simultaneously to generating the barcode spatial coordinate file:
## (Optional) Retrieve spatial barcodes coordinates for one tile
It is also possible to obtain per-tile barcodes and coordinates:

```sh
cat lanes_and_tiles.txt | xargs -n 1 -P <parallel_processes> -I {} \
sh -c 'bcl2fastq -R <bcl_in> --no-lane-splitting \
-o <bcl_out>/"{}" --tiles s_"{}"; \
openst barcode_preprocessing \
--fastq-in <bcl_out>/{}/Undetermined_S0_R1_001.fastq.gz \
--tilecoords-out <out_path> \
--out-suffix .txt \
--out-prefix <out_prefix>"{}" \
--crop-seq <len_int> \
--rev-comp \
--single-tile'
openst barcode_preprocessing \
--fastq-in /path/to/tile.fastq \
--tilecoords-out /path/to/fc_tiles \
--out-prefix fc_1_ \
--crop-seq 5:30 \
--rev-comp \
--single-tile
```

Again, make sure to replace the placeholders: `<bcl_in>` and `<bcl_out>` are the directories where the
basecall files are contained and where the converted output `fastq` files will be saved; The rest of
arguments have the same meaning as above. If you generated full `fastq` yourself, you can adapt the command
above to remove the call to `bcl2fastq`.
!!! warning
These files do not undergo deduplication, therefore some barcodes might be repeated. Run `openst flowcell_map` to make sure there
are no duplicated barcodes across the flow cell.

Make sure to replace the placeholders.
`/path/to/tile.fastq` to the `fastq` file of a specific tile; `/path/to/fc_tiles` where the table-like
files will be written; `--out-prefix` (and `--out-suffix`) are prefixes and suffixes that are added to the tile file names;
`--crop-seq 5:30` is a [Python slice](https://docs.python.org/3/tutorial/datastructures.html)
(e.g., 5:30 will take nucleotides 6th until 30th of the sequence in the `fastq` file); `--rev-comp` is provided whether the barcode sequences
must be written into the `csv` as their reverse-complementary, **after cropping**; `--single-tile` argument is provided when the `fastq` file only contains data for
a single tile (**our recommendation**).

## Expected output

After running all the steps of this section, you will have a folder with many `*.txt.gz` files
containing the spatial coordinates of flow cell tiles (you will only need to generate this once per flow cell),
in the tab format described above.
After running all the steps of this section, you will have a folder `/path/to/fc_tiles` with `*.txt.gz` files
containing the spatial coordinates of flow cell tiles. **You only need to generate this once per flow cell.**
4 changes: 2 additions & 2 deletions docs/computational/preprocessing_openst_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,8 @@ The above will add a new Open-ST project with `barcode_flavor`, `run_mode`, `puc

These are generated by the `openst` package as [previously described](preprocessing_capture_area.md).

All `puck_barcode_files` generated in the [previous step](preprocessing_capture_area.md) need to be specified after
`--puck_barcode_file`, e.g., with the wildcards `path_to_puck_files/*.txt.gz`.
All `puck_barcode_files` generated in the [previous step](preprocessing_capture_area.md) at the folder `/path/to/fc_tiles`
need to be specified after `--puck_barcode_file`, e.g., with the wildcards `/path/to/fc_tiles/*.txt.gz`.

To generate output files and reports only for the relevant tiles per sample, you can configure the variable
`spatial_barcode_min_matches` under `run_mode` (see [spacemake documentation](https://spacemake.readthedocs.io/en/latest/config.html#configure-run-modes)).
Expand Down
38 changes: 37 additions & 1 deletion openst/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -506,7 +506,7 @@ def get_image_stitch_parser():
"--microscope",
type=str,
help="microscope model or imaging strategy that was used for imaging",
choices=["keyence"],
choices=["keyence", "axio"],
required=True,
)

Expand Down Expand Up @@ -807,6 +807,41 @@ def cmd_run_barcode_preprocessing(args):

_run_barcode_preprocessing(args)

FLOWCELL_MAP_HELP = "Convert basecalls from a whole flow cell into a map of barcodes to spatial coordinates"
def get_flowcell_map_parser():
parser = argparse.ArgumentParser(
allow_abbrev=False,
add_help=False,
description=FLOWCELL_MAP_HELP,
)
parser.add_argument("--bcl-in", type=str, required=True, help="Input directory containing BCL files")
parser.add_argument("--tiles-out", required=True, help="Output directory for tile coordinate files")
parser.add_argument("--out-suffix", type=str, default=".txt.gz", help="Suffix for output files")
parser.add_argument("--out-prefix", type=str, default="fc_1_", help="Prefix for output files")
parser.add_argument("--crop-seq", type=str, default=":", help="Python slice format for cropping sequences")
parser.add_argument("--rev-comp", action="store_true", help="Reverse complement the sequences")
parser.add_argument("--parallel-processes", type=int, default=1, help="Number of parallel processes to use")

return parser


def setup_flowcell_map_parser(parent_parser):
parser = parent_parser.add_parser(
"flowcell_map",
help=FLOWCELL_MAP_HELP,
parents=[get_flowcell_map_parser()],
)
parser.set_defaults(func=cmd_run_flowcell_map)

return parser


def cmd_run_flowcell_map(args):
from openst.preprocessing.flowcell_map import _run_flowcell_map

_run_flowcell_map(args)


REPORT_HELP = "Generate HTML reports from metadata files (json)"
def get_report_parser():
parser = argparse.ArgumentParser(
Expand Down Expand Up @@ -1497,6 +1532,7 @@ def cmdline_args():

# TODO: do this iteratively
setup_barcode_preprocessing_parser(parent_parser_subparsers)
setup_flowcell_map_parser(parent_parser_subparsers)

setup_image_stitch_parser(parent_parser_subparsers)
setup_image_preprocess_parser(parent_parser_subparsers)
Expand Down
52 changes: 23 additions & 29 deletions openst/preprocessing/barcode_preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,10 @@
import pandas as pd

import logging

import dnaio

tab = str.maketrans("ACTG", "TGAC")


def reverse_complement_table(seq):
return seq.translate(tab)[::-1]

Expand All @@ -23,35 +22,30 @@ def process_multiple_tiles(
in_fastq: str, out_path: str, out_prefix: str, out_suffix: str, sequence_preprocessor: callable = None
):
current_tile_number = None
sequences, xcoords, ycoords = [[], [], []]
idx = 0
with gzip.open(in_fastq, "rt") as f:
n_written = 0
with dnaio.open(in_fastq) as f:
for line in f:
if idx % 4 == 0:
tile_number, xcoord, ycoord = get_tile_number_and_coordinates(line.strip())
if current_tile_number is None:
current_tile_number = tile_number

if tile_number != current_tile_number:
logging.info(f"Writing {len(sequences):,} barcodes of file {current_tile_number} to disk")
df = pd.DataFrame({"cell_bc": sequences, "x_pos": xcoords, "y_pos": ycoords})
df.to_csv(
os.path.join(
out_path,
out_prefix + current_tile_number + out_suffix,
),
index=False,
sep="\t",
)
current_tile_number = tile_number
sequences, xcoords, ycoords = [[], [], []]
elif idx % 4 == 1:
sequence = sequence_preprocessor(line) if sequence_preprocessor is not None else line
sequences.append(sequence)
xcoords.append(xcoord)
ycoords.append(ycoord)
tile_number, xcoord, ycoord = get_tile_number_and_coordinates(line.name)
if current_tile_number != tile_number:

idx += 1
if current_tile_number is not None:
logging.info(f"Written {n_written} spatial barcodes to {current_tile_number}")
tile_file.close()

current_tile_number = tile_number

tile_fname = os.path.join(
out_path,
out_prefix + current_tile_number + out_suffix,
)

tile_file = open(tile_fname, "w")
tile_file.write("\t".join(["cell_bc", "xcoord", "ycoord"])+"\n")
n_written = 0

sequence = sequence_preprocessor(line.sequence) if sequence_preprocessor is not None else line.sequence
tile_file.write("\t".join([sequence, xcoord, ycoord])+"\n")
n_written += 1


def process_single_tile(in_fastq: str, sequence_preprocessor: callable = None) -> pd.DataFrame:
Expand Down
Loading

0 comments on commit 18a48bb

Please sign in to comment.