03-pre_processing.Rmd

# Pre-processing 

CopyKit's pre-processing module workhorse is `runVarbin()` 

`runVarbin()` is a wrapper for a series of functions that perform three main processes: 

1. Binning and read counting from the `.bam` files
2. Applying a variance stabilization transformation to the bin counts
3. Piece-wise segmentating stabilized bin counts

We load CopyKit with the `library()` function.
```{r}
library(copykit)
```

**NOTE:** If the BiocParallel framework was registered, CopyKit functions will, whenever possible, run in parallel. 
For more information check the [parallelization](#parallelization) section.

## From Single Cell BAM files

### runVarbin()

The runVarbin() function takes in the file path of the folder where your *marked duplicate BAM files are located*. 

Each cell must be its own BAM file. For Chromium single cell CNA users (10X Genomics), this means splitting the possorted.bam file by cell barcode into individual BAM files. 

If files are originated from *paired-end* sequencing, please ensure that the argument "is_paired_end" is set to TRUE during `runVarbin().

Once run, `runVarbin()` uses the variable binning method to count the number of reads in each genomic bin. - [Learn More!](#varbin-lm).

A [snakemake](https://snakemake.readthedocs.io/en/stable/) pipeline is provided [here](https://github.com/navinlabcode/copykit/tree/master/snakemake_pipelines) to convert FASTQ files to marked BAM files, but it is not a required step. If you already have marked BAM files from another source, you can skip this step. Note that the reads must be aligned to the same genome assembly as used in CopyKit.

The input for `runVarbin()` is the path of the folder containing your *marked duplicate `.bam` files*. An optional second argument `remove_Y` provides a convenient shortcut to exclude chromosome Y from the dataset.

```
To learn about all arguments that runVarbin() accepts use ?runVarbin
```

```{r run_varbin, eval=FALSE}
tumor <- runVarbin("~/path/to/bam/files/", 
                   remove_Y = TRUE)
``` 
 
```{r run_varbin_hidden, eval=TRUE, echo=FALSE}
tumor <- runVarbin("/mnt/lab/users/dminussi/projects/copy_number_pipeline_hg38/test_tumor_data/PMTC6/marked/", 
                   remove_Y = TRUE)
```

The `runVarbin()` function uses the hg38 genome assembly and a 220kb bin resolution by default, but these can be adjusted using the `"genome"` or `"resolution"` arguments. If your BAM files come from paired-end sequencing, be sure to set the `"is_paired_end"` argument to TRUE. (See the function documentation for more information: ?runVarbin)

```{r run_varbin_pe, eval=FALSE}
tumor <- runVarbin("~/path/to/bam/files/", 
                   remove_Y = TRUE,
                   is_paired_end = TRUE)
```

The resulting object is the CopyKit object.
```{r copykit_object}
tumor
```

CopyKit objects inherit from the [SingleCellExperiment](https://bioconductor.org/packages/devel/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html) class.

Each column corresponds to a single cell and each row corresponds to a bin. The bincounts assay, which can be accessed using the `bincounts()` function, stores the read counts for each bin.
```{r}
head(bincounts(tumor)[,1:5])
```

To address overdispersion, CopyKit performs a variance stabilization transformation (VST) of the count matrix using the Freeman-Tukey transformation. Alternatively, log transformation is also available (see the function documentation for more information ?runVst())

The resulting transformation is stored within the *ft* assay, which can be accessed with `assay()`:
```{r}
head(assay(tumor, 'ft')[,1:5])
```

We use segmentation to divide the genome-ordered bin counts into piecewise constant segments, and the means of these segments can be used to infer copy number states across the genomic regions.

The runVarbin() function allows for the selection of different segmentation methods using the `method` argument, which can be set to either `CBS` or `multipcf`. By default, the function uses the Circular Binary Segmentation (CBS) - [Learn More!](#cbs-lm) - method from the [DNAcopy](https://bioconductor.org/packages/release/bioc/html/DNAcopy.html) package. An additional argument, `alpha`, controls the significance levels required to accept change-points. 

The second segmentation option is the Multi-sample Piecewise Constant Fit (multipcf) segmentation from the [copynumber](https://www.bioconductor.org/packages/release/bioc/html/copynumber.html) package, which performs a joint segmentation of the samples and identifies common breakpoints across all samples.

The resulting segmentation is stored within the CopyKit object into two different assays: *ratios* and *segment_ratios*. [Learn More!](#infercn-lm) which can be accessed with the  functions `ratios()` and `segment_ratios()`
```{r r_sr_accessors}
head(ratios(tumor)[,1:5])
head(segment_ratios(tumor)[,1:5])
```

---

## runVarbin() modules

The `runVarbin()` module includes the following functions, which are detailed below.
**It is important to note that if you have already started from runVarbin(), you do not need to run these functions again.**
The details of these functions is to enable you to run the modules under different conditions without requiring you to re-run all `runVarbin()` module.

#### runCountReads()

The `runCountReads()` function counts the reads into bins, and performs GC correction of the counts - [Learn More!](#gccor-lm). 

The genome argument defines the genome assembly ("hg38", "hg19") and the resolution argument specifies the size of the variable bins ("55kb", "110kb", "195kb", "220kb", "280kb", "500kb", "1Mb", "2.8Mb").
The `remove_Y` argument provides a convenient shortcut to exclude the chromosome Y from the dataset.

#### runVst()

The `runVst()` function performs variance stabilization transformation on the bin counts, using either the freeman-tukey ('ft') or 'log' transformation.

#### runSegmentation()

The `runSegmentation()` function runs either the 'CBS' or 'multipcf' segmentation, and then merges consecutive segments that don't reach the significance threshold [Learn More!](#mergelevels-lm).
The argument _alpha_ controls the significance threshold for `CBS` segmentation, whil th _gamma_ argument controls the penalty for the `multipcf` segmentation.


#### logNorm()

`logNorm()` performs a log transformation of the _segment_ratios_ assay and stores into the _logr_ assay. The _logr_ assay is used downstream by functions such as `runUmap()`. 

---

## From external count data or user-defined scaffolds

To use CopyKit's downstream functions with processed datasets, we can create a `CopyKit` object meeting the following requirements:

1. Either a bin count matrix where columns represent cells and rows represent bins (to be stored in the `bincounts` assay) or a segment mean ratios matrix where columns represent cells and rows represent segment mean values for each bin. An integer matrix of copy number calls can also be used at this step (to be stored in the `segment_ratios` assay). 

2. A [GenomicRanges](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) object with length equal to the number of bins from the cell assay matrix. As long as the length requirement is respected this allows user-defined scaffolds to be used within CopyKit.

To construct the CopyKit object, the following mock code is presented as an example. 
If providing bincounts:
```{r, eval=FALSE}
obj <- CopyKit(list(bincounts = cell_bincount_matrix),
               rowRanges = genomic_ranges_scaffold)
```

If providing bincounts, the functions `runVst()`, `runSegmentation()`, and `logNorm()` can be used to continue with the analysis. To obtain variance stabilized counts, and segment ratio means.

If providing segment mean ratios:
```{r, eval=FALSE}
obj <- CopyKit(list(segment_ratios = cell_bincount_matrix),
               rowRanges = genomic_ranges_scaffold)
```
It is useful to call `logNorm()` and set the _logr_ slot for downstream functions.
The resulting object can then be passed on to the Quality Control and Analysis modules of CopyKit.

Furthermore, it is useful for downstream functions to add the genome assembly used on the dataset.
```{r, eval=FALSE}
metadata(obj)$genome <- "hg38"
```
**NOTE:** If no bincount matrix is provided, functions that require a matrix of bincounts, such as `runMetrics()` can't be used with this object.