navinlabcode
diff --git a/‎05-knnsmoothing.Rmd
+14 b/‎05-knnsmoothing.Rmd
+14
diff --git a/‎06-analysis_visualization.Rmd
+365 b/‎06-analysis_visualization.Rmd
+365
@@ -0,0 +1,14 @@
+# kNN smoothing
+
+scDNA-Seq experiments are often sparsely sequenced and present inherent noise from low-coverage datasets. To mitigate this, CopyKit is capable of smoothing profiles based on the k-nearest neighbors of each single cell.
+
+This way, CopyKit aggregates the genomic bin counts based on a cell k-nearest neighbors, followed by re-segmentation of copy number profiles.
+
+In order to ensure optimal performane it is essential to carefully consider the number of neighbors (k) used in the smoothing process. We recommend to use conservative k values, which are below the number of cells that compose the smallest subclone, and to visually inspect and compare smoothed single cells to the original profiles
+
+```{r knn_smoothing}
+tumor <- knnSmooth(tumor)
+```
+
+This step is recommended after the filtering process. Though we have noticed that smoothing can also rescue cells with low-depth quality. If done so, additional inspection is recommended due to the possibility of increased noise.
+
@@ -0,0 +1,365 @@
+# Analysis and Visualization module
+
+The analysis and visualization module from CopyKit work in synergy to help you analyze your single cell copy number results.
+
+## plotMetrics()
+
+The `plotMetrics()` function can plot any information store in `colData()`, which is passedto the `metric` argument. If a `label` argument is supplied, the points will be colored based on that information.
+```{r}
+plotMetrics(tumor, metric = c("overdispersion", 
+                              "breakpoint_count",
+                              "reads_total",
+                              "reads_duplicates",
+                              "reads_assigned_bins",
+                              "percentage_duplicates"),
+            label = "percentage_duplicates")
+```
+
+## plotRatio()
+
+Visualizing the seegmeentation to eensure it follows the ratios as expected is crucial to a copy number analysis.
+This helps to verify the accuracy of the segmentation an assess the quality of the data visually.
+
+The `plotRatio()` function is a useful tool for this, offering two different modes. When the input is the CopyKit object, an interactive app will open, allowing you to choose which cell to visualize from the drop-down option menu.
+```{r plot_ratio_app, eval = FALSE}
+plotRatio(tumor)
+```
+<center>
+![Ratio Plot App](images/plot_ratio_shiny.png)
+</center>
+
+Providing a sample name to the `plotRatio()` function with the argument `sample_name`, will display the plot only for the selected cell.
+
+```{r plot_ratio_sample, warning = FALSE}
+plotRatio(tumor, "PMTC6LiverC117AL4L5S1_S885_L003_R1_001")
+```
+
+## runUmap()
+
+The `runUmap()` function generates a [UMAP](https://umap-learn.readthedocs.io/en/latest/) embedding, which is stored in the `reducedDim` slot. 
+
+`runUmap()` is an important pre-processing step for the `findClusters()` feature.
+
+```{r run_umap}
+tumor <- runUmap(tumor)
+```
+
+You can pass additional arguments to control UMAP parameters using the `'...'` argument. The full list of additional arguments that can be passed on to `uwot::umap`  with the '...' argument can be found in the [uwot manual](https://cran.r-project.org/web/packages/uwot/uwot.pdf) and information on their influence on clustering can be seen in the [UMAP webpage](https://umap-learn.readthedocs.io/en/latest/clustering.html)
+
+## Clustering
+
+The `findClusters()` function performs clustering using the UMAP embedding generated by `runUmap()`. For detecting subclones, CopyKit uses
+different clustering algorithms such as:
+
+1) [hdbscan](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) (recommended)
+
+2) [Leiden](https://www.nature.com/articles/s41598-019-41695-z)
+
+3) [Louvain](https://arxiv.org/abs/0803.0476)
+
+The hdbscan method is recommended and has been previously successfully applied in the work from Laks *et al.* and Minussi *et al.*.
+
+### findSugestedK()
+
+The `findSuggestedK()` is a helper function that provides guidance to choose the `k` parameter for clustering algorithms. The function 
+`findSuggestedK` bootstraps clustering over a range of k values, and returns the value that maximizes the jaccard similarity, with *median* as the default metric (argument `metric`). 
+
+While `findSuggestedK` does not guarantee optimal clustering. It  provides a guide to select k values.
+
+```{r find_suggested_k}
+tumor <- findSuggestedK(tumor)
+```
+
+The `plotSuggestedK()` function can be used to inspect the results of `findSuggestedK()`. It can plot a boxplot, heatmap (tile), or scatterplot
+
+```{r plot_suggested_k_boxplot}
+plotSuggestedK(tumor)
+```
+
+If the argument `geom` is set to *tile*, `plotSuggestedK()` plots a heatmap where each row is a detected subclone, each column is a k assessed during the grid search and the color represents the jaccard similarity for a given clone.
+
+The suggested value can be accessed from the metadata:
+```{r find_sug_meta}
+S4Vectors::metadata(tumor)$suggestedK
+```
+
+### findClusters()
+
+To run `findClusters()`, pass the CopyKit object as its input. If `findSuggestedK()` was run beforehand, `findClusters()` will automatically use the stored value from `findSuggestedK()` as the argument for k_subclones, unless otherwise specified.
+
+
+```{r findCL_findSuggested}
+tumor <- findClusters(tumor)
+```
+
+If hdbscan is used for clustering, singleetons [outliers](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) may bee idenetifieed and added to subgroup `c0`.  Copykit will notify you of any cells classified as c0 after running findClusters(). To remove outliers, subset the CopyKit object with:
+
+```{r filter_outliers}
+tumor <- tumor[,colData(tumor)$subclones != 'c0']
+```
+
+### plotUmap()
+
+`plotUmap()` can be used to visualize the reduced dimensional embedding. 
+`plotUmap` can be colored by any element of the colData with the argument 'label'.
+
+```{r}
+plotUmap(tumor)
+```
+
+Cells can be colored with the argument 'label' by any element of `colData()`:
+
+```{r plot_umap_subclones}
+plotUmap(tumor, label = 'subclones')
+```
+
+## runPhylo()
+
+To run a phylogenetic analysis of cells' copy number profiles, use the function `runPhylo()`. 
+Available methods are [Neighbor Joining](https://pubmed.ncbi.nlm.nih.gov/3447015/) and [Balanced Minimum evolution](https://pubmed.ncbi.nlm.nih.gov/8412650/).
+
+The resulting tree is stored within the CopyKit object in the phylo slot:
+```{r run_phylo}
+tumor <- runPhylo(tumor, metric = 'manhattan')
+```
+
+### plotPhylo()
+
+To visualize the resulting phylogenetic trees from `runPhylo()` we can use the `plotPhylo()` function.
+
+```{r plot_phylo, fig.height=8}
+plotPhylo(tumor)
+```
+
+`plotPhylo()` can use any element of the colData to color the leaves of the tree.
+
+```{r plot_phylo_label, fig.height=8}
+plotPhylo(tumor, label = 'subclones')
+```
+
+## calcInteger()
+
+The true underlying copy number states for a given region of the genome are given in integer states. 
+We can use the segment ratio means to make an inference of the copy number state for a given genomic region.
+Keep in mind that those are inferences and, as such, are subjected to errors of different sources, like erroneous inference or rounding errors.
+
+Segment ratio means can be scaled to integer values with `calcInteger()`. 
+
+CopyKit supports different methods of calculating integer copy number profiles.
+
+To calculate computational ploidies, a method is in development and should be available soon within copykit.
+
+By setting the argument `method` to *fixed*, a fixed value of ploidy (generally determined using Flow Cytometry) will scale all cells.
+
+```{r calc_integer_fixed}
+tumor <- calcInteger(tumor, method = 'fixed', ploidy_value = 4.3)
+```
+
+The integer values are stored in the slot `integer` that can be accessed with the function `assay()`.
+
+If the assay _integer_ exists, `plotRatio()` will use it to plot the integer copy number values as a secondary axis.
+```{r plot_ratio_integer, warning=FALSE}
+plotRatio(tumor, "PMTC6LiverC117AL4L5S1_S885_L003_R1_001")
+```
+
+## plotHeatmap()
+
+To visualize copy number profiles with a heatmap we can use `plotHeatmap()`. 
+
+The heatmap can be annotated with elements of colData.
+
+To order subclones, one option is to calculate a consensus phylogeny, explained in later sections:
+```{r}
+tumor <- calcConsensus(tumor)
+tumor <- runConsensusPhylo(tumor)
+```
+
+To plot the heatmap.
+```{r plot_heatmap_subclones, fig.height=8}
+plotHeatmap(tumor, label = 'subclones', order_cells = 'consensus_tree')
+```
+
+Genes can be annotated to a heatmap in `plotHeatmap()` with the _genes_ argument.
+The genes argument is a vector of gene HUGO symbols:
+```{r plot_heatmap_subclones_genes, fig.height=8}
+plotHeatmap(tumor, label = 'subclones', genes = c("TP53", "BRAF", "MYC"), order_cells = 'consensus_tree')
+```
+
+To plot integer copy number heatmaps pass the argument `assay = 'integer'`, importantly the integer matrix must be in the `'integer'`slot.
+```{r plot_ht_int, fig.height=8}
+plotHeatmap(tumor, assay = 'integer', order_cells = 'consensus_tree')
+```
+
+New information can be added to `colData()` and used in conjunction with the plotting functions. 
+
+The example dataset has macro-spatial information. 
+The information is encoded in the sample name by the letter L followed by a number. 
+
+We can extract that information and add it as an extra column to the metadata:
+
+```{r add_spatial_info}
+colData(tumor)$spatial_info <- stringr::str_extract(colData(tumor)$sample, "L[0-9]")
+```
+
+Once the information has been added, we can use it to color the umap by their spatial information:
+
+```{r umap_spatial}
+plotUmap(tumor, label = 'spatial_info')
+```
+
+It is also possible to annotate the heatmap with that information:
+The 'label' argument for `plotHeatmap()` can add as many annotations as specified by the user as long as they are elements in `colData()` of the CopyKit object.
+```{r heat_spatial, fig.height=8}
+plotHeatmap(tumor, label = c("spatial_info", "subclones"), order_cells = 'consensus_tree')
+```
+
+## plotFreq()
+
+Computing the frequencies of genomic gain or losses across the genome can be useful to visualize differences between groups. This can be done with `plotFreq()`. 
+For every region of the genome, `plotFreq()` will calculate the frequency of gain or losses according to a threshold across all samples.
+The thresholds are controlled with the arguments `low_threshold` (values below will be counted as genomic losses) and `high_threshold` (values above will be counted as genomic gains).
+
+Ideally thresholds will be set according to the ploidy of the sample. 
+
+The argument `assay` can be provided to pass on the `integer` assay instead of the `segment_ratios` assay (adjust thresholds accordingly).
+
+Two geoms are available `area` or `line`.
+```{r plot_freq_sample}
+plotFreq(tumor)
+```
+
+Group information can be set to any categorical element of the `colData()` and is provided with the argument `group`.
+
+```{r plot_freq_label, fig.height=6}
+plotFreq(tumor,
+         group = 'subclones')
+```
+
+## calcConsensus()
+
+Consensus sequences can help visualize the different segments across subclones. To calculate consensus matrices we can use `calcConsensus()`.
+```{r calc_consensus}
+tumor <- calcConsensus(tumor)
+```
+
+`plotHeatmap()` can plot a consensus heatmap:
+```{r plot_heat_consensus}
+plotHeatmap(tumor, consensus = TRUE, label = 'subclones')
+```
+
+`plotHeatmap()` can annotate the consensus heatmap with information from the metadata as long as label is the same as the information used to build the consensus matrix:
+```{r plot_heat_freq_bar}
+plotHeatmap(tumor, consensus = TRUE, label = 'subclones', group = 'spatial_info')
+```
+
+By default `calcConsensus()` uses the subclones information to calculate a consensus for each subclone.
+Any element of the `colData()` can be used to calculate the consensus.
+
+**Note:** Consensus matrices can be calculated from the integer assay. Importantly, the integer matrix must be in the `assay(tumor, 'integer')` slot. Check `calcInteger()` for more info.
+```{r eval = FALSE}
+tumor <- calcConsensus(tumor, consensus_by = 'subclones', assay = 'integer')
+```
+
+## plotConsensusLine()
+
+To compare the differences among subclones, `plotConsensusLine()` opens an interactive app where the consensus sequences are plotted as lines.
+
+```{r calc_consensus_again, echo = FALSE}
+tumor <- calcConsensus(tumor)
+```
+
+```{r plot_cons_line, eval=FALSE}
+plotConsensusLine(tumor)
+```
+
+<center>
+![Plot Consensus Line](images/plot_consensus_shiny.png)
+</center>
+
+## plotGeneCopy()
+
+To check copy number states across of genes we can use `plotGeneCopy()`. Two different geoms: “swarm” (default) or “violin” can be applied. 
+
+As with other plotting functions, points can be colored with the argument 'label'.
+
+```{r}
+plotGeneCopy(tumor, genes = c("CDKN2A",
+                              "FGFR1",
+                              "TP53",
+                              "PTEN",
+                              "MYC",
+                              "CDKN1A",
+                              "MDM2",
+                              "AURKA",
+                              "PIK3CA",
+                              "CCND1",
+                              "KRAS"),
+                      label = 'spatial_info')
+```
+
+A positional dodge can be added to facilitate the visualization across groups:
+```{r}
+plotGeneCopy(tumor, 
+             genes = c("CDKN2A",
+                              "FGFR1",
+                              "TP53",
+                              "PTEN",
+                              "MYC",
+                              "CDKN1A",
+                              "MDM2",
+                              "AURKA",
+                              "PIK3CA",
+                              "CCND1",
+                              "KRAS"),
+             label = 'subclones',
+             dodge.width = 0.8)
+```
+
+
+A barplot geom is also provided to visualize the integer data as a frequency barplot for each gene:
+```{r plot_gene_copy_integer}
+plotGeneCopy(tumor, genes = c("CDKN2A",
+                              "FGFR1",
+                              "TP53",
+                              "PTEN",
+                              "MYC",
+                              "CDKN1A",
+                              "MDM2",
+                              "AURKA",
+                              "PIK3CA",
+                              "CCND1",
+                              "KRAS"),
+             geom = 'barplot',
+             assay = 'integer')
+```
+
+## plotAlluvial()
+
+To visualize frequencies across elements of the metadata we can use `plotAlluvial()` 
+```{r plot_alluvial}
+plotAlluvial(tumor, label = c("subclones", "spatial_info"))
+```
+
+## runPca()
+
+Principal Component Analysis can be run with `runPca()`
+```{r run_pca}
+tumor <- runPca(tumor)
+```
+
+The results are stored in the `reducedDim(tumor, 'PCA')` slot.
+
+Principal Components can be used as an alternative pre-processing step to the `findClusters()` function by changing `findClusters()` _embedding_ argument to 'PCA'. The number of principal components to be used for clustering can also be set within `findClusters() with the argument ncomponents. At this time better results are achieved by using the default UMAP embedding and it is the recommend pre-processing step. However PCA remains as a viable alternative
+
+### plotScree
+To determine the number of useful principal components a scree plot can be shown with:
+```{r plot_scree}
+plotScree(tumor)
+```
+
+### plotPCA
+To visualize the first 2 principal components you can use `plotPca()`.
+```{r}
+plotPca(tumor, label = 'spatial_info')
+```
+