-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathThymoma_Geomx_Nanostring.Rmd
1708 lines (1336 loc) · 78.9 KB
/
Thymoma_Geomx_Nanostring.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Analyzing GeoMx-NGS Data with GeomxTools"
shorttitle: "GeomxTools Workflow"
author:
- name: Jason Reeves
affiliation:
- NanoString Technologies, Inc
- name: Prajan Divaker
affiliation: NanoString Technologies, Inc
- name: Nicole Ortogero
affiliation: NanoString Technologies, Inc
- name: Maddy Griswold
affiliation: NanoString Technologies, Inc
- name: Zhi Yang
affiliation: NanoString Technologies, Inc
- name: Stephanie Zimmerman
affiliation: NanoString Technologies, Inc
- name: Rona Vitancol
affiliation: NanoString Technologies, Inc
- name: David Henderson
affiliation: NanoString Technologies, Inc
package: GeomxTools
abstract: >
Here we walk through an end-to-end Geomx-NGS gene expression analysis workflow. We start with raw gene expression count files. Using a combination of Nanostring-developed (GeoMxTools & NanoStringNCTools) and open source R packages, we evaluate samples and expression targets and prepare gene-level count data for downstream analysis. To understand our spatial data, we perform unsupervising clustering, dimension reduction, and differential gene expression analyses and visually explore the results.
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
BiocStyle::html_document:
toc: true
toc_float: true
self_contained: false
fig_width: 5
fig_height: 4.5
mainfont: Arial
fontfamily: Arial
vignette: >
%\VignetteIndexEntry{Analyzing GeoMx-NGS Data with GeomxTools}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: inline
---
```{r style, echo = FALSE, results = "asis"}
BiocStyle::markdown()
```
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width = 5,
fig.height = 4.5,
dpi = 200
)
```
<p>
**R version**: `r R.version.string`
<br>
**Bioconductor version**: `r BiocManager::version()`
<br>
**Package**: `r packageVersion("GeomxTools")`
<\p>
# Introduction
In this vignette, we will introduce a data analysis workflow for GeoMx-NGS mRNA expression data.
The GeoMx Digital Spatial Profiler (DSP) is a platform for capturing spatially resolved high-plex gene (or protein) expression data from tissue [Merritt et al., 2020](https://pubmed.ncbi.nlm.nih.gov/32393914/).In particular, formalin-fixed paraffin-embedded (FFPE) or fresh-frozen (FF) tissue sections are stained with barcoded in-situ hybridization probes that bind to endogenous mRNA transcripts. The user then selects regions of the interest (ROI) to profile; if desired, each ROI segment can be further sub-divided into areas of illumination (AOI) based on tissue morphology. The GeoMx then photo-cleaves and collects expression barcodes for each AOI segment separately for downstream sequencing and data processing.
The final results are spatially resolved unique expression datasets for every protein-coding gene (>18,000 genes) from every individual segments profiled from tissue.
## Motivation & Scope
The motivation for this vignette is to enable a scientist to work with GeoMx-NGS gene expression data and understand a standard data analysis workflow.
Our specific objectives:
* Load GeoMx raw count files and metadata (DCC, PKC, and annotation file)
* Perform quality control (QC), filtering, and normalization to prepare the data
* Perform downstream visualizations and statistical analyses including:
+ Dimension reduction with UMAP or t-SNE
+ Heatmaps and other visualizations of gene expression
+ Differential expression analyses with linear mixed effect models
# Getting started
Let's install and load the GeoMx packages we need:
```{r InstallGeomxTools, echo = TRUE, eval = FALSE}
install.packages("devtools")
devtools::install_github("Nanostring-Biostats/NanoStringNCTools", force = TRUE)
devtools::install_github("Nanostring-Biostats/GeomxTools", ref = "dev", force = TRUE)
```
```{r libs, message = FALSE, warning = FALSE, eval = TRUE}
library(NanoStringNCTools)
library(GeomxTools)
```
## Loading Data
In this vignette, we will analyze a GeoMx kidney dataset created with the human whole transcriptome atlas (WTA) assay. The dataset includes 4 diabetic kidney disease (DKD) and 3 healthy kidney tissue samples. Regions of interest (ROI) were spatially profiled to focus on two different kidney structures: tubules or glomeruli. One glomerular ROI contains the entirety of a single glomerulus. Each tubular ROI contains multiple tubules that were segmented into distal (PanCK+) and proximal (PanCK-) tubule areas of illumination (AOI).
Download and the unzip the kidney data set found [on the NanoString Website](http://nanostring-public-share.s3-website-us-west-2.amazonaws.com/GeoScriptHub/Kidney_Dataset_for_GeomxTools.zip)
The key data files are:
* DCCs files - expression count data and sequencing quality metadata
* PKCs file(s) - probe assay metadata describing the gene targets present in the data
* Annotation file - useful tissue information, including the type of segment profiled (ex: glomerulus vs. tubule), segment area/nuclei count, and other tissue characteristics (ex: diseased vs. healthy)
We first locate the downloaded files:
```{r quickstart, message = FALSE, warning = FALSE}
# Reference the main folder 'file.path' containing the sub-folders with each
# data file type:
datadir <- file.path("")
# automatically list files in each directory for use
DCCFiles <- dir(file.path(datadir, "dccs"), pattern = ".dcc$",
full.names = TRUE, recursive = TRUE)
PKCFiles <- dir(file.path(datadir, "pkcs"), pattern = ".pkc$",
full.names = TRUE, recursive = TRUE)
SampleAnnotationFile <-
dir(file.path(datadir, "annotation"), pattern = ".xlsx$",
full.names = TRUE, recursive = TRUE)
```
We then load the data to create a data object using the `readNanoStringGeoMxSet` function.
``` {r loadData, message = FALSE, warning = FALSE}
# load data
demoData <-
readNanoStringGeoMxSet(dccFiles = DCCFiles,
pkcFiles = PKCFiles,
phenoDataFile = SampleAnnotationFile,
phenoDataSheet = "Template",
phenoDataDccColName = "Sample_ID",
protocolDataColNames = c("aoi", "roi", "scan name"),
experimentDataColNames = c("panel"))
```
All of the expression, annotation, and probe information are now linked and stored together into a single data object.
For more details on this object's structure and accessors, please refer to the "GeoMxSet Object Overview" section at the end of this vignette.
# Study Design
## Modules Used
First let's access the PKC files, to ensure that the expected PKCs have been loaded for this study. For the demo data we are using the file `Hsa_WTA_1.0.pkc`.
``` {r modules}
library(knitr)
pkcs <- annotation(demoData)
modules <- gsub(".pkc", "", pkcs)
kable(data.frame(PKCs = pkcs, modules = modules))
```
## Sample Overview
Now that we have loaded the data, we can visually summarize the experimental design for our dataset to look at the different types of samples and ROI/AOI segments that have been profiled. We present this information in a Sankey diagram.
``` {r sampleFlow, fig.width = 10, fig.height = 8, fig.wide = TRUE, message = FALSE, warning = FALSE}
library(dplyr)
library(ggforce)
# select the annotations we want to show, use `` to surround column names with
# spaces or special symbols
count_mat <- count(pData(demoData), `slide name`, Phenotype, `CellType`)
# simplify the slide names
#count_mat$`slide name` <- gsub("disease", "d",
# gsub("normal", "n", count_mat$`slide name`))
# gather the data and plot in order: class, slide name, region, segment
test_gr <- gather_set_data(count_mat, 1:3)
test_gr$x <- factor(test_gr$x,
levels = c("slide name", "Phenotype", "CellType"))
# plot Sankey
ggplot(test_gr, aes(x, id = id, split = y, value = n)) +
geom_parallel_sets(aes(fill = `CellType`), alpha = 0.5, axis.width = 0.1) +
geom_parallel_sets_axes(axis.width = 0.2) +
geom_parallel_sets_labels(color = "white", size = 5) +
theme_classic(base_size = 17) +
theme(legend.position = "bottom",
axis.ticks.y = element_blank(),
axis.line = element_blank(),
axis.text.y = element_blank()) +
scale_y_continuous(expand = expansion(0)) +
scale_x_discrete(expand = expansion(0)) +
labs(x = "", y = "") +
annotate(geom = "segment", x = 4.25, xend = 4.25, y = 20, yend = 120, lwd = 2) +
annotate(geom = "text", x = 4.19, y = 70, angle = 90, size = 5,
hjust = 0.5, label = "100 segments")
```
# QC & Pre-processing
![](./GeomxQCWorkflow.png)
The steps above encompass the standard pre-processing workflow for GeoMx data. In short, they represent the selection of ROI/AOI segments and genes based on quality control (QC) or limit of quantification (LOQ) metrics and data normalization.
Before we begin, we will shift any expression counts with a value of 0 to 1 to enable in downstream transformations.
``` {r shiftCounts, eval = TRUE}
# Shift counts to one
demoData <- shiftCountsOne(demoData, useDALogic = TRUE)
```
## Segment QC
We first assess sequencing quality and adequate tissue sampling for every ROI/AOI segment.
Every ROI/AOI segment will be tested for:
* Raw sequencing reads: segments with >1000 raw reads are removed.
* % Aligned,% Trimmed, or % Stitched sequencing reads: segments below ~80% for one or more of these QC parameters are removed.
* % Sequencing saturation ([1-deduplicated reads/aligned reads]%): segments below ~50% require additional sequencing to capture full sample diversity and are not typically analyzed until improved.
* Negative Count: this is the geometric mean of the several unique negative probes in the GeoMx panel that do not target mRNA and establish the background count level per segment; segments with low negative counts (<5-10) are not necessarily removed but may be studied closer for low endogenous gene signal and/or insufficient tissue sampling.
* No Template Control (NTC) count: values >1,000 could indicate contamination for the segments associated with this NTC; however, in cases where the NTC count is between 1,000- 10,000, the segments may be used if the NTC data is uniformly low (e.g. 0-2 counts for all probes).
* Nuclei: >100 nuclei per segment is generally recommended; however, this cutoff is highly study/tissue dependent and may need to be reduced; what is most important is consistency in the nuclei distribution for segments within the study.
* Area: generally correlates with nuclei; a strict cutoff is not generally applied based on area.
### Select Segment QC
First, we select the QC parameter cutoffs, against which our ROI/AOI segments will be tested and flagged appropriately. We have selected the appropriate study-specific parameters for this study. Note: the default QC values recommended above are advised when surveying a new dataset for the first time.
```{r setqcflagupdated, eval = TRUE}
# Default QC cutoffs are commented in () adjacent to the respective parameters
# study-specific values were selected after visualizing the QC results in more
# detail below
QC_params <-
list(minSegmentReads = 1000, # Minimum number of reads (1000)
percentTrimmed = 65, # Minimum % of reads trimmed (80%)
percentStitched = 65, # Minimum % of reads stitched (80%)
percentAligned = 65, # Minimum % of reads aligned to known targets (80%)
percentSaturation = 50, # Minimum sequencing saturation (50%)
minNegativeCount = 1, # Minimum negative control counts (10)
maxNTCCount = 9000, # Maximum counts observed in NTC well (1000)
minNuclei = 20, # Minimum # of cells observed in a segment (100)
minArea = 1000) # Minimum segment area (5000)
demoData <-
setSegmentQCFlags(demoData,
qcCutoffs = QC_params)
# Collate QC Results
QCResults <- protocolData(demoData)[["QCFlags"]]
flag_columns <- colnames(QCResults)
QC_Summary <- data.frame(Pass = colSums(!QCResults[, flag_columns]),
Warning = colSums(QCResults[, flag_columns]))
QCResults$QCStatus <- apply(QCResults, 1L, function(x) {
ifelse(sum(x) == 0L, "PASS", "WARNING")
})
QC_Summary["TOTAL FLAGS", ] <-
c(sum(QCResults[, "QCStatus"] == "PASS"),
sum(QCResults[, "QCStatus"] == "WARNING"))
```
### Visualize Segment QC
Before excluding any low-performing ROI/AOI segments, we visualize the distributions of the data for the different QC parameters. Note that the "Select Segment QC" and "Visualize Segment QC" sections are performed in parallel to fully understand low-performing segments for a given study. Iteration may follow to select the study-specific QC cutoffs.
For QC visualization, we write a quick function to draw histograms of our data.
``` {r qcflagHistogramsCode, eval = TRUE, warning = FALSE, message = FALSE}
library(ggplot2)
col_by <- "segment"
# Graphical summaries of QC statistics plot function
QC_histogram <- function(assay_data = NULL,
annotation = NULL,
fill_by = NULL,
thr = NULL,
xlims = NULL) {
if(is.null(xlims)) {
xlims <- range(assay_data[[annotation]])
}
plt <- ggplot(assay_data,
aes_string(x = paste0("unlist(`", annotation, "`)"),
fill = fill_by)) +
geom_histogram(bins = 50) +
geom_vline(xintercept = thr, lty = "dashed", color = "black") +
theme_bw() + guides(fill = "none") +
facet_wrap(as.formula(paste("~", fill_by)), nrow = 4) +
xlim(xlims) +
labs(x = annotation, y = "Segments, #", title = annotation)
plt
}
```
Now we explore each of the QC metrics for the segments.
``` {r plotQCHist, warning = FALSE, message = FALSE}
QC_histogram(sData(demoData), "Trimmed (%)", col_by, 65, c(0,101))
QC_histogram(sData(demoData), "Stitched (%)", col_by, 65, c(0,101))
QC_histogram(sData(demoData), "Aligned (%)", col_by, 65, c(0,101))
QC_histogram(sData(demoData), "Saturated (%)", col_by, 50, c(0,101)) +
labs(title = "Sequencing Saturation (%)",
x = "Sequencing Saturation (%)")
QC_histogram(sData(demoData), "area", col_by, 1000) +
scale_x_continuous(trans = "log10")
QC_histogram(sData(demoData), "nuclei", col_by, 20)
# calculate the negative geometric means for each module
negativeGeoMeans <-
esBy(negativeControlSubset(demoData),
GROUP = "Module",
FUN = function(x) {
assayDataApply(x, MARGIN = 2, FUN = ngeoMean, elt = "exprs")
})
protocolData(demoData)[["NegGeoMean"]] <- negativeGeoMeans
# explicitly copy the Negative geoMeans from sData to pData
negCols <- paste0("NegGeoMean_", modules)
pData(demoData)[, negCols] <- sData(demoData)[["NegGeoMean"]]
for(ann in negCols) {
plt <- QC_histogram(pData(demoData), ann, col_by, 2) +
scale_x_continuous(trans = "log10")
print(plt)
}
# detatch neg_geomean columns ahead of aggregateCounts call
pData(demoData) <- pData(demoData)[, !colnames(pData(demoData)) %in% negCols]
# show all NTC values, Freq = # of Segments with a given NTC count:
kable(table(NTC_Count = sData(demoData)$NTC),
col.names = c("NTC Count", "# of Segments"))
```
Finally we plot all of the QC Summary information in a table.
```{r QCSummaryTable, results = "asis"}
kable(QC_Summary, caption = "QC Summary Table for each Segment")
```
### Remove flagged segments
As the final step in Segment QC, we remove flagged segments that do not meet our QC cutoffs.
```{r removeQCSampleProbe, eval = TRUE}
demoData <- demoData[, QCResults$QCStatus == "PASS"]
# Subsetting our dataset has removed samples which did not pass QC
dim(demoData)
```
## Probe QC
Before we summarize our data into gene-level count data, we will remove low-performing probes. In short, this QC is an outlier removal process, whereby probes are either removed entirely from the study (global) or from specific segments (local). The QC applies to gene targets for which there are multiple distinct probes representing the count for a gene per segment. In WTA data, only one probe exists per target gene; thus, Probe QC does not apply to the endogenous genes in the panel. Rather, it is performed on the negative control probes; there are multiple probes representing our negative controls, which do not target any sequence in the genome. These probes enable calculation of the background per segment and will be important for determining gene detection downstream.
After Probe QC, there will always remain at least one probe representing every gene target. In other words, Probe QC never removes genes from your data.
### Set Probe QC Flags
A probe is removed globally from the dataset if either of the following is true:
* the geometric mean of that probe's counts from all segments divided by the geometric mean of all probe counts representing the target from all segments is less than 0.1
* the probe is an outlier according to the Grubb's test in at least 20% of the segments
A probe is removed locally (from a given segment) if the probe is an outlier according to the Grubb's test in that segment.
We do not typically adjust these QC parameters.
```{r setbioprobeqcflag, eval = TRUE}
#saved_demoData <- demoData
#demoData <- saved_demoData
# Generally keep the qcCutoffs parameters unchanged. Set removeLocalOutliers to
# FALSE if you do not want to remove local outliers
demoData <- setBioProbeQCFlags(demoData,
qcCutoffs = list(minProbeRatio = 0.1,
percentFailGrubbs = 20),
removeLocalOutliers = TRUE)
ProbeQCResults <- fData(demoData)[["QCFlags"]]
# Define QC table for Probe QC
qc_df <- data.frame(Passed = sum(rowSums(ProbeQCResults[, -1]) == 0),
Global = sum(ProbeQCResults$GlobalGrubbsOutlier),
Local = sum(rowSums(ProbeQCResults[, -2:-1]) > 0
& !ProbeQCResults$GlobalGrubbsOutlier))
qc_df
```
We report the number of global and local outlier probes.
```{r bioprobeQCTable, echo = FALSE, results = "asis"}
kable(qc_df, caption = "Probes flagged or passed as outliers")
```
### Exclude Outlier Probes
```{r excludeOutlierProbes}
#Subset object to exclude all that did not pass Ratio & Global testing
ProbeQCPassed <-
subset(demoData,
fData(demoData)[["QCFlags"]][,c("LowProbeRatio")] == FALSE &
fData(demoData)[["QCFlags"]][,c("GlobalGrubbsOutlier")] == FALSE)
dim(ProbeQCPassed)
demoData <- ProbeQCPassed
```
## Create Gene-level Count Data
With our Probe QC steps complete, we will generate a gene-level count matrix. The count for any gene with multiple probes per segment is calculated as the geometric mean of those probes.
```{r aggregateCounts, eval = TRUE}
# Check how many unique targets the object has
length(unique(featureData(demoData)[["TargetName"]]))
# collapse to targets
demoData[fData(demoData)[["TargetName"]] %in% c("0610009B22Rik", "Trgc4"),]
target_demoData <- aggregateCounts(demoData)
dim(target_demoData)
exprs(target_demoData)[is.na(exprs(target_demoData))] <- 1
exprs(target_demoData)[1:5, 1:2]
```
## Limit of Quantification
In addition to Segment and Probe QC, we also determine the limit of quantification (LOQ) per segment. The LOQ is calculated based on the distribution of negative control probes and is intended to approximate the quantifiable limit of gene expression per segment. Please note that this process is more stable in larger segments. Likewise, the LOQ may not be as accurately reflective of true signal detection rates in segments with low negative probes counts (ex: <2). The formula for calculating the LOQ in a $i^{th}$ segment is:
$$LOQ_{i} = geomean(NegProbe_{i}) * geoSD(NegProbe_{i})^{n}$$
We typically use 2 geometric standard deviations ($n = 2$) above the geometric mean as the LOQ, which is reasonable for most studies. We also recommend that a minimum LOQ of 2 be used if the LOQ calculated in a segment is below this threshold.
``` {r calculateLOQ, eval = TRUE}
# Define LOQ SD threshold and minimum value
cutoff <- 2
minLOQ <- 2
# Calculate LOQ per module tested
LOQ <- data.frame(row.names = colnames(target_demoData))
for(module in modules) {
vars <- paste0(c("NegGeoMean_", "NegGeoSD_"),
module)
if(all(vars[1:2] %in% colnames(pData(target_demoData)))) {
LOQ[, module] <-
pmax(minLOQ,
pData(target_demoData)[, vars[1]] *
pData(target_demoData)[, vars[2]] ^ cutoff)
}
}
pData(target_demoData)$LOQ <- LOQ
```
## Filtering
After determining the limit of quantification (LOQ) per segment, we recommend filtering out either segments and/or genes with abnormally low signal. Filtering is an important step to focus on the true biological data of interest.
We determine the number of genes detected in each segment across the dataset.
```{r LOQMat, eval = TRUE}
LOQ_Mat <- c()
for(module in modules) {
ind <- fData(target_demoData)$Module == module
Mat_i <- t(esApply(target_demoData[ind, ], MARGIN = 1,
FUN = function(x) {
x > LOQ[, module]
}))
LOQ_Mat <- rbind(LOQ_Mat, Mat_i)
}
# ensure ordering since this is stored outside of the geomxSet
LOQ_Mat <- LOQ_Mat[fData(target_demoData)$TargetName, ]
```
### Segment Gene Detection
We first filter out segments with exceptionally low signal. These segments will have a small fraction of panel genes detected above the LOQ relative to the other segments in the study. Let's visualize the distribution of segments with respect to their % genes detected:
```{r segDetectionBarplot}
# Save detection rate information to pheno data
pData(target_demoData)$GenesDetected <-
colSums(LOQ_Mat, na.rm = TRUE)
pData(target_demoData)$GeneDetectionRate <-
pData(target_demoData)$GenesDetected / nrow(target_demoData)
# Determine detection thresholds: 1%, 5%, 10%, 15%, >15%
pData(target_demoData)$DetectionThreshold <-
cut(pData(target_demoData)$GeneDetectionRate,
breaks = c(0, 0.01, 0.05, 0.1, 0.15, 1),
labels = c("<1%", "1-5%", "5-10%", "10-15%", ">15%"))
# stacked barplot of different cutpoints (1%, 5%, 10%, 15%)
ggplot(pData(target_demoData),
aes(x = DetectionThreshold)) +
geom_bar(aes(fill = segment)) +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5, size = 8) +
theme_bw() + theme(text = element_text(size = 30))
scale_y_continuous(expand = expansion(mult = c(0, 0.1))) +
labs(x = "Gene Detection Rate",
y = "Segments, #",
fill = "Segment Type")
```
We can also create a table to review what kidney tissue type (DKD vs normal) is going to be impacted by each threshold:
```{r segTable}
# cut percent genes detected at 1, 5, 10, 15
# personal note: segments are areas within an Rregion of Interest (ROI) which are defined by combinations of the presence and absence of specific morphology markers
kable(table(pData(target_demoData)$DetectionThreshold,
pData(target_demoData)$segment))
```
In this example, we choose to remove segments with less than 10% of the genes detected. Generally, 5-10% detection is a reasonable segment filtering threshold. However, based on the experimental design (e.g. segment types, size, nuclei) and tissue characteristics (e.g. type, age), these guidelines may require adjustment.
```{r filterSegments}
# save target_demoData in case I mess up downstream steps
clean_target_demoData <- target_demoData
target_demoData <-
target_demoData[, pData(target_demoData)$GeneDetectionRate >= .15]
dim(target_demoData)
```
Let's re-plot the Sankey diagram showing our current working dataset. This is now a dataset that no longer contains segments flagged by Segment QC or that have low gene detection rates.
``` {r replotSankey, fig.width = 10, fig.height = 8, fig.wide = TRUE, message = FALSE, warning = FALSE}
# select the annotations we want to show, use `` to surround column names with
# spaces or special symbols
count_mat <- count(pData(demoData), `slide name`, Phenotype, segment)
# simplify the slide names
#count_mat$`slide name` <-
# gsub("disease", "d",
# gsub("normal", "n", count_mat$`slide name`))
# gather the data and plot in order: class, slide name, region, segment
test_gr <- gather_set_data(count_mat, 1:3)
test_gr$x <-
factor(test_gr$x,
levels = c("slide name", "Phenotype", "segment"))
# plot Sankey
ggplot(test_gr, aes(x, id = id, split = y, value = n)) +
geom_parallel_sets(aes(fill = segment), alpha = 0.5, axis.width = 0.1) +
geom_parallel_sets_axes(axis.width = 0.2) +
geom_parallel_sets_labels(color = "white", size = 5) +
theme_classic(base_size = 17) +
theme(legend.position = "bottom",
axis.ticks.y = element_blank(),
axis.line = element_blank(),
axis.text.y = element_blank()) +
scale_y_continuous(expand = expansion(0)) +
scale_x_discrete(expand = expansion(0)) +
labs(x = "", y = "") +
annotate(geom = "segment", x = 4.25, xend = 4.25, y = 20, yend = 120, lwd = 2) +
annotate(geom = "text", x = 4.19, y = 70, angle = 90, size = 5,
hjust = 0.5, label = "100 segments")
```
### Gene Detection Rate
Next, we determine the detection rate for genes across the study. To illustrate this idea, we create a small gene list (`goi`) to review.
``` {r goi detection}
library(scales) # for precent
# Calculate detection rate:
LOQ_Mat <- LOQ_Mat[, colnames(target_demoData)]
fData(target_demoData)$DetectedSegments <- rowSums(LOQ_Mat, na.rm = TRUE)
fData(target_demoData)$DetectionRate <-
fData(target_demoData)$DetectedSegments / nrow(pData(target_demoData))
# Gene of interest detection table
# personal note: Cd45 - Ptprc, Tp63 - Trp63
goi <- c("Ptprc", "Cd3e","Cd3g","Cd3d", "Cd4", "Cd8a", "Cdh1", "Epcam",
"Krt5", "Krt8", "Trp63")
goi_df <- data.frame(
Gene = goi,
Number = fData(target_demoData)[goi, "DetectedSegments"],
DetectionRate = percent(fData(target_demoData)[goi, "DetectionRate"]))
goi_df
```
```{r tableGOI, echo = FALSE, results = "asis"}
kable(goi_df, caption = "Detection rate for Genes of Interest", align = "c",
col.names = c("Gene", "Detection, # Segments", "Detection Rate, % of Segments"))
```
We can see that individual genes are detected to varying degrees in the segments, which leads us to the next QC we will perform across the dataset.
### Gene Filtering
We will graph the total number of genes detected in different percentages of segments. Based on the visualization below, we can better understand global gene detection in our study and select how many low detected genes to filter out of the dataset. Gene filtering increases performance of downstream statistical tests and improves interpretation of true biological signal.
```{r plotDetectionRate, eval = TRUE}
# Plot detection rate:
plot_detect <- data.frame(Freq = c(1, 5, 10, 20, 30, 50))
plot_detect$Number <-
unlist(lapply(c(0.01, 0.05, 0.1, 0.2, 0.3, 0.5),
function(x) {sum(fData(target_demoData)$DetectionRate >= x)}))
plot_detect$Rate <- plot_detect$Number / nrow(fData(target_demoData))
rownames(plot_detect) <- plot_detect$Freq
ggplot(plot_detect, aes(x = as.factor(Freq), y = Rate, fill = Rate)) +
geom_bar(stat = "identity") +
geom_text(aes(label = formatC(Number, format = "d", big.mark = ",")),
vjust = 1.6, color = "black", size = 4) +
scale_fill_gradient2(low = "orange2", mid = "lightblue",
high = "dodgerblue3", midpoint = 0.65,
limits = c(0,1),
labels = scales::percent) +
theme_bw() +
scale_y_continuous(labels = scales::percent, limits = c(0,1),
expand = expansion(mult = c(0, 0))) +
labs(x = "% of Segments",
y = "Genes Detected, % of Panel > LOQ")
```
We typically set a % Segment cutoff ranging from 5-20% based on the biological diversity of our dataset. For this study, we will select 10% as our cutoff. In other words, we will focus on the genes detected in at least 10% of our segments; we filter out the remainder of the targets.
Note: if we know that a key gene is represented in only a small number of segments (<10%) due to biological diversity, we may select a different cutoff or keep the target gene by manually selecting them for inclusion in the data object.
``` {r subsetGenes, eval = TRUE}
# Subset to target genes detected in at least 10% of the samples.
# Also manually include the negative control probe, for downstream use
negativeProbefData <- subset(fData(target_demoData), CodeClass == "Negative")
neg_probes <- unique(negativeProbefData$TargetName)
target_demoData <-
target_demoData[fData(target_demoData)$DetectionRate >= 0.3 |
fData(target_demoData)$TargetName %in% neg_probes, ]
dim(target_demoData)
# retain only detected genes of interest
goi <- goi[goi %in% rownames(target_demoData@assayData$exprs)]
```
# Normalization
We will now normalize the GeoMx data for downstream visualizations and differential expression. The two common methods for normalization of DSP-NGS RNA data are i) quartile 3 (Q3) or ii) background normalization.
Both of these normalization methods estimate a normalization factor per segment to bring the segment data distributions together. More advanced methods for normalization and modeling are under active development. However, for most studies, these methods are sufficient for understanding differences between biological classes of segments and samples.
Q3 normalization is typically the preferred normalization strategy for most DSP-NGS RNA studies. Given the low negative probe counts in this particular dataset as shown during Segment QC, we would further avoid background normalization as it may be less stable.
Before normalization, we will explore the relationship between the upper quartile (Q3) of the counts in each segment with the geometric mean of the negative control probes in the data. Ideally, there should be a separation between these two values to ensure we have stable measure of Q3 signal. If you do not see sufficient separation between these values, you may consider more aggressive filtering of low signal segments/genes.
``` {r, previewNF, fig.width = 8, fig.height = 8, fig.wide = TRUE, eval = TRUE, warning = FALSE, message = FALSE}
library(reshape2) # for melt
library(cowplot) # for plot_grid
# Graph Q3 value vs negGeoMean of Negatives
ann_of_interest <- "segment"
Stat_data <-
data.frame(row.names = colnames(exprs(target_demoData)),
Segment = colnames(exprs(target_demoData)),
Annotation = pData(target_demoData)[, ann_of_interest],
Q3 = unlist(apply(exprs(target_demoData), 2,
quantile, 0.75, na.rm = TRUE)),
NegProbe = exprs(target_demoData)[neg_probes, ])
Stat_data_m <- melt(Stat_data, measure.vars = c("Q3", "NegProbe"),
variable.name = "Statistic", value.name = "Value")
plt1 <- ggplot(Stat_data_m,
aes(x = Value, fill = Statistic)) +
geom_histogram(bins = 40) + theme_bw() + theme(text = element_text(size = 30)) +
scale_x_continuous(trans = "log2") +
facet_wrap(~Annotation, nrow = 1) +
scale_fill_brewer(palette = 3, type = "qual") +
labs(x = "Counts", y = "Segments, #")
plt2 <- ggplot(Stat_data,
aes(x = NegProbe, y = Q3, color = Annotation)) +
geom_abline(intercept = 0, slope = 1, lty = "dashed", color = "darkgray") +
geom_point() + guides(color = "none") + theme_bw() + theme(text = element_text(size = 30)) +
scale_x_continuous(trans = "log2") +
scale_y_continuous(trans = "log2") +
theme(aspect.ratio = 1) +
labs(x = "Negative Probe GeoMean, Counts", y = "Q3 Value, Counts")
plt3 <- ggplot(Stat_data,
aes(x = NegProbe, y = Q3 / NegProbe, color = Annotation)) +
geom_hline(yintercept = 1, lty = "dashed", color = "darkgray") +
geom_point() + theme_bw() + theme(text = element_text(size = 30)) +
scale_x_continuous(trans = "log2") +
scale_y_continuous(trans = "log2") +
theme(aspect.ratio = 1) +
labs(x = "Negative Probe GeoMean, Counts", y = "Q3/NegProbe Value, Counts")
btm_row <- plot_grid(plt2, plt3, nrow = 1, labels = c("B", ""),
rel_widths = c(0.43,0.57))
plot_grid(plt1, btm_row, ncol = 1, labels = c("A", ""))
```
As expected, we see separation of the Q3 and negative probe counts at both the distribution (A) and per segment (B) levels.
For additional conceptual guidance, please refer to our [Data Analysis White Paper for DSP-NGS Assays](https://www.nanostring.com/resources/geomx-cta-data-whitepaper/). Next, we normalize our data. We will use Q3 normalized data moving forward.
We use the `normalize` function from `NanoStringNCTools` to create normalization factors reflecting each data type. Upper quartile (Q3) normalization is performed using `norm_method = "quant"` setting the `desiredQuantile` flag to 0.75. Other quantiles could be specified by changing that value. We save the normalized data to a specific slot using `toELT = "q_norm"`. Similarly background normalization is performed by setting `norm_method = "neg"` and `toElt = "neg_norm"`.
```{r normalizeObject, eval = TRUE}
# Q3 norm (75th percentile) for WTA/CTA with or without custom spike-ins
target_demoData <- normalize(target_demoData , data_type = "RNA",
norm_method = "quant",
desiredQuantile = .75,
toElt = "q_norm")
# Background normalization for WTA/CTA without custom spike-in
target_demoData <- normalize(target_demoData , data_type = "RNA",
norm_method = "neg",
fromElt = "exprs",
toElt = "neg_norm")
```
To demonstrate the effects of normalization, we graph a representative boxplots of the data for individual segments before and after normalization.
```{r normplot, fig.small = TRUE}
# visualize the first 10 segments with each normalization method
boxplot(exprs(target_demoData)[,1:10],
col = "#9EDAE5", main = "Raw Counts",
log = "y", names = 1:10, xlab = "Segment",
ylab = "Counts, Raw")
boxplot(assayDataElement(target_demoData[,1:10], elt = "q_norm"),
col = "#2CA02C", main = "Q3 Norm Counts",
log = "y", names = 1:10, xlab = "Segment",
ylab = "Counts, Q3 Normalized")
boxplot(assayDataElement(target_demoData[,1:10], elt = "neg_norm"),
col = "#FF7F0E", main = "Neg Norm Counts",
log = "y", names = 1:10, xlab = "Segment",
ylab = "Counts, Neg. Normalized")
```
# Unsupervised Analysis
## UMAP & t-SNE
One common approach to understanding high-plex data is dimension reduction. Two common methods are UMAP and tSNE, which are non-orthogonally constrained projections that cluster samples based on overall gene expression. In this study, we see by either UMAP (from the `umap` package) or tSNE (from the `Rtsne` package), clusters of segments related to structure (glomeruli or tubules) and disease status (normal or diabetic kidney disease).
``` {r dimReduction, eval = TRUE}
library(umap)
library(Rtsne)
library(tidyverse)
# update defaults for umap to contain a stable random_state (seed)
custom_umap <- umap::umap.defaults
custom_umap$random_state <- 42
pData(target_demoData) <- pData(target_demoData) %>% mutate(
General_CellType = case_when(grepl("Bcells", pData(target_demoData)$SampleCellType) ~ "Bcells",
grepl("Tcells", pData(target_demoData)$SampleCellType) ~ "Tcells",
grepl("Epithelial", pData(target_demoData)$SampleCellType) ~ "Epithelial")
)
# Label indices for General CellTypes
Bcells_ind <- pData(target_demoData)$General_CellType == "Bcells"
Tcells_ind <- pData(target_demoData)$General_CellType == "Tcells"
Epithelial_ind <- pData(target_demoData)$General_CellType == "Epithelial"
# Subset by indices
target_demoData_Bcells <- target_demoData[,Bcells_ind]
# # Cre expression on Bcells
# pData(target_demoData_Bcells)$Cre_expr <- target_demoData_Bcells@assayData$exprs["Cre",]
# pData(target_demoData_Bcells)$kmeans_cluster <- "Not_applicable"
target_demoData_Tcells <- target_demoData[,Tcells_ind]
target_demoData_Epithelial <- target_demoData[,Epithelial_ind]
# pData(target_demoData_Epithelial)$Cre_expr <- target_demoData_Epithelial@assayData$exprs["Cre",]
#
# # Save Cre expression data from Bcells and Epithelial cells
# Cre_data <- rbind(pData(target_demoData_Bcells)[,match(colnames(pData(target_demoData_Epithelial)),colnames(pData(target_demoData_Bcells)))],pData(target_demoData_Epithelial))
#
# Cre_data <- Cre_data[,-c(12,13,14)]
# write.csv(Cre_data,"Bcell_Epithelial_Cre_data.csv")
# run UMAP
set.seed(1)
umap_out <-
umap(t(log2(assayDataElement(target_demoData_Epithelial , elt = "q_norm"))),
config = custom_umap)
pData(target_demoData_Epithelial)[, c("UMAP1", "UMAP2")] <- umap_out$layout[, c(1,2)]
set.seed(1)
umap_out <-
umap(t(log2(assayDataElement(target_demoData , elt = "q_norm"))),
config = custom_umap)
pData(target_demoData)[, c("UMAP1", "UMAP2")] <- umap_out$layout[, c(1,2)]
# kmeans clustering
library(factoextra)
umap_coord <- as.matrix(pData(target_demoData_Epithelial)[,c("UMAP1","UMAP2")])
# 4 or 5 clusters appear to be the optimal
fviz_nbclust(umap_coord, FUNcluster = kmeans, method = "wss")
set.seed(4)
km_res <- kmeans(umap_coord, 5, nstart = 25)
km_clusters <- as.factor(km_res$cluster)
pData(target_demoData_Epithelial)$kmeans_cluster <- km_clusters[match(rownames(pData(target_demoData_Epithelial)),names(km_clusters))]
set.seed(9)
g <- ggplot(pData(target_demoData_Epithelial),
aes(x = UMAP1, y = UMAP2, color = km_clusters, shape = `slide name`
)) +
geom_point(size = 4) + theme(legend.text = element_text(size = 10), legend.title = element_text(size = 10), axis.title.x = element_text(size = 100)) +
theme_bw()
pData(target_demoData)$Case <- as.character(pData(target_demoData)$Case)
table(pData(target_demoData)$SampleCellType,pData(target_demoData)$Case)
set.seed(1001)
g <- ggplot(pData(target_demoData),
aes(x = UMAP1, y = UMAP2, color = General_CellType, shape = `slide name`)) +
# )) + scale_shape_manual(values=1:length(unique(pData(target_demoData)$Case))) +
geom_point(size = 4) + theme(legend.text = element_text(size = 10), legend.title = element_text(size = 10), axis.title.x = element_text(size = 100)) +
theme_bw()
## for line plot figures
# epithelial_subcluster_meta <- as.data.frame(pData(target_demoData_Epithelial))[,c(-9,-12,-13,-14)]
# colnames(epithelial_subcluster_meta) <- gsub(" ","_",colnames(epithelial_subcluster_meta))
# write.csv(epithelial_subcluster_meta, "epithelial_subcluster_meta.csv")
# # run tSNE
# set.seed(42) # set the seed for tSNE as well
# tsne_out <-
# Rtsne(t(log2(assayDataElement(target_demoData , elt = "q_norm"))),
# perplexity = ncol(target_demoData)*.15)
# pData(target_demoData)[, c("tSNE1", "tSNE2")] <- tsne_out$Y[, c(1,2)]
# ggplot(pData(target_demoData),
# aes(x = tSNE1, y = tSNE2, color = SampleCellType, shape = 'SampleCellType')) +
# geom_point(size = 3) +
# theme_bw()
```
## Clustering high CV Genes
Another approach to explore the data is to calculate the coefficient of variation (CV) for each gene ($g$) using the formula $CV_g = SD_g/mean_g$. We then identify for genes with high CVs that should have large differences across the various profiled segments. This unbiased approach can reveal highly variable genes across the study.
We plot the results in a heatmap.
``` {r CVheatmap, eval = TRUE, echo = TRUE, fig.width = 8, fig.height = 6.5, fig.wide = TRUE}
library(pheatmap) # for pheatmap
# create a log2 transform of the data for analysis
assayDataElement(object = target_demoData, elt = "log_q") <-
assayDataApply(target_demoData, 2, FUN = log, base = 2, elt = "q_norm")
# create CV function
calc_CV <- function(x) {sd(x) / mean(x)}
CV_dat <- assayDataApply(target_demoData,
elt = "log_q", MARGIN = 1, calc_CV)
# show the highest CD genes and their CV values
sort(CV_dat, decreasing = TRUE)[1:5]
# Identify genes in the top 3rd of the CV values
GOI <- names(CV_dat)[CV_dat > quantile(CV_dat, 0.75)]
arranged_pData <- arrange(pData(target_demoData),pData(target_demoData)$General_CellType)
# Original pheat matrix
GOI <- c(GOI, "Ptprc")
pheat_matrix_orig <- assayDataElement(target_demoData[GOI, ], elt = "log_q")
set.seed(2);
p <- pheatmap(pheat_matrix_orig,
kmeans_k = 50,
scale = "row",
show_rownames = TRUE, show_colnames = TRUE,
border_color = NA,
cluster_rows = TRUE,
cluster_cols = TRUE,
clustering_method = "average",
clustering_distance_rows = "correlation",
clustering_distance_cols = "correlation",
fontsize = 5,
breaks = seq(-3, 3, 0.05),
color = colorRampPalette(c("purple3", "black", "yellow2"))(120),
annotation_col = pData(target_demoData)[, c("General_CellType","SampleCellType", "Phenotype")])
### Remove genes through kmeans clustering
print(p)
# Find clusters with suspicious genes
Krt5_cluster <- p$kmeans$cluster["Krt5"]
print(Krt5_cluster)
genes_in_Krt5_cluster <- names(p$kmeans$cluster[p$kmeans$cluster == Krt5_cluster])
possible_epithelial_genes_to_rm <- names(p$kmeans$cluster[p$kmeans$cluster == 23 | p$kmeans$cluster == 36 | p$kmeans$cluster == 31 |
p$kmeans$cluster == 13 | p$kmeans$cluster == 1 | p$kmeans$cluster == 43 |
p$kmeans$cluster == 34])
library(transcripTools)
# Create matrix subsets of each celltypes
all_cells_data <- pheat_matrix_orig
epithelial_data <- pheat_matrix_orig[,colnames(pheat_matrix_orig) %in% rownames(pData(target_demoData))[pData(target_demoData)$General_CellType == "Epithelial"]]
Bcells_data <- pheat_matrix_orig[,colnames(pheat_matrix_orig) %in% rownames(pData(target_demoData))[pData(target_demoData)$General_CellType == "Bcells"]]
# If iteration is on Tcells, then remove possible epithelial contaminant genes before doing pheatmap
Tcells_data <- pheat_matrix_orig[,colnames(pheat_matrix_orig) %in% rownames(pData(target_demoData))[pData(target_demoData)$General_CellType == "Tcells"]]
Tcells_data <- Tcells_data[!rownames(Tcells_data) %in% epithelial_genes_to_rm,]
matrix_store <- list(all_cells_data, epithelial_data, Bcells_data, Tcells_data)
# Store heatmaps
pheat_store <- list()
for (i in seq_along(matrix_store)){
# Remove mCherry from pheatmap
pheat_matrix_orig_no_cherry <- matrix_store[[i]][!rownames(matrix_store[[i]]) %in% "mCherry/tdTomato",]
filtered_pheat_matrix <- mostVar(pheat_matrix_orig_no_cherry, n = 250, i_want_most_var = TRUE)
# If i = 1 (i.e. the matrix containing all cells, then arrange by general celltype)
if (i == 1){
filtered_pheat_matrix <- filtered_pheat_matrix[,match(rownames(arranged_pData),colnames(filtered_pheat_matrix))]
}
colnames(filtered_pheat_matrix) <- gsub(".dcc","",colnames(filtered_pheat_matrix))
p_filtered <- pheatmap(filtered_pheat_matrix,
# kmeans_k = 250,
scale = "row",
show_rownames = TRUE, show_colnames = TRUE,
border_color = NA,
cluster_cols = FALSE,
clustering_method = "average",
clustering_distance_rows = "correlation",
clustering_distance_cols = "correlation",
fontsize = 3,
breaks = seq(-3, 3, 0.05),
color = colorRampPalette(c("purple3", "black", "yellow2"))(120),
annotation_col = pData(target_demoData)[, c("General_CellType","SampleCellType", "Phenotype")])
pheat_store[[i]] <- p_filtered
}
```
# Differential Expression
A central method for exploring differences between groups of segments or samples is to perform differential gene expression analysis. A common statistical approach is to use a linear mixed-effect model (LMM). The LMM allows the user to account for the subsampling per tissue; in other words, we adjust for that fact that the multiple regions of interest placed per tissue section are not independent observations, as is the assumption with other traditional statistical tests. The formulation of the LMM model depends on the scientific question being asked.
Overall, there are two flavors of the LMM model when used with GeoMx data: i) with and ii) without random slope.
When comparing features that co-exist in a given tissue section (e.g. glomeruli vs tubules in DKD kidneys), a random slope is included in the LMM model. When comparing features that are mutually exclusive in a given tissue section (healthy glomeruli versus DKD glomeruli) the LMM model does not require a random slope. We represent the two variations on the LMM in the schematic below:
![](./GeomxDETypes.png)
For more details on the LMM, please refer to the [lme4 package](https://cran.r-project.org/web/packages/lme4/) and the [lmerTest package](https://cran.r-project.org/web/packages/lmerTest/).
## Within Slide Analysis: Glomeruli vs Tubules
One informative exploration is to study differences between morphological structures. In this example, we can study differential expression between glomeruli and tubules. we will focus on the diseased kidney tissues. Because we are comparing structures that co-exist within the a given tissue we will use the LMM model with a random slope. Morphological structure (`Region`) is our test variable. We control for tissue subsampling with `slide name` using a random slope and intercept; the intercept adjusts for the multiple regions placed per unique tissue, since we have one tissue per slide. If multiple tissues are placed per slide, we would change the intercept variable to the unique tissue name (ex: tissue name, Block ID, etc).
In this analysis we save log~2~ fold change estimates and P-values across all levels in the factor of interest. We also apply a Benjamini-Hochberg multiple test correction.
``` {r cheatDE, eval = TRUE, echo = FALSE}
# below is work-around text to quick load DE results from a saved csv for convenience
# THIS SECTION WILL NOT BE SHOWN, is only for speed of iteration
#results <- as.data.frame(read.csv("~/DE_results_for_testing.csv", row.names = 1))
#colnames(results) <- c('Gene','Subset','Contrast','Estimate','Pr(>|t|)','FDR')
# turn eval to TRUE in next code block to re-run LMM code (will take ~15-30 minutes on a single core local machine)
```
``` {r addPhenoData, eval = TRUE}
# Line plots
library(tidyverse)
ROIs <- c("DSP-1001660007393-A-H08.dcc",
"DSP-1001660007393-A-E12.dcc",
"DSP-1001660007393-A-F02.dcc",
"DSP-1001660007393-A-A03.dcc",
"DSP-1001660007393-A-A07.dcc",
"DSP-1001660007393-A-A09.dcc",
"DSP-1001660007393-A-B01.dcc",
"DSP-1001660007393-A-B03.dcc",
"DSP-1001660007393-A-D08.dcc"
)
lineplot_genes_list <- list(list_1 = c("E2f1","E2f2","Myc"),
list_2 = c("Mki67","Pcna","Top2a",
"Ube2c","Birc5"),
list_3 = c("Akt1","Cdk4","Ppp1ca",
"Mapk1","Arf1"),
list_4_1 = c("Prss16","Psmb11","Ccl25"),
list_4_2 = c("Ccl21a",
"Aire","Fezf2"),