Skip to content

Commit

Permalink
Pearson nolog incorporated
Browse files Browse the repository at this point in the history
  • Loading branch information
rmflight committed Feb 18, 2022
1 parent 9e14ffd commit 126ea59
Show file tree
Hide file tree
Showing 56 changed files with 2,155 additions and 114 deletions.
58 changes: 28 additions & 30 deletions doc/ici_kt_manuscript.Rmd

Large diffs are not rendered by default.

Binary file modified doc/ici_kt_manuscript.docx
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
2,099 changes: 2,060 additions & 39 deletions doc/icikt_references.json

Large diffs are not rendered by default.

69 changes: 28 additions & 41 deletions doc/supplemental_materials.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ knitr::opts_chunk$set(echo = FALSE, warning = FALSE,
dev.args = list(png = list(type = "cairo")))
supp_figure_count = dn_counter$new("Figure S", "_")
supp_table_count = dn_counter$new("Table S", "_")
supp_figure_count = dn_counter$new("Figure ", "_", "S")
supp_table_count = dn_counter$new("Table ", "_", "S")
supp_figure_count$increment("kt_pearson")
supp_figure_count$increment("ici_distribution")
Expand All @@ -35,7 +35,7 @@ supp_figure_count$increment("kt_distribution")
The simplest simulated data set includes three samples, where two are perfectly correlated, and the third perfectly anti-correlated.
We then created missing values (replaced value with NA) in each sample systematically, and computed the ICI-Kt, Pearson, and Kendall-tau correlations.

```{r compare_simple, dn_id = supp_figure_count$label_file("kt_pearson")}
```{r kt_pearson, dn_id = supp_figure_count}
loadd(positive_kt)
loadd(positive_pearson)
loadd(negative_kt)
Expand All @@ -62,7 +62,7 @@ We can also examine the full set of positive and negative correlations generated
These distributions are shown in `r supp_figure_count$label_text("kt_distribution")`.
We can see that the distributions from both ICI-Kt and Kendall-tau are the same, which is expected given we replaced missing values (NA) with zero *within* the ICI-Kt code, and replaced missing values (NA) with zero prior to calculating Kendall-tau correlations.

```{r ici_full_distribution, dn_id = supp_figure_count$label_file("ici_distribution")}
```{r ici_distribution, dn_id = supp_figure_count}
loadd(all_kt)
ici_plot = ggplot(all_kt, aes(x = ici_kt)) +
geom_histogram(bins = 100) +
Expand All @@ -74,7 +74,7 @@ ici_plot
`r supp_figure_count$label_text("ici_distribution")`.
ICI-Kendall-tau correlation as missing values are varied between two samples.

```{r kt_full_distribution, dn_id = supp_figure_count$label_file("kt_distribution")}
```{r kt_distribution, dn_id = supp_figure_count}
kt_plot = ggplot(all_kt, aes(x = kendall)) +
geom_histogram(bins = 100) +
facet_wrap(~ comp, ncol = 1) +
Expand All @@ -94,7 +94,7 @@ supp_figure_count$increment("complexity")
In addition to comparing their performance, we can check that the algorithmic complexity fits the theoretically expected complexity by fitting a regression line of the run time to the number of items.
The run times and fitted lines for each method (R's Pearson, the ICI-Kt, and R's Kendall-tau) are shown in `r supp_figure_count$label_text("complexity")`.

```{r plot_complexity, dn_id = supp_figure_count$label_file("complexity")}
```{r complexity, dn_id = supp_figure_count}
single_core_perf = readRDS(here::here("doc", "single_core_perf.rds"))
complex_figure = create_complexity_figure(single_core_perf)
complex_figure
Expand Down Expand Up @@ -161,7 +161,7 @@ yeast_single2 = rbind(yeast_single, tmp_yeast)
compare_yeast = c("icikt", "icikt_complete", "pearson_log1p", "pearson_base_nozero", "manuscript")
```

```{r yeast_single_plot, dn_id = supp_figure_count$label_file("yeast_outliers")}
```{r yeast_outliers, dn_id = supp_figure_count}
yeast_single2 %>%
add_method(map_method = manual_method) %>%
dplyr::filter(which %in% compare_yeast) %>%
Expand All @@ -181,19 +181,6 @@ yeast_table = compare_outlier_tables(yeast_single2, compare_yeast, sort_var = "m
supp_table_count$increment("yeast_outliers")
```

`r supp_table_count$label_text("yeast_outliers")`.
Yeast dataset median correlation values and outlier determination for each outlier from each of the correlation methods.
Abbreviations for different measures and data are: IK: ICI-Kt; IKC: ICI-Kt * Completeness; M: Manuscript; PL1: Pearson Log(x + 1); PN0: Pearson No Zeros.

```{r yeast_table2}
#yeast_ft_out = set_caption(yeast_table,
# caption = paste0(paste_label("Table S", supp_table_count, "yeast_outliers"), ". Yeast dataset median correlation values and outlier determination for each outlier from each of the correlation methods."), style = "paragraph")
yeast_ft_out = yeast_table %>%
set_nice_widths() %>%
put_outlier_right()
yeast_ft_out
```

In `r supp_figure_count$label_text("yeast_outliers")` and `r supp_table_count$label_text("yeast_outliers")` we can see how the determination of outliers was not made solely on the basis of correlation alone, but on a combination of factors that lead to some of the higher correlating samples (using raw counts and Pearson correlaion) being considered outliers where lower correlating samples were not listed as being outliers.

Regardless, using ICI-Kt or ICI-Kt * Completeness in this instance, the outliers using the simple distribution summary statistics and an outlier having to be > 1.5 median error, the outliers are mostly a superset of the outliers determined by Gierlinski et al, with the exception of three samples specific to their data: Snf2.24, WT.22, and WT.28.
Expand All @@ -220,7 +207,7 @@ brainson_single = all_brainson %>%
compare_brainson = c("icikt", "icikt_complete", "pearson_log1p")
```

```{r show_brainson, dn_id = supp_figure_count$label_file("brainson_outliers")}
```{r brainson_outliers, dn_id = supp_figure_count}
brainson_single %>%
add_method() %>%
dplyr::filter(which %in% compare_brainson) %>%
Expand Down Expand Up @@ -284,7 +271,7 @@ adeno_single = all_adeno2 %>%
compare_adeno = c("icikt", "icikt_complete", "pearson_log1p")
```

```{r show_adeno_outliers, dn_id = supp_figure_count$label_file("adeno_outliers")}
```{r adeno_outliers, dn_id = supp_figure_count}
adeno_single %>%
add_method() %>%
dplyr::filter(which %in% compare_adeno) %>%
Expand Down Expand Up @@ -356,7 +343,7 @@ supp_figure_count$increment("yeast_by_keep_num")
supp_figure_count$increment("yeast_differences")
```

```{r yeast_by_method, dn_id = supp_figure_count$label_file("yeast_by_method")}
```{r yeast_by_method, dn_id = supp_figure_count}
all_yeast2 = all_yeast %>%
add_method(other_method) %>%
clear_outliers()
Expand All @@ -370,7 +357,7 @@ ggplot(all_yeast2, aes(x = sample_class, y = med_cor, color = outlier, group = s
Median correlations by correlation method and applying different fractional cutoffs.
Abbreviations for different measures and data are: IK: ICI-Kt; IKC: ICI-Kt * Completeness; Kt: Kendall-tau; PB: Pearson Base (raw values); PL: Pearson Log(x); PL1: Pearson Log(x + 1); PN0: Pearson No Zeros.

```{r yeast_by_fraction, dn_id = supp_figure_count$label_file("yeast_by_keep_num")}
```{r yeast_by_keep_num, dn_id = supp_figure_count}
ggplot(all_yeast2, aes(x = sample_class, y = med_cor, color = outlier, group = sample_class)) +
geom_sina() +
facet_grid(method ~ keep_num) +
Expand All @@ -381,7 +368,7 @@ ggplot(all_yeast2, aes(x = sample_class, y = med_cor, color = outlier, group = s
Median correlations by applying different fractional cutoffs and different correlation methods.
Abbreviations for different measures and data are: IK: ICI-Kt; IKC: ICI-Kt * Completeness; Kt: Kendall-tau; PB: Pearson Base (raw values); PL: Pearson Log(x); PL1: Pearson Log(x + 1); PN0: Pearson No Zeros.

```{r yeast_plot_differences, dn_id = supp_figure_count$label_file("yeast_differences")}
```{r yeast_differences, dn_id = supp_figure_count}
yeast_summary = calculate_variation(all_yeast2)
yeast_diffs = calculate_differences(yeast_summary)
Expand All @@ -404,7 +391,7 @@ supp_figure_count$increment("brainson_by_keep_num")
supp_figure_count$increment("brainson_differences")
```

```{r brainson_plot1, dn_id = supp_figure_count$label_file("brainson_by_method")}
```{r brainson_by_method, dn_id = supp_figure_count}
brainson_classes = c("sorted", "total")
all_brainson_method = add_method(all_brainson, map_method = other_method) %>%
dplyr::filter(sample_class %in% brainson_classes)
Expand All @@ -419,7 +406,7 @@ ggplot(all_brainson_method, aes(x = sample_class, y = med_cor, color = outlier,
Median correlations by correlation method and applying different fractional cutoffs for Brainson RNA-seq data.
Abbreviations for different measures and data are: IK: ICI-Kt; IKC: ICI-Kt * Completeness; Kt: Kendall-tau; PB: Pearson Base (raw values); PL: Pearson Log(x); PL1: Pearson Log(x + 1); PN0: Pearson No Zeros.

```{r brainson_plot2, dn_id = supp_figure_count$label_file("brainson_by_keep_num")}
```{r brainson_by_keep_num, dn_id = supp_figure_count}
ggplot(all_brainson_method, aes(x = sample_class, y = med_cor, color = outlier, group = sample_class)) +
geom_sina() +
facet_grid(method ~ keep_num) +
Expand All @@ -430,7 +417,7 @@ ggplot(all_brainson_method, aes(x = sample_class, y = med_cor, color = outlier,
Median correlations by correlation method and applying different fractional cutoffs for Brainson RNA-seq data.
Abbreviations for different measures and data are: IK: ICI-Kt; IKC: ICI-Kt * Completeness; Kt: Kendall-tau; PB: Pearson Base (raw values); PL: Pearson Log(x); PL1: Pearson Log(x + 1); PN0: Pearson No Zeros.

```{r brainson_plot_differences, dn_id = supp_figure_count$label_file("brainson_differences")}
```{r brainson_differences, dn_id = supp_figure_count}
brainson_summary = calculate_variation(all_brainson_method)
brainson_diffs = calculate_differences(brainson_summary)
Expand All @@ -453,7 +440,7 @@ supp_figure_count$increment("adeno_by_keep_num")
supp_figure_count$increment("adeno_differences")
```

```{r adeno_by_method, dn_id = supp_figure_count$label_file("adeno_by_method")}
```{r adeno_by_method, dn_id = supp_figure_count}
adeno_method = add_method(all_adeno2, other_method) %>%
clear_outliers()
ggplot(adeno_method, aes(x = sample_class, y = med_cor, color = outlier, group = sample_class)) +
Expand All @@ -467,7 +454,7 @@ ggplot(adeno_method, aes(x = sample_class, y = med_cor, color = outlier, group =
Median correlations by correlation method and applying different fractional cutoffs for TCGA adenocarcinoma RNA-seq data.
Abbreviations for different measures and data are: IK: ICI-Kt; IKC: ICI-Kt * Completeness; Kt: Kendall-tau; PB: Pearson Base (raw values); PL: Pearson Log(x); PL1: Pearson Log(x + 1); PN0: Pearson No Zeros.

```{r adeno_by_fraction, dn_id = supp_figure_count$label_file("adeno_by_keep_num")}
```{r adeno_by_keep_num, dn_id = supp_figure_count}
ggplot(adeno_method, aes(x = sample_class, y = med_cor, color = outlier, group = sample_class)) +
geom_sina() +
facet_grid(method ~ keep_num) +
Expand All @@ -478,7 +465,7 @@ ggplot(adeno_method, aes(x = sample_class, y = med_cor, color = outlier, group =
Median correlations by correlation method and applying different fractional cutoffs for TCGA adenocarcinoma RNA-seq data.
Abbreviations for different measures and data are: IK: ICI-Kt; IKC: ICI-Kt * Completeness; Kt: Kendall-tau; PB: Pearson Base (raw values); PL: Pearson Log(x); PL1: Pearson Log(x + 1); PN0: Pearson No Zeros.

```{r adeno_plot_differences, dn_id = supp_figure_count$label_file("adeno_differences")}
```{r adeno_differences, dn_id = supp_figure_count}
adeno_summary = calculate_variation(adeno_method)
adeno_diffs = calculate_differences(adeno_summary)
Expand All @@ -501,7 +488,7 @@ Abbreviations for different measures and data are: IK: ICI-Kt; IKC: ICI-Kt * Com
supp_figure_count$increment("icikt_subsampling")
```

```{r load_icikt_combined}
```{r icikt_subsampling, dn_id = supp_figure_count}
loadd(combined_random)
icikt_overrange =
Expand All @@ -527,7 +514,7 @@ supp_figure_count$increment("pearson_subsampling")
```


```{r pearson_subsampling, fig.height = 8, fig.width = 8, dn_id = supp_figure_count$label_file("pearson_subsampling")}
```{r pearson_subsampling, fig.height = 8, fig.width = 8, dn_id = supp_figure_count}
loadd(run_random_pearson_log_select_random_fraction_0.01)
loadd(run_random_pearson_log_select_random_fraction_0.1)
Expand Down Expand Up @@ -585,7 +572,7 @@ Right: Histogram of the residual differences of the sample-sample correlations,
supp_figure_count$increment("pearson_sub_overrange")
```

```{r pearson_subsampling_range, dn_id = supp_figure_count$label_file("pearson_sub_overrange")}
```{r pearson_sub_overrange, dn_id = supp_figure_count}
loadd(combined_random_pearson_log)
pearson_overrange =
(ggplot(combined_random_pearson_log, aes(x = fraction, y = median)) +
Expand All @@ -609,7 +596,7 @@ supp_figure_count$increment("pearson_subsampling_0")
```


```{r pearson_subsampling_0, fig.height = 8, fig.width = 8, dn_id = supp_figure_count$label_file("pearson_subsampling_0")}
```{r pearson_subsampling_0, fig.height = 8, fig.width = 8, dn_id = supp_figure_count}
loadd(run_random_pearson_0_log_select_random_fraction_0.01)
loadd(run_random_pearson_0_log_select_random_fraction_0.1)
Expand Down Expand Up @@ -667,7 +654,7 @@ Right: Histogram of the residual differences of the sample-sample correlations,
supp_figure_count$increment("pearson_sub_overrange_0")
```

```{r pearson_subsampling_range_0, dn_id = supp_figure_count$label_file("pearson_sub_overrange_0")}
```{r pearson_sub_overrange_0, dn_id = supp_figure_count}
loadd(combined_random_pearson_0_log)
pearson_0_overrange =
(ggplot(combined_random_pearson_0_log, aes(x = fraction, y = median)) +
Expand All @@ -691,7 +678,7 @@ supp_figure_count$increment("pearson_sub_overrange_nolog")
```


```{r pearson_subsampling_nolog, fig.height = 8, fig.width = 8, dn_id = supp_figure_count$label_file("pearson_subsampling_nolog")}
```{r pearson_subsampling_nolog, fig.height = 8, fig.width = 8, dn_id = supp_figure_count}
nolog_limits = c(-0.6, 1)
loadd(run_random_pearson_select_random_fraction_0.01)
loadd(run_random_pearson_select_random_fraction_0.1)
Expand Down Expand Up @@ -745,7 +732,7 @@ Left: The sample-sample correlations using all features are plotted against the
Right: Histogram of the residual differences of the sample-sample correlations, where the difference is the (random subset - all features).


```{r pearson_subsampling_range_nolog, dn_id = supp_figure_count$label_file("pearson_sub_overrange_nolog")}
```{r pearson_sub_overrange_nolog, dn_id = supp_figure_count}
loadd(combined_random_pearson)
pearson_overrange_nolog =
(ggplot(combined_random_pearson, aes(x = fraction, y = median)) +
Expand All @@ -768,7 +755,7 @@ supp_figure_count$increment("kendall_subsampling")
supp_figure_count$increment("kendall_overrange")
```

```{r kendall_subsampling, fig.height = 8, fig.width = 8, dn_id = supp_figure_count$label_file("kendall_subsampling")}
```{r kendall_subsampling, fig.height = 8, fig.width = 8, dn_id = supp_figure_count}
loadd(run_random_kendall_select_random_fraction_0.01)
loadd(run_random_kendall_select_random_fraction_0.1)
Expand Down Expand Up @@ -821,7 +808,7 @@ kendall_f1_f3
Left: The sample-sample correlations using all features are plotted against the sample-sample correlations using a random subset. Red line indicates perfect agreement.
Right: Histogram of the residual differences of the sample-sample correlations, where the difference is the (random subset - all features).

```{r kendall_subsampling_range, dn_id = supp_figure_count$label_file("kendall_overrange")}
```{r kendall_overrange, dn_id = supp_figure_count}
loadd(combined_random_kendall)
loadd(combined_random_pearson_log)
kendall_overrange =
Expand All @@ -844,7 +831,7 @@ Median differences in correlation (top) and standard deviation of differences (b
supp_figure_count$increment("compare_overrange")
```

```{r compare_overrange, fig.height = 10, fig.width = 10, dn_id = supp_figure_count$label_file("compare_overrange")}
```{r compare_overrange, fig.height = 10, fig.width = 10, dn_id = supp_figure_count}
combined_random$method = "ICI-Kt"
combined_random_pearson_log$method = "Pearson"
combined_random_pearson_0_log$method = "Pearson 0"
Expand Down
Binary file modified doc/supplemental_materials.docx
Binary file not shown.
Binary file not shown.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
34 changes: 34 additions & 0 deletions doc/supplemental_table_1.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
---
title: 'Supplemental Table 1'
output:
rmarkdown::word_document:
reference_docx: 'table_template.docx'
keep_md: true
editor_options:
chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE,
message = FALSE, fig.width = 8,
fig.height = 6, fig.keep = "all",
fig.process = dn_modify_path,
tab.cap.style = "paragraph",
tab.cap.pre = "",
tab.cap.sep = "",
dpi = 600,
dev.args = list(png = list(type = "cairo")))
```

`r supp_table_count$label_text("yeast_outliers")`.
Yeast dataset median correlation values and outlier determination for each outlier from each of the correlation methods.
Abbreviations for different measures and data are: IK: ICI-Kt; IKC: ICI-Kt * Completeness; M: Manuscript; PL1: Pearson Log(x + 1); PN0: Pearson No Zeros.

```{r yeast_table2}
#yeast_ft_out = set_caption(yeast_table,
# caption = paste0(paste_label("Table S", supp_table_count, "yeast_outliers"), ". Yeast dataset median correlation values and outlier determination for each outlier from each of the correlation methods."), style = "paragraph")
yeast_ft_out = yeast_table %>%
set_nice_widths() %>%
put_outlier_right()
yeast_ft_out
```
Binary file added doc/supplemental_table_1.docx
Binary file not shown.
Binary file added doc/table_template.docx
Binary file not shown.
1 change: 1 addition & 0 deletions generate_manuscript.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ lapply(list.files("./R", full.names = TRUE), source)
# we process the supplement first so we can refer to the figures and tables
# in there in the main manuscript
rmarkdown::render("./doc/supplemental_materials.Rmd")
rmarkdown::render("./doc/supplemental_table_1.Rmd")
#beepr::beep(2)
rmarkdown::render("./doc/ici_kt_manuscript.Rmd")

Expand Down
8 changes: 4 additions & 4 deletions renv.lock
Original file line number Diff line number Diff line change
Expand Up @@ -562,15 +562,15 @@
},
"documentNumbering": {
"Package": "documentNumbering",
"Version": "0.0.3",
"Version": "0.0.4",
"Source": "GitHub",
"RemoteType": "github",
"RemoteHost": "api.github.com",
"RemoteRepo": "documentNumbering",
"RemoteRepo": "DocumentNumbering",
"RemoteUsername": "rmflight",
"RemoteRef": "HEAD",
"RemoteSha": "868d414f40cae8f65c31fc2d42c9a4a007832a13",
"Hash": "8e601e6e14c55671edc199c4e29a6a95"
"RemoteSha": "08285e17300556341dc3ab505db0d4edc6191676",
"Hash": "a54ca716e6017cd4e9f257fccde7c957"
},
"dotenv": {
"Package": "dotenv",
Expand Down

0 comments on commit 126ea59

Please sign in to comment.