Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
3231ef6
Merge pull request #1 from MerrimanLab/master
BenRangihuna Mar 3, 2021
ce7c9dd
should run ready for test
BenRangihuna Mar 3, 2021
bf9ca94
testing
BenRangihuna Mar 3, 2021
df7a669
more timestamps
BenRangihuna Mar 8, 2021
ec3ce13
comment out testing code that was causing issues
BenRangihuna Mar 8, 2021
e557c32
calc pca for individual pops
BenRangihuna May 31, 2021
a6566ea
calc pca for individual pops
BenRangihuna Jun 1, 2021
3997125
exclude missing data before making models
BenRangihuna Jun 4, 2021
45d9ac7
Update whole_pop_pipeline.sh
BenRangihuna Jun 22, 2021
4a0c47e
use separate pheno file per model
BenRangihuna Jun 22, 2021
f985064
use separate pheno for each model
BenRangihuna Jun 22, 2021
3ee50ac
use gcta suffix on pheno file
BenRangihuna Jun 22, 2021
d2c3eb0
Update whole_pop_pipeline.sh
BenRangihuna Jun 22, 2021
ffeb7d8
extract results at the end
BenRangihuna Jun 22, 2021
23a58d9
Merge branch 'master' of https://github.com/BenRangihuna/GenomicPred
BenRangihuna Jun 22, 2021
9d69910
extract from actual original data
BenRangihuna Jun 22, 2021
5f6cd93
new scripts for running models on pca selected populations
BenRangihuna Jul 13, 2021
cb360a5
all pops
BenRangihuna Jul 13, 2021
f5981da
update prevalences
BenRangihuna Jul 13, 2021
2c5a01b
adding subshells ({}) into "for covar in.." lines of code
BenRangihuna Jul 21, 2021
2735592
remove {} from "for covar in.." lines
BenRangihuna Jul 22, 2021
75dfe6b
updates to pipeline and supplimental scripts
BenRangihuna Sep 22, 2021
a152b3d
Merge pull request #2 from MerrimanLab/ben
BenRangihuna Oct 8, 2021
764c564
Update README.md
BenRangihuna Apr 11, 2022
56a1ab4
Update model_residuals_whole_pop.R
BenRangihuna Jul 31, 2023
cccc9a7
Update model_residuals_whole_pop.R
BenRangihuna Jul 31, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
`scripts/` contains the scripts that are used for the calculation of genomic profile scores

The main workflow script is `cv_pipeline.sh` which will run each of the required steps.
The main workflow script is `cv_pipeline.sh` which will run each of the required steps. (This pipeline is used to estimate trait heritabilities for each Polynesian Population of interest (i.e not for the CV-trait profile score stages)).

It starts with a plink formatted dataset containing all populations of interest, subsets out each population, and does cross-validation for each trait of interest in each population using both GCTA and LDAK.


`scripts/create_model_reports.R` will create reports for each pop/trait combo as listed in `data/pop_trait_models.csv` and apply a template RMarkdown to pull in and summarise the results from LDAK and GCTA. The template document is `scripts/model_selection_doc.Rmd`.
115 changes: 115 additions & 0 deletions scripts/LDAK_GCTA_heritability_with_trait_values.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
echo NPH
POP=nphpca
RESULTS=${POP}_results
TRAIT=gout
mkdir -p ${RESULTS}/{LDAK,GCTA}/${TRAIT}
for covar in ls ${RESULTS}/*.covar
do

MODEL=$(basename ${covar} .covar)
PREV=0.049
mkdir -p ${RESULTS}/{LDAK,GCTA}/${TRAIT}
software/LDAK/ldak5.linux --reml ${RESULTS}/LDAK/${TRAIT}/${MODEL} --pheno data/tanya_${TRAIT}.pheno --grm ${RESULTS}/${POP}_ldak_kinships --prevalence ${PREV} --covar ${covar}
software/gcta64 --reml --reml-pred-rand --grm ${RESULTS}/${POP}_gcta_grm --pheno data/tanya_${TRAIT}.pheno --prevalence ${PREV} --out ${RESULTS}/GCTA/${TRAIT}/${MODEL} --qcovar ${covar} --threads 4

done > ${POP}_${TRAIT}.log

TRAIT=t2d
for covar in ls ${RESULTS}/*.covar
do
MODEL=$(basename ${covar} .covar)
PREV=0.078
mkdir -p ${RESULTS}/{LDAK,GCTA}/${TRAIT}
software/LDAK/ldak5.linux --reml ${RESULTS}/LDAK/${TRAIT}/${MODEL} --pheno data/tanya_${TRAIT}.pheno --grm ${RESULTS}/${POP}_ldak_kinships --prevalence ${PREV} --covar ${covar}
software/gcta64 --reml --reml-pred-rand --grm ${RESULTS}/${POP}_gcta_grm --pheno data/tanya_${TRAIT}.pheno --prevalence ${PREV} --out ${RESULTS}/GCTA/${TRAIT}/${MODEL} --qcovar ${covar} --threads 4
done > ${POP}_${TRAIT}.log


## East
echo East
POP=eastpca
RESULTS=${POP}_results
TRAIT=gout
mkdir -p ${RESULTS}/{LDAK,GCTA}/${TRAIT}
for covar in ls ${RESULTS}/*.covar
do
MODEL=$(basename ${covar} .covar)
PREV=0.043
mkdir -p ${RESULTS}/{LDAK,GCTA}/${TRAIT}
software/LDAK/ldak5.linux --reml ${RESULTS}/LDAK/${TRAIT}/${MODEL} --pheno data/tanya_${TRAIT}.pheno --grm ${RESULTS}/${POP}_ldak_kinships --prevalence ${PREV} --covar ${covar}
software/gcta64 --reml --reml-pred-rand --grm ${RESULTS}/${POP}_gcta_grm --pheno data/tanya_${TRAIT}.pheno --prevalence ${PREV} --out ${RESULTS}/GCTA/${TRAIT}/${MODEL} --qcovar ${covar} --threads 4

done > ${POP}_${TRAIT}.log

TRAIT=t2d
for covar in ls ${RESULTS}/*.covar
do
MODEL=$(basename ${covar} .covar)
PREV=0.084
mkdir -p ${RESULTS}/{LDAK,GCTA}/${TRAIT}
software/LDAK/ldak5.linux --reml ${RESULTS}/LDAK/${TRAIT}/${MODEL} --pheno data/tanya_${TRAIT}.pheno --grm ${RESULTS}/${POP}_ldak_kinships --prevalence ${PREV} --covar ${covar}
software/gcta64 --reml --reml-pred-rand --grm ${RESULTS}/${POP}_gcta_grm --pheno data/tanya_${TRAIT}.pheno --prevalence ${PREV} --out ${RESULTS}/GCTA/${TRAIT}/${MODEL} --qcovar ${covar} --threads 4
done > ${POP}_${TRAIT}.log

## West
echo WEST
POP=westpca
RESULTS=${POP}_results
TRAIT=gout
mkdir -p ${RESULTS}/{LDAK,GCTA}/${TRAIT}
for covar in ls ${RESULTS}/*.covar
do
MODEL=$(basename ${covar} .covar)
PREV=0.051
mkdir -p ${RESULTS}/{LDAK,GCTA}/${TRAIT}
software/LDAK/ldak5.linux --reml ${RESULTS}/LDAK/${TRAIT}/${MODEL} --pheno data/tanya_${TRAIT}.pheno --grm ${RESULTS}/${POP}_ldak_kinships --prevalence ${PREV} --covar ${covar}
software/gcta64 --reml --reml-pred-rand --grm ${RESULTS}/${POP}_gcta_grm --pheno data/tanya_${TRAIT}.pheno --prevalence ${PREV} --out ${RESULTS}/GCTA/${TRAIT}/${MODEL} --qcovar ${covar} --threads 4

done > ${POP}_${TRAIT}.log

TRAIT=t2d
for covar in ls ${RESULTS}/*.covar
do

MODEL=$(basename ${covar} .covar)
PREV=0.146
mkdir -p ${RESULTS}/{LDAK,GCTA}/${TRAIT}
software/LDAK/ldak5.linux --reml ${RESULTS}/LDAK/${TRAIT}/${MODEL} --pheno data/tanya_${TRAIT}.pheno --grm ${RESULTS}/${POP}_ldak_kinships --prevalence ${PREV} --covar ${covar}
software/gcta64 --reml --reml-pred-rand --grm ${RESULTS}/${POP}_gcta_grm --pheno data/tanya_${TRAIT}.pheno --prevalence ${PREV} --out ${RESULTS}/GCTA/${TRAIT}/${MODEL} --qcovar ${covar} --threads 4
done > ${POP}_${TRAIT}.log


## Euro

echo EURO
POP=europca
RESULTS=${POP}_results
TRAIT=gout
mkdir -p ${RESULTS}/{LDAK,GCTA}/${TRAIT}
for covar in ls ${RESULTS}/*.covar
do
MODEL=$(basename ${covar} .covar)
PREV=0.024
mkdir -p ${RESULTS}/{LDAK,GCTA}/${TRAIT}
software/LDAK/ldak5.linux --reml ${RESULTS}/LDAK/${TRAIT}/${MODEL} --pheno data/tanya_${TRAIT}.pheno --grm ${RESULTS}/${POP}_ldak_kinships --prevalence ${PREV} --covar ${covar}
software/gcta64 --reml --reml-pred-rand --grm ${RESULTS}/${POP}_gcta_grm --pheno data/tanya_${TRAIT}.pheno --prevalence ${PREV} --out ${RESULTS}/GCTA/${TRAIT}/${MODEL} --qcovar ${covar} --threads 4

done > ${POP}_${TRAIT}.log

TRAIT=t2d
for covar in ls ${RESULTS}/*.covar
do
MODEL=$(basename ${covar} .covar)
PREV=0.049
mkdir -p ${RESULTS}/{LDAK,GCTA}/${TRAIT}
software/LDAK/ldak5.linux --reml ${RESULTS}/LDAK/${TRAIT}/${MODEL} --pheno data/tanya_${TRAIT}.pheno --grm ${RESULTS}/${POP}_ldak_kinships --prevalence ${PREV} --covar ${covar}
software/gcta64 --reml --reml-pred-rand --grm ${RESULTS}/${POP}_gcta_grm --pheno data/tanya_${TRAIT}.pheno --prevalence ${PREV} --out ${RESULTS}/GCTA/${TRAIT}/${MODEL} --qcovar ${covar} --threads 4
done > ${POP}_${TRAIT}.log


#Code for transferring directories to local computer..
# scp -r [email protected]:/Volumes/scratch/merrimanlab/ben/genomic_predictions/{nph,euro,west,east}pca_results ~/Documents/genomic_prediction_results/




2 changes: 1 addition & 1 deletion scripts/cv_pipeline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ parallel 'bash scripts/generate_pop_pca.sh {}' ::: nph east west euro
# create the cv splits and residuals for all pops and trait combos

# **** traits must not contain underscores in their names ****
CV=5 # number fo folds for cross validation
CV=2 # number fo folds for cross validation

for POP in nph east west euro
do
Expand Down
72 changes: 72 additions & 0 deletions scripts/make_pheno_and_covar_files.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
library(tidyverse)
library(here)

#CREBRF MASTER FILE (FROM TANYA)
tanya <- read_csv(file = here("data/tanya_data.csv"), col_names = TRUE) %>%
filter(SUBJECT != "Blank")



nph <- read_delim(here("nphpca_results/nphpca_pcafile.eigenvec"), col_names = c("FID","IID",paste0("PCA",1:10)), delim = ' ')
east <- read_delim(here("eastpca_results/eastpca_pcafile.eigenvec"), col_names = c("FID","IID",paste0("PCA",1:10)), delim = ' ')
west <- read_delim(here("westpca_results/westpca_pcafile.eigenvec"), col_names = c("FID","IID",paste0("PCA",1:10)), delim = ' ')
euro <- read_delim(here("europca_results/europca_pcafile.eigenvec"), col_names = c("FID","IID",paste0("PCA",1:10)), delim = ' ')

# pheno files

all_IIDS <- c(nph$IID, east$IID, west$IID, euro$IID)

tanya <- tanya %>% filter(SUBJECT %in% all_IIDS)

tanya %>% select(IID = SUBJECT, GOUT) %>%
mutate(GOUT_recode = case_when(str_detect(GOUT,"Control") ~ 1,
GOUT %in% c("ACR Gout","GP Gout") ~ 2,
TRUE ~ NA_real_
),
FID = IID) %>% select(FID, IID, GOUT = GOUT_recode) %>%
write_tsv(here("data/tanya_gout.pheno"))


tanya %>% select(IID = SUBJECT, TYPE2D) %>%
mutate(TYPE2D_recode = case_when(TYPE2D == "No" ~ 1,
TYPE2D == "Yes" ~ 2,
TRUE ~ NA_real_
),
FID = IID) %>% select(FID, IID, TYPE2D = TYPE2D_recode) %>% write_tsv(here("data/tanya_t2d.pheno"))

# covar files
nph_covar <- tanya %>% mutate(FID = SUBJECT, IID = SUBJECT) %>%
select(FID,IID, AGECOL, SEX) %>%
right_join(nph)

east_covar <- tanya %>% mutate(FID = SUBJECT, IID = SUBJECT) %>%
select(FID,IID, AGECOL, SEX) %>%
right_join(east)

west_covar <- tanya %>% mutate(FID = SUBJECT, IID = SUBJECT) %>%
select(FID,IID, AGECOL, SEX) %>%
right_join(west)

euro_covar <- tanya %>% mutate(FID = SUBJECT, IID = SUBJECT) %>%
select(FID,IID, AGECOL, SEX) %>%
right_join(euro)

cols <- names(nph_covar)[-1:-2]
for(n in cols){
p <- nph_covar %>% select(FID, IID:!!n)
fn <- paste(names(p)[-1:-2], collapse = "_")
print(fn)
write_tsv(p, here("nphpca_results/",paste0("nphpca_", fn,".covar")))

east_covar %>% select(FID, IID:!!n) %>%
write_tsv( here("eastpca_results/",paste0("eastpca_", fn,".covar")))

west_covar %>% select(FID, IID:!!n) %>%
write_tsv( here("westpca_results/",paste0("westpca_", fn,".covar")))

euro_covar %>% select(FID, IID:!!n) %>%
write_tsv( here("europca_results/",paste0("europca_", fn,".covar")))

}


49 changes: 49 additions & 0 deletions scripts/make_pop_summary_stats.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
## Remove Pukapuka from West Polynesian Pops


tanya <- read_csv(file = here("data/tanya_data.csv"), col_names = TRUE) %>%
filter(SUBJECT != "Blank") %>%
mutate(GOUT_orig_code = GOUT, GOUT = case_when(str_detect(GOUT,"Control") ~ 1,
GOUT %in% c("ACR Gout","GP Gout") ~ 2,
TRUE ~ NA_real_
),
T2D = case_when(TYPE2D == "No" ~ 1,
TYPE2D == "Yes" ~ 2,
TRUE ~ NA_real_
))

## west_pca_tanya.keep is the same as the current westpca.keep file (same PATIENT IDs)
keep_files <- list.files("/Volumes/scratch/merrimanlab/ben/genomic_predictions/data/", pattern = "pca.keep", full.names = TRUE)

pops <- map_dfr(keep_files, read_tsv, col_names = FALSE)
tanya_filtered <- tanya %>% filter(SUBJECT %in% pops$X1)

tanya_no_puka <- tanya_filtered %>% mutate(Westkeep= ifelse(ETH_SPECIFIC == "Pukapukan",0,1))

tanya_no_puka2 <- filter(tanya_no_puka, Westkeep == 1)

height <- tanya_no_puka2 %>% drop_na(HEIGHT)
BMI <- tanya_no_puka2 %>% drop_na(BMI)
HDL <- tanya_no_puka2 %>% drop_na(HDL)
T2D<- tanya_no_puka2 %>% drop_na(T2D)
GOUT <- tanya_no_puka2 %>% drop_na(GOUT)

library(dplyr)

# There are 2 Mixed Polys, these are NPH participants (NPH group wasnt run)
data.table::setDT(height)[,list(Mean=mean(HEIGHT), Max=max(HEIGHT), Min=min(HEIGHT), Mean=as.numeric(mean(HEIGHT)), Std=sd(HEIGHT)), by=ANALYSISGROUP_EASTWEST]
data.table::setDT(BMI)[,list(Mean=mean(BMI), Max=max(BMI), Min=min(BMI), Mean=as.numeric(mean(BMI)), Std=sd(BMI)), by=ANALYSISGROUP_EASTWEST]
data.table::setDT(HDL)[,list(Mean=mean(HDL), Max=max(HDL), Min=min(HDL), Median=as.numeric(median(HDL)), Std=sd(HDL)), by=ANALYSISGROUP_EASTWEST]

table(height$ANALYSISGROUP_EASTWEST)
table(BMI$ANALYSISGROUP_EASTWEST)
table(HDL$ANALYSISGROUP_EASTWEST)
table(T2D$ANALYSISGROUP_EASTWEST)
table(height$ANALYSISGROUP_EASTWEST)
table(height$ANALYSISGROUP_EASTWEST)
table(tanya$ANALYSISGROUP_EASTWEST)

# Gout and T2D tables
tanya_no_puka2 %>% group_by(T2D, ANALYSISGROUP_EASTWEST) %>% tally(sort = F)
tanya_no_puka2 %>% group_by(GOUT, ANALYSISGROUP_EASTWEST) %>% tally(sort = F)
tanya_no_puka2 %>% group_by(SEX, ANALYSISGROUP_EASTWEST) %>% tally(sort = F)
60 changes: 60 additions & 0 deletions scripts/make_sex_specific_pop_files.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Make sex specific Euro and West Poly keepfiles..

tanya <- read_csv(file = here("data/tanya_data.csv"), col_names = TRUE) %>%
filter(SUBJECT != "Blank") %>%
mutate(GOUT_orig_code = GOUT, GOUT = case_when(str_detect(GOUT,"Control") ~ 1,
GOUT %in% c("ACR Gout","GP Gout") ~ 2,
TRUE ~ NA_real_
),
T2D = case_when(TYPE2D == "No" ~ 1,
TYPE2D == "Yes" ~ 2,
TRUE ~ NA_real_
))


## Read in Euro and West (no Puka) keep files
westpca <- read.delim(file = "data/westnopukapca.keep", sep = "\t", header = F)
euro <- read.delim(file = "data/europca.keep", sep = "\t", header = F)



## List of Sex specific IDs
tanya_men <- filter(tanya, SEX == 1) %>% select(PATIENT, ANALYSISGROUP_EASTWEST)
tanya_women <- filter(tanya, SEX == 2) %>% select(PATIENT, ANALYSISGROUP_EASTWEST)

## Renaming column so that "euro" or "westpca" have matching columns.
names(tanya_men)[1] <- 'V1'
names(tanya_women)[1] <- 'V1'



euro_men <- filter(tanya_men, ANALYSISGROUP_EASTWEST == "European") %>% select(V1)
euro_women <- filter(tanya_women, ANALYSISGROUP_EASTWEST == "European") %>% select(V1)

west_men <- filter(tanya_men, ANALYSISGROUP_EASTWEST == "West Polynesian") %>% select(V1)
west_women <- filter(tanya_women, ANALYSISGROUP_EASTWEST == "West Polynesian") %>% select(V1)


## Make sex specific population files
euromale <- merge(euro,euro_men,all = F)
eurofemale <- merge(euro,euro_women, all = F)

westmale <- merge(westpca,west_men, all = F)
westfemale <- merge(westpca, west_women, all = F)
sum(is.na(westfemale$V1))
sum(is.na(westfemale$V2))
sum(is.na(westmale$V1))
sum(is.na(westmale$V2))

ifelse(westfemale$V1==westfemale$V2,"Yes","No")

## make keep files.
write_delim(euromale, file = "/Volumes/scratch/merrimanlab/ben/genomic_predictions/data/euromalepca.keep", col_names = F, delim = "\t")

write_delim(eurofemale, file = "/Volumes/scratch/merrimanlab/ben/genomic_predictions/data/eurofemalepca.keep", col_names = F, delim = "\t")

write_delim(westmale, file = "/Volumes/scratch/merrimanlab/ben/genomic_predictions/data/westmalepca.keep", col_names = F, delim = "\t")

write_delim(westfemale, file = "/Volumes/scratch/merrimanlab/ben/genomic_predictions/data/westfemalepca.keep", col_names = F, delim = "\t")


Empty file.
Loading