This pipeline runs bolt-lmm (Loh et al, Nat Genet 2015; Loh et al. Nat
Genet 2018;
https://alkesgroup.broadinstitute.org/BOLT-LMM/BOLT-LMM_manual.html)
with UK biobank data on the Imperial hpc cluster. It formats data,
divides them into chunks and runs the chunks through bolt-lmm in
parallel (see )
The pipeline carries out association testing by running bolt-lmm on UKB imputed SNPs using a mixed model built on a subset of hard-called, PLINK-format UKB genotypes. Thus, it first performs its model-fitting on PLINK-format genotypes and then applies the model to scan any provided imputed SNPs.
The pipeline needs software and python packages installed in the environment path. On the Imperial hpc cluster, this is achieved by two methods:
- Environment modules, which load installed software into the search path. This is carried out by the pipeline itself. Required modules are listed in the configuration file.
- The Conda enironment which provides python, python packages and other software defined by the user. For instructions on how to use a conda environment see https://www.imperial.ac.uk/admin-services/ict/self-service/research-support/rcs/support/applications/python/.
When using conda for the first time on the cluster, you need to set it up for your environment:
module load anaconda3/personal
anaconda-setup
- Before running this pipeline for the first time you have to create a Conda environment, called 'bolt', using the configuration file in the config directory.
module load anaconda3/personal
conda env create --file /path/to/config/environment.yml
- If the environment.yml file has been modified, e.g. in a newer version of the pipeline, the environment can be updated like this:
conda env update --file /path/to/config/environment.yml
The pipeline run is configured by the yaml-format file config.yml. An
example configuration file is located in
/rds/general/project/uk-biobank-2020/live/software/bolt-lmm-pipeline/config/config.yml
. Copy
this file to a convenient location and edit the configuration to your
needs. For pipeline tests, an example phenotype file
/rds/general/project/uk-biobank-2020/live/software/bolt-lmm-pipeline/data/sample.phenotype.txt
can be used.
- Phenotype file, containing phenotypes and covariates, with the first line containing column headers and subsequent lines containing records, one per individual. bolt-lmm requires this to be a whitespace-delimited file, so tab-delimited will do. The first two columns must be FID and IID (the PLINK identifiers of an individual). Any number of columns may follow. Values of -9 and NA are interpreted as missing data. All other values in the column should be numeric.
- Sample information file for genotype data in .fam format.
- Sample information file for imputed data in Oxford .sample format (used in bolt --sampleFile argument).
- Data directory containing core snp files (in .bed and .bim format)
and imputed snp files (in .bgen format). Currently these are
ukb_gen_chr*.bim, ukb_gen_chr*.bed, and ukb_imp_chr*.bgen files
in
/rds/general/project/uk-biobank-2017/live/reference/sdata_latest/
by default. (TODO) - A file that is listing missing samples to remove (for the bolt --remove argument), e.g. samples in the fam-file that are missing in the sample-file. This is a header-less tab-delimited text file, FID and IID must be the first two columns. If samples are missing and no such remove-file is provided, bolt-lmm produces a file listing the samples to remove and exits with an error. The generated file can subsequently be used as the missing samples list.
To use columns in the phenotype file as covariates in the model, the config file has the following form:
cov-1: cat_cov1,...,cat_covn;quant_cov1,...,quant_covn
i.e. a comma-separated list of categorial covariates, followed by a semicolon, followed by a comma-separated list of quantitative covariates. For example:
-
Categorial and quantitative covariates:
cov-1: Sex,Center;Age,PC1,PC2,PC3,PC4
-
Quantitative covariates only:
cov-1: ;age,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10
The program produces temporary files and directories, the location of which can be set with the 'tempdir' variable. These files take a lot of space, therefore it is recommended to choose a location on the ephemeral directory (default is /rds/general/user/$USER/ephemeral/). The variable temp-delete (True/False) determines if the temporary directory gets deleted at the end of the pipeline run.
- Loading the conda environment.
module load anaconda3/personal
source activate
conda activate bolt
- Starting the pipeline
python /rds/general/project/uk-biobank-2020/live/software/bolt-lmm-pipeline/bin/initialise-pipeline.py --config-file config.yml
- Pipeline help message
python /rds/general/project/uk-biobank-2020/live/software/bolt-lmm-pipeline/bin/initialise-pipeline.py -h
The output of the pipeline is a text file *.bolt with the following columns:
| SNP | CHR | BP | GENPOS | ALLELE1 | ALLELE0 | A1FREQ | INFO | CHISQ_LINREG | P_LINREG | BETA | SE | CHISQ_BOLT_LMM_INF | P_BOLT_LMM_INF | CHISQ_BOLT_LMM | P_BOLT_LMM |
Note that the last two columns (CHISQ_BOLT_LMM | P_BOLT_LMM) can be
missing. By default, the pipeline runs with option --lmm
, according to the
bolt-lmm manual:
Performs default BOLT-LMM analysis, which consists of (1a) estimating heritability parameters, (1b) computing the BOLT-LMM-inf statistic, (2a) estimating Gaussian mixture parameters, and (2b) computing the BOLT-LMM statistic only if an increase in power is expected. If BOLT-LMM determines based on cross-validation that the non-infinitesimal model is likely to yield no increase in power, the BOLT-LMM (Bayesian) mixed model statistic is not computed.
Warning:
This pipeline has been tested vigorously and so far has proven to
yield the correct and complete result given the provided input and
configuration. However, it can not be ruled out that problems on the
hpc environment can occur, like nodes getting stuck, an unavailable
file system, lack of storage space etc. For this reason, it is good
practise to review the log files for possible error messages. It is
also recommended to make some plausibility tests with the output file,
e.g. if the numer of variants meets the expectation by counting line
numbers using the wc
command, or by checking if all chromosomes are
represented in the output: cut -f 2 output.txt | uniq -c
.
-
v0.0.4 (2022-11-24) conda environment
-
v0.0.3 (2022-08-22) Documentation, config file
-
v0.0.2 (2022-08-12) Code organised in functions, documentation, updated config file, subprocesses
-
v0.0.1 (2022-07-28) First version running on hpc cluster
- variant annotation
- multiple models in parallel
- mail upon job completion
- check queues (medbio?)
- check warning: Overlap of sample file and fam file < 50%
- dedicated conda environment?