The bolt-lmm pipeline on the high-performance computer cluster

Introduction

This pipeline runs bolt-lmm (Loh et al, Nat Genet 2015; Loh et al. Nat Genet 2018; https://alkesgroup.broadinstitute.org/BOLT-LMM/BOLT-LMM_manual.html) with UK biobank data on the Imperial hpc cluster. It formats data, divides them into chunks and runs the chunks through bolt-lmm in parallel (see )

The pipeline carries out association testing by running bolt-lmm on UKB imputed SNPs using a mixed model built on a subset of hard-called, PLINK-format UKB genotypes. Thus, it first performs its model-fitting on PLINK-format genotypes and then applies the model to scan any provided imputed SNPs.

Prerequisites

The pipeline needs software and python packages installed in the environment path. On the Imperial hpc cluster, this is achieved by two methods:

Environment modules, which load installed software into the search path. This is carried out by the pipeline itself. Required modules are listed in the configuration file.
The Conda enironment which provides python, python packages and other software defined by the user. For instructions on how to use a conda environment see https://www.imperial.ac.uk/admin-services/ict/self-service/research-support/rcs/support/applications/python/.

When using conda for the first time on the cluster, you need to set it up for your environment:

module load anaconda3/personal
anaconda-setup

Before running this pipeline for the first time you have to create a Conda environment, called 'bolt', using the configuration file in the config directory.

module load anaconda3/personal
conda env create --file /path/to/config/environment.yml

If the environment.yml file has been modified, e.g. in a newer version of the pipeline, the environment can be updated like this:

conda env update --file /path/to/config/environment.yml

Running the pipeline

Configuration

The pipeline run is configured by the yaml-format file config.yml. An example configuration file is located in /rds/general/project/uk-biobank-2020/live/software/bolt-lmm-pipeline/config/config.yml. Copy this file to a convenient location and edit the configuration to your needs. For pipeline tests, an example phenotype file /rds/general/project/uk-biobank-2020/live/software/bolt-lmm-pipeline/data/sample.phenotype.txt can be used.

Data files

Phenotype file, containing phenotypes and covariates, with the first line containing column headers and subsequent lines containing records, one per individual. bolt-lmm requires this to be a whitespace-delimited file, so tab-delimited will do. The first two columns must be FID and IID (the PLINK identifiers of an individual). Any number of columns may follow. Values of -9 and NA are interpreted as missing data. All other values in the column should be numeric.
Sample information file for genotype data in .fam format.
Sample information file for imputed data in Oxford .sample format (used in bolt --sampleFile argument).
Data directory containing core snp files (in .bed and .bim format) and imputed snp files (in .bgen format). Currently these are ukb_gen_chr*.bim, ukb_gen_chr*.bed, and ukb_imp_chr*.bgen files in /rds/general/project/uk-biobank-2017/live/reference/sdata_latest/ by default. (TODO)
A file that is listing missing samples to remove (for the bolt --remove argument), e.g. samples in the fam-file that are missing in the sample-file. This is a header-less tab-delimited text file, FID and IID must be the first two columns. If samples are missing and no such remove-file is provided, bolt-lmm produces a file listing the samples to remove and exits with an error. The generated file can subsequently be used as the missing samples list.

Covariates

To use columns in the phenotype file as covariates in the model, the config file has the following form:

cov-1: cat_cov1,...,cat_covn;quant_cov1,...,quant_covn

i.e. a comma-separated list of categorial covariates, followed by a semicolon, followed by a comma-separated list of quantitative covariates. For example:

Categorial and quantitative covariates:
```
 cov-1: Sex,Center;Age,PC1,PC2,PC3,PC4
```

Quantitative covariates only:

 cov-1: ;age,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10

Temporary files directory

The program produces temporary files and directories, the location of which can be set with the 'tempdir' variable. These files take a lot of space, therefore it is recommended to choose a location on the ephemeral directory (default is /rds/general/user/$USER/ephemeral/). The variable temp-delete (True/False) determines if the temporary directory gets deleted at the end of the pipeline run.

Starting the pipeline

Loading the conda environment.

module load anaconda3/personal
source activate
conda activate bolt

Starting the pipeline

python /rds/general/project/uk-biobank-2020/live/software/bolt-lmm-pipeline/bin/initialise-pipeline.py --config-file config.yml

Pipeline help message

python /rds/general/project/uk-biobank-2020/live/software/bolt-lmm-pipeline/bin/initialise-pipeline.py -h

Results

The output of the pipeline is a text file *.bolt with the following columns:

Note that the last two columns (CHISQ_BOLT_LMM | P_BOLT_LMM) can be missing. By default, the pipeline runs with option --lmm, according to the bolt-lmm manual:

Performs default BOLT-LMM analysis, which consists of (1a) estimating heritability parameters, (1b) computing the BOLT-LMM-inf statistic, (2a) estimating Gaussian mixture parameters, and (2b) computing the BOLT-LMM statistic only if an increase in power is expected. If BOLT-LMM determines based on cross-validation that the non-infinitesimal model is likely to yield no increase in power, the BOLT-LMM (Bayesian) mixed model statistic is not computed.

Warning:

This pipeline has been tested vigorously and so far has proven to yield the correct and complete result given the provided input and configuration. However, it can not be ruled out that problems on the hpc environment can occur, like nodes getting stuck, an unavailable file system, lack of storage space etc. For this reason, it is good practise to review the log files for possible error messages. It is also recommended to make some plausibility tests with the output file, e.g. if the numer of variants meets the expectation by counting line numbers using the wc command, or by checking if all chromosomes are represented in the output: cut -f 2 output.txt | uniq -c.

Version history

v0.0.4 (2022-11-24) conda environment
v0.0.3 (2022-08-22) Documentation, config file
v0.0.2 (2022-08-12) Code organised in functions, documentation, updated config file, subprocesses
v0.0.1 (2022-07-28) First version running on hpc cluster

TODO

variant annotation
multiple models in parallel
mail upon job completion
check queues (medbio?)
check warning: Overlap of sample file and fam file < 50%
dedicated conda environment?

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
bin		bin
config		config
data		data
doc		doc
lib		lib
test		test
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The bolt-lmm pipeline on the high-performance computer cluster

Introduction

Prerequisites

Running the pipeline

Configuration

Data files

Covariates

Temporary files directory

Starting the pipeline

Results

Version history

TODO

About

Releases

Packages

Languages

gwotto/bolt_lmm_pipeline

Folders and files

Latest commit

History

Repository files navigation

The bolt-lmm pipeline on the high-performance computer cluster

Introduction

Prerequisites

Running the pipeline

Configuration

Data files

Covariates

Temporary files directory

Starting the pipeline

Results

Version history

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages