Skip to content

A python package for learning mutational signatures and their multidimensional genomic properties

License

Notifications You must be signed in to change notification settings

gerstung-lab/tensorsignatures

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TensorSignatures

Documentation Status

TensorSignatures is a tensor factorisation framework for mutational signature analysis, which in contrast to other methods, deciphers mutational processes not only in terms of mutational spectra, but also assess their properties with respect to various genomic variables, allows the inclusion of different mutation types and integrates a robust noise model toperform the inference.

TensorSignatures is a young project and breaking changes are to be expected. We keep a changelog and it will have possible breakage clearly documented.

Quick install

TensorSignatures makes use of the TensorFlow 1.5.x framework requiring the user to install a separate package to enable GPU support, i.e. tensorflow-gpu instead of tensorflow. We highly recommend to install TensorSignatures into an environment with tensorflow-gpu, as the tensor computations greatly benefit from GPU-acceleration.

Via GitHub

To obtain the most recent version of TensorSignatures, we recommend to download the repository directly from GitHub and to install the package into a virtual environment. To get started, clone the repository by executing the following commands in your terminal

$ git clone https://github.com/gerstung-lab/tensorsignatures.git && cd tensorsignatures

Then, create a new virtual environment and install all dependencies. If you have access to a GPU with cuda support use requirements-gpu.txt instead of requirements.txt.

$ python -m venv env
$ source env/bin/activate
$ pip install --upgrade pip setuptools wheel && pip install -r requirements.txt

Finally, install TensorSignatures.

$ python setup.py install

Via Pypi

To install tensorsignatures via Pypi simply type

$ pip install tensorsignatures

into your shell.

Via docker (& jupyter)

To run TensorSignatures within a docker environment, clone the repository

$ git clone https://github.com/gerstung-lab/tensorsignatures.git
$ cd tensorsignatures

and spin up the container using docker-compose

$ docker-compose up --build

This spins up a jupyter server including notebooks with tutorials on http://localhost:8889.

Getting started

Step 1: Data preparation

Running TensorSignatures involves three steps: preparing the input data, i.e. creating the mutation count tensor as well as the mutation count matrix, computing a trinucleotide normalisation to account for differences in the nucleotide composition of different genomic regions, and running TensorSignatures.

Preparing input data using docker

We provide a docker image that contains all R and bioconductor dependencies to create the variant tensor and the other mutation type matrix. To use it, pull the image from docker. Note that the image is approximately 5 GB large.

$ docker pull sagar87/tensorsignatures-data:latest

To use the image switch into the folder containing your VCF data. Then run image using the following command and supply the VCF files as well as the name of the hdf5 output file (must be the last argument) as arguments.

$ docker run -v $PWD:/usr/src/app/mount sagar87/tensorsignatures-data <vcf1.vcf> <vcf2.vcf> ... <vcfn.vcf> <output.h5>

Then continue with Step 2.

Preparing input data using a custom installation

Make sure you have R3.4.x (!) and the packages VariantAnnotation and rhdf5 installed. You can install them, if necessary, by executing

$ Rscript -e "source('https://bioconductor.org/biocLite.R'); biocLite('VariantAnnotation')"

and

$ Rscript -e "source('https://bioconductor.org/biocLite.R'); biocLite('rhdf5')"

from your command line.

To get started, download the following files and place them in the same directory:

Constants.RData (contains GRanges objects that annotate transcription/replication orientation, nucleosomal and epigenetic states)

mutations.R (all required functions to partiton SNVs, MNVs and indels)

processVcf.R (loads vcf files and creates the SNV count tensor, MNV and indel count matrix; eventually needs custom modification to make the script run on your vcfs.)

genome.zip .

To obtain the SNV count tensor and the matrices containing other mutation types, execute processVcf.R and pass the VCF files you want to convert, as well as a name for an output hdf5 file as command line arguments, e.g.

$ Rscript processVcf.R <vcf1.vcf> <vcf2.vcf> ... <vcfn.vcf> <output.h5>

In case of errors please check wether you have correctly specified paths in line 6-8. Also, take a look at the readVcfSave function and adjust it when it fails.

Step 2: Computing trinucleotide normalisation

TensorSignatures requires a trinucleotide normalisation constant to account for differences in the nucleotide composition of genomic states. To compute it, invoke the prep sub routine of TensorSignatures and pass the hd5 file from Step 1 as well as the path for the output file as positional arguments to the programme.

$ tensorsignatures prep <output.h5> <tsdata.h5>

Step 3: Run TensorSignatures

There are two ways to run TensorSignatures using either the refit option, which fits the exposures of a set of pre-defined signatures extracted from the PCAWG cohort to a your dataset, or via the train subroutine, that performs a denovo extraction of tensor signatures. Refitting tensor signatures is computationally fast but does not allow to discover new signatures, while extracting new signatures from scratch is computationally intensive (GPU required) and requires ideally larger numbers of samples. For most use cases, with a small number of samples, we advice to use the refit option:

$ tensorsignatures --verbose refit tsData.h5 refit.pkl -n

To run a denovo extraction use

$ tensorsignatures --verbose train tsData.h5 denovo.pkl <rank> -k <size> -n -ep <epochs>

where rank specifies the decomposition rank, size controls the dispersion of the model, and epochs the number of desired epochs to fit the model. TensorSignatures outputs value of the objective function (log likelihood) that is minimised during training as well as the change of the objective during an epoch interval (delta). When deciding on the number of epochs to train the model ensure that it is sufficiently large such that the objective function converges, i.e. the delta value is close to, or fluctuates around zero. For more information on how to run TensorSignatures in a practical setting see the documentation. Running TensorSignatures will yield a pickle dump which can subsequently inspected using the tensorsignatures package.

Features

  • Run tensorsignatures on your dataset using the TensorSignature class provided by the package or via the command line tool.
  • Compute percentile based bootstrap confidence intervals for inferred parameters.
  • Basic plotting tools to visualize tensor signatures and inferred parameters

Credits

  • Harald Vöhringer and Moritz Gerstung

About

A python package for learning mutational signatures and their multidimensional genomic properties

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 77.0%
  • Python 22.6%
  • Other 0.4%