This repository contains the pipeline and configuration used for generating and evaluating the Genome-in-a-Bottle stratifications. The stratification files were developed with the Global Alliance for Genomic Health (GA4GH) Benchmarking Team, the Genome in a Bottle Consortium (GIAB) and the Telomere-to-Telomere Consortium (T2T). They are intended as a standard resource of BED files for use in stratifying true positive, false positive and false negative variant calls in challenging and targeted regions of the the genome.
The stratification BED files are only accessible on the GIAB FTP site. These files can be used as a standard resource of BED files for use with GA4GH benchmark tools such as hap.py.
Note that this pipeline applies to stratification versions starting with v3.2. Older pipelines and scripts can be found here.
Within this repository, further details (including all source files) can
be found in the configuration at config/all.yml
. The pipeline itself is
maintained in a separate, reference agnostic repository
here. Please refer to this
pipeline for specific, methodological information regarding how these files are
produced.
Currently, this pipeline is set up to build stratifications for the following references:
- GRCh37
- GRCh38
- CHM13
Version tags follow the format X.Y-Z
where X
will be incremented for removal
of old stratifications and/or large deletions, Y
will be incremented for
additions and minor revisions, and Z
will be incremented for documentation
updates in this repository. Note that X.Y
(without Z
) corresponds to the
versions located in the FTP
site.
For those looking for precise differences between each stratification version, see the stratdiff tool.
Author Information
- Principal Investigator: Justin Zook, NIST, [email protected]
- Nate Olson, NIST, [email protected]
- Justin Wagner, NIST, [email protected]
- Jennifer McDaniel, NIST, [email protected]
- Nate Dwarshuis, NIST, [email protected]
Licenses/restrictions placed on the data, or limitations of reuse: Publicly released data are freely available for reuse without embargo.
Citations for stratifications are located in the associated READMEs.
If stratifications were used in benchmarking with GA4GH/GIAB best practices or hap.py please reference:
Krusche, P., Trigg, L., Boutros, P.C. et al.
Best practices for benchmarking germline small-variant calls in human genomes.
Nat Biotechnol 37, 555-560 (2019). https://doi.org/10.1038/s41587-019-0054-x
- Individual stratification BED files as well as zipped directories (tar.gz) of files
- stratification READMEs
- .tsvs for benchmarking with hap.py
- MD5 checksums
Stratifications can be binned into several types: Low Complexity, Functional Technically Difficult, Genome Specific, Functional Regions, GC content, mappability, Other Difficult, Segmental Duplications, Union, Ancestry, Telomeres, and XY. General information for stratification types are provided below.
These categories are present for all indicated references unless otherwise noted.
Regions with different types and sizes of low complexity sequence, e.g., homopolymers, STRs, VNTRs and other locally repetitive sequences.
Regions with segmental duplications (generally defined as repeated regions >1kb with >90% similarity).
Chomosome XY specific regions such as PAR, XTR or ampliconic.
Regions to stratify variants inside and outside coding regions.
Regions with different ranges (%) of GC content.
Regions where short read mapping can be challenging.
Telomeric regions. These are included for CHM13 only.
Regions with different general types of difficult regions or any type of difficult region or complex variant. For example, performance can be measured in just "easy" or "all difficult" regions of the genome.
Regions with inferred patterns of local ancestry.
These only exist for GRCh38.
Difficult regions due to potentially difficult variation in a NIST/GIAB sample, including 1) regions containing putative compound heterozygous variants 2) small regions containing multiple phased variants, 3) regions with potential structural or copy number variation.
These only exist for GRCh37/8.
Functional, or potentially functional, regions that are also likely to be technically difficult to sequences.
These only exist for GRCh37/8.
These regions that are difficult to sequences for miscellaneous reasons. In the case of GRCh37/38, these include regions near gaps or errors in the reference or regions that are known to be expanded/collapsed relative to most other genomes.
This category also includes highly polymorphic regions such as those coding the major histocompatibility complex (MHC), the VDJ regions, and regions coding the killer-cell immunoglobulin-like receptor (KIR). These are all highly polymorphic in nature and thus difficult to sequence.
Finally, in the case of CHM13, this category also includes the rDNA arrays.
This data/work was created by employees of the National Institute of Standards and Technology (NIST), an agency of the Federal Government. Pursuant to title 17 United States Code Section 105, works of NIST employees are not subject to copyright protection in the United States. This data/work may be subject to foreign copyright.
The data/work is provided by NIST as a public service and is expressly provided AS IS. NIST MAKES NO WARRANTY OF ANY KIND, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NON-INFRINGEMENT AND DATA ACCURACY. NIST does not warrant or make any representations regarding the use of the data or the results thereof, including but not limited to the correctness, accuracy, reliability or usefulness of the data. NIST SHALL NOT BE LIABLE AND YOU HEREBY RELEASE NIST FROM LIABILITY FOR ANY INDIRECT, CONSEQUENTIAL, SPECIAL, OR INCIDENTAL DAMAGES (INCLUDING DAMAGES FOR LOSS OF BUSINESS PROFITS, BUSINESS INTERRUPTION, LOSS OF BUSINESS INFORMATION, AND THE LIKE), WHETHER ARISING IN TORT, CONTRACT, OR OTHERWISE, ARISING FROM OR RELATING TO THE DATA (OR THE USE OF OR INABILITY TO USE THIS DATA), EVEN IF NIST HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
To the extent that NIST may hold copyright in countries other than the United States, you are hereby granted the non-exclusive irrevocable and unconditional right to print, publish, prepare derivative works and distribute the NIST data, in any medium, or authorize others to do so on your behalf, on a royalty-free basis throughout the world.
You may improve, modify, and create derivative works of the data or any portion of the data, and you may copy and distribute such modifications or works. Modified works should carry a notice stating that you changed the data and should note the date and nature of any such change. Please explicitly acknowledge the National Institute of Standards and Technology as the source of the data: Data citation recommendations are provided at https://www.nist.gov/open/license.
Permission to use this data is contingent upon your acceptance of the terms of this agreement and upon your providing appropriate acknowledgments of NIST’s creation of the data/work.
This repository's existence on Github is primarily for documentation purposes, as the runtime is defined in terms of NIST-specific resources.
However, for those wishing to run the pipeline themselves, there are several options. Both require cloning this repo with submodules and setting up the conda env:
git clone --recurse-submodules https://github.com/ndwarshuis/giab-stratifications.git
cd giab-stratifications
conda env create -f env.yml
conda activate giab-strats-production
Just call the pipeline directly using snakemake:
snakemake --cores 20 --configfile=config/all.yml all
Note that most rules in the pipeline require 12MB of memory or less with the exception of mappability (which runs GEM) and requires ~32GB for size of the references here.
First, create a profile from that located at
workflow/profiles/nisaba/config.yml
(which is designed for a NIST-specific
cluster which runs Slurm).
Then run snakemake with this profile:
snakemake --cores 20 --configfile=config/all.yml --profile=workflow/profiles/yerprofile all