- Automatically calculate
coverage_interval
based on coverage calculations, avoiding need to set this directly in input configuration. - Update vt decompose to handle additional multi-allelic adjustments including all format attributes, providing full support for new GEMINI changes. Thanks to Brent Pedersen and Adrian Tan.
- Add
default
configuration target tobcbio_system.yaml
reducing the need to set program specific arguments for everything. - Ensure
resources
specified in input YAML get passed to global system configuration for making parallelization decisions. Thanks to Miika Ahdesmaki. - Run upload process on distributed machines, allowing upload to S3 on AWS to take advantage of machines with multiple cores. Thanks to Lorena Pantano.
- Re-write interactions with external object stores like S3 to be more general and incorporate multiple regions and future support for non-S3 storage.
- Scale local jobs by total memory usage when memory constrains resource usage jinstead of cores. Thanks to Sven-Eric Schelhorn and Lorena Pantano.
- Disambiguation: improve parallelization by disambiguating on split alignment parts prior to merging. Thanks to Sven-Eric Schelhorn.
- Disambiguation: ensure ambiguous and other organism reads are sorted, merged and passed to final upload directory. Thanks to Sven-Eric Schelhorn.
- Fix problem with sambamba name sorting not being compatible with samtools. Thanks to Sven-Eric Schelhorn.
- FreeBayes: update to latest version (0.9.21-7) with validation (http://imgur.com/a/ancGz).
- Allow bz2 files in bcbio_prepare_sample.py script.
- Ensure GEMINI statistics run for project summary file. Thanks to Luca Beltrame.
- Better error checking for booleans in input configuration. Thanks to Daryl Waggott.
- Implement qualimap for RNAseq QC metrics, but not active yet.
- Run snpEff 4.1 in back-compatibility mode to work with GEMINI database loading. Fixes snpEff 4.1/GEMINI effects loading.
- Add PED file to GEMINI database load, containing family, gender and phenotype information from bcbio metadata. Thanks to Luca Beltrame and Roy Ronen.
- Enable specification of input PED files into template creation, extracting family, gender and phenotype information. Any sample rows from PED files get used when creating the GEMINI database.
- Fix preparation of multi-allelic inputs to GEMINI by implementing custom merge of bi-allelic and split multi-allelic. Previous implementation using GATK CombineVariants re-merged some split multi-allelic, losing effects annotations.
- Skip contig order naming checking with bedtools 2.23.0+ to avoid potential issues with complex naming schemes.
- Installation and upgrade: Set pip SSL certificates to point at installed conda SSL package if present. Avoids SSL errors when pip can't find system certificates. Thanks to Andrew Oler.
- Enable support for PBSPro schedulers through ipython-cluster-helper.
- Calculate high depth regions with more than 20x median coverage as targets for filtering in structural variants. Attempts to detect and avoid spurious calls in repetitive regions.
- Support snpEff 4.1, including re-download of snpEff databases on demand if out of sync with older versions.
- Split multi-allelic variants into bi-allelic calls prior to loading into GEMINI, since it only handles bi-allelic inputs. Thanks to Pär Larsson.
- Pass ploidy to GATK HaplotypeCaller, supporting multiple ploidies and correct calling of X/Y/MT chromosomes. Requires GATK 3.3.
- Remove extra 'none' sample when calling tumor-only samples using MuTect. Harmonizes headers with other tumor-only callers and enables tumor-only ensemble calling. Thanks to Miika Ahdesmaki.
- Perform variant prioritization as part of tumor-only calling, using population based frequencies like 1000 genomes and ExAC and presence in known disease causing databases like COSMIC and Clinvar.
- Switch to samtools sort from sambamba sort during alignment streaming. Saves steps in processing and conversions on single sample no deduplication inputs.
- On AWS, download inputs for S3 instead of streaming into fastq preparation to avoid issues with converting BAM to fasta. Thanks to Roy Ronen.
- Provide better defaults for mincores that packs together multiple single IPython processes on a single cluster request -- use core specification from input configuration. Thanks to Miika Ahdesmaki.
- No longer keep INFO fields with
vcfallelicprimitves
in FreeBayes, Platypus and Scalpel calling to prevent introduction of problematic fields for multi-allelic MNPs. - Fix batching problem when using
coverage
and multiple shared batches like a global normal in cancer calling. Thanks to Luca Beltrame. - Use
mincores
specification to ipython-cluster-helper to combine single core jobs into a single submission job for better memory shared on resource constrained systems. - Move disambiguation split work inside parallel framework so download and preparation occurs on worker nodes or inside Docker containers. Enables on demand download of disambiguation genomes.
- Ensure population databases created when some inputs do not have variant calls.
- Switch to seaborn as matplotlib wrapper, from prettplotlib.
- Fixes for ensemble structural variant calling on single samples.
- Fixes for mixing joint and pooled calling in a single configuration file.
- Support for qSNP for tumor-normal calling.
- Add eXpress to RNA-seq pipeline.
- Add transcriptome-only mapping with STAR, bowtie2 or bwa.
- Change logging time stamps to be UTC and set explicitly as ISO 8601 compliant output. Improves benchmarking analysis and comparability across runs.
- Add support for RNA-seq variant calling with HaplotypeCaller
- Fix parallelization of DEXSeq.
- Improvements in VarDict calling on somatic samples.
- Fix compatibility issue with bedtools 2.22.0 when calculating genome coverage.
- Fix joint calling upload to avoid redundant inclusion of full VCF file in individual sample directories.
- Fixes for inclusion of GATK jars inside Docker contains when running distributed jobs.
- Enable generation of STAR indexes on demand to handle running STAR on AWS instances.
- Re-organize code to prepare samples and reference genomes so it runs inside distributed processing components. This isolates process to Docker containers on AWS and also enables complex operations like preparing reference genomes on demand.
- Improve tumor/normal calling with FreeBayes, MuTect, VarDict and VarScan by validating against DREAM synthetic 3 data.
- Validate ensemble based calling for somatic analysis using multiple callers.
- Improve ability to run on Amazon AWS, including up to date interaction with files originally stored in S3 and transfer to S3 on completion with encryption.
- Avoid race conditions during
bedprep
work on samples with shared input BED files. These are now processed sequentially on a single machine to avoid conflicts. Thanks to Justin Johnson. - Add data checks and improved flexibility when specifying joint callers. Thanks to Luca Beltrame.
- Default to a reduced number of split regions (
nomap_split_targets
defaults to 200 instead of 2000) to avoid controller memory issues with large sample sizes. - Avoid re-calculating depth metrics when running post variant calling annotation with GATK to provide accurate metrics on high depth samples. Thanks to Miika Ahdesmaki.
- Consistently keep annotations and genotype information for split MNPs from vcfallelicprimitives. Thanks to Pär Larsson.
- Enable VQSR for large batches of exome samples (50 or more together) to coincide with joint calling availability for large populations.
- Support retrieval of GATK and MuTect jars from S3 to enable integration with bcbio inside Docker.
- Bump pybedtools version to avoid potential open file handle issues. Thanks to Ryan Dale.
- Move to bgzipped and indexes human_ancestor.fa for LOFTEE to support access with new samtools that no longer uses razip.
- Fix bug in creating shared regions for analysis when using a single sample in multiple batches: for instance, when using a single normal sample for multiple tumors. Thanks to Miika Ahdesmaki.
- Unify approach to creating temporary directories. Allows specification of a
global temporary directory in
resources: tmp:
used for all transactions. This enables full use of local temporary space during processing, with results transferred to the shared filesystem on completion. - Fix issues with concatenating files that fail to work with GATK's CatVariants. Fall back to bcftools concat which correctly handles problem headers and overlapping segments.
- Enable flexible specification of
indelcaller
forvariantcaller
targets that do not have integrated indel methods. Thanks to Miika Ahdesmaki. - Move to samtools 1.0 release. Update samtools variant calling to support new multiallelic approach.
- Improve Platypus integration: correctly pass multiple BAM files, make use of assembler, split MNPs, and correctly restrict to variant regions.
- Be more aggressive with system memory usage to try and make better use of available resources. The hope is to take advantage of Java memory fixes that previously forced us to be conservative.
- Support joint recalling with GATK HapolotypeCaller, FreeBayes and Platypus. The
jointcaller
configuration variable enables calling concurrently in large populations by independently calling on samples them combining into a final combined callset with no-call/reference calls at any position called independently. - Add qsignature tool to standard and variant analyses, which helps identify
sample swaps. Add
mixup_check
configuration variant to enable. - Fix issue with merging GATK produced VCF files with vcfcat by swapping to GATK's CatVariants. Thanks to Matt De Both.
- Initial support for ensemble calling on cancer tumor/normal calling. Now available for initial validation work. Thanks to Miika Ahdesmaki.
- Enable structural variant analyses on shared batches (two tumors with same normal). Thanks to Miika Ahdesmaki.
- Avoid Java out of memory errors for large numbers of running processes by avoiding Parallel GC collction. Thanks to Justin Johnson and Miika Ahdesmaki.
- Enable streaming S3 input to RNA-seq and variant processing. BAM and fastq inputs can stream directly into alignment and trimming steps.
- Speed improvements for re-running samples with large numbers of samples or regions.
- Improved cluster cleanup by providing better error handling and removal of controllers and engines in additional failure cases.
- Support variant calling for organisms without dbSNP files. Thanks to Mark Rose.
- Support the SNAP aligner, which provides improved speed on systems with larger amount of memory (64Gb for human genome alignment).
- Support the Platypus haplotype based variant caller for germline samples with both batched and joint calling.
- Fix GATK version detection when
_JAVA_OPTIONS
specified. Thanks to Miika Ahdesmaki. - Use msgpack for ipython serialization to reduce message sizes and IPython controller memory instead of homemade json/zlib approach.
- Change defaults for installation: do not use sudo default and require
--sudo
flag for installing system packages. No longer includes default genomes or aligners to enable more minimal installations. Users install genomes by specifically enumerating them on the command line. - Add support for Ensembl variant effects predictor (VEP). Enables annotation of variants with dbNSFP and LOFTEE. Thanks to Daniel MacArthur for VEP suggestion.
- Support CADD annotations through new GEMINI database creation support.
- Rework parallelization during variant calling to enable additional multicore parallelization for effects prediction with VEP and backfilling/squaring off with bcbio-variation-recall.
- Rework calculation of callable regions to use bedtools/pybedtools thanks to groupby tricks from Aaron Quinlan. Improves speed and memory usage for coverage calculations. Use local temporary directories for pybedtools to avoid filling global temporary space.
- Improve parallel region generation to avoid large numbers of segments on organisms with many chromosomes.
- Initial support for tumor normal calling with VarDict. Thanks to Miika Ahdesmaki and Zhongwu Lai.
- Provide optional support for compressing messages on large IPython jobs to
reduce memory usage. Enable by adding
compress_msg
toalogrithm
section ofbcbio_system.yaml
. There will be additional testing in future releases before making the default, and this may be replaced by new methods like transit (https://github.com/cognitect/transit-python). - Add de-duplication support back for pre-aligned input files. Thanks to Severine Catreux.
- Generalize SGE support to handle additional system setups. Thanks to Karl Gutwin.
- Add reference guided transcriptome assembly with Cufflinks along with functions to classify novel transcripts as protein coding or not as well as generally clean the Cufflinks assembly of low quality transcripts.
- Developer: provide datadict.py with encapsulation functions for looking up and setting items in the data dictionary.
- Unit tests fixed. Unit test data moved to external repository: https://github.com/roryk/bcbio-nextgen-test-data
- Add exon-level counting with DEXseq.
- Bugfix: Fix for Tophat setting the PI flag as inner-distance-size and not insert size.
- Added kraken support for contamination detection (@lpatano): http://ccb.jhu.edu/software/kraken/
- Isoform-level FPKM combined output file generated (@klrl262)
- Use shared conda repository for tricky to install Python packages: https://github.com/chapmanb/bcbio-conda
- Added initial chanjo integration for coverage calculation (@kern3020): https://github.com/robinandeer/chanjo
- Initial support for automated evaluation of structural variant calling.
- Bugfix: set library-type properly for Cufflinks runs.
- Added
genome_setup.py
a script to prepare your own genome and rnaseq files.
- Redo Illumina sequencer integration to be up to date with current code base. Uses external bcl2fastq demultiplexing and new bcbio integrated analysis server. Provide documentation on setting up automated infrastructure.
- Perform de-duplication of BAM files as part of streaming alignment process using samblaster or biobambam's bammarkduplicates. Removes need for secondary split of files and BAM preparation unless recalibration and realignment needed. Enables pre-processing of input files for structural variant detection.
- Rework batched regional analysis in variant calling to remove custom cases and simplify structure. Filtering now happens explicitly on the combined batch file. This is functionally equivalent to previous filters but now the workflow is clearer. Avoids special cases for tumor/normal inputs.
- Perform regional splitting of samples grouped by batch instead of globally, enabling multiple organisms and experiments within a single input sample YAML.
- Add temporary directory usage to enable use of local high speed scratch disk on setups with large enough global temporary storage.
- Update FreeBayes to latest version and provide improved filtering for high depth artifacts.
- Update VQSR support for GATK to be up to date with latest best practices. Re-organize GATK and filtering to be more modular to help with transition to GATK 3.x gVCF approaches.
- Support CRAM files as input to pipeline, including retrieval of reads from defined sequence regions.
- Support export of alignment data as CRAM instead of BAM for space storage and long term archiving.
- Provide configuration option,
remove_lcr
, to filter out variants in low complexity regions. - Improve Galaxy upload for LIMS supports: enable upload of FastQC as PDF reports with wkhtmltopdf installed. Provide tabular summaries of mapped reads.
- Improve checks for pre-aligned BAMs: ensure correct sample names and provide more context on errors around mismatching reference genomes.
- GATK HaplotypeCaller: ensure genotype depth annotation with DepthPerSampleHC annotation. Enable GATK 3.1 hardware specific optimizations.
- Use bgzipped VCFs for dbSNP, Cosmic and other resources to save disk space. Upgrade to Cosmic v68.
- Avoid VCF concatenation errors when first input file is empty. Thanks to Jiantao Shi.
- Added preliminary support for oncofuse for calling gene fusion events. Thanks to @tanglingfung.
- Add a check for mis-specified FASTQ format in the sample YAML file. Thanks to Alla Bushoy.
- Updated RNA-seq integration tests to have more specific tags (singleend, Tophat, STAR, explant).
- Fix contig ordering after Tophat alignment which was preventing GATK-based tools from running.
- Allow calculation of RPKM on more deeply sampled genes by setting
--max-bundle-frags
to 2,000,000. Thanks to Miika Ahdesmaki. - Provide cleaner installation process for non-distributable tools like
GATK. The
--tooplus
argument now handles jars from the GATK site or Appistry and correctly updates manifest version information. - Use bgzipped/tabix indexed variant files throughout pipeline instead of raw uncompressed VCFs. Reduces space requirements and enables parallelization on non-shared filesystems or temporary space by avoiding transferring uncompressed outputs.
- Reduce memory usage during post-alignment BAM preparation steps (PrintReads downsampling, deduplication and realignment prep) to avoid reaching memory cap on limited systems like SLURM. Do not include for IndelRealigner which needs memory in high depth regions.
- Provide explicit targets for coverage depth (
coverage_depth_max
andcoverage_depth_min
) instead ofcoverage_depth
enumeration. Provide downsampling of reads to max depth during post-alignment preparation to avoid repetitive centromere regions with high depth. - Ensure read group information correctly supplied with bwa aln. Thanks to Miika Ahdesmaki.
- Fix bug in retrieval of snpEff databases on install. Thanks to Matan Hofree.
- Fix bug in normal BAM preparation for tumor/normal variant calling. Thanks to Miika Ahdesmaki.
- General removal of GATK for variant manipulation functionality to help focus on support for upcoming GATK 3.0. Use bcftools for splitting of variants into SNPs and indels instead of GATK. Use vcflib's vcfintersection to combine SNPs and indels instead of GATK. Use bcftools for sample selection from multi-sample VCFs. Use pysam for calculation of sample coverage.
- Use GATK 3.0 MIT licensed framework for remaining BAM and variant manipulation code (PrintReads, CombineVariants) to provide one consistent up to date set of functionality for GATK variant manipulation.
- Normalize input variant_regions BED files to avoid overlapping segments. Avoids out of order errors with FreeBayes caller which will call in each region without flattening the input BED.
- For cancer tumor/normal calling, attach final call information of both to the tumor sample. This provides a single downstream file for processing and analysis.
- Enable batch specification in metadata to be a list, allowing a single normal BAM file to serve as a control for multiple tumor files.
- Re-organization of parallel framework code to enable alternative approaches. Document plugging in new parallel frameworks. Does not expose changes to users but makes the code cleaner for developers.
- Default to 1Gb/core memory usage when not specified in any programs. Do not use default baseline if supplied in input file. Thanks to James Porter.
- Integrate plotting of variant evaluation results using prettyplotlib.
- Add
globals
option to configuration to avoid needing to specify the same shared file multiple times in a samples configuration. - Remove deprecated Celery distributed messaging, replaced in favor of IPython.
- Remove algorithm/custom_algorithm from bcbio_system.yaml, preferring to set these directly in the sample YAML files.
- Remove outdated and unused custom B-run trimming.
- Remove ability to guess fastq files from directories with no specification in sample YAML. Prefer using generalized template functionality with explicit specification of files in sample YAML file.
- Remove deprecated multiplex support, which is outdated and not maintained. Prefer approaches in external tools upstream of bcbio-nextgen.
- Add
--tag
argument which labels job names on a cluster to help distinguish when multiple bcbio jobs run concurrently. Thanks to Jason Corneveaux. - Connect min_read_length parameter with read_through trimming in RNA-seq. Thanks to James Porter.
- Map
variant
calling specification tovariant2
since original approach no longer supported. - Fix issues with trying to upload directories to Galaxy. Thanks to Jim Peden.
- Made inner distance calculation for Tophat more accurate.
- Added gffutils GFF database to the RNA-seq indices.
- Add gene name annotation from the GFF file instead of from mygene.
- Expand template functionality to provide additional ability to add metadata to samples with input CSV. Includes customization of algorithm section and better matching of samples using input file names. Improve ability to distinguish fastq pairs.
- Generalize snpEff database preparation to use individual databases located with each genome. Enables better multi-organism support.
- Enable tumor/normal paired called with FreeBayes. Contributed by Luca Beltrame.
- Provide additional parallelization of bgzip preparation, performing grabix indexing in parallel for paired ends.
- Fix downsampling with GATK-lite 2.3.9 releases by moving to sambamba based downsampling. Thanks to Przemek Lyszkiewicz.
- Handle Illumina format input files for bwa-mem alignment, and cleanly convert these when preparing bgzipped inputs for parallel alignment. Thanks to Miika Ahdesmaki.
- Provide better algorithm for distinguishing bwa-mem and bwa-aln usage. Now does random sampling of first 2 million reads instead of taking the first set of reads which may be non-generalizable. Also lowers requirement to use bwa-mem to 75% of reads being smaller than 70bp. Thanks to Paul Tang.
- Enable specification of a GATK key file in the bcbio_system resources
keyfile
parameter. Disables callbacks to GATK tracking. Thanks to Severine Catreux for keyfile to debug with. - Correctly handle preparation of pre-aligned BAM files when sorting and coordinate specification needed. Thanks to Severine Catreux.
- Fix incorrect quality flag being passed to Tophat. Thanks to Miika Ahdesmaki.
- Fix Tophat not respecting the existing --transcriptome-index. Thanks to Miika Ahdesmaki.
- Keep original gzipped fastq files. Thanks again to Miika Ahdesmaki.
- Fixed incompatibility with complexity calculation and IPython.
- Added strand-specific RNA-seq support via the strandedness option.
- Added Cufflinks support.
- Set stranded flag properly in htseq-count. Thanks to Miika Ahdesmaki.
- Fix to ensure Tophat receives a minimum of 8 gb of memory, regardless of number of cores.
- Remove
hybrid_bait
andhybrid_target
which were no longer used with new lightweight QC framework. Prefer better coverage framework moving forward. - Added extra summary information to the project-summary.yaml file so downstream tools can locate what genome resources were used.
- Added
test_run
option to the sample configuration file. Set it to True to run a small subset of your data through the pipeline to make sure everything is working okay. - Fusion support added by setting
fusion_mode: True
in the algorithim section. Not officially documented for now until we can come up with best practices for it. - STAR support re-enabled.
- Fixed issue with the complexity calculation throwing an exception when there were not enough reads.
- Add disambiguation stats to final project-summary.yaml file. Thanks to Miika Ahdesmaki.
- Remove
Estimated Library Size
andComplexity
from RNA-seq QC summary information as they were confusing and unnecessarily alarming, respectively. Thanks to Miika Ahdesmaki and Sara Dempster. - Several memory allocation errors resulting in jobs getting killed in cluster environments for overusing their memory limit fixed.
- Added JVM options by default to Picard to allocate enough memory for large BAM->FastQ conversion.
- Update overall project metrics summary to move to a flexible YAML format that handles multiple analysis types. Re-include target, duplication and variant metrics.
- Support disambiguation of mixed samples for RNA-seq pipelines. Handles alignment to two genomes, running disambiguation and continuation of disambiguated samples through the pipeline. Contributed by Miika Ahdesmaki and AstraZenenca.
- Handle specification of sex in metadata and correctly call X,Y and mitochondrial chromosomes.
- Fix issues with open file handles for large population runs. Ensure ZeroMQ contexts are closed and enable extension of ulimit soft file and user process limits within user available hard limits.
- Avoid calling in regions with excessively deep coverage. Reduces variant calling bottlenecks in repetitive regions with 25,000 or more reads.
- Improve
bcbio_nextgen.py upgrade
function to be more consistent on handling of code, tools and data. Now each require an implicit specification, while other options are remembered. Thanks to Jakub Nowacki. - Generalize retrieval of RNA-seq resources (GTF files, transcriptome indexes) to use genome-resources.yaml. Updates all genome resources files. Contributed by James Porter.
- Use sambamba for indexing, which allows multicore indexing to speed up index creation on large BAM processing. Falls back to samtools index if not available.
- Remove custom Picard metrics runs and pdf generation. Eliminates dependencies on pdflatex and R for QC metrics.
- Improve memory handling by providing fallbacks during common memory intensive steps. Better handle memory on SLURM by explicitly allowing system memory in addition to that required for processing.
- Update fastqc runs to use a BAM files downsampled to 10 million reads to avoid excessive run times. Part of general speed up of QC step.
- Add Qualimap to generate plots and metrics for BAM alignments. Off by default due to speed issues.
- Improve handling of GATK version detection, including support for Appistry versions.
- Allow interruption of read_through trimming with Ctrl-C.
- Improve test suite: use system configuration instead of requiring test specific setup. Install and use a local version of nose using the installer provided Python.
- Fix for crash with single-end reads in read_through trimming.
- Added a library complexity calculation for RNA-seq libraries as a QC metric
- Added sorting via sambamba. Internally bcbio-nextgen now inspects the headers of SAM/BAM files to find their sorting status, so make sure tools set it correctly.
- Framework for indexing input reads using parallel bgzip and grabix, to handle distributed alignment. Enables further distribution of alignment step beyond multicore nodes.
- Rework of ensemble calling approach to generalize to population level ensemble calls. Provide improved defaults for handle 3 caller consolidation.
- Support for Mouse (mm10) variant calling and RNA-seq.
- For recent versions of Gemini (0.6.3+) do not load filtered variants into database, only including passed variants.
- Improve specification of resource parameters, using multiple
-r
flags instead of single semi-colon separated input. Allow specification of pename resource parameter for selecting correct SGE environment when not automatically found. - Support biobambam's bammarkduplicates2 for duplicate removal.
- Clean up logging handling code to be more resilient to interrupt messages.
- Speed improvements for selecting unanalyzed and unmapped reads to address bottlenecks during BAM prep phase.
- Bug fix for algorithm options incorrectly expanded to paths on re-runs. Thanks to Brent Pedersen for report.
- Fix for Tophat 2.0.9 support: remove reads with empty read names.
- Save installation and upgrade details to enable cleaner upgrades without needing to respecify genomes, tool directory and other options from installation.
- Move specification of supporting genome files for variation (dbSNP, training files) and RNA-seq (transcript GTF files) analyses into an organism specific resources file. Improves ability to support additional organisms and genome builds.
- Provide paired tumor/normal variant calling with VarScan. Thanks to Luca Beltrame.
- Require bash shell and use of pipefail for piped commands. Ensures rapid detection of failures during piped steps like alignment.
- Use samtools cat for post-BAM merging to avoid issues with bamtools requirement for open file handles.
- Add installation/upgrade options to enable commercially restricted and data intensive third party tools.
- Support for GATK 2.7
- Fixes for TopHat 2.0.9 support: remove extra non-mate match paired end reads from alignment output.
- Pull
description
sample names from BAM files if not present in input configuration file. Thanks to Paul Tang for suggestion. - Bug fixes for non-paired RNA-seq analysis.
- Add custom filtration of FreeBayes samples using bcbio.variation.
- Default to phred33 format for Tophat alignment if none specified.
- Report memory usage for processes to cluster schedulers and use predicted memory usage to schedule cores per machine. Gets core and memory information for machines and uses to ensure submitted jobs can schedule with available resources.
- Provide error checking of input YAML configuration at run start. Avoids accidental typos or incorrect settings that won't error out until later in the process.
- Drop requirement for fc_name and fc_date in input YAML file. Individual sample names are instead used and required to be unique within a processing run.
- Remove original
variant
pipeline, replacing with the all around bettervariant2
analysis method. Plan for the next version is to automatically redirect tovariant2
. - Improve parallelization of BAM preparation and gemini database creation by moving to multicore versions.
- Move variant annotation to work on called sub-regions, to avoid bottlenecks when annotating a full whole genome VCF.
- Remove sequencer-specific integration functionality which is poorly maintained and better done with third party tools: demultiplexing and statistics from Illumina directories.
- Bug fix to re-enable template generation functionality.
- Improve BAM merging on large files using samtools for output sort.
- Uploading results works with the RNA-seq pipeline.
- Rework internals to provide a consistent dictionary of sample attributes up front, avoiding lane/sample dichotomy which provided confusing internal code.
- Drop calling htseq-count from the command line in favor of an internal implementation.
- Remove requirement for bcbio_system.yaml passed in on command line, defaulting to default file prepared by installer unless specified.
- Bug fixes for new approach to parsing *.loc files: handle Galaxy *.loc files with mixed tabs and spaces correctly and fall back to previous approaches when aligner specific *.loc files are missing.
- Bug fixes for preparing merged BAM files using bamtools: correctly sort after merging and avoid duplication of reads in noanalysis files.
- Bug fix for concatenating files when first file in empty.
- Recover from ZeroMQ logging errors, avoiding loss of logging output.
- RNA-seq pipeline updated: deprecate Tophat 1 in favor of Tophat 2. Perform automatic adapter trimming of common adapter sequences. STAR aligner support. RNA-SeQC support for RNA-seq specific quality control. Transcript quantitation with htseq-count.
- Updated installation and upgrade procedures, to make it easier to build an initial analysis pipeline and upgrade bcbio-nextgen and third-parts tools and data in place.
- Add support for MuTect tumor/normal variant caller, contributed by Luca Beltrame.
- Generalize variant calling to support alternative callers like cancer-specific calling: provide additional associated files to variant calls and pass along sample specific metadata. Document implementation of new variant callers.
- Improve algorithms around post-variant calling preparation. Avoid unnecessary tries for VQSR on low coverage whole genome reads, and concatenate VCF files to avoid locking penalties.
- Fix logging and memory usage for multicore jobs run within ipython clusters.
- Improve logging for IPython cluster issues, including moving IPython logs inside project logging directory for better access.
- Options for improved cluster resiliency: minimize number of clusters started during processing with more extensive reuse, flexible timeouts for waiting on cluster start up, and expose options to allow job retries. Thanks to Zhengqiu Cai for suggestions and testing.
- Improve logging: Detailed debugging logs collect all process standard out and error and command lines across distributed systems.
- Piping improvements: provide fully piped analysis with GATK recalibration and gkno realignment. Handle smaller reads with novoalign piped analysis.
- Improve collapsing analysis regions into evenly sized blocks to better handle large numbers of samples analyzed together.
- Provide template functionality to ease generation of input sample.yaml files from lists of BAM of fastq files. Thanks to Brent Pedersen and Paul Tang.
- Updated program support: Improved novoalign support based on evaluation with reference genomes. Support GATK 2.5-2. Support VarScan 2.3.5.
- Fix naming of read group information (ID and SM) to be more robust. Identifies issues with duplicated read groups up front to avoid downstream errors during variant calling. Thanks to Zhengqiu Cai.
- Improve quality control metrics: Cleanup into custom qc directory and ensure correct selection of duplicate and other metrics for split post-alignment prep, even without merging.
- Fix IPython parallel usage for larger clusters, providing improved resiliency for long running jobs.
- Clean up handling of missing programs and input files with better error messages. From Brent Pedersen.
- Integrate fully with bcbio.variation to provide automated validation of variant calls against reference materials.
- Provide full list of all third party software versions used in analysis.
- Create GEMINI database as part of output process, allowing immediate queries of variants with associated population and annotation data.
- Collapse analysis regions into evenly sized blocks separated by non-callable regions. Provides better parallelism.
- Documentation and examples for NA12878 exome and whole genome pipelines.