Doomsayer: detection of batch effects and outliers in next-generation sequencing data using mutation signature analysis
Whole-genome sequencing data must go through extensive quality control measures to ensure that the variants identified in an individual’s genome are true biological differences and not the result of errors that can occur throughout the many stages of sample preparation and sequencing. Many such errors can be avoided by collecting, storing, transferring, and preparing the biological samples according to established best practices. Human error is inevitable, however, and sometimes a few DNA samples will get degraded or oxidized or sloshed into another well of the plate, etc. etc. Furthermore, large samples are often generated over the course of several months and aggregated from multiple sequencing centers, which can result in batch effects, where there is non-biological variation between groups of samples that were handled on a particular date or by a particular technician or sequenced on a particular machine.
Sometimes we are aware of these problems immediately after they occur, and it’s possible to reacquire or resequence the affected samples. More often, we need to apply a series of computational QC methods to verify the quality of the data, identify potential signatures of errors or batch effects, and correct or remove the affected samples.
One common QC metric is the transition:transversion (Ts:Tv) ratio, which is simply the ratio between the number of transition variants (purine <–> purine or pyrimidine <–> pyrimidine) and the number of transversion variants (purine <–> pyrimidine). Because transitions are much more common than transversions, we expect a Ts:Tv ratio roughly around 2 for WGS data. Individual genomes with very low or very high Ts:Tv ratios are kicked out of the sample, because we assume that too many transitions or transversions cannot occur in a normal biological sample.
However, the Ts:Tv ratio is a very coarse approximation of sample quality–if we have a sample with 2000 transitions that are all A>G, and 1000 transversions that are all A>C, the Ts:Tv ratio will be 2, but there is clearly something wrong with a sample that only has these two types of variants. Moreover, the “ideal” Ts:Tv ratio is somewhat of a moving target–even in high-quality data, the Ts:Tv ratio varies according to the sample size, population structure, relatedness, coverage, reference genome, variant calling algorithm, etc. Consequently, sequencing studies can be terribly inconsistent in how they identify potential outlier samples.
To this end, I have developed a sample-level QC method that can be conceptualized as a more intricate version of the Ts:Tv ratio. Rather than looking at two classes of variants (transitions and transversions), this method looks at 96 classes of variants, based on the nucleotides immediately upstream and downstream from each variant site. This is important because many known sources of error have been shown to have sequence-specific biases, resulting in an excess of certain types of errors at certain sequence motifs. Rather than determining outliers by a pre-specified Ts/Tv threshold, this method uses a combination of dimensionality reduction algorithms and machine learning outlier detection procedures.
I have implemented this method in Doomsayer, a command-line tool available on GitHub.
Here are a few of Doomsayer’s features:
Doomsayer was designed to evaluate some of the largest sequencing datasets that have ever been collected, containing hundreds of millions of variants identified in tens of thousands of individuals. It uses the super-fast htslib sequence analysis libraries, as implemented in the cyvcf2 Python package. Options for parallel processing are baked right in.
Nobody likes having to download and compile a million dependencies, and I don’t like troubleshooting them. That’s why Doomsayer includes a pre-built Conda environment. Just clone the repository, activate the environment, and everything should work.
Doomsayer is also available as a pre-built Docker image, making it even easier to run.
You can also access a cloud-based instance of Doomsayer using Binder. Note that due to resource limitations on the Binder server, not all
Doomsayer includes a Jupyter notebook with an interactive tutorial on common use cases. This tutorial can be accessed through both the Docker and Binder instances.
Once Doomsayer has crunched the data and flagged some potential outliers, it will automatically render an HTML-formatted diagnostic report using RMarkdown.
One of the core functions of Doomsayer is the use of Principal Component Analysis (PCA) or Nonnegative Matrix Factorization (NMF) to extract the major components of variability across your sample. This is effectively the same method that is used to decipher unique mutation signatures in somatic sequencing datasets (e.g., the SomaticSignatures R package).
Preprint coming soon!
For now, please cite the repository:
Carlson J. carjed/doomsayer v0.6.1. (2018).