Bioinformatics

Overview

Providing tools and resources designed to help improve analysis of complex sequence data.

Bioinformatics – the tool for analyzing complex biological data using sophisticated algorithms – is essential for understanding the vast amount of data that can be garnered from technologies such as next generation sequencing (NGS). The advancements in NGS have facilitated deeper sequencing and higher coverage of the genome and yield superior quality data that could have only been imagined before. Therefore, it is imperative to develop powerful informatics software and algorithms that can push the boundaries of genomic science to interpret this data to precisely call mutations, such as single nucleotide variants (SNVs), insertions/deletions (indels) and structural variants (SVs), and to be able to utilize this knowledge toward development of precise treatment plans for personalized medicine in the future. Roche Sequencing Solutions has authored multiple algorithms, such as MetaSV and SomaticSeq, that improve the accuracy of calling variants, and developed resources, such as VarSim and LongISLND, to facilitate the benchmarking of emerging algorithms.

  • LongISLND: A platform-agnostic, accurate simulator for simulating single molecule technologies using context-dependent error profiles.
  • VarSim: A high fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications. 
  • MetaSV: An accurate and integrative structural variant caller for NGS. 
  • SomaticSeq: An ensemble approach to accurately detect somatic mutations. 
  • HuRef Variant Gold Set: A comprehensive set of variants for benchmarking variant callers utilizing the Venter genome. 

LongISLND

Before investing resources in a single-molecule sequencing (SMS) experiment, one can first rapidly prototype a synergistic pair of experimental procedure and bioinformatics tools. A versatile and realistic SMS read simulator would have enabled such prototyping but has not been available. Currently available software, such as PBSIM and Alchemy, generate data formats with limited portability or compatibility with downstream analysis tools.1 More importantly, the existing simulators assume idealized error characteristics and might fall short of capturing the true error characteristics from a broad range of sequencing mechanisms, ranging from chemistry to signal processing.2 LongISLND is a software package that adapts to such mechanisms and simulates sequencing data according to the empirical characteristics of various third generation SMS technologies.

LongISLND uses a learn-and-simulate approach to achieve accurate and realistic simulation while remaining platform agnostic. The software analyzes the alignments of real data to truth sequences to extract a non-parametric model. For example, the model captures empirical sequencing error by the sequencing context, such as the length and base identity of homopolymers and the identity of the flanking bases. The resulting models are so realistic that subtle yet significant changes in accuracy and error bias can be detected from P5-to-P6 chemistry revision of the Pacific Biosciences (PacBio) RSII sequencer (Figures 1a and 1b)

The nonparametric empirical model is then used to perform simulations for a test case of interest. Such realistic simulation enables rapid bioinformatics development, which accounts for empirical sequencing defects rather than mere idealized error model. LongISLND is a valuable tool to simulate single-molecule sequencing and for detecting bias in sequencing characteristics. It can be used on any third-generation SMS platform or chemistry for predicting performance of specific combination of experiment and bioinformatics analysis. 

VarSim

The lack of a comprehensive validation framework for real data has motivated the use of simulation as a validation methodology for the evaluation of high-throughput sequencing secondary analysis. VarSim, an automated, open-source and comprehensive framework, simulates and validates a wide range of variants for assessing alignment and variant calling accuracy in high-throughput genome sequencing.3

Instead of simulating a random mutation spectrum, VarSim takes databases of real experimental data as input, and synthesizes diploid genomes with germline and somatic mutations based on a realistic model. By doing so, it leverages information such as previously reported mutations to make the synthetic genomes biologically relevant. The synthesized genomes are then used to generate reads using state-of-the-art read simulation. These reads can be analyzed using the secondary analyses under consideration and their outputs are validated using VarSim to get a detailed assessment of their accuracy.

When validating read alignments, VarSim places an emphasis on the reads aligning to rare variants and uses the previously generated metadata to stratify alignment accuracy by variant types (Figure 2). By contrast, traditional pooled comparisons align the majority of the reads with the wider genome, thus diluting and decreasing the accuracy of those aligning to rare complex variations. When validating variant-calls, VarSim stratifies the variants by type and size to give a comprehensive view of variant-calling accuracy.

VarSim’s basic workflow also has significant applications in cancer sequencing analysis as it can be adapted for the simulation of tumor/normal pairs and the validation of somatic variant callers.

MetaSV

Structural variations (SVs) are large genomic rearrangements that contribute to genomic diversity as well as genomic disorders.4 Due to their varying lengths, SVs are difficult to detect accurately, especially when analyzing the relatively short reads generated by next-generation sequencing (NGS).

Although several SV detection tools have been developed, each is limited to specific types of SVs, with differing accuracy and resolution. In order to improve SV detection accuracy, Roche has developed MetaSV, an integrated SV caller that intelligently merges SVs from multiple tools; namely BreakSeq, Breakdancer, Pindel and CNVnator, to generate high-quality SV call sets.5 Following the merge, MetaSV uses local assembly, in combination with alignment, to resolve SV breakpoints with high precision (Figure 3). MetaSV also analyzes soft-clipped reads from alignment to boost the accuracy of insertion SV detection. The high-quality SVs are finally genotyped and outputted as a VCF which is a standard format for secondary analysis.

MetaSV significantly improves the accuracy of SV-calling through its ensemble approach. Furthermore, its potential is greater than the four aforementioned tools; it can also be adapted to use additional or even different sets of tools.

SomaticSeq

Two major challenges in somatic mutation detection in cancer are the heterogeneous nature of tumors and the potential for cross-contamination between tumor and normal samples. These challenges also mean that while a somatic mutation detection algorithm may work well for one tumor, it might work poorly for another.

SomaticSeq, an open-source, accurate somatic mutation detection pipeline integrates five state-of-the-art somatic mutation callers: MuTect, SomaticSniper, VarScan2, JointSNVMix2, and VarDict.6 These yield a high-confidence somatic mutation call set for both single nucleotide variants and small indels.

SomaticSeq implements a stochastic boosting machine learning algorithm. By combining the somatic mutation calls from these five tools, we can generate up to 72 genomic and sequencing features for each call. We use the SomaticSeq trainer to create a trained model based on these features. The trained model can then be used to call high-confidence somatic mutations in test data sets of similar characteristics, e.g., comparable sequencing depth, identical data preprocessing protocols, etc.

The SomaticSeq pipeline has been validated with the DREAM Challenge data (Figure 4), in silico titration of two genomes, as well as real tumor data.

HuRef Variant Gold Set

Precision medicine based on high-throughput sequencing requires accurate sequencing algorithms. This necessitates the development of high-confidence, comprehensive human variant sets to accurately assess their performance. Despite multiple efforts to generate such a validation set and to characterize complex variation at a population scale, no variant gold set covers all major variant types, including structural variants (SVs).

To address this, we leveraged the massive high-quality Sanger sequences from the HuRef genome to construct the first and most comprehensive gold set from a diploid male Caucasian genome.7 This was cross-validated with deep Illumina sequencing, population datasets, and well-established algorithms.

The published variants for HuRef were mostly reported five years ago and this required our thorough reanalysis and validation of the HuRef genome to generate a gold set with high specificity and sensitivity. In contrast to current variant call sets of the NA12878 or HS1011 genomes, our gold set is the first that includes small variants, deletion SVs and insertion SVs up to a hundred thousand base pairs.7 This makes it comprehensive for both variant numbers and variant types, over a large size range.