Article

Making the Most of Your NGS Data: Understanding Metrics for Target-enriched NGS

Introduction

Targeted next-generation sequencing (NGS) is frequently used for identifying mutations, single nucleotide polymorphisms (SNPs), and disease-associated variants, as well as for whole-exome sequencing 1,2. One of most common target enrichment (TE) methods is hybridization-based TE, which uses oligonucleotide probes to capture regions of interest for downstream sequencing and analysis.

However, while targeted sequencing can reduce sequencing costs and simplify data analysis compared to whole-genome sequencing, it is still time-consuming and expensive; understanding key sequencing metrics can help you to maximize the value of each sequencing run.

After each sequencing run, several key sequencing metrics are assessed to evaluate sequencing performance and data quality; these include base quality, cluster density, and the number of reads passing filter. Here, we review five additional metrics that provide more in-depth insights into the success of hybridization-based target enrichment experiments.

Understanding each of these metrics will allow NGS users to evaluate the results of their target enrichment experiments, and to plan for future experiments.

Depth of Coverage

The number of times that a particular base within the target region is represented in the sequence data is referred to as coverage or depth of coverage This measurement is expressed as a multiple; for example, a base that is covered by (included in) 5 unique sequencing reads is said to have 5X coverage (Figure 1). Higher coverage of a given region increases the confidence in SNP identification and variant calling for that region; this is especially important for rare variants, where even infrequent sequencing errors can lead to misleading conclusions.

The required coverage depth needed for different types of experiment can varies greatly, and depends on many factors. These include the quality and amount of the input sample, the number and type of variants and their expected frequencies, and the coverage depth typically reported for similar studies3. Determining the desired coverage for your experiments during the planning stages will help you save time and money by ensuring that you select (or design) a suitable probe panel, choose the most appropriate library preparation kit, and sequence your target-enriched libraries using the most appropriate sequencing instrument and run type.

Figure 1. Illustration of coverage depth.
Blue bars represent target genomic regions, black bars indicate unique mapped reads, and shaded boxes show various coverage depths across the target region. In this example, coverage of target regions ranges from 1X to 5X; required coverage depth varies widely across applications.

On-target Rate

The on-target rate provides information about the specificity of the target enrichment experiment. There are two ways to determine on-target rate: percent bases on-target (also called the percent on-target bases) and percent reads on-target (also called percent on-target reads).

The percent bases on-target is the number of bases that map to the target region (Figure 2), while the percent reads on-target includes all sequencing reads that overlap the target region by even by a single base. The more bases or reads that fall within the target region, the greater the overall on-target rate-- indicating strong probe specificity, high-quality probes, and efficient hybridization-based target enrichment.

Occasionally the regions directly outside of the target will be included in the analysis. This is referred to padding or buffering, and while it may aid in analysis by detecting intronic variants4, it can also misrepresent the true on-target values.

Low on-target rates can result from suboptimal probe design, poorly optimized protocols, problems during the library preparation or hybrid capture, or low-quality reagents. To improve on-target rates, invest in well-designed, high-quality probes, robust reagents, and a validated, reliable enrichment method.

Watch Video >

Figure 2. Illustration of on-target rates.
In this image, the on-target reads are blue and off-target reads are black. Any bases falling within the green box are on-target bases; a single read may contain both on-target and off-target bases.

GC-bias

The distribution of AT-rich and GC-rich regions—often referred to as GC content—is uneven across the genomes of many organisms, including humans. Regions of high or low GC content are often unevenly represented during sequencing; this disproportionate coverage is known as GC-bias. Figure 3 provides an example of GC-bias distribution plots, which are used to visualize GC bias across regions of variable GC content.

High amounts of GC bias can be introduced during library preparation (especially in with-PCR workflows), during hybrid capture, or during the sequencing run itself5,6,7. Identification of GC-bias during data analysis can help determine if more sequencing is required to achieve the desired sequencing depth across all of the target regions.

In order to minimize the GC-bias in sequencing experiments, prepare sequencing libraries with a robust library preparation workflow known to minimize GC-bias; optimize PCR enrichment steps and minimize the number of PCR cycles; use properly calibrated thermocyclers; and use well-designed high-quality probes.

Watch Video >

Figure 3. Examples of GC-Bias distribution plots.
The plot on the left demonstrates a successful, low-GC bias experiment where the GC normalized coverage (green dots) closely resembles the %GC in the reference genome (blue bars). The plot on the right displays GC-bias where the fraction of normalized coverage is much higher in GC-rich regions. In both plots, the blue bars show varying levels of GC content from the reference genome; the x-axis shows the % GC divided into increments of 100-base windows; values on the secondary y-axis (on the right) show how many windows across the genome contain the indicated %GC. Green dots on the plot show the fraction of normalized coverage per window using the primary y-axis.

Fold-80 Base Penalty

Analysis of sequencing data typically reveals that some target regions have achieved higher coverage than others. The Fold-80 base penalty metric is one way to assess coverage uniformity. Once the mean target coverage is determined for an experiment, the Fold-80 base penalty describes how much more sequencing is required to bring 80% of the target bases to the mean coverage. Thus, a run with perfect coverage uniformity would have a Fold-80 base penalty score of 1, indicating an on-target rate of 100% and uniform coverage (see Figure 4). Values higher than 1 show uneven levels of uniformity. For instance, a Fold-80 value of 2 requires that twice as much (2-fold) sequencing is needed for 80% of the reads to reach the mean coverage.

The Fold-80 base penalty provides information about the capture efficiency of the probes in the panel, which is impacted by both probe design and the quality of the probes themselves. Thus, the use of high-quality, well-designed probes and high-quality reagents can decrease the Fold-80 base penalty, reducing the need for additional, costly sequencing runs.

Watch Video >

Figure 4. Comparison of ideal and observed uniformity.
The plot on the left displays results with a 100% on-target rate and uniform coverage, yielding a Fold-80 base penalty of 1.0. The plot on the right shows results that include off-target reads and uneven coverage; thus, the Fold-80 base penalty is greater than 1.0.

Duplicate Rate

When multiple sequencing reads are mapped to the exact same location, including the coordinates of the 3’ and 5’ ends, they are considered duplicate reads (Figure 5). The duplication rate is the fraction of mapped reads marked as duplicate reads in a particular data set. In contrast to overlapping reads, duplicate reads offer no additional information and are removed from sequencing data during bioinformatic analysis, a process known as deduplication.

Although removing duplicate reads may initially appear to reduce the sequencing depth, this process actually increases the confidence in the coverage of target regions since all reads are unique. Failure to remove duplicate reads causes an inflation of coverage in certain regions and may overrepresent SNPs or false variant calls from PCR- or sequencing- derived errors.

Factors that lead to high duplication rates include using low-input library preparation reagents, PCR over-amplification during or after library preparation and target capture, the use of low-complexity libraries in hybrid capture, and over-sequencing. Duplicates can also arise when the same DNA strand is used to form multiple clusters on the flow cell7. To minimize duplication rates, use adequate sample input and reduce PCR cycles when possible. The use of paired-end sequencing can also help lower the duplication rate.

Watch Video >

Figure 5. Total reads mapped to a reference genome, showing duplicate rates.
Duplicate reads are highlighted in blue and green. The duplicates will be removed during analysis, increasing the accuracy of the data.

Summary

Sequencing metrics such as depth of coverage, on-target rate, GC-bias, Fold-80 base penalty, and duplication rates provide important information about the efficiency and specificity of hybridization-based NGS target enrichment experiments. This is only a small subset of the many sequencing metrics available, but the understanding them allows you to better plan your targeted sequencing experiments, understand your data, and modify targeted NGS workflows to conserve resources and improve results.

For a more in-depth explanation of these metrics and additional sequencing resources, watch our Ask a Scientist videos below.

Ask a Scientist Videos

Learn more about KAPA Target Enrichment

Discover robust new products for hybridization-based target capture, with streamlined workflows that focus sequencing resources on regions of interest. Increase efficiency and depth of targeted resequencing for human genetic disease and cancer research, as well as other applications. The KAPA Target Enrichment portfolio offers solutions for:

Human whole-exome sequencing (WES)
Custom-designed panels for human or nonhuman genomic regions
KAPA SARS-CoV-2 Target Enrichment Panel

References

Yang, Y., Muzny, D. M., Reid, J. G., Bainbridge, M. N., Willis, A., Ward, P. A., ... & Eng, C. M. (2013). Clinical whole-exome sequencing for the diagnosis of mendelian disorders. New England Journal of Medicine, 369(16), 1502-1511.
Mamanova, L., Coffey, A. J., Scott, C. E., Kozarewa, I., Turner, E. H., Kumar, A., ... & Turner, D. J. (2010). Target-enrichment strategies for next-generation sequencing. Nature methods, 7(2), 111-118.
Sims, D., Sudbery, I., Ilott, N. E., Heger, A., & Ponting, C. P. (2014). Sequencing depth and coverage: key considerations in genomic analyses. Nature Reviews Genetics, 15(2), 121-132.
Zanti, M., Michailidou, K., Loizidou, M. A., Machattou, C., Pirpa, P., Christodoulou, K., ... & Hadjisavvas, A. (2021). Performance evaluation of pipelines for mapping, variant calling and interval padding, for the analysis of NGS germline panels. BMC bioinformatics, 22(1), 1-21.
Dohm, J. C., Lottaz, C., Borodina, T., & Himmelbauer, H. (2008). Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic acids research, 36(16), e105.
Chen, Y. C., Liu, T., Yu, C. H., Chiang, T. Y., & Hwang, C. C. (2013). Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PloS one, 8(4), e62856.
Aird, D., Ross, M. G., Chen, W. S., Danielsson, M., Fennell, T., Russ, C., ... & Gnirke, A. (2011). Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome biology, 12(2), 1-14.

For Research Use Only. Not for use in diagnostic procedures.

rdoe_persona_userprofile

rdoe_persona_select_up

Learn more about KAPA Target Enrichment