The distribution of AT-rich and GC-rich regions—often referred to as GC content—is uneven across the genomes of many organisms, including humans. Regions of high or low GC content are often unevenly represented during sequencing; this disproportionate coverage is known as GC-bias. Figure 3 provides an example of GC-bias distribution plots, which are used to visualize GC bias across regions of variable GC content.
High amounts of GC bias can be introduced during library preparation (especially in with-PCR workflows), during hybrid capture, or during the sequencing run itself5,6,7. Identification of GC-bias during data analysis can help determine if more sequencing is required to achieve the desired sequencing depth across all of the target regions.
In order to minimize the GC-bias in sequencing experiments, prepare sequencing libraries with a robust library preparation workflow known to minimize GC-bias; optimize PCR enrichment steps and minimize the number of PCR cycles; use properly calibrated thermocyclers; and use well-designed high-quality probes.
Figure 3. Examples of GC-Bias distribution plots. The plot on the left demonstrates a successful, low-GC bias experiment where the GC normalized coverage (green dots) closely resembles the %GC in the reference genome (blue bars). The plot on the right displays GC-bias where the fraction of normalized coverage is much higher in GC-rich regions. In both plots, the blue bars show varying levels of GC content from the reference genome; the x-axis shows the % GC divided into increments of 100-base windows; values on the secondary y-axis (on the right) show how many windows across the genome contain the indicated %GC. Green dots on the plot show the fraction of normalized coverage per window using the primary y-axis.
Analysis of sequencing data typically reveals that some target regions have achieved higher coverage than others. The Fold-80 base penalty metric is one way to assess coverage uniformity. Once the mean target coverage is determined for an experiment, the Fold-80 base penalty describes how much more sequencing is required to bring 80% of the target bases to the mean coverage. Thus, a run with perfect coverage uniformity would have a Fold-80 base penalty score of 1, indicating an on-target rate of 100% and uniform coverage (see Figure 4). Values higher than 1 show uneven levels of uniformity. For instance, a Fold-80 value of 2 requires that twice as much (2-fold) sequencing is needed for 80% of the reads to reach the mean coverage.
The Fold-80 base penalty provides information about the capture efficiency of the probes in the panel, which is impacted by both probe design and the quality of the probes themselves. Thus, the use of high-quality, well-designed probes and high-quality reagents can decrease the Fold-80 base penalty, reducing the need for additional, costly sequencing runs.
Figure 4. Comparison of ideal and observed uniformity. The plot on the left displays results with a 100% on-target rate and uniform coverage, yielding a Fold-80 base penalty of 1.0. The plot on the right shows results that include off-target reads and uneven coverage; thus, the Fold-80 base penalty is greater than 1.0.
Sequencing metrics such as depth of coverage, on-target rate, GC-bias, Fold-80 base penalty, and duplication rates provide important information about the efficiency and specificity of hybridization-based NGS target enrichment experiments. This is only a small subset of the many sequencing metrics available, but the understanding them allows you to better plan your targeted sequencing experiments, understand your data, and modify targeted NGS workflows to conserve resources and improve results.
For a more in-depth explanation of these metrics and additional sequencing resources, watch our Ask a Scientist videos below.