Results: 5

1.
Figure 4

Figure 4. Differences between empirical and reported quality scores calculated by GATK BQSR for recalibration based on the spike-in standards.. From: Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing.

Blue (<0) indicates the reported quality scores are too high, and red/yellow (>0) indicates the reported quality scores are too low. The mean differences across all samples are calculated for each combination of reported quality score and cycle (a and c) or reported quality score and dinucleotide (b and d) for Illumina (a and b) and SOLiD (c and d), respectively.

Justin M. Zook, et al. PLoS One. 2012;7(7):e41356.
2.
Figure 3

Figure 3. Comparison of GATK BQSR scores for recalibration based on the genome vs. recalibration based on the spike-in standards. . From: Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing.

The differences in quality score recalibration values are calculated for each combination of reported quality score and cycle (a and c) or reported quality score and dinucleotide (b and d) for Illumina (a and b) and SOLiD (c and d). White blocks correspond to very large differences, generally with very few errors. The differences are (genome – spike-in standard), so blue (<0) indicates that genome recalibration would result in recalibrated quality scores that are too low, and yellow/red (>0) results in recalibrated quality scores that are too high. The p values for the differences are shown in Figure S2.

Justin M. Zook, et al. PLoS One. 2012;7(7):e41356.
3.
Figure 5

Figure 5. Comparison of error rates for each type of nucleotide change for spike-in standard or genome recalibration with (a) SOLiD 4 data with standards spiked-in in a large dynamic concentration range or (b) Illumina HiSeq data with standards spiked-in at equimolar concentrations.. From: Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing.

The plots are annotated with transition/transversion (Ti/Tv) ratios, where random base changes result in Ti/Tv = 0.5, and biological mutations result in Ti/Tv >>0.5. To determine the significance of biological variants in the data, only bases with reported reported base quality scores above 30 are included in this analysis. All values are the mean ± SD of 2 samples with 2 biological replicates or of 4 sequenced samples with no replicates.

Justin M. Zook, et al. PLoS One. 2012;7(7):e41356.
4.
Figure 2

Figure 2. BQSR recalibration inaccuracies due to limited size and coverage of the ERCC spike-in standards, compared to the inaccuracies caused by recalibrating from the genome excluding known variant sites in dbSNP.. From: Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing.

The errors due to the limitations of the spike-in standards are the mean absolute difference between the recalibration coefficients calculated from randomly selected 50% of the spike-in standard bases (ERCC Set A) and the opposite 50% of the bases (ERCC Set B). Because the mean absolute differences are lower for the spike-in standards, they serve as a reasonable proxy for accuracy of the recalibration coefficients. Differences are calculated for the base quality score reported from the instrument (RpQS), dinucleotide context (Dinuc), and machine cycle (Cycle). The differences are the mean ± SD (n = 4) for SOLiD4 with spike-in standards spiked-in in a large dynamic concentration range with 250–700× mean coverage (SOLiD-DR), and for Illumina HiSeq with spike-in standards spiked-in at equimolar concentrations with 5500–8500× mean coverage (Illumina-EP). The use of spike-in standards for recalibration significantly improves upon the traditional genome recalibration in all cases (p<10−4).

Justin M. Zook, et al. PLoS One. 2012;7(7):e41356.
5.
Figure 1

Figure 1. Systematic errors and base quality score recalibration.. From: Synthetic Spike-in Standards Improve Run-Specific Systematic Error Analysis for DNA and RNA Sequencing.

(a) Observed variants in the reads can result from a variety of biological causes and sequencing errors and biases. (1) Random sequencing errors are relatively rare at any given position in the reference, and are generally reflected accurately in the reported base quality score from the instrument. (2) Biological variants that are included in the SNP database (e.g., dbSNP for humans) are excluded from the base quality score recalibration (BQSR), and therefore do not decrease the empirical quality scores. (3) RNA editing can occur at frequencies less than 50%, so it can be difficult to distinguish from SSEs. These observed variants are treated as SSEs by the BQSR algorithm, incorrectly decreasing their base quality scores and quality scores of similar bases in other locations in the genome. (4) Biological variants that are not in dbSNP are also treated as SSEs by the BQSR algorithm, again decreasing their and similar bases’ recalibrated quality scores. (5) Since variant bases are only seen on one strand, they are likely to be SSEs. In this case, the BQSR algorithm would decrease the quality scores of the dinucleotide on the forward strand (GG). (b) Example reads and the covariates for each base used by GATK BQSR. The red columns would be counted as errors when calculating empirical quality scores. (c) Schematic of the GATK BQSR process, in which reported quality scores from the instrument are adjusted (or “recalibrated”) using empirical quality scores associated with the covariates reported quality score, machine cycle, and dinucleotide context.

Justin M. Zook, et al. PLoS One. 2012;7(7):e41356.

Supplemental Content

Recent activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...
Write to the Help Desk