• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. 2004; 32(17): 5183–5191.
Published online Sep 30, 2004. doi:  10.1093/nar/gkh850
PMCID: PMC521663

Adjust quality scores from alignment and improve sequencing accuracy

Abstract

In shotgun sequencing, statistical reconstruction of a consensus from alignment requires a model of measurement error. Churchill and Waterman proposed one such model and an expectation–maximization (EM) algorithm to estimate sequencing error rates for each assembly matrix. Ewing and Green defined Phred quality scores for base-calling from sequencing traces by training a model on a large amount of data. However, sample preparations and sequencing machines may work under different conditions in practice and therefore quality scores need to be adjusted. Moreover, the information given by quality scores is incomplete in the sense that they do not describe error patterns. We observe that each nucleotide base has its specific error pattern that varies across the range of quality values. We develop models of measurement error for shotgun sequencing by combining the two perspectives above. We propose a logistic model taking quality scores as covariates. The model is trained by a procedure combining an EM algorithm and model selection techniques. The training results in calibration of quality values and leads to a more accurate construction of consensus. Besides Phred scores obtained from ABI sequencers, we apply the same technique to calibrate quality values that come along with Beckman sequencers.

INTRODUCTION

Shotgun sequencing is the standard methodology for genome sequencing (1). Starting with a whole genome, or a large genomic region, short random fragments are generated and sequenced. Enough fragments are sequenced so that almost all positions in the genome or region are covered multiple times just by chance. The standard sequencing procedure is exemplified by the commonly used Phred/Phrap suite of software (2,3). First, each fragment is base-called from its chromatogram, i.e. a vector times series of four fluorescence intensities, and a sequence of A, C, G and T, is inferred. The commercial producers of sequencing devices include Applied Biosystems, Inc., Beckman Coulter, Inc., etc. Typically each base is accompanied by a quality value that is meant to convey some idea of how likely the base-calling is correct. Second, the base-called sequences are assembled into a contig using an ad hoc alignment algorithm that compares both strands and overlap between fragments. Quality values are taken into account during the alignment to eliminate low quality reads. Third, a consensus sequence is constructed from this alignment by comparing different reads for each position.

The accuracy of the consensus sequence depends on the coverage (i.e. how many independent observations we have for each nucleotide base pair in the genome) and the performance of the base-calling algorithm. The quality values of base-calling play a crucial role in the construction of consensus. If they are misleading or interpreted incorrectly, the consensus sequence will be less reliable. The Phred quality scores for base-calling are defined from sequencing traces in such a way that they have a probabilistic interpretation. This is achieved by training a model on a large amount of data. However, in practice, sample preparations and sequencing machines may work under different conditions and therefore quality scores need to be adjusted. Moreover, the information given by quality scores is incomplete in the sense that they do not describe error patterns. We observe that each nucleotide base has its specific error pattern that varies across the range of quality values.

Churchill and Waterman (4) proposed another model to define a consensus. It is based on an assembly without assuming the availability of quality values. The parameters in the model include composition probabilities and sequencing error rates and are estimated by an expectation–maximization (EM) algorithm based on the alignment. The consensus is defined by the probability of the target sequence conditional on observations. This offers an evaluation of reliability.

In this article, we combine quality scores of base-calling and the idea in Churchill and Waterman's model (4), to improve sequencing accuracy. Specifically, we start with assembled contigs and quality scores to build up complete probabilistic error models. One option is to represent the error pattern of each nucleotide by a multinomial model. Since the true sequence is unknown, we develop an EM algorithm to deal with missing data. In a more sophisticated logistic model, we take quality scores as covariates. To parsimoniously represent the non-linear effect of quality scores, we adopt simple piecewise linear functions in the regression model. The model is trained by a procedure combining an EM algorithm, the Bayesian information criterion (BIC) criterion and backward deletion. The training results in calibration of quality values and leads to a more accurate consensus construction.

MATERIALS AND METHODS

Sequencing data

The first source of data in this article comes from the Campylobacter jejuni whole-genome shotgun sequencing project (5). The raw data, generated on ABI 373 and 377 automated sequencers were downloaded from the Sanger Center (ftp.sanger.ac.uk/pub/pathogens/cj). The total length of the genome sequence is 1 641 481 bp. There are 33 824 reads and the average coverage is 10-folds. The sequence assembly was obtained by Phrap (see http://www.phrap.org). We tested our methods on the first 100 kb of the reference sequence and the corresponding reads. The coverage of the C.jejuni sequencing project is unusually high, so we randomly removed some reads to decrease the average coverage from 10- to 6-fold. Since the reference sequence was obtained on reads of 10-fold, we will assume that it is close to the true sequence later when we calculate single base discrepancy (SBD).

To test our methods on data obtained using another sequencing technology, we analyzed data from an Arabidopsis thaliana re-sequencing project carried out at USC (http://walnut.usc.edu/2010) using Beckman Coulter CEQ automated sequencers. These data were obtained as part of a polymorphism survey and thus contain different haplotypes. Since we are interested only in sequencing error, in this paper, we selected ~500 kb of raw data from non-polymorphic regions.

Setup

Throughout the article, we represent random variables by capital letters and their values by small letters. First, reads are aligned into an assembly matrix. We introduce two alphabets: = {A, C, G, T, −} and = {A, C, G, T, −, N, [var phi]}, where − denotes an internal gap, N denotes any ambiguous determination of a base and the null symbol [var phi] is for non-aligned positions beyond the ends of a fragment. Each fragment is either in direct or in reverse complemented orientation. To deal with the issue of orientation, we introduce a complementary operation ~ on the alphabet as follows:

An external file that holds a picture, illustration, etc.
Object name is gkh850equ24.gif

,

An external file that holds a picture, illustration, etc.
Object name is gkh850equ25.gif

and

An external file that holds a picture, illustration, etc.
Object name is gkh850equ26.gif

. An illustrative example of assembly matrices is shown in Figure Figure11.

Figure 1
An illustrative example of the problem. The bases with a ~ sign represent their complementary bases.

We denote the target sequence by S = S1S2 ··· Sn, where Sj takes any value from the alphabet . Random fragments generated from the template are aligned by an assembler. This results in an assembly matrix {Xij}m×n. The elements of the fragment assembly matrix, denoted by xij, take values from the alphabet . Each row in {Xij} contains the ordered sequence of bases and possible gaps in a particular fragment. The column index j = 1, …, n runs from the leftmost base in the assembly to the rightmost. We represent the orientation of the ith fragment in the assembly by

An external file that holds a picture, illustration, etc.
Object name is gkh850equ1.gif

The observations Xij are subject to measurement error. We denote the true base of fragment i at position j by Yij [set membership] . Therefore

An external file that holds a picture, illustration, etc.
Object name is gkh850equ2.gif

We denote the compositional probability by αa = Pr(Sj = a), a [set membership] .

Phred quality scores

After appropriate preprocessing, each sequencing chromatogram contains a series of peaks of four colors. The rationale of base-calling is that each peak represents one base, and the order of peaks from the four channels is consistent with the order of nucleotide bases on the underlying DNA fragment. In addition to base-calling, Phred also assigns each base-call a quality score q, which takes integer values from 0 to Q (Q is 64 for Phred scores) (2). Quality scores are based on trace features such as peak spacing, uncalled/called peak ratio and peak resolution. The model that defines quality scores was so trained, on a large amount of sequencing traces, that the scores could be interpreted as probabilities. Mathematically, the score is defined by

An external file that holds a picture, illustration, etc.
Object name is gkh850equ3.gif

where, εij is the error probability of base-calling. We randomly select one position from an assembly and let Y and X be its true base and called base, respectively. Let denote the event that the base-calling is incorrect, namely, = {XY}. Then the correct calling probability given base a is: 1 − ε = Pr(Y = a|X = a), where a [set membership] . Notice that

An external file that holds a picture, illustration, etc.
Object name is gkh850equ4.gif

If the assumption of unbiased base-calling is valid, namely, Pr(X = a) = Pr(Y = a), then we have Pr(X = a|Y = a) = Pr(Y = a|X = a) = 1 − ε. Consequently, we are able to interpret the Phred scores as probabilities by

An external file that holds a picture, illustration, etc.
Object name is gkh850equ5.gif

Even though Phred scores are valuable information for the construction of consensus, they are not the complete picture of measurement error. In general, for ab [set membership] , we have

An external file that holds a picture, illustration, etc.
Object name is gkh850equ6.gif

We denote sequencing error rates, conditional on event , by w(b|a) = Pr(X = b|Y = a, ) for ab, and arrange them in the following table:

where {w(b|a)} satisfy

An external file that holds a picture, illustration, etc.
Object name is gkh850equ7.gif

The sequencing error rates relate to the conditional probabilities as follows.

An external file that holds a picture, illustration, etc.
Object name is gkh850equ8.gif

Since Phred scores provide only partial information about sequencing error rates, we need to estimate the rest. For the sake of simplicity, we skip the issue of fragment orientation when we describe the sequencing error models.

Conditional sequencing error model

Our perspective is to incorporate Phred quality scores into the Churchill–Waterman model (4). We first adopt the parameterization in Equation 2 to model sequencing error, and refer to it as the conditional sequencing error model. The parameters θ in this model include the composition probability {αa} and the conditional sequencing error rates {w(b|a)}. The likelihood of the assembly and underlying sequence is given by

An external file that holds a picture, illustration, etc.
Object name is gkh850equ9.gif

Since {Sj} are missing, we estimate the parameters by the EM algorithm. The following form of log-likelihood is easy for imputing the sufficient statistics.

An external file that holds a picture, illustration, etc.
Object name is gkh850equ10.gif

Logistic model

From a regression perspective, we take Phred scores as a covariate. We denote

An external file that holds a picture, illustration, etc.
Object name is gkh850equ11.gif

We assume that base-calling error rates follow a logistic form:

An external file that holds a picture, illustration, etc.
Object name is gkh850equ12.gif

where each covariate hl(q) is a function of quality score q and takes the form

An external file that holds a picture, illustration, etc.
Object name is gkh850equ13.gif

Notice that each function has a knot ol, where 0 ≤ o1 < o2, … < oL < Q. Thus each regressor is a piecewise linear function of the quality score, which allows us to approximate any potential non-linear effect. Equivalently, base-calling rates can be represented as:

An external file that holds a picture, illustration, etc.
Object name is gkh850equ14.gif

Similar to Equation 3, this parameterization leads to the following form of log-likelihood function for the assembly and the underlying sequence, up to a term only relating to parameters.

An external file that holds a picture, illustration, etc.
Object name is gkh850equ15.gif

where θ represents all the unknown parameters.

Parameter estimation and EM training algorithm

In both the conditional sequencing error model and the logistic model, the underlying sequence {sj} is unknown. Its reconstruction relies on the estimates of parameters in the models. On the other hand, algorithms of estimating parameters are well established when {sj} are known. Thus we use the EM algorithm to train the model iteratively. In the E-step, we impute the sufficient statistic from observations at the current parameter value and in the M-step, we update the maximum likelihood estimate using the current imputed values of missing data. In the case of the conditional sequencing error model, the parameters are estimated by counting frequencies. In the case of the logistic model, the likelihood can be decomposed into five independent logistic regression models (6). Consequently, we run re-weighted least squares to estimate the parameters (7). The mathematical and technical details are rather lengthy and tedious, and will be published in our technical report (8).

Consensus and quality values

According to the logistic model, the distribution of nucleotides at each position is given by:

An external file that holds a picture, illustration, etc.
Object name is gkh850equ16.gif

As shown in the formula, the issue of orientation can generally be dealt with by the orientation indicators {ri} and the complementary operator ~. In our convention, we observe

An external file that holds a picture, illustration, etc.
Object name is gkh850equ27.gif

directly when a fragment is in reverse orientation. After we plug in the estimated value of θ, we define the consensus at one position and its quality score by maximizing the above probability.

Parsimonious representation and model selection

Although we can include piecewise linear functions at all possible knots in the logistic regression model (Equation 4), we seek a parsimonious model for several purposes. First, we would like to avoid potential overfitting, especially when the size of assembly is not large. Second, a parsimonious model may give us insights into quality scores.

The selection of knots is nothing but a model selection problem. To compare different models, we need an evaluation criterion. Based on quality scores, each fitted model defines a set of error rates, which in turn can be used to construct a consensus. If the truth is known, we can calculate SBD for a model (9). SBD is thus one criterion for model comparison.

A practical solution to model selection ought to be self-evident from data. One such criterion is BIC (10). It is defined as

An external file that holds a picture, illustration, etc.
Object name is gkh850equ17.gif

That is, BIC penalizes log-likelihood by model complexity in terms of the number of parameters. For a logistic model with L knots, we have 20(L + 1) parameters. We calculate the BIC score for each model and choose the one that minimizes the quantity. The idea is to trade off goodness of fit and model complexity. Computationally, it is intensive to evaluate every model. We adopt the backward deletion strategy used in regression analysis to search for the optimal model (7).

RESULTS

Bias of quality scores

If we do not otherwise specify the data source, the results reported hereafter are based on the C.jejuni sequencing data explained earlier. In the conditional sequencing error model, quality scores are interpreted as error probabilities of base-calling. The model that defines the Phred scores is determined from a training data set (2,3). When the model is applied to sequencing traces obtained under different working conditions, scores may deviate from probabilities to some extent. We examine this issue on sequencing reads from one BAC. After alignment, we count incorrect base-calls for each value of quality scores—Phred scores take integer values from 0 to 64. The observed score for the predicted quality score q is calculated from the assembly by:

An external file that holds a picture, illustration, etc.
Object name is gkh850equ18.gif

where Errq and Corrq are, respectively, the number of incorrect and correct base-calls at quality score q. In Figure Figure2,2, we plot the observed scores against predicted Phred scores. When scores are >55, essentially no error is observed. When scores are <20, the prediction is fairly consistent. When scores are between 20 and 55, Phred scores overestimate probabilities. Thus calibration is desired for the purpose of improving accuracy of base-calling.

Figure 2
Observed sequencing error rates versus predicted error rates by Phred quality score.

Next we apply the logistic model to the data. Let

An external file that holds a picture, illustration, etc.
Object name is gkh850equ19.gif

where θ represents all the parameters. Under the assumption of unbiased base-calling, we have:

An external file that holds a picture, illustration, etc.
Object name is gkh850equ20.gif

Then we can assign a new quality score to each base-call xij:

An external file that holds a picture, illustration, etc.
Object name is gkh850equ21.gif

The bias of this adjusted quality score can be examined by:

An external file that holds a picture, illustration, etc.
Object name is gkh850equ22.gif

where Errq and Corrq are respectively the number of incorrect and correct base-calls at the adjusted quality score q′. We plot the observed against the corrected quality score in Figure Figure3.3. Compared with Figure Figure2,2, we see that the corrected quality score is more consistent with the observed quality score. After adjustment, no error occurs above score value 42.

Figure 3
Observed sequencing error rates versus corrected error rates by a logistic model.

The CEQ software that comes along with Beckman sequencers offers quality values similar to Phred scores (11). However, their scores are trained from a smaller data set when compared with Phred. As in Figure Figure2,2, we plot the observed scores against the predicted CEQ scores in Figure Figure4.4. The overestimate pattern is seen across almost the entire region. Then we apply the adjustment procedure to correct for the obvious bias. The training data set is ~500 kb. In Figure Figure55 we plot the observed against the corrected quality scores. This calibration can benefit our haplotype construction project at USC.

Figure 4
Observed sequencing error rates versus predicted error rates by CEQ Quality Score.
Figure 5
Observed sequencing error rates versus predicted error rates by a logistic model.

Score-dependent error patterns

In the conditional sequencing error model, we assume that the error patterns, or the conditional error rates, are constant regardless of quality scores. To check the assumption, we compare frequencies of each type of sequencing errors at each quality value ranging from 0 to 64. That is, given an assembly, we calculate the empirical conditional error rates as follows:

An external file that holds a picture, illustration, etc.
Object name is gkh850equ23.gif

When the true base is , we plot these conditional error rates against quality scores in Figure Figure6.6. It indicates that error patterns do depend on quality scores. After we fit a logistic model to the assembly, the conditional error probabilities as a function of quality scores are shown in Figure Figure7.7. When quality scores are >55, no sequencing error is observed. Thus conditional error patterns make sense only for scores <55. Many sequencing projects use the Q20 rule as a rough measure of the effective length of a DNA read (12). Scores <20 indicate low quality regions. As we can see, error patterns change significantly at scores ~20–24. Since we do not have many bases with high scores, the inference in the high quality range is less reliable than that in the low quality range. When quality scores are <20, C and G are similar to each other; when the scores are >24, a totally different error pattern is observed. This by-product of the parsimonious model offers another perspective of the Q20 rule.

Figure 6
Observed score-wise conditional error rates. The true base is A.
Figure 7
Conditional error rates predicted from a logistic model. The true base is A.

Comparison of different methods

We have introduced a conditional sequencing model and a logistic model. In the literature, two different methods exist to estimate sequencing error rates. On the one hand, the method proposed by Churchill and Waterman (4) relies only on assembly but not on quality scores, and we refer to it as the simple probability model. On the other hand, we can use Phred scores and simply assign equal error chances among bases. Hereafter, we refer to it as the simple quality score method. In Table Table1,1, we compare the performance of these methods using the C.jejuni data set. The majority rule defines the consensus by choosing the most frequent nucleotide at each position. Compared with the majority rule, the simple probability model reduces errors by one-quarter, not resorting to any other information other than the assembly itself. The simple quality score method cuts errors by more than half. The gain is from the training data set that defines Phred scores. The conditional sequencing error model reduces errors further. The best result, 346 SBDs, is achieved by the logistic model with five knots. BIC selects a logistic model with three knots that has 348 SBDs. The likelihood scores for these models are also shown in Table Table1.1. The likelihood of a model measures its goodness of fit to the data. For the same data set, we slightly perturb the Phred scores associated with the called bases, and errors resulting from the simple quality score method increase substantially from 367 to 517 while the performance of the logistic method remains almost the same.

Table 1.
Comparison of different methods

DISCUSSION

Alignment algorithm

We have observed that different alignment algorithms may produce slightly different assembly matrices. Our adjustment of scores is adaptive to alignment in the sense that it optimizes performance based on each assembly. When a new alignment procedure is used, adjustment may change correspondingly.

Phrap (see http://www.phrap.org) examines all individual sequences at a given position, and generally uses the highest quality sequence to build the consensus. Phrap also uses the quality information of individual sequences to estimate the quality of the consensus sequence. By comparison, our method can be used with any other assembly algorithms. The reconstruction of consensus and definition of quality value are based on a probabilistic model. It can adjust potential bias of quality scores in base-calling.

Computing complexity

In the logistic model, the inner loop computes the parameters in logistic regressions by either the Fisher scoring method or by the Newton–Raphson method. Both methods converge quadratically. The outer loop is an EM procedure, and in general an EM algorithm converges at a linear rate. Thus, the computing complexity hinges on the EM algorithm. Specifically, let κ, L, D1 and D2 be the coverage, number of knots, number of Newton iterations and number of EM iterations, respectively; then the complexity is about O[(nκ + L3D1)D2], where n is the size of the target DNA. Similarly, the complexity for the conditional sequencing error model is O(nκD), where D is the number of EM iterations.

Repeat patterns

We have checked the errors that are left uncorrected by the procedure described in this article. Almost all of them are from regions with repeat patterns. They can be single-, di- or tri-nucleotide repeats. Situations become even more subtle if two repeats are next to one another. In one example, an A is in the middle of four Cs and is missed. In these cases, it is not appropriate to assume that the sequencing error pattern is independent of local contexts. We are considering more sophisticated models to deal with regions with repeats. Li and Speed (13,14) proposed a parametric deconvolution procedure to improve accuracy of sequencing for regions with repeats.

Size of training data set

We have applied our method to data sets of different sizes. The larger the data set, the greater the number of knots selected in the optimal model. We can achieve satisfactory training with an assembly of size 30 kb and a coverage of six. The result is not sensitive to the ‘quality’ of the quality scores. In comparison, the training of Phred scores requires several hundred million base-calls, and it has been carried out on the sequencing traces generated from ABI sequencers. It is difficult to obtain reliable quality scores for other sequencers by the Phred training method if only limited base-calling data are available. In this situation, we can apply the method proposed in this article to adjust the preliminary quality scores obtained under roughly the same conditions and obtain probabilistically meaningful quality scores. Earlier, we reported one such example that calibrates Beckman CEQ quality scores using 500 kb from an Arabidopsis re-sequencing project.

ACKNOWLEDGEMENTS

We thank Prof. Michael Waterman for his various suggestions. This work is supported by a NIH CEGS grant to University of Southern California.

REFERENCES

1. Adams M.D., Fields,C. and Ventor,J.C. (eds). (1994) Automated DNA Sequencing and Analysis. Academic Press, London, San Diego.
2. Ewing B. and Green,P. (1998) Base-calling of automated sequencer traces using phred. 2. Error probabilities. Genome Res., 8, 186–194. [PubMed]
3. Ewing B., Hillier,L., Wendl,M.C. and Green,P. (1998) Base-calling of automated sequencer traces using phred. 1. Accuracy assessment. Genome Res., 8, 175–185. [PubMed]
4. Churchill G.A. and Waterman,M.S. (1992) The accuracy of DNA sequences: estimating sequence quality. Genomics, 14, 89–98. [PubMed]
5. Parkhill J., Wren,B.W., Mungall,K., Ketley,J.M., Churcher,C., Basham,D., Chillingworth,T., Davies,R.M., Feltwell,T., Holroyd,S., et al. (2000) The genome sequence of the food-borne pathogen Campylobacter jejuni reveals hypervariable sequences. Nature, 403, 665–668. [PubMed]
6. McCullagh P. and Nelder,J.A. (1989) Generalized Linear Model, 2nd edn. Chapman and Hall, London.
7. Venables W.N. and Ripley,B.D. (1994) Modern Applied Statistics with S-plus, 2nd edn. Springer, New York.
8. Li M. and Li,L.M. (2004) Estimate Sequencing Error Patterns by Logistic Regression Models with Missing Responses. Technical report, Computational Biology, University of Southern California.
9. Felsenfeld A., Peterson,J., Schloss,J. and Guyer,M. (1999) Assessing the quality of the DNA sequence from the human genome project. Genome Res., 9, 1–4. [PubMed]
10. Schwarz G. (1978) Estimating the dimension of a model. Ann Stat, 6, 461–464.
11. Winer R., Yen,G. and Huang,J. (2002) Call Scores and Quality Values: Two Measures of Quality Produced by the CEQ® Genetic Analysis Systems. Beckman Coulter, Inc, Fullerton, CA.
12. Nelson D.O. and Fridlyand,J. (2003) Designing meaningful measures of real length for data produced by DNA sequencers. In Goldstein,D. (ed.), Science and Statistics: A Festschrift for Terry Speed, Vol. 40 of Lecture Notes—Monograph series. Institute of Mathematical Statistics, Beachwood, OH, pp. 295–306.
13. Li L.M. (2002) DNA sequencing and parametric deconvolution. Stat Sin, 12, 179–202.
14. Li L.M. and Speed,T.P. (2002) Parametric deconvolution of positive spike trains. Ann Stat, 28, 1279–1301.

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...