# Adjust quality scores from alignment and improve sequencing accuracy

^{*}To whom correspondence should be addressed. Tel: +1 213 740 2407; Fax: +1 213 740 2437; Email: ude.csu@ielil

## Abstract

In shotgun sequencing, statistical reconstruction of a consensus from alignment requires a model of measurement error. Churchill and Waterman proposed one such model and an expectation–maximization (EM) algorithm to estimate sequencing error rates for each assembly matrix. Ewing and Green defined *Phred* quality scores for base-calling from sequencing traces by training a model on a large amount of data. However, sample preparations and sequencing machines may work under different conditions in practice and therefore quality scores need to be adjusted. Moreover, the information given by quality scores is incomplete in the sense that they do not describe error patterns. We observe that each nucleotide base has its specific error pattern that varies across the range of quality values. We develop models of measurement error for shotgun sequencing by combining the two perspectives above. We propose a logistic model taking quality scores as covariates. The model is trained by a procedure combining an EM algorithm and model selection techniques. The training results in calibration of quality values and leads to a more accurate construction of consensus. Besides *Phred* scores obtained from ABI sequencers, we apply the same technique to calibrate quality values that come along with Beckman sequencers.

## INTRODUCTION

Shotgun sequencing is the standard methodology for genome sequencing (1). Starting with a whole genome, or a large genomic region, short random fragments are generated and sequenced. Enough fragments are sequenced so that almost all positions in the genome or region are covered multiple times just by chance. The standard sequencing procedure is exemplified by the commonly used *Phred*/*Phrap* suite of software (2,3). First, each fragment is base-called from its chromatogram, i.e. a vector times series of four fluorescence intensities, and a sequence of A, C, G and T, is inferred. The commercial producers of sequencing devices include Applied Biosystems, Inc., Beckman Coulter, Inc., etc. Typically each base is accompanied by a quality value that is meant to convey some idea of how likely the base-calling is correct. Second, the base-called sequences are assembled into a contig using an ad hoc alignment algorithm that compares both strands and overlap between fragments. Quality values are taken into account during the alignment to eliminate low quality reads. Third, a consensus sequence is constructed from this alignment by comparing different reads for each position.

The accuracy of the consensus sequence depends on the coverage (i.e. how many independent observations we have for each nucleotide base pair in the genome) and the performance of the base-calling algorithm. The quality values of base-calling play a crucial role in the construction of consensus. If they are misleading or interpreted incorrectly, the consensus sequence will be less reliable. The *Phred* quality scores for base-calling are defined from sequencing traces in such a way that they have a probabilistic interpretation. This is achieved by training a model on a large amount of data. However, in practice, sample preparations and sequencing machines may work under different conditions and therefore quality scores need to be adjusted. Moreover, the information given by quality scores is incomplete in the sense that they do not describe error patterns. We observe that each nucleotide base has its specific error pattern that varies across the range of quality values.

Churchill and Waterman (4) proposed another model to define a consensus. It is based on an assembly without assuming the availability of quality values. The parameters in the model include composition probabilities and sequencing error rates and are estimated by an expectation–maximization (EM) algorithm based on the alignment. The consensus is defined by the probability of the target sequence conditional on observations. This offers an evaluation of reliability.

In this article, we combine quality scores of base-calling and the idea in Churchill and Waterman's model (4), to improve sequencing accuracy. Specifically, we start with assembled contigs and quality scores to build up complete probabilistic error models. One option is to represent the error pattern of each nucleotide by a multinomial model. Since the true sequence is unknown, we develop an EM algorithm to deal with missing data. In a more sophisticated logistic model, we take quality scores as covariates. To parsimoniously represent the non-linear effect of quality scores, we adopt simple piecewise linear functions in the regression model. The model is trained by a procedure combining an EM algorithm, the Bayesian information criterion (BIC) criterion and backward deletion. The training results in calibration of quality values and leads to a more accurate consensus construction.

## MATERIALS AND METHODS

### Sequencing data

The first source of data in this article comes from the *Campylobacter jejuni* whole-genome shotgun sequencing project (5). The raw data, generated on ABI 373 and 377 automated sequencers were downloaded from the Sanger Center (ftp.sanger.ac.uk/pub/pathogens/cj). The total length of the genome sequence is 1641481 bp. There are 33824 reads and the average coverage is 10-folds. The sequence assembly was obtained by *Phrap* (see http://www.phrap.org). We tested our methods on the first 100 kb of the reference sequence and the corresponding reads. The coverage of the *C.jejuni* sequencing project is unusually high, so we randomly removed some reads to decrease the average coverage from 10- to 6-fold. Since the reference sequence was obtained on reads of 10-fold, we will assume that it is close to the true sequence later when we calculate single base discrepancy (SBD).

To test our methods on data obtained using another sequencing technology, we analyzed data from an *Arabidopsis thaliana* re-sequencing project carried out at USC (http://walnut.usc.edu/2010) using Beckman Coulter CEQ automated sequencers. These data were obtained as part of a polymorphism survey and thus contain different haplotypes. Since we are interested only in sequencing error, in this paper, we selected ~500 kb of raw data from non-polymorphic regions.

### Setup

Throughout the article, we represent random variables by capital letters and their values by small letters. First, reads are aligned into an assembly matrix. We introduce two alphabets: = {*A*, *C*, *G*, *T*, −} and = {*A*, *C*, *G*, *T*, −, *N*, }, where − denotes an internal gap, *N* denotes any ambiguous determination of a base and the null symbol is for non-aligned positions beyond the ends of a fragment. Each fragment is either in direct or in reverse complemented orientation. To deal with the issue of orientation, we introduce a complementary operation ~ on the alphabet as follows:

,

and

. An illustrative example of assembly matrices is shown in Figure Figure11.

We denote the target sequence by *S* = *S*_{1}*S*_{2} ··· *S _{n}*, where

*S*takes any value from the alphabet . Random fragments generated from the template are aligned by an assembler. This results in an assembly matrix {

_{j}*X*}

_{ij}_{m×n}. The elements of the fragment assembly matrix, denoted by

*x*, take values from the alphabet . Each row in {

_{ij}*X*} contains the ordered sequence of bases and possible gaps in a particular fragment. The column index

_{ij}*j*= 1, …,

*n*runs from the leftmost base in the assembly to the rightmost. We represent the orientation of the

*i*th fragment in the assembly by

The observations *X _{ij}* are subject to measurement error. We denote the true base of fragment

*i*at position

*j*by

*Y*. Therefore

_{ij}We denote the compositional probability by α_{a} = **Pr**(*S _{j}* =

*a*),

*a*.

*Phred* quality scores

After appropriate preprocessing, each sequencing chromatogram contains a series of peaks of four colors. The rationale of base-calling is that each peak represents one base, and the order of peaks from the four channels is consistent with the order of nucleotide bases on the underlying DNA fragment. In addition to base-calling, *Phred* also assigns each base-call a quality score *q*, which takes integer values from 0 to Q (Q is 64 for *Phred* scores) (2). Quality scores are based on trace features such as peak spacing, uncalled/called peak ratio and peak resolution. The model that defines quality scores was so trained, on a large amount of sequencing traces, that the scores could be interpreted as probabilities. Mathematically, the score is defined by

where, ε_{ij} is the error probability of base-calling. We randomly select one position from an assembly and let *Y* and *X* be its true base and called base, respectively. Let denote the event that the base-calling is incorrect, namely, = {*X* ≠ *Y*}. Then the correct calling probability given base *a* is: 1 − ε = **Pr**(*Y* = *a*|*X* = *a*), where *a* . Notice that

If the assumption of unbiased base-calling is valid, namely, **Pr**(*X* = *a*) = **Pr**(*Y* = *a*), then we have **Pr**(*X* = *a*|*Y* = *a*) = **Pr**(*Y* = *a*|*X* = *a*) = 1 − ε. Consequently, we are able to interpret the *Phred* scores as probabilities by

Even though *Phred* scores are valuable information for the construction of consensus, they are not the complete picture of measurement error. In general, for *a* ≠ *b* , we have

We denote sequencing error rates, conditional on event , by *w*(*b*|*a*) = **Pr**(*X* = *b*|*Y* = *a*, ) for *a* ≠ *b*, and arrange them in the following table:

where {*w*(*b*|*a*)} satisfy

The sequencing error rates relate to the conditional probabilities as follows.

Since *Phred* scores provide only partial information about sequencing error rates, we need to estimate the rest. For the sake of simplicity, we skip the issue of fragment orientation when we describe the sequencing error models.

### Conditional sequencing error model

Our perspective is to incorporate *Phred* quality scores into the Churchill–Waterman model (4). We first adopt the parameterization in Equation 2 to model sequencing error, and refer to it as the conditional sequencing error model. The parameters θ in this model include the composition probability {α_{a}} and the conditional sequencing error rates {*w*(*b*|*a*)}. The likelihood of the assembly and underlying sequence is given by

Since {*S _{j}*} are missing, we estimate the parameters by the EM algorithm. The following form of log-likelihood is easy for imputing the sufficient statistics.

### Logistic model

From a regression perspective, we take *Phred* scores as a covariate. We denote

We assume that base-calling error rates follow a logistic form:

where each covariate *h _{l}*(

*q*) is a function of quality score

*q*and takes the form

Notice that each function has a knot *o _{l}*, where 0 ≤

*o*

_{1}<

*o*

_{2}, … <

*o*

_{L}<

*Q*. Thus each regressor is a piecewise linear function of the quality score, which allows us to approximate any potential non-linear effect. Equivalently, base-calling rates can be represented as:

Similar to Equation 3, this parameterization leads to the following form of log-likelihood function for the assembly and the underlying sequence, up to a term only relating to parameters.

where θ represents all the unknown parameters.

### Parameter estimation and EM training algorithm

In both the conditional sequencing error model and the logistic model, the underlying sequence {*s _{j}*} is unknown. Its reconstruction relies on the estimates of parameters in the models. On the other hand, algorithms of estimating parameters are well established when {

*s*} are known. Thus we use the EM algorithm to train the model iteratively. In the E-step, we impute the sufficient statistic from observations at the current parameter value and in the M-step, we update the maximum likelihood estimate using the current imputed values of missing data. In the case of the conditional sequencing error model, the parameters are estimated by counting frequencies. In the case of the logistic model, the likelihood can be decomposed into five independent logistic regression models (6). Consequently, we run re-weighted least squares to estimate the parameters (7). The mathematical and technical details are rather lengthy and tedious, and will be published in our technical report (8).

_{j}### Consensus and quality values

According to the logistic model, the distribution of nucleotides at each position is given by:

As shown in the formula, the issue of orientation can generally be dealt with by the orientation indicators {*r _{i}*} and the complementary operator ~. In our convention, we observe

directly when a fragment is in reverse orientation. After we plug in the estimated value of θ, we define the consensus at one position and its quality score by maximizing the above probability.

### Parsimonious representation and model selection

Although we can include piecewise linear functions at all possible knots in the logistic regression model (Equation 4), we seek a parsimonious model for several purposes. First, we would like to avoid potential overfitting, especially when the size of assembly is not large. Second, a parsimonious model may give us insights into quality scores.

The selection of knots is nothing but a model selection problem. To compare different models, we need an evaluation criterion. Based on quality scores, each fitted model defines a set of error rates, which in turn can be used to construct a consensus. If the truth is known, we can calculate SBD for a model (9). SBD is thus one criterion for model comparison.

A practical solution to model selection ought to be self-evident from data. One such criterion is BIC (10). It is defined as

That is, BIC penalizes log-likelihood by model complexity in terms of the number of parameters. For a logistic model with *L* knots, we have 20(*L* + 1) parameters. We calculate the BIC score for each model and choose the one that minimizes the quantity. The idea is to trade off goodness of fit and model complexity. Computationally, it is intensive to evaluate every model. We adopt the backward deletion strategy used in regression analysis to search for the optimal model (7).

## RESULTS

### Bias of quality scores

If we do not otherwise specify the data source, the results reported hereafter are based on the *C.jejuni* sequencing data explained earlier. In the conditional sequencing error model, quality scores are interpreted as error probabilities of base-calling. The model that defines the *Phred* scores is determined from a training data set (2,3). When the model is applied to sequencing traces obtained under different working conditions, scores may deviate from probabilities to some extent. We examine this issue on sequencing reads from one BAC. After alignment, we count incorrect base-calls for each value of quality scores—*Phred* scores take integer values from 0 to 64. The observed score for the predicted quality score *q* is calculated from the assembly by:

where Err_{q} and Corr_{q} are, respectively, the number of incorrect and correct base-calls at quality score *q*. In Figure Figure2,2, we plot the observed scores against predicted *Phred* scores. When scores are >55, essentially no error is observed. When scores are <20, the prediction is fairly consistent. When scores are between 20 and 55, *Phred* scores overestimate probabilities. Thus calibration is desired for the purpose of improving accuracy of base-calling.

Next we apply the logistic model to the data. Let

where θ represents all the parameters. Under the assumption of unbiased base-calling, we have:

Then we can assign a new quality score to each base-call *x _{ij}*:

The bias of this adjusted quality score can be examined by:

where Err_{q′} and Corr_{q′} are respectively the number of incorrect and correct base-calls at the adjusted quality score *q*′. We plot the observed against the corrected quality score in Figure Figure3.3. Compared with Figure Figure2,2, we see that the corrected quality score is more consistent with the observed quality score. After adjustment, no error occurs above score value 42.

The CEQ software that comes along with Beckman sequencers offers quality values similar to *Phred* scores (11). However, their scores are trained from a smaller data set when compared with *Phred*. As in Figure Figure2,2, we plot the observed scores against the predicted CEQ scores in Figure Figure4.4. The overestimate pattern is seen across almost the entire region. Then we apply the adjustment procedure to correct for the obvious bias. The training data set is ~500 kb. In Figure Figure55 we plot the observed against the corrected quality scores. This calibration can benefit our haplotype construction project at USC.

### Score-dependent error patterns

In the conditional sequencing error model, we assume that the error patterns, or the conditional error rates, are constant regardless of quality scores. To check the assumption, we compare frequencies of each type of sequencing errors at each quality value ranging from 0 to 64. That is, given an assembly, we calculate the empirical conditional error rates as follows:

When the true base is , we plot these conditional error rates against quality scores in Figure Figure6.6. It indicates that error patterns do depend on quality scores. After we fit a logistic model to the assembly, the conditional error probabilities as a function of quality scores are shown in Figure Figure7.7. When quality scores are >55, no sequencing error is observed. Thus conditional error patterns make sense only for scores <55. Many sequencing projects use the *Q*_{20} rule as a rough measure of the effective length of a DNA read (12). Scores <20 indicate low quality regions. As we can see, error patterns change significantly at scores ~20–24. Since we do not have many bases with high scores, the inference in the high quality range is less reliable than that in the low quality range. When quality scores are <20, C and G are similar to each other; when the scores are >24, a totally different error pattern is observed. This by-product of the parsimonious model offers another perspective of the *Q*_{20} rule.

### Comparison of different methods

We have introduced a conditional sequencing model and a logistic model. In the literature, two different methods exist to estimate sequencing error rates. On the one hand, the method proposed by Churchill and Waterman (4) relies only on assembly but not on quality scores, and we refer to it as the simple probability model. On the other hand, we can use *Phred* scores and simply assign equal error chances among bases. Hereafter, we refer to it as the simple quality score method. In Table Table1,1, we compare the performance of these methods using the *C.jejuni* data set. The majority rule defines the consensus by choosing the most frequent nucleotide at each position. Compared with the majority rule, the simple probability model reduces errors by one-quarter, not resorting to any other information other than the assembly itself. The simple quality score method cuts errors by more than half. The gain is from the training data set that defines *Phred* scores. The conditional sequencing error model reduces errors further. The best result, 346 SBDs, is achieved by the logistic model with five knots. BIC selects a logistic model with three knots that has 348 SBDs. The likelihood scores for these models are also shown in Table Table1.1. The likelihood of a model measures its goodness of fit to the data. For the same data set, we slightly perturb the *Phred* scores associated with the called bases, and errors resulting from the simple quality score method increase substantially from 367 to 517 while the performance of the logistic method remains almost the same.

## DISCUSSION

### Alignment algorithm

We have observed that different alignment algorithms may produce slightly different assembly matrices. Our adjustment of scores is adaptive to alignment in the sense that it optimizes performance based on each assembly. When a new alignment procedure is used, adjustment may change correspondingly.

*Phrap* (see http://www.phrap.org) examines all individual sequences at a given position, and generally uses the highest quality sequence to build the consensus. *Phrap* also uses the quality information of individual sequences to estimate the quality of the consensus sequence. By comparison, our method can be used with any other assembly algorithms. The reconstruction of consensus and definition of quality value are based on a probabilistic model. It can adjust potential bias of quality scores in base-calling.

### Computing complexity

In the logistic model, the inner loop computes the parameters in logistic regressions by either the Fisher scoring method or by the Newton–Raphson method. Both methods converge quadratically. The outer loop is an EM procedure, and in general an EM algorithm converges at a linear rate. Thus, the computing complexity hinges on the EM algorithm. Specifically, let κ, *L*, *D*_{1} and *D*_{2} be the coverage, number of knots, number of Newton iterations and number of EM iterations, respectively; then the complexity is about *O*[(*n*κ + *L*^{3}*D*_{1})*D*_{2}], where *n* is the size of the target DNA. Similarly, the complexity for the conditional sequencing error model is *O*(*n*κ*D*), where *D* is the number of EM iterations.

### Repeat patterns

We have checked the errors that are left uncorrected by the procedure described in this article. Almost all of them are from regions with repeat patterns. They can be single-, di- or tri-nucleotide repeats. Situations become even more subtle if two repeats are next to one another. In one example, an A is in the middle of four Cs and is missed. In these cases, it is not appropriate to assume that the sequencing error pattern is independent of local contexts. We are considering more sophisticated models to deal with regions with repeats. Li and Speed (13,14) proposed a parametric deconvolution procedure to improve accuracy of sequencing for regions with repeats.

### Size of training data set

We have applied our method to data sets of different sizes. The larger the data set, the greater the number of knots selected in the optimal model. We can achieve satisfactory training with an assembly of size 30 kb and a coverage of six. The result is not sensitive to the ‘quality’ of the quality scores. In comparison, the training of *Phred* scores requires several hundred million base-calls, and it has been carried out on the sequencing traces generated from ABI sequencers. It is difficult to obtain reliable quality scores for other sequencers by the *Phred* training method if only limited base-calling data are available. In this situation, we can apply the method proposed in this article to adjust the preliminary quality scores obtained under roughly the same conditions and obtain probabilistically meaningful quality scores. Earlier, we reported one such example that calibrates Beckman CEQ quality scores using 500 kb from an *Arabidopsis* re-sequencing project.

## ACKNOWLEDGEMENTS

We thank Prof. Michael Waterman for his various suggestions. This work is supported by a NIH CEGS grant to University of Southern California.

## REFERENCES

*phred*. 2. Error probabilities. Genome Res., 8, 186–194. [PubMed]

*phred*. 1. Accuracy assessment. Genome Res., 8, 175–185. [PubMed]

*et al*. (2000) The genome sequence of the food-borne pathogen

*Campylobacter jejuni*reveals hypervariable sequences. Nature, 403, 665–668. [PubMed]

*Estimate Sequencing Error Patterns by Logistic Regression Models with Missing Responses*. Technical report, Computational Biology, University of Southern California.

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (390K) |
- Citation

- Mapping short DNA sequencing reads and calling variants using mapping quality scores.[Genome Res. 2008]
*Li H, Ruan J, Durbin R.**Genome Res. 2008 Nov; 18(11):1851-8. Epub 2008 Aug 19.* - Base-calling of automated sequencer traces using phred. I. Accuracy assessment.[Genome Res. 1998]
*Ewing B, Hillier L, Wendl MC, Green P.**Genome Res. 1998 Mar; 8(3):175-85.* - Base-calling of automated sequencer traces using phred. II. Error probabilities.[Genome Res. 1998]
*Ewing B, Green P.**Genome Res. 1998 Mar; 8(3):186-94.* - Sense from sequence reads: methods for alignment and assembly.[Nat Methods. 2009]
*Flicek P, Birney E.**Nat Methods. 2009 Nov; 6(11 Suppl):S6-S12.* - Improving the quality of automatic DNA sequence assembly using fluorescent trace-data classifications.[Proc Int Conf Intell Syst Mol Biol. 1996]
*Allex CF, Baldwin SF, Shavlik JW, Blattner FR.**Proc Int Conf Intell Syst Mol Biol. 1996; 4:3-14.*

- SEME: A Fast Mapper of Illumina Sequencing Reads with Statistical Evaluation[Journal of Computational Biology. 2013]
*Chen S, Wang A, Li LM.**Journal of Computational Biology. 2013 Nov; 20(11)847-860* - A practical method to detect SNVs and indels from whole genome and exome sequencing data[Scientific Reports. ]
*Shigemizu D, Fujimoto A, Akiyama S, Abe T, Nakano K, Boroevich KA, Yamamoto Y, Furuta M, Kubo M, Nakagawa H, Tsunoda T.**Scientific Reports. 32161* - FadE: whole genome methylation analysis for multiple sequencing platforms[Nucleic Acids Research. 2013]
*Souaiaia T, Zhang Z, Chen T.**Nucleic Acids Research. 2013 Jan; 41(1)e14* - Next generation sequencing technologies and the changing landscape of phage genomics[Bacteriophage. 2012]
*Klumpp J, Fouts DE, Sozhamannan S.**Bacteriophage. 2012 Jul 1; 2(3)190-199* - Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows[Genes. ]
*Torri F, Dinov ID, Zamanyan A, Hobel S, Genco A, Petrosyan P, Clark AP, Liu Z, Eggert P, Pierce J, Knowles JA, Ames J, Kesselman C, Toga AW, Potkin SG, Vawter MP, Macciardi F.**Genes. 3(3)545-575*

- MedGenMedGenRelated information in MedGen
- PubMedPubMedPubMed citations for these articles
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- Adjust quality scores from alignment and improve sequencing accuracyAdjust quality scores from alignment and improve sequencing accuracyNucleic Acids Research. 2004; 32(17)5183

Your browsing activity is empty.

Activity recording is turned off.

See more...