NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Koonin EV, Galperin MY. Sequence - Evolution - Function: Computational Approaches in Comparative Genomics. Boston: Kluwer Academic; 2003.

Cover of Sequence - Evolution - Function

Sequence - Evolution - Function: Computational Approaches in Comparative Genomics.

Show details

Chapter 4Principles and Methods of Sequence Analysis

We aligned sequences by eye.

Troy CS, MacHugh DE, Bailey JF, Magee DA, Loftus RT, Cunningham P, Chamberlain AT, Sykes BC, Bradley DG. Genetic evidence for Near-Eastern origins of European cattle. Nature, 2001, vol. 410, p. 1091

This chapter is the longest in the book as it deals with both general principles and practical aspects of sequence and, to a lesser degree, structure analysis. Although these methods are not, in themselves, part of genomics, no reasonable genome analysis and annotation would be possible without understanding how these methods work and having some practical experience with their use. Inappropriate use of sequence analysis procedures may result in numerous errors in genome annotation (we have already touched upon this subject in the previous chapter and further discuss it in Chapter 5). We attempted to strike a balance between generalities and specifics, aiming to give the reader a clear perspective of the computational approaches used in comparative and functional genomics, rather than discuss any one of these approaches in great detail. In particular, we refrained from any extensive discussion of the statistical basis and algorithmic aspects of sequence analysis because these can be found in several recent books on computational biology and bioinformatics (see 4.8), and no less importantly, because we cannot claim advanced expertise in this area. We also tried not to duplicate the “click here”-type tutorials, which are available on many web sites. However, we deemed it important to point out some difficult and confusing issues in sequence analysis and warn the readers against the most common pitfalls. This discussion is largely based on our personal, practical experience in analysis of protein families. We hope that this discussion might clarify some aspects of sequence analysis that “you always wanted to know about but were afraid to ask”. Still, we felt that it would be impossible (and unnecessary) to discuss all methods of sequence and structure analysis in one, even long, chapter and concentrated on those techniques that are central to comparative genomics. We hope that after working through this chapter, interested readers will be encouraged to continue their education in methods of sequence analysis using more specialized texts, including those in the ‘Further Reading’ list (see 4.8).

4.1. Identification of Genes in a Genomic DNA Sequence

4.1.1. Prediction of protein-coding genes

Archaeal and bacterial genes typically comprise uninterrupted stretches of DNA between a start codon (usually ATG, but in a minority of genes, GTG, TTG, or CTG) and a stop codon (TAA, TGA, or TAG; alternative genetic codes of certain bacteria, such as mycoplasmas, have only two stop codons). Rare exceptions to this rule involve important but rare mechanisms, such as programmed frameshifts. There seem to be no strict limits on the length of the genes. Indeed, the gene rpmJ encoding the ribosomal protein L36 (Figure 2.1) is only 111 bp long in most bacteria, whereas the gene for B. subtilis polyketide synthase PksK is 13,343 bp long. In practice, mRNAs shorter than 30 codons are poorly translated, so protein-coding genes in prokaryotes are usually at least 100 bases in length. In prokaryotic genome-sequencing projects, open reading frames (ORFs) shorter than 100 bases are rarely taken into consideration, which does not seem to result in substantial underprediction. In contrast, in multicellular eukaryotes, most genes are interrupted by introns. The mean length of an exon is ~50 codons, but some exons are much shorter; many of the introns are extremely long, resulting in genes occupying up to several megabases of genomic DNA. This makes prediction of eukaryotic genes a far more complex (and still unsolved) problem than prediction of prokaryotic genes. Prokaryotes

For most common purposes, a prokaryotic gene can be defined simply as the longest ORF for a given region of DNA. Translation of a DNA sequence in all six reading frames is a straightforward task, which can be performed on line using, for example, the Translate tool on the ExPASy server ( or the ORF Finder at NCBI (

Of course, this approach is oversimplified and may result in a certain number of incorrect gene predictions, although the error rate is rather low. Firstly, DNA sequencing errors may result in incorrectly assigned or missed start and/or stop codons, because of which a gene might be truncated, overextended, or missed altogether. Secondly, on rare occasions, among two overlapping ORFs (on the same or the opposite DNA strand), the shorter one might be the real gene. The existence of a long “shadow” ORF opposite a protein-coding sequence is more likely than in a random sequence because of the statistical properties of the coding regions. Indeed, consider the simple case where the first base in a codon is a purine and the third base is a pyrimidine (the RNY codon pattern). Obviously, the mirror frame in the complementary strand would follow the same pattern, resulting in a deficit of stop codons [235]. Figure 4.1 shows the ORFs of at least 100 bp located in a 10-kb fragment of the E. coli genome (from 3435250 to 3445250) that encodes potassium transport protein TrkA, mechanosensitive channel MscL, transcriptional regulator YhdM, RNA polymerase alpha subunit RpoA, preprotein translocase subunit SecY, and ribosomal proteins RplQ (L17), RpsD (S4), RpsK (S11), RpsM (S13), RpmJ (L36), RplO (L15), RpmD (L30), RpsE (S5), RplR (L18), RplF (L6), RpsH (S8), RpsN (S14), RplE (L5), and RplX (L24). Although the two ORFs in frame +1 (top line, on the right) are longer (207 aa and 185 aa) than the ORFs in frame −3 (bottom line, 117 aa, 177 aa, 130 aa, and 101 aa), it is the latter that encode real proteins, namely the ribosomal proteins RplR, RplF, RpsH, and RpsN.

Figure 4.1. Open reading frames of ≥100 bp encoded on a 10-kb fragment of the Escherichia coli K12 genome from 3435250 to 3445250.

Figure 4.1

Open reading frames of ≥100 bp encoded on a 10-kb fragment of the Escherichia coli K12 genome from 3435250 to 3445250. The figure was generated using the program ORF finder at the NCBI web site ( (more...)

Because of these complications, it is always desirable to have some additional evidence that a particular ORF actually encodes a protein. Such evidence comes along many different lines and can be obtained using various methods, e.g. the following ones:

  • The ORF in question encodes a protein that is similar to previously described ones (search the protein database for homologs of the given sequence).
  • The ORF has a typical GC content, codon frequency, or oligonucleotide composition (calculate the codon bias and/or other statistical features of the sequence, compare to those for known protein-coding genes from the same organism).
  • The ORF is preceded by a typical ribosome-binding site (search for a Shine-Dalgarno sequence in front of the predicted coding sequence).
  • The ORF is preceded by a typical promoter (if consensus promoter sequences for the given organism are known, check for the presence of a similar upstream region).

The most reliable of these approaches is a database search for homologs. In several useful tools, DNA translation is seamlessly bound to the database searches. In the ORF finder, for example, the user can submit the translated sequence for a BLASTP or TBLASTN (see 4.4) search against the NCBI sequence databases. In addition, there is an opportunity to compare the translated sequence to the COG database (see 3.4). A largely similar Analysis and Annotation Tool (, developed by Xiaoqiu Huang at Michigan Tech [361], also compares the translated protein sequences to nr and SWISS-PROT; in addition, it checks them against two cDNA databases, the dbEST at the NCBI and Human Gene Index at TIGR.

Other methods take advantage of the statistical properties of the coding sequences. For organisms with highly biased GC content, for example, the third position in each codon has a highly biased (very high or very low) frequency of G and C. FramePlot, a program that exploits this skew for gene recognition [380], is available at the Japanese Institute of Infectious Diseases ( and at the TIGR web site ( The most useful and popular gene prediction programs, such as GeneMark and Glimmer (see 3.1.2), build Markov models of the known coding regions for the given organism and then employ them to estimate the coding potential of uncharacterized ORFs.

Inferring genes based on the coding potential and on the similarity of the encoded protein sequences to those of other proteins represent the intrinsic and extrinsic approaches to gene prediction [110], which ideally should be combined. Two programs that implement such a combination, developed specifically for analysis of prokaryotic genomes, are ORPHEUS ( [249]) and CRITICA ([67], source code at Several other algorithms that incorporate both these approaches are aimed primarily at eukaryotic genomes and are discussed further in this section. Unicellular eukaryotes

Genomes of unicellular eukaryotes are extremely diverse in size, the proportion of the genome that is occupied by protein-encoding genes and the frequency of introns. Clearly, the smaller the intergenic regions and the fewer introns are there, the easier it is to identify genes. Fortunately, genomes of at least some simple eukaryotes are quite compact and contain very few introns. Thus, in yeast S. cerevisiae, at least 67% of the genome is protein-coding, and only 233 genes (less than 4% of the total) appear to have introns [660]. Although these include some biologically important and extensively studied genes, e.g. those for aminopeptidase APE2, ubiquitin-protein ligase UBC8, subunit 1 of the mitochondrial cytochrome oxidase COX1, and many ribosomal proteins, introns comprise less than 1% of the yeast genome. The tiny genome of the intracellular eukaryotic parasite Encephalitozoon cuniculi appears to contain introns in only 12 genes and is practically prokaryote-like in terms of the “wall-to-wall” gene arrangement [425]. Malaria parasite Plasmodium falciparum is a more complex case, with ~43% of the genes located on chromosome 2 containing one or more introns [272]. Protists with larger genomes often have fairly high intron density. In the slime mold Physarum polycephalum, for example, the average gene has 3.7 introns [851]. Given that the average exon size in this organism (165 ± 85 bp) is comparable to the length of an average intron (138 ± 103 bp), homology-based prediction of genes becomes increasingly complicated.

Because of this genome diversity, there is no single way to efficiently predict protein-coding genes in different unicellular eukaryotes. For some of them, such as yeast, gene prediction can be done by using more or less the same approaches that are routinely employed in prokaryotic genome analysis. For those with intron-rich genomes, the gene model has to include information on the intron splice sites, which can be gained from a comparison of the genomic sequence against a set of ESTs from the same organism. This necessitates creating a comprehensive library of ESTs that have to be sequenced in a separate project. Such dual EST/genomic sequencing projects are currently under way for several unicellular eukaryotes (see Appendix 2). Multicellular eukaryotes

In most multicellular eukaryotes, gene organization is so complex that gene identification poses a major problem. Indeed, eukaryotic genes are often separated by large intergenic regions, and the genes themselves contain numerous introns, many of them long. Figure 4.2 shows a typical distribution of exons and introns in a human gene, the X chromosome-located gene encoding iduronate 2-sulfatase (IDS_HUMAN), a lysosomal enzyme responsible for removing sulfate groups from heparan sulfate and dermatan sulfate. Mutations causing iduronate sulfatase deficiency result in the lysosomal accumulation of these glycosaminoglycans, clinically known as Hunter's syndrome or type II mucopolysaccharidosis (OMIM entry 309900) [896]. A number of clinical cases have been shown to result from aberrant alternative splicing of this gene's mRNA, which emphasizes the importance of reliable prediction of gene structure [631].

Figure 4.2. Organization of the human iduronate 2-sulfatase gene.

Figure 4.2

Organization of the human iduronate 2-sulfatase gene. This gene is located in positions 152960–177995 of human X chromosome and encodes a 550-aa precursor protein that contains a 25-aa N-terminal signal sequence, followed by (more...)

Obviously, the coding regions compose only a minor portion of the gene. In this case, positions of the exons could be unequivocally determined by mapping the cDNA sequence (i.e. iduronate sulfatase mRNA) back to the chromosomal DNA. Because of the clinical phenotype of the mutations in the iduronate sulfatase gene, we already know the “correct” mRNA sequence and can identify various alternatively spliced variants as mutations. However, for many, perhaps the majority of the human genes, multiple alternative forms are part of the regular expression pattern [118,576], and correct gene prediction ideally should identify all of these forms, which immensely complicates the task.

Ideally, gene prediction should identify all exons and introns, including those in the 5′-untranslated region (5′-UTR) and the 3′-UTR of the mRNA, in order to precisely reconstruct the predominant mRNA species. For practical purposes, however, it is useful to assemble at least the coding exons correctly because this allows one to deduce the protein sequence.

Figure 4.3. Sequence of the first two exons of human iduronate sulfatase gene.

Figure 4.3

Sequence of the first two exons of human iduronate sulfatase gene. The figure shows the DNA sequence of the positions 15391–15571 of human X chromosome. The iduronate sulfatase mRNA and its coding sequence are shown (more...)

Correct identification of the exon boundaries relies on the recognition of the splice sites, which is facilitated by the fact that the great majority of splice sites conform to consensus sequences that include two nearly invariant dinucleotides at the ends of each intron, a GT at the 5′ end and an AG at the 3′ end. Non-canonical splice signals are rare and come in several variants [329,582]. In the 5′ splice sites, the GC dinucleotide is sometimes found instead of GT. The second class of exceptions to the splice site consensus includes so-called “AT-AC” introns that have the highly conserved /(A,G)TATCCT(C,T) sequence at their 5′ sites. There are additional variants of non-canonical splice signals, which further complicate prediction of the gene structure.

The available assessments of the quality of eukaryotic gene prediction achieved by different programs show a rather gloomy picture of numerous errors in exon/intron recognition. Even the best tools correctly predict only ~40% of the genes [697]. The most serious errors come from genes with long introns, which may be predicted as intragenic sequences, resulting in erroneous gene fission, and pairs of genes with short intergenic regions, which may be predicted as introns, resulting in false gene fusion. Nevertheless, most of the popular gene prediction programs discussed in the next section show reasonable performance in predicting the coding regions in the sense that, even if a small exon is missed or overpredicted, the majority of exons are identified correctly.

Another important parameter that can affect ORF prediction is the fraction of sequencing errors in the analyzed sequence. Indeed, including frameshift corrections was found to substantially improve the overall quality of gene prediction [133]. Several algorithms were described that could detect frameshift errors based on the statistical properties of coding sequences [224]. On the other hand, error correction techniques should be used with caution because eukaryotic genomes contain numerous pseudogenes, and non-critical frameshift correction runs the risk of wrongly “rescuing” pseudogenes. The problem of discriminating between pseudogenes and frameshift errors is actually quite complex and will likely be solved only through whole-genome alignments of different species or, in certain cases, by direct experimentation, e.g., expression of the gene(s) in question.

4.1.2. Algorithms and software tools for gene identification

As discussed in the previous section, recognizing genes in the DNA sequences remains one of the most pressing problems in genome analysis. Several different approaches to gene prediction have been developed, and there are several popular programs that are most commonly used for this task (see Table 4.1). Some of these tools perform gene prediction ab initio, relying only on the statistical parameters in the DNA sequence for gene identification. In contrast, homology-based methods rely primarily on identifying homologous sequences in other genomes and/or in public databases using BLAST or Smith-Waterman algorithms. Many of the commonly used methods combine these two approaches.

Table 4.1. Software tools for ab initio gene prediction.

Table 4.1

Software tools for ab initio gene prediction.

The absence of introns and relatively high gene density in most genomes of prokaryotes and some unicellular eukaryotes provides for effective use of sequence similarity searches as the first step in genome annotation. Genes identified by homology can be used as the training set for one of the statistical methods for gene recognition, and the resulting statistical model can then be used for analyzing the remaining parts of the genome. In most eukaryotes, the abundance of introns and long intergenic regions makes it difficult to use homology-based methods as the first step unless, of course, one can rely on synteny between several closely related genomes (e.g. human, mouse, and rat). As a result, gene prediction for genome sequences of multicellular eukaryotes usually starts with ab initio methods, followed by similarity searches with the initial exon assemblies.

A detailed comparison of the algorithms and tools for gene prediction is beyond the scope of this book. We would like to emphasize only that each of these methods has its own advantages and limitations, and none of them is perfect. Therefore, it is advisable to use at least two different programs for gene prediction in a new DNA sequence, especially if it comes from a eukaryote or a poorly characterized prokaryote. A comparison of predictions generated by different programs reveals the cases where a given program performs the best and helps in achieving consistent quality of gene prediction. Such a comparison can be performed, for example, using the TIGR Combiner program (, which employs a voting scheme to combine predictions of different gene-finding programs, such as GeneMark, GlimmerM, GRAIL, GenScan, and Fgenes.

We describe only several computational tools that are most commonly used for gene prediction in large-scale genome annotation projects.


GeneMark (, mirrored at the EBI web site was developed by Mark Borodovsky and James McIninch in 1993 [108]. GeneMark was the first tool for finding prokaryotic genes that employed a non-homogeneous Markov model to classify DNA regions into protein-coding, non-coding, and non-coding but complementary to coding. It has been shown previously that, by multivariate codon usage analysis, the E. coli genes could be classified into so-called typical, highly typical, and atypical gene sets, with the latter two groups apparently corresponding to highly expressed genes and horizontally transferred genes [562]. Accordingly, more than one Markov model was required to adequately describe different groups of genes in the same genome [109,110].

Like other gene prediction programs (see below), GeneMark relies on organism-specific recognition parameters to partition the DNA sequence into coding and non-coding regions and thus requires a sufficiently large training set of known genes from a given organism for best performance. The program has been repeatedly updated and modified and now exists in separate variants for gene prediction in prokaryotic, eukaryotic, and viral DNA sequences [88,528,768].


Gene Locator and Interpolated Markov Modeler (GLIMMER,, developed by Steven Salzberg and colleagues at Johns Hopkins University and TIGR, is a system for finding genes in prokaryotic genomes. To identify coding regions and distinguish them from noncoding DNA, Glimmer uses interpolated Markov models, i.e. series of Markov models with the order of the model increasing at each step and the predictive power of each model separately evaluated [735]. Like GeneMark, Glimmer requires a training set, which is usually selected among known genes, genes coding for proteins with strong database hits, and/or simply long ORFs. Glimmer is used as the primary gene finder tool at TIGR, where it has been applied to the annotation of numerous microbial genomes [241,243,610,891].

Recently, Salzberg and coworkers developed GlimmerM, a modified version of Glimmer specifically designed for gene recognition in small eukaryotic genomes, such as the malaria parasite Plasmodium falciparum [736].


Gene Recognition and Assembly Internet Link (GRAIL,, developed by Ed Uberbacher and coworkers at the Oak Ridge National Laboratory, is a tool that identifies exons, polyA sites, promoters, CpG islands, repetitive elements, and frameshift errors in DNA sequences by comparing them to a database of known human and mouse sequence elements [858]. Exon and repetitive element prediction is also available for Arabidopsis and Drosophila sequences.

Grail has been recently incorporated into the Oak Ridge genome analysis pipeline (, which provides a unified web interface to a number of convenient analysis tools. For prokaryotes, it offers gene prediction using Glimmer (see above) and Generation programs, followed by BLASTP searches of predicted ORFs against SWISS-PROT and NR databases and a HMMer search against Pfam. There is also an option of BLASTN search of the submitted DNA sequence against a variety of nucleotide sequence databases.

For human and mouse sequences, the Oak Ridge pipeline offers gene prediction using GrailEXP and GenScan (see below), also followed by BLASTP searches of predicted ORFs against SWISS-PROT and NR databases and a HMMer search against Pfam. Again, the user can perform BLASTN search of the submitted DNA sequence against a variety of nucleotide sequence databases, as well as search for CpG islands, repeat fragments, tRNAs, and BAC-end pairs. As discussed above, the possibility to directly compare gene predictions made by two different programs is a valuable feature, which is available at the Oak Ridge web site.


GenScan ( was developed by Chris Burge and Samuel Karlin at Stanford University and is currently hosted in the Burge laboratory at the MIT Department of Biology. This program uses a complex probabilistic model of the gene structure that is based on actual biological information about the properties of transcriptional, translational, and splicing signals. In addition, it utilizes various statistical properties of coding and noncoding regions. To account for the heterogeneity of the human genome that affects gene structure and gene density, GenScan derives different sets of gene models for genome regions with different GC content [131,132]. Its high speed and accuracy make GenScan the method of choice for the initial analysis of large (in the megabase range) stretches of eukaryotic genomic DNA. GenScan has being used as the principal tool for gene prediction in the International Human Genome Project [488].


GeneBuilder ( performs ab initio gene prediction using numerous parameters, such as GC content, dicodon frequencies, splicing site data, CpG islands, repetitive elements, and others. It also utilizes a unique approach that is based on evaluating relative frequencies of synonymous and nonsynonymous substitutions to identify likely coding sequences. In addition, it performs BLAST searches of predicted genes against protein and EST databases, which helps to refine the boundaries of predicted exons using the BLAST hits as guides. The program allows the user to change certain parameters, which permits interactive gene structure prediction. As a result, GeneBuilder is sometimes able to predict the gene structure with a good accuracy, even when the similarity of the predicted ORF to a homologous protein sequence is low [569,708].

Splice site prediction

Programs for predicting intron splice sites, which are commonly used as subroutines in the gene prediction tools, can also be used as stand-alone programs to verify positions of splice sites or predict alternative splicing sites. Such programs (Table 4.2) can be particularly useful for predicting non-coding exons, which are commonly missed in gene prediction studies.

Table 4.2. Software tools for prediction of splicing sites.

Table 4.2

Software tools for prediction of splicing sites.

Recognition of the splice sites by these programs usually relies on statistical properties of exons and introns and on the consensus sequences of splicing signals. A detailed study of the performance of one such program, SpliceView, showed that, although the fraction of missed splicing signals was relatively low (~5%), the false-positive rate was quite high (typically, one potential splicing signal per 150–250 bases). One should note, however, that such false-positive signals might correspond to rare alternative splice forms or cryptic splice sites (splice sites that are not active in normal genes and become activated as a result of mutations in major splicing signals) [710].

Combining various gene prediction tools

While the first step of gene identification in long genomic sequences utilizes ab initio programs that can rapidly and with reasonable accuracy predict multiple genes, the next step validates these predictions through similarity searches. Predicted genes are compared to nucleotide sequence databases, including EST databases, and protein sequences encoded by these predicted genes are compared to protein sequence databases. These data are then combined with the information about repetitive elements, CpG islands, and transcription factor binding sites and used for further refinement of gene structure. Thus, homology information is ultimately incorporated into every gene prediction pipeline (see above). There are, however, several programs that primarily rely on similarity search for gene prediction (Table 4.3). Although differing in details, they all search for the best alignment of the given piece of DNA to the homologous nucleotide or protein sequences in the database.

Table 4.3. Software tools for homology-based gene prediction.

Table 4.3

Software tools for homology-based gene prediction.

4.2. Principles of Sequence Similarity Searches

As discussed in the previous section, initial characterization of any new DNA or protein sequence starts with a database search aimed at finding out whether homologs of this gene (protein) are already available, and if they are, what is known about them. Clearly, looking for exactly the same sequence is quite straightforward. One can just take the first letter of the query sequence, search for its first occurrence in the database, and then check if the second letter of the query is the same in the subject. If it is indeed the same, the program could check the third letter, then the fourth, and continue this comparison to the end of the query. If the second letter in the subject is different from the second letter in the query, the program should search for another occurrence of the first letter, and so on. This will identify all the sequences in the database that are identical to the query sequence (or include it). Of course, this approach is primitive computation-wise, and there are sophisticated algorithms for text matching that do it much more efficiently [92].

Image ch4e1.jpg

Note that, in the example above, we looked only for sequences that exactly match the query. The algorithm would not even find a sequence that is identical to the query with the exception of the first letter. To find such sequences, the same analysis should be conducted with the fragments starting from the second letter of the original query, then from the third one, and so on.

Image ch4e2.jpg

Such search quickly becomes time-consuming, and we are still dealing only with identical sequences. Finding close relatives would introduce additional conceptual and technical problems. Let us assume that sequences that are 99% identical are definitely homologous. What should one select as the threshold to consider sequences not to be homologous: 50% identity, 33%, or perhaps 25%? These are legitimate questions that need to be answered before one goes any further. The example of two lysozymes (see 2.1.2) shows that sequences with as low as 8% identity may belong to orthologous proteins and perform the same function.

As a matter of fact, when comparing nucleic acid sequences, there is very little one could do. All the four nucleotides, A, T, C, and G, are found in the database with approximately the same frequencies and have roughly the same probability of mutating one into another. As a result, DNA-DNA comparisons are largely based on straightforward text matching, which makes them fairly slow and not particularly sensitive, although a variety of heuristics have been developed to overcome this [571].

Amino acid sequence comparisons have several distinct advantages over nucleotide sequence comparisons, which, at least potentially, lead to a much greater sensitivity. Firstly, because there are 20 amino acids but only four bases, an amino acid match carries with it >4 bits of information as opposed to only two bits for a nucleotide match. Thus, statistical significance can be ascertained for much shorter sequences in protein comparisons than in nucleotide comparisons. Secondly, because of the redundancy of the genetic code, nearly one-third of the bases in coding regions are under a weak (if any) selective pressure and represent noise, which adversely affects the sensitivity of the searches. Thirdly, nucleotide sequence databases are much larger than protein databases because of the vast amounts of non-coding sequences coming out of eukaryotic genome projects, and this further lowers the search sensitivity. Finally, and probably most importantly, unlike in nucleotide sequence, the likelihoods of different amino acid substitutions occurring during evolution are substantially different, and taking this into account greatly improves the performance of database search methods as described below. Given all these advantages, comparisons of any coding sequences are typically carried out at the level of protein sequences; even when the goal is to produce a DNA-DNA alignment (e.g. for analysis of substitutions in silent codon positions), it is usually first done with protein sequences, which are then replaced by the corresponding coding sequences. Direct nucleotide sequence comparison is indispensable only when non-coding regions are analyzed.

4.2.1. Substitution scores and substitution matrices

The fact that each of the 20 standard protein amino acids has its own unique properties means that the likelihood of the substitution of each particular residue for another residue during evolution should be different. Generally, the more similar the physico-chemical properties of two residues, the greater the chance that the substitution will not have an adverse effect on the protein's function and, accordingly, on the organism's fitness. Hence, in sequence comparisons, such a substitution should be penalized less than a replacement of amino acid residue with one that has dramatically different properties. This is, of course, an oversimplification, because the effect of a substitution depends on the structural and functional environment where it occurs. For example, a cysteine-to-valine substitution in the catalytic site of an enzyme will certainly abolish the activity and, on many occasions, will have a drastic effect on the organism's fitness. In contrast, the same substitution within a β-strand may have little or no effect. Unfortunately, in general, we do not have a priori knowledge of the location of a particular residue in the protein structure, and even with such knowledge, incorporating it in a database search algorithm is an extremely complex task. Thus, a generalized measure of the likelihood of amino acid substitutions is required so that each substitution is given an appropriate score (weight) to be used in sequence comparisons. The score for a substitution between amino acids i and j always can be expressed by the following intuitively plausible formula, which shows how likely a particular substitution is given the frequencies of each the two residues in the analyzed database:

Image ch4e3.jpg

where k is a coefficient, q ij is the observed frequency of the given substitution, and p i , p j are the background frequencies of the respective residues. Obviously, here the product p i p j is the expected frequency of the substitution and, if q ij = p i p j (S ij = 0), the substitution occurs just as often as expected. The scores used in practice are scaled such that the expected score for aligning a random pair of amino acid sequences is negative (see below).

There are two fundamentally different ways to come up with a substitution score matrix, i.e. a triangular table containing 210 numerical score values for each pair of amino acids, including identities (diagonal elements of the matrix; Figures 4.4 and 4.5). As in many other situations in computational biology, the first approach works ab initio, whereas the second one is empirical. One ab initio approach calculates the score as the number of nucleotide substitutions that are required to transform a codon for one amino acid in a pair into a codon for the other. In this case, the matrix is obviously unique (as long as alternative genetic codes are not considered) and contains only four values, 0, 1, 2, or 3. Accordingly, this is a very coarse grain matrix that is unlikely to work well. The other ab initio approach assigns scores on the basis of similarities and differences in the physico-chemical properties of amino acids. Under this approach, the number of possible matrices is infinite, and they may have as fine a granularity as desirable, but a degree of arbitrariness is inevitable because our understanding of protein physics is insufficient to make informed decisions on what set of properties “correctly” reflects the relationships between amino acids.

Figure 4.4. The PAM30 substitution matrix.

Figure 4.4

The PAM30 substitution matrix. The numbers indicate the substitution scores for each replacement. The greater the number, the lesser the penalty for the given substitution. Note the high penalty for replacing Cys and aromatic amino (more...)

Figure 4.5. The BLOSUM 62 substitution matrix.

Figure 4.5

The BLOSUM 62 substitution matrix. The meaning of the numbers is the same as for PAM30. Note the relatively lower reward for conservation of Cys, Phe, Tyr, and Trp and lower penalties for replacing these amino acids than in the PAM30 (more...)

Empirical approaches, which historically came first, attempt to derive the characteristic frequencies of different amino acid substitutions from actual alignments of homologous protein families. In other words, these approaches strive to determine the actual likelihood of each substitution occurring during evolution. Obviously, the outcome of such efforts critically depends on the quantity and quality of the available alignments, and even now, any alignment database is far from being complete or perfectly correct. Furthermore, simple counting of different types of substitutions will not suffice if alignments of distantly related proteins are included because, in many cases, multiple substitutions might have occurred in the same position. Ideally, one should construct the phylogenetic tree for each family, infer the ancestral sequence for each internal node, and then count the substitutions exactly. This is not practicable in most cases, and various shortcuts need to be taken.

Several solutions to these problems have been proposed, each resulting in a different set of substitution scores. The first substitution matrix, constructed by Dayhoff and Eck in 1968 [172], was based on an alignment of closely related proteins, so that the ancestral sequence could be deduced and all the amino acid replacements could be considered occurring just once. This model was then extrapolated to account for more distant relationships (we will not discuss here the mathematics of this extrapolation and the underlying evolutionary model [174]), which resulted in the PAM series of substitution matrices (Figure 4.4). PAM (Accepted Point Mutation) is a unit of evolutionary divergence of protein sequences, corresponding to one amino acid change per 100 residues. Thus, for example, the PAM30 matrix is supposed to apply to proteins that differ, on average, by 0.3 change per aligned residue, whereas PAM250 should reflect evolution of sequences with an average of 2.5 substitutions per position. Accordingly, the former matrix should be employed for constructing alignments of closely related sequences, whereas the latter is useful in database searches aimed at detection of distant relationships. Using an approach similar to that of Dayhoff, combined with rapid algorithms for protein sequence clustering and alignment, Jones, Taylor, and Thornton produced the series of the so-called JTT matrices [403], which are essentially an update of the PAMs.

The PAM and JTT matrices, however, have obvious limitations because of the fact that they have been derived from alignments of closely related sequences and extrapolated to distantly related ones. This extrapolation may not be fully valid because the underlying evolutionary model might not be adequate, and the trends that determine sequence divergence of closely related sequences might not apply to the evolution at larger distances.

In 1992, Steven and Jorja Henikoff developed a different series of substitution matrices [342] using conserved ungapped alignments of related proteins from the BLOCKS database (see 3.2.1). The use of these alignments offered three important advantages over the alignments used for constructing the PAM matrices. First, the BLOCKS collection obviously included a much larger number and, more importantly, a much greater diversity of protein families than the collection that was available to Dayhoff and coworkers in the 1970's. Second, coming from rather distantly related proteins, BLOCKS alignments better reflected the amino acid changes that occur over large phylogenetic distances and thus produced substitution scores that represented sequence divergence in distant homologs directly, rather than through extrapolation. Third, in these distantly related proteins, BLOCKS included only the most confidently aligned regions, which are likely to best represent the prevailing evolutionary trends. These substitution matrices, named the BLOSUM (= BLOcks SUbstitution Matrix) series, were tailored to particular evolutionary distances by ignoring the sequences that had more than a certain percent identity. In the BLOSUM62 matrix, for example, the substitution scores were derived from the alignments of sequences that had no more than 62% identity; the substitution scores of the BLOSUM45 matrix were calculated from the alignments that contained sequences with no more than 45% identity. Accordingly, BLOSUM matrices with high numbers, such as BLOSUM80, are best suited for comparisons of closely related sequences (it is also advisable to use BLOSUM80 for database searches with short sequences, see 4.4.2), whereas low-number BLOSUM matrices, such as BLOSUM45, are better for distant relationships. In addition to the general-purpose PAM, JTT, and BLOSUM series, some specialized substitution matrices were developed, for example, for integral membrane proteins [404], but they never achieved comparable recognition.

Several early studies found the PAM matrices based on empirical data consistently resulted in greater search sensitivity than any of the ab initio matrices (see [186]). An extensive empirical comparison showed that: (i) BLOSUM matrices consistently outperformed PAMs in BLAST searches, and (ii) on average, BLOSUM62 (Fig 4.5) performed best in the series [343]; this matrix is currently used as the default in most sequence database searches. It is remarkable that so far, throughout the 30-plus-year history of amino acid substitution matrices, empirical matrices have consistently outperformed those based on theory, either physico-chemical or evolutionary. This is not to say, of course, that theory is powerless in this field, but to point out that we currently do not have a truly adequate theory to describe protein evolution. Clearly, the last word has not been said on amino acid substitution matrices, and one can expect that eventually the BLOSUM series will be replaced by new matrices based on greater amounts of higher quality alignment data and more realistic evolutionary models. A recently reported maximum-likelihood model for substitution frequency estimation has already been claimed to describe individual protein families better than the Dayhoff and JTT models [889]. It remains to be seen how this and other new matrices perform in large-scale computational experiments on real databases. AAindex (, see 3.6.3) lists 66 different substitution matrices, both ab initio and empirical, and there is no doubt that this list will continue to grow [429].

4.2.2. Statistics of protein sequence comparison

It is impossible to explain even the basic principles of statistical analysis of sequence similarities without invoking some mathematics. To introduce these concepts in the least painful way, let us consider the same protein sequence (E. coli RpsJ) as above

Image ch4e4.jpg

and check how many times segments of this sequence of different lengths are found in the database (we chose fragments starting from the second position in the sequence because nearly every protein in the database starts with a methionine). Not unexpectedly, we find that the larger the fragment, the smaller the number of exact matches in the database (Table 4.4).

Table 4.4. Dependence of the number of exact database matches on the length of the query word.

Table 4.4

Dependence of the number of exact database matches on the length of the query word.

Perhaps somewhat counterintuitively, a 9-mer is already unique. With the decrease in the number of database hits, the likelihood that these hits are biologically relevant, i.e. belong to homologs of the query protein, increases. Thus, 13 of the 23 occurrences of the string KVRASV and all 8 occurrences of the string KVRASVK are from RpsJ orthologs.

The number of occurrences of a given string in the database can be roughly estimated as follows. The probability of matching one amino acid residue is 1/20 (assuming equal frequencies of all 20 amino acids in the database; this not being the case, the probability is slightly greater). The probability of matching two residues in a row is then (1/20)2, and the probability of matching n residues is (1/20) n . Given that the protein database currently contains N ~2 × 108 letters, one should expect a string of n letters to match approximately N × (1/20) n times, which is fairly close to the numbers in Table 4.4.

Searching for perfect matches is the simplest and, in itself, obviously insufficient form of sequence database search, although, as we shall see below, it is important as one of the basic steps in currently used search algorithms. As repeatedly mentioned above, the goal of a search is finding homologs, which can have drastically different sequences such that, in distant homologs, only a small fraction of the amino acid residues are identical or even similar. Even in close homologs, a region of high similarity is usually flanked by dissimilar regions like in the following alignment of E. coli RpmJ with its ortholog from Vibrio cholerae:

Image ch4e5.jpg

In this example, the region of highest similarity is in the middle of the alignment, but including the less conserved regions on both sides improves the overall score (taking into account the special treatment of gaps, which is introduced below). Further along the alignment, the similarity almost disappears so that inclusion of additional letters into the alignment would not increase the overall score or would even decrease it. Such fragments of the alignment of two sequences whose similarity score cannot be improved by adding or trimming any letters, are referred to as high-scoring segment pairs (HSPs) . For this approach to work, the expectation of the score for random sequences must be negative, and the scoring matrices used in database searches are scaled accordingly (see Figures 4.4 and 4.5).

So, instead of looking for perfect matches, sequence comparisons programs actually search for HSPs. Once a set of HSPs is found, different methods, such as Smith-Waterman, FASTA, or BLAST, deal with them in different fashions (see below). However, the principal issue that any database search method needs to address is identifying those HSPs that are unlikely to occur by chance and, by inference, are likely to belong to homologs and to be biologically relevant. This problem has been solved by Samuel Karlin and Stephen Altschul, who showed that maximal HSP scores follow the extreme value distribution [421]. Accordingly, if the lengths of the query sequence (m) and the database (n) are sufficiently high, the expected number of HSPs with a score of at least S is given by the formula

Image ch4e6.jpg

Here, S is the so-called raw score calculated under a given scoring system, and K and λ are natural scaling parameters for the search space size and the scoring system, respectively. Normalizing the score according to the formula:

Image ch4e7.jpg

gives the bit score, which has a standard unit accepted in information theory and computer science. Then,

Image ch4e8.jpg

and, since it can be shown that the number of random HSPs with score ≥S’ is described by Poisson distribution, the probability of finding at least one HSP with bit score ≥S’ is

Image ch4e9.jpg

Equation 4.5 links two commonly used measures of sequence similarity, the probability (P-value) and expectation (E-value). For example, if the score S is such that three HSPs with this score (or greater) are expected to be found by chance, the probability of finding at least one such HSP is (1 − e−3), ~0.95. By definition, P-values vary from 0 to 1, whereas E-values can be much greater than 1. The BLAST programs (see below) report E-values, rather than P-values, because E-values of, for example, 5 and 10 are much easier to comprehend than P-values of 0.993 and 0.99995. However, for E < 0.01, P-value and E-value are nearly identical.

The product mn defines the search space , a critically important parameter of any database search. Equations 4.2 and 4.4 codify the intuitively obvious notion that the larger the search space, the higher the expectation of finding an HSP with a score greater than any given value. There are two corollaries of this that might take some getting used to: (i) the same HSP may come out statistically significant in a small database and not significant in a large database; with the natural growth of the database, any given alignment becomes less and less significant (but by no means less important because of that) and (ii) the same HSP may be statistically significant in a small protein (used as a query) and not significant in a large protein.

Clearly, one can easily decrease the E-value and the P-value associated with the alignment of the given two sequences by lowering n in equation 4.2, i.e. by searching a smaller database. However, the resulting increase in significance is false, although such a trick can be useful for detecting initial hints of subtle relationships that should be subsequently verified using other approaches. It is the experience of the authors that the simple notion of E(P)-value is often misunderstood and interpreted as if these values applied just to a single pairwise comparison (i.e., if an E-value of 0.001 for an HSP with score S is reported, then, in a database of just a few thousand sequences, one expects to find a score >S by chance). It is critical to realize that the size of the search space is already factored in these E-values, and the reported value corresponds to the database size at the time of search (thus, it is certainly necessary to indicate, in all reports of sequence analysis, which database was searched, and desirably, also on what exact date).

Speaking more philosophically (or futuristically), one could imagine that, should the genomes of all species that inhabit this planet be sequenced, it would become almost impossible to demonstrate statistical significance for any but very close homologs in standard database searches. Thus, other approaches to homology detection are required that counter the problems created by database growth by taking advantage of the simultaneously increasing sequence diversity, and as discussed below, some have already been developed.

The Karlin-Altschul statistics has been rigorously proved to apply only to sequence alignments that do not contain gaps, whereas statistical theory for the more realistic gapped alignments remains an open problem. However, extensive computer simulations have shown that these alignments also follow the extreme value distribution to a high precision; therefore, at least for all practical purposes, the same statistical formalism is applicable [19,22].

Those looking for a detailed mathematical description of the sequence comparison statistics can find it in the “Further Reading” list at the end of the chapter. A brief explanation of the statistical principles behind the BLAST program, written by Stephen Altschul, can be found online at

4.2.3. Protein sequence complexity: Compositional bias

The existence of a robust statistical theory of sequence comparison, in principle, should allow one to easily sort search results by statistical significance and accordingly assign a level of confidence to any homology identification. However, a major aspect of protein molecule organization substantially complicates database search interpretation and may lead to gross errors in sequence analysis. Many proteins, especially in eukaryotes, contain low (compositional) complexity regions , in which the distribution of amino acid residues is non-random, i.e. deviates from the standard statistical model [919,921]. In other words, these regions typically have biased amino acid composition, e.g. are rich in glycine or proline, or in acidic or basic amino acid residues. The ultimate form of low complexity is, of course, a homopolymer, such as a Q-linker [920]. Other low-complexity sequences have a certain amino acid periodicity, sometimes subtle, such as, for example, in coiled-coil and other non-globular proteins (e.g. collagen or keratin).

The notion of compositional complexity was encapsulated in the SEG algorithm and the corresponding program, which partitions protein sequences into segments of low and high (normal) complexity [921]. An important finding made by John Wootton is that low-complexity sequences correspond to non-globular portions of proteins [919]. In other words, a certain minimal level of complexity is required for a sequence to fold into a globular structure. Low-complexity regions in proteins, although devoid of enzymatic activity, have important biological functions, most often promoting protein-protein interactions or cellular adhesion to various surfaces and to each other.

In a detailed empirical study, a set of parameters of the SEG program was identified that allowed reasonably accurate partitioning of a protein sequence into predicted globular and non-globular parts. The mastermind protein of Drosophila is a component of the Notch-dependent signaling pathway and plays an important role in the development of the nervous system of the fruit fly [755,785]. In spite of this critical biological function, this protein consists mostly of stretches of only three amino acid residues, Gln, Asn, and Gly, and is predicted to have a predominantly non-globular structure (Figure 4.6). Recently discovered human homologs of mastermind are also involved in Notch-dependent transcriptional regulation and similarly appear to be almost entirely non-globular [442,923].

Figure 4.6. Sequence of the Drosophila mastermind protein: partitioning into predicted non-globular (left column) and globular (right column) regions.

Figure 4.6

Sequence of the Drosophila mastermind protein: partitioning into predicted non-globular (left column) and globular (right column) regions. The SEG program was run with the parameters optimized for detection of non-globular regions: (more...)

Low-complexity regions represent a major problem for database searches. Since the λ parameter of equation 4.2 is calculated for the entire database, Karlin-Altschul statistics breaks down when the composition of the query or a database sequence or both significantly deviates from the average composition of the database. The result is that low-complexity regions with similar composition (e.g. acidic or basic) often produce “statistically significant” alignments that have nothing to with homology and are completely irrelevant. The SEG program can be used to overcome this problem in a somewhat crude manner: the query sequence, the database, or both can be partitioned into normal complexity and low-complexity regions, and the latter are masked (i.e. amino acid symbols are replaced with the corresponding number of X's). For the purpose of a database search, such filtering is usually done using short windows so that only the segments with a strongly compositional bias are masked. Low-complexity filtering has been indispensable for making database search methods, in particular BLAST, into reliable tools [18]. Without masking low-complexity regions, false results would have been produced for a substantial fraction of proteins, especially eukaryotic ones (an early estimate held that low-complexity regions comprise ~15% of the protein sequences in the SWISS-PROT database [919]). These false results would have badly polluted any large-scale database search, and the respective proteins would have been refractory to any meaningful sequence analysis. For these reasons, for several years, SEG filtering had been used as the default for BLAST searches to mask low-complexity segments in the query sequence. However, this procedure is not without its drawbacks. Not all low-complexity sequences are captured, and false-positives still occur in database searches. The opposite problem also hampers database searches for some proteins: when short low-complexity sequences are parts of conserved regions, statistical significance of an alignment may be underestimated, sometimes grossly.

In a recent work of Alejandro Schäffer and colleagues, a different, less arbitrary approach for dealing with compositionally biased sequences was introduced [748]. This method, called composition-based statistics, recalculates the λ parameter and, accordingly, the E values [see equation 4.2] for each query and each database sequence, thus correcting the inordinately low (“significant”) E-values for sequences with similarly biased amino acid composition. This improves the accuracy of the reported E-values and eliminates most false-positives. Composition-based statistics is currently used as the default for the NCBI BLAST. In 4.4.2, we will discuss the effect of this procedure on database search outcome in greater detail and using specific examples.

4.3. Algorithms for Sequence Alignment and Similarity Search

4.3.1. The basic alignment concepts and principal algorithms

As discussed in the previous sections, similarity searches aim at identifying the homologs of the given query protein (or nucleotide) sequence among all the protein (or nucleotide) sequences in the database. Even in this general discussion, we repeatedly mentioned and, on some occasions, showed sequence alignments. An alignment of homologous protein sequences reveals their common features that are ostensibly important for the structure and function of each of these proteins; it also reveals poorly conserved regions that are less important for the common function but might define the specificity of each of the homologs. In principle, the only way to identify homologs is by aligning the query sequence against all the sequences in the database (below we will discuss some important heuristics that allow an algorithm to skip sequences that are obviously unrelated to the query), sorting these hits based on the degree of similarity, and assessing their statistical significance that is likely to be indicative of homology. Thus, before considering algorithms and programs used to search sequence databases, we must briefly discuss alignment methods themselves.

It is important to make a distinction between a global (i.e. full-length) alignment and a local alignment , which includes only parts of the analyzed sequences (subsequences). Although, in theory, a global alignment is best for describing relationships between sequences, in practice, local alignments are of more general use for two reasons. Firstly, it is common that only parts of compared proteins are homologous (e.g. they share one conserved domain, whereas other domains are unique). Secondly, on many occasions, only a portion of the sequence is conserved enough to carry a detectable signal, whereas the rest have diverged beyond recognition. Optimal global alignment of two sequences was first realized in the Needleman-Wunsch algorithm, which employs dynamic programming [606]. The notion of optimal local alignment (the best possible alignment of two subsequences from the compared sequences) and the corresponding dynamic programming algorithm were introduced by Smith and Waterman [784]. Both of these are O(n 2 ) algorithms, i.e. the time and memory required to generate an optimal alignment are proportional to the product of the lengths of the compared sequences (for convenience, the sequences are assumed to be of equal length n in this notation). Optimal alignment algorithms for multiple sequences have the O(n k ) complexity (where k is the number of compared sequences). Such algorithms for k > 3 are not feasible on any existing computers, therefore all available methods for multiple sequence alignments produce only approximations and do not guarantee the optimal alignment.

It might be useful, at this point, to clarify the notion of optimal alignment. Algorithms like Needleman-Wunsch and Smith-Waterman guarantee the optimal alignment (global and local, respectively) for any two compared sequences. It is important to keep in mind, however, that this optimality is a purely formal notion, which means that, given a scoring function, the algorithm outputs the alignment with the highest possible score. This has nothing to with statistical significance of the alignment, which has to be estimated separately (e.g. using the Karlin-Altschul statistics as outlined above), let alone the biological relevance of the alignment.

For better or worse, alignment algorithms treat protein or DNA as simple strings of letters without recourse to any specific properties of biological macromolecules. Therefore, it might be useful to illustrate the principles of local alignments using a text free of biological context as an example. Below is the text of stanzas I and IV of one of the most famous poems of all times; we shall compare them line by line, observing along the way various problems involved in sequence alignment (the alignable regions are shown in bold):

Image ch4e10.jpg

It is easy to see that, in the first two lines of the two stanzas, the longest common string consists of only five letters, with one mismatch:

Image ch4e11.jpg

The second lines align better, with two similar blocks separated by spacers of variable lengths, which requires gaps to be introduced, in order to combine them in one alignment:

Image ch4e12.jpg

In the third lines, there are common words of seven, four, and six letters, again separated by gaps:

Image ch4e13.jpg

The fourth lines align very well, with a long string of near identity at the end:

Image ch4e14.jpg

In contrast, there is no reasonable alignment between the fifth lines, except for the identical word ‘door’. Obviously, however, the fourth line of the second stanza may be aligned not only with the fourth (IV), but also with the fifth line of the first stanza:

Image ch4e15.jpg

Alignments (IV) and (IV′) can thus be combined to produce a multiple alignment:

Image ch4e16.jpg

Finally, sixth lines of the two stanzas could be aligned at their ends:

Image ch4e17.jpg

This simple example seems to capture several important issues that emerge in sequence alignment analysis. Firstly, remembering that an optimal alignment can be obtained for any two sequences, we should ask: Which alignments actually reflect homology of the respective lines? The alignments III, IV, IV′ (and the derivative IV′′), and V seem to be relevant beyond reasonable doubt. However, are they really correct? In particular, aligning en-ly/ently in III and ntly/ntly in IV require introducing gaps into both sequences. Is this justified? We cannot answer this simple question without a statistical theory for assessing the significance of an alignment, including a way to introduce some reasonable gap penalties.

The treatment of gaps is one of the hardest and still unsolved problems of alignment analysis. There is no theoretical basis for assigning gap penalties relative to substitution penalties (scores). Deriving these penalties empirically is a much more complicated task than deriving substitution penalties as in PAM and BLOSUM series because, unlike the alignment of residues in highly conserved blocks, the number and positions of gaps in alignments tend to be highly uncertain (see, for example alignment IV: Is it correct to place gaps both before and after ‘so’ in the second line?). Thus, gap penalties typically are assigned on the basis of two notions that stem both from the existing understanding of protein structure and from empirical examinations of protein family alignments: (i) deletion or insertion resulting in a gap is much less likely to occur than even the most radical amino acid substitution and should be heavily penalized, and (ii) once a deletion (insertion) has occurred in a given position, deletion or insertion of additional residues (gap extension) becomes much more likely. Therefore a linear function:

Image ch4e18.jpg

where a is the gap opening penalty, b is the gap extension penalty, and x is the length of the gap is used to deal with gaps in most alignment methods. Typically, a = 10 and b = 1 is a reasonable choice of gap penalties to be used in conjunction with the BLOSUM62 matrix. Using these values, the reader should be able to find out whether gaps should have been introduced in alignments III and IV above. In principle, objective gap penalties could be produced through analysis of distributions of gaps in structural alignments, and such a study suggested using convex functions for gap penalties [84]. However, this makes alignment algorithms much costlier computationally, and the practical advantages remain uncertain, so linear gap penalties are still universally employed.

The feasibility of alignments (IV) and (IV’) creates the problem of choice: Which of these is the correct alignment? Alignment (IV) wins because it clearly has a longer conserved region. What is, then, the origin of line 5 in the first stanza and, accordingly, of alignment (IV′)? It is not too difficult to figure out that this is a repeat, a result of duplication of line 4 (this is what we have to conclude given that line 4 is more similar to the homologous line in the second stanza). Such duplications are common in protein sequences, too, and often create major problems for alignment methods.

We concluded that lines 3, 4, and 6 in each stanza of “Raven” are homologous, i.e. evolved from common ancestors with some subsequent divergence. In this case, the conclusion is also corroborated by the fact that we recognize the English words in these lines and see that they are indeed nearly the same and convey similar meanings, albeit differing in nuances. What about alignments (I) and (II)? The content here tells us that no homology is involved, even though alignment (II) looks “believable”. However, it would not have been recognized as statistically significant in a search of any sizable database, such as, for example, the “Complete poems of Edgar Allan Poe” at the American Verse Project of the University of Michigan Humanities Text Initiative (

Is this similarity purely coincidental, then? Obviously, it is not. This is a case of convergence, a phenomenon whose role in molecular evolution we already had a chance to discuss in Chapter 2. Of course, in this case, the source of convergence is known: Edgar Allan Poe deliberately introduced these similarities for the sake of rhyme and alliteration. Such a force, to our knowledge, does not exist in molecular evolution, but analogous functional constraints act as its less efficient substitute.

Most of the existing alignment methods utilize modifications of the Smith-Waterman algorithm. Although it is not our goal here to discuss the latest developments in sequence alignment, the reader has to keep in mind that this remains an active research field, with a variety of algorithms and tools being developed, which at least claim improvements over the traditional ones appearing at a high rate. Just one recent example is BALSA, a Bayesian local alignment algorithm that explores series of substitution matrices and gap penalty values and assesses their posterior probabilities, thus overcoming some of the shortcomings of the Smith-Waterman algorithm [885].

Pairwise alignment methods are important largely in the context of a database search. For analysis of individual protein families, multiple alignment methods are critical. We believe that anyone routinely involved in protein family analysis would agree that, so far, no one has figured out the best way to do it. As indicated above, optimal alignment of more than three sequences is not feasible in the foreseeable future; so all the available methods are approximations. The main principle underlying popular algorithms is hierarchical clustering that roughly approximates the phylogenetic tree and guides the alignment (to our knowledge, this natural idea was first introduced by Feng and Doolittle [221]). The sequences are first compared using a fast method (e.g. FASTA, see below) and clustered by similarity scores to produce a guide tree. Sequences are then aligned step-by-step in a bottom-up succession, starting from terminal clusters in the tree and proceeding to the internal nodes until the root is reached. Once two sequences are aligned, their alignment is fixed and treated essentially as a single sequence with a modification of dynamic programming. Thus, the hierarchical algorithms essentially reduce the O(n k ) multiple alignment problem to a series of O(n 2 ) problems, which makes the algorithm feasible but potentially at the price of alignment quality. The hierarchical algorithms attempt to minimize this problem by starting with most similar sequences where the likelihood of incorrect alignment is minimal, in the hope that the increased weight of correctly aligned positions precludes errors even on the subsequent steps. The most commonly used method for hierarchical multiple alignment is Clustal, which is currently used in the ClustalW or ClustalX variants [392,843] (available, e.g. at,, and

Clustal is fast and tends to produce reasonable alignments, even for protein families with limited sequence conservation, provided the compared proteins do not differ in length too much. A combination of length differences and low sequence conservation tends to result in gross distortions of the alignment. The T-Coffee program is a recent modification of Clustal that incorporates heuristics partially solving these problems [623].

4.3.2. Sequence database search algorithms Smith-Waterman

Any pairwise sequence alignment method in principle can be used for database search in a straightforward manner. All that needs to be done is to construct alignments of the query with each sequence in the database, one by one, rank the results by sequence similarity, and estimate statistical significance.

The classic Smith-Waterman algorithm is a natural choice for such an application, and it has been implemented in several database search programs, the most popular one being SSEARCH written by William Pearson and distributed as part of the FASTA package [663]. It is currently available on numerous servers around the world. The major problem preventing SSEARCH and other implementations of the Smith-Waterman algorithm from becoming the standard choice for routine database searches is the computational cost, which is orders of magnitude greater than it is for the heuristic FASTA and BLAST methods (see below). Since extensive comparisons of the performance of these methods in detecting structurally relevant relationships between proteins failed to show a decisive advantage of SSEARCH [117], the fast heuristic methods dominate the field. Nevertheless, on a case-by-case basis, it is certainly advisable to revert to full Smith-Waterman search when other methods do not reveal a satisfactory picture of homologous relationship for a protein of interest. On a purely empirical and even personal note, the authors have not had much success with this, but undoubtedly, even rare findings may be important. A modified, much faster version of the Smith-Waterman algorithm has been implemented in the MPSRCH program, which is available at the EBI web site ( FASTA

FASTA, introduced in 1988 by William Pearson and David Lipman [664], was the first database search program that achieved search sensitivity comparable to that of Smith-Waterman but was much faster. FASTA looks for biologically relevant global alignments by first scanning the sequence for short exact matches called “words”; a word search is extremely fast. The idea is that almost any pair of homologous sequences is expected to have at least one short word in common. Under this assumption, the great majority of the sequences in the database that do not have common words with the query can be skipped without further examination with a minimal waste of computer time. The sensitivity and speed of the database search with FASTA are inversely related and depend on the “k-tuple” variable, which specifies the word size; typically, searches are run with k = 3, but, if high sensitivity at the expense of speed is desired, one may switch to k = 2.

Subsequently, Pearson introduced several improvements to the FASTA algorithm [662,664], which are implemented in the FASTA3 program available on the EBI server at A useful FASTA-based tool for comparing two sequences, LALIGN, is available at BLAST

Basic Local Alignment Search Tool (BLAST) is the most widely used method for sequence similarity search; it is also the fastest one and the only one that relies on a complete, rigorous statistical theory [1820,22].

Like FASTA and in contrast to the Smith-Waterman algorithm, BLAST employs the word search heuristics to quickly eliminate irrelevant sequences, which greatly reduces the search time. The program initially searches for a word of a given length W (usually 3 amino acids or 11 nucleotides, see 4.4.2) that scores at least T when compared to the query using a given substitution matrix. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of S. The W and T parameters dictate the speed and sensitivity of the search, which can thus be varied by the user.

The original version of BLAST (known as BLAST 1.4) produced only ungapped local alignments, for which rigorous statistical theory is available. Although this program performed well for many practical purposes, it repeatedly demonstrated lower sensitivity than the Smith-Waterman algorithm and the FASTA program, at least when run with the default parameters [662]. The new generation of BLAST makes alignments with gaps, for which extensive simulations have demonstrated the same statistical properties as proved for ungapped alignments (see above).

The BLAST suite of programs is available for searching at the NCBI web site ( and many other web sites around the world. It has three programs that work with nucleotide queries and two programs using protein queries (Table 4.5). The BLASTX, TBLASTN, and TBLASTX programs are used when either the query or the database or both are uncharacterized sequences and the location of protein-coding regions is not known. These programs translate the nucleotide sequence of the query in all six possible frames and run a protein sequence comparison analogous to that in BLASTP.

Table 4.5. Use of BLAST programs for database searches.

Table 4.5

Use of BLAST programs for database searches.

A version of gapped BLAST, known as WU-BLAST, with a slightly different statistical model, which, in some cases, may lead to a greater search sensitivity, is supported by Warren Gish at Washington University in St. Louis (

Recently, the BLAST suite was supplemented with BLAST2 sequences (, a tool for comparing just two nucleotide or protein sequences [830].

Because of its speed, high selectivity, and flexibility, BLAST is the first choice program in any situation when a sequence similarity search is required, and importantly, this method is used most often as the basis for genome annotation. Therefore, we consider the practical aspects of BLAST use in some detail in 4.4. Before that, however, we need to introduce some additional concepts that are critical for protein sequence analysis.

4.3.3. Motifs, domains, and profiles Protein sequence motifs and methods for motif detection

Let us ask a very general question: What distinguishes biologically important sequence similarities from spurious ones? By looking at just one alignment of the query and its database hit showing more or less scattered identical and similar residues as in this, already familiar alignment:

Image ch4e19.jpg

it might be hard to tell one from the other. However, as soon as we align more homologous sequence, particularly from distantly related organisms, as it is done for L36 in Figure 2.1, we will have a clue as to the nature of the distinction. Note two pairs of residues that are conserved in the great majority of L36 sequences: Cx(2)Cx(12)Cx(4–5)H [here x(n) indicates n residues whose identity does not concern us]. Those familiar with protein domains might have already noticed that this conserved pattern resembles the pattern of metal-coordinating residues in the so-called Zn-fingers and Zn-ribbons, extremely widespread metal-binding domains, which mediate protein-nucleic acid and protein-protein interactions. Indeed, L36 has been shown to bind Zn2+, and those very cysteines and histidines are involved [112,333]. Such constellation of conserved amino acid residues associated with a particular function is called a sequence motif (see also 3.2). Typically, motifs are confined to short stretches of protein sequences, usually spanning 10 to 30 amino acid residues. The notion of a motif, arguably one of the most important concepts in computational biology, was first explicitly introduced by Russell Doolittle in 1981 [185]. Fittingly and, to our knowledge, quite independently, the following year, John Walker and colleagues [880] described what is probably the most prominent sequence motif in the entire protein universe, the phosphate-binding site of a vast class of ATP/GTP-utilizing enzymes, which subsequently has been named P-loop [744]. Discovery of sequence motifs characteristic of a vast variety of enzymatic and binding activities of proteins proceeded first at an increasing and then, apparently, at a steady rate [103], and the motifs, in the form of amino acid patterns, were swiftly incorporated by Amos Bairoch in the PROSITE database (see 3.2.1).

The P-loop, which we already encountered in 3.2.1, is usually presented as the following pattern of amino acid residues:

Image ch4e20.jpg

Note that there are two strictly conserved residues in this pattern and two positions where one of two residues is allowed. By running this pattern against the entire protein sequence database using, for example, the FPAT program available through the ExPASy server program or any other pattern-matching program (even the UNIX ‘grep’ command will do), one immediately realizes just how general and how useful this pattern is. Indeed, such a search retrieves sequences of thousands of experimentally characterized ATPases and GTPases and their close homologs. However, only about one-half of the retrieved sequences are known or predicted NTPases of the P-loop class, whereas the rest are false-positives (E.V.K., unpublished). This is not surprising given the small number of residues in this pattern, which results in the probability of chance occurrence of about

Image ch4e21.jpg

(this is an approximate estimate because the actual amino acid frequencies are not taken into account, but it is close enough). With the current database size of about 3.2 × 108 residues, the expected number of matches is about 8,000!

This simple calculation shows that this and many other similar patterns, although they include the most conserved amino acid residues of important motifs, are insufficiently selective to be good diagnostic tools. The specificity of a pattern can be increased by taking into account adjacent residues that tend to have conserved properties. In particular, for the P-loop pattern, it can be required that there are at least three bulky, hydrophobic residues among the five residues upstream of the first glycine (structurally, this is a hydrophobic β-strand in ATPases and GTPases). This would greatly reduce the number of false-positives in a database search but would require a more sophisticated search method (as implemented, for example, in the GREF program of the SEALS package [878], see 5.1.2). Still, this does not solve the problem of motif identification. Figure 4.7 shows the alignment of a small set of selected P-loops that were chosen for their sequence diversity. Obviously, not even a single amino is conserved in all these sequences, although they all represent the same motif that has a conserved function and, in all likelihood, is monophyletic, i.e. evolved only once. Given this lack of strict conservation of amino acid residues in an enzymatic motif, this trend is even more pronounced in motifs associated with macromolecular interactions, in which invariant residues are the exception rather than the norm. Pattern search remains a useful first-approximation method for motif identification, especially because a rich pattern collection, PROSITE (see 3.2.1), can be searched using a rapid and straightforward program like SCANPROSITE ( However, by the very nature of the approach, patterns are either insufficiently selective or too specific and, accordingly, are not adequate descriptions of motifs.

Figure 4.7. Alignment of P-loops from diverse ATPases and GTPases.

Figure 4.7

Alignment of P-loops from diverse ATPases and GTPases. The most conserved residues are shown in bold.

The way to properly capture the information contained in sequence motifs is to represent them as amino acid frequency profiles , which incorporate the frequencies of each of the 20 amino acid residues in each position of the motif. Even in the absence of invariant residues, non-randomness of a motif may be quite obvious in a profile representation (Figure 4.9). Utilization of frequency profiles for database searches had a profound effect on the quality and depth of sequence and structure analysis. The principles and methods that made this possible are discussed in the next section.

Figure 4.9A. Conserved catalytic motifs in the caspase-like superfamily of proteases.

Figure 4.9A

Conserved catalytic motifs in the caspase-like superfamily of proteases. A. Multiple alignment of the catalytic motifs around the two active residues (His and Cys) of caspases (Csp, top group), paracaspases (PC, middle group), (more...) Protein domains, PSSMs, and advanced methods for database search

Sequence motifs are extremely convenient descriptors of conserved, functionally important short portions of proteins. However, motifs are not the natural units of protein structure and evolution. Such distinct units are protein domains . In structural biology, domains are defined as structurally compact, independently folding parts of protein molecules. In comparative genomics and sequence analysis in general, the central, “atomic” objects are parts of proteins that have distinct evolutionary trajectories, i.e. occur either as stand-alone proteins or as parts of variable domain architectures (we refer to the linear order of domains in protein sequences as domain or multidomain architecture), but are never split into parts. Very often, probably in the majority of cases, such units of protein evolution exactly correspond to structural domains. However, in some groups of proteins, an evolutionary unit may consist of two or more domains. For example, from a purely structural viewpoint, trypsin-like proteases have two domains. However, at least so far, separation of these domains has not been observed, and therefore, they should be treated as a single evolutionary unit. It might be desirable to propose a special name for these units of protein evolution, but to our knowledge, this has not been done, and in comparative-genomic literature, including this book, they are commonly referred to as domains. On rare occasions, a domain consists of a single motif, as in the case of AT-hooks shown in Figure 4.8. However, much more often, domains are relatively large, comprising 100 to 300 amino acid residues and including two or more distinct motifs. Motifs are highly conserved patches in multiple alignments of domains that tend to be separated by regions of less pronounced sequence conservation and often of variable length (Figure 4.9A); in other words, motifs may be conceptualized (and visualized) as peaks on sequence conservation profiles. In the 3D structure of most domains, the distinct motifs are juxtaposed and function together, which explains their correlated conservation. Figure 4.9B illustrates the juxtaposition of motifs that center around the two catalytic residues in the alignment of the catalytic domain of caspase-related proteases from Figure 4.9A.

Figure 4.8. Profile representation of a conserved sequence motif and the corresponding 3D structure of the DNA-binding AT-hook domain [48].

Figure 4.8

Profile representation of a conserved sequence motif and the corresponding 3D structure of the DNA-binding AT-hook domain [48]. The pictorial form of the profile was produced using the Sequence Logo method.

Figure 4.9B. Conserved catalytic motifs in the caspase-like superfamily of proteases.

Figure 4.9B

Conserved catalytic motifs in the caspase-like superfamily of proteases. B. Structure of the human caspase-7 (PDB entry 1I51). The figure is generated using Cn3D (see representation (more...)

The notion of protein motifs has been employed directly in algorithms that construct multiple sequence alignments as a chain of motifs separated by unaligned regions. The first of such methods, Multiple Alignment Construction and Analysis Workbench (MACAW), originally used a BLAST-like method for approximately delineating conserved sequence blocks (motifs) and then allowed the user to determine whether inclusion of additional alignment columns increased the significance of the block alignment [756].

MACAW is a very convenient, accurate, and flexible alignment tool; however, the algorithm is O(nk) and, accordingly, becomes prohibitively computationally expensive for a large number of sequences [756]. MACAW is an interactive tool that embodies the important notion that completely automatic methods are unlikely to capture all important motifs in cases of subtle sequence conservation, particularly in proteins that substantially differ in length. For many occasions, it remains the method of choice when careful alignment analysis is required, although, in the current situation of explosive growth of sequence data, the computational cost severely limits MACAW's utility. Subsequently, Charles Lawrence, Andrew Neuwald, and coworkers adapted the Gibbs sampling strategy for motif detection and developed the powerful (if not necessarily user-friendly) PROBE method that allows delineation of multiple, subtle motifs in large sets of sequences ( [493,614,615]). Importantly, Gibbs sampler is an O(n) algorithm, which allows analysis of large numbers of sequences. Gibbs sampling has been incorporated into MACAW as one of the methods for conserved block detection. In principle, this should enable MACAW to efficiently align numerous sequences. In practice, the authors find it problematic to identify relevant motifs among the numerous blocks detected by Gibbs sampler.

Arguably, the most important methodological advance based on the concepts of domains and motifs was the development of position-specific weight matrices (PSSMs) and their use in database searches as an incomparably more powerful substitute for regular matrices, such as BLOSUMs and PAMs. A PSSM is a rectangular table, which consists of n columns (n is the number of positions in the multiple alignment for which the PSSM is made) and 20 rows and contains, in each cell, the score (weight) for the given amino acid in the given position of the multiple alignment (see Figure 3.1). In the simplest case, this score can be the frequency of the amino acid in the given position. It is easy to realize, however, that, on most occasions, residue frequencies taken from any given alignment are unlikely to adequately describe the respective domain family. Firstly, we certainly never know the full range of family members, and moreover, there is no evidence that we have a representative set. Therefore, if a residue is missing in a particular alignment column, this does not justify a 0 score in a PSSM. In reality, a PSSM never includes a score of exactly 0, although scores for some residues might be extremely low, and rounding sometimes may result in 0 values. Instead, a finite score is assigned to the missing residue using so-called regularizers, i.e. various mathematical techniques that strive to derive the correct distribution of amino acids for a given position on the basis of a limited sample [422,826]. It is easy to realize that the score given to a missing residue depends on two factors: the distribution actually found in the sample of available superfamily members and the size of the sample. Clearly, if a set of 1,000 diverse sequences invariably contains, for example, a serine residue in a particular position, the probability of finding any other residue in this position is extremely low. Nevertheless, threonine, as a residue that is structurally close to serine and, according to substitution matrices like BLOSUMs and PAMs, is often exchangeable with serine in proteins, certainly should receive a higher score than, say, lysine. One, of course, could argue that an invariant serine is most likely to be part of a catalytic center of an enzyme and as such is more likely to be replaced by cysteine than by threonine (such replacements in enzymes, e.g. proteases and acyltransferases, are well documented, e.g. [295,338]). This level of sophistication seems to be beyond the capabilities of current automatic methods for PSSM generation, although, in principle, a PSSM for a particular domain could be tailored manually. Another aspect of PSSM construction that requires formal treatment beyond calculating and regularizing amino acid residue scores stems from the fact that many protein families available to us are enriched with closely related sequences (this might be the result of a genuine proliferation of a particular subset of a family or could be caused by sequencing bias). Obviously, an overrepresented subfamily will sway the entire PSSMs toward detection of additional closely related sequences and hamper the performance. To overcome this problem, different weighting schemes are applied to PSSMs to downweigh closely related sequences and increase the contribution of diverse ones. Optimal PSSM construction remains an important problem in sequence analysis, and even small improvements have the potential of significantly enhancing the power of database search methods. Some of the recent developments that we do not have the opportunity to discuss in detail here seem to hold considerable promise [650,691]. Once a PSSM is constructed, using it in a database search is straightforward and not particularly different from using a single query sequence combined with a regular substitution matrix, e.g. BLOSUM62. The common database search methods, such as BLAST, can work equally well with a PSSM, and the same statistics apply.

To our knowledge, the notion of a PSSM and its use for detecting weak sequence similarities was first introduced by Michael Gribskov, Andrew McLachlan, and David Eisenberg in 1987 [315]. However, their method was initially of limited utility because it depended on a pre-constructed multiple sequence alignment and consequently could not be used with the speed and ease comparable to those of using FASTA or BLAST. An important additional step was combining the use of PSSMs with iterative search strategy. To our knowledge, this was first done by Gribskov [314]. Under this approach, after the first run of a PSSM-based similarity search against a sequence database, newly detected sequences (with the similarity to PSSM above a certain cut-off) are added to the alignment, the PSSM is rebuilt, and the cycle is repeated until no new members of the family are detected. This approach was implemented in a completely automated fashion in the Motif Search Tool (MoST) program, which also included a rigorous statistical method for evaluating resulting similarities but only worked with ungapped alignment blocks [826].

A decisive breakthrough in the evolution of PSSM-based methods for database searching was the development of the Position-Specific Iterating (PSI)-BLAST program [22]. This program first performs a regular BLAST search of a protein query against a protein database. It then uses all the hits with scores greater than a certain cut-off to generate a multiple alignment and create a PSSM, which is used for the second search iteration. The search goes on until convergence or for a desired number of iterations. Obviously, the first PSI-BLAST iteration must employ a regular substitution matrix, such as BLOSUM62, to calculate HSP scores. For the subsequent iterations, the PSSM regularization procedure was designed in such a way that the contribution of the initial matrix to the position-specific scores decreases, whereas the contribution of the actual amino acid frequencies in the alignment increases with the growth of the number of retrieved sequences. PSI-BLAST also employs a simple sequence-weighting scheme [344], which is applied for PSSM construction at each iteration. Since its appearance in 1997, PSI-BLAST has become the most common method for in-depth protein sequence analysis. The method owes its success to its high speed (each iteration takes only slightly longer than a regular BLAST run), the ease of use (no additional steps are required, the search starts with a single sequence, and alignments and PSSMs are constructed automatically on the fly), and high reliability, especially when composition-based statistics are invoked. The practical aspects of using PSI-BLAST are considered at some length in 4.3.5.

Hidden Markov Models (HMMs) of multiple sequence alignments are a popular alternative to PSSMs [72,202,351,473]. HMMs can be trained on unaligned sequences or pre-constructed multiple alignments and, similarly to PSI-BLAST, can be iteratively run against a database in an automatic regime. A variety of HMM-based search programs are included in the HMMer2 package ( [208]; Sean Eddy's web site displays a recommendation to pronounce the name of this package "hammer" as in “a more precise mining tool than a BLAST”). HMM search is slower than PSI-BLAST, but there have been reports of greater sensitivity of HMMs (e.g. [652]). In the extensive albeit anecdotal experience of the authors, the results of protein superfamily analysis using PSI-BLAST (with a few “tricks” discussed in 4.4) and HMMer2 are remarkably similar.

The availability of techniques for constructing models of protein families and using them in database searches naturally leads to a vision of the future of protein sequence analysis. The methods discussed above, such as PSI-BLAST and HMMer, start with a protein sequence and gradually build a model that allows detection of homologs with low sequence similarity to the query. Clearly, this approach can be reversed such that a sequence query is run against a pre-made collection of protein family models. In principle, if models were developed for all protein families, the problem of classifying a new protein sequence would have been essentially solved. In addition to family classification, regular database searches like BLAST also provide information on the most closely related homologs of the query, thus giving an indication of its evolutionary affinity. In itself, a search of a library of family models does not yield such information, but an extension of this approach is easily imaginable whereby a protein sequence, after being assigned to a family through PSSM and HMM search, is then fit into a phylogenetic tree. Searching the COG database may be viewed as a rough prototype of this approach (see 3.3). Such a system seems to have the potential of largely replacing current methods with an approach that is both much faster and more informative. Given the explosive growth of sequence databases, transition to searching databases of protein family models as the primary sequence analysis approach seems inevitable in a relatively near future. Only for discovering new domains will it be necessary to revert to searching the entire database, and since the protein universe is finite (see 8.1), these occasions are expected to become increasingly rare.

Presently, sequence analysis has not reached such an advanced stage, but searches against large, albeit far from complete, databases of domain-specific PSSMs and HMMs have already become extremely useful approaches in sequence analysis. Pfam, SMART, and CDD, which were introduced in 3.2, are the principal tools of this type. Pfam and SMART perform searches against HMMs generated from curated alignments of a variety of proteins domains. The CDD server compares a query sequence to the PSSM collection in the CDD (see 3.2.3) using the Reversed Position-Specific (RPS)-BLAST program [545]. Algorithmically, RPS-BLAST is similar to BLAST, with minor modifications; Karlin-Altschul statistics applies to E-value calculation for this method. RPS-BLAST searches the library of PSSMs derived from CDD, finding single- or double-word hits and then performing ungapped extension on these candidate matches. If a sufficiently high-scoring ungapped alignment is produced, a gapped extension is done, and the alignments with E-values below the cut-off are reported. Since the search space is equal to nm where n is the length of the query and m is the total length of the PSSMs in the database (which, at the time of writing, contains ~5,000 PSSMs), RPS-BLAST is ~100 times faster than regular BLAST.

Pattern-Hit-Initiated BLAST (PHI-BLAST) is a variant of BLAST that searches for homologs of the query that contain a particular sequence pattern [942]. As discussed above, pattern search often is insufficiently selective. PHI-BLAST partially rectifies this by first selecting the subset of database sequences that contain the given pattern and then searching this limited database using the regular BLAST algorithm. Although the importance of this method is not comparable to that of PSI-BLAST, it can be useful for detecting homologs with a very low overall similarity to the query that nevertheless retain a specific pattern (see 4.5).

Stand-alone (non-web) BLAST . The previous discussion applied to the web version of BLAST, which is indeed most convenient for analysis of small numbers of sequences, and is, typically, the only form of database search used by experimental biologists. However, the web-based approach is not suitable for large-scale searches requiring extensive post-processing, which are common in genome analysis. For these tasks, one has to use the stand-alone version of BLAST, which can be obtained from NCBI via ftp and installed locally under the Unix or Windows operation systems. Although the stand-alone BLAST programs do not offer all the conveniences available on the web, they do provide some additional and useful opportunities. In particular, stand-alone PSI-BLAST can be automatically run for the specified number of iterations or until convergence.

With the help of simple additional scripts, the results of stand-alone BLAST can be put to much use beyond the straightforward database search. Searches with thousands of queries can be run automatically, followed with various post-processing steps; some of these will figure in the next chapter on genome annotation. The BLASTCLUST program (written by Ilya Dondoshansky in collaboration with Yuri Wolf and E.V.K.), which is also available from NCBI via ftp and works only with stand-alone BLAST, allows clustering sequences by similarity using the results of an all-against-all BLAST search within an analyzed set of sequences as the input. It identifies clusters using two criteria: (i) level of sequence similarity, which may be expressed either as percent identity or as score density (number of bits per aligned position), and (ii) the length of HSP relative to the length of the query and subject (e.g. one may require that, for the given two sequences to be clustered, the HSP(s) should cover at least 70% of each sequence). BLASTCLUST can be used, for example, to eliminate protein fragments from a database or to identify families of paralogs.

4.4. Practical Issues: How to Get the Most Out of BLAST

BLAST in all its different flavors remains by far the most widely used sequence comparison program. For this reason and also given extensive personal experience using BLAST for a variety of projects, we describe the practical aspects of using BLAST in considerable detail. Some of this information and additional hints on BLAST usage are available online at the NCBI web site as a BLAST tutorial ( A simple description of the statistical foundations of BLAST, written by Stephen Altschul, is available at the NCBI web site as BLAST course (

4.4.1. Setting up the BLAST search

The BLAST web site has been recently redesigned in order to simplify the selection of the appropriate BLAST program to perform the desired database search. The user only needs to select the type of the query (nucleotide or protein sequence) and the type of the database (protein or nucleotide). These selections automatically determine which of the BLAST programs is used for the given search. The discussion below deals primarily with BLASTP, but all the same considerations apply to BLASTX, TBLASTN, and TBLASTX (used only in exceptional cases).

Selecting a subsequence and a database

The default is to search with the entire query sequence. On many occasions, however, it is advantageous to use as the query only a portion of a protein sequence, e.g. one domain of a multidomain protein. To this end, one can indicate the range of the amino acid position to be used as the query or simply paste the fragment of interest into the query window.

NCBI, EBI, and other centers offer the users a wide variety of nucleotide and protein databases for various searches, for example, instead of the non-redundant (nr) database, which is used as default, Swissprot, Month (subset of the nr database contaning only the sequences added to GenPept in the last 30 days), or pdb (sequences of the proteins whose structures are present in the PDB (see 3.4). Furthermore, a recently implemented option of the NCBI BLAST allows one to limit the BLAST search by a taxonomy tree node (see 3.6.1) or any other properly formatted Entrez query. This option provides for setting up an infinite variety of special-purpose searches, saves a large amount of computer and human time, and leads to an increase of search sensitivity because of the reduced search space (this, of course has to be remembered when E-values are interpreted). For example, if one needs to compare a query protein only to the two-component signaling systems in Cyanobacteria (e.g. if one suspects that the protein in question is a cyanobacterial histidine kinase), the search can be easily adapted by indicating “histidine kinase” AND “Cyanobacteria [ORGN]” in the Entrez window. Of course, this needs to be done with caution because it is likely that not all histidine kinases that are actually encoded in the cyanobacterial genomes have been annotated as such. Therefore it might be a better idea to search all cyanobacterial proteins, which is still many times faster than searching the entire non-redundant database. We find the ability to limit the search space using taxonomic criteria to be an extremely useful feature of BLAST, especially given the current rapid growth of sequence databases. Another useful way to limit the search space is to include the expression ‘srcdb_refseq[PROP]’ in the ‘Limit results by Entrez query’ window. This will limit the search to NCBI's RefSeq database (see 3.4), thus preserving most of the sequence diversity while avoiding redundancy.

4.4.2. Choosing BLAST parameters

Composition-based statistics and filtering

As noted above, low-complexity sequences (e.g., acidic-, basic- or proline-rich regions) often produce spurious database hits in non-homologous proteins. Currently, this problem is addressed by using composition-based statistics (see 4.2.3) as the default for NCBI BLAST; filtering with SEG is available as an option but is turned off by default. As shown in large-scale tests [748] and confirmed by our own experience, composition-based statistics eliminates spurious hits for all but the most severe cases of low sequence complexity [878].

Table 4.6 shows an example of how filtering low-complexity regions in a protein affects the BLAST scores with a true homolog compared to a hit that is due solely to the low-complexity segment and how composition-based statistics solves this problem without filtering. Breast cancer type 1 susceptibility protein (BRCA1 or BRC1_HUMAN) is a 1863-aa protein, which is mutated in a significant fraction of breast and ovarian cancers (see OMIM entry 113705). This protein contains several low-complexity regions. Using the unfiltered BRCA1 sequence in a BLAST search retrieves a number of BRCA1 fragments and its variants from human, mouse, rat, and other vertebrates, which are not particularly helpful for predicting the function(s) of the query protein. The first significant hit beyond BRCA1 variants is dentin sialophosphoprotein, an Asp,Ser-rich protein found in mineralized dentin. There is no genuine homologous relationship between BRCA1 and dentin phosphoprotein, and this hit is entirely spurious. The default filtering with SEG decreases the similarity score reported by BLAST from 73 to 49 (and, accordingly, increases the E-value from 3e-11 to 5e-4), which still appears to be a statistically significant similarity. Only more stringent filtering, which increases the number of masked residues from 117 to 172, pushes the E-value of the dentin phosphoprotein hit into the unreliable territory. In contrast, filtering only slightly decreases the similarity scores between BRCA1 and an Arabidopsis protein that also contains a BRCT domain [454] or between human and oppossum BRCA1. Of course, when most of the protein is masked, the scores start to drop (and E-values go up). These data were generated using the standard (old) statistics option. The last row of the table shows that using composition-based statistics largely eliminates the problem by recognizing the compositional bias in the dentin phosphoprotein and accordingly decreasing the score for this hit. Thus, this new option makes explicit low complexity filtering largely (albeit not completely) irrelevant for the purposes of routine database searches, but not for exploration of protein structure. Methods like SEG remain important tools for delineating probable globular domains and in that capacity may still be useful for searches, e.g. when a predicted globular domain is used a query. Using composition-based statistics is the only feasible choice for any large-scale automated BLAST searches. However, the same tests have shown that, for some queries, this statistical procedure resulted in artificially high (not significant) E-values [748]. Therefore, for detailed exploration of certain, particularly short, proteins, it is advisable to also try a search with composition-based statistics turned off.

Table 4.6. Removing spurious database hits for the low sequence complexity protein BRCA1 by modifying SEG parameters a.

Table 4.6

Removing spurious database hits for the low sequence complexity protein BRCA1 by modifying SEG parameters a.

Expect value, word size, gap penalty, substitution matrix

Expect (E) value can be any positive number; the default value is 10. Obviously, it is the number of matches in the database that one should expect to find merely by chance. Typically, there is no reason to change this value. However, in cases when extremely low similarity needs to be analyzed, the threshold may be increased (e.g. to 100) and, conversely, when it is desirable to limit the size of the output, lower E-values may be used.

Word size (W) must be an integer; the default values are 3 for protein sequences and 11 for nucleotide sequences. This parameter determines the length of the initial seeds picked up by BLAST in search of HSPs. Currently supported values for the protein search are only 3 and 2. Changing word size to 2 increases sensitivity but considerably slows down the search. This is one of the last resorts for cases when no homologs are detected for a given query with regular search parameters.

We have already discussed the advantages of protein sequence searches over nucleotide sequence searches (see 4.2). The necessarily greater values of the word size parameter in nucleotide searches may lead to curious situations like the following one. Having sequenced the cysteine proteinase gene from alfalfa weevil Hypera postica (GenBank accession no. AF157961 [893]), the authors searched for similar sequences using both BLASTX and BLASTN. A BLASTX search with this gene showed that its product is very similar to cathepsin L-like proteinases from a variety of organisms, including the relatively well-characterized proteinase from Drosophila melanogaster. In contrast, a BLASTN search with this gene sequence, surprisingly, did not find any homologous genes. This seems surprising because, for example, the cathepsin L gene of the southern cattle tick Boophilus microplus (GenBank accession no. AF227957) contains a region with as much as 61% identity (98 identical nucleotides of 161) to the gene from Hypera postica:

Figure 4.10. The importance of the “word size” parameter.

Figure 4.10

The importance of the “word size” parameter. An alignment of segments of cysteine protease genes from alfalfa weevil (AF157961) and cattle tick (AF227957).

How come BLASTN failed to find this similarity? It happened because BLASTN has the default word size of 11, i.e. reports as an HSP only a run of 11 identical nucleotides. Even decreasing the word size to 7, the lowest word size currently allowed for BLASTN, would not change the result because the longest stretch of identical nucleotides in this alignment is only 6 bases long. This example not only shows once more why protein searches are superior to DNA-DNA searches. It also demonstrates that establishing that two given sequences are not homologous requires as much caution as proving that they are homologous. Accordingly, the statement that the reported sequence is “novel” and has no homologs in GenBank, often found in scientific literature, should always be treated with a healthy dose of skepticism.

As described above, different amino acid substitution matrices are tailored to detect similarities among sequences with different levels of divergence. However, a single matrix, BLOSUM62, is reasonably efficient over a broad range of evolutionary change, so that situations when a matrix change is called for are rare. For particularly long alignments with very low similarity, a switch to BLOSUM45 may be attempted, but one should be aware that this could also trigger an increase in the false-positive rate. In contrast, PAM30, PAM70, or BLOSUM80 matrices may be used for short queries. Each substitution matrix should be used with the corresponding set of gap penalties . Since there is no analytical theory for calculating E-values for gapped alignments, parameters of equation 4.2 had to be determined by extensive computer simulations separately for each combination of a matrix, gap opening penalty, and gap extension penalty. Therefore, only a limited set of combinations is available for use (Table 4.7). However, there is no indication that substantial changes in these parameters would have a positive effect on the search performance.

Table 4.7. Substitution matrices and gap penalties.

Table 4.7

Substitution matrices and gap penalties.

Additional aspects of the BLAST setup are discussed in the next subsection because they apply to the output of the search. A useful feature that has been recently added to NCBI BLAST is the ability to save and bookmark the URL with a particular BLAST setup using the ‘Get URL’ button at the bottom of the page. For a habitual BLAST user, it pays off to save several setups customized for different tasks.

4.4.3. Running BLAST and formatting the output

A BLAST search can be initiated with either a GI number or the sequence itself. In the current implementation at the NCBI web page (, the user can run a BLAST search and then try several different ways of formatting the output. The default option involves toggling between two windows, which may become confusing; it may be convenient to switch to a one-window format using the Layout toggle and save the setup as indicated above.

CDD search is run by default in conjunction with BLAST. As discussed above, this search is much faster than regular BLAST and is often more sensitive. The CDD search is normally completed long before the results of conventional BLAST become available. This allows the user to inspect the CDD search output and get an idea of the domain architecture of the query protein while waiting for the BLAST results. On many occasions, all one really needs from a database search is recognizing a particular protein through its characteristic domains architecture or making sure that a protein of interest does not contain a particular domain. In such situations, there may be no reason to even wait for the regular BLAST to finish. The CDD search may also be run as a stand-alone program from the main BLAST page. In this mode, it is possible to change the E-value threshold for reporting domain hits (default 0.01), which can be helpful for detecting subtle relationships and new versions of known domains.

The current BLAST setup includes limitation on the number of descriptions and the number of alignments included in the output; the current defaults are 250 and 100, respectively. With the rapidly growing database size, there is often need to increase these limits in order to investigate a particular protein family. Doing so, however, will likely result in large outputs that are hard to download and navigate. Limiting the search space as outlined above could be a viable and often preferable option.

The graphical overview option allows the user to select whether a pictorial representation of the database hits aligned to the query sequence is included in the output. Although it slows loading the page, this option is essential for quick examination of the output to get an idea of the domain architecture of the query. Each alignment in the graphical view window is color-coded to indicate its similarity to the query sequence.

The Alignments views menu allows the user to choose the mode of alignment presentation. The default Pairwise alignment is the standard BLAST alignment view of the pairs between the query sequence and each of the database hits. All other views are pseudo-multiple alignments produced by parsing the HSPs using the query as a template. Query-anchored with identities shows only residues that are different from the ones in the query; residues identical to the ones in the query are shown as dashes. Query-anchored without identities is the same view with all residues shown. Flat query-anchored with identities is a multiple alignment that allows gaps in the query sequence; residues that are identical to those in the query sequence are shown as dashes. Flat query-anchored without identities also allows gaps in the query sequence but shows all the residues. Pairwise alignment is definitely most convenient for inspection of sequence similarities, but the “flat query-anchored without identities” option allows one to generate multiple alignments of reasonable quality that can be saved for further analysis. This option is best used with the number of descriptions and alignments (see above) limited to a manageable number (typically, no more that 50).

The Taxonomy Reports option allows the user to produce a taxonomic breakdown of the BLAST output. Given that many BLAST outputs are quite large these days, this is extremely helpful, allowing one to promptly assess the phyletic distribution of the given protein family and identify homologs from distant taxa.

Formatting for PSI-BLAST

The output of BLAST can be used as the input for PSI-BLAST. The critical parameter that is typically set before starting the initial BLAST run is inclusion threshold ; the current default is E = 0.005. This parameter determines the E-value required to include a HSP into the multiple alignment that is used to construct the PSSM. Combined with composition-based statistics, the E-value of 0.005 is a relatively conservative cut-off. Spurious hits with lower E-values are uncommon: they are observed more or less as frequently as expected according to Karlin-Altschul statistics, i.e. approximately once in 200 searches. Therefore, carefully exploring the results with higher E-values set as the inclusion threshold often allows one to discover subtle relationships that are not detectable with the default cut-off. When studying new or poorly understood protein families, we routinely employ thresholds up to 0.1. In the version of PSI-BLAST, which is available on the web, each new iteration has to be launched by the user. New sequences detected in the last iteration with an E-value above the cut-off are highlighted in PSI-BLAST output. PSI-BLAST also has the extremely useful option of manually selecting or deselecting sequences for inclusion into the PSSM. Selecting “hopeful” sequences with E-values below the cut-off may help in a preliminary exploration of an emerging protein family; deselecting sequences that appear to be spurious despite E-values above the cut-off may prevent corruption of the PSSM. The PSSM produced by PSI-BLAST at any iteration can be saved and used for subsequent database searches.

We realize that the above recommendation to investigate results that are not reported as statistically significant is a call for controversy. However, we believe there are several arguments in favor of this approach. First, such analyses of subtle similarities have repeatedly proved useful, including the original tests of PSI-BLAST effectiveness [22]. Second, like in other types of research, what is really critical is the original discovery. Once one gets the first glimpse of what might be an important new relationship, statistical significance often can be demonstrated using a combination of additional methods (more on this in 4.4.4). Third, we certainly do not advocate lowering the statistical cut-off for any large-scale searches, let alone automated searches. This is safe only when applied in carefully controlled case studies.

4.4.4. Analysis and interpretation of BLAST results: separating the wheat from the chaff

In spite of the solid statistical foundation, including composition-based statistics, BLAST searches inevitably produce both false positives and false negatives. The main cause for the appearance of false positives, i.e. database hits that have “significant” E-values but, upon more detailed analysis, turn out not to reflect homology, seems to be subtle compositional bias missed by composition-based statistics or low-complexity filtering. The reason why false negatives are inevitable is, in a sense, more fundamental: in many cases, homologs really have low sequence similarity that is not easily captured in database searches and, even if reported, may not cross the threshold of statistical significance. In an iterative procedure like PSI-BLAST, both the opportunities to detect new and interesting relationships and the pitfalls are further exacerbated. Beyond the (conceptually) straightforward issues of selectivity and sensitivity, functional assignments based on database search results require careful interpretation if we want to extract the most out of this type of analysis while minimizing the chance of false predictions. Below we consider both the issues of search selectivity and sensitivity and functional interpretation.

No cut-off value is capable of accurately partitioning the database hits for a given query into relevant ones, indicative of homology, and spurious ones. By considering only database hits with very high statistical significance (e.g. E < 10−10) and applying composition-based statistics, false positives can be eliminated for the overwhelming majority of queries, but the price to pay is high: numerous homologs, often including those that are most important for functional interpretation, will be missed.

Now that we have come to the practical aspects of a database search, specific examples will work best. Let us see what BLAST analysis can tell us about the protein product of human tumor susceptibility gene 101 (TSG101). This protein appears to have important roles in a variety of human cancers and in budding of viruses, including HIV [139,275,514,515,527,721]. We first run the default BLAST search (again, this includes CDD search; composition-based statistics is turned on) using the TSG101 sequence (T101_HUMAN) as the query. Since CDD search is much faster than BLAST, we have the opportunity to examine the potential domain architecture of TSG101 while BLAST is still running. Distinct, statistically significant domain hits are reported for the N-terminal and C-terminal regions of the TSG101 sequence (Figure 4.11). There is also a low-complexity segment in the middle that probably corresponds to a non-globular domain.

Figure 4.11. Graphical output of the CD search for the TSG101 protein.

Figure 4.11

Graphical output of the CD search for the TSG101 protein.

The N-terminal part is similar to the UBCc domain, the catalytic domain of the E2 subunit of ubiquitin-conjugating enzymes. The statistical significance of this similarity is overwhelming, with a E < 10−13. The domain is unique (the only hit in this part of the TSG101 sequence), and UBCc is known to be a globular domain, so there is no reason to suspect a spurious hit due to compositional bias. Therefore we must conclude that the N-terminal portion of TSG101 is a homolog of the catalytic domains of E2 enzymes. Does this mean that this domain of TSG101 has the activity of ubiquitin ligase? The answer comes from a more careful examination of the alignment and is negative. We can easily see that the conserved catalytic Cys residue of E2, which conjugates ubiquitin and is essential for the enzymatic activity [156], is replaced by Tyr in TSG101 (Figure 4.12A). Soon after the discovery of TSG101, similar type of analysis (in the pre-CDD and pre-composition-based statistics days, this was much harder!) led to the conclusion that TSG101 was an inactivated homolog of E2 enzymes, probably a regulator of ubiquitination [453,680]. These predictions of TSG101 structure and function have been subsequently confirmed experimentally [516,682].

Figure 4.12. Results of the CD search for the TSG101 protein.

Figure 4.12

Results of the CD search for the TSG101 protein. A. Alignment of the TSG101 N-terminal domain with ubiquitin-conjugating enzymes. The conserved Cys (replaced by Tyr in TSG101) is shown in bold and indicated with an asterisk. B. Alignment (more...)

The domain hits of the C-terminal region of TSG101 are of a completely different nature. There are three overlapping domains with moderate similarity to the C-terminal portion of TSG101 (Figure 4.11). One of these belongs to an uncharacterized family of bacterial proteins, whereas two others are myosin and tropomyosin, proteins known to contain coiled-coil domains, a distinct non-globular α-helical superstructure. The telltale feature of coiled-coil domains is the periodic distribution of leucines, which tend to occur seven residues apart. A simple visual examination of the alignment of TSG101 with tropomyosin (Figure 4.12B) reveals this pattern of leucines, suggesting that TSG101 contains a C-terminal coiled-coil domain. This observation can be confirmed by using the COILS2 program (see How do we interpret the domain hits for the C-terminal part of TSG101? Coiled-coil domains occur in a huge variety of proteins, and it would be ridiculous to consider them all to be homologous. Thus, these hits are not indications of homology and, in all likelihood, do not point to any specific functional similarity between the respective proteins and TSG101. So, in the traditional sense, these domain hits are spurious, i.e. are caused by similar amino acid composition and structure of the query and certain domains present in the CDD collection, rather than by homology. However, these hits are not uninformative. They allow us to predict the existence of a coiled-coil domain in TSG101, and this domain is likely to mediate important protein-protein interactions. Indeed, it has been shown that the coiled-coil domain is responsible for the interaction of TSG101 with stathmin, a phosphoprotein implicated in tumorigenesis [514].

We have already learned a lot about the functions of TSG101 from the CD search alone. As mentioned above, in many situations, this information could be all a researcher needs from computational analysis of a sequence. However, for the purpose of this discussion, let us now turn to BLAST. The search returns 184 hits, and their distribution along the TSG101 sequence shows the presence of several full-length hits with highly significant similarity to TSG101 (we may immediately suspect these are orthologs) as well as a number of domain-specific hits. Examination of the full-length alignments confirms the conservation of domain architecture in probable TSG101 orthologs not only in distantly related animals (fruit fly and worm) but also in the plant Arabidopsis thaliana. Moreover, in all of these alignments, the tyrosine replacing the catalytic cysteine of E2 is conserved, indicating that all these TSG101 orthologs are enzymatically inactive. This leads us to the remarkable conclusion that inactivation of a E2 paralog and its fusion with a coiled-coil domain leading to the regulatory function of TSG101 have already occurred prior to the divergence of animals and plants. Thus, this regulatory mechanism involved both in cellular transformation and in viral budding from animal cells is at least 800 million years old! Let us note that this conclusion, which was easily reached through the analysis of BLAST output, was not immediately obvious from the CDD search result.

What about the domain-specific hits? There are only three of these for the N-terminal part of TSG101, and all appear to be inactivated homologs of E2 enzymes. This is, of course, something that we recognize in retrospect; note that there is no mention of E2 or ubiquitin or anything else in this BLAST output that would even hint at the homology of TSG101 and ubiquitin-conjugating enzymes. The rest of the hits, located in the central and C-terminal portions of the TSG101 sequence, are to low-complexity sequences, including those of coiled-coil domains, with some E-values below 10−4. Does it make sense to run a second PSI-BLAST iteration starting from this output? We think not (an unwieldy proliferation of low-complexity sequences in the output is fully expected), although the reader might want to attempt the experiment.

We will now try a different approach: low complexity filtering instead of composition-based statistics (the two cannot be used together because masking low-complexity regions precludes correct estimation of search parameters). The search with low complexity filtering turned on results in 637 hits. A region of 41 amino acid residues in the middle of the TSG101 sequence is masked, eliminating the low-complexity hits in this region. However, the C-terminal coiled-coil region is not masked, and an even greater number of hits with apparently significant E-values are reported. Thus, filtering does not solve the problem. It does suggest, however, a simple trick we will do next. It is common for low-complexity (non-globular) regions to separate distinct domains in proteins. Therefore, we will run BLAST using as the query only the segment of TSG101 preceding the masked region; to do so, we put 1 to 163 in the ‘Set subsequence windows’ on the BLAST page. The results are dramatically different. We get only 37 hits; all those with significant E-values are (as we know from the results of the CD search) inactivated E2 homologs, but below the threshold, we notice some proteins annotated as ubiquitin-conjugating enzymes. Now we can run PSI-BLAST. The second iteration brings in two more sequences of inactivated E2's; the best hit to ubiquitin-conjugating enzyme now has an E-value of 0.009, still below the cut-off. After the third iteration, we get the message “No new sequences were found above the 0.005 threshold!” – the search has converged. Formally, we have not observed a statistically significant sequence similarity between TSG101 and ubiquitin-conjugating enzymes. However, the E-value of 0.009 is suggestive (let us recall: this means that an alignment with this or greater score is expected to be observed in less than one database search in 100 for the given query), and the detected similarity is worth further investigation. Let us run the search using the same 1–161 fragment of TSG101 as the query but with the composition-based statistics turned off. Now, in the second iteration, we detect two ubiquitin-conjugating enzymes with E-values < 0.001. In this example, an equivalent result would have been achieved if a less restrictive cut-off (E = 0.01) was used with composition-based statistics. However, there are cases when a search with composition-based statistics turned off reveals relationships that are not detectable at all under the default BLAST parameters.

In the third and subsequent iterations, the E-values for ubiquitin-conjugating enzyme sequences become extremely low. Here, a clarifying note on E-values reported by PSI-BLAST is due [21]. When, in a PSI-BLAST search, a sequence crosses the inclusion threshold for the first time, the reported E-value is based on the score of the alignment of this sequence and the PSSM constructed in the previous PSI-BLAST iteration, which did not include the sequence in question. Therefore, this E-value accurately reflects the similarity between the given sequence and the rest of the protein family involved in PSSM construction. However, for the next iteration, the sequence(s) in question, e.g. those of E2 enzymes in the above example, becomes part of the alignment used for PSSM construction, and their E-values get unduly inflated. Thus, inferences of homology and functional predictions should take into account only the first E-value below the threshold reported for the sequence of interest.

In order to introduce another crucial aspect of PSI-BLAST analysis, we will now leave TSG101 and turn to other examples. In an ideal world, PSI-BLAST could be expected to be symmetrical with respect to protein family members. In other words, each member used as the query would retrieve from the database the entire family. In practice, the effect of query choice on the outcome of the search can be dramatic [43]. Consider how different queries fare in retrieving members of the ATP-grasp superfamily of enzymes, which includes primarily ATP-dependent carboligases (Table 3.2). The biotin carboxylase domain of human acetyl-CoA carboxylase initially retrieves other biotin carboxylases. After ~200 sequences of biotin carboxylases, PSI-BLAST starts finding carbamoyl phosphate synthetases, ~100 of them. Only after these, other members of the superfamily (D-alanine-D-alanine ligase, phosphoribosylamine-glycine ligase and others; Table 3.2) start appearing in the output. It takes four iterations and almost 1,000 database hits before this search finds synapsin, a diverged member of the ATP-grasp superfamily, which is involved in regulating exocytosis in synaptic vesicles and has been shown to contain the ATP-grasp domain by crystal structure analysis (PDB entry 1auxA [216]). Using synapsin as a query in PSI-BLAST search with default parameters fares even worse: only other members of the synapsin family itself are retrieved. Only changing the inclusion threshold to 0.01 results in retrieval of other members of this superfamily in the fourth and subsequent iterations. In contrast, using M. jannaschii protein MJ1001 as a query in PSI-BLAST search with default parameters results in retrieval of most members of the ATP-grasp superfamily, including the synapsins, in the second iteration. A regrettable personal anecdote belongs here: because of this remarkable asymmetry, we initially failed to predict the structure of synapsin! [262]. This and other examples (see [43]) show that, in order to characterize a protein (super)family, it is never sufficient to run a single PSI-BLAST search. Rather, either all known members or at least a representative subset should be used as queries. The search then needs to be repeated with newly detected members in “superiteration” mode. Just like PSI-BLAST, this process (run either automatically or manually) stops only when no new sequences with credible similarity to the analyzed family are detected. As a potentially useful tip for query choice, it is interesting to note that, in our experience, prokaryotic, and particularly archaeal, proteins are often better queries than their eukaryotic homologs. This conclusion has been reached in the early days of genomics [262,467] and still holds true. Among the obvious reasons for this trend are the generally shorter length and lower content of compositionally biased sequences in prokaryotic proteins.

The final crucial aspect of BLAST searches that we must mention here is the importance of protein domain architecture for inferring the function of the query protein from the functions of homologs. Assigning function on the basis of one homologous domain while overlooking the multidomain architecture of the query, the database hit, or both is one of the most common sources of regrettable errors in sequence analysis. We will discuss some specific examples in the next chapter when discussing genome annotation.

This brief discussion certainly cannot cover all “trade secrets” of sequence analysis; some more case studies are presented in the next section. However, the above seems to be sufficient to formulate a few rules of thumb that help a researcher to extract maximal amount of information from database searches while minimizing the likelihood of false “discoveries”.

  • Searching a domain library is often easier and more informative than searching the entire sequence database. However, the latter yields complementary information and should not be skipped if details are of interest.
  • Varying the search parameters, e.g. switching composition-based statistics on and off, can make a difference.
  • Using subsequences, preferably chosen according to objective criteria, e.g. separation from the rest of the protein by a low-complexity linker, may improve search performance.
  • Trying different queries is a must when analyzing protein (super)families.
  • Even hits below the threshold of statistical significance often are worth analyzing, albeit with extreme care.
  • Transferring functional information between homologs on the basis of a database description alone is dangerous. Conservation of domain architectures, active sites, and other features needs to be analyzed (hence automated identification of protein families is difficult and automated prediction of functions is extremely error-prone).

4.5. The Road to Discovery

The notion of punctuated equilibrium introduced by Niles Eldredge and Steven Jay Gould [304] applies not only to the evolution of life but also to the history of science: epochs of relative quiet are punctuated by bursts of discoveries. The event that sets off the avalanche is usually the development of a new method. This phenomenon reproduces itself with an uncanny regularity in all areas of science—and computational biology is no exception. The authors of this book have personally experienced the excitement of two such groundbreaking developments. The first one came in 1994 with the algorithm for iterative motif search implemented in the MoST program [826]. A variety of (relatively) subtle relationships between protein domains that previously were completely unknown suddenly were within our reach.

Arguably, the most remarkable of these new findings was the discovery of the BRCT domain, so named after BRCA1 (hereditary Breast Cancer Associated protein 1) C-terminus [454]. The initial discovery of this domain took something like a leap of faith: the first seed for MoST was derived from an alignment between the C-terminal part of BRCA1 and a p53-binding protein that was produced by BLAST and was not statistically significant at all. Subsequently, of course, some level of statistical significance and, more importantly, the uniqueness of the new domain and the coherence of its diagnostic pattern of amino acid residue conservation were established through MoST searches and multiple alignment analysis. Furthermore, the BRCT domain superfamily was greatly expanded by combining MoST with other profile search tools, Pfsearch and Gibbs sampler [100], and these results were simultaneously corroborated with a completely independent method [136]. In the years since, numerous experimental studies led to the characterization of the BRCT domain as one of the most important adaptors that mediate protein-protein interactions in eukaryotic cell cycle checkpoints (see [407] and references therein).

The BRCT domain also served as the training ground for the next-generation iterative search program, PSI-BLAST. The findings that were originally made quite painstakingly using MoST and other methods were reproduced using PSI-BLAST with minimal human intervention [22]. PSI-BLAST and HMM methods that came of age more or less simultaneously quickly enabled an unprecedented rise of new findings. The discovery of the 3′-5′ exonuclease domain in the Werner syndrome protein and the unification of histidine kinases, type II topoisomerases, HSP90 molecular chaperones and MutL repair proteins in a single distinct superfamily of ATPase domains deserve to be mentioned as two of the very first ones [593].

Much of the new knowledge obtained using these methods, along with additional structural and functional information, was soon encapsulated within the new generation of profile search tools combined with domain databases, of which Pfam, SMART, and CDD seem to be the most practically important ones (see above). These tools, which employ growing and, for the most part, carefully tested libraries of predefined PSSMs or HMMs, have dramatically simplified and, to a large extent, trivialized the process of identification of already known domains in protein sequences, even in cases of limited sequence conservation. This is, of course, a natural and welcome development: what starts as an exciting investigation of uncharted territories ends up being a routine methodology, and the sooner the better. Nevertheless, it is equally clear that fringe cases, in which, even with these new tools, expert exploration is required to validate and interpret domain finding, still abound, particularly with the rapid progress of genome sequencing. And, of course, many new domains that are still missing in the existing collections remain to be discovered, and for this, the libraries of already known domains are of little use.

Nothing helps capturing the flavor of a relatively complex research approach better than real-life examples, so let us explore some interesting proteins, for which, to our knowledge, there is no annotation in either published literature or available databases.

Phage λ gene Ea31

First, let us consider phage λ gene Ea31, which was briefly mentioned in Chapter 1. We start by using the sequence of this protein in a PSI-BLAST search, with the CDD search option turned on, but the composition-based statistics turned off (see above). The CDD search immediately results in a provocative finding: a hit to the so-called HNH nuclease domain. The similarity seems to be statistically significant, but the E-values are not particularly impressive, about 0.001. Can we conclude that Ea31 contains a nuclease domain of this particular family and predict that this phage protein is a nuclease? Let us first note that there are two distinct questions here because the presence of a conserved domain does not automatically imply the corresponding activity. There are numerous cases of the same domain present in both enzymatically active and inactivated proteins (as we have seen in the TSG101 case). Furthermore, without a more detailed investigation, even the conclusion on the presence of an HNH domain in Ea31 would be premature. Experience shows that spurious hits with a similar level of statistical significance are not particularly uncommon in CDD searches (in other words, they appear somewhat more often than once in 1000 searches, as suggested by the E-value), probably due to subtle low complexity present in some of the multiple alignments behind the PSSMs. So let us investigate farther. First of all, by clicking on the HNH domain icon in the CDD search, we obtain a multiple alignment of our query (Ea31) with a diverse set of 10 HNH domains (Figure 4.13).

Figure 4.13. Results of the CD search with the Ea31 protein.

Figure 4.13

Results of the CD search with the Ea31 protein.

We should note that nearly all of these proteins, including Ea31, share a conserved pattern of two pairs of cysteines and the aspartate-histidine (DH) doublet. Examination of the relevant literature immediately shows that the cysteines and the histidine in this pattern are the metal-coordinating residues that comprise a Zn-finger-like domain in the HNH family nucleases and are required for the nuclease activity [129]. The structural elements of the HNH nuclease domain known from the available 3D structure (PDB entry 1EMV) are also well conserved in Ea31. At this point, we are in a position to conclude that the CDD hits pointed us in the right direction and Ea31 indeed contains a HNH domain and is, in all likelihood, an active nuclease. As mentioned in Chapter 1, it seems likely that this predicted nuclease forms a two-subunit, ATP-dependent enzyme with the product of the adjacent (and probably coexpressed) gene Ea59. The HNH family includes several restriction enzymes, e.g. McrA, and other enzymes involved in defense function, such as colicins. Therefore it is possible that such an ATP-dependent nuclease functions as a phage-specific restriction-modification enzyme.

Our little discovery is made (or, more precisely, reproduced and expanded here because the HNH nuclease domain of Ea31 has already been described in the course of a comprehensive evolutionary analysis of Holliday junction resolvases [50]) and, hopefully, will be eventually tested experimentally. But let us not leave the HNH nuclease family just yet and see what happens if we do not rely on the CDD search but rather continue with the regular PSI-BLAST analysis. The original PSI-BLAST search with the default cut-off E-value of 0.005 converges after the first iteration. We will try to lower the cut-off and run PSI-BLAST, carefully monitoring the results. A search with E = 0.1 brings in numerous proteins in successive iterations, among which proteins annotated as putative HNH family endonucleases start appearing in the second iteration.

After running five PSI-BLAST iterations, we detect about 100 sequences with E-values above the chosen cut-off. Examination of all these sequences shows the conservation of the metal-binding motif, suggesting that even using such a liberal cut-off in this case does not result in spurious hits (let us note parenthetically that this very approach of lowering the cut-off was used for detecting BRCT domains in the original PSI-BLAST work).

Continuing running iterations one by one on the web becomes cumbersome with a large number of retrieved sequences, as well as time-consuming, so, at this point, we switch to a local search on a UNIX workstation, which we run to convergence. The search does indeed converge after the 16th iteration, and inspection of the results suggests at least two important conclusions. First, the great majority of the sequences detected in this search and containing the conserved metal-binding site are annotated in the database as “hypothetical proteins” (or, even worse, as “predicted transmembrane proteins”), and only a minority are labeled as (predicted) nucleases. Thus, although this family, its conserved motifs, structure, and enzymatic activity have been described in a considerable number of publications, its representation in current databases does not even approach completeness, and researchers have no direct way to become aware of its scope. Second, our search detected two members of the HNH family from Arabidopsis thaliana and one from rice, the latter one already annotated as a predicted nuclease. Therefore, plants also encode nucleases of this family, which so far has been considered to be exclusively prokaryotic.


In the above example, we identified a diverged version of an already well-known domain in a previously uncharacterized protein. This was a pretty straightforward observation. Discovering new domains and predicting their function(s) is a much more challenging task. However, we have an excellent source of such potential new domains: the functionally uncharacterized COGs or Pfam families (see 3.2.2). The COGs are most convenient because they allow selection of potential “targets” for discoveries using phyletic patterns (see 2.2.6). Figure 4.14 shows the conserved portion of the multiple alignment of proteins from COG1518, which is represented in many archaeal and several bacterial species, primarily thermophiles. These proteins show striking conservation among themselves, but extensive searches for possible homologs using PSI-BLAST, CD search, and even threading methods (see 4.6.3) fail to show any credible relationships (the readers may try these searches again: perhaps the databases change by the time this book is out). However, a little alignment gazing immediately tells us that these proteins have several charged residues (two glutamates, an aspartate and a histidine) that are conserved in all sequences of this family (Figure 4.14). This is a telltale sign of an enzyme: polar amino acids that are invariant in a diverse protein family typically are part of a catalytic center.

Figure 4.14. Alignment of a subset of COG1518 proteins: a predicted new enzyme.

Figure 4.14

Alignment of a subset of COG1518 proteins: a predicted new enzyme. Sequence conservation is indicated by shading and by the consensus pattern. Most conserved residues are shown in reverse colors. Conserved hydrophobic residues (more...)

We have repeatedly observed that, when a domain is capable of “merely” binding a protein, nucleic acid, or a small molecule, there would be no invariant polar residues. Strangely, we found it hard to cite any particular publication for this generalization. Apparently, formal rules for distinguishing enzymes from non-enzymes by analysis of multiple alignments have not been developed so far, although it seems that we more or less know an enzyme when we see the alignment of the respective protein family (provided the aligned sequences are diverse enough). Given the nature of the conserved residues, it appears likely that the activity of this putative new enzyme requires metal coordination. It seems to be dangerous to speculate further on the nature of this activity. We will leave it at that for now. In the next chapter, when discussing the use of genomic context for functional annotation, we will apply a different method of inference to arrive at specific hypotheses on the biological function and biochemical activity of COG1518 proteins.

RNA-dependent RNA polymerase

The final case in this section is a real wild goose chase. Our subject will be the RNA-dependent RNA polymerase (RdRp), an enzyme that attracted enormous attention in the last few years as the apparent amplifier of small RNAs involved in post-transcriptional gene silencing [6]. RdRp is a large protein that is highly conserved in most eukaryotes, including some invertebrate animals, fungi, plants, and importantly, the early-branching protozoon Giardia. It is missing in yeast S. cerevisiae, the microsporidion E. cuniculi, and interestingly, in insects and vertebrates, which is attributed to lineage-specific gene loss during eukaryotic evolution [55]. As noted in the literature and confirmed in our database searches, RdRp does not show significant similarity to any other proteins; in particular, there seems to be no relationship whatsoever with enzymes of RNA viruses that have the same activity. Thus, the origin of this unique enzyme at the dawn of eukaryotic evolution remains a mystery. However, inspection of the alignment of the RdRp's revealed a striking signature present in all these sequences: three invariant aspartates flanked by a few other conserved residues. Negatively charged residues coordinating metal ions are a fixture in all types of DNA and RNA polymerases [176]. A tantalizing hypothesis therefore emerged that this motif was likely to be a part of the active site of RdRp's. Clearly, any similarity between RdRp's and other polymerases, if it exists, must be subtle, otherwise it would have been picked by PSI-BLAST (if not regular BLAST). We ran a PHI-BLAST search using the most conserved segment of the RdRp from the slime mold Dictyostelium discoideum as the query and the following delimiting amino acid pattern:

Image ch4e22.jpg

This search retrieved from the database all available sequences of eukaryotic RdRp's and, in addition, another alignment that is shown in Figure 4.15A. Strikingly, although there is no statistical significance at all to this alignment (E-value of nearly 6), our signature fits the known active site located in the second largest subunit of DNA-dependent RNA polymerase (DdRp), the enzyme that catalyzes transcription in all cellular life forms [586,587]. Furthermore, there seems to be some similarity around this signature that might reflect conservation of the structural elements supporting the polymerase active site (Figure 4.15B). Given the lack of statistical support, these observations should be regarded with utmost caution and perhaps even outright skepticism.

Figure 4.15. Similarity between the putative catalytic site of RdRp and the known catalytic site of DdRp.

Figure 4.15

Similarity between the putative catalytic site of RdRp and the known catalytic site of DdRp. A. The PHI-BLAST hit between RdRp and DdRp. B. Multiple alignment of the conserved domains in RdRps and DdRps. The conserved motif containing (more...)

Nevertheless, it seems likely that the resemblance between the most conserved regions of the two classes of polymerases is not fortuitous. Rather, it might be explained either by divergence from a common ancestor or by convergence driven by similar functional requirements to the polymerase active sites. The latter alternative is not unrealistic: in Chapter 2, we discussed rather similar cases of apparent convergence, e.g. among the lysozymes. If the two polymerases are homologous, the evolutionary scenario becomes more or less obvious, with a duplication of the DdRp catalytic subunit at the onset of the evolution of eukaryotes, followed by extreme, rapid divergence, giving rise to the RdRp. We will know the answer once the 3D structure of RdRp is determined. Regardless of the outcome, there seems to be an excellent chance that the conserved region shown in Figure 4.15 is an important part of the catalytic site of RdRp.

We presented here three case studies of uncharacterized proteins that range from straightforward to openly speculative. We believe that these are illustrative of the way computational analysis of proteins works, by drawing from several lines of evidence to produce predictions whose levels of confidence vary to a great extent.

4.6. Protein Annotation in the Absence of Detectable Homologs

Analysis of various features of protein molecules, which helps predict the type of structure and cellular localization of a given protein or an entire protein family, is an indispensable complement to homology-based analysis. Furthermore, identification of certain structural features of proteins, such as signal peptides, transmembrane segments, or coiled-coil domains, may provide some functional clues even in the absence of detectable homologs. It is not our intention here to discuss these methods in detail, but a brief summary is necessary to develop a reasonably complete picture of computational approaches that are important in genomics.

4.6.1. Prediction of subcellular localization of the protein Signal peptides

The extensive studies of properties of signal peptides by Gunnar von Heijne and colleagues [149,873,874] resulted in the development of the SignalP program [617], which has become the de facto standard for signal peptide prediction and made their identification a relatively straightforward process. SignalP is a neural network method that has been trained separately on experimentally characterized sets of signal peptides from eukaryotes, Gram-positive bacteria, and Gram-negative bacteria. Thus, the appropriate version of the program needs to be selected according to the origin of the analyzed protein (the Gram-positive version can be used for other prokaryotes with single-membrane cells, i. e. all others than Proteobacteria). SignalP is available at the web site of the Technical University of Denmark at The only other widely used algorithm for prediction of signal peptides was developed by Kenta Nakai and Minoru Kanehisa [601] and is included in the PSORT ( suite of programs ([600], see below). Transmembrane segments

There is a variety of methods for predicting transmembrane segments or, more precisely, transmembrane α-helices in proteins (Table 4.8). All those programs rely to some degree on the hydrophobicity profiles of the polypeptide chains; several of them use experimentally determined transmembrane segments as training sets.

Table 4.8. Software tools for prediction of transmembrane segments.

Table 4.8

Software tools for prediction of transmembrane segments.

In addition to predicting the positions of transmembrane segments, some of these programs predict membrane topology of the protein, typically using the “positive-inside” rule [875]. While most of the programs listed above (Table 4.8) accept only single sequences, TMHMM, HMMTop, and SOSUI would also accept FASTA libraries. PHD additionally accepts multiple alignments, and TMAP works only with multiple alignments. Protein targeting

Prediction of mitochondrial and chloroplast targeting signals allows one to differentiate between cytosolic and organellar proteins (Table 4.9). The PSORT server (, developed and maintained by Kenta Nakai at the University of Tokyo, additionally predicts protein targeting to the nucleus, endoplasmic reticulum, lysosomes, vacuoles, Golgi complex, and peroxisomes and identifies probable GPI-anchored proteins.

Table 4.9. Software tools for prediction of protein targeting.

Table 4.9

Software tools for prediction of protein targeting.

4.6.2. Prediction of structural features of proteins

Prediction of structural features of a protein does not directly lead to functional prediction but is a prerequisite for most functional assignments. These methods contribute to protein analysis both in themselves and in conjunction with sequence-based and structure-based methods for homology detection. As discussed above, identification of low-complexity regions is the standard preliminary step in sequence similarity searches, whereas prediction of the secondary structure elements is a prerequisite for some methods of threading introduced below. Coiled-coil domains

Because of the critical importance of appropriate treatment of low-complexity sequences for similarity searches, these were already introduced and briefly discussed in 4.2.3. Coiled-coil domains comprise a distinct class of non-globular protein structures; coiled coils are α-superhelices that are characterized, at the sequence level, by a 7-mer periodicity of hydrophobic residues, usually leucines. Search for such periodicity is the basis of the coiled-coil recognition algorithm in the COILS program by Andrei Lupas ( [531–533]. Peter Kim and colleagues at MIT developed two more programs for coiled-coil prediction, Parcoil ( and Multicoil ( [908]. Secondary structure

Prediction of the secondary structure of a protein in itself gives little indication of its function(s) or homologous relationship but nevertheless is important in conjunction with results obtained by other methods. Protein evolution proceeds largely through insertions and deletions in unstructured parts of a domain (loops), whereas secondary structure elements tend to be conserved. Therefore, reliable prediction of these elements helps correctly align distantly related proteins and reveals subtle sequence similarities that otherwise might have been missed. Sequence-based secondary structure prediction is a well-developed area with a large number of competing methods (Table 4.10). The classic early methods, such as those of Chow-Fasman and Garnier and coworkers, used amino acid residue propensities calculated from 3D structures to predict the most likely structural state for each sequence segment. The modern methods, such as PHD, employ neural networks, which, again, train on the available database of protein structures. Most of the prediction methods partition the protein sequence into one of the three states, α-helix, β-sheet, and loop, and achieve a 70–75% accuracy. Several popular programs, such as PHD and PREDATOR, can accept a multiple alignment as the input, which facilitates identification of conserved structural motifs and notably increases the prediction accuracy. Other programs, such as PSIPRED and Jnet, would take a single sequence as an input, run PSI-BLAST with this sequence as query, and use the alignment generated after three iterations of PSI-BLAST for secondary structure prediction. In addition to generating the structural assignment (α-helix, β-sheet, or a loop) for each amino acid residue, some programs also provide numerical measures of the confidence of prediction. Different secondary structure prediction programs utilize different approaches and often generate conflicting predictions. Therefore, it is advisable to attempt prediction with two or three different tools and compare the results. To simplify this task, several servers (Table 4.10) offer simultaneous submission of the given protein sequence to several prediction programs.

Table 4.10. Portals for secondary structure prediction.

Table 4.10

Portals for secondary structure prediction.

The JPred server (, developed by Geoff Barton and colleagues at the EBI (currently at the University of Dundee), simultaneously runs PHD, Predator, DSC, NNSSP, ZPred, Mulpred, Jnet, COILS, and MultiCoil programs [166]. Another convenient site is the PredictProtein server (, maintained by Burkhard Rost at the Columbia University and mirrored around the world. In addition to running PHD, PredictProtein also submits the input sequence to Prof, PSIpred, PSSP, SAM-T99, and SSPro programs (see Table 4.11) for secondary structure prediction and to DAS, TMHMM, and TopPred programs (Table 4.8) for prediction of transmembrane segments.

Table 4.11. Software tools for secondary structure prediction.

Table 4.11

Software tools for secondary structure prediction.

While the Jpred server runs different programs locally and generates a consensus prediction, PredictProtein does not run other programs by itself. As a result, the user receives separate E-mail messages with outputs from each server used and has to combine their results. The NPS@ server at Pôle Bio-Informatique Lyonnais in Lyon, France ( provides access to PHD and Predator, as well as several other prediction tools, such as GORI, GORIII, GORIV, and SIMPA96 by Jean Garnier and colleagues [273,274,282,509,510], HNN and MLRC by Yann Guermeur and colleagues [321,322], and SOPM and SOPMA by Geourjon and Deleage [277,278]. Like Jpred, the NPS@ server runs all these programs on its own and allows generation of a single consensus prediction [155].

Despite their convenience, the “meta-servers” do not cover the entire diversity of the existing methods for secondary structure prediction. Servers that run individual prediction programs are listed in Table 4.11. Combination and hierarchy of prediction methods

Obviously, to characterize a protein, it is necessary to combine predictions of a variety of structural features for which purpose the methods outlined above are used. When applying them, one needs to take into account that different predictions may overlap, with a more specific one being subsumed and obscured by a more general one. Specifically, signal peptides may mask as transmembrane segments, whereas both transmembrane segments and coiled-coil domains may be given the inappropriately general label of low-complexity sequences. To avoid such clashes between different predictions, it is necessary to establish the hierarchy, in which they apply, with the more specific predictions given higher priority. The following priority order appears reasonable: signal peptide > transmembrane segment > coiled-coil > low complexity.

Secondary structure, which provides a complementary description of a protein sequence (a transmembrane segment is, at the same time, an α-helix), is usually predicted separately. This priority system has to be kept in mind when analysis is done manually, but ordering of prediction methods is also implemented (among software tools familiar to us) in the SMART server and in the UNIPRED program of the SEALS package.

Just like sequence similarity analysis methods, structural prediction needs to be scaled up for the purpose of genome analysis, and this requires local implementation. Most of the researchers who support the servers mentioned above will readily provide their code and/or executables.

4.6.3. Threading

Protein sequence-structure threading (usually simply referred to as threading) is a family of computational approaches that, given a protein sequence, attempt to select, among all known 3D structures, the structure that is best compatible with this sequence [535,775,783]. Metaphorically, the sequence is “threaded” through a variety of structures, and the method determines which one fits better than the others [125,405,575,649,651]. The underlying idea is already well familiar to us: protein structure is more conserved in evolution than sequence. Insertions and deletions that change the substrate specificity, thermal stability, and other properties of the protein mostly occur in the loop regions without changing the core set of α-helices and β-strands. Therefore, a comparison of the (predicted) secondary structure of the new protein against a library of known 3D structures could potentially identify distant homologs, even in the absence of statistically significant sequence similarity. Generally, threading methods involve calculating residue contact energy for the analyzed sequence superimposed over each structure in the database and ranking the structures by decreasing energy; the structure with minimal energy is the winner (e.g. [125,405,580,939]). Several statistical models to estimate the probability of “native” fold detection have been developed (e.g. [125,575,812]). It has been consistently reported that combining the traditional contact-potential-based threading with the use of sequence profiles and secondary structure alignment leads to a substantially greater success rate of fold recognition than either threading or profile searches or secondary structure comparisons alone [406,559,649,651].

Further discussion of threading is beyond the scope of this book; a detailed review of the physical theory behind threading methods has been published recently [574]. However, before ending this brief section with a list of threading software tools, which are available on the web, we must add a cautionary note based on our own research experience. Despite the well-documented success of threading approaches, using several different threading methods in the analysis of a variety of protein families, we usually failed to detect any demonstrably relevant relationships beyond what we could have identified by extensive PSI-BLAST-based sequence analysis. In contrast, we faced a considerable number of false leads that were associated with apparently statistically significant scores. We are relating this experience not to question the impressive performance of threading methods in fold recognition but to caution the reader that current threading approaches still might not be robust enough for routine use in large-scale genome analysis.

Even in case-by-case manual analysis, before gleaning any far-reaching conclusions from the threading results, one has to be aware of the complexities of the approach and its potential pitfalls. It is important to carefully analyze the outputs to make sure that the reported secondary structure similarity is genuine and is not caused, for example, by “fatal attraction” of long (>15 aa) helices in two proteins that are otherwise entirely different. It also helps to check the list of hits and make sure that they belong to the same SCOP and/or CATH fold or at least align well in VAST and/or DALI analysis. One should be extremely suspicious when the query protein produces high-scoring alignments with proteins known to have different folds.

Some of the popular threading tools accessible on the web are listed in Table 4.12. A convenient portal to most of these methods is maintained by Leszek Rychlewski and colleagues at BioInfoBank ( in Poznan, Poland [128]. In addition, this server offers a consensus prediction through the Pcons tool ( [530]).

Table 4.12. Software tools for protein threading and related methods for protein structure prediction.

Table 4.12

Software tools for protein threading and related methods for protein structure prediction.

4.7. Conclusions and Outlook

The methodological armory of computational biology has been evolving at a substantial rate for only 20 years but has already reached impressive diversity, such that it would be impossible to present it properly, even if we devoted the entire book to this subject alone. In this chapter, we attempted to briefly discuss only those methods that, in our understanding, are central to comparative genomics. These are: (i) gene prediction; (ii) sequence similarity analysis and, in particular, database search, including iterative methods based on PSSMs and HMMs; and (iii) prediction of protein structural features. We believe that, at the time of this writing (middle of 2002), sequence similarity analysis, along with comparison of 3D structures, remains the main source of biologically important discoveries and predictions made by computational biologists. We tried to show that some of these finding are distinctly nontrivial. In the next chapter, we discuss how these methods work when applied to genome comparison and also present some new approaches that more explicitly rely on the information that can be extracted only from (nearly) complete genome sequences.

4.8. Further Reading

Doolittle RF. 1986. Of Urfs and Orfs: A Primer on How to Analyze Derived Amino acid Sequences. University Science Books, San Diego.
Baxevanis AD and Ouellette BFF (eds). 2001. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins. John Wiley & Sons, New York.
Mount DW. 2000. Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. Chapter 1.
Durbin R, Eddy SR, Krogh A and Mitchison G. 1997. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK.
Waterman MS. 1995. Introduction to Computational Biology: Maps, Sequences and Genomes. CRC Press, Boca Raton, FL.
Gusfield D. 1997. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge, UK.
Copyright © 2003, Kluwer Academic.
Bookshelf ID: NBK20261
PubReader format: click here to try


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...