![]() | ![]() |
Formats:
|
||||||||||||||||||||
Consensus Sequence Zen National Cancer Institute at Frederick, Laboratory of Experimental and Computational Biology, P. O. Box B, Frederick, MD 21702-1201. (301) 846-5581 (-5532 for messages), fax: (301) 846-5598, email: toms/at/ncifcrf.gov. http://www.lecb.ncifcrf.gov/~toms/ See other articles in PMC that cite the published article.Abstract Consensus sequences are widely used in molecular biology but they have many flaws. As a result, binding sites of proteins and other molecules are missed during studies of genetic sequences and important biological effects cannot be seen. Information theory provides a mathematically robust way to avoid consensus sequences. Instead of using consensus sequences, sequence conservation can be quantitatively presented in bits of information by using sequence logo graphics to represent the average of a set of sites and sequence walker graphics to represent individual sites. “All models are wrong but some are useful.”— George E. P. Box (Box, 1979) Keywords: consensus sequence, information theory, sequence logo, sequence walker, binding site, genetic control How to be sure to make a mistake. Genes are controlled by proteins that bind to specific spots on the DNA sequence. Molecular biologists often represent the patterns at these spots by using a consensus sequence. For example, after aligning some binding sites so that they match each other, one position might contain 70% adenine, 10% cytosine, 10% guanine, and 10% thymine. The consensus is the most frequent base, ‘A’. This is the simplest (and possibly the most commonly applied) approach, but there are alternatives (Day & McMorris, 1992). Various kinds of consensus sequence commonly found in the molecular biology literature will be considered here, while the controversy over the use of consensus trees used in phylogenetic inference (Barrett et al., 1991; Nelson, 1993; Barrett et al., 1993; de Queiroz, 1993) will not be covered. The main difficulty with using consensus sequences is that they present distorted pictures of binding sites. In order to locate new binding sites, consensus sequences are compared to various locations in a sequence and the number of matches is tallied. A difficulty arises because a position that is always an ‘A’ in the original set is treated the same as a position that is just 70% A. If we think that a position has A, then when we use this observation to look for additional binding sites, we will find mismatches for 30% of the acceptable sequences. This problem is compounded across the entire binding site, which may be 20 or even 40 bases long (Schneider, 1996; Zheng et al., 1999). For example, a commonly cited consensus sequence is TA T (Lewin, 1997), which represents the -10 region of bacterial promoters originally discovered by David Pribnow (1975). The most prominent bases for the boxed positions are only 49%, 58%, and 54% respectively (Lisser & Margalit, 1993). If one demands that a site have all of the consensus bases, one finds only 14 TATAAT sequences out of 291 sequences in the database. To deal with this, people often count mismatches, but it is not obvious from the simple consensus which bases are allowed to be more variable. Sometimes variations such as allowing C or G are indicated but, again, the degree of allowed variation is lost. It is not surprising then, that consensus sequences frequently fail to identify binding sites or that they predict sites where there are none.Consensus sequences have other serious problems, many of which are revealed by using information theory to measure the amount of conservation in bits. In a set of aligned binding sites, a DNA position that is always an A stays that way during evolution because the molecule that binds to it always selects A from the four possible bases (Schneider, 2000). Such a selection can be made with a minimum of two yes-no questions: ‘Is it in the set A or T?’ and ‘Is it in the set A or C?’, so the selection takes two bits of information, one to answer each question. Likewise, a position that is either A or T only requires one yes-no question—the other one being ignored—so has one bit of sequence conservation. The late Claude Shannon figured out how to consistently measure the average information when the frequencies are not so simple (Shannon, 1948; Schneider et al., 1986; Schneider, 1995). One can plot the sequence conservation across all positions in the set of aligned binding sites. This continuous quantitative measure often follows a sine wave, reflecting the binding of a protein to one face of helically twisting B-form DNA (Papp et al., 1993; Schneider, 1996; Schneider, 2001). This subtle effect cannot be seen by using consensus sequences. A Paradox: How can two things be the same but different? Intriguingly, the binding sites for human splice junction donor and acceptor sites have the same consensus sequence for a portion of each site around the junction. Yet, when we measured the sequence conservation in bits we found that the information curves are quite different (Stephens & Schneider, 1992). How could two sites have the same consensus sequence but be different? This conundrum led us to introduce a computer graphic, called a sequence logo, in order to understand the difference (Fig. 1
Walking along the genome. One can depict individual sites using another graphic called a sequence walker, in which the height of a letter above or below zero shows how much that base contributes to the average sequence conservation of the entire collection of sites shown in the logo (Fig. 2
Sequence walkers can be stepped along the sequence (hence the name) to discover positions that match a particular model, and one can predict whether or not a sequence change will destroy the site and cause a genetic disease (Rogan et al., 1998). In the case shown, splicing is normally accomplished using a 12.7 bit acceptor at position 5154. Nearby, however, is an 8.9 bit ‘cryptic’ acceptor that is not used apparently because the strongest site in any local region normally wins the competition for splice factors. An A to G mutation at 5153 destroys the normal site, making it 4.5 bits while simultaneously raising the cryptic site to 16.5 bits. This results in a single base frame shift, the loss of the protein, and Hunter disease. Cases like this are difficult to understand using consensus sequences because sites are affected by all of their parts and quantitative differences are missed. Using information theory and sequence walkers we have interpreted about 100 mutations in two human workdays (the computer time is only a few seconds). Statistical effects of making a consensus. The overall strength of a binding site is found by summing the individual bit contributions. A distribution of these strengths is roughly Gaussian and shows that most natural binding sites have much less information than the consensus sequence (Schneider, 1997a). The strict consensus (where only the most frequent base is used) is the strongest possible binding site and is on the far high end of the distribution. For example, only one in 270 acceptor sites matches the strict consensus. For this reason it is generally inappropriate to say that one has a consensus binding site at such-and-such a position on a sequence. As mentioned earlier, using consensus sequences to find binding sites by counting mismatches can lead to errors. How does this compare to the information theory approach? If matches to the consensus are assigned to have 1 unit and mismatches 0 units, then the total count is an integer. In contrast, the information theory weights are 2+log2(base frequency) + (a small-sample correction), which includes the real numbers. Summing the information theory weights gives continuous results, while counting mismatches gives blocky results that will often be off the mark. The commonly used ‘percent identity’ between two sequences, such as proteins, is flawed for the same reasons. Sometimes counting matches or mismatches can give results opposite to the information measure weights so that a base in a site could have a mismatch to the consensus and yet that base could contribute positive information. For example, for a position that has 60% A, 30% T, 5% G, and 5% C the consensus base is A by two-fold, and yet a T in an individual binding site would contribute 2+log2 0.30 = 0.26 bits. Only by noting the total distribution can we learn that the T contributes positively to the information. A related effect that is hidden by a consensus is that the diversity of the less frequent bases affects the total sequence conservation. For example, a position with 70% A, 30% T, 0% G, and 0% C has 1.12 bits of conservation, but a position with 70% A and 10% for each of C, G and T has only 0.64 bits. The consensus for both cases, A, does not distinguish between these. When there are very few sequences, statistical artifacts crop up. Even if there’s no information in the set, it can look like there is. For example, if one has only 6 random sequences, one will frequently observe positions that have 50% or more of one base. If, as is commonly done, one uses 50% as the cutoff for writing the consensus base, then one can get the false impression that there is pretty good sequence conservation. In the example shown in Fig. 3
Missing the trees in the data forest. As a result of counting mismatches to a consensus it is possible to entirely miss a binding site. One of the most striking examples is a Fis site in the tgt/sec promoter of E. coli (Fig. 4
The authors were aware that there was a second binding site, but placed its location somewhere in the 69 bases upstream of the ClaI site. Why did they miss the site? Two positions (indicated by arrows in the figure) did not match the ‘accepted’ consensus sequence (Hübner & Arber, 1989). The consensus method gave these positions far more weight than was appropriate. The information for the site at -73 is 10.8 bits, which is 2 bits more than the average. To determine if there is really a site there, we performed a gel shift experiment using a DNA containing only the proposed Fis site at -73 and showed that the sequence is indeed bound by Fis (Hengen et al., 1997). Because the consensus sequence failed to predict a site that had been documented experimentally, that site could not be seen, and to the scientists it did not exist (Kuhn, 1970). A more critical example is in the hMSH2 gene, which is associated with familial nonpolyposis colon cancer (Rogan & Schneider, 1995). A ‘T’ to ‘C’ transition occurred at position -5 of an acceptor site and this change was proposed to be the cause of the disease (Fishel et al., 1993). Inspection of the logo in Fig. 1 Why did this potential ‘misdiagnosis’ happen? We suppose that T was taken to be the consensus sequence. Given this, one would interpret any change from that consensus to be detrimental. In this case the consensus sequence was so rigid that it could not handle a subtle change and a site disappeared from the scientist’s view even though it was still functional. As DNA sequencing technologies become widely available to doctors, this situation will come up repeatedly. Serious malpractice suits could occur as a result of using the consensus model. Flipping the light on to see an unseen world. Two recently published examples demonstrate some of the interesting biology that one can miss by using a consensus sequence. The first example is the RepA binding site (Fig. 5
RepA and other DNA binding proteins show sequence conservation up to 2 bits where they contact the major groove and only 1 bit where they face a minor groove (Papp et al., 1993; Schneider, 2001). The upper bound of 2 bits is achievable because all 4 bases can be distinguished using contacts in the major groove (Seeman et al., 1976). In contrast, the minor groove of B-form DNA is essentially symmetrical and can only provide up to 1 bit of sequence conservation. Intriguingly, as seen in Fig. 5 Base flipping was discovered by Rich Roberts in the co-crystal of the HhaI methyltransferase (Roberts, 1995; Roberts & Cheng, 1998). This solved a puzzle of how that enzyme functions, since the chemistry of methylation requires attack from above or below the plane of the base. Such an attack is not possible inside the DNA helix. The HhaI methyltransferase solves the problem by flipping the base out of the helix and into a pocket of the enzyme. Other DNA modification proteins also flip bases (Cheng et al., 1993; Klimasauskas et al., 1994; Verdine, 1994; Reinisch et al., 1995). Why would RepA be flipping a base? RepA is used by the bacteriophage P1 plasmid for DNA replication (Abeles, 1986; Abeles et al., 1989). DNA replication requires that the helix be opened before synthesis can begin. The first step of this process would be the binding of RepA to the DNA. A very simple second step would be the flipping of a base out of the DNA, since DNA ‘breathing’ occurs naturally on a millisecond scale (Guéron et al., 1987; Leroy et al., 1988). If the thymine at +7 flips, is captured, and then held out of the DNA helix by RepA, weakened stacking could allow the remainder of the DNA to be more easily opened by a DNA helicase. Sequence logos of other DNA replication protein binding sites have similar anomalies (Schneider, 2001), suggesting that base flipping may be a general mechanism for the second step of DNA replication. How is this related to consensus sequences? The consensus sequence for RepA sites can be determined by reading the top letters of the sequence logo (Fig. 5 A second example is the TATAAT sites mentioned earlier, for which the sequence logo is shown in Fig. 6
Just say no! We can express a consensus sequence in bits and so quantify the effect of making one. Each unique base (A, C, G or T) of a consensus counts as 2 bits. When two variations such as C or G are allowed (e.g. Fig. 3
Models and Illusions. One sometimes reads about how a particular DNA sequence has a consensus sequence at such-and-such a position (Robberson et al., 1990). Thus using consensus sequences has led these biologists into a philosophical trap: confounding the model of reality (the consensus sequence) with reality (the binding sites). Even the original title of our paper on sequence logos reflects our initial confusion on this issue (Schneider & Stephens, 1990). Logos and walkers let us see more deeply into the genetic structure, revealing the details of sites and how mutations work. But no matter how sophisticated we are in depicting the patterns at binding sites, all we have are models. Logos and walkers are clearly better than consensus sequences (and can replace them completely), but they are still only representations of the universe ‘out there’ (Box, 1979). It is surprising, then, that scientists forget this and treat the consensus as reality. The effect was understood more than 30 years ago by Thomas Kuhn: once a paradigm is formed it occludes other ways of thinking and molds the way scientists perceive the world (Kuhn, 1970). Yet a consensus can no more be ‘in’ a DNA sequence than the meaning of these words is on the page. These words are interpretations in your mind; the page only has some disconnected black squiggles. One way to see this is to consider the perennial myth of a face on Mars that appears in American tabloid magazines. Whether or not there is a face on Mars, most of us have seen faces in clouds. Are there really faces there? Evidently not. Experiments with sheep and monkeys have identified neurons that become excited when a face is presented in the visual field (Kendrick & Baldwin, 1987). So faces in clouds, words, and consensus sequences are all constructs in our brains. Stranger still, the words may not be there when you perceive them since neural impulses take 300 milliseconds to travel from your eye to your brain (Rager & Singer, 1998), where, after another 80 milliseconds, they are finally perceived (Eagleman & Sejnowski, 2000). All that we see, hear, feel, smell, touch, and taste is delayed, so the entire perceived world is a model in our minds. Optical illusions remind us, and Zen masters understood, that everything is illusion (Purves et al., 2002). Acknowledgments. I thank Ilya Lyakhov, Becky Chasan, Peter K. Rogan, Zehua Chen, Krishnamachari Annangarachari, and Jeff Haemer for comments on the manuscript. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||
Nucleic Acids Res. 1992 Mar 11; 20(5):1093-9.
[Nucleic Acids Res. 1992]Methods Enzymol. 1996; 274():445-55.
[Methods Enzymol. 1996]J Bacteriol. 1999 Aug; 181(15):4639-43.
[J Bacteriol. 1999]Nucleic Acids Res. 1993 Apr 11; 21(7):1507-16.
[Nucleic Acids Res. 1993]Nucleic Acids Res. 2000 Jul 15; 28(14):2794-9.
[Nucleic Acids Res. 2000]J Mol Biol. 1986 Apr 5; 188(3):415-31.
[J Mol Biol. 1986]Hum Mutat. 1995; 6(1):74-6.
[Hum Mutat. 1995]J Mol Biol. 1993 Sep 20; 233(2):219-30.
[J Mol Biol. 1993]Methods Enzymol. 1996; 274():445-55.
[Methods Enzymol. 1996]J Mol Biol. 1992 Dec 20; 228(4):1124-36.
[J Mol Biol. 1992]Nucleic Acids Res. 1997 Dec 15; 25(24):4994-5002.
[Nucleic Acids Res. 1997]J Mol Biol. 1992 Dec 20; 228(4):1124-36.
[J Mol Biol. 1992]Hum Mutat. 1998; 12(3):153-71.
[Hum Mutat. 1998]Nucleic Acids Res. 1997 Dec 15; 25(24):4994-5002.
[Nucleic Acids Res. 1997]Methods Enzymol. 1996; 274():445-55.
[Methods Enzymol. 1996]Nucleic Acids Res. 1997 Dec 15; 25(24):4994-5002.
[Nucleic Acids Res. 1997]Nucleic Acids Res. 1992 Aug 25; 20(16):4193-8.
[Nucleic Acids Res. 1992]EMBO J. 1989 Feb; 8(2):577-85.
[EMBO J. 1989]Nucleic Acids Res. 1997 Dec 15; 25(24):4994-5002.
[Nucleic Acids Res. 1997]Hum Mutat. 1995; 6(1):74-6.
[Hum Mutat. 1995]Cell. 1993 Dec 3; 75(5):1027-38.
[Cell. 1993]Cell. 1993 Dec 17; 75(6):1215-25.
[Cell. 1993]J Mol Biol. 1993 Sep 20; 233(2):219-30.
[J Mol Biol. 1993]Nucleic Acids Res. 1994 Jan 25; 22(2):152-7.
[Nucleic Acids Res. 1994]J Mol Biol. 1993 Sep 20; 233(2):219-30.
[J Mol Biol. 1993]Proc Natl Acad Sci U S A. 1976 Mar; 73(3):804-8.
[Proc Natl Acad Sci U S A. 1976]J Mol Biol. 1993 Sep 20; 233(2):219-30.
[J Mol Biol. 1993]Cell. 1995 Jul 14; 82(1):9-12.
[Cell. 1995]Annu Rev Biochem. 1998; 67():181-98.
[Annu Rev Biochem. 1998]Cell. 1993 Jul 30; 74(2):299-307.
[Cell. 1993]Cell. 1994 Jan 28; 76(2):357-69.
[Cell. 1994]Cell. 1994 Jan 28; 76(2):197-200.
[Cell. 1994]J Biol Chem. 1986 Mar 15; 261(8):3548-55.
[J Biol Chem. 1986]J Bacteriol. 1989 Jan; 171(1):43-52.
[J Bacteriol. 1989]Nature. 1987 Jul 2-8; 328(6125):89-92.
[Nature. 1987]J Mol Biol. 1988 Mar 20; 200(2):223-38.
[J Mol Biol. 1988]J Mol Biol. 1986 Apr 5; 188(3):415-31.
[J Mol Biol. 1986]Nucleic Acids Res. 1992 Mar 11; 20(5):1093-9.
[Nucleic Acids Res. 1992]Mol Cell Biol. 1990 Jan; 10(1):84-94.
[Mol Cell Biol. 1990]Nucleic Acids Res. 1990 Oct 25; 18(20):6097-100.
[Nucleic Acids Res. 1990]Science. 1987 Apr 24; 236(4800):448-50.
[Science. 1987]Eur J Neurosci. 1998 May; 10(5):1856-77.
[Eur J Neurosci. 1998]Science. 2000 Mar 17; 287(5460):2036-8.
[Science. 2000]Nucleic Acids Res. 1990 Oct 25; 18(20):6097-100.
[Nucleic Acids Res. 1990]J Mol Biol. 1992 Dec 20; 228(4):1124-36.
[J Mol Biol. 1992]Nucleic Acids Res. 1997 Dec 15; 25(24):4994-5002.
[Nucleic Acids Res. 1997]Hum Mutat. 1998; 12(3):153-71.
[Hum Mutat. 1998]J Mol Biol. 1986 Apr 5; 188(3):415-31.
[J Mol Biol. 1986]Nucleic Acids Res. 1997 Dec 15; 25(24):4994-5002.
[Nucleic Acids Res. 1997]Nucleic Acids Res. 1993 Apr 11; 21(7):1507-16.
[Nucleic Acids Res. 1993]