![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © 2007, Cold Spring Harbor Laboratory Press The genetic code is nearly optimal for allowing additional information within protein-coding sequences 1 Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot 76100, Israel; 2 Department of Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel 3Corresponding author.E-mail uri.alon/at/weizmann.ac.il; fax 972-8-934125. Received September 22, 2006; Accepted November 29, 2006. Freely available online through the Genome Research Open Access option. This article has been cited by other articles in PMC.Abstract DNA sequences that code for proteins need to convey, in addition to the protein-coding information, several different signals at the same time. These “parallel codes” include binding sequences for regulatory and structural proteins, signals for splicing, and RNA secondary structure. Here, we show that the universal genetic code can efficiently carry arbitrary parallel codes much better than the vast majority of other possible genetic codes. This property is related to the identity of the stop codons. We find that the ability to support parallel codes is strongly tied to another useful property of the genetic code—minimization of the effects of frame-shift translation errors. Whereas many of the known regulatory codes reside in nontranslated regions of the genome, the present findings suggest that protein-coding regions can readily carry abundant additional information. The genetic code is the mapping of 64 three-letter codons to 20 amino-acids and a stop signal (Woese 1965; Crick 1968; Knight et al. 2001). The genetic code has been shown to be nonrandom in at least two ways: first, the assignment of amino acids to codons appears to be optimal for minimizing the effect of translational misread errors. This optimality is achieved by mapping close codons (codons that differ by one letter) to either the same amino acids or to chemically related ones (Woese 1965). This feature has been attributed to an adaptive selection of a code, so that errors that misread a codon by one letter would result in minimal effects on the translated protein (Freeland and Hurst 1998; Freeland et al. 2000; Gilis et al. 2001; Wagner 2005b). Second, amino acids with simple chemical structure tend to have more codons assigned to them (Hasegawa and Miyata 1980; Dufton 1997; Di Giulio 2005). There exist a large number of alternative genetic codes that are equivalent to the real code in these two prominent features (Fig. 1
We consider the ability of the genetic code to support, in addition to the protein-coding sequence, additional information that can carry biologically meaningful signals. These signals can include binding sequences of regulatory proteins that bind within coding regions (Robison et al. 1998; Stormo 2000; Lieb et al. 2001; Kellis et al. 2003). Such binding sites are typically sequences of length 6–20 bp. In addition to regulatory proteins, there are binding sites of structural proteins such as DNA- and mRNA-binding proteins (Draper 1999). Histones, for example, bind with a code that has a periodicity of about 10 bp over a site of about 150 bp (Satchwell et al. 1986; Trifonov 1989; Segal et al. 2006). Other codes include splicing signals (Cartegni et al. 2002) that include specific 6–8 bp sequences within coding regions and mRNA secondary structure signals (Zuker and Stiegler 1981; Shpaer 1985; Konecny et al. 2000; Katz and Burge 2003). The latter often correspond to sequences of several dozen base pairs or longer. Since we do not know all of these additional codes, and different organisms can use a vast array of different codes, we tested the ability of the genetic code to support arbitrary sequences of any length in parallel to the protein-coding sequence. We find that the universal genetic code can allow arbitrary sequences of nucleotides within coding regions much better than the vast majority of other possible genetic codes. We further find that the ability to support parallel codes is strongly correlated with an additional property—minimization of the effects of frame-shift translation errors. Selection for either or both of these traits may have helped to shape the universal genetic code. Results Ability to include additional sequences We first considered the ability of the genetic code to support, in addition to the protein-coding sequence, additional sequences that can carry biological signals. For this purpose, we studied the properties of all alternative genetic codes that share the known optimality features of the real code (Fig. 1 We tested the ability of the genetic codes to include arbitrary sequences, denoted n-mers, within protein-coding regions. As an example, consider the 5-mer “UGACA.” This sequence may be a protein-binding site, which should appear within a protein-coding region. This 5-mer sequence can appear within a coding sequence in one of the three reading frames: UGA|CAN, NNU|GAC|ANN, or NUG|ACA, where N denotes any nucleotide and the vertical lines separate consecutive codons. To assess the probability that this 5-mer appears in a coding region, one needs to sum over the three possible reading frames (Fig. 2A
Each genetic code has n-mer sequences, such as the above-mentioned sequence UGACA in the real genetic code, which are difficult to include in coding regions: these “difficult” sequences contain stop codons, and thus cannot appear in at least one of the three frames, since protein-coding regions do not contain stop codons. We find that the real genetic code is able to include even the most difficult n-mers because it has a special property: its stop codons, when frame shifted, tend to form abundant codons. Hence, n-mers that cannot be included in one frame-shift can be included with high probability in other frame shifts. To understand the relation between the stop codons and the ability of the genetic code to include arbitrary n-mers, consider the 5-mer S = AAAAA (Fig. 2C
Another example is the 5-mer S = CCGGU. In an alternative code with stop codons CCA, CCG, and CGG, this n-mer can only appear in one of the three reading frames (Fig. 2D
We calculated the probability of including all n-mer sequences for each alternative genetic code by summing up, for every n-mer sequence, the probabilities of all codon combinations that contain it (Fig. 2A We find that the real code shows significantly higher probabilities to include arbitrary sequences. The average of the logarithm of all n-mer probabilities is significantly higher in the real code than in the vast majority of alternative codes (Table 2), with a P-value < 0.05 for n-mer sequences with n greater than seven. In addition, the real code shows significantly higher probabilities to include the most difficult sequences (n-mers with the lowest probability of appearing in a coding region) than the vast majority of alternative codes (Fig. 2E
The optimality of the real genetic code relative to alternative codes seems to increase with the length of the n-mers (Fig. 2F Robustness to translational frame-shift errors How did such near optimality for parallel codes evolve? One possibility is that the ability to include parallel codes within protein-coding sequences conferred a selection advantage during the early evolution of the genetic code. Alternatively, the genetic code might have been fixed in evolution before most parallel codes existed. We therefore sought a different selection pressure on the code, which could have existed in the early stages of the evolution of the genetic code. One such inherent feature of protein translation is frame-shift translation errors (Parker 1989; Farabaugh and Bjork 1999; Seligmann and Pollock 2004). In these errors, the ribosome shifts the reading frame, either forward or backward. This results in a nonsense translated peptide, and usually loss of protein function. These errors occur in ribosomes nearly as frequently as misread errors (3 × 10−5 per codon, compared with misread errors of 10−4 per codon [Parker 1989]). These errors have a relatively large effect on fitness because they result in a nonsense polypeptide. Frame-shift errors may thus pose a selectable constraint on the genetic code: Codes that are able to abort translation more rapidly following frame-shift errors have an advantage (Seligmann and Pollock 2004). To abort translation after a frame shift, the ribosome must encounter a stop codon in the shifted frame. It has been suggested that codon usage in some organisms may be biased toward codons that can form stop codons upon translational frame shift (Seligmann and Pollock 2004). Here, we consider whether robustness to translational frame-shift errors may be linked to the structure of the genetic code. We tested all alternative codes for the mean probability of encountering a stop in a frame-shifted protein-coding message. We find that the real genetic code encounters a stop more rapidly on average than 99.3% of the alternative codes (Fig. 3 Interestingly, the ability to abort translation after frame shift is closely related to the ability to include arbitrary parallel codes (Fig. 4
The present optimality features are shared also by almost all of the nonuniversal codes such as those found in mitochondria (Osawa et al. 1992; Knight et al. 2001) (see Supplemental material). For example, the fraction of alternative genetic codes with higher probabilities for encountering frame-shifted stop codons is lower then 0.05 for all nonuniversal codes except for the flatworm mitochondrial code (see Supplemental Table 3). It is also found for a range of different codon usages (Muto and Osawa 1987), specifically those that represent GC content of <70% (see Supplemental material). This range of GC contents is also the range that supports the optimality of previously known features such as robustness to misread errors (Archetti 2004). Discussion In summary, we found that the genetic code is nearly optimal for encoding additional information in parallel to its main function of encoding for the amino acid sequence of proteins. This optimality is related to the identity of the stop codons in the universal code: when frame shifted, the stop codons overlap with codons of abundant amino acids. We showed that this optimality is strongly tied to a second useful property—minimization of the effect of translational frame-shift errors. Robustness to frame-shift errors may be a reasonable inherent constraint on the early genetic code. One may therefore propose that the ability to carry parallel codes may have emerged as a side effect that was later exploited to allow genes and mRNA molecules to support a wide range of signals to regulate and modify biological processes in cells (Kirschner et al. 2005). Alternatively, the ability to include arbitrary parallel sequences within coding regions may have contributed to the selection of the early genetic code. For example, early RNA molecules that had the ability to both specify peptides and to include sequences that conferred useful RNA structure may have had an advantage over RNAs that were less effective at simultaneously fulfilling both objectives. Whereas many of the currently known regulatory codes reside in nontranslated regions of the genome (Robison et al. 1998; Lieb et al. 2001), the present findings support the view that protein-coding regions can carry abundant parallel codes. It would be interesting to use information-theoretical approaches (Gusev et al. 1999; Wan and Wootton 2000; Troyanskaya et al. 2002) to search for such codes in genomes. Methods Alternative genetic codes The alternative genetic codes were obtained by independently permuting the nucleotides in the three codon positions while preserving the amino acid assignment (Fig. 1 Inclusion of arbitrary sequences within protein-coding sequences We calculated the probability of encountering every n-mer in a coding sequence for each alternative code for n = 4–25. This was done by scanning all codon combinations in all three possible frame shifts, which can include the n-mer sequence, and summing the probabilities of the codon combinations (Fig. 2A p(a(ci))/N(a(ci)), where p(a(ci)) is the average frequency within protein-coding sequences of the amino acid assigned in the real code to the codon ci, taken as an average over the amino acid probabilities in the proteome of 134 organisms (Pe’er et al. 2004), and N(a(ci)) is the number of codons assigned to that amino acid. The same results were found when using estimated amino acid frequencies for early genetic code development (Brooks et al. 2004), as well as when amino acid frequencies were varied around their mean frequency with a standard deviation of up to 70% of the mean. Adding correlations between consecutive codons does not change the present results (Supplemental material).For each code we calculated the average logarithm of the probabilities of all n-mer sequences. To avoid singularities, a small number ε was added to all probability values before taking the logarithm (the results do not depend on ε). The P-value is the fraction of alternative codes for which the average logarithm of all n-mer probabilities is higher than in the real code (Table 2). Note that the average logarithm measure is appropriate to situations in which many n-mers need to be independently encoded, so that the product of their probabilities is the biologically significant parameter (e.g., distinct sequences within an RNA that affect stability typically have an approximately multiplicative effect on the total stability of the RNA [Zuker and Stiegler 1981]). In addition to an average of the logarithm of all n-mer probabilities, for each alternative genetic code we calculated the arithmetic average probability of obtaining the fraction x of n-mers, sorted from the most difficult to the easiest (lowest to highest probability). For every x, we assigned a P-value to the real code, which is the fraction of alternative codes for which the average probability of the x most difficult n-mers is equal or higher than in the real code (Supplemental Fig. 4). Table 2 shows the P-value for the average probability of obtaining the x most difficult n-mers for different n-mer sizes, with x = 20%. The values of x for which small P-values are found increases with the size of the n-mers under consideration (see below). The FDR method was used to determine the range of difficult n-mers for which the average probability in the real code is significantly higher than in alternative codes, with a threshold that corresponds to a false discovery rate of 15% (Supplemental Fig. 4). For n > 8, the calculations were based on 105 randomly sampled n-mers. We find that in the real code, all sequences with n ≤ 10 can appear within protein-coding sequences, a feature shared by 37% of the alternative codes. For n > 10, some sequences cannot appear, since they contain nonoverlapping stop codons in each of the three reading frames (such as the 11-mer UAANUAANUAA). The probability of encountering a frame-shifted stop For each alternative code, we calculated the probability of encountering any one of the three stop codons following a frame-shift event. For this we examined all of the possible 61 × 61 di-codon combinations. A frame-shifted stop upon a +1 translational frame-shift codon can be encountered at positions 2–4 of a di-codon. A frame-shifted stop upon a −1 translational frame-shift codon can be encountered at positions 3–5 of a di-codon. The overall +1/−1 frame-shift stop probabilities were obtained for each code by summing the probabilities of all di-codons containing a stop signal at the appropriate position. Codon probabilities were calculated from average amino acid probabilities encountered in proteins of sequenced genomes (Pe’er et al. 2004) and uniform codon usage (Table 1). The present results also apply for a wide range of codon usages (Supplemental material). Selection pressure of frame shift errors A translational frame-shift event is estimated to occur at a probability of about 1/30,000 codons (Parker 1989; Farabaugh and Bjork 1999). The average alternative genetic code encounters a stop signal 23 codons on average after a frame-shift event (Fig. 3A Acknowledgments We thank James Shapiro for suggesting this problem, Orna Man for the amino acid probabilities data, and Tsvi Tlusti, Eran Segal, Liran Shlush, and all members of our lab for useful comments. We thank Minerva, HFSP and the Kahn family Foundation for support. S.I. acknowledges support from the Horowitz Complexity Science Foundation. Footnotes [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5987307 References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Proc Natl Acad Sci U S A. 1965 Jul; 54(1):71-5.
[Proc Natl Acad Sci U S A. 1965]J Mol Biol. 1968 Dec; 38(3):367-79.
[J Mol Biol. 1968]Nat Rev Genet. 2001 Jan; 2(1):49-58.
[Nat Rev Genet. 2001]J Mol Evol. 1998 Sep; 47(3):238-48.
[J Mol Evol. 1998]Mol Biol Evol. 2000 Apr; 17(4):511-8.
[Mol Biol Evol. 2000]J Mol Biol. 1998 Nov 27; 284(2):241-54.
[J Mol Biol. 1998]Bioinformatics. 2000 Jan; 16(1):16-23.
[Bioinformatics. 2000]Nat Genet. 2001 Aug; 28(4):327-34.
[Nat Genet. 2001]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]J Mol Biol. 1999 Oct 22; 293(2):255-70.
[J Mol Biol. 1999]Microbiol Rev. 1989 Sep; 53(3):273-98.
[Microbiol Rev. 1989]EMBO J. 1999 Mar 15; 18(6):1427-34.
[EMBO J. 1999]DNA Cell Biol. 2004 Oct; 23(10):701-5.
[DNA Cell Biol. 2004]DNA Cell Biol. 2004 Oct; 23(10):701-5.
[DNA Cell Biol. 2004]Microbiol Rev. 1992 Mar; 56(1):229-64.
[Microbiol Rev. 1992]Nat Rev Genet. 2001 Jan; 2(1):49-58.
[Nat Rev Genet. 2001]Proc Natl Acad Sci U S A. 1987 Jan; 84(1):166-9.
[Proc Natl Acad Sci U S A. 1987]J Mol Evol. 2004 Aug; 59(2):258-66.
[J Mol Evol. 2004]J Mol Biol. 1998 Nov 27; 284(2):241-54.
[J Mol Biol. 1998]Nat Genet. 2001 Aug; 28(4):327-34.
[Nat Genet. 2001]Bioinformatics. 1999 Dec; 15(12):994-9.
[Bioinformatics. 1999]Comput Chem. 2000 Jan; 24(1):71-94.
[Comput Chem. 2000]Bioinformatics. 2002 May; 18(5):679-88.
[Bioinformatics. 2002]J Mol Evol. 1998 Sep; 47(3):238-48.
[J Mol Evol. 1998]Genome Biol. 2001; 2(11):RESEARCH0049.
[Genome Biol. 2001]J Mol Biol. 1968 Dec; 38(3):367-79.
[J Mol Biol. 1968]Microbiol Rev. 1992 Mar; 56(1):229-64.
[Microbiol Rev. 1992]Proteins. 2004 Jan 1; 54(1):20-40.
[Proteins. 2004]Bioinformatics. 2004 Sep 22; 20(14):2251-7.
[Bioinformatics. 2004]Nucleic Acids Res. 1981 Jan 10; 9(1):133-48.
[Nucleic Acids Res. 1981]Proteins. 2004 Jan 1; 54(1):20-40.
[Proteins. 2004]Microbiol Rev. 1989 Sep; 53(3):273-98.
[Microbiol Rev. 1989]EMBO J. 1999 Mar 15; 18(6):1427-34.
[EMBO J. 1999]Nature. 2005 Jul 28; 436(7050):588-92.
[Nature. 2005]Proc Natl Acad Sci U S A. 1998 Jun 9; 95(12):6854-9.
[Proc Natl Acad Sci U S A. 1998]