![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||
Copyright © 2005 Moura et al.; licensee BioMed Central Ltd. Comparative context analysis of codon pairs on an ORFeome scale 1Centre for Cell Biology, Department of Biology, University of Aveiro, 3810-193 Aveiro, Portugal 2Institute of Electronics and Telematics Engineering, University of Aveiro, 3810-193 Aveiro, Portugal 3Department of Mathematics, University of Aveiro, 3810-193 Aveiro, Portugal Corresponding author.Gabriela Moura: gmoura/at/bio.ua.pt; Miguel Pinheiro: monsanto/at/ieeta.pt; Raquel Silva: rsilva/at/bio.ua.pt; Isabel Miranda: imiranda/at/bio.ua.pt; Vera Afreixo: vafreixo/at/mat.ua.pt; Gaspar Dias: gaspar/at/ieeta.pt; Adelaide Freitas: adelaide/at/mat.ua.pt; José L Oliveira: jlo/at/ieeta.pt; Manuel AS Santos: msantos/at/bio.ua.pt Received September 24, 2004; Revised November 25, 2004; Accepted January 17, 2005. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Codon context is an important feature of gene primary structure that modulates mRNA decoding accuracy. We have developed an analytical software package and a graphical interface for comparative codon context analysis of all the open reading frames in a genome (the ORFeome). Using the complete ORFeome sequences of Saccharomyces cerevisiae, Schizosaccharomyces pombe, Candida albicans and Escherichia coli, we show that this methodology permits large-scale codon context comparisons and provides new insight on the rules that govern the evolution of codon-pair context. Background The standard genetic code uses 64 codons for only 22 amino acids, including the amino acids selenocysteine and pyrrolysine whose incorporation into protein requires the reassignment of the UGA and UAG stop codons, respectively [1,2]. This degeneracy of the genetic code has important implications for gene primary structure evolution as it provides nature with a vast array of options for building open reading frame (ORF) sequences for any particular protein. However, the usage of synonymous codons for building ORFs is not random, suggesting the existence of mechanistic or evolutionary constraints that limit the degree of freedom for coding sequence building [3-6]. In other words, each organism uses a set of rules for building ORF sequences which restrict the total number of options provided by the degeneracy of the genetic code. These rules are only partly understood. Nevertheless, it is becoming increasingly clear that codon usage and context bias reflect the action of two main evolutionary forces: selection for mRNA decoding efficiency and mutational drift acting indiscriminately on coding and noncoding DNA [7-10]. Codon usage reflects selection for translational efficiency, as highly expressed genes tend to use codons that are decoded by abundant cognate tRNAs [11-13]. Similarly, the context of a sequential pair of codons (codon-pair) is biased, but this bias is apparently linked more to decoding accuracy than to translational speed [14-17]. This suggests that the translational machinery is sensitive to the nature of the codon-pair present in the ribosome A and P decoding sites [16,18-20], raising the possibility that, like codon usage, codon context may also be species specific. This is supported by the fact that tRNA populations diverge in the number and abundance of tRNA isoacceptors for each codon family and also in the pattern of modified nucleosides in the tRNAs, which also affects mRNA decoding accuracy. To shed new light on the overall pattern of codon context at the species level and evaluate how codon-pair context varies between species, we have developed software and statistical methodologies for codon-pair context analysis on all the ORFs in a genome as a whole (the ORFeome). Because our main interest is to evaluate the effect of codon context on mRNA decoding accuracy, this study focuses on the context of codon-pairs and not on long-range context effects. With a few exceptions, long-range context is not relevant for mRNA decoding by the ribosome. These new methodologies were tested using the complete ORFeome sequences of the eukaryotes Saccharomyces cerevisiae, Candida albicans and Schizosaccharomyces pombe and the bacterium Escherichia coli. The methodology developed provides robust and flexible tools for intra- and inter-ORFeome comparative codon-pair context analysis, permits identification of species-specific codon context fingerprints and provides new insight into the role of codon context on mRNA decoding accuracy and ultimately on the pressure imposed by the translational machinery on the evolution of the ORFeome. The software developed, called Anaconda, is available at [21]. Results Global analysis of codon context in yeast The Anaconda bioinformatics system developed in this study identifies the start codon of an ORF and reads it by moving a 'decoding window' three nucleotides at a time in the 3' direction until it encounters a stop codon. While doing so it fixes the middle codon of the reading window and memorizes its 5' and 3' neighbors. Anaconda creates a table of frequencies of 64 × 64 codons that allows computation of the number of times the complete set of contiguous codon pairs occurs in an ORF or in an ORFeome. The overall architecture of Anaconda is described in Figure Figure11
The codon-pair context frequency table built by Anaconda allows the statistical analysis of contingency tables to be used to test whether the context is significantly biased [22-25]. These tables allow one to test the existence of association between codon-pairs through the chi-square (χ2) test of independence; to identify preferred and rejected pairs of codons in the ORFeome through the analysis of adjusted residuals for contingency tables (Table 1 and Figure Figure2);2
Codon clustering unveils unique features of codon context The codon-pair context maps shown in Figure 3a,b
To identify the codons responsible for defining the subgroups with high bias (red and green clusters) and evaluate whether these could define codon-pair context rules, one zooms in on the context subclusters. Three specific subclusters (one red and two green) were analyzed in this study (Figure 6a-c
The above observations were confirmed by analyzing two green codon-pair context subclusters (good contexts). In these cases, two different clustering rules were identified, namely the XXC-AYY and the XXU-GYY (Figure 6b,c Comparative codon context analysis Because the S. cerevisiae codon-pair context map produced a clear context pattern, we wondered whether this map could represent a species-specific fingerprint, as is the case for the codon-usage fingerprint. For this, maps for S. pombe, C. albicans and E. coli were also constructed, with the latter being used as an outgroup. Some similarities between the codon-pair context maps were immediately visible, namely a strong green diagonal line in the yeast maps (Figure (Figure7).7
An additional approach to identifying codon-pair context differences between S. cerevisiae, S. pombe and C. albicans, was undertaken by overlapping the complete codon context maps displayed in Figure Figure7.7
The DCMs also show that codon-pair context is more similar for the pair S. pombe-S. cerevisiae (data not shown) than for the other two yeast pairs, indicating that there are fewer differences between S. pombe and S. cerevisiae than between C. albicans and S. cerevisiae. This is surprising, considering that S. pombe diverged from S. cerevisiae 420 million years ago whereas C. albicans diverged from the latter only 170 million years ago [29]. The effect of the rather strong green diagonal (codon repeats) in the C. albicans maps is also visible in the DCMs (blue cells) of the C. albicans-S. cerevisiae pairs (Figure (Figure8a).8a
Contribution of mutation bias to codon-pair context An important feature of the codon-pair context map in the yeasts analyzed, but not in E. coli, is the presence of a diagonal green line (Figures (Figures3,3 The above observations prompted us to investigate whether mutational bias also played a part in codon-pair context bias and whether such bias could be extracted from the codon-pair context maps. For this, particular attention was given to GC content because it plays a major role in codon usage [31]. An algorithm was implemented into Anaconda for calculating %GC total, %GC at codon position 1 (GC1), %GC at codon position 2 (GC2) and %GC at codon position 3 (GC3). While scanning an ORFeome, Anaconda divides ORFs into GC-content subgroups and creates groups of ORFs with high and low GC content. It also determines the distribution of ORFs according to their GC total and GC3 (Figure 9a,c
Because GC bias is better observed at the third codon position as a result of the degeneracy of the genetic code, GC3 was used to evaluate whether mutational bias contributed to the codon-pair context using the S. cerevisiae and E. coli ORFeomes as proof of principle. In the former, the ORF distribution varied from a minimum of 11.9% to a maximum of 76.7%; however, most ORFs fell within a narrow interval between 35-40% GC3 (Figure (Figure9a).9a The differential display map for the low and high GC3 ORF subgroups of S. cerevisiae showed several differences, indicating that GC bias contributes to the codon-pair context. However, most of these differences corresponded to small deviations in the strength of the rejection or preference of the codon-pair contexts (Figure (Figure9b9b
Discussion Codon context has been extensively studied in prokaryotic, eukaryotic, mitochondrial and viral genomes, and these studies unequivocally showed that codon-pair context is biased [9,10,32-35]. However, no tool has yet been developed to display codon context data and in particular codon-pair context (short-range context) in a way that would facilitate interpretation of the data and allow inter- or intra-genome context comparisons. This is essential if putative general rules that govern codon-pair context evolution are to be unraveled. The Anaconda bioinformation system has been developed to address this problem. By using statistical methodologies based on contingency tables and residual analysis (see Materials and methods), specific codon-pair context patterns were unveiled and displayed using a color coded ORFeome-context map. The data highlighted codon-pair context bias in yeasts and E. coli and some rules that define codon-pair context patterns in yeast. Forces that shape codon-pair context Studies carried out in the 1980 s in E. coli have demonstrated that codon-pair context influences mRNA decoding accuracy and efficiency, indicating that the translational machinery imposes significant constraints on codon-pair context [17,36,37]. For example, in starved E. coli cells, the asparagine AAU and AAC codons are misread as lysine at high frequency [16]. Quantification of the level of lysine misincorporation at those codons and determination of the effect of the 3' nucleotide context on lysine misincorporation showed that the AAU codon is misread up to nine times more frequently than the AAC codon, and that the 3' nucleotide context (III-I context) influenced the level of misreading by as much as twofold [16]. Additional studies carried out in vitro in E. coli, have also shown that ribosomes discriminate C-ending Phe UUC and Leu CUC codons less well than the U-ending Phe UUU and Leu CUU, showing that synonymous codons differ in translational accuracy [38]. Therefore, a possible role for codon-pair context is minimization of decoding error, in particular in those codons that are poorly discriminated by the ribosome. In E. coli, over-represented codon-pairs are translated more slowly than under-represented codon-pairs, indicating that codon-pair context also influences translational speed [14]. This suggests that codon-pair context in E. coli is under strong selective constraints imposed by the translational machinery. Whether the context patterns now unveiled in yeast reflect similar selective constraints remains unclear. Nevertheless, the codon-pair context maps described here provide a good starting point to address this important biological question in vivo in yeast in a guided manner. Additional evidence for a role for selection on codon-pair context was highlighted by the negligible, or even zero, contribution of GC3 to the context bias in very frequent or very infrequent codon-pairs (strong contexts) in both S. cerevisiae and E. coli (Figure (Figure9,9 Despite these arguments, mutational bias does influence codon-pair context [7,39-41]. Observed mutational bias reflects mutational events that act indiscriminately on all DNA sequences (coding and noncoding DNA) and is consequently a property of the genome rather than the result of selection acting within ORFs [42-45]. The data presented here is in line with those observations. For example, context maps shown in this study indicate that several of the context clusters are formed on the basis of dinucleotide context rules (III-I rule), namely the XXU-AYY, XXC-AYY, XXU-GYY (Figure 6a-c Apart from those cases mentioned above, other species-specific genomic features also contribute to codon-pair context bias highlighted by Anaconda. For example, the yeast codon-pair context maps show a feature of eukaryotic genomes which is not related to mRNA translation: trinucleotide repeats which are evident in the diagonal line present in Figures Figures33 Finally, constraints imposed by protein sequences and mRNA secondary structure are also thought to influence codon context [48,49]. The context maps seem to exclude the former hypothesis because no cluster is formed as a result of selection or rejection of two adjacent amino acids. In regard to the latter constraint, the Anaconda algorithm was not designed to detect mRNA secondary structures and consequently this question cannot be addressed at this stage. Conclusions The Anaconda algorithm was developed with the aim of studying codon-pair context on an ORFeome scale, define rules that govern codon-pair context, carry out large-scale interspecies codon-pair context comparisons and clarify the effect of selection and mutational drift on codon-pair context. The results provide important new insight on the role of codon-pair context on mRNA decoding accuracy and efficiency, and we expect that it will allow the development of reporter genes for in vivo and in vitro quantification of codon-decoding error and translational speed. Finally, Anaconda will be a valuable tool to redesign ORFs for efficient and accurate heterologous or homologous protein expression in yeast and, eventually, in other suitable host systems. Materials and methods Statistics To study the association between contiguous codon-pairs, the coding sequences analyzed by Anaconda are processed in a 64 × 64 contingency table subdivided in mutually exclusive categories. If the 3' context is being analyzed, the rows of the table correspond to the codons in the P-site and the columns to the codons in the A-site of the ribosome. At the 5' context analysis the situation is inverted, and so the contingency table built is a transposed version of the one for 3' analysis. A number of different mathematical methodologies have already been used to study codon context bias (for example [9,50-52]). In this study, the analysis of contingency tables and residuals (Figure (Figure3)3 For analysis of contingency tables and residuals [22-25], given an r × c contingency table where a multinomial distribution is assumed (Table 5), the hypothesis of independence between the variables A and B is tested using the Pearson's statistic given by:
![]() where: ![]() It is known that Pearson's statistic has an asymptotical chi-square probability distribution with (r - 1)(c - 1) degrees of freedom. To identify cells in the table responsible for the eventual rejections of independence, the adjusted residuals dij are calculated by: ![]() where: ![]() is the variance estimated for rij. Haberman [54] has shown that, under independence between A and B, the adjusted residuals dij have a standardized normal probability distribution, and therefore P(- 3 <dij < 3) ≈ 0.9973, as N → + ∞. This means that, for a 99,73% confidence level, the pair (Ai, Bj) is considered responsible for rejection of the hypothesis of independence if |dij| ≥ 3. In practice, we consider that an adjusted residual is statistically significant if its absolute value is greater then 3. Additionally, to find codon context patterns in the contingency table, lines and columns can be grouped using classifying methodologies such as cluster analysis. These patterns are determined by calculating similarities between two vectors of the contingency table using the centred Pearson correlation coefficient and applying single linkage. The single-linkage method produces groups with 'chaining effect': that is, any element of a group is more 'similar' to an element of the same group than to any element of another group. Software The architecture of the Anaconda software is based on three main modules, namely data acquisition, processing and visualization (Figure (Figure1).1 The acquisition and processing modules download row data from genome databases, create a local database of usable ORFs and analyze the data using an algorithm that simulates the ribosome during mRNA decoding. It finally constructs a database containing the processed data. This data is then submitted to statistical analysis as described above. The visualization module allows the user to visualize the data matrices and gene sequences and to create filters that permit searching for specific sequence patterns defined by the user. The data-acquisition module deals with genome input files, namely reading and interpreting FASTA sequences of complete or partial sets of ORFs from public or private genome databases. To ensure that the screened sequences have the best possible quality, and hence do not introduce background noise in the following analyses, several quality filters are applied to the reading process. When the filters are activated the data are classified according to the following criteria. Valid data consist of genes whose sequence is a multiple of three; which start with an AUG codon and stop with a UAG, UAA or UGA codon, and which satisfy other user-defined requirements. Rejected data consist of genes whose sequence does not fulfill the above requirements. The result is the separation of valid from rejected ORFs. Other parameters needed by the application, such as reference relative synonymous codon usage (RSCU) values for codon adaptation index (CAI) calculation [55], are also uploaded by this module. The processing module is the core of the application, where the codon context analysis is performed. After prescanning the files, the user can test the existence of significant bias in the codon context and use the residual values to further explore the matrices of residual values (see Statistics, above). The data generated are then converted into a contingency table that includes the corresponding observed values of Pearson's statistics, and the matrix of adjusted residuals [25]. After processing, the data become available to the visualization module. This module is the graphical interface. It follows the file manager paradigm in which information is presented in hierarchical views. This module offers a set of tools that enable several tasks to be carried out, namely to search prespecified sequence patterns, to visualize data in histogram form, to cluster codon context data, and to export residual values. It is also possible to visualize other information at the gene level, such as rare codons and their distribution in the ORFs, to determine their ratio relative to the total number of codons, to determine the GC% at the first, second and third codon positions and determine the codon adaptation index (CAI) and the effective number of codons [55,56]. Acknowledgements We thank FCT (Project: POCTI/BME/39030/2001), IEETA and the II-UA (CTS-12) for supporting the development of the Anaconda software. G.M. is funded by FCT grant SFRH/BPD/7195/2001 and M.P. by INFOGENMED (FP-V). M.S. is supported by an EMBO YIP Award. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||
Nucleic Acids Res. 2003 Apr 15; 31(8):2234-41.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2004; 32(3):1091-6.
[Nucleic Acids Res. 2004]Proc Natl Acad Sci U S A. 1988 Jun; 85(12):4242-6.
[Proc Natl Acad Sci U S A. 1988]Science. 2003 Jun 13; 300(5626):1718-22.
[Science. 2003]Genetics. 1994 Mar; 136(3):927-35.
[Genetics. 1994]Trends Genet. 2000 Jul; 16(7):287-9.
[Trends Genet. 2000]J Mol Evol. 1997 Nov; 45(5):514-23.
[J Mol Evol. 1997]J Biol Chem. 1995 Sep 29; 270(39):22801-6.
[J Biol Chem. 1995]Mol Gen Genet. 1989 Sep; 218(3):397-401.
[Mol Gen Genet. 1989]J Biol Chem. 1987 Aug 15; 262(23):11351-5.
[J Biol Chem. 1987]Nucleic Acids Res. 1984 Feb 10; 12(3):1749-63.
[Nucleic Acids Res. 1984]Genome Res. 2003 Apr; 13(4):544-57.
[Genome Res. 2003]Science. 1998 Feb 6; 279(5352):853-6.
[Science. 1998]Gene. 1999 Sep 30; 238(1):53-8.
[Gene. 1999]Nucleic Acids Res. 2002 Mar 1; 30(5):1192-7.
[Nucleic Acids Res. 2002]J Mol Evol. 2000 Mar; 50(3):264-75.
[J Mol Evol. 2000]Biochem Biophys Res Commun. 2003 Apr 25; 304(1):86-90.
[Biochem Biophys Res Commun. 2003]Biochimie. 1994; 76(5):351-4.
[Biochimie. 1994]Mol Gen Genet. 1989 Sep; 218(3):397-401.
[Mol Gen Genet. 1989]J Mol Biol. 1984 May 5; 175(1):29-38.
[J Mol Biol. 1984]J Mol Biol. 1984 May 5; 175(1):19-27.
[J Mol Biol. 1984]J Biol Chem. 1987 Aug 15; 262(23):11351-5.
[J Biol Chem. 1987]Proc Natl Acad Sci U S A. 1989 Sep; 86(18):6888-92.
[Proc Natl Acad Sci U S A. 1989]J Biol Chem. 1995 Sep 29; 270(39):22801-6.
[J Biol Chem. 1995]Genetics. 1994 Mar; 136(3):927-35.
[Genetics. 1994]Proc Natl Acad Sci U S A. 2004 Mar 9; 101(10):3480-5.
[Proc Natl Acad Sci U S A. 2004]J Mol Evol. 2003 Dec; 57(6):694-701.
[J Mol Evol. 2003]Gene. 1997 Dec 31; 205(1-2):269-78.
[Gene. 1997]Genome Biol. 2001; 2(4):RESEARCH0010.
[Genome Biol. 2001]Science. 1998 Feb 6; 279(5352):853-6.
[Science. 1998]Proteomics. 2004 Jan; 4(1):46-58.
[Proteomics. 2004]J Comput Biol. 2003; 10(3-4):419-32.
[J Comput Biol. 2003]Nucleic Acids Res. 2002 Mar 1; 30(5):1192-7.
[Nucleic Acids Res. 2002]Bioinformatics. 2003 May 22; 19(8):987-98.
[Bioinformatics. 2003]J Mol Evol. 2002 Mar; 54(3):365-75.
[J Mol Evol. 2002]Nucleic Acids Res. 1987 Feb 11; 15(3):1281-95.
[Nucleic Acids Res. 1987]Nucleic Acids Res. 1987 Feb 11; 15(3):1281-95.
[Nucleic Acids Res. 1987]Gene. 1990 Mar 1; 87(1):23-9.
[Gene. 1990]