U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Riddle DL, Blumenthal T, Meyer BJ, et al., editors. C. elegans II. 2nd edition. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press; 1997.

Cover of C. elegans II

C. elegans II. 2nd edition.

Show details

Appendix 3 Codon Usage in C. elegans

and .

Author Information and Affiliations

Synonymous codon usage in C. elegans was investigated by Stenico et al. (1994). The major conclusion of this analysis was that synonymous codon usage patterns vary among genes in a manner correlated with their expression level. Some genes have extremely biased codon usage: these genes appear to be expressed at higher levels, and it was inferred that natural selection has favored a limited number of translationally optimal codons. Other genes (apparently those expressed at low levels) have relatively unbiased codon usage, although there was some nonrandomness consistent with context-dependent mutational biases. These results echo those found in a number of unicellular eukaryotes and in Drosophila melanogaster (see Sharp et al. 1995). Here the analyses of codon usage in C. elegans have been updated using similar techniques, but with a much larger dataset.

All nuclear protein-coding sequences annotated as deriving from C. elegans were extracted from the GenBank/EMBL/DDBJ DNA sequence data library (GenBank release 92), using the ACNUC retrieval system (Gouy et al. 1985). Duplicate sequences, partial sequences, and sequences containing ambiguous codons or multiple stop codons were excluded, yielding a total dataset of 4027 open reading frames (ORFs). Although some of these sequences were determined by the “traditional” approach—i.e., the genes were identified and sequenced because of some known function or phenotype—many others were found within cosmids sequenced as part of the genome project and many of the genes thus identified remain putative. Therefore, gene sequences were first designated as (1) “genes” if the sequence was determined by the traditional approach, (2) “probable genes” if the cosmid-contained sequence exhibited significant similarity to a sequence from another species, or (3) “unidentified reading frames (URFs)” if identified only as an open reading frame within a cosmid sequence.

Overall codon usage in 312 genes (not shown) was found to be very similar to that in 168 genes previously examined (Stenico et al. 1994). Codon usage in 2238 URFs was not very different from that in 90 URFs previously examined (Stenico et al. 1994). As before, URF codon usage differed somewhat from that in genes, in being generally much less biased. Codon usage in 1477 probable genes showed a pattern of bias intermediate between that in genes and URFs. These observations suggest that the dataset of genes examined by the traditional approach contains a disproportionate number of relatively highly expressed genes, whereas the majority of the URFs are lowly expressed (and perhaps some are not in fact genes). Further analyses were restricted to the dataset of genes and probable genes (totaling 1789 sequences).

Codon usage in these genes was subjected to correspondence analysis, a statistical technique for characterizing major trends in multivariate data (see Stenico et al. 1994). As previously found, a single major trend among genes ("axis 1") was identified. The nature of the trend in codon usage along axis 1 is shown in Table 1, which contains codon usage values for three subsets of this dataset: the 10% of genes from each extreme of the axis (one extreme exhibiting high bias and the other low bias) and the 10% of genes lying at the center of the axis. Each subset shows codon usage summed over 178 genes: The total numbers of codons in each case are 73164 (High), 109120 (Middle), and 80249 (Low). The data are presented as raw codon usage values (N) and relative synonymous codon usage values (RSCU). RSCU is calculated as the observed value (N) divided by that expected if all synonyms for an amino acid were used equally.

Comparison of the RSCU values for the High (bias) and Low (bias) datasets illustrates the considerable variation in codon usage among C. elegans genes. For example, for Phe, UUC seems to be heavily favored over UUU in some genes (the High subset), whereas the opposite is true in others (the Low subset). The Middle subset of genes exhibit codon usage patterns intermediate between the High and Low subsets. These patterns are consistent with codon usage in any particular gene reflecting a balance between the population genetic processes of mutation, selection, and random genetic drift, with the point of balance depending on the strength of selection on that gene. Thus, for Phe, UUC seems to be the translationally optimal codon in a wide range of species, and its high frequency in the High subset is inferred to be due to strong natural selection. In contrast, in the Low subset, U-ending codons occur at high frequency (and C-ending codons at low frequency) for all four amino acids with U at the second codon position, consistent with neighboring nucleotide-dependent mutation biases. Codon usage in the Middle subset of genes is similar to the total codon usage of the 1789 taken as a whole. Thus, it is a guide to the “typical” codon usage of C. elegans, but the heterogeneity among genes reflected in the contrast between the High and Low subsets should always be borne in mind.

Twenty two codons that occur at significantly (p <0.01) higher frequencies in the High subset than in the Low subset (indicated by an asterisk) are inferred to be those that are translationally optimal. These include 21 codons identified by Stenico et al. (1994), plus UCU (for Ser), which was only marginally significant (0.01 < p < 0.05) in the earlier analysis. Another Ser codon, AGC, has not been included in this set of optimal codons because its frequency in the High subset, while significantly higher than in the Low subset, is still lower than for four other Ser codons. The degree of codon usage bias in any gene can then be quantified by Fop, the frequency of these optimal codons as a fraction of the total usage of codons for the 18 amino acids with more than one codon. These Fop values range between 0.17 and 0.89 and represent a succinct summary of the major trend in codon usage bias revealed by the correspondence analysis: Fop and position on axis 1 are extremely highly correlated (correlation coefficient 0.97).

Strong bias in favor of translationally optimal codons has generally been interpreted as reflecting selection for efficiency of translation. Thus, the genes with the highest bias are those with the highest expression levels. This appears to be the case in C. elegans. The genes with the highest Fop values (lying at the high bias extreme of axis 1) include those encoding ribosomal proteins, translation elongation factors, actins, and histones, all of which are expressed at very high levels. The genes with the lowest Fop values (lying at the low bias extreme of axis 1) include those encoding regulatory proteins, generally expected to be expressed at very low levels. Furthermore, within gene families whose members are expressed at different levels, those expressed at higher levels have higher Fop values (Stenico et al. 1994). An alternative hypothesis, recently proposed for Drosophila genes, is that optimal codons are selected primarily for accuracy of translation (Akashi 1994). Since many abundant proteins (such as those listed above) are also highly conserved proteins, it can be difficult to disentangle the potential effects of selection for efficiency and/or accuracy; we are currently investigating this problem.

Table 1. Codon usage in C. elegans.

Table 1

Codon usage in C. elegans.

Copyright © 1997, Cold Spring Harbor Laboratory Press.
Bookshelf ID: NBK20194

Views

  • PubReader
  • Print View
  • Cite this Page

Related Items in Bookshelf

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...