NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Riddle DL, Blumenthal T, Meyer BJ, et al., editors. C. elegans II. 2nd edition. Cold Spring Harbor (NY): Cold Spring Harbor Laboratory Press; 1997.

Cover of C. elegans II

C. elegans II. 2nd edition.

Show details

Section VIITentative Conclusions

Sufficient data are now available from genomic sequencing and other sources to provide some insight into the arrangement of sequence in the genome as a whole, although unfortunately at this time little sequence is available from the autosome arms. This section deals with some of the major conclusions so far.

A. Gene Number and Distribution

The reasonable reliability of the GENEFINDER predictions combined with the end sequencing of cDNAs has allowed an estimate to be made of the total protein-coding gene number. The ratio of predicted genes to cDNA sequences exactly matching them should be equal to the ratio of the total number of genes to total cDNAs; current data predict a total in the region of 14,000 genes. This number is much higher than expected, based on the estimations of 2000 to 4000 essential genes through genetic studies (Brenner 1974; Meneely and Herman 1979; Sigurdson et al. 1984). Presumably, this discrepancy means that a large number of C. elegans genes are dispensable, either because of redundancy or because under conditions tested, the loss of gene function does not result in a detectable mutant phenotype. A similar discrepancy between gene number revealed by sequence and essential gene number determined by genetics has been found for yeast, and it seems likely to be a general feature of eukaryotic genomes.

The accuracy of this gene number prediction is limited by several factors beyond random sampling errors. For example, the signals defining the ends of genes are at present poorly understood, leading at times to the probable fusion of genes in operons or in other sites where genes are closely spaced. In other areas where exons are more distantly spaced, a single gene may be split in two. These effects tend to balance each other, but if overprediction predominates, an exaggerated estimate of gene number will result. The calculation of total gene number also makes the assumption that gene expression levels are not biased by location, an assumption that seems likely to be true for regions of the genome with similar organization, i.e., the central gene-rich clusters. Whether this assumption holds true genome-wide seems increasingly unlikely.

These assumptions have come under increasing scrutiny as sequence has become available outside the autosomal gene clusters. At least two differences are apparent in preliminary analyses of available data. The density of predicted genes on the X chromosome in the regions sequenced to date is only 1 per 6.2 kb, with 20% of the total sequence coding, as compared to 1 per 4.6 kb, with 31% coding in the clusters of chromosomes II and III. The average predicted message length is about 13% smaller on the X than on the autosomes, whereas introns and intergenic regions are larger (40% and 60% greater, respectively). Another difference is that on the X chromosome, the fraction of genes that match a cDNA tag is lower than for the autosomal clusters (23% vs. 35%). Together, the difference in predicted gene density and the smaller fraction matching an EST approximately account for the difference in gene density observed with EST hybridization data for chromosome II and III clusters versus the X overall (Barnes et al. 1995).

Part of the difference between the autosomes and the X in the fraction of predicted genes matching a cDNA tag could result from overprediction of genes on the X. The lower coding density and larger introns could contribute to an artificial division of some genes with a correspondingly higher false-positive prediction rate on the X, as suggested by the shorter predicted message length on the X.

The remaining difference may reflect a lower average level of gene expression on the X. Dosage compensation, in which gene expression on the X is reduced in hermaphrodites (Villeneuve and Meyer 1990b), provides one plausible explanation as to why this might be so. Alternatively, the postulated difference in gene expression might reflect some other feature of chromosome structure or organization, such as those reflected in the distribution and density of certain repeated sequences and in the frequencies of recombination per unit of physical distance. Sequence from the autosome arms will provide additional comparisons between these parameters.

Whatever the explanation for the lower density of genes on the X and their lower match rate with ESTs, the practical effect at the moment is that predicted gene number has been rising as more sequence from the X is incorporated into the total.

Coding density through the gene clusters is fairly uniform at the sequence level and does not show the local fluctuations in gene density observed in the analysis of the EST map data (Barnes et al. 1995). However, those fluctuations probably represent simply a combination of statistical and experimental artifacts, resulting from the relatively few cDNAs positioned and the uncertainties inherent in positioning YACs by hybridization data. A reduced coding density is evident in a small contig near the end of III (20% exonic in this 300-kb region as compared to 32% for III overall) and at the left end of the sequenced region on II, as would be expected from the cDNA hybridization data (Waterston et al. 1992; Barnes et al. 1995). More data will be required for a more detailed comparison, but perhaps differences in gene expression levels will be a factor on the autosomal arms and on the X.

B. tRNA Genes

More than 240 tRNA genes have been predicted in the first 27 Mb sequenced in the genome project, for an average of about ten genes per megabase (S. Eddy and T. Lowe, pers. comm.). No data comparable to the EST matches exist to predict how many tRNA genes there will be in the total genome. However, if tRNA genes have a distribution similar to that of the protein-coding genes, the genome would be expected to contain about 700–800 tRNA genes, far more than the 300 expected from filter hybridization experiments (Sulston and Brenner 1974). This could reflect an overprediction of tRNA genes or more likely an underprediction by the hybridization methods.

tRNA genes for all 20 amino acids are represented. The tRNA gene copy number approximately correlates with the known codon bias, but full analysis must await the complete sequence. The tRNA genes often occur singly, but small clusters have been seen, with one cosmid having nine candidate tRNA genes and three cosmids having six. A few tRNA genes have been observed in the introns of protein-coding genes, on either strand relative to the protein coding strand. About 5% of the genes are interrupted by introns.

C. Repetitive DNA

Although little of the sequence is from the autosomal arms that are probably enriched in repetitive sequences, the central clusters of the autosomes and the X chromosome present the full spectrum of repetitive DNA typical of eukaryotes, including simple sequence, tandem, direct, inverted, and dispersed repeats. In the first 2.2 Mb of sequence, for example, more than 5% of the DNA was classified as repetitive.

Mononucleotide runs have been evaluated over the available portions of chromosomes II (3.12 Mb) and III (3.76 Mb) and the X (5.76 Mb) chromosomes (L. Hillier and R. Waterston, unpubl.; S. Jones and M. Berks, pers. comm.). The frequency of such runs by chance alone is determined by base composition. Overall, the GC content is quite constant (34–38% is the full range found in samples of up to 300 kb over regions of II, III, and X) and quite close to the fraction measured for the genome as a whole (36%). This contrasts with mammalian DNA, where large regional differences are found. Local variation does occur, however, with exons having a higher GC content than introns and intergenic regions.

The distribution of A and T run lengths does indeed follow a negative log linear plot, suggesting that their occurrence is largely explicable by chance alone. The slope of the line, however, would predict a higher effective AT content than is seen on average. This could reflect the local inhomogeneities in base composition discussed above, but we cannot rule out other contributing mechanisms.

In contrast, the distribution of G and C run lengths is distinctly different from that expected by chance. Although the number of runs of eight nucleotides is not far different from that predicted, on the basis of sequence composition, the number of longer runs does not decrease, but actually increases in runs with lengths between 11 and 16. (With an average composition of 36% GC, the probability by chance alone of a run of 11 bases or longer is less than one in the whole C. elegans genome.) Not until runs of 20 or greater are reached does the frequency fall off sharply.

The G and C runs are also more frequent on the X than on the autosomes (about threefold higher overall), although the distribution of run lengths is similar on all three chromosomes. The density of runs does vary along the length of the chromosomes, particularly on the autosomes, where the frequency increases on the edges of the gene-rich clusters. The strong deviation from random and the differential distribution between and along chromosomes strongly suggest that there may be a biological role for these runs. The prevalence on X might possibly reflect a role in dosage compensation. However, as pointed out previously, the X chromosome also differs from the autosomal gene clusters (and autosomal arms) in other significant ways, including recombination frequency and repeat distribution. It may be that the G and C runs are correlated with one of these, and the hint of increased frequency of G and C runs at the edges of the autosomal clusters is at present suggestive of this alternative correlation.

The nematode genome also appears to be rich in inverted repeats, in which a segment of genomic sequence lies within a few hundred bases of an inverted copy of itself. In the first 2.2 Mb of sequence, inverted repeats accounted for more than 2.5% of the sequence, with an inverted repeat found on average every 5.5 kb (Wilson et al. 1994). Most were quite small, with an average segment length of 70 bp and an average loop size of 164 bp. Occasionally, much larger segments (>1 kb) are found with complete or nearly complete identity. A high proportion of these inverted repeats fall in introns (43%), which represent only 20–25% of the sequence. Many of the inverted repeats fall into families and may be remnants of mobile elements (see below).

Tandem repeats, in which a segment of genomic sequence lies adjacent to one or more copies of itself, accounted for 1.5% of the sequence and occurred on average every 10 kb in the first 2.2 Mb (Wilson et al. 1994). Most of these were small, with an average segment length of 17 bp and an average copy number of 14. In contrast to inverted repeats, only 17% of these fell in introns, 20% fell in exons, and 63% fell between genes. Only triplet repeats, which formed the most common category of tandem repeat, were found in exons with any frequency. Large and complicated tandem repeats have been found, e.g., approximately 100 copies of a 200-base segment.

In addition to local repeats, many dispersed families of sequences appear to share a common consensus and are therefore probably duplicated and diverged from common ancestral sequences. In a few cases, these turn out to be examples of transposons. In addition to examples of known Tc transposons, the genome sequence has revealed examples of mariner-type transposons distantly related to Tc elements, two families of non-LTR (long terminal repeat) transposons and one example of a gypsy-class LTR retrotransposon. Most or all of these elements are probably inactive, or they would have revealed themselves as insertional mutagens.

Most dispersed repetitive elements are small (50–150 bp or so), and their function (if any) and the mechanism of their propagation are unclear. In addition to the families of elements previously identified through hybridization (Felsenstein and Emmons 1988; Naclerio et al. 1992), new families are being systematically identified and classified by computer methods (R. Durbin; S. Lewis and S. Eddy; P. Agarwal and D. States; all unpubl.). C. elegans does not have any sequence as striking as the human Alu family, which makes up more than 10% of the human genome. The most numerous C. elegans interspersed repeat family, repA, has about a 98-bp core consensus, occurs in probably about 10,000 full-length and fragmentary copies, and accounts for 0.7% of the genome. Sixteen other dispersed repeat families have been identified and are routinely annotated in ACeDB. This number, however, is increasing, with 14 more likely to be added in the near future. In some cases, these dispersed families have short terminal inverted repeats similar or identical to the Tc2 or Tc3 terminal inverted repeats (Plasterk and van Leunen, this volume); it seems plausible that these families are “hitchhikers,”mobilized in trans by Tc transposases.

In addition to these short dispersed repeats, larger segments are repeated at great distances, with up to 98% similarity (Wilson et al. 1994). These apparent duplications can have a complex structure wherein segments from regions are repeated in a second location, but with different spacings and orientations. Some involve coding regions and could represent exon shuffling; more likely, one copy probably represents a nontranscribed copy.

D. Homologies

Overall, approximately 48% of the 6157 predicted genes in the region sequenced to date have significant similarities to genes previously characterized in other organisms. These similarities have often suggested a function for those predicted genes and have been used to find candidate genes associated with certain mutants. In turn, scientists working on genes in other organisms are turning to C. elegans to learn more about their genes. An illustration of the potential of the latter approach is the fact that more than half of all positionally cloned human disease genes have similarities to C. elegans genes, and in some cases, the C. elegans gene is the only similar gene in all of the public databases.

With so many genes now found in the finished sequence (an estimated 50% of all C. elegans genes), it is not surprising that many of the predicted genes fall into gene families (Table 1), with the largest family being the G-protein-coupled receptors and the protein kinases. Some quite large families which had been classified originally as “C. elegans-specific,”based on lack of similarity to any known proteins, have recently been reclassified as probably G-protein-coupled receptors. Alignment and comparison of the sequences along with those from other organisms can help show which residues are important in protein function. The predominance of regulatory proteins (“G-protein-coupled receptors,”kinases, GTPase, homeobox proteins) in the list provides a glimpse of the complexity of regulatory phenomena in a metazoan.

Table 1. The current “top ten”protein families in the nematode.

Table 1

The current “top ten”protein families in the nematode.

Why do half of C. elegans genes not find similar sequences in other organisms? The facile answer is, of course, that the similar gene from other organisms has not yet been sequenced. But this appears unlikely. Green et al. (1993) used statistical methods to examine the question in detail, when the first nematode genome and EST data were becoming available, along with similar systematically obtained data from yeast, E. coli, and humans. The clear prediction from that analysis was that even with their full sequences known, organisms in different phyla would show significant similarities for only about half their genes, using the programs and criteria delineated in the study. This result was reaffirmed when the analysis was repeated with additional data (P. Green, pers. comm.).

Although one could postulate that these genes have arisen de novo, the more probable explanation is that they have simply diverged too far to allow recognition of their ancestral relationships. One means of establishing these relationships is to find a homologous gene from more closely related species within the same phylum and use these to predict the ancestral gene, which in turn can be used to look more powerfully for related genes in other phyla. These ancestral genes can be found in C. elegans itself in the case of families of genes without known similarities, alluded to above. Indeed, in one instance, it has proven possible to identify a mammalian relative for what had been a nematode-specific gene family. Other examples will undoubtedly follow. To what extent the function of such divergent genes has been conserved is unclear, but even conservation of general function may be useful in understanding how genotype leads to phenotype.

Copyright © 1997, Cold Spring Harbor Laboratory Press.
Bookshelf ID: NBK20106
PubReader format: click here to try

Views

  • PubReader
  • Print View
  • Cite this Page

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...