• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Feb 1, 2001; 29(3): 818–830.
PMCID: PMC30377

Digging for dead genes: an analysis of the characteristics of the pseudogene population in the Caenorhabditis elegans genome

Abstract

Pseudogenes are non-functioning copies of genes in genomic DNA, which may either result from reverse transcription from an mRNA transcript (processed pseudogenes) or from gene duplication and subsequent disablement (non-processed pseudogenes). As pseudogenes are apparently ‘dead’, they usually have a variety of obvious disablements (e.g., insertions, deletions, frameshifts and truncations) relative to their functioning homologs. We have derived an initial estimate of the size, distribution and characteristics of the pseudogene population in the Caenorhabditis elegans genome, performing a survey in ‘molecular archaeology’. Corresponding to the 18 576 annotated proteins in the worm (i.e., in Wormpep18), we have found an estimated total of 2168 pseudogenes, about one for every eight genes. Few of these appear to be processed. Details of our pseudogene assignments are available from http://bioinfo.mbb.yale.edu/genome/worm/pseudogene. The population of pseudogenes differs significantly from that of genes in a number of respects: (i) pseudogenes are distributed unevenly across the genome relative to genes, with a disproportionate number on chromosome IV; (ii) the density of pseudogenes is higher on the arms of the chromosomes; (iii) the amino acid composition of pseudogenes is midway between that of genes and (translations of) random intergenic DNA, with enrichment of Phe, Ile, Leu and Lys, and depletion of Asp, Ala, Glu and Gly relative to the worm proteome; and (iv) the most common protein folds and families differ somewhat between genes and pseudogenes—whereas the most common fold found in the worm proteome is the immunoglobulin fold and the most common ‘pseudofold’ is the C-type lectin. In addition, the size of a gene family bears little overall relationship to the size of its corresponding pseudogene complement, indicating a highly dynamic genome. There are in fact a number of families associated with large populations of pseudogenes. For example, one family of seven-transmembrane receptors (represented by gene B0334.7) has one pseudogene for every four genes, and another uncharacterized family (represented by gene B0403.1) is approximately two-thirds pseudogenic. Furthermore, over a hundred apparent pseudogenic fragments do not have any obvious homologs in the worm.

INTRODUCTION

Over the course of evolution, genes duplicate in the genome, gradually accumulating mutations that may lead to the acquisition of new functions or to the modification of existing functions. However, some duplications of genes acquire deleterious mutations that disable them so that they can no longer be translated into a functioning protein. The disablement may occur at either or both the transcription and translation levels. These copies of genes are called non-processed pseudogenes. Pseudogenes may also arise by a process of retrotransposition, where an mRNA transcript is reverse transcribed and re-integrated into the genome (13). These are termed processed pseudogenes or retropseudogenes and occur in a variety of plants and animals.

Some pseudogenes are evidently transcribed. A possible case of a ‘functioning’ pseudogene transcript has been described recently for neural nitric oxide synthase in the snail Lymnaea stagnalis (4). Here, the pseudogene has a segment that is the inverse complement of the normal gene, and interferes through RNA duplex formation with the expression of nitric oxide synthase (4). Interestingly, the expression of pseudogene transcripts can vary markedly with respect to the expression of the transcripts of their homologous living genes. For example, for the 5-HT7 receptor, transcripts of a pseudogene can be detected in various tissues whereas transcripts for the corresponding functioning gene are absent (5). Pseudogene transcripts can have raised expression in tumor cells, e.g., in laryngeal squamous cell carcinoma (6) or in glioblastoma (7).

Pseudogenes are important in the study of molecular evolution, since they generally acquire mutations, insertions and deletions without any apparent evolutionary pressures. [However, in Drosophila, for example, many putative pseudogenes appear to have patterns of mutation that are inconsistent with a lack of functional constraints (810).] In evolutionary studies, pseudogenes have been used to derive underlying rates of nucleotide substitution (1113) and rates of insertion and deletion in genomic DNA (14,15). In particular, Averof et al. (13) used η-globin pseudogenes to show that double-nucleotide substitutions occur more often than would be expected from independent single-nucleotide substitutions. Gu and Li (14) noted that the pattern of insertions and deletions in processed pseudogenes implies that a logarithmic gap penalty dependence on gap size in sequence alignment is more appropriate than the more commonly used linear dependence. Ophir and Graur (15) performed a survey of processed pseudogenes in human and mouse and found evidence for distinctly different mechanisms underlying gene truncations, insertions and deletions. Pseudogenes are also useful in determining rates of genomic DNA loss for an organism: a smaller complement of pseudogenes in a genome implies a greater net loss of genomic DNA (10,16). Petrov et al. (16) demonstrated experimentally, using dead copies of retrotransposons as ‘pseudogene surrogates’, that the rates of DNA loss in Drosophila and the cricket Laupala are key determinants of genome size. In certain circumstances, pseudogenes can be conserved by a process of gene conversion, such as for immunoglobulin VH pseudogenes in the chicken (17). Goncalves et al. (18) surveyed human retropseudogenes and found that genes with a high number of retropseudogene copies tend to be widely expressed, highly conserved and low in (G+C) content.

With the complete genomes of more than 30 prokaryotes and four eukaryotes [including the Caenorhabditis elegans genome (19)] now published, we have the opportunity to investigate pseudogenes on the genomic scale. Surveys have recently been performed on the genes and pseudogenes of families of G-protein-coupled receptors (20,21). We have conducted a global survey of the population of pseudogenes in the C.elegans genome. Our survey highlights some surprising characteristics of the pseudogene population, such as a markedly uneven chromosomal distribution. In a sense, our survey is a form of ‘molecular archaeology’, focussing on the characteristics of the ‘dead’ genes that can be uncovered in a genome. We see it as logically following up on a number of global surveys of the characteristics of the ‘living’ protein population in the newly sequenced genomes (2224).

MATERIALS AND METHODS

General definitions: G, ΨG and related terms

Given the gene population of the worm genome, what is the size and distribution of the corresponding pseudogene population? To answer this question, we need to define several populations and subpopulations of genes and pseudogenes in the worm. These are described in detail in Table Table11 and Figure Figure1A.1A. We denote by G the total population of confirmed and predicted protein-encoding genes, which are taken from the Wormpep18 database. We denote by ΨG the estimated population of pseudogenes that correspond to G. In general, the symbol Ψ before any gene name or gene population name denotes the corresponding pseudogene or pseudogene population. The use of the term pseudogene here does not imply any attempt at parsing the exon structure, but refers loosely to any pseudogene or pseudogenic fragment readily detected by homology matching and the occurrence of a simple disablement (a premature stop codon or a frameshift). The total ΨG population is thus an initial estimate somewhat in the spirit of recent attempts to estimate the gene complement of the human genome (25,26).

Figure 1Figure 1
(A) Schematic showing the derivation of the ΨG data set and its breakdown into subsets. The steps in the derivation of ΨG are summarized in Materials and Methods. The size of ΨG is indicated for the last two steps in ...
Table 1.
Overall statistics for ΨG

We have clustered all genes in G into paralog families. Pseudogenes are assigned to the paralog family of the gene with closest homology to it. (Singleton genes are those genes that do not have an obvious paralog.) Pseudogenes are assigned to the paralog family of the gene with the closest homology to it. An example of a paralog family with its associated pseudogenes is illustrated (Fig. (Fig.11B).

As summarized in Figure Figure1A,1A, we compiled various subsets of ΨG. ΨGR denotes a particularly ‘reliable’ subset of ΨG that is supported by a variety of information such as a complete cDNA match or a matching protein homology in another organism.

We also generated subsets of G and ΨG that relate to levels of gene expression. The set of genes with at least one verifying EST match was derived (GE). GE was expanded by including all of the paralogs of GE proteins to give the (GE)P set. A set of genes that were adjudged to be highly expressed was derived from microarray expression data (27) and denoted GM. The corresponding predicted pool of pseudogenes is denoted ΨGM.

Data files used and Pseudogene Annotation Pipeline

We downloaded the following data from the Sanger Sequencing Centre ftp site (ftp://ftp.sanger.ac.uk, versions present in December 1999): the complete sequences of the six worm chromosomes, the most current worm protein sequence database (Wormpep18) and GFF data files with annotations for genes and other genomic features that correspond to this Wormpep version. The C.elegans genome sequence data is constantly updated and certain regions will undoubtedly be revised in future versions; it should be stressed therefore that our survey results here are just an initial estimate of the pseudogene population. We have arranged our ΨG identification procedure in the form of a pipeline schematized in Figure Figure11A.

Pipeline Step 1: Sanger Centre pseudogene annotations. We started off with a list of 332 pseudogenes annotated by the Sanger Centre. This original list is small compared to the final size of ΨG, as the Sanger Centre annotators did not set out to find all of the pseudogenes in the worm genome (R.Durbin, personal communication). These pseudogenes are included in the clustering procedure for derivation of paralog families described below. Our pseudogene population was derived by looking for a simple disablement (a frameshift or premature stop codon, see below). We calculate that 6% of the Sanger Centre-annotated pseudogenes would not be detectable by looking for a simple disablement.

Pipeline Step 2: FASTA matching to find potential pseudogenes. After Wormpep18 was initially masked for low complexity regions with the program SEG (28), the sequence alignment programs TFASTX and TFASTY (version 3.1t13) (29) were used to compare the complete Wormpep18 against the worm genome (in six-frame translation). A list of representatives for the SCOP database (version 1.39) and for sequence clusters from PROTOMAP (version 1) (30) was also compared against the 99-Mb worm genomic DNA. (PROTOMAP is a database that comprises the whole of SWISS-PROT clustered into families.) Sequences were checked for an obvious (sequence-length dependent) coding disablement (i.e., either a frameshift or a premature stop codon) indicative of a pseudogene. The potential pseudogene matches were then further filtered and refined as described below.

Pipeline Step 3: reduction for overlap on the genomic DNA. Initial significant matches of the protein sequences to the genomic DNA (with e-value ≤ 0.01) were reduced for redundancy where homologs match the same segment of DNA. A normal e-value of 0.01 was used at this stage as it is consistent with that used in previous genome analyses (2224). First, matches were sorted in a list in decreasing order of significance. Then, if a match was selected, any matches extensively overlapping it were excluded from subsequent selection (allowing for a small margin of overlap of 30 nt). This (de)selection procedure was continued until the end of the list was reached.

Pipeline Step 4: prevention of over-counting for adjacent matches. Some of these initial matches may correspond to the same pseudogene. Therefore, to avoid over-counting for these worm protein matches, the initial matches were further aligned. The genomic DNA fragment f corresponding to each matching protein was extracted. The predicted genomic sequence g for each paralog of the initial matching worm protein in the Wormpep18 database was aligned against f. The length of genomic sequence (gtop) for the top-matching gene paralog for f gives an interval on the genomic DNA within which other less significant matches f can be discarded. This second alignment stage ensures that two or more initial consecutive matches of a Wormpep18 protein to genomic DNA are not counted as separate pseudogenes. The gene for gtop was also used as the final assignment as the closest homolog/paralog for a particular pseudogene.

Pipeline Step 5: masking against Sanger Centre annotation and a transposon library. The potential pseudogenes were then filtered for overlap with any other annotations in the Sanger Centre GFF files such as exons of genes, tandem or inverted repeats and transposable elements. We masked for further transposable elements and their associated repeats by comparing a library of sequences for reported (retro)transposons against the complete C.elegans genome sequence [including the Tc DNA transposons, the Rte-1 retrotransposon and LTR retrotransposons (3133)].

Pipeline Step 6: reduction for possible additional repeat elements. At this point in the pipeline, we have a set of 3814 pseudogenic fragments, which we denote ΨG1–6. To delete any possible unknown repeat elements from the total estimated ΨG, any matches to a Wormpep18 protein that recurred more than three times to the same exon (in the absence of additional supporting homology) were deleted.

Pipeline Step 7: reducing threshold stringency. At this point, we had a set of 2401 pseudogenes, denoted ΨG1–6. Next, we reduced the e-value match threshold for pseudogene matches to Wormpep18 from 0.01 to 0.001. However, matches supported by other evidence (such as cDNA or protein homology match) were allowed for e-value up to 0.01. This gave us a new total pseudogene population (denoted ΨG1–7 or simply ΨG throughout the paper). We had a number of rationales for doing this: (i) comparison of ΨG1–7 and ΨG1–6 gives some indication of the sensitivity of pseudogene annotation to the thresholds; (ii) it also potentially allows one to identify a set of more ancient pseudogenes (ΨG1–7–ΨG1–6); (iii) a FASTA e-value cutoff of 0.01 is expected to give one false positive per 100 matches, not a particularly high value, but one that would give a substantial number of false positives in tens of thousands of comparisons that underlie our pseudogene identification.

Processed pseudogenes

We developed a heuristic to assess whether a pseudogene was processed. We estimated whether a pseudogene was processed by looking for ‘exon seams’ in the DNA segment f that contains a homology match to a protein. An exon seam is a short stretch of coding sequence that would not be found uninterrupted in the genomic DNA without processing. We found that a suitable length for an exon seam was 10 amino acids. If all but one of the exon seams for any paralogous protein are found in the translation of f then the pseudogene is identified as a possible processed pseudogene. Processed pseudogenes have a polyadenine tract 3′ to their protein homology segment (2). Polyadenosine tracts are added during mRNA processing and are usually between 50 and 200 nt long. Therefore, in addition, we analyzed a 50-nt stretch 3′ to the pseudogene fragments found in the genomic DNA for any evidence of an elevated adenine content relative to the overall distribution of polyadenine content for predicted genes in the same region.

Clustering of Wormpep18 proteins

The 18 576 proteins in Wormpep18 were clustered using a modification of the algorithm of Hobohm et al. (34) for deriving representative lists of protein chains. Pairwise alignment using the FASTA (version 3.1t13) algorithm (29) was performed to compare proteins. Two proteins were judged similar if they had an e-value for alignment ≤ 0.01. Clusters are formed in increasing order of the number of relatives that a sequence has in order to minimize false linkage of multidomain proteins. These clusters are termed paralog families. Each cluster is named after its representative Wormpep18 protein. Genes with no close relatives according to this method are termed singleton genes.

Fold assignments

For the worm proteome, matches to SCOP (version 1.39) domains and to transmembrane proteins are extrapolated onto Wormpep18 from assignments made previously on Wormpep17 proteins (23). For the pseudogene complement, implied assignments to SCOP domains and transmembrane proteins are taken from the closest matching Wormpep18 protein for each individual pseudogene or pseudogene fragment.

In addition, we performed transmembrane helix prediction directly on six-frame translations of the raw genomic DNA using a hydropathy scale and 20-residue window as described previously (23,35). Based on an analysis of the distribution of length of interhelical segments in existing membrane protein structures, we joined two predicted transmembrane helices into the same ‘exon’ if they were separated by <40 amino acids. We only flagged the resulting assemblage as a pseudogene if it contained a single stop codon in one of the predicted transmembrane helices. These predicted transmembrane protein regions are masked for overlap with other described genomic features as for the pseudogene homology matching.

Subsets of worm genes

Some of the gene sequences in the Sanger Centre worm genome data are noted as matched to ESTs or full-length cDNA. A further set of EST- and cDNA-confirmed worm gene structures is available from the Intronerator database (http://www.cse.ucsc.edu/~kent/intronerator; 36). We merged these two sets of notations and derived two sets of EST-verified genes. First, the set of genes with at least one verifying EST were compiled (GE). Secondly, GE was expanded by including all of the paralogs of GE proteins [(GE)P].

Microarray expression data at four time points in the development of the worm (from egg to adult) is available for a substantial cross-section of worm genes (27). The average of this expression level may be a rough indicator of whether a gene is highly expressed or not. (However, microarray data, unlike that from GeneChips or SAGE, gives only approximate qualitative indications of the degree to which various genes are differentially expressed. It is much more accurate in highlighting the genes that change considerably in expression.) A suitable threshold for this average expression was used to compile a data set that comprises about half of the ~18 500 worm genes (totaling 9991 more highly expressed genes, denoted GM). The corresponding data set of pseudogenes is ΨGM.

A subset of more ‘reliable’ pseudogenes (ΨGR) was compiled that are supported by a variety of evidence. They are pseudogenes that: (i) are verified by a full-length cDNA or have complete EST coverage; (ii) are noted as confirmed genes in the Wormbase database (http://www.wormbase.org), excluding those which upon inspection have obviously incorrect genomic structure; (iii) have been previously annotated by the Sanger Centre annotators as a pseudogene using a gene prediction algorithm; (iv) have a homology match to another non-worm protein over the length of the pseudogene homology match; or (v) have 50 or more matches to a worm coding sequence of substantial length (>400 nt). This last condition mainly applies to homologies to chemoreceptor genes and other G-protein-coupled receptors. The corresponding set of whole genes for these is denoted GR, but is not directly comparable as some of the conditions above do not relate to them.

Data on Web site

We have constructed a Web site (http://bioinfo.mbb.yale.edu/genome/worm/pseudogene) for browsing the pseudogene annotations, along with other genomic features downloaded from the Sanger Centre Web site. The ΨGR data can be viewed either by searching for a particular ORF or protein name, by viewing the region around an ORF or simply by viewing a specified range in the chromosome. The sense and alignment score of all pseudogenes is displayed, and the genomic sequences of aligned segments (along with their amino acid translations) are viewable. We have also linked the results to a variety of available internal and external resources including online databases and structural annotations.

RESULTS AND DISCUSSION

Estimated size of pseudogene population

The pseudogene population (denoted ΨG) arising from the decay of protein-coding genes in the worm is estimated to comprise 2168 sequences which is ~12% of the total gene complement (G). This is only an initial estimate of the pseudogene population which may be examined for broad trends and characteristics. One should keep in mind that there are a number of obvious factors that may affect the size of ΨG, causing over- or under-estimation.

(i) Dead copies of transposable elements would lead to an over-estimate of ΨG. However, these may be considered validly as pseudogenic fragments, and have been used as such in studies of DNA loss in Drosophila (10,16). Nonetheless, we do not find any abundant patterns of multiple protein-homology hits in the genomic DNA that would be indicative of a major unknown transposable element (see below). Only ~5% of our total potential pseudogene matches are deleted because of matches to known transposable element proteins (see Materials and Methods).

(ii) The size of ΨG here may be an underestimate as we do not include pseudogenes that only have the less obvious coding disablements, such as damaged splicing signals. However, our search for only frameshifts and premature stops is supported by the fact that 6% of the Sanger Centre-annotated pseudogenes would be missed by this procedure.

(iii) Some annotated genes may in fact be pseudogenes, as the disablement is undetectable by gene prediction procedures (such as a disabled promoter).

(iv) Conversely, some of our pseudogenes might be parts of real functioning genes that were not annotated in Wormpep. In particular, it is conceivable that some premature stops or frameshifts may indicate a shortened protein that lacks all or part of a domain. However, a search of the scientific literature has revealed that reported cases of this phenomenon are rare, and where they occur they may be pathogenic [and thus unlikely to be conserved, e.g., a germline mutation in the human prion protein gene in a single Japanese patient (37).]

(v) Some of the pseudogenes may arise because of sequencing errors (and so should be annotated as genes). However, the reported overall error rate in sequencing is low (<1 in 10 000 bases) (38).

(vi) Some of our pseudogene fragments that are extrapolated from Wormpep may comprise genomic-level repeats; however, we have taken measures to avoid this problem (see Materials and Methods).

(vii) Some pseudogenes may be fragments of two separate pseudogenes; this problem is minimized in the present work by merging some pseudogene matches along the genomic DNA, with a procedure described in Materials and Methods.

Pseudogene subpopulations

Highly expressed genes appear to have fewer dead gene copies or fragments. When only EST-matched genes are considered, ΨGE corresponds to 5% of GE (363 predicted pseudogenes) (Table (Table1).1). (Intermediate between these, there are 1165 predicted pseudogenes that correspond to a gene with an EST match or that are paralogous to a gene with an EST match, Ψ(GE)P.) For pseudogenes related to genes that are highly expressed according to microarray data (i.e., those that comprise the GM data set), the corresponding pseudogene complement is ~7% of the size of GM. Interestingly, singleton genes (i.e., those with no close paralogs) have a smaller relative population of pseudogenes (corresponding to 11% of the total number of singleton genes) yet constitute 32% of the gene population. The most reliable subset of the pseudogene population (ΨGR) is about half of the total for ΨG (Table (Table1).1). The sizes of ΨGR, the most reliable subset of pseudogenes, and GR are not directly comparable as ΨGR is compiled from a variety of sources (Table (Table11).

Intronic pseudogenes are pseudogenes that are contained completely within a single intron. A substantial fraction of ΨG is estimated to be intronic (39%) (Table (Table1).1). Interestingly, there is no preference for sense or antisense alignment for an intronic pseudogene relative to the exons of the surrounding gene (53% are antisense). This indicates that the existence of pseudogenes in an intron has no relation to the transcription and splicing of a gene.

A key consideration is the proportion of ΨG that are processed pseudogenes. Processed pseudogenes are derived originally from mRNA transcripts that have been reverse-transcribed and re-integrated into the genome. In a sense, these pseudogenes are not indications of ailing families of proteins, but rather the opposite; one might expect more processed pseudogenes for genes that are highly or widely expressed. They have the following features: (i) they lack the introns of the gene from which they are derived; (ii) they tend to have a characteristic polyadenine tail; (iii) they lack the promoter structure of the gene from which they are derived; and (iv) they have short direct repeats (~9–15 bp) at their N- and C-termini (2). We could not find any mention of processed pseudogenes in the worm in the scientific literature. We estimated the proportion of processed pseudogenes in ΨG using a simple heuristic that involved looking for stretches of coding sequence that could not be in the pseudogene without processing (which we have termed ‘exon seams’) and also for evidence of a polyadenine tail (see Materials and Methods). According to the exon seams identification, there appear to be few pseudogenes that result from processing in ΨG (totaling 208, 10%). We could not find any obvious subpopulation of pseudogenes with an elevated adenine content 3′ to their homology segment that would indicate a polyadenine tail. The size of the estimated population of processed pseudogenes here contrasts substantially with the human genome, where ~80% of the pseudogenes are predicted to be processed (39).

Chromosomal distribution of pseudogenes

We mapped the positions of pseudogenes and genes along each of the six worm chromosomes. Pseudogenes appear to be more abundant nearer the ends or ‘arms’ of the chromosomes (Fig. 2). When the distributions for the individual chromosomes are merged, we find that 53% of the pseudogenes are in the first and last 3 Mb of the chromosomes, compared to only 30% of the genes. It was previously noted (19) that the proportion of genes with similarities to other organisms tends to be lower on the chromosomal arms. The pseudogene distribution along the chromosomes correlates with this observation and supports the idea of more rapidly evolving genomic DNA towards the ends of the chromosomes (19). The same trend for increased occurrence of pseudogenes is observed for the various pseudogene subpopulations. In particular, for the GE subset, 50% of the pseudogenes are in the first and last 3 Mb of genomic DNA (Fig. (Fig.2).2). The analogous number for (GE)P is 53%, and one also gets similar results for the highly expressed subset GM. For the most reliable subset ΨGR, this proportion is lower (40% in the first and last 3 Mb). This may be related to the fact that genes with homology to proteins from other organisms are more prevalent towards the center of the chromosomes (19).

Figure 2Figure 2
The estimated chromosomal distribution of pseudogenes. Each panel depicts the distribution of genes (left) and pseudogenes (right) for the chromosomes I, II, III, IV, V, X. The EST-matched subsets for each chromosome are binned as a dark grey ...

As is also shown in Figure Figure22 (legend), the distribution of pseudogenes between the chromosomes is also uneven. For each chromosome, we calculated the proportion of ‘dead’ genes [equal to |ΨGn|/(|ΨGn| + |Gn|), where |Gn| is the size of the gene population Gn for chromosome n and |ΨGn| is the number of pseudogenes]. Chromosome IV appears to be the most ‘dead’, chromosome II the least. (The same trend is also found for the ΨGR subset, as noted.) This variation in the proportion of pseudogenes between chromosomes may be due to specific gene families, or perhaps recently defunct families of genes.

We looked for recurrent pairs of predicted pseudogenes distributed along the chromosomes that may perhaps indicate some undiscovered transposable element. The most frequent pair patterns are tabulated (Table (Table2).2). The most common pseudogene pair is for a seven-transmembrane gene family, represented by the gene B0334.7. None of the top recurrent pairs appears indicative of a transposable element.

Table 2.
Most common pair patterns for predicted pseudogenes along chromosomes

Disablements, length and composition of ΨG matches

The obvious disablements in the pseudogene population (i.e., frameshifts or premature stops) are tallied in Figure Figure3A.3A. A high proportion of ΨG has only one disablement over the length of genomic sequence aligned (44%). This may indicate an evolutionarily young pseudogene population that is rapidly deleted from genomic DNA. (Alternatively, it may just reflect the fact that pseudogenes with more than one disablement tend to have less similarity to known proteins than those with a single disablement.) In general, non-coding frameshifts (of either one or two bases) and premature stop codons are approximately evenly represented in the pseudogene fragments detected (Fig. (Fig.3A).3A). A similar trend for disablements is seen for ΨGR (data not shown).

Figure 3Figure 3Figure 3
Disablements, length and composition for ΨG. (A) Simple disablements. This data is only for the ΨG population directly derived from Wormpep18. (B) Length distribution of pseudogene matches. The distribution of pseudogene match ...

The length distribution for the homology matches for pseudogenes is shown in Figure Figure3B,3B, compared to the length distribution for exons in known worm genes. The modes of these distributions are similar. The mean length of these matches is 338 nt, somewhat larger than the mean length of a worm exon (210 nt), because the distribution for the pseudogenes has a somewhat longer tail. Over medium-range lengths (300–500 nt) pseudogenic fragments are about twice as prevalent as exons. The long tail is probably due firstly to processed pseudogenes and secondly to matches against genomic DNA where a gap has been introduced over the length of a small intron. The maximum length is 3156 nt for a pseudogene that is most similar to the gene W08D2.5. This is probably a processed pseudogene as there is no evidence of the exon structure of one of its paralogs. In general, however, these matches will not correspond to the length of the whole pseudogene, and are only used to detect the presence of a pseudogene at a particular genomic locus (see Materials and Methods). We do not observe any preference in the pseudogenic homology matches for the N- or C-termini of the corresponding worm protein, for either the processed or unprocessed pseudogenes (50 and 57% of those estimated to be unprocessed and processed, respectively, tend toward the C-terminus).

As shown in Figure Figure3C,3C, we measured the amino acid composition of G (the Wormpep18 protein complement) and the implied amino acid composition of both ΨG and random non-repetitive genomic sequence (Fig. (Fig.3C).3C). The amino acid composition for the pseudogenes is generally intermediate between the composition of random genomic sequence and the composition of the Wormpep18 proteins (Fig. (Fig.3C),3C), being closer to random than to Wormpep18 (14 out of 20 residues). One would expect older pseudogenes to be closer to random sequence than younger ones, so study of the amino acid composition in this way may indicate from genome to genome the overall age of the pseudogene population. (However, of course, the actual age of a pseudogene subpopulation will be dependent in a complex way on rates of genomic deletion/insertion and point mutation.)

In our composition analysis, we find that the most enriched residues in ΨG relative to G are Phe, Ile, Leu and Lys, and the most depleted residues are Asp, Ala, Glu and Gly relative to the worm proteome. The enrichment in Phe is particularly interesting as the number of codons for this residue is small (two, TTT and TTC) (Fig. (Fig.3C).3C). Moreover, the enrichment of Phe and Lys in the ΨG and random sequences relative to G is perhaps related to an underlying trend for local A/T mononucleotide repeats in the genome (data not shown). Also, Lys is preferred to the physico-chemically similar Arg in the C.elegans proteome even though the former has only two codons, compared with six for the latter (Fig. (Fig.3C;3C; 40). Lys, in fact, has been found to be the amino acid that varies most in composition between various genomes (41). The amino acid composition of ΨGR was also derived and yields the same results as described above (data not shown).

Distribution in terms of gene paralog families

We clustered the genes and pseudogenes in the worm genome into paralog families. An example of a paralog family is illustrated in Figure Figure1A.1A. For each family, as shown in Figure Figure4,4, we plotted the number of genes versus the number of pseudogenes. Clearly, the number of pseudogenes per family is not correlated with the number of genes. The large families that have an extensive graveyard of pseudogenes relative to their living population, or vice versa, are labeled with their family representatives. Some of these larger families are ‘outliers’ that deviate from the overall ratio, indicating a dynamic genome. The family represented by the gene B0403.1 is uncharacterized, but comprises twice as many pseudogenes as genes (31 and 16 in total, respectively).

Figure 4
Plot of the number of genes in a paralog family (Gfamily) versus the number of pseudogenes in a paralog family (ΨGfamily). The families from the GE set are marked as closed points, with the remainder as open points. The lines indicate ...

In Table Table3,3, we list the largest sequence families in the worm, ranked by their number of genes and pseudogenes. They are named for their particular family representative. Four of the top 10 paralog gene families when ranked by number of pseudogenes are functionally uncharacterized. Moreover, three of the pseudogene top 10 are amongst the biggest families when we rank according to number of genes. These large, evolutionarily dynamic seven-transmembrane receptor families are represented by the genes B0334.7, B0213.7 and C03A7.3. The B0334.7 family is the largest and has about one pseudogene for every four genes, which is close to the overall ratio for genes and pseudogenes in the genome (Fig. (Fig.4).4). The occurrence of the reverse-transcriptase and the TcA transposase families in the top 10 list may indicate parts of an unknown transposable element that we failed to mask for.

Table 3.
Top paralog families for ΨG and Ga

The pseudogene family rankings are similar for the EST-matched genes (ΨGE) (Table (Table3).3). If the higher e-value threshold of 0.01 is used instead of 0.001 for worm protein homology matching, there is little change in the most prevalent families for pseudogenes (Table (Table33 footnote). This suggests a fundamental robustness to these rankings of gene paralog families. The additional pseudogenes pulled in by the less stringent e-value threshold (0.01) presumably represent more ancient pseudogenes. Thus, the fact that the rankings change little suggests that the older pseudogenes have the same distribution of families as more modern ones.

In addition, we found 150 pseudogenic fragments that were similar to representative sequences from the PROTOMAP database but did not have detectable homology to a worm protein (Table (Table4).4). These ‘PROTOMAP pseudogenes’ either result from horizontal transfer or have diverged too far for the homology to their parent worm protein to be detected. Or perhaps they are even remnants of gene families that have completely died out in the worm. We list the biggest families of PROTOMAP pseudogenes in Table Table4.4. The top match is an uncharacterized ORF of yeast (yja7_yeast, yeast ORF name YJL007C), which has no other reported homologs, whereas the second and third are similar to mammalian proteins with known functions (Table (Table44).

Table 4.
Other pseudogenic homology fragments that match a PROTOMAP family representative but with no detected homology to WormPep

Protein ‘pseudofolds’ and transmembrane assignments

The proteins encoded by the worm genome have previously been assigned to globular protein domain folds from the SCOP (version 1.39) database and top 10 lists of the most common folds in the worm have been constructed (23,42). Here, we tried to perform the analogous procedure on the pseudogene population. Where possible, we assigned one of the known protein folds to each identified pseudogene based on standard approaches. In particular, for every pseudogene, the structural assignments of its closest gene homolog were considered as implied structural assignments (see Materials and Methods). Then we ranked the pseudogenes in terms of these implied structural assignments or pseudofolds (Fig. (Fig.5).5). Overall, there is a decrease in assignability to a SCOP domain for the pseudogene population (12% have an assignment) compared to the gene population (24%). This may be due to truncation or deletion of genomic DNA.

Figure 5Figure 5
The folds and pseudofolds in the worm genome. (A) The SCOP domain matches are extrapolated onto Wormpep18 from assignments made previously on Wormpep17 proteins (23). (B) Pseudofold assignments are taken from the closest matching gene paralog ...

In Figure Figure5,5, we ranked the pseudogenes in terms of these implied structural assignments or pseudofolds. The prevalence of different globular folds is somewhat different for the gene and pseudogene populations, although six folds occur in both top 10 lists (Fig. (Fig.5).5). Examination of pseudofolds may give an indication of protein structures that have fallen out of favor evolutionarily. Two of the top 10 pseudofolds occur infrequently in the worm proteome and thus may be folds that have lost some utility for the worm; the DNaseI-like fold (α+β class) and the ovomucoid PCI-like inhibitor fold, which is small and disulphide-rich (Fig. (Fig.5).5). The immunoglobulin-like fold, which is in the all-β folding class, is the top fold in G, but is the second-ranking fold for ΨG. This fold is much more abundant in the worm than in any completely sequenced microbial organism (23). The most common pseudofold for ΨG is C-type lectin fold, which has only been found in eukaryotes (43).

Previously, the worm gene population was surveyed for the presence of transmembrane segments (23). We tried to perform a similar survey here for the pseudogene population. The proportion of pseudogenes corresponding to a predicted transmembrane protein is the same in ΨG (22%) as in G (22%). In addition, outside of the homology-based pseudogene fragments, transmembrane helices were assigned on six-frame translations of the raw genomic sequence to locate other regions that are transmembrane-protein-like and pseudogenic (see Materials and Methods). There is a small number of such pseudogenic transmembrane segments with four or more predicted transmembrane helices (174 in total). These may be additional deceased transmembrane protein genes.

CONCLUSIONS

Our goal in this study was to provide an initial estimate of the size, distribution and characteristics of the pseudogene population in a large metazoan genome, that of C.elegans, in the spirit of recent attempts to estimate the total number of genes in the human genome (25,26). We have found 2168 homology fragments in the worm genome (about one for every eight genes) that appear to be pseudogenic. About a half of these (totalling 1100) form a most ‘reliable’ subset of the data. These figures for ΨG may be an over-estimate due to inclusion of dead copies of transposable elements, or of ‘unpredicted’ genes with disablements that are due to sequencing errors. Contrarily, it may be an under-estimate due to disregard for pseudogenes with only the less likely disablements, such as a damaged splicing signal, or because some annotated genes are in fact pseudogenes.

We found few pseudogenes that are apparently due to processing in the worm genome. This is in marked contrast to the situation for the human genome, where 80% of the pseudogenes are thought to be processed (39).

The distribution of the proportion of pseudogenes relative to genes for different gene families is notably uneven, indicative of a highly dynamic genome. There are some examples of gene families with an extensive panel of dead fragments, most notably for families of chemoreceptors and other seven-transmembrane receptors (20,21). A future detailed study of the complete chemoreceptor worm ‘subgenome’ that includes these pseudogenes may shed light on the evolution of these largely worm-specific proteins. We also found one large functionally uncharacterized gene family that comprises about two-thirds of dead genes. Such genes or gene families may be falling out of usage due to removal of the evolutionary pressure for their conservation, or due to recent functional redundancy with another gene family. This may partly explain why fewer pseudogenes occur for genes/gene families that are EST-matched.

There are more pseudogenes relative to genes on the arms of the chromosomes, suggesting that many duplications at the ends of the chromosomes tend to produce unusable genes. This may be because the arms of chromosomes undergo more recombination relative to the overall rate of genomic DNA loss. These areas may be thus more ‘unreliable’ for encoding genes and functions, but conversely are more likely to spawn new proteins. This may also explain the general depletion of genes homologous to other organisms on the arms of the chromosomes (38).

ACKNOWLEDGEMENTS

Thanks to Nathan Bowen (University of Atlanta, GA) for LTR retrotransposon data in the worm genome, to Richard Durbin (Sanger Centre, UK) for helpful advice, to Hedi Hegyi for worm protein assignment data and to Valerie Reinke (Stanford University) and Ronald Jansen for microarray expression data of genes in the worm and to the NIH Structural Genomics Initiative and the Keck Foundation for support.

References

1. Weiner A.M., Deininger,P.L. and Efstratiadis,A. (1986) Non-viral retroposons: genes, pseudogenes and transposable elements generated by the reverse flow of genetic information. Annu. Rev. Biochem., 55, 631–661. [PubMed]
2. Vanin E.F. (1985) Processed pseudogenes: characteristics and evolution. Annu. Rev. Genet., 19, 253–272. [PubMed]
3. Mighell A.J., Smith,N.R., Robinson,P.A. and Markham,A.F. (2000) Vertebrate pseudogenes. FEBS Lett., 468, 109–114. [PubMed]
4. Korneev S.A., Park,J.-H. and O’Shea,M. (1999) Neuronal expression of neural nitric oxide synthase (nNOS) protein is suppressed by an antisense RNA transcribed from an NOS pseudogene. J. Neurosci., 19, 7711–7720. [PubMed]
5. Olsen M.A. and Schechter,L.E. (1999) Cloning, mRNA localization and evolutionary conservation of a human 5HT7 receptor pseudogene. Gene, 227, 63–69. [PubMed]
6. Feenstra M., Bakema,J., Verdaasdonk,M., Rosemuller,E., van den Tweel,J., Slootweg,P., Weger,R. and Tilanus,M. (2000) Detection of a putative hla-a*31012 Processed pseudogene in a laryngeal squamous cell carcinoma. Genes Chromosom. Cancer, 27, 26–34. [PubMed]
7. Fujii G.H., Morimoto,A.M., Berson,A.E. and Bolen,J.B. (1999) Transcriptional analysis of the PTEN1/MMAC1 pseudogene, pisPTEN. Oncogene, 18, 1765–1769. [PubMed]
8. Currie P.D. and Sullivan,D.T. (1994) Structure, expression and duplication of genes which encode phosphoglycerate mutase of D. melanogaster. Genetics, 138, 353–363. [PMC free article] [PubMed]
9. Sullivan D.T., Starmer,W.T., Curtiss,S.W., Menotti,M. and Yum,J. (1994) Unusual molecular evolution of an Adh pseudogene in Drosophila. Mol. Biol. Evol., 11, 443–458. [PubMed]
10. Petrov D., Lzovzkaya,E. and Hartl,D. (1996) High intrinsic rate of DNA loss in Drosophila. Nature, 384, 346–349. [PubMed]
11. Gojobori T., Li,W.H. and Graur,D. (1982) Patterns of nucleotide substitutions in pseudogenes and functional genes. J. Mol. Evol., 18, 360–369. [PubMed]
12. Li W.H., Wu,C.I. and Luo,C.C. (1984) Nonrandomness of point mutation as reflected in nucleotide substitutions in pseudogenes and its evolutionary implications. J. Mol. Evol., 21, 58–71. [PubMed]
13. Averof M., Rokas,A., Wolfe,K.H. and Sharp,P.M. (1999) Evidence for a high frequency of simultaneous double-nucleotide substitutions. Science, 287, 1283–1285. [PubMed]
14. Gu X. and Li,W.-H. (1995) The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment. J. Mol. Evol., 40, 464–473. [PubMed]
15. Ophir R. and Graur,D. (1997) Patterns and rates of indel evolution in processed pseudogenes from humans and murids. Gene, 205, 191–202. [PubMed]
16. Petrov D., Sangster,T.A., Johnston,J.S., Hartl,D.L. and Shaw,K.L. (2000) Evidence for DNA loss as a determinant of genome size. Science, 287, 1060–1062. [PubMed]
17. Ota T. and Nei,M. (1995) Evolution of immunoglobulin VH pseudogenes in chickens. Mol. Biol. Evol., 12, 94–102. [PubMed]
18. Goncalves I., Duret,L. and Mouchiroud,D. (2000) Nature and structure of human genes that generate retropseudogenes. Genome Res., 10, 672–678. [PMC free article] [PubMed]
19. The C.elegans Sequencing Consortium. (1998) Genome sequence of the nematode C. elegans: A platform for investigating biology. Science, 282, 2012–2018. [PubMed]
20. Robertson H.M. (1998) Two large families of chemoreceptor genes in the nematodes C. elegans and C. briggsae reveal extensive gene duplication, diversification, movement and intron loss. Genome Res., 8, 449–463. [PubMed]
21. Robertson H.M. (2000) The large srh family of chemoreceptor genes in Caenorhabditis nematodes reveals processes of genome evolution involving large duplications and deletions and intron gains and losses. Genome Res., 10, 192–203. [PubMed]
22. Gerstein M. (1997) A structural census of genomes: Comparing Bacterial, Eukaryotic and Archaeal Genomes in terms of Protein Structure. J. Mol. Biol., 274, 562–576. [PubMed]
23. Gerstein M., Lin,J. and Hegyi,H. (2000) Proteins Folds in the Worm Genome. Pac. Symp. Biocomput., 5, 30–42. [PubMed]
24. Jansen R. and Gerstein,M. (2000) Analysis of the yeast transcriptome with broad structural and functional categories: characterizing highly expressed proteins. Nucleic Acids Res., 28, 1481–1488. [PMC free article] [PubMed]
25. Ewing B. and Green,P. (2000) Analysis of expressed sequence tags indicates 35,000 human genes. Nature Genet., 25, 232–234. [PubMed]
26. Liang F., Holt,I., Pertea,G., Karamycheva,S., Salzberg,S. and Quackenbush,J. (2000) Gene index analysis of the human genome estimates approximately 120,000 genes. Nature Genet., 25, 239–240. [PubMed]
27. Reinke V., Smith,E., Nance,J., Wang,J., Van Doren,C., Begley,R., Jones,S.J., Davis,E.B., Scherer,S., Ward,S. and Kim,S.K. (2000) A global profile of germline expression in C. elegans. Mol. Cell, 6, 1–12. [PubMed]
28. Wootton J.C. and Federhen,S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol., 266, 554–571. [PubMed]
29. Pearson W.R., Wood,T., Zhang,Z. and Miller,W. (1997) Comparison of DNA sequences with protein sequences. Genomics, 46, 24–36. [PubMed]
30. Yona G., Linial,N. and Linial,M. (2000) ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res., 28, 49–55. [PMC free article] [PubMed]
31. Youngman S., van Luenen,H.G.A.M. and Plasterk,R.H.A. (1996) Rte-1, a retrotransposon-like element in Caenorhabditis elegans. FEBS Lett., 380, 1–7. [PubMed]
32. Bigot Y., Auge-Gouillou,C. and Periquet,G. (1996) Computer analyses reveal a hobo-like element in the nematode C. elegans, which presents a conserved transposase domain common with the Tc1-Mariner transposon family. Gene, 174, 265–271. [PubMed]
33. Bowen N. and McDonald,J. (1999) Genomic analysis of C. elegans reveals ancient families of retro-viral-like elements. Genome Res., 9, 924–935. [PubMed]
34. Hobohm U., Scharf,M., Schneider,R. and Sander,C. (1992) Selection of representative protein data sets. Protein Sci., 1, 409–417. [PMC free article] [PubMed]
35. Gerstein M. (1998) Patterns of protein fold usage in eight microbial genomes: a comprehensive structural census. Proteins, 33, 518–534. [PubMed]
36. Kent W.J. and Zahler,A.M. (2000) The intronerator: exploring introns and alternative splicing in C.elegans. Nucleic Acids Res., 28, 91–93. [PMC free article] [PubMed]
37. Kitamoto T., Iuszuka,R. and Takeishi,I. (1993) An amber mutation of prion protein in Gerstmann-Straussler-Scheinker syndrome with mutant PrP plaques. Biochem. Biophys. Res. Commun., 191, 709–714. [PubMed]
38. Chervitz S.A., Aravind,L., Sherlock,G., Ball,K.A., Koonin,E.V., Dwight,S.S., Harris,M.A., Dolinski,K., Mohr,S., Smith,T., Weng,S., Cherry,J.M. and Botstein,D. (1998) Comparison of the complete protein sets of worm and yeast: orthology and divergence. Science, 282, 2016–2022. [PMC free article] [PubMed]
39. Dunham I., Shimizu,N., Roe,B.A., Chissoe,S., Hunt,A.R., Collins,J.E., Bruskiewich,R., Beare,D.M., Clamp,M., Smink,L.J. et al. (1999) The DNA sequence of human chromosome 22. Nature, 402, 489–495. [PubMed]
40. Nishizawa M. and Nishizawa,K. (1998) Biased usages of arginines and lysines in proteins are correlated with local-scale fluctuations of the G+C content of DNA sequences. J. Mol. Evol., 47, 385–393. [PubMed]
41. Gerstein M. (1998) How representative are the sequences in a genome? A comprehensive structural census. Fold Des., 3, 497–512. [PubMed]
42. Murzin A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540. [PubMed]
43. Gerstein M. and Levitt,M. (1997) A structural census of the current population of protein sequences. Proc. Natl Acad. Sci. USA, 94, 11911–11916. [PMC free article] [PubMed]
44. Altschul S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,J. and Lipman,D.J. (1997) Gapped BLAST and psi-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...