• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Sep 2, 1997; 94(18): 9746–9750.

A screen for fast evolving genes from Drosophila


In an attempt to quantify the rates of protein sequence divergence in Drosophila, we have devised a screen to differentiate between slow and fast evolving genes. We find that over one-third of randomly drawn cDNAs from a Drosophila melanogaster library do not cross-hybridize with Drosophila virilis DNA, indicating that they evolve with a very high rate. To determine the evolutionary characteristics of such protein sequences, we sequenced their homologs from a more closely related species (Drosophila yakuba). The amino acid substitution rates among these cDNAs are among the fastest known and several are only about 2-fold lower than the corresponding values for silent substitutions. An analysis of within-species polymorphisms for one of these sequences reveals an exceptionally high number of polymorphic amino acid positions, indicating that the protein is not under strong negative selection. We conclude that the Drosophila genome harbors a substantial proportion of genes with a very high divergence rate.

Keywords: protein evolution, neutral evolution/expressed sequence tag/genome evolution

Evolutionary novelties are usually thought to be brought about either by changes in the regulatory interactions of genes or by gene duplication with subsequent diversification (14). The latter assumption in particular has lead to the notion that the large number of genes that can be found in highly evolved organisms can potentially be derived from a limited number of functional protein modules. Estimates of the number of functional modules range between 1,000 and 7,000 (5, 6). However, in the current eukaryotic genome projects, a relatively high number of ORFs are identified that do not show a match with other known protein motifs in databases (79). Usually it is assumed that this means that their homologs and their respective modules are yet to be discovered. However, an alternative interpretation would be that some proteins evolve so fast that their homologs cannot be discovered over larger evolutionary distances. There are some indications that such fast evolving sequences do indeed exist. First, even in proteins harboring functionally highly conserved motifs, such as transcription factors, a substantial proportion of the protein can diverge so quickly that the alignment between homologs from different species may become impossible outside of the conserved DNA-binding domain. Second, for some of the genes known to be involved in early embryogenesis in Drosophila, it has proven to be difficult or even impossible to obtain homologs from outside the insects (10). Finally, a few genes from Drosophila have been identified that diverge so fast that they cannot even be cloned from distantly related Drosophilids (1113). Such examples show that it may be worthwhile to systematically analyze how many fast evolving genes exist in a given organism. The results of such a screen would not only bear on the question of the number of ancient conserved protein modules, but also on the question of the origin and material basis of evolutionary novelties.

We have devised such a screen in Drosophila with the following rationale. Randomly picked clones from a Drosophila melanogaster cDNA library are hybridized against a panel of genomic DNA from different insect species with increasing evolutionary distance. The most closely related species in this panel is Drosophila virilis, which has split from D. melanogaster between 40 and 60 million years ago. As it is known that the neutral evolution rate in Drosophilids lies between 1 and 2% per million years (14, 15), one should assume that sequences evolving with a neutral or close to neutral rate should not cross-hybridize between these two species under moderately stringent conditions. On the other hand, conserved protein sequences should readily cross-hybridize with D. virilis and also with some of the other species in the panel. After this initial screen, the non-cross-hybridizing genes are analyzed more closely. Their homologs are isolated from Drosophila yakuba, a species that has split from D. melanogaster only about 10–15 million years ago, which should allow recovery of homologs of sequences evolving with a neutral rate. The respective cDNAs from both species are then fully sequenced and compared with each other to verify that they have a homologous ORF, which should be indicative of functional genes. Moreover, the sequence comparison of the coding region between these two species allows an estimate of the evolutionary rates of the respective protein. As a final test of whether truly fast evolving genes are recovered, one can determine the level of within species polymorphism for a certain gene. It is expected under a neutral model of molecular evolution that a gene that shows a high divergence rate between species should also be polymorphic within a species.

We show that this strategy does indeed identify a large number of genes with very high evolutionary divergence rates. Only a few of them include a known protein module, whereas for most of them, homologs cannot be detected in the databases. These results support the notion that a whole class of genes exists in a typical eukaryotic genome that has not been systematically taken account of so far.


Screening of Clones.

An oligo(dT) primed and directionally cloned λ ZAPII cDNA library encompassing the first 2–14 hr of embryonic D. melanogaster development (Stratagene) was plated at low density. Five hundred and seventy-six single clones were picked and checked via cross hybridization for duplicates. One hundred and five nonduplicate cDNA clones were used for the further experiments. The plasmids were in vivo excised with the ExAssist helper phage (Stratagene). The 5′ end of each plasmid was cycle sequenced and run on an Applied Biosystems model 377 automated sequencer (Perkin–Elmer) resulting in 300–600 bases of sequence. The sequences were then used to search nonredundant databases with the blastx program via E-mail (matrix: BLOSUM 62). Scores were considered significant for P < 10−3, though most P values were much smaller (i.e., P < 10−7). The expressed sequence tag (EST) sequences were submitted to the EST division of GenBank with accession numbers AA433202AA433290. For the cross-hybridization experiments, genomic DNA of the respective species was purified by CsCl centrifugation. The DNA was digested with EcoRI, and 2 μg of D. melanogaster, 4 μg of D. virilis, 8 μg of Musca domestica (housefly), and 10 μg of Tenebrio molitor (large flour beetle) were loaded per lane on an 0.8% agarose gel. The different amounts of DNA roughly reflect the different genome sizes. The gel was blotted and hybridized with the cDNA inserts (obtained by PCR with flanking primers and labeled with 32P by random priming) at 65°C in 5× standard saline citrate (SSC). The filters were washed three times for 15 min at 65°C with 2× SSC and exposed to x-ray film. These conditions represent a hybridization stringency, which should allow up to 35% mismatch in a probe with balanced GC-content.

D. yakuba cDNA Clones.

To construct the D. yakuba embryonic cDNA library, poly(A+) RNA was isolated from total RNA of 0–14 hr of embryonic development with the PolyATract System (Promega), and the library was constructed with the λ ZAPII cDNA synthesis and library construction kit (Stratagene). The phages were plated and transferred onto nylon membranes. Filters were hybridized with a 32P-labeled cDNA fragment of D. melanogaster under the same conditions as above. cDNAs were completely sequenced using a shotgun procedure in combination with custom designed primers. Alignment was usually unequivocal, and the number of synonymous and nonsynonymous substitutions per site was estimated according to Comeron (16).

Polymorphism Analysis.

Genomic DNA from single flies of a collection of isofemale lines from throughout the world (kindly provided by M. Kidwell, University of Arizona) was prepared using a proteinase K/SDS and phenol/chloroform purification procedure. An 868-bp segment of the reading frame of clone 1G5 was amplified with primers 1G5-PR3 [5′-AAGTATCTAGCCGA(CT)GAGGAC-3′] and 1G5-PR4 (5′-TACCCAGCTCTCATTCATCTC-3′) in 20 μl reaction volume with 1 pmol of each primer, 200 pmol dNTP, 1× Taq buffer, and 0.5 unit of AmpliTaq DNA polymerase (Perkin–Elmer). The DNA was gel-purified and directly sequenced with the amplification primers and internal primers on an Applied Biosystems model 377 sequencer. Each base was sequenced from both directions. Accession numbers are AF005865AF005881.

In Situ Hybridization.

Digoxigenin-labeled DNA or RNA probes were produced by random priming or in vitro transcription from plasmids containing the cDNA insert according to manufacturer’s protocols (Boehringer Mannheim). In situ hybridization was done essentially as described (17).


Evolutionary Conservation of Random cDNA Clones.

One hundred and five nonredundant cDNA clones were randomly isolated from an embryonic D. melanogaster cDNA library and analyzed in two ways. First, 300–600 bp were sequenced from the 5′ end and the sequences were compared with sequence databases to allow the identification of already known genes and genes with clear homologs. Second, low-stringency hybridization to genomic DNA from a panel of insect species was employed to analyze the degree of conservation of the genes. The most closely related species in this panel was D. virilis (split around 40–60 million years ago), the next one the housefly M. domestica (split around 100–120 million years ago) and finally the beetle T. molitor (split around 250–270 Mya). The results of these experiments are summarized in Table Table1.1.

Table 1
Summary statistics for the characterization of the random cDNA clones from D. melanogaster (n = 105)

The database searches with the partial sequences revealed that a fraction of 60% showed no matches with known genes, which is similar to the published results of large scale expressed sequence tag sequencing projects with Caenorhabditis elegans, Arabidopsis, and humans (79). The cross-hybridization study showed that more than half of the clones (53%) hybridize with D. melanogaster DNA only, 31% hybridize in both Drosophilids, and only 10% in all four species. Among the latter are well-known conserved genes that also had matches in the databases, thus confirming the utility of the cross-hybridization approach.

The proportion of sequences that hybridize to D. melanogaster DNA, but not to D. virilis DNA, is surprisingly large. There are in fact potential sources of error in this approach, most notably the possibility of including incomplete cDNA clones containing only untranslated 3′ ends. We found, however, that only 4 of the 26 already known sequences from D. melanogaster represented 3′ ends only and more than half of them were full-length cDNAs. Thus, though some correction is necessary, we estimate that well over one-third of the randomly chosen cDNAs are derived from fast evolving genes according to this assay.

We have also analyzed the spatio-temporal expression patterns of all sequences in this study by whole-mount in situ hybridization. The results show that the fast evolving genes do not significantly differ in their expression characteristics from the slow evolving ones (Table (Table2),2), indicating that they constitute a representative sample of all genes.

Table 2
Comparison of expression patterns

Comparison of Rapidly Evolving cDNA Clones Between D. melanogaster and D. yakuba.

To analyze the fraction of fast evolving sequences in detail and to estimate their amino acid substitution rates, we compared them to their homologs from the closely related species, D. yakuba. The cDNAs of 10 fast evolving and 1 highly conserved clone were recovered from a D. yakuba library, and the clones from both species were sequenced completely. Nine pairs of clones, including the highly conserved one, contained a homologous ORF in both species and are thus likely to encode functional proteins. This shows also that the above described hybridization results are not simply due to the inclusion of a high proportion of noncoding cDNA fragments. However, in two pairs of clones no homologous ORF could be identified and they were omitted from the further analysis. In subsequent database searches with the complete cDNA sequences, the highly conserved clone and three of the fast evolving clones gave significant matches. The conserved clone 2A12 matches with kinesin-like proteins (best match with CHO1; GenBank accession no. X83575; blast score, 472; P = 2 × 10−65), clone 1E9 with zinc-finger proteins (best match with Xenopus Znc6; Protein Identification Resource accession no. PC1144; blast score, 72; P = 8.8 × 10−9), and clone 2D9 with LIM domain proteins (best match with sunflower LIM domain protein SF3, Protein Identification Resource accession no. S37656; blast score, 110; P = 1.4 × 10−7), and 2A5 shows a weak similarity to the yeast ATP11 precursor protein (SWISS-PROT accession no. P32453; blast score, 89; P = 0.00015). The matches of 1E9 and 2D9 concern only the respective protein modules, but not sequences outside of them. The other six clones did not yield any significant matches in the database.

Table Table33 lists the number of synonymous (Ks) and nonsynonymous (Ka) nucleotide substitutions per site for these cDNA clones and other genes of which complete sequences from D. melanogaster and D. yakuba are available. The highest proportion of replacement substitutions are found among four genes that were identified in our screen. In each case, the values are only about twofold lower than the corresponding numbers of synonymous substitutions of the same genes. Moreover, they are only about 3-fold lower than the value of 0.32 substitutions per site that was found for the neutrally evolving parts of the rDNA internal transcribed spacer regions from this species pair (15).

Table 3
Comparison of nonsynonymous (Ka) and synonymous (Ks) substitutions per site of different genes from D. melanogaster and D. yakuba

To ensure that the fastest evolving genes are not paralogs of duplicated gene pairs, we have done an additional Southern hybridization analysis with D. yakuba DNA. Only one hybridizing band was found in most cases, suggesting that the genes are indeed unique. Additional evidence that paralogs are not a problem comes from the fact that even though multiple cDNA clones were recovered for each of the genes from D. yakuba, they could all be allocated to the same gene by sequence comparison with the longest clone. Finally, the fact that the synonymous rates are nearly the same across all the genes compared between D. melanogaster and D. yakuba suggests also that these are not paralogs with a substantially longer evolutionary divergence time than the split of the two species.

Population Polymorphism.

Differences in the amino acid sequence of homologous proteins from different species may be due to fixation of neutral substitutions by random drift or to fixation of adaptive substitutions by natural selection. The McDonald–Kreitman approach (18) can be used to differentiate between the two processes. It is based on the prediction of the neutral evolutionary theory that genes with a high number of substitutions between species should also show a high degree of within-species polymorphism. We have tested this for the fastest evolving clone in our sample. Most of the coding region of clone 1G5 was amplified and sequenced from 13 strains of D. melanogaster and 4 strains of the sibling species Drosophila simulans. A large number of polymorphic sites were found in both species (Fig. (Fig.1),1), many more than in comparable studies for other genes (1820). Two lines of evidence suggest that the polymorphisms found reflect a mutation-drift equilibrium in a neutrally or near neutrally evolving sequence. First, there are more polymorphic replacement sites than synonymous sites in agreement with the fact that there are also more potential replacement sites in the region. Second, the fraction of polymorphic sites in the short intron is comparable to the fraction of polymorphic sites in the coding region (1.6% vs. 3.9%; G = 0.85, P > 0.3). This suggests that almost none of the amino acid positions may be under strong selective constraint. A comparison between fixed and polymorphic sites between the two species shows also no significant deviation from the assumption of a neutral evolution in this region (Table (Table4).4). Thus, both the within- and between-species parameters indicate that this coding region evolves with a near neutral rate and under apparently neutral conditions.

Figure 1
Polymorphic sites of 1G5 in different populations of D. melanogaster and D. simulans. Dots represent identical nucleotides with respect to the top sequences. The region surveyed includes bases 58–922 of clone 1G5 (total length of coding region ...
Table 4
Polymorphisms found for sequence 1G5 in D. melanogaster and D. simulans


Our results have general implications on how one can envisage the evolution of genes and genomes. Some scenarios have suggested that there may only be a limited number of stable protein folds or domains that were already present early in the evolution of life and that were used over and over again in different combinations to make up most of the genes existing in todays species (46). Our results would suggest that there is a substantial fraction of coding sequences that does not follow such a pattern. The high amino acid replacement rates found in these sequences raises the question of whether these proteins might be able to assume many different stable protein folds that can easily be changed during evolution. On the other hand, it is well known that conservation of folding structure does not necessarily require the conservation of the primary sequence (21). These fast-evolving genes will therefore be a test case of whether fold structures are generally more conserved than primary sequences, or whether evolution is indeed able to explore many different new folds. In any case, our results suggest that the failure of finding homologs for a substantial proportion of genes in genome comparisons may not imply that these code for novel proteins, but might simply indicate that these are fast evolving.

A previous study investigating the general conservation of cDNAs between Drosophilids by DNA–DNA hybridization has suggested that cDNAs are not evolving as fast as single-copy noncoding DNA and that there might be even differences in embryonic messages versus adult messages in this respect (22). Furthermore, we note that most genes analyzed so far from Drosophila species tend to be more conserved than the fast-evolving genes characterized here. However, there are exceptions such as transformer (11), period (23) and Acp-26Aa (24), all three of which may be directly involved in sexual selection or species specific differentiations. Another interesting case is the gene spalt adjacent. It shows a high evolutionary rate between closely related species (12) and does not seem to have a function, at least not under laboratory conditions. Neither an allele with a stop codon at the beginning of the reading frame nor artificial over-expression lead to any recognizable phenotype (13). Still, an ORF is retained between the different species analyzed (12), indicating that it is not a pseudogene. It remains to be shown whether the apparent lack of an overt function is typical for most fast evolving genes, but this would at least explain why this class of genes has not been found more often so far. Lack of a phenotype under laboratory conditions does not imply that a gene lacks a function in the wild. Genes providing partially redundant functions (25) or genes involved in local adaptation, parasite defense or speciation might have subtle adult phenotypic effects when mutant and would thus have escaped the genetic screening regimes that have been employed so far.


We thank Margaret Kidwell for providing wild-type strians, Charles N. David for enthusiastic encouragement, and Charles N. David, Michael Ashburner, Svante Pääbo, Loredana Nigro, and Markus Friedrich for comments on the manuscript. This work was supported by a Gerhard Hess Preis from the Deutsche Forschungsgemeinschaft to D.T. and by a Ph.D. studentship from the Graduiertenkolleg “Zelluläre und molekulare Aspekte der Entwicklung” to K.S.


1. King M C, Wilson A. Science. 1975;188:107–116. [PubMed]
2. Carroll S B. Nature (London) 1995;376:479–485. [PubMed]
3. Palopoli M F, Patel N. Curr Opin Genet Dev. 1996;6:502–508. [PubMed]
4. Ohno S. Evolution by Gene Duplication. Berlin: Springer; 1970.
5. Dorit R L, Schoenbach L, Gilbert W. Science. 1990;250:1377–1382. [PubMed]
6. Chotia C. Nature (London) 1992;357:543–544. [PubMed]
7. Waterston R, Martin C, Craxton M, Huynh C, Coulson A, Hillier L, Durbin R, Green P, Shownkeen R, Halloran N, Metzstein M, Hawkins T, Wilson R, Berks M, Du Z, Thomas K, Thierry-Mieg J, Sulston J. Nat Genet. 1992;1:114–123. [PubMed]
8. Adams, M. D., Kerlavage, A. R., Fleischmann, R. D., Fuldner, R. A., Bult, C. J., et al. (1995) Nature (London) 377, Suppl., 3–174. [PubMed]
9. Dujon P. Trends Genet. 1996;12:263–270. [PubMed]
10. Akam M, Averof M, Castelli-Gair J, Dawes R, Falciani F, Ferrier D. Development (Cambridge, UK) Suppl. 1994;1994:209–215. [PubMed]
11. O’Neil M T, Belote J M. Genetics. 1992;131:113–128. [PMC free article] [PubMed]
12. Reuter D, Schuh R, Jäckle H. Proc Natl Acad Sci USA. 1989;86:5483–5486. [PMC free article] [PubMed]
13. Reuter D, Kühnlein R P, Frommer G, Barrio R, Kafatos F C, Jäckle H. Chromosoma. 1996;104:445–454. [PubMed]
14. Sharp P M, Li W-H. J Mol Evol. 1989;28:398–402. [PubMed]
15. Schlötterer C, Hauser M-T, von Haeseler A, Tautz D. Mol Biol Evol. 1994;11:513–522. [PubMed]
16. Comeron J M. J Mol Evol. 1995;41:1152–1159. [PubMed]
17. Tautz D, Pfeifle C. Chromosoma. 1989;98:81–85. [PubMed]
18. McDonald J H, Kreitman M. Nature (London) 1991;351:652–654. [PubMed]
19. Eanes W F, Kirchner M, Yoon J. Proc Natl Acad Sci USA. 1993;90:7475–7479. [PMC free article] [PubMed]
20. Hey J, Kliman R M. Mol Biol Evol. 1993;10:804–822. [PubMed]
21. Sander C, Schneider R. Proteins. 1991;9:56–68. [PubMed]
22. Powell J R, Caccone A, Gleason J M, Nigro L. Genetics. 1993;133:291–298. [PMC free article] [PubMed]
23. Thackeray J R, Kyriacou C. J Mol Evol. 1990;31:389–401. [PubMed]
24. Tsaur S-C, Wu C-I. Mol Biol Evol. 1997;14:544–549. [PubMed]
25. Tautz D. BioEssays. 1992;14:263–266. [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...