Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. 1998 May 26; 95(11): 5849–5856.
Colloquium Paper

Measuring genome evolution


The determination of complete genome sequences provides us with an opportunity to describe and analyze evolution at the comprehensive level of genomes. Here we compare nine genomes with respect to their protein coding genes at two levels: (i) we compare genomes as “bags of genes” and measure the fraction of orthologs shared between genomes and (ii) we quantify correlations between genes with respect to their relative positions in genomes. Distances between the genomes are related to their divergence times, measured as the number of amino acid substitutions per site in a set of 34 orthologous genes that are shared among all the genomes compared. We establish a hierarchy of rates at which genomes have changed during evolution. Protein sequence identity is the most conserved, followed by the complement of genes within the genome. Next is the degree of conservation of the order of genes, whereas gene regulation appears to evolve at the highest rate. Finally, we show that some genomes are more highly organized than others: they show a higher degree of the clustering of genes that have orthologs in other genomes.

Keywords: ortholog, synteny, computer analysis, horizontal gene transfer

Molecular evolution usually is studied at the level of single genes. With the determination of genome sequences we have an opportunity to study it at a higher, comprehensive level, that of complete genomes. This leads to the pertinent question: how can genomic information be used to obtain useful information concerning genome evolution? The goal of this paper is to create baseline expectations for measures of genome distances that are based on gene content. By describing some general patterns one also can identify the exceptions. Measuring evolution at the level of complete genomes is pertinent as it is, after all, the principal level for natural selection. Furthermore, it is intermediate to levels at which evolution has long been studied: namely, the molecular level in genes and genotypes, and the organismal level in the fossil record. The genome in principle contains all of the information necessary to bridge the gap between genotype and phenotype. For example, by knowing the functions of the genes in a genome of a species we can postulate a model for its complete metabolism. However, we have to be careful not to overstate our expectations. The situation might turn out to be analogous to that of proteins, for which, in principle, all information necessary to determine three-dimensional structures in the form of amino acid sequences is known, yet we remain unable to predict their tertiary structures.

Genomes can be analyzed and compared for various features: e.g., nucleotide content, compositional biases of leading and lagging strands in replication (e.g., in Escherichia coli) (1), dinucleotide frequencies (2), the occurrence of repeats (e.g., in virulence genes of Haemophilus influenzae; ref. 3), RNA structures, coding densities, protein coding genes, operons, the size distribution of gene families (4), etc. They also can be compared at a variety of levels: a first-order level where we regard the genome as a “bag of genes” without taking account of interactions between the various components, and a second-order level that considers whether properties of genomes are cross-correlated (e.g., the absence of certain polynucleotides together with the presence of restriction enzymes that specifically cut these polynucleotides; ref. 5). In this paper we focus on first- and second-order patterns in protein coding regions in genomes. Specifically we measure: (i) the fraction of orthologous sequences between genomes, (ii) the conservation of gene order between genomes, and (iii) the spatial clustering of genes in one genome that have an ortholog in another genome. We correlate these measures with the divergence time between the genomes compared. It is not our goal to define new distance measures to construct phylogenetic trees. Rather it is to analyze the conservation and differentiation of patterns between genomes, to show how we can extract useful information from these, and to analyze at what relative time scales they change. The analyses are done on the first nine sequenced Archaea and Bacteria that were publicly available: H. influenzae (6), Mycoplasma genitalium (7), Synechocystis sp. PCC 6803 (8), Methanococcus jannaschii (9), Mycoplasma pneumoniae (10), E. coli (1), Methanobacterium thermoautotrophicum (11), Helicobacter pylori (12), and Bacillus subtilis (13). Although the total number of publicly available genome sequences is growing rapidly, the trends that we observe should remain largely unchanged with the comparison of new species, given the diverse range of evolutionary distances of the species compared in this paper.

Methodological Issues in Comparisons of Genomes

Identification of Orthologous Genes.

Defining orthology. In comparing the genes of different genomes it is important that we avoid comparisons of “apples and pears”: i.e., that we are able to identify which genes correspond to each other in the various genomes. Fitch (14) introduced the term “orthologs” for genes whose independent evolution reflects a speciation event rather than a gene duplication event. “Where the homology is the result of gene duplication so that both copies have descended side by side during the history of an organism, (for example, alpha and beta hemoglobin) the genes should be called paralogous (para = in parallel). Where the homology is the result of speciation so that the history of the gene reflects the history of the species (for example, alpha hemoglobin in man and mouse) the genes should be called orthologous (ortho = exact)” (14). Note that orthology and paralogy are defined only with respect to the phylogeny of the genes and not with respect to function.

Identifying orthology by using relative levels of sequence identity.

Ideally one would expect that the orthologous genes of two genomes are those that have the highest pairwise identity, having bifurcated relatively recently compared with genes that duplicated before the speciation. The most straightforward approach to identifying orthologous genes is to compare all genes in genomes with each other, and then to select pairs of genes with significant pairwise similarities. A pair of sequences with the highest level of identity then is considered orthologous.

Auxiliary information for detection of orthology.

Auxiliary information that is useful to assess orthology is “synteny”: the presence in both genomes of neighboring sequences that are also orthologs of each other. As shown below, there is little conservation of the order of genes in genomes in evolution at a time when divergence of their orthologous genes reaches a level of 50% amino acid identity (see Fig. Fig.3).3). Hence the potential for using synteny for identifying orthologs is limited mainly to genomes that have speciated only relatively recently. A second type of auxiliary information that can be used is the comparison of genes with those of a third genome. If two genes from different genomes have the highest level of identity both to each other and to a single gene from a third genome, then this is a strong indication that they are orthologs (see ref. 15 for a large-scale implementation of this idea). However for a large fraction of genes identifying orthologs by relative sequence identity is hampered by a variety of evolutionary processes. We describe these in the following sections.

Figure 3
Conservation of the order of genes within the genome. Shown are the number of genes that are orthologs in both genomes, and that have at least one neighboring gene that is the same ortholog in both genomes, divided by the total number of shared orthologs ...

Sequence divergence.

At large evolutionary distances, e.g., between Archaea and Bacteria, sequence similarities may be eroded to such an extent that the distance between orthologous sequences is similar to that between sequences that are merely part of the same gene family. More dramatically, homolog sequences can diverge “beyond recognition,” such that the similarity between two orthologs is not higher than the similarity between sequences that are not part of the same gene family and automatic procedures for the recognition of homology fail. A recent survey of genes in Drosophila shows that one-third of the cDNAs code for very fast evolving genes, for which the frequency of amino acid substituting mutations is only a 2-fold lower than that of silent mutations, leading to a situation where homologous proteins are barely recognizable after 8,000 years of evolution (16).

Nonorthologous gene displacement.

A second event problematic to ortholog identification is nonorthologous gene displacement. This occurs when two nonorthologous genes that are unrelated or only remotely related perform the same function in two organisms (17). This occurs relatively frequently: a comparison of M. genitalium to H. influenzae revealed 12 clear-cut cases (17). As a consequence orthologs may not be detectable (or are classified as paralogs) in another organism even when the corresponding function is retained.

Gene duplication, gene loss, and horizontal gene transfer.

A third process that restricts the identification of orthologous genes is that of gene loss in combination with gene duplication. If two genomes lose different paralogs of an ancestral gene that was duplicated before the speciation event, the remaining genes have highest sequence identity even though they are not orthologs (18). One may test for such an event by checking whether the protein similarity falls into an expected range. This is done implicitly by including (presumably orthologous) sequences from other species in the phylogeny and checking whether the gene tree is in accordance with the species tree (18, 19). Inconsistencies between the species tree and the gene tree can indicate nonorthologous relationships between genes. However, they also can be caused by horizontal gene transfer, in which case the genes still could be orthologs. In general, the identification of orthologous sequences, horizontal gene transfer, and ancient gene duplications cannot be distinguished. Besides the construction of phylogenetic trees an additional strategy for finding horizontal gene transfer is the comparison of nucleotide frequencies within a genome. Recently transferred genes often display nucleotide frequencies that deviate significantly from the rest of the genome (20, 21). A conservative estimate of the amount of genes that recently have been transferred to E. coli, based on nucleotide frequencies and dinucleotide frequencies in genomes is 10%−15% of the E. coli genome (Phil Green, personal communication; ref. 21). A third strategy for finding horizontal gene transfer is synteny. Because gene order is rarely conserved in evolution, the presence in two distant evolutionary branches of the same order of genes, combined with the absence of this gene order in other more closely related branches, can point to horizontal gene transfer. This strategy has been used successfully to find the example of horizontal gene transfer described in Fig. Fig.1. 1.

Figure 1
An example of complexities in assigning orthology to multidomain proteins. The M. thermoautotrophicum genes MTH444 (a sensory transduction histidine kinase) and MTH445 (a sensory transduction regulatory protein) are orthologs of the Synechocystis ...

Orthology in multidomain proteins.

In multidomain proteins two levels of orthology can be distinguished: one is at the level of single domains, a second at the level of the whole protein. This may lead to situations where nonorthologous proteins possess orthologous domains. Modularity of genes in the sense that modules can have different positions, but the same function, in various proteins, is not well documented in Bacteria and Archaea. A first step toward modularity, the presence of “gene fusion” or “gene splitting,” however, does occur regularly. Comparative analysis of the genomes H. influenzae and E. coli showed 10 (24) clear-cut cases of genes that were separate in E. coli (H. influenzae), but that were part of a single gene in H. influenzae (E. coli) (unpublished data).

A much more complicated scenario, for which many of the factors described above (multidomain proteins, synteny, and horizontal gene transfer) are involved, is shown in Fig. Fig.1.1. In general, a combination of the various evolutionary processes described above leads to a situation where, although orthology was defined originally as a one-to-one relationship between proteins, it must be considered a many-to-many relationship.

From homologs to orthologs.

The advent of powerful, easy-to-use tools, such as psi-blast (22), to find homologous sequences is likely to shift the emphasis in sequence analysis from predicting homology to predicting orthology. It is clear that, at present, there is not a single, simple, and perfect solution to the question of orthology. Orthology is methodologically defined, that is, dependent on what is asked of the genomes that are compared, different methods to find orthologous genes are being used. We use a minimal definition when we are interested only in the number of orthologs shared between genomes at various phylogenetic distances. Orthologs then are defined in the following manner: (i) They have the highest level of pairwise identity when compared with the identities of either gene to all other genes in the other’s genome; (ii) the pairwise identity is significant (E, the expected fraction of false positives, is smaller than 0.01), and (iii) the similarity extends to at least 60% of one of the genes. The region of similarity is not required to cover the majority of both genes to include the possibility of gene fusion and gene splitting. In more detailed comparisons between a small number of genomes, auxiliary information was used to determine orthology, such as the order of genes and the comparison to genes from a third genome (see legend to Fig. Fig.11).

Given all of these complications in the finding of orthologs and the oversimplified view of evolution that the term suggests, one could conclude that it is better not to use it at all, or only in those cases where one does not have conflicting information from various sources about the phylogeny of the genes. One also can argue that it is exactly these cases where there are conflicts in the information about orthology from different sources that evolution shows some of its most interesting aspects. Orthology is an important refinement over homology in describing the phylogenetic relations between genes, as long as one always keeps in mind the caveats described above and as long as the methods for determining orthology are well defined.

Timing Genome Divergence.

To compare the rates at which various properties of genomes change, a central reference for the divergence between genomes is required. Measurement of the divergence times between the three “domains” (Archaea, Bacteria, and Eukarya) on the basis of protein dissimilarities recently has gained considerable attention and has been the subject of some controversy (see ref. 23 and references therein). The estimates of the date of the last common ancestor vary from 2 billion (24) to 3–4 billion years ago (23). The major assumptions in estimating divergence times from distances between protein sequences are: (i) The proteins are of vertical descent; i.e., they have not been horizontally transferred into the genome following the speciation of the species compared; and (ii) the proteins act as a molecular clock, having rates of amino acid substitutions that do not vary over time and between the lineages. Here we use proteins to scale divergence between and within the Archaea and the Bacteria. It is not our intention to estimate absolute divergence times, rather it is to compare the different relative rates at which genomes evolve. Thus we translate the protein dissimilarities between the species into amino acid substitutions per position per gene, using an equation derived by Grishin (25), which corrects for variations in substitution rates for both amino acids and sites: q = ln(1 + 2d)/2d, where q is the fraction of identical amino acids between the proteins and d is the number of amino acid substitutions per site. Grishin’s equation recently was used by Doolittle et al. (23) and gives reasonable estimates for the divergence between Bacteria and Archaea. Stringent criteria were used to select a set of genes that had orthologs in all of the nine genomes compared: (i) Each gene had the highest level of identity to at least five of the other genes (relative to other genes in those five genomes, see our minimal definition of orthology above); and (ii) there were no conflicting hits, from each genome only one protein was selected. The resulting set of 34 proteins is surprisingly small. It contains 17 ribosomal proteins, five tRNA synthetases, two signal recognition particles, two proteins with unknown function, and eight metabolic enzymes. Interestingly, the set consists almost exclusively of proteins that interact with RNA or synthesize RNA. In estimating divergence times of the genomes of Archaea and Bacteria it could be useful to check whether the protein similarities follow the phylogenetic tree (23) given the previously recognized ancient horizontal transfer of metabolic enzymes from Bacteria to Archaea (26), and more recent occasions of horizontal gene transfer (Fig. (Fig.1).1). However, because Archaeal genomes are chimeric, they were treated as such by obtaining a central reference for the distance between genomes by averaging over the proteins’ distances, irrespective of their phylogenetic trees. As Grishin’s equation tends to overestimate the number of amino acid substitutions per position for low levels of identities between genes (27), the median of the estimates of the number of amino acid substitutions was used in preference to the mean. The results are used in the following sections.

Comparing Genomes as “Bags of Genes”

Shared Orthologous Genes.

The decrease of the number of shared orthologs in time. A straightforward comparison between genomes simply considers genes, and not the correlation between genes: i.e., a genome is regarded as a “bag of genes.” Taking this a step further, we measure how the number of shared orthologs between two genomes decreases with their divergence time (Fig. (Fig.2).2). The results show that the fraction of shared orthologous sequences decreases rapidly in evolution, faster than the level of pairwise identity between the shared orthologs. Although the fraction of shared orthologs between Archaea and Bacteria is less than among the Bacteria, the most dramatic reduction in the fraction of shared orthologs takes place on shorter time scales within the Bacteria and Archaea, when protein identity levels between genomes are still above 50%.

Figure 2
The relationship between genome similarity, measured as the fraction of shared orthologs, and time, measured as the number of amino acid substitutions per protein per position in a set of 34 orthologs. + shows the fraction of sequences in a ...

Non-tree-like aspects of the evolution of gene content. Even over large evolutionary distances such as those between Archaea and Bacteria different pairs of genomes share different orthologs. For example, M. genitalium shares different orthologs with M. jannaschii than with M. thermoautotrophicum (see legend to Fig. Fig.2).2). This demonstrates a nontree-like aspect of the evolution of the gene content of genomes: phylogenetically closely related species do not share orthologous genes that either of them shares with a phylogenetically distant species.

Differential Genome Analysis.

Pairwise genome comparison. Instead of focusing on genomes’ similarities one can focus on their dissimilarities; i.e., “differential genome analysis” (28). Such analysis can be particularly revealing if the genomes are closely related but have different phenotypes, in which case one can identify the genetic basis for their differences. For example, of the genes in the pathogen H. influenzae that do not have a homolog in the relatively benign E. coli, a large fraction, 60% are (potentially) involved in H. influenzae’s pathogenesis (28). These genes encode proteins that are located on the surface of the cell or are involved in the production of toxins, or are virulence factors, or are homologous to proteins present only in pathogenic species. By contrast, of the proteins in H. influenzae that do have an ortholog in E. coli only an estimated 12% can be considered host interaction factors.

Multiple genome comparison.

Differential genome analysis can be extended to multiple genomes. One then can analyze the correlation between shared gene content and shared phenotypic features of the species compared. This is demonstrated in a comparison of the two pathogens H. influenzae and H. pylori with E. coli. H. influenzae and H. pylori share 17 orthologs that do not have a homolog in E. coli. Of these, a large fraction (12) are related to pathogenicity (unpublished data). Differential genome analysis also can be used to select genes responsible for other differences in phenotypes, e.g., metabolism. The main requirement is that the genomes are sufficiently close in evolution that the identification of orthologs is reliable and that the differences in genome content reflect mainly the phenotypic feature that one is interested in.

Measuring Correlations Between Genes

Conservation of the Spatial Association of Genes.

Quantification of the differentiation of gene order. Synteny, the conservation of the order of genes, has been extensively studied already. Although some conservation of the order of genes in genomes has been reported (29, 30), the emphasis has been on the the drastic rearrangement of gene order in evolution (3133). The evolution of the spatial organization of the genome is being studied for three reasons: (i) To calibrate the rate at which it evolves. (ii) To study the genome organization of the last common ancestor (34). Shared gene order between the Archaea and the Bacteria is assumed to date back to their last common ancestor, with the exception of horizontal gene transfer (Fig. (Fig.1).1). (iii) To estimate the time scale at which gene regulation changes during evolution. The spatial association of genes is related to their regulation, e.g., in the case of operons.

The conservation of gene order was related to genome divergence time (Fig. (Fig.3).3). The results show a drastic rearrangement of genomes within the first time unit, during which protein identity levels remain above 50%, after which a saturation level is reached. Notice that the order of orthologous genes is less preserved than their presence (compare with Fig. Fig.2).2). At the divergence time at which the saturation level is reached, the genes that are still paired are in general subunits of proteins, ribosomal proteins or proteins involved in ABC transport. A detailed examination (T. Dandekar, M.A.H., and P.B., unpublished data) of all conserved pairs of proteins in three Gram-negative bacteria (E. coli, H. influenzae, and H. pylori) and in three Archaea (M. thermoautotrophicum, M. jannaschii, and A. fulgidus) has shown that, for nearly all cases, there is experimental evidence for direct physical interaction between these proteins (see also ref. 31). As mentioned previously, this observation has implications for the study of horizontal gene transfer. Synteny between phylogenetically distant species of genes for proteins that do not show physical interaction indicates recent horizontal gene transfer events.

Gene order and operons.

Given the widely accepted concept of the operon, it is perhaps surprising that there is so little conservation of gene order. Why the gene order that is conserved only concerns proteins that show physical interaction might be explained by Fisher’s model of gene clustering (35). Fisher argued that the linkage between genes of proteins that function well together will tend to increase, to prevent the separation of a co-adapted pair of alleles by recombination.

It is clear that operons do not only exist of genes for proteins that show physical interaction (reviewed in ref. 36). However what is conserved of operons over large time scales seems indeed to concur with Fisher’s hypothesis. A theory that explains the rearrangement of operons has to include an explanation for the existence of operons. The overall rearrangement of operons does not support any theory that is based on functional relationships of the proteins coded by the genes in the operon, unless one specifically can show that functional relationships of the genes change over the time scales on which we observe the rearrangement of operons. The recently proposed theory of “selfish operons” proposes that operons exist because they increase the probability that genes that function together are transferred together in horizontal gene transfer (36). This model was based on the observation that operon structure is conserved between E. coli and Salmonella typhimurium. The model applies only to “nonessential” genes, genes that are relatively dispensable, which can be lost and then reintroduced into the genome through horizontal operon transfer. It, for example, does not apply to the ribosomal genes that are strongly clustered, are essential, and for which we have no evidence for horizontal gene transfer. It does, however, apply to pathogenicity islands and pathogenicity islets, clusters of genes that play a role in pathogenicity, and do indeed show evidence for horizontal gene transfer (37).

Regulatory Elements.

With the determination of orthologous genes and conservation of gene order one can begin to determine whether intergenic regions are conserved. The degree of conservation of intergenic regions is remarkably low and is diverging much faster than the gene order (Y. Diaz-Lazcoz, M.A.H. and P.B., unpublished results). The pattern in Fig. Fig.44 can be regarded as an exception, demonstrating that at least in some cases gene regulation is preserved. At the 5′ end of the ribosomal genes rpl11 and rpl1 in E. coli lies an RNA secondary structure potentially involved in the regulation of expression of the rpl11 operon (38). The structure is conserved in all Bacterial genomes analyzed in this paper, with the notable exception of H. pylori.

Figure 4
Conservation of an RNA secondary structure at the 5′ end of rpl11 operon in Bacterial genomes. The order of the ribosomal protein genes rpl11 and rpl1 is conserved in all of the Bacteria analyzed. The gene nusG is a transcription antitermination ...

Co-Occurrence of Genes.

Some genomes are more organized than others. If neighboring genes tend to function together in one genome, as they do in the case of operons, then they should both occur in another genome, even if they are not neighbors or part of the same operon. We show (Fig. (Fig.55A) that this is indeed the case. If gene A has a neighboring gene B, then if the ortholog of B (B′) occurs in another genome the probability that the ortholog of A (A′) occurs in the other genome is increased (compare Fig. Fig.2).2). In other words, orthologs shared between two genomes tend to be clustered in at least one of the genomes. Part of the results of Fig. Fig.55A are caused by genes that occur as neighbors in both of the genomes compared. The analysis was repeated to only include genes that are separated in one genome (X), but neighbors in another genome (Y). The fraction of genes that are neighbors in Y was compared with the expected fraction, given a model of random shuffling of genes (see Fig. Fig.55B for methods). Results show that genes from a genome Y that have an ortholog in genome X tend to cluster in Y. The trend is present in all genomes except M. genitalium, and is particularly pronounced in the genomes of E. coli and B. subtilis. This surprising results suggests that most genomes are organized, yet some genomes are more organized than others. We assume that the genes that occur in one genome and are neighbors in another genome are in some way or another related in function. One explanation for the high degree of clustering in E. coli and B. subtilis is they consist to a large fraction of recent horizontal gene transfers, which could increase the prevalence of poly-cistronic operons in their genome.

Figure 5
(A) The probability that a gene in genome A has an ortholog in another genome B if a neighboring gene in A has an ortholog in genome B. The probabilities clearly increase, as compared with the average probability of having an ortholog in another genome ...

Co-occurrence of genes and the conservation of pathways.

Instead of analyzing spatial association of orthologs, one can analyze whether orthologs show “genome association”: i.e., they either occur together in a genome or are both absent from a genome. Such an analysis could, in principle, be used to reconstruct which genes are functionally related. The fact that orthologs that both occur in two genomes have a relative high probability of spatial association in one of the genomes (Fig. (Fig.55A), even if they are separated in the other genome (Fig. (Fig.55B), in itself points to the usefulness of this idea. By analogy to approaches using the covariation of the nucleotide content of positions in RNA (39) to predict which positions interact with each other, one can use the covariation in the occurrence of proteins to create a model of which proteins depend for their function on each other. Such information could be used to reconstruct metabolic pathways or signaling pathways. The important assumption is that the structure of the pathway was constant throughout evolution. Nonorthologous gene displacement, where a gene assumes the functions of another in a pathway suggests that pathways are more conserved than the presence of orthologous genes. Our observation of the co-occurrence of the genes dnaJ and dnaK in a small set of orthologs that are shared by M. genitalium and M. thermoautotrophicum, but not by M. jannaschii (see legend Fig. Fig.2),2), dnaK shows that the correlation of functionally related genes is present in phylogenetically distant species.

The existence of associated genes and the conservation of this association are important parameters in determining the degree of epistatis of genome evolution and determine the shape of the “adaptive landscape” (40) in which genome evolution operates. For an analysis of covariation in the occurrence of genes to be statistically meaningful more genomes then the nine that were analyzed here are required. Furthermore one needs to correct for the “baseline” probability that a gene from one genome has an ortholog in another genome, which depends on phylogenetic distance between the genomes (Fig. (Fig.22).

Comparing Rates of Genome Evolution

We have studied several indicators of genome evolution and followed their conservation over time (Fig. (Fig.6).6). The resulting calibration curves do quantify not only the divergence of these indicators, but also have practical value as they show what information can be extracted from new microbial genomes given their phylogenetic position. The calibration curves shall require refinement when more data become available but they already provide levels of expectation, deviations from which are of potential interest (e.g., synteny of genes in distant species that cannot be found in other species is an indicator of horizontal gene transfer; Fig. Fig.1).1). In particular, more relatively closely related genomes that have protein identity levels higher than 50% shall be essential to provide more precise estimates of the rates at which genome organization and gene regulation evolve. The calibration curves also should influence the analysis strategy, e.g., if a closely related genome is available, orthologs are relatively easy to discriminate from other members of multigene families. By analogy to profile search techniques, it is helpful to include not too closely related but also not too divergent species into the first round of the analysis, where the closeness of the relationship depends on the features one wants to identify. For example, to study the evolution of gene regulation one needs to compare more closely related species than to study the evolution of gene order. To study the evolution of gene content, one needs to compare even less related species, whereas the study of the evolution of metabolism requires the comparison of the most distantly related species.

Figure 6
Relative rates of genome evolution. The curves were fitted from the fraction of shared orthologs (Fig. (Fig.2)2) and the conservation of the order of genes (Fig. (Fig.3),3), the curve that shows the relationship between protein identity ...

Current analysis of genomes is driven by the prediction of functional features at the molecular and cellular level; it is based on the presence and absence of certain genes in the context of phenotypic expectations. Expectations about horizontal gene transfers and the loss, the acquisition or displacement of entire pathways (the entire metabolism in the case of the Archaea) and the study of the correlations of gene occurrence will enable us to identify functional cascades in greater detail. Identification of weak regulatory signals in the genomes requires a sensitive comparative analysis. The puzzling evolution of nonconserved but ever-present operons is only one indication that many genetic and evolutionary mechanisms are yet to be detected and quantified.


We are very grateful to Chris Ponting, Berend Snel, Yolande Diaz-Lazcoz, Thomas Dandekar, and Joerg Schultz for providing data and useful discussions. The work was supported by the Bundesministerium für Bildung, Wissenschaft, Forschung and Technologie (Germany) and Deutsche Forschungsgemeinschaft.


1. Blattner F E, III, Bloch C A, Perna N T, Burland V, Riley M, Collado-Vides J, Glasner J D, Rode C K, Mayhew G F. Science. 1997;277:1453–1462. [PubMed]
2. Karlin S, Mrazek J, Campbell A. J Bacteriol. 1997;179:3899–3913. [PMC free article] [PubMed]
3. Hood D W, Deadman M E, Jennings M P, Bisercic M, Fleishmann R D, Venter J C, Moxon E R. Proc Natl Acad Sci USA. 1996;93:11121–11125. [PMC free article] [PubMed]
4. Huynen, M. A. & van Nimwegen, E. (1998) Mol. Biol. Evol., in press.
5. Gelfand M S, Koonin E V. Nucleic Acids Res. 1997;25:2430–2439. [PMC free article] [PubMed]
6. Fleishmann R, Adams M, White O, Clayton R A, Kirkness E F, Kerlavage A R, Bult C J, Tomb J F, Dougherty B A, Merrick J M. Science. 1995;269:496–512. [PubMed]
7. Fraser C M, White O, Casjens S, Huang W M, Sutton G G, Clayton R, Lathigra R, Ketchum K A, Dodson R, Hickey E K. Science. 1995;270:397–403. [PubMed]
8. Kaneko T, Sato S, Kotani H, Tanaka A, Asamizu E, Nakamura Y, Miyajima N, Hirosawa M, Sugiura M, Sasamoto S. DNA Res. 1996;3:109–136. [PubMed]
9. Bult C J, White O, Olsen G J, Zhou L, Fleischmann R D, Sutton G G, Blake J A, FitzGerald L M, Clayton R A, Gocayne J D. Science. 1996;273:1058–1072. [PubMed]
10. Himmelreich R, Hilbert H, Plagens H, Pirkl E, Li B, Herrmann R. Nucleic Acids Res. 1996;24:4420–4449. [PMC free article] [PubMed]
11. Smith D R, Doucette-Stamm L A, Deloughery C, Lee H, Dubois J, Aldredge T, Bashirzadeh R, Blakely D, Cook R, Gilbert K. J Bacteriol. 1997;17:7135–7155. [PMC free article] [PubMed]
12. Tomb J-F, White O, Kervalage A R, Clayton R A, Sutton G G, Fleischmann R D, Ketchum K A, Klenk H P, Gill S, Dougherty B A. Nature (London) 1997;388:539–547. [PubMed]
13. Kunst F, Ogasawara N, Moszer I, Albertini A M, Alloni G, Azevedo V, Bertero M G, Bessieres P, Bolotin A, Borchert S. Nature (London) 1997;390:249–256. [PubMed]
14. Fitch W M. Syst Zool. 1970;19:99–110. [PubMed]
15. Tatusov R L, Koonin E V, Lipman D J. Science. 1997;278:631–637. [PubMed]
16. Schmid K, Tautz D. Proc Natl Acad Sci USA. 1997;94:9746–9750. [PMC free article] [PubMed]
17. Koonin E V, Mushegian A R, Bork P. Trends Genet. 1996;12:334–336. [PubMed]
18. Page R D M. Syst Biol. 1994;43:58–77.
19. Yuan, Y. P., Eulenstein, O., Vingron, M. & Bork, P. (1998) Bioinformatics, in press.
20. Medigue C, Rouxel Y, Vigier P, Henaut A, Danchin A. J Mol Biol. 1991;222:851–856. [PubMed]
21. Lawrence J G, Ochman H. J Mol Evol. 1997;44:383–397. [PubMed]
22. Althschul S F, Madden T L, Schaffer A A, Zhang J, Zhang Z, Miller W, Lipman D J. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
23. Feng D F, Cho G, Doolittle R F. Proc Natl Acad Sci USA. 1997;94:13028–13033. [PMC free article] [PubMed]
24. Doolittle R F, Seng D F, Tsang S, Cho G, Little E. Science. 1996;271:470–477. [PubMed]
25. Grishin N V. J Mol Evol. 1995;41:675–679. [PubMed]
26. Koonin E V, Mushegian A R, Galperin M Y, Walker D R. Mol Microbiol. 1997;25:619–637. [PubMed]
27. Feng D-F, Doolittle R F. J Mol Evol. 1997;44:361–370. [PubMed]
28. Huynen M, Diaz-Lazcoz Y, Bork P. Trends Genet. 1997;13:389–390. [PubMed]
29. Tatusov R L, Mushegian A R, Bork P, Brown N P, Hayes W S, Borodovsky M, Rudd K, Koonin E V. Curr Biol. 1996;6:279–291. [PubMed]
30. Tamames J, Casari G, Ouzounis C, Valencia A. J Mol Evol. 1997;44:66–73. [PubMed]
31. Mushegian A R, Koonin E V. Trends Genet. 1996;12:289–290. [PubMed]
32. Watanabe H, Mori H, Itoh T, Gojobori T. J Mol Evol. 1997;44:57–64. [PubMed]
33. Kolsto A B. Mol Microbiol. 1997;24:241–248. [PubMed]
34. Siefert J L, Martijn K A, Abdi F, Widger W R, Fox G E. J Mol Evol. 1997;45:467–472. [PubMed]
35. Fisher R A. The Genetical Theory of Natural Selection. Oxford: Oxford Univ. Press; 1930.
36. Lawrence J G, Roth J R. Genetics. 1996;143:1843–1860. [PMC free article] [PubMed]
37. Barinaga M. Science. 1996;272:1261–1263. [PubMed]
38. Branlant C, Krol A, Machatt A, Ebel J P. Nucleic Acids Res. 1981;9:293–307. [PMC free article] [PubMed]
39. Gutell R R, Power A, Hertz G, Putz E, Stormo G. Nucleic Acids Res. 1993;20:5785–5795. [PMC free article] [PubMed]
40. Wright S. In: Proceedings of the Sixth International Congress on Genetics. Jones D F, editor. Vol. 1. New York: Brooklyn Botanical Garden; 1932. pp. 356–366.
41. Yeh K C, Wu S H, Murphy J T, Lagarias J C. Science. 1997;277:1505–1508. [PubMed]
42. Klenk H P, Clayton R A, Tomb J F, White O, Nelson K E, Ketchum K A, Dodson R J, Gwinn M, Hickey E K, Peterson J D. Nature (London) 1997;390:364–370. [PubMed]
43. Aravind L, Ponting C P. Trends Biochem Sci. 1997;22:458–45. [PubMed]
44. Zhulin I B, Taylor B L, Dixon R. Trends Biochem Sci. 1997;22:331–333. [PubMed]
45. Ponting C P, Avarind L. Curr Biol. 1997;7:R674–R677. [PubMed]
46. Schultz J, Milpetz F, Bork P, Ponting C P. Proc Natl Acad Sci USA. 1998;95:5857–5864. [PMC free article] [PubMed]
47. Smith T, Waterman M S. J Mol Biol. 1981;147:195–197. [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • PubMed
    PubMed citations for these articles
  • Taxonomy
    Taxonomy records associated with the current articles through taxonomic information on related molecular database records (Nucleotide, Protein, Gene, SNP, Structure).
  • Taxonomy Tree
    Taxonomy Tree

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...