NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Koonin EV, Galperin MY. Sequence - Evolution - Function: Computational Approaches in Comparative Genomics. Boston: Kluwer Academic; 2003.

Cover of Sequence - Evolution - Function

Sequence - Evolution - Function: Computational Approaches in Comparative Genomics.

Show details

Chapter 2Evolutionary Concept in Genetics and Genomics

2.1. Similarity, Homology, Divergence and Convergence

2.1.1. The critical definitions

In times past, gathering information on a potential partner in marriage or business routinely started with the simplest question “What family does he or she come from?” Affiliation with a certain family immediately provided a starting point for further inquiries, a general idea of what might be expected from a certain individual. Of course, families are never uniform, and classic literature from Homer to Shakespeare to Tolstoy provides ample illustrations that any expectation based solely on family history should be taken with a grain of salt. Nevertheless, in the absence of other clues to the character of the subject in question, an educated guess could be made based on the family structure and the individual’s position within that structure.

Essentially the same approach is used in predicting potential functions for a newly sequenced gene and its protein product. Since it is technically impossible to experimentally test activity of the product of every single open reading frame in every organism, understanding their cellular roles routinely relies on family history.

So how can one decide what family a given protein belongs to? Sequence analysis aims at finding important sequence similarities that would allow one to infer homology. The latter term is extensively used in scientific literature, often without a clear understanding of its meaning, which is simply common origin. Since the mid-19th century, zoologists and botanists have learned to make a distinction between homologous organs (e.g. bat's wing and human's hand) and similar (analogous) organs (e.g. bat's wing and butterfly's wing). Homologous organs are not necessarily similar (at least the similarity may not be obvious); similar organs are not necessarily homologous. For some reason, this simple concept tends to get extremely muddled when applied to protein and DNA sequences [695]. Phrases like “sequence (structural) homology”, “high homology”, “significant homology”, or even “35% homology” are as common, even in top scientific journals, as they are absurd, considering the above definition. “Sequence homology” is particularly pervasive, having found its way even into the NLM’s Medical Subject Heading (MeSH) system. It has been assigned as a keyword to more than 80,000 papers in MEDLINE, including, to the embarrassment of the authors, most of their own. In all of the above cases, the term “homology” is used basically as a glorified substitute for “sequence (or structural) similarity”.

All this misuse of “homology”, in principle, could be dismissed as an inconsequential semantic problem. One could even suggest that, after all, since it so happened that in molecular biology literature “homology” has been often used to designate quantifiable similarity between sequences (or, less often, structures), the term should be redefined, legitimizing this usage. We believe, however, that the notion of homology is of major fundamental and practical importance and, on this occasion, semantics matters. In our opinion, misuse of the term ‘homology’ has the potential of washing out the meaning of the very concept of common evolutionary descent [695].

A conclusion that two (or more) genes or proteins are homologous is a conjecture, not an experimental fact. We would be able to know for a fact that genes are homologous only if we could directly explore their common ancestor and all intermediate forms. Since there is no fossil record of these extinct forms, a decision on homology between genes has to be made on the basis of the similarity between them, the only observable variable that can be expressed numerically and correlated with probability. The higher the similarity between two sequences, the lower the probability that they have originated independently of each other and became similar merely by chance (see 4.2). Indeed, if we take two sequences of 100 amino acid residues each that have, say, 80% identical residues, we can calculate the probability of this occurring by chance, find that it is so low that such an event is extremely unlikely to have happened in the last 5 billion years, and conclude that the sequences in question must be homologous (share a common ancestry). Even for proteins that share a much lesser degree of identity, alignment of counterparts from all walks of life is often straightforward, and there seems to be no reasonable doubt of homology. For example, although sequences of the ribosomal protein L36 from different species (Figure 2.1) exhibit considerable diversity and only a single amino acid residue is conserved in all the sequences, they align unequivocally and are indisputable homologs.

Figure 2.1. Multiple alignment of the ribosomal protein L36 sequences.

Figure 2.1

Multiple alignment of the ribosomal protein L36 sequences. Conserved amino acid residues are shown in bold and/or yellow. The following proteins are listed: A. aeolicus, aq_075; B. subtilis, RpmJ; C. jejuni, Cj1591; C. trachomatis, CT786; E. coli, RpmJ; (more...)

A real problem arises only when the similarity between two given sequences is much lower, so it is not immediately clear how to properly align them and how to calculate their degree of similarity. Even when one comes up with a figure—say, two protein sequences have 10% identical residues and additional 8% similar amino acid residues (a total of 18% similarity)—does this imply homology or not? The only reasonable answer is: it depends. This and lower levels of similarity might be indicative of homology provided that one or more of the following applies: (i) the similarity extends over a long stretch of sequence and is statistically significant by criteria known to be reliable (such as those applied in the BLAST algorithm and its derivatives); (ii) although the sequence similarity is low, the same pattern of identical and similar amino acid residues is seen in multiple sequences; or (iii) the pattern of sequence similarity reflects the similarity between experimentally determined structures of the respective proteins or at least corresponds to the known key elements of one such structure.

In the rest of this chapter and in the subsequent chapters as well, we will have multiple opportunities to examine each type of evidence. Right here and now, however, it is pertinent to ponder the question: Why is sequence and structural similarity considered to be evidence of homology (common origin) in the first place? Once we are confident that a particular similarity is not spurious, but rather, according to the above criteria, represents certain biological reality, is common ancestry the only explanation? The answer is: no, a logically consistent alternative does exist and involves convergence from unrelated sequences.

The functional convergence hypothesis would posit that sequence and structural similarities between proteins are observed because the shared features are strictly required for these proteins to perform their identical or similar functions. Functional convergence per se is an undeniable reality. In the broadest sense, convergence is observed, for example, between all proteins that contain disulfide bonds stabilizing their structure or between all enzymes that have the same catalytic residues (e.g. a constellation of histidines and aspartates). Even more prominent motifs associated with catalytic residues are found within different structural context and, in all likelihood, have evolved convergently [722,724]. In the case of disulfide-bonded domains, convergence can even fool sequence comparison programs, translating into statistically significant (albeit not overwhelming) sequence similarity. A rather dramatic manifestation of convergence is the recent description of a “homologous” disulfide-bonded domain in Wnt proteins and phospholipase A2 [699], which was later recognized as “mistaken identity”, on the grounds of structural implausibility [77]. The classic work of Alan Wilson and colleagues comparing lysozymes from ruminants, langur monkeys, and leaf-eating birds is a textbook case that reveals the nature and extent of convergence in enzymes [471,806,816]. These studies have shown beyond doubt that several amino acid residues required for functioning in the stomach have evolved independently (convergently) in different lineages of lysozymes. Importantly, however, this set of convergent positions consists of only seven amino acid residues, a small subset of the residues that comprises the lysozyme molecule.

A pan-adaptationist view of evolution would hold that functional convergence is the sole (or at least the principal) factor responsible for similarity between proteins. Formally disproving this paradigm might not be possible, but there seem to be at least two compelling arguments against it. The first one stems from the notion of a continuous gradient of similarity between proteins. The convergence explanation is implausible for closely related sequences, such as those of the same proteins (or, more precisely, orthologs; see below) from different mammalian species, which are usually 70–80% identical. For such sequences, the convergence hypothesis is equivalent to the statement that most, if not all, amino acid residues in a protein are fixed through positive selection. This runs against the neutral theory of molecular evolution, which has shown that, given the known parameters of animal populations, positive selection could not be responsible for the majority of amino acid substitutions, which are therefore effectively neutral [440]. Convergence could only be a realistic possibility for deep relationships between proteins, which involve limited similarities; indeed, the neutral theory does not preclude positive selection acting, say, on 10% of the positions in a protein. Then, the observed spectrum of similarities between proteins would have two distinct explanations: (i) divergence from common ancestors for tight families with high levels of sequence similarity, and (ii) convergence from independent ancestors for larger groups of related proteins (superfamilies), in which only limited similarity is observed. While not theoretically impossible, such an opposition of two vastly different modes of evolution, with a mysterious bottleneck separating the two phases, appears extremely unlikely. This view of evolution is clearly inferior to the alternative, whereby all significant similarities observed within a class of proteins are interpreted within a single theoretical framework of divergence from an ultimate common ancestor.

The second, probably most convincing, argument against convergence as the principal explanation for the observed similarities between proteins has to do with the nature of structural constraints associated with a particular function. A fundamental observation is that a single function, such as catalysis of a specific enzymatic reaction, is often performed by two or more proteins that have unrelated structures [187,271]. In 2.2.5, we discuss this phenomenon in some detail and present several specific examples. These observations indicate that the same function does not necessarily require significantly similar structures, which means that, as a rule, there is no basis for convergent evolution of extensive sequence and structural similarity between proteins. This is not to say that unrelated enzymes that catalyze the same reaction bear no structural resemblance whatsoever. Indeed, subtle similarities in the spatial configuration of amino acid residues in the active centers are likely to exist, and these are precisely the kind of similarity that is expected to emerge due to functional convergence. These similarities, however, do not translate into structural and sequence similarity detectable by existing methods for comparison of proteins (at least in the overwhelming majority of cases). By inference, we are justified to conclude that whenever statistically significant sequence or structural similarity between proteins or protein domains is observed, this is an indication of their divergent evolution from a common ancestor or, in other words, evidence of homology . We will revisit the issue of convergence versus divergence when discussing the deepest structural connections between proteins.

Now that we have established the connection between similarity and homology, it should be emphasized that demonstration of homology is central to the interpretation of similarities between proteins. The feasibility of this conclusion, which sometimes is reached on the basis of limited similarity, is what makes sequence and structure comparison the major staples of computational biology and inspires the development of increasingly sensitive methods for such comparisons. Indeed, under the notion of homology, a sequence or structural alignment becomes a powerful tool for evolutionary and functional inferences.

Once sequences are correctly aligned, homology implies that the corresponding residues in homologous proteins are also homologous, i.e. derived from the same ancestral residue and, typically, inherit its function. If the residue in question is the same in a set of homologous sequences, we say that it is (evolutionarily) conserved . Thus, homology lends legitimacy to the transfer of functional information from experimentally characterized proteins (or nucleic acids) to uncharacterized homologs, the single most common and practically important application of computational methods in molecular biology. Conversely, an alignment of non-homologous sequences is inherently meaningless and potentially misleading. Even if such an alignment attains a relatively high percentage of identity or similarity, no conclusions at all can be inferred from the (spurious, in this case) correspondence between aligned residues. This is why phrases like “significant homology” or “percent homology” are so ludicrous. Homology is a qualitative notion of common ancestry. As long as homology is established, 10% identical residues between two protein sequences could be highly meaningful and amenable to functional interpretation. In contrast, even 30% identity between two sequences that are not homologous in reality could be totally misleading.

2.1.2. Conservation of protein sequence and structure in evolution

Protein structure is conserved during evolution much better than protein sequence. There are numerous examples of proteins that show little sequence similarity but still adopt similar structures, contain identical or related amino acid residues in their active sites, and have similar catalytic mechanisms. These shared features support the notion that, despite low sequence similarity, such proteins are homologous.

Consider, for example, the structure of lysozyme, the enzyme that hydrolyzes bacterial cell walls (formal name: 1,4-beta-N-acetylmuramidase, EC Different lysozymes are found in many organisms, from bacteriophages to mammals, and in general, they show little sequence similarity to each other. PDB, the database of protein structures (see 3.3), includes the lysozyme from goose (PDB code 153L), which consists of 185 amino acid residues (Figure 2.2). The sequence neighbors of this protein in the protein database (see 3.1.2) are lysozymes from black swan (same length, 96% identity), ostrich (same length, 83% identity), chicken (same length, 80% identity), as well as unannotated proteins from human (44% identity), mouse (43% identity), and B. subtilis bacteriophage SPBc2 (25% identity in 176-aa overlap). The vertebrate proteins in this list, including the uncharacterized ones, are obvious homologs of the goose lysozyme. The phage protein is more dissimilar and, in this case, the issue of homology is worth some investigation. However, the sequence similarity between lysozymes and this phage protein is statistically significant (as can be shown, for example, using PSI-BLAST, see 4.3.3), and their multiple alignment shows a consistent pattern of shared residues, thus establishing homology (Figure 2.2).

Figure 2.2. Multiple sequence alignment of goose lysozyme and its closest homologs.

Figure 2.2

Multiple sequence alignment of goose lysozyme and its closest homologs. Absolutely conserved amino acid residues are shown in bold; conserved hydrophobic residues are yellow.

In contrast, the list of closest structural neighbors of goose lysozyme, according to the MMDB database (, see 3.3), includes the classic chicken egg white lysozyme (e.g. PDB code 3LZT, 11% identity) and lysozymes from E. coli bacteriophages λ (PDB code 1AM7, 13% identity) and T4 (PDB code 149L, 11% identity). Nevertheless, a superposition of the three-dimensional structures of these three proteins clearly reveals the conserved structural core and many shared features (Figure 2.3).

Figure 2.3. Structural alignment of goose lysozyme (PDB code 153L), chicken egg white lysozyme (3LZT), and lysozymes from E. coli bacteriophages λ (1AM7) and T4 (1L92).

Figure 2.3

Structural alignment of goose lysozyme (PDB code 153L), chicken egg white lysozyme (3LZT), and lysozymes from E. coli bacteriophages λ (1AM7) and T4 (1L92). Structures of the four different types of lysozyme were aligned using VAST ( and (more...)

A different method of structural comparison, DALI, used in the FSSP database (see 3.3), also identifies them as the nearest structural neighbors. Importantly, structural and sequence comparisons are a two-way street: the structural alignment shown in Figure 2.3 can be transformed into a multiple sequence alignment (Figure 2.4) in which conserved positions, including the catalytic glutamate, can be readily identified [217].

Figure 2.4. Structure-based sequence alignment of goose lysozyme (153L), chicken egg white lysozyme (3LZT), and lysozymes from E. coli bacteriophages λ (1AM7) and T4 (1L92).

Figure 2.4

Structure-based sequence alignment of goose lysozyme (153L), chicken egg white lysozyme (3LZT), and lysozymes from E. coli bacteriophages λ (1AM7) and T4 (1L92). Multiple alignment, generated by the DALI program [354], was extracted from the FSSP (more...)

This straightforward analysis makes us conclude that all lysozymes are homologous, which, in this case, is easy to accept given their similar, if not identical, functions. Furthermore, this analysis can be extended to a broad group of other transglycosylases, which all turn out to share a conserved catalytic domain with lysozyme and comprise a superfamily of homologous proteins [594,863]

Does structural similarity always reflect homology? For reasons discussed in the previous section, structural similarity that spans at least one complete domain most likely does. It is this type of similarity that is sought by structure comparison methods, such as VAST and DALI (see 3.3). Thus, the general rule of structure-homology correspondence seems to be straightforward: protein domains that have the same fold according to structure classification systems, such as SCOP or CATH, are homologs .

In principle, however, it is difficult to rule out that some common folds are so advantageous thermodynamically that they have evolved several times independently (convergently). This possibility has been considered, for example, for the triose phosphate isomerase (TIM) barrel fold, given its high stability and symmetrical, quasi-periodical organization [157].

How far does the notion of divergent evolution go? The overreaching idea that all proteins evolved from a single primordial protein does not seem plausible. Indeed, there is no reason to believe that proteins of different structural classes, e.g. all-α (consisting exclusively of α-helices) and all-β (consisting exclusively of β-strands), have a common origin. However, certain topological changes in protein folds seem to occur during evolution [317], and the possibility of primordial common ancestry might become realistic if different folds within the same structural class are considered.

Interestingly, credible relationships between certain proteins that, according to SCOP, have different folds are detectable even through PSI-BLAST searches. For example, statistically significant similarities between NAD-dependent oxidoreductases and S-adenosylmethionine-dependent methyltransferases are regularly detected in iterative database searches, and the alignments produced are usually consistent with structural superpositions (N.V. Grishin and E.V.K., unpublished). Consequently, there is little doubt that these proteins, which formally have distinct folds, do share a common ancestry. At least in principle, such comparisons could be extended to all the numerous proteins whose structural core consists of parallel β-sheets, leading to the more or less radical proposal that they all have evolved from the same primordial “Rossmann-type” domain, which possibly possessed nucleotide-binding properties [37]. The notion of divergence can be similarly extended to unite other types of structurally similar domains (e.g. different all-α-helical folds) into broad monophyletic classes. We find such generalizations attractive and credible, but caution is due, and further elaboration of the methods for structure comparison, perhaps combined with theoretical analysis of evolutionary models, is required before more certainty is achieved on these potential distant evolutionary relationships. We will return to the discussion of the possible nature of primordial proteins when considering the early stages of biological evolution from a comparative-genomic perspective (see 6.4).

Coming back to earth, it is important to note that approximately the same level of sequence similarity that is seen between distantly related proteins whose homology is established via a combination of iterative sequence searches and structural comparisons (roughly, 8–15% identity with gaps) can be expected to exist between two randomly chosen protein sequences. We already listed above some criteria that allow one to distinguish between true evidence of homology and spurious similarities. More generally, it cannot be overemphasized that, when this level of similarity between proteins is involved, there is no substitute (at least as of this writing) for a careful analysis of each particular relationship. Such an analysis usually pays off, allowing one to avoid false ‘fundamental discoveries’ and sometimes opening up new avenues of investigation.

2.1.3. Homologs: orthologs and paralogs

As discussed above, one of the main objectives of DNA and protein sequence analysis is to identify homologous sequences and to employ sequence and structure conservation to predict common biochemical activities and biological functions of proteins and non-coding sequences. The second major goal of sequence analysis is evolutionary reconstruction per se. To address each of these goals, it is critical to distinguish between two principal types of homologous relationships, which differ in their evolutionary history and functional implications. The two categories of homologs are orthologs , defined as evolutionary counterparts derived from a single ancestral gene in the last common ancestor of the given two species, and paralogs, which are homologous genes evolved through duplication within the same (perhaps ancestral) genome. These definitions were first introduced by Walter Fitch in 1970 [228,229] and remained virtually unknown to molecular biologists until the advent of genomics, at which time it has become clear that the distinction between the two types of homologs was crucial for understanding evolutionary relationships between genomes and gene functions. In evolutionary terms, robust identification of orthologs is essential because otherwise any evolutionary scenarios, for example, attempts to reconstruct the gene repertoire and gene order in ancestral genomes (see discussion below), are bound to be meaningless. With respect to functional analysis, orthologs typically retain the same, ancestral function, which makes transfer of functional information within a set of orthologs generally reliable. The evolutionary basis of such conservation of function among orthologs appears fairly obvious. Indeed, consider a gene (or, rather, its product) in an ancestral species that was responsible for carrying out some essential biological function. As long as the progeny of this ancestor carries a single copy of the gene in question and does not evolve or acquire an unrelated gene capable of providing the same function, it has to rely on the original gene to continue carrying out that function. This puts orthologs under strict evolutionary constraints and makes them perform the same function as long as this function remains essential for survival or at least confers a substantial selective advantage to its bearers.

In contrast, paralogs tend to evolve new functions, and study of paralogous families may provide means for understanding adaptation. As first detailed by Susumu Ohno in his classic 1970 book Evolution by Gene Duplication [627], once paralogs emerge as a result of a gene duplication, the pressure of purifying selection decreases for either one (in Ohno’s original model) or, under new, more elaborate models [448,534,877] both paralogs, which eventually enables evolution of new functions. In each sequenced genome, a substantial fraction (from 25 to 80% [374,408,484,506]) of genes belongs to families of paralogs, each of which reflects functional diversification via duplications that occurred at different stages of evolution. Classic examples include animal olfactory receptors or nuclear hormone receptors, vast families in which an astonishing repertoire of specificities evolved as the result of multiple duplications.

The interplay of speciation events, leading to the divergence of orthologs, and duplications, giving rise to paralogous families, results in complex evolutionary scenarios, which may be hard to resolve (Figure 2.5). When duplication precedes speciation, each of the paralogs gives rise to a distinct line of orthologous descent. Conversely, when duplication occurs after a particular speciation event in one lineage or in both lineages independently (this can be referred to as a lineage-specific duplication or lineage-specific expansion of a paralogous family), a situation ensues whereby a one-to-one orthologous relationship cannot be delineated in principle (Figure 2.5). Instead, all one can say is that the family AB in lineage 1 is orthologous to family A’B’C’ in lineage 2 or, in other words, that A and B are co-orthologs (a new term recently introduced to more accurately describe such relationships [700]) of A’, B’, and C’ (Figure 2.5). Clearly, in such a case, the functional correspondence between the two orthologous families of paralogs is less straightforward than it is between regular, one-to-one orthologs. The relationships between homologs could become particularly tricky if some genes in certain lineages have been lost during evolution (a phenomenon referred to as lineage-specific gene loss , see 2.2.3). In such cases, genes that, at face value, appear to be orthologous may actually be paralogs, whereas the genuine orthologs might have been lost. Once again, functional inferences made on the basis of this type of homologous relationship require particular caution.

Figure 2.5. Orthologous and paralogous genes in three lineages descending from a common ancestor.

Figure 2.5

Orthologous and paralogous genes in three lineages descending from a common ancestor. Gene sets I, II, and III should be considered co-orthologous.

Reliable identification of orthologs is only possible when complete sets of genes from two or more genomes are compared. Indeed, if one of the compared genomes is incomplete, a possibility always remains that the true ortholog of the given gene is “hiding” in the unsequenced part. Even with complete genomes, identification of orthologous gene sets is not a simple task because of the complex evolutionary scenarios, which involve multiple duplications, speciations, and most importantly, lineage-specific gene loss events. In principle, complete phylogenetic analysis of all groups of homologous genes is required to decipher true orthologous relationships. This is an extremely labor-intensive task; moreover, it is well known that not all phylogenetic trees provide the required resolution. “Shortcut” approaches have been developed to circumvent the need for comprehensive phylogenetic analyses, and some of these are discussed in subsequent chapters.

2.2. Patterns and Mechanisms in Genome Evolution

Although still a young discipline, comparative genomics has matured enough to allow delineation of the most common and important types of events that occur during genome evolution. These include different forms of genome rearrangement, gene duplication, and more specifically, lineage-specific expansion of gene families, lineage-specific gene loss, horizontal gene transfer, and non-orthologous gene displacement.

2.2.1. Evolution of gene order

Comparison of the first completely sequenced genomes promptly showed that gene order is much less conserved than protein sequences. Genomes of the closely related bacteria Mycoplasma genitalium and M. pneumoniae, for example, consist of six large segments with similar organization of genes, but the segments themselves are shifted relative to each other and partially scrambled in the two genomes [348]. Much greater differences were found between Haemophilus influenzae and E. coli, or even between E. coli K-12 and its pathogenic relative E. coli O157:H7 [669,829]. The gradient of gene order conservation is illustrated in Figure 2.6 (see color plates). In the chlamydial genomes, a genome-scale alignment is readily traceable along the main diagonal, although gaps in the alignment and two major inversions are equally obvious (Figure 2.6). In contrast, the comparison of E. coli and P. aeruginosa looks completely disordered on the genome scale (Figure 2.6B).

Figure 2.6. Gene order comparison plots.

Figure 2.6

Gene order comparison plots. A Chlamydia trachomatic (X axis) vs Chlamydophila pneumoniae (Y axis) B Escherichia coli (X axis) vs Pseudomonas aeruginosa (Y axis)

In fact, any such comparison between more or less distantly related prokaryotic genomes, e.g. bacteria or archaea from different genera, would look disordered at a scale where only conservation of about a dozen genes in a row is noticeable. On a smaller scale, however, there is important conservation of gene order within operons, the units of prokaryotic gene coregulation. Extensive genome comparisons showed that, in each genome, 5% to 25% of the genes belong to conserved (predicted) operons, i.e. strings of genes that are shared with at least one relatively distant genome [916]. As should be expected, this fraction gradually increases as new genomes are sequenced. A few operons that are conserved in distantly related prokaryotes consist of genes for ribosomal proteins and some other components of the translation machinery. Other conserved operons include those encoding subunits of the H-ATPase and ABC-type transporter complexes [169,385,461,595].

2.2.2. Lineage-specific gene loss

A quick look at the genome sizes of the organisms with completely sequenced genomes (Table 1.4) shows that many pairs of closely related organisms have vastly different numbers of genes. Thus, E. coli K-12 has seven times more genes than the aphid symbiont Buchnera sp., which is located right next to E. coli in the 16S rRNA-based phylogenetic tree. Two more representatives of gamma-proteobacteria, H. influenzae and P. multocida, have 2.5 times fewer genes than E. coli. Substantial differences in the gene number can be found even within the same genus. The gene set of Mycoplasma pneumoniae, for example, includes all the 480 genes of M. genitalium, as well as 197 additional genes. Mycobacterium leprae is closely related to M. tuberculosis but has at least 1,200 fewer genes [153].

The same phenomenon is seen throughout eukaryotes. Baker’s yeast S. cerevisiae, for example, has about 6,000 genes, which is at least 2,000 genes fewer than in its relatives, multicellular ascomycetes such as Aspergillus. Furthermore, a eukaryotic intracellular parasite, microsporidian Encephalitozoon cuniculi, which has been identified as a derived fungus in several consistent phylogenetic studies, has only ~2,000 genes [425], which points to a truly dramatic scale of gene loss. About 300 genes were apparently lost by S. cerevisiae after its radiation from the common ancestor with fission yeast S. pombe, although the latter has even fewer genes than S. cerevisiae [55]. All these observations show that certain phylogenetic lineages experienced a significant gene loss, often linked to the adaptations to the parasitic lifestyle (H. influenzae, P. multocida, M. pneumoniae, M. genitalium, M. leprae), or intracellular symbiosis (Buchnera sp.), or just adaptation to a constant (narrow) range of environmental conditions. Indeed, parasites might not need a complicated web of metabolic pathways for the biosynthesis of amino acids, nucleotides, and cofactors as long as they can fetch those nutrients from their host.

In the same vein, the well-known absence of the biosynthetic pathways for 12 amino acids in humans and other vertebrates was probably made possible by the abundance of these amino acids in the food consumed by their common ancestor at the time of their divergence.

An analysis of gene loss in bacterial parasites showed that, in many cases, it led to the elimination of entire pathways, such as amino acid, nucleotide, and cofactor biosynthetic pathways (Chapter 7). For example, a number of parasitic bacteria lack pyrimidine biosynthesis genes that are present in their free-living relatives (Figure 2.7). This has, of course, a simple evolutionary explanation: if the necessary nutrient is available in the medium, the genes responsible for its synthesis become redundant and can be eliminated. Moreover, once at least one of these genes is lost, expression of the others would lead to the accumulation of metabolic intermediates that can be harmful for the cell. This would result in an evolutionary pressure toward coordinated loss of all the genes in a pathway [270]. A similar trend toward coelimination of functionally connected groups of proteins, such as the signalosome and the spliceosome components, has been detected in yeast [55].

Figure 2.7. Pyrimidine biosynthesis genes in organisms with completely sequenced genomes.

Figure 2.7

Pyrimidine biosynthesis genes in organisms with completely sequenced genomes. Each rectangle signifies an enzyme of the pyrimidine biosynthesis pathway, indicated by its gene name and COG number. Alternative enzymes catalyzing the same reaction are shown (more...)

In a remarkable exception to the principle of coordinated gene loss, there are cases when only a certain (typically, upstream) part of the pathway is eliminated. Figure 2.7 shows that the complete pyrimidine biosynthesis pathway is missing in M. genitalium and M. pneumoniae, whereas H. influenzae lacks genes for the first three reactions of this pathway but has the complete set of genes for all the enzymes that catalyze the conversion of dihydroorotate into CTP. Thus, while H. influenzae is evidently incapable of de novo pyrimidine biosynthesis, it has preserved certain metabolic plasticity to accommodate whatever pyrimidine it can get from its host. The same trend is seen in the even smaller genomes of B. burgdorferi and C. trachomatis, which have lost most of the pyrimidine biosynthesis genes but still contain genes coding for the downstream steps of this pathway.

2.2.3. Lineage-specific expansion of gene families

We have already mentioned the evolutionary importance of gene duplication leading to the emergence of paralogs, which may assume new functions, sometimes substantially different from those of the ancestral gene. Genome comparisons suggest that lineage-specific expansion of paralogous gene families, which in some cases account for a sizable fraction of a genome, is one of the major mechanisms of adaptation [408,506]. Analysis of lineage-specific gene expansions can provide useful clues to the evolution of each particular lineage. Table 2.1 shows that, indeed, in pathogens M. tuberculosis and H. pylori, the most conspicuous expansions are those of genes encoding factors involved in interactions with and survival within the host organisms. In contrast, in free-living autotrophs Synechocystis sp. and A. fulgidus, the largest expansion involves signal transduction proteins, sensor histidine kinases, and related ATPases.

Table 2.1. Lineage-specific expansions of paralogous families in prokaryotic genomes a.

Table 2.1

Lineage-specific expansions of paralogous families in prokaryotic genomes a.

In eukaryotes, lineage-specific expansion of certain protein families is even more evident than in prokaryotes. A comparison of the genome counts of signaling domains in the nematode C. elegans against the corresponding numbers in the yeast S. cerevisiae and some free-living bacteria and archaea (Table 2.2) shows that certain domains are dramatically expanded in C. elegans, even when the greater number of genes in the worm is taken into account (see also the counts of ankyrin repeats in C. elegans in 3.2.2).

Table 2.2. Expansion of signaling domains in C. elegansa.

Table 2.2

Expansion of signaling domains in C. elegansa.

2.2.4. Horizontal (lateral) gene transfer

Horizontal (lateral) gene transfer, as opposed to the standard (vertical) transfer from ancestors to progeny, refers to acquisition of genes from organisms that belong to other species, genera, or even higher taxa. Some mechanisms of lateral gene transfer between different strains of the same species, or between closely related species, are well established and include conjugation, acquisition of plasmids, and viral (phage) infection [134]. These events are common and do not stir much controversy. After all, it was the experiment on pneumococcal transformation by heterologous DNA by Avery, MacLeod, and McCarthy that proved the role of DNA in heredity. However, in the pre-genomic era, the long-range lateral gene transfer across taxa has been considered to be extremely rare and more or less unimportant in the general scheme of evolution [782]. The only instance where the fact and impact of horizontal gene transfer have been clearly recognized was the apparent massive flow of genes from the genomes of endosymbiotic organelles, mitochondria in all eukaryotes and particularly chloroplasts in plants, to the eukaryotic nuclear genome [311,312].

As soon as first comparisons of multiple, complete genome sequences representing diverse taxa had been performed, it became apparent that lateral gene transfer was too common to be dismissed as inconsequential [194]. First, horizontal gene flow between closely related species turned out to be much more pervasive than ever suspected before. Lawrence and Ochman estimate, for example, that as much as 25% of the E. coli genome consists of recently acquired “foreign” genes [497,625]. The actual rate of influx and loss of new genes is even faster: it appears that, in the ~100 million years since the split between Escherichia and Salmonella lineages, E. coli has picked up and lost as much DNA as it has now [496,497].

In addition, genome comparisons helped to uncover numerous cases of (predicted) horizontal gene transfer between organisms belonging to distinct phylogenetic lineages. Archaeal genomes presented a particularly striking picture, with some genes having close homologs only among eukaryotes and others being much more similar to their bacterial homologs than to those from eukaryotes, if eukaryotic homologs were detectable at all [466]. With some exceptions, the “bacterial” and “eukaryotic” proteins in archaea were divided along functional lines, with those involved in information processing (translation, transcription, and replication) showing the eukaryotic affinity, and metabolic enzymes, structural components, and a variety of uncharacterized proteins appearing “bacterial” [466,540]. Because the informational components generally appear to be less prone to horizontal gene transfer [703] and in accord with the “standard model” of early evolution whereby eukaryotes share a common ancestor with archaea [906], these observations could be explained by massive gene exchange between archaea and bacteria [466]. This hypothesis was further supported by the results of genome analysis of two hyperthermophilic bacteria, A. aeolicus and T. maritima. Each of these genomes contained a significantly greater proportion of “archaeal” genes than any of the other bacterial genomes, in an obvious correlation between the similarity in the life styles of evolutionarily very distant organisms (bacterial and archaeal hyperthermophiles) and the apparent rate of horizontal gene exchange between them [52,610]. Further analyses led to the discovery of genes of clear bacterial origin in the hyperthermophilic archaeon P. furiosus, which proved lateral gene transfer from bacteria to archaea [184].

We believe that the demonstration of the evolutionary prominence of lateral gene transfer can be considered the single greatest change in perspective in biology brought about by comparative genomics. A new round of controversy has been sparked by the discovery of genes of possible bacterial origin in the human genome [488]. In Chapter 6, we revisit this issue and discuss implications of large-scale lateral gene transfer for the “tree of life”.

2.2.5. Non-orthologous gene displacement and the minimal gene set concept

Proteins responsible for the same function in different organisms typically show significant sequence and structural conservation and can be inferred to be orthologs. However, there are exceptions to this rule. Examples of apparently unrelated enzymes with the same specificity were noted as early as 1943 when Warburg and Christian described two distinct forms of fructose-1,6-bisphosphate aldolase in yeast and rabbit muscle, respectively. These two enzymes, referred to as class I and class II aldolases, were later shown to be associated with different phylogenetic lineages and have different catalytic mechanisms and little structural similarity [95,549]. Unrelated enzymes that catalyze the same reaction have been referred to as analogous, as opposed to homologous, enzymes [228,271].

Comparative analysis of complete genomes shows that cases like this are common. Strikingly, only about 65 orthologous protein sets are universally represented in all sequenced genomes. While, in large part, this is due to lineage-specific gene loss, this number is much lower than the number of essential functions, indicating that other such functions are performed by unrelated (or at least non-orthologous) proteins in different life forms. This major evolutionary phenomenon, which came to light already in the first comparisons of sequenced genomes, was dubbed non-orthologous gene displacement [465]. The full range of mechanisms leading to non-orthologous gene displacement is not known. However, in cases when essential functions are involved, the main sequence of events appears to be clear. Since an organism cannot survive without a protein that performs an essential function, transient functional redundancy, when an organism has both forms of the respective protein, appears to be a pre-requisite of non-orthologous gene displacement [464]. Such redundancy might evolve via horizontal gene transfer or via recruitment of a protein whose original function was different from the given one (recruitment is likely to occur after gene duplication). The redundancy phase is followed by lineage-specific gene loss, resulting in non-orthologous gene displacement (Figure 2.8). In case of non-essential functions, the redundancy phase might be bypassed, with non-orthologous gene displacement evolving directly via horizontal gene transfer or recruitment.

Figure 2.8. A scenario for the evolution of non-orthologous gene displacement via an ancestral redundancy stage and lineage-specific gene loss.

Figure 2.8

A scenario for the evolution of non-orthologous gene displacement via an ancestral redundancy stage and lineage-specific gene loss.

Enzyme recruitment is a common evolutionary phenomenon leading to non-orthologous gene displacement. Typically, one of the two non-orthologous enzymes with the same catalytic activity belongs to a diverse family of enzymes and could have evolved by shifting the substrate specificity of a related but distinct enzyme [271]. A good example is the two unrelated forms of gluconate kinase. Gluconate kinases from E. coli, yeast, and S. pombe form a narrow conserved group. In contrast, the gluconate kinase of B. subtilis belongs to the so-called FGGY family of carbohydrate kinases, which also includes glycerol kinase (GlpK), D-xylulose kinase (XylB), L-fuculose kinase, and L-xylulose kinase (LyxK). The scenario of enzyme recruitment in this case seems straightforward: a duplication of the glpK or xylB gene in the Bacillus lineage produced a new paralog, which accumulated several mutations resulting in a shift of substrate specificity from glycerol (or xylulose) to gluconate.

Enzyme recruitment seems to be particularly common in organisms that have adapted to novel ecological niches by developing unusual, idiosyncratic metabolic pathways. For example, most of the enzymes that are responsible for the biosynthesis of polyketide antibiotics in actinomycetes appear to be recent recruits from the enzymes of fatty acid biosynthesis. Similarly, enzymes that hydrolyze man-made halogenated hydrocarbons have close relatives among regular metabolic enzymes and, in all likelihood, have been recruited from this source. Perhaps the most remarkable example is the evolution of apyrase (ATP-diphosphohydrolases, EC, the enzyme secreted by blood-sucking insects into the blood of human or other mammalian victims in order to prevent or slow down blood clotting [862]. Because ADP in the blood can serve as a trigger of blood clotting, any enzyme capable of hydrolyzing it would give the hematophagous insect a substantial evolutionary advantage. As a result of this evolutionary pressure toward increasing salivary apyrase activity, insect apyrases are found in at least three different forms, which are homologous, respectively, to ATPases, 5’-nucleotidases, and inositoltriphosphate phosphatases [271,862].

It is worth noting that enzyme recruitment can be legitimately described as independent, convergent evolution of the same enzymatic activity. In Chapter 7, we look at the comparative genomic of central metabolic pathways and encounter numerous cases of non-orthologous gene displacement and, specifically, enzyme recruitment.

The idea of non-orthologous gene displacement was originally developed in conjunction with the concept of a minimal gene set for a living cell [596]. This was construed as the minimal set of genes that are essential for the functioning of a modern-type cell even under the most favorable environmental conditions, including abundance of nutrients and absence of competition. An attempt to explicitly derive a version of such a minimal gene set was undertaken by comparing the first two sequenced bacterial genomes, those of the parasites H. influenzae and M. genitalium. The straightforward logic of this reconstruction was that these two bacteria, which belong to distant phylogenetic lineages, have been independently losing genes during their adaptation to the parasitic lifestyle, and whichever common genes remain in both genomes were likely to belong to the minimal set of essential genes. It was noticed, however, that for certain essential functions (e.g. glycyl-tRNA synthetase), there was no orthologous pair of genes in the two bacteria, hence non-orthologous gene displacement had to be invoked.

The original version of the minimal gene set included 256 genes, with 16 inferred non-orthologous gene displacement cases. (The magic of these numbers must not be lost on the reader: 16 is 22 to the power of 2; 256=162, and accordingly, 256 is 22 to the power of 2 to the power of 2. Thus, 256 is the only number that can be represented as such a succession of powers of 2 and, at the same time, can be a reasonable approximation of a minimal gene set: 16 is obviously too few and 2562=65,536 is, in all likelihood, much greater than the number of genes in the human genome.)

A subsequent large-scale experimental study has shown that most of the genes included in this theoretical minimal gene set were, indeed, essential in M. genitalium, although a few, surprisingly, were not [364]. However, sequencing of additional genomes and the corresponding genome comparisons have clearly shown that this early reconstruction vastly underestimated the extent of non-orthologous gene displacement [452,591,674]. Indeed, as indicated above, only about 65 genes seem to be truly ubiquitous in cellular life forms, comprising perhaps 25% of the minimal set of essential functions. Therefore, it probably makes more sense to consider not so much a minimal gene set but rather a minimal set of functional requirements for cell survival. Comparative genomics shows that, for some of these requirements, a unique solution has evolved, but for the majority, evolution has come up with two or more unrelated or distantly related solutions. As discussed in 6.4, non-orthologous gene displacement is prominent even in the DNA replication machinery, the central functional system of all cells.

Figure 2.9. Distribution of different phylogenetic lineages in the COG database.

Figure 2.9

Distribution of different phylogenetic lineages in the COG database. The plot shows the number of protein families (COGs) in a release of the COG database (see 3.4), which included proteins from the given number of phylogenetic lineages of the total of (more...)

2.2.6. Phyletic patterns (profiles)

As a result of numerous lineage-specific gene losses, horizontal gene transfers and non-orthologous gene displacements, most protein families show a “patchy” distribution among the sequenced genomes. The data from the database of Clusters of Orthologous Groups of proteins (COGs, see 3.4) show that the majority of COGs are represented in only three or four phylogenetic lineages; universal or nearly universal COGs are much less common.

This distribution can be conveniently presented in the form of phyletic patterns (profiles), which show the presence or absence of a COG in each analyzed species. This approach, initially introduced as a feature of the COGs [828] and subsequently adapted, with various modifications, by several research groups [547,665,689], provides a convenient way to compare genomes and investigate the evolutionary history of individual cellular functions. For example, a quick examination of the phyletic patterns of the two distinct forms of phosphoglycerate mutase (the cofactor-dependent form GpmA and the cofactor-independent form GpmI [393]) immediately shows several interesting trends (the species symbols are the same as in Figure 2.7):

Image ch2e1.jpg

Firstly, the two forms have largely complementary phyletic patterns, a clear sign of non-orthologous gene displacement. Only E. coli encodes both forms of the enzyme (and hence shows apparent functional redundancy), whereas other organisms encode either one or the other. Secondly, several organisms do not encode either of the two forms of this enzyme. Assuming that glycolysis is an essential metabolic pathway, glycolytic enzymes should be encoded in every genome (we are aware of one exception, Rickettsia, which does not encode any glycolytic enzymes; see 7.1.1). Therefore, one might suggest that there should be an additional, third form of phosphoglycerate mutase, which is encoded in archaeal genomes and also in T. maritima, A. aeolicus, and D. radiodurans. Indeed, sequence analysis of those genomes shows that they all encode an uncharacterized enzyme, distantly related to alkaline phosphatase and cofactor-independent phosphoglycerate mutase. Based on the conservation of active site residues, this archaeal enzyme has been predicted to have a phosphoglycerate mutase activity [258,261]; this prediction has now been experimentally confirmed in two independent studies [308,866]. Remarkably, the phyletic pattern of the respective COG complements the union of the patterns for the two forms of phosphoglycerate mutase, which ensures the presence of at least one type of phosphoglycerate mutase in every species, except for Rickettsia:

Figure 2.10. Phyletic patterns of the three forms of phosphoglycerate mutase.

Figure 2.10

Phyletic patterns of the three forms of phosphoglycerate mutase. The species symbols are as in Figure 2.7.

This summation also shows that there is no necessity in yet another form of phosphoglycerate mutase, which has been designated GpmB in E. coli (see, but has never been experimentally demonstrated to have this activity:

Image ch2e2.jpg

Indeed, recent data show that this protein does not have a phosphoglycerate mutase activity, at least in B. subtilis. Instead, it appears to function as a non-specific sugar phosphatase [702]. This example shows the impressive power of the comparative-genomic approach for prediction of gene functions. This methodology is discussed in greater detail later in this book (see 5.2).

2.3. Conclusions and Outlook

In this chapter, we discussed some general principles of molecular evolution that are central to the comparative-genomic approaches and major evolutionary phenomena that became apparent as the result of genome comparison. The above discussion is obviously quite sketchy. However, this should be sufficient for understanding the principles underlying methods of computational genomics and the organization of various databases, which we discuss in the next two chapters. In Chapters 5 through 8, we return to problems of genome evolution at a new level and analyze some of the concepts outlined in greater depth.

Table 2.3. Examples of non-orthologous gene displacement between M. genitalium and H. influenzae.

Table 2.3

Examples of non-orthologous gene displacement between M. genitalium and H. influenzae.

2.4. Further Reading

Darwin C. 1859. The Origin of Species. Murray, London.
Kimura M. 1983. The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, UK.
Ohno S. 1970. Evolution by Gene Duplication. Springer, New York.
Graur D, Li W-H. 2000. Fundamentals of Molecular Evolution. Sinauer Associates, Sunderland, MA.
Doolittle WF. Uprooting the tree of life. Scientific American. 2000;282:90–95. [PubMed: 10710791]
Koonin EV, Aravind L, Kondrashov AS. The impact of comparative genomics on our understanding of evolution. Cell. 2000;101:573–576. [PubMed: 10892642]
Copyright © 2003, Kluwer Academic.
Bookshelf ID: NBK20255


Related information

Recent Activity

Your browsing activity is temporarily unavailable.