2.1. Similarity, Homology, Divergence and Convergence
2.1.1. The critical definitions
In times past, gathering information on a potential partner in marriage or
business routinely started with the simplest question “What family
does he or she come from?” Affiliation with a certain family
immediately provided a starting point for further inquiries, a general idea of
what might be expected from a certain individual. Of course, families are never
uniform, and classic literature from Homer to Shakespeare to Tolstoy provides
ample illustrations that any expectation based solely on family history should
be taken with a grain of salt. Nevertheless, in the absence of other clues to
the character of the subject in question, an educated guess could be made based
on the family structure and the individual’s position within that
structure.
Essentially the same approach is used in predicting potential functions for a
newly sequenced gene and its protein product. Since it is technically impossible
to experimentally test activity of the product of every single open reading
frame in every organism, understanding their cellular roles routinely relies on
family history.
So how can one decide what family a given protein belongs to? Sequence analysis
aims at finding important sequence similarities that would allow one to infer
homology. The latter term is extensively used in scientific literature, often
without a clear understanding of its meaning, which is simply common origin.
Since the mid-19th century, zoologists and botanists have learned to
make a distinction between homologous organs (e.g. bat's wing and human's hand)
and similar (analogous) organs (e.g. bat's wing and butterfly's wing).
Homologous organs are not necessarily similar (at least the similarity may not
be obvious); similar organs are not necessarily homologous. For some reason,
this simple concept tends to get extremely muddled when applied to protein and
DNA sequences [695]. Phrases like
“sequence (structural) homology”, “high
homology”, “significant homology”, or even
“35% homology” are as common, even in top
scientific journals, as they are absurd, considering the above definition.
“Sequence homology” is particularly pervasive, having found
its way even into the NLM’s Medical Subject Heading (MeSH) system. It
has been assigned as a keyword to more than 80,000 papers in MEDLINE, including,
to the embarrassment of the authors, most of their own. In all of the above
cases, the term “homology” is used basically as a glorified
substitute for “sequence (or structural) similarity”.
All this misuse of “homology”, in principle, could be
dismissed as an inconsequential semantic problem. One could even suggest that,
after all, since it so happened that in molecular biology literature
“homology” has been often used to designate quantifiable
similarity between sequences (or, less often, structures), the term should be
redefined, legitimizing this usage. We believe, however, that the notion of
homology is of major fundamental and practical importance and, on this occasion,
semantics matters. In our opinion, misuse of the term
‘homology’ has the potential of washing out the meaning of
the very concept of common evolutionary descent [695].
Figure 2.1
.
Multiple alignment of the ribosomal protein L36 sequences
Conserved amino acid residues are shown in bold and/or yellow. The
following proteins are listed: A. aeolicus, aq_075;
B. subtilis, RpmJ; C. jejuni,
Cj1591; C. trachomatis, CT786; E.
coli, RpmJ; H. pylori, HP1297;
L. lactis, L153863; M. leprae,
ML1961; M. genitalium, MG174; R.
prowazekii, RP456; Synechocystis sp.,
sml0006; T. pallidum, TP0209; T.
maritima, TM1476; V. cholerae, VC2575;
X. fastidiosa, XF2440; Yeast, YPL183w.
Bacterial and yeast proteins are from COG0257; other proteins are
from GenPept and have the following gi numbers: rice O.
sativa, gi12020; fruit fly D.
melanogaster, CG18767; mouse, gi13559402; human,
gi7677060.
A conclusion that two (or more) genes or proteins are homologous is a conjecture,
not an experimental fact. We would be able to know for a fact that genes are
homologous only if we could directly explore their common ancestor and all
intermediate forms. Since there is no fossil record of these extinct forms, a
decision on homology between genes has to be made on the basis of the similarity
between them, the only observable variable that can be expressed numerically and
correlated with probability. The higher the similarity between two sequences,
the lower the probability that they have originated independently of each other
and became similar merely by chance (see
4.2). Indeed, if we take two sequences of 100 amino acid residues
each that have, say, 80% identical residues, we can calculate the
probability of this occurring by chance, find that it is so low that such an
event is extremely unlikely to have happened in the last 5 billion years, and
conclude that the sequences in question must be homologous (share a common
ancestry). Even for proteins that share a much lesser degree of identity,
alignment of counterparts from all walks of life is often straightforward, and
there seems to be no reasonable doubt of homology. For example, although
sequences of the ribosomal protein L36 from different species () exhibit considerable diversity
and only a single amino acid residue is conserved in all the sequences, they
align unequivocally and are indisputable homologs.
A real problem arises only when the similarity between two given sequences is
much lower, so it is not immediately clear how to properly align them and how to
calculate their degree of similarity. Even when one comes up with a
figure—say, two protein sequences have 10% identical
residues and additional 8% similar amino acid residues (a total of
18% similarity)—does this imply homology or not? The only
reasonable answer is: it depends. This and lower levels of similarity might be
indicative of homology provided that one or more of the following applies: (i)
the similarity extends over a long stretch of sequence and is statistically
significant by criteria known to be reliable (such as those applied in the BLAST
algorithm and its derivatives); (ii) although the sequence similarity is low,
the same pattern of identical and similar amino acid residues is seen in
multiple sequences; or (iii) the pattern of sequence similarity reflects the
similarity between experimentally determined structures of the respective
proteins or at least corresponds to the known key elements of one such
structure.
In the rest of this chapter and in the subsequent chapters as well, we will have
multiple opportunities to examine each type of evidence. Right here and now,
however, it is pertinent to ponder the question: Why is sequence and structural
similarity considered to be evidence of homology (common origin) in the first
place? Once we are confident that a particular similarity is not spurious, but
rather, according to the above criteria, represents certain biological reality,
is common ancestry the only explanation? The answer is: no, a logically
consistent alternative does exist and involves convergence from unrelated
sequences.
The functional convergence hypothesis would posit that sequence and structural
similarities between proteins are observed because the shared features are
strictly required for these proteins to perform their identical or similar
functions. Functional convergence per se is an undeniable reality. In the
broadest sense, convergence is observed, for example, between all proteins that
contain disulfide bonds stabilizing their structure or between all enzymes that
have the same catalytic residues (e.g. a constellation of histidines and
aspartates). Even more prominent motifs associated with catalytic residues are
found within different structural context and, in all likelihood, have evolved
convergently [722,724]. In the case of disulfide-bonded domains,
convergence can even fool sequence comparison programs, translating into
statistically significant (albeit not overwhelming) sequence similarity. A
rather dramatic manifestation of convergence is the recent description of a
“homologous” disulfide-bonded domain in Wnt proteins and
phospholipase A2 [699], which was later
recognized as “mistaken identity”, on the grounds of
structural implausibility [77]. The
classic work of Alan Wilson and colleagues comparing lysozymes from ruminants,
langur monkeys, and leaf-eating birds is a textbook case that reveals the nature
and extent of convergence in enzymes [471,806,816]. These studies have shown beyond
doubt that several amino acid residues required for functioning in the stomach
have evolved independently (convergently) in different lineages of lysozymes.
Importantly, however, this set of convergent positions consists of only seven
amino acid residues, a small subset of the residues that comprises the lysozyme
molecule.
A pan-adaptationist view of evolution would hold that functional convergence is
the sole (or at least the principal) factor responsible for similarity between
proteins. Formally disproving this paradigm might not be possible, but there
seem to be at least two compelling arguments against it. The first one stems
from the notion of a continuous gradient of similarity between proteins. The
convergence explanation is implausible for closely related sequences, such as
those of the same proteins (or, more precisely, orthologs; see below) from
different mammalian species, which are usually 70–80%
identical. For such sequences, the convergence hypothesis is equivalent to the
statement that most, if not all, amino acid residues in a protein are fixed
through positive selection. This runs against the neutral theory of molecular
evolution, which has shown that, given the known parameters of animal
populations, positive selection could not be responsible for the majority of
amino acid substitutions, which are therefore effectively neutral [440]. Convergence could only be a
realistic possibility for deep relationships between proteins, which involve
limited similarities; indeed, the neutral theory does not preclude positive
selection acting, say, on 10% of the positions in a protein. Then,
the observed spectrum of similarities between proteins would have two distinct
explanations: (i) divergence from common ancestors for tight families with high
levels of sequence similarity, and (ii) convergence from independent ancestors
for larger groups of related proteins (superfamilies), in which only limited
similarity is observed. While not theoretically impossible, such an opposition
of two vastly different modes of evolution, with a mysterious bottleneck
separating the two phases, appears extremely unlikely. This view of evolution is
clearly inferior to the alternative, whereby all significant similarities
observed within a class of proteins are interpreted within a single theoretical
framework of divergence from an ultimate common ancestor.
The second, probably most convincing, argument against convergence as the
principal explanation for the observed similarities between proteins has to do
with the nature of structural constraints associated with a particular function.
A fundamental observation is that a single function, such as catalysis of a
specific enzymatic reaction, is often performed by two or more proteins that
have unrelated structures [187,271]. In 2.2.5, we discuss this phenomenon in some detail and present several
specific examples. These observations indicate that the same function does not
necessarily require significantly similar structures, which means that, as a
rule, there is no basis for convergent evolution of extensive sequence and
structural similarity between proteins. This is not to say that unrelated
enzymes that catalyze the same reaction bear no structural resemblance
whatsoever. Indeed, subtle similarities in the spatial configuration of amino
acid residues in the active centers are likely to exist, and these are precisely
the kind of similarity that is expected to emerge due to functional convergence.
These similarities, however, do not translate into structural and sequence
similarity detectable by existing methods for comparison of proteins (at least
in the overwhelming majority of cases). By inference, we are justified to
conclude that whenever statistically significant sequence or structural similarity
between proteins or protein domains is observed, this is an indication
of their divergent evolution from a common ancestor or, in other words,
evidence of homology. We will revisit the issue of convergence versus divergence when
discussing the deepest structural connections between proteins.
Now that we have established the connection between similarity and homology, it
should be emphasized that demonstration of homology is central to the
interpretation of similarities between proteins. The feasibility of this
conclusion, which sometimes is reached on the basis of limited similarity, is
what makes sequence and structure comparison the major staples of computational
biology and inspires the development of increasingly sensitive methods for such
comparisons. Indeed, under the notion of homology, a sequence or structural
alignment becomes a powerful tool for evolutionary and functional
inferences.
Once sequences are correctly aligned, homology implies that the corresponding
residues in homologous proteins are also homologous, i.e. derived from the same
ancestral residue and, typically, inherit its function. If the residue in
question is the same in a set of homologous sequences, we say that it is
(evolutionarily) conserved. Thus, homology lends legitimacy to the transfer of functional
information from experimentally characterized proteins (or nucleic acids) to
uncharacterized homologs, the single most common and practically important
application of computational methods in molecular biology. Conversely, an
alignment of non-homologous sequences is inherently meaningless and potentially
misleading. Even if such an alignment attains a relatively high percentage of
identity or similarity, no conclusions at all can be inferred from the
(spurious, in this case) correspondence between aligned residues. This is why
phrases like “significant homology” or “percent
homology” are so ludicrous. Homology is a qualitative notion of common
ancestry. As long as homology is established, 10% identical residues
between two protein sequences could be highly meaningful and amenable to
functional interpretation. In contrast, even 30% identity between two
sequences that are not homologous in reality could be totally misleading.
2.1.2. Conservation of protein sequence and structure in evolution
Protein structure is conserved during evolution much better than protein
sequence. There are numerous examples of proteins that show little sequence
similarity but still adopt similar structures, contain identical or related
amino acid residues in their active sites, and have similar catalytic
mechanisms. These shared features support the notion that, despite low sequence
similarity, such proteins are homologous.
Figure 2.2
.
Multiple sequence alignment of goose lysozyme and its closest
homologs
Absolutely conserved amino acid residues are shown in bold; conserved
hydrophobic residues are yellow.
Consider, for example, the structure of lysozyme, the enzyme that hydrolyzes
bacterial cell walls (formal name: 1,4-beta-N-acetylmuramidase, EC 3.2.1.17).
Different lysozymes are found in many organisms, from bacteriophages to mammals,
and in general, they show little sequence similarity to each other. PDB, the
database of protein structures (see
3.3),
includes the lysozyme from goose (PDB code 153L), which consists of 185 amino
acid residues (). The
sequence neighbors of this protein in the protein database (see
3.1.2) are lysozymes from black swan (same
length, 96% identity), ostrich (same length, 83%
identity), chicken (same length, 80% identity), as well as
unannotated proteins from human (44% identity), mouse (43%
identity), and
B. subtilis bacteriophage SPBc2 (25%
identity in 176-aa overlap). The vertebrate proteins in this list, including the
uncharacterized ones, are obvious homologs of the goose lysozyme. The phage
protein is more dissimilar and, in this case, the issue of homology is worth
some investigation. However, the sequence similarity between lysozymes and this
phage protein is statistically significant (as can be shown, for example, using
PSI-BLAST, see
4.3.3), and their multiple
alignment shows a consistent pattern of shared residues, thus establishing
homology ().
In contrast, the list of closest
structural neighbors of goose lysozyme, according to the MMDB database (
http://www.ncbi.nlm.nih.gov/Structure, see
3.3), includes the classic chicken egg white lysozyme (e.g.
PDB code 3LZT, 11% identity) and lysozymes from
E.
coli bacteriophages λ (PDB code 1AM7, 13%
identity) and T4 (PDB code 149L, 11% identity). Nevertheless, a
superposition of the three-dimensional structures of these three proteins
clearly reveals the conserved structural core and many shared features ().
Figure 2.4
.
Structure-based sequence alignment of goose lysozyme (153L),
chicken egg white lysozyme (3LZT), and lysozymes from E.
coli bacteriophages λ (1AM7) and T4
(1L92)
Multiple alignment, generated by the DALI program [354], was extracted from the
FSSP database (http://www.ebi.ac.uk/dali/fssp/fssp.html). The
residues that are structurally equivalent with ones in 153L are
shown in uppercase; those that are not, in lowercase. Conserved
hydrophobic residues are highlighted in yellow. The active-site Glu
residue is shown in reverse bold.
A different method of structural comparison, DALI, used in the FSSP database (see
3.3), also identifies them as the
nearest structural neighbors. Importantly, structural and sequence comparisons
are a two-way street: the structural alignment shown in can be transformed into a multiple sequence
alignment () in which conserved
positions, including the catalytic glutamate, can be readily identified [
217].
This straightforward analysis makes us conclude that all lysozymes are
homologous, which, in this case, is easy to accept given their similar, if not
identical, functions. Furthermore, this analysis can be extended to a broad
group of other transglycosylases, which all turn out to share a conserved
catalytic domain with lysozyme and comprise a superfamily of homologous proteins
[594,863]
Does structural similarity always reflect homology? For reasons discussed in the
previous section, structural similarity that spans at least one complete domain
most likely does. It is this type of similarity that is sought by structure
comparison methods, such as VAST and DALI (see 3.3). Thus, the general rule of structure-homology correspondence
seems to be straightforward: protein domains that have the same fold according to structure
classification systems, such as SCOP or CATH, are homologs.
In principle, however, it is difficult to rule out that some common folds are so
advantageous thermodynamically that they have evolved several times
independently (convergently). This possibility has been considered, for example,
for the triose phosphate isomerase (TIM) barrel fold, given its high stability
and symmetrical, quasi-periodical organization [157].
How far does the notion of divergent evolution go? The overreaching idea that all
proteins evolved from a single primordial protein does not seem plausible.
Indeed, there is no reason to believe that proteins of different structural
classes, e.g. all-α (consisting exclusively of α-helices) and
all-β (consisting exclusively of β-strands), have a common
origin. However, certain topological changes in protein folds seem to occur
during evolution [317], and the
possibility of primordial common ancestry might become realistic if different
folds within the same structural class are considered.
Interestingly, credible relationships between certain proteins that, according to
SCOP, have different folds are detectable even through PSI-BLAST searches. For
example, statistically significant similarities between NAD-dependent
oxidoreductases and S-adenosylmethionine-dependent methyltransferases are
regularly detected in iterative database searches, and the alignments produced
are usually consistent with structural superpositions (N.V. Grishin and E.V.K.,
unpublished). Consequently, there is little doubt that these proteins, which
formally have distinct folds, do share a common ancestry. At least in principle,
such comparisons could be extended to all the numerous proteins whose structural
core consists of parallel β-sheets, leading to the more or less radical
proposal that they all have evolved from the same primordial
“Rossmann-type” domain, which possibly possessed
nucleotide-binding properties [37]. The
notion of divergence can be similarly extended to unite other types of
structurally similar domains (e.g. different all-α-helical folds) into
broad monophyletic classes. We find such generalizations attractive and
credible, but caution is due, and further elaboration of the methods for
structure comparison, perhaps combined with theoretical analysis of evolutionary
models, is required before more certainty is achieved on these potential distant
evolutionary relationships. We will return to the discussion of the possible
nature of primordial proteins when considering the early stages of biological
evolution from a comparative-genomic perspective (see 6.4).
Coming back to earth, it is important to note that approximately the same level
of sequence similarity that is seen between distantly related proteins whose
homology is established via a combination of iterative sequence searches and
structural comparisons (roughly, 8–15% identity with gaps)
can be expected to exist between two randomly chosen protein sequences. We
already listed above some criteria that allow one to distinguish between true
evidence of homology and spurious similarities. More generally, it cannot be
overemphasized that, when this level of similarity between proteins is involved,
there is no substitute (at least as of this writing) for a careful analysis of
each particular relationship. Such an analysis usually pays off, allowing one to
avoid false ‘fundamental discoveries’ and sometimes opening
up new avenues of investigation.
2.1.3. Homologs: orthologs and paralogs
As discussed above, one of the main objectives of DNA and protein sequence
analysis is to identify homologous sequences and to employ sequence and
structure conservation to predict common biochemical activities and biological
functions of proteins and non-coding sequences. The second major goal of
sequence analysis is evolutionary reconstruction per se. To address each of
these goals, it is critical to distinguish between two principal types of
homologous relationships, which differ in their evolutionary history and
functional implications. The two categories of homologs are orthologs, defined as evolutionary counterparts derived from a single ancestral
gene in the last common ancestor of the given two species, and paralogs, which are homologous genes evolved through duplication within the same
(perhaps ancestral) genome. These definitions were first introduced by Walter
Fitch in 1970 [228,229] and remained virtually unknown to molecular
biologists until the advent of genomics, at which time it has become clear that
the distinction between the two types of homologs was crucial for understanding
evolutionary relationships between genomes and gene functions. In evolutionary
terms, robust identification of orthologs is essential because otherwise any
evolutionary scenarios, for example, attempts to reconstruct the gene repertoire
and gene order in ancestral genomes (see discussion below), are bound to be
meaningless. With respect to functional analysis, orthologs typically retain the
same, ancestral function, which makes transfer of functional information within
a set of orthologs generally reliable. The evolutionary basis of such
conservation of function among orthologs appears fairly obvious. Indeed,
consider a gene (or, rather, its product) in an ancestral species that was
responsible for carrying out some essential biological function. As long as the
progeny of this ancestor carries a single copy of the gene in question and does
not evolve or acquire an unrelated gene capable of providing the same function,
it has to rely on the original gene to continue carrying out that function. This
puts orthologs under strict evolutionary constraints and makes them perform the
same function as long as this function remains essential for survival or at
least confers a substantial selective advantage to its bearers.
In contrast, paralogs tend to evolve new functions, and study of paralogous
families may provide means for understanding adaptation. As first detailed by
Susumu Ohno in his classic 1970 book Evolution by Gene
Duplication [627], once
paralogs emerge as a result of a gene duplication, the pressure of purifying
selection decreases for either one (in Ohno’s original model) or,
under new, more elaborate models [448,534,877] both paralogs, which eventually
enables evolution of new functions. In each sequenced genome, a substantial
fraction (from 25 to 80% [374,408,484,506]) of genes belongs to families of paralogs, each of which
reflects functional diversification via duplications that occurred at different
stages of evolution. Classic examples include animal olfactory receptors or
nuclear hormone receptors, vast families in which an astonishing repertoire of
specificities evolved as the result of multiple duplications.
Figure 2.5
.
Orthologous and paralogous genes in three lineages descending
from a common ancestor
Gene sets I, II, and III should be considered co-orthologous.
The interplay of speciation events, leading to the divergence of orthologs, and
duplications, giving rise to paralogous families, results in complex
evolutionary scenarios, which may be hard to resolve (). When duplication precedes speciation, each of
the paralogs gives rise to a distinct line of orthologous descent. Conversely,
when duplication occurs after a particular speciation event in one lineage or in
both lineages independently (this can be referred to as a lineage-specific
duplication or
lineage-specific expansion of a paralogous family), a situation ensues whereby a one-to-one
orthologous relationship cannot be delineated in principle (). Instead, all one can say is that the family AB
in lineage 1 is orthologous to family A’B’C’ in
lineage 2 or, in other words, that A and B are
co-orthologs (a new term recently introduced to more accurately describe such
relationships [
700]) of A’,
B’, and C’ (). Clearly, in such a case, the functional correspondence between the
two orthologous families of paralogs is less straightforward than it is between
regular, one-to-one orthologs. The relationships between homologs could become
particularly tricky if some genes in certain lineages have been lost during
evolution (a phenomenon referred to as
lineage-specific gene loss, see
2.2.3). In such cases, genes
that, at face value, appear to be orthologous may actually be paralogs, whereas
the genuine orthologs might have been lost. Once again, functional inferences
made on the basis of this type of homologous relationship require particular
caution.
Reliable identification of orthologs is only possible when complete sets of genes
from two or more genomes are compared. Indeed, if one of the compared genomes is
incomplete, a possibility always remains that the true ortholog of the given
gene is “hiding” in the unsequenced part. Even with complete
genomes, identification of orthologous gene sets is not a simple task because of
the complex evolutionary scenarios, which involve multiple duplications,
speciations, and most importantly, lineage-specific gene loss events. In
principle, complete phylogenetic analysis of all groups of homologous genes is
required to decipher true orthologous relationships. This is an extremely
labor-intensive task; moreover, it is well known that not all phylogenetic trees
provide the required resolution. “Shortcut” approaches have
been developed to circumvent the need for comprehensive phylogenetic analyses,
and some of these are discussed in subsequent chapters.
2.2. Patterns and Mechanisms in Genome Evolution
Although still a young discipline, comparative genomics has matured enough to allow
delineation of the most common and important types of events that occur during
genome evolution. These include different forms of genome rearrangement, gene
duplication, and more specifically, lineage-specific expansion of gene families,
lineage-specific gene loss, horizontal gene transfer, and non-orthologous gene
displacement.
2.2.1. Evolution of gene order
Figure 2.6
.
Gene order comparison plots
A Chlamydia trachomatic (X axis) vs Chlamydophila pneumoniae (Y axis)
B Escherichia coli (X axis) vs Pseudomonas aeruginosa (Y axis)
Each dot represents a pair of genes with the level of similarity between the encoded proteins sequences indicated by color: red - >1.3 bits/position; blue - from 0.8 to 1.3 bits/position; grey - from 0.3 to 0.3 bits/position; light blue <0.3 bits/position. The similarity scores are expressed in bits/position, rather than in total scores per protein, to remove the bias caused by variation in the protein length.
Comparison of the first completely sequenced genomes promptly showed that gene
order is much less conserved than protein sequences. Genomes of the closely
related bacteria
Mycoplasma genitalium and
M.
pneumoniae, for example, consist of six large segments with similar
organization of genes, but the segments themselves are shifted relative to each
other and partially scrambled in the two genomes [
348]. Much greater differences were found between
Haemophilus influenzae and
E. coli, or
even between
E. coli K-12 and its pathogenic relative
E. coli O157:H7 [
669,
829]. The gradient of
gene order conservation is illustrated in (see color plates). In the chlamydial genomes, a
genome-scale alignment is readily traceable along the main diagonal, although
gaps in the alignment and two major inversions are equally obvious (). In contrast, the comparison
of
E. coli and
P. aeruginosa looks completely
disordered on the genome scale ().
In fact, any such comparison between more or less distantly related prokaryotic
genomes, e.g. bacteria or archaea from different genera, would look disordered
at a scale where only conservation of about a dozen genes in a row is
noticeable. On a smaller scale, however, there is important conservation of gene
order within operons, the units of prokaryotic gene coregulation. Extensive
genome comparisons showed that, in each genome, 5% to 25%
of the genes belong to conserved (predicted) operons, i.e. strings of genes that
are shared with at least one relatively distant genome [916]. As should be expected, this fraction gradually
increases as new genomes are sequenced. A few operons that are conserved in
distantly related prokaryotes consist of genes for ribosomal proteins and some
other components of the translation machinery. Other conserved operons include
those encoding subunits of the H-ATPase and ABC-type transporter complexes
[169,385,461,595].
2.2.2. Lineage-specific gene loss
A quick look at the genome sizes of the organisms with completely sequenced
genomes (Table 1.4) shows that many
pairs of closely related organisms have vastly different numbers of genes. Thus,
E. coli K-12 has seven times more genes than the aphid
symbiont Buchnera sp., which is located right next to
E. coli in the 16S rRNA-based phylogenetic tree. Two more
representatives of gamma-proteobacteria, H. influenzae and
P. multocida, have 2.5 times fewer genes than E.
coli. Substantial differences in the gene number can be found even
within the same genus. The gene set of Mycoplasma pneumoniae,
for example, includes all the 480 genes of M. genitalium, as
well as 197 additional genes. Mycobacterium leprae is closely
related to M. tuberculosis but has at least 1,200 fewer genes
[153].
The same phenomenon is seen throughout eukaryotes. Baker’s yeast
S. cerevisiae, for example, has about 6,000 genes, which is
at least 2,000 genes fewer than in its relatives, multicellular ascomycetes such
as Aspergillus. Furthermore, a eukaryotic intracellular
parasite, microsporidian Encephalitozoon cuniculi, which has
been identified as a derived fungus in several consistent phylogenetic studies,
has only ~2,000 genes [425], which
points to a truly dramatic scale of gene loss. About 300 genes were apparently
lost by S. cerevisiae after its radiation from the common
ancestor with fission yeast S. pombe, although the latter has
even fewer genes than S. cerevisiae [55]. All these observations show that certain phylogenetic
lineages experienced a significant gene loss, often linked to the adaptations to
the parasitic lifestyle (H. influenzae, P.
multocida, M. pneumoniae, M.
genitalium, M. leprae), or intracellular symbiosis
(Buchnera sp.), or just adaptation to a constant (narrow)
range of environmental conditions. Indeed, parasites might not need a
complicated web of metabolic pathways for the biosynthesis of amino acids,
nucleotides, and cofactors as long as they can fetch those nutrients from their
host.
In the same vein, the well-known absence of the biosynthetic pathways for 12
amino acids in humans and other vertebrates was probably made possible by the
abundance of these amino acids in the food consumed by their common ancestor at
the time of their divergence.
Figure 2.7
.
Pyrimidine biosynthesis genes in organisms with completely
sequenced genomes
Each rectangle signifies an enzyme of the pyrimidine biosynthesis
pathway, indicated by its gene name and COG number. Alternative
enzymes catalyzing the same reaction are shown side-by-side. Each
COG is accompanied by the list of organisms represented in it (the
phyletic pattern, see 2.2.6).
The species abbreviations and order are as follows: a,
Archaeoglobus fulgidus; o,
Halobacterium sp.; m, methanogens (M.
jannaschii and M.
thermoauto-trophicum); p, thermoplasmas (T.
acidophilum and T. volcanii); k,
pyrococci (P. horikoshii and P.
abyssi); z, Aeropyrum pernix; y, yeast
(Saccharomyces cerevisiae); q, Aquifex
aeolicus; v, Thermotoga maritima; d,
Deinococcus radiodurans; r, mycobacteria
(M. tuberculosis and M.
leprae); l, streptococci (Lactococcus
lactis and Streptococcus pyogenes); b,
bacilli (B. subtilis and B.
halodurans); c, Synechocystis sp.; e,
Escherichia coli; f, Pseudomonas
aeruginosa; g, Vibrio cholerae; h,
Haemophilus influenzae; s, Xylella
fastidiosa; n, Neisseria meningitidis;
u, Helicobacter pylori and Campylobacter
jejuni; j, Mesorhizobium loti and
Caulobacter crescentus; x, Rickettsia
prowazekii; i, chlamydiae (C.
trachomatis and C. pneumoniae); t,
spirochetes (Borrelia burgdorferi and
Treponema pallidum); and w, mycoplasmas
(M. genitalium, M. pneumoniae,
and Ureaplasma urealyticum).
An analysis of gene loss in bacterial parasites showed that, in many cases, it
led to the elimination of entire pathways, such as amino acid, nucleotide, and
cofactor biosynthetic pathways (
Chapter
7). For example, a number of parasitic bacteria lack pyrimidine
biosynthesis genes that are present in their free-living relatives (). This has, of course, a simple
evolutionary explanation: if the necessary nutrient is available in the medium,
the genes responsible for its synthesis become redundant and can be eliminated.
Moreover, once at least one of these genes is lost, expression of the others
would lead to the accumulation of metabolic intermediates that can be harmful
for the cell. This would result in an evolutionary pressure toward coordinated
loss of all the genes in a pathway [
270]. A similar trend toward coelimination of functionally connected
groups of proteins, such as the signalosome and the spliceosome components, has
been detected in yeast [
55].
In a remarkable exception to the principle of coordinated gene loss, there are
cases when only a certain (typically, upstream) part of the pathway is
eliminated. shows that the
complete pyrimidine biosynthesis pathway is missing in
M.
genitalium and
M. pneumoniae, whereas
H.
influenzae lacks genes for the first three reactions of this
pathway but has the complete set of genes for all the enzymes that catalyze the
conversion of dihydroorotate into CTP. Thus, while
H.
influenzae is evidently incapable of
de novo
pyrimidine biosynthesis, it has preserved certain metabolic plasticity to
accommodate whatever pyrimidine it can get from its host. The same trend is seen
in the even smaller genomes of
B. burgdorferi and
C.
trachomatis, which have lost most of the pyrimidine biosynthesis
genes but still contain genes coding for the downstream steps of this
pathway.
2.2.3. Lineage-specific expansion of gene families
Table 2.1
Lineage-specific expansions of paralogous families in prokaryotic
genomes a
| M. tuberculosis | 90 | PPE | Surface antigen, interacts with host
cells |
| M. tuberculosis | 67 | PE | Surface antigen, interacts with host
cells |
| H. pylori | 34 | HOP | Surface antigen, interacts with host
cells |
| Synechocystis sp. | 30 | His kinase | Sensing of environmental stimuli |
| M. pneumoniae | 25 | - | Unknown |
| M. tuberculosis | 24 | MCE1 | Entry and survival inside the
macrophages |
| A. fulgidus | 24 | His kinase-type ATPase | Sensing of environmental stimuli |
| Synechocystis sp. | 22 | GGDEF domain | Signal transduction |
We have already mentioned the evolutionary importance of gene duplication leading
to the emergence of paralogs, which may assume new functions, sometimes
substantially different from those of the ancestral gene. Genome comparisons
suggest that lineage-specific expansion of paralogous gene families, which in
some cases account for a sizable fraction of a genome, is one of the major
mechanisms of adaptation [
408,
506]. Analysis of lineage-specific gene
expansions can provide useful clues to the evolution of each particular lineage.
Table 2.1 shows that, indeed, in
pathogens
M. tuberculosis and
H. pylori, the
most conspicuous expansions are those of genes encoding factors involved in
interactions with and survival within the host organisms. In contrast, in
free-living autotrophs
Synechocystis sp. and
A.
fulgidus, the largest expansion involves signal transduction
proteins, sensor histidine kinases, and related ATPases.
Table 2.2
Expansion of signaling domains in C.
elegansa
| C. elegans | 19,100 | 435 | 112 | 26 | 58 | 65 | 127 |
| S. cerevisiae | 6,500 | 116 | 14 | 10 | 24 | 3 | 110 |
| E. coli | 4,289 | 3 | 1 | 1 | 1 | 4 | 0 |
| B. subtilis | 4,100 | 4 | 0 | 1 | 6 | 5 | 0 |
| M. tuberculosis | 3,918 | 13 | 1 | 1 | 0 | 4 | 4 |
| Synechocystis | 3,169 | 12 | 0 | 1 | 3 | 4 | 2 |
| A. fulgidus | 2,420 | 4 | 0 | 0 | 0 | 2 | 0 |
| M. thermoautotrophicum | 1,869 | 4 | 0 | 0 | 0 | 2 | 0 |
| M. jannaschii | 1,715 | 4 | 2 | 0 | 0 | 3 | 0 |
| A. aeolicus | 1,522 | 2 | 0 | 1 | 0 | 1 | 0 |
In eukaryotes, lineage-specific expansion of certain protein families is even
more evident than in prokaryotes. A comparison of the genome counts of signaling
domains in the nematode
C. elegans against the corresponding
numbers in the yeast
S. cerevisiae and some free-living
bacteria and archaea (
Table 2.2) shows
that certain domains are dramatically expanded in
C. elegans,
even when the greater number of genes in the worm is taken into account (see
also the counts of ankyrin repeats in
C. elegans in
3.2.2).
2.2.4. Horizontal (lateral) gene transfer
Horizontal (lateral) gene transfer, as opposed to the standard (vertical)
transfer from ancestors to progeny, refers to acquisition of genes from
organisms that belong to other species, genera, or even higher taxa. Some
mechanisms of lateral gene transfer between different strains of the same
species, or between closely related species, are well established and include
conjugation, acquisition of plasmids, and viral (phage) infection [134]. These events are common and do not
stir much controversy. After all, it was the experiment on pneumococcal
transformation by heterologous DNA by Avery, MacLeod, and McCarthy that proved
the role of DNA in heredity. However, in the pre-genomic era, the long-range
lateral gene transfer across taxa has been considered to be extremely rare and
more or less unimportant in the general scheme of evolution [782]. The only instance where the fact
and impact of horizontal gene transfer have been clearly recognized was the
apparent massive flow of genes from the genomes of endosymbiotic organelles,
mitochondria in all eukaryotes and particularly chloroplasts in plants, to the
eukaryotic nuclear genome [311,312].
As soon as first comparisons of multiple, complete genome sequences representing
diverse taxa had been performed, it became apparent that lateral gene transfer
was too common to be dismissed as inconsequential [194]. First, horizontal gene flow between closely related
species turned out to be much more pervasive than ever suspected before.
Lawrence and Ochman estimate, for example, that as much as 25% of the
E. coli genome consists of recently acquired
“foreign” genes [497,625]. The actual rate
of influx and loss of new genes is even faster: it appears that, in the ~100
million years since the split between Escherichia and
Salmonella lineages, E. coli has picked up
and lost as much DNA as it has now [496,497].
In addition, genome comparisons helped to uncover numerous cases of (predicted)
horizontal gene transfer between organisms belonging to distinct phylogenetic
lineages. Archaeal genomes presented a particularly striking picture, with some
genes having close homologs only among eukaryotes and others being much more
similar to their bacterial homologs than to those from eukaryotes, if eukaryotic
homologs were detectable at all [466].
With some exceptions, the “bacterial” and
“eukaryotic” proteins in archaea were divided along
functional lines, with those involved in information processing (translation,
transcription, and replication) showing the eukaryotic affinity, and metabolic
enzymes, structural components, and a variety of uncharacterized proteins
appearing “bacterial” [466,540]. Because the
informational components generally appear to be less prone to horizontal gene
transfer [703] and in accord with the
“standard model” of early evolution whereby eukaryotes share
a common ancestor with archaea [906],
these observations could be explained by massive gene exchange between archaea
and bacteria [466]. This hypothesis was
further supported by the results of genome analysis of two hyperthermophilic
bacteria, A. aeolicus and T. maritima. Each of
these genomes contained a significantly greater proportion of
“archaeal” genes than any of the other bacterial genomes, in
an obvious correlation between the similarity in the life styles of
evolutionarily very distant organisms (bacterial and archaeal hyperthermophiles)
and the apparent rate of horizontal gene exchange between them [52,610]. Further analyses led to the discovery of genes of clear
bacterial origin in the hyperthermophilic archaeon P. furiosus,
which proved lateral gene transfer from bacteria to archaea [184].
We believe that the demonstration of the evolutionary prominence of lateral gene
transfer can be considered the single greatest change in perspective in biology
brought about by comparative genomics. A new round of controversy has been
sparked by the discovery of genes of possible bacterial origin in the human
genome [488]. In Chapter 6, we revisit this issue and
discuss implications of large-scale lateral gene transfer for the
“tree of life”.
2.2.5. Non-orthologous gene displacement and the minimal gene set concept
Proteins responsible for the same function in different organisms typically show
significant sequence and structural conservation and can be inferred to be
orthologs. However, there are exceptions to this rule. Examples of apparently
unrelated enzymes with the same specificity were noted as early as 1943 when
Warburg and Christian described two distinct forms of fructose-1,6-bisphosphate
aldolase in yeast and rabbit muscle, respectively. These two enzymes, referred
to as class I and class II aldolases, were later shown to be associated with
different phylogenetic lineages and have different catalytic mechanisms and
little structural similarity [95,549]. Unrelated enzymes that catalyze the
same reaction have been referred to as analogous, as opposed to homologous,
enzymes [228,271].
Figure 2.8
.
A scenario for the evolution of non-orthologous gene displacement
via an ancestral redundancy stage and lineage-specific gene
loss
Comparative analysis of complete genomes shows that cases like this are common.
Strikingly, only about 65 orthologous protein sets are universally represented
in all sequenced genomes. While, in large part, this is due to lineage-specific
gene loss, this number is much lower than the number of essential functions,
indicating that other such functions are performed by unrelated (or at least
non-orthologous) proteins in different life forms. This major evolutionary
phenomenon, which came to light already in the first comparisons of sequenced
genomes, was dubbed
non-orthologous gene displacement [
465]. The full range of
mechanisms leading to non-orthologous gene displacement is not known. However,
in cases when essential functions are involved, the main sequence of events
appears to be clear. Since an organism cannot survive without a protein that
performs an essential function, transient functional redundancy, when an
organism has both forms of the respective protein, appears to be a pre-requisite
of non-orthologous gene displacement [
464]. Such redundancy might evolve via horizontal gene transfer or
via recruitment of a protein whose original function was different from the
given one (recruitment is likely to occur after gene duplication). The
redundancy phase is followed by lineage-specific gene loss, resulting in
non-orthologous gene displacement (). In case of non-essential functions, the redundancy phase might
be bypassed, with non-orthologous gene displacement evolving directly via
horizontal gene transfer or recruitment.
Enzyme recruitment is a common evolutionary phenomenon leading to non-orthologous gene
displacement. Typically, one of the two non-orthologous enzymes with the same
catalytic activity belongs to a diverse family of enzymes and could have evolved
by shifting the substrate specificity of a related but distinct enzyme [271]. A good example is the two unrelated
forms of gluconate kinase. Gluconate kinases from E. coli,
yeast, and S. pombe form a narrow conserved group. In contrast,
the gluconate kinase of B. subtilis belongs to the so-called
FGGY family of carbohydrate kinases, which also includes glycerol kinase (GlpK),
D-xylulose kinase (XylB), L-fuculose kinase, and L-xylulose kinase (LyxK). The
scenario of enzyme recruitment in this case seems straightforward: a duplication
of the glpK or xylB gene in the
Bacillus lineage produced a new paralog, which accumulated
several mutations resulting in a shift of substrate specificity from glycerol
(or xylulose) to gluconate.
Enzyme recruitment seems to be particularly common in organisms that have adapted
to novel ecological niches by developing unusual, idiosyncratic metabolic
pathways. For example, most of the enzymes that are responsible for the
biosynthesis of polyketide antibiotics in actinomycetes appear to be recent
recruits from the enzymes of fatty acid biosynthesis. Similarly, enzymes that
hydrolyze man-made halogenated hydrocarbons have close relatives among regular
metabolic enzymes and, in all likelihood, have been recruited from this source.
Perhaps the most remarkable example is the evolution of apyrase
(ATP-diphosphohydrolases, EC 3.6.1.5), the enzyme secreted by blood-sucking
insects into the blood of human or other mammalian victims in order to prevent
or slow down blood clotting [862].
Because ADP in the blood can serve as a trigger of blood clotting, any enzyme
capable of hydrolyzing it would give the hematophagous insect a substantial
evolutionary advantage. As a result of this evolutionary pressure toward
increasing salivary apyrase activity, insect apyrases are found in at least
three different forms, which are homologous, respectively, to ATPases,
5’-nucleotidases, and inositoltriphosphate phosphatases [271,862].
It is worth noting that enzyme recruitment can be legitimately described as
independent, convergent evolution of the same enzymatic activity. In Chapter 7, we look at the comparative
genomic of central metabolic pathways and encounter numerous cases of
non-orthologous gene displacement and, specifically, enzyme recruitment.
The idea of non-orthologous gene displacement was originally developed in
conjunction with the concept of a minimal gene set for a living cell [596]. This
was construed as the minimal set of genes that are essential for the functioning
of a modern-type cell even under the most favorable environmental conditions,
including abundance of nutrients and absence of competition. An attempt to
explicitly derive a version of such a minimal gene set was undertaken by
comparing the first two sequenced bacterial genomes, those of the parasites
H. influenzae and M. genitalium. The
straightforward logic of this reconstruction was that these two bacteria, which
belong to distant phylogenetic lineages, have been independently losing genes
during their adaptation to the parasitic lifestyle, and whichever common genes
remain in both genomes were likely to belong to the minimal set of essential
genes. It was noticed, however, that for certain essential functions (e.g.
glycyl-tRNA synthetase), there was no orthologous pair of genes in the two
bacteria, hence non-orthologous gene displacement had to be invoked.
The original version of the minimal gene set included 256 genes, with 16 inferred
non-orthologous gene displacement cases. (The magic of these numbers must not be
lost on the reader: 16 is 22 to the power of 2;
256=162, and accordingly, 256 is 22 to the
power of 2 to the power of 2. Thus, 256 is the only number that can be
represented as such a succession of powers of 2 and, at the same time, can be a
reasonable approximation of a minimal gene set: 16 is obviously too few and
2562=65,536 is, in all likelihood, much greater than
the number of genes in the human genome.)
A subsequent large-scale experimental study has shown that most of the genes
included in this theoretical minimal gene set were, indeed, essential in
M. genitalium, although a few, surprisingly, were not
[364]. However, sequencing of
additional genomes and the corresponding genome comparisons have clearly shown
that this early reconstruction vastly underestimated the extent of
non-orthologous gene displacement [452,591,674]. Indeed, as indicated above, only
about 65 genes seem to be truly ubiquitous in cellular life forms, comprising
perhaps 25% of the minimal set of essential functions. Therefore, it
probably makes more sense to consider not so much a minimal gene set but rather
a minimal set of functional requirements for cell survival. Comparative genomics
shows that, for some of these requirements, a unique solution has evolved, but
for the majority, evolution has come up with two or more unrelated or distantly
related solutions. As discussed in 6.4,
non-orthologous gene displacement is prominent even in the DNA replication
machinery, the central functional system of all cells.
Figure 2.9
.
Distribution of different phylogenetic lineages in the COG
database
The plot shows the number of protein families (COGs) in a release of the
COG database (see 3.4), which
included proteins from the given number of phylogenetic lineages of the
total of 26 lineages [827].
2.2.6. Phyletic patterns (profiles)
As a result of numerous lineage-specific gene losses, horizontal gene transfers
and non-orthologous gene displacements, most protein families show a
“patchy” distribution among the sequenced genomes. The data
from the database of Clusters of Orthologous Groups of proteins (COGs, see 3.4) show that the majority of COGs are
represented in only three or four phylogenetic lineages; universal or nearly
universal COGs are much less common.
This distribution can be conveniently presented in the form of phyletic patterns
(profiles), which show the presence or absence of a COG in each analyzed
species. This approach, initially introduced as a feature of the COGs [
828] and subsequently adapted, with
various modifications, by several research groups [
547,
665,
689], provides a convenient way to
compare genomes and investigate the evolutionary history of individual cellular
functions. For example, a quick examination of the phyletic patterns of the two
distinct forms of phosphoglycerate mutase (the cofactor-dependent form GpmA and
the cofactor-independent form GpmI [
393]) immediately shows several interesting trends (the species symbols
are the same as in ):

Firstly, the two forms have largely complementary phyletic patterns, a clear sign
of non-orthologous gene displacement. Only E. coli encodes both
forms of the enzyme (and hence shows apparent functional redundancy), whereas
other organisms encode either one or the other. Secondly, several organisms do
not encode either of the two forms of this enzyme. Assuming that glycolysis is
an essential metabolic pathway, glycolytic enzymes should be encoded in every
genome (we are aware of one exception, Rickettsia, which does
not encode any glycolytic enzymes; see 7.1.1). Therefore, one might suggest that there should be an
additional, third form of phosphoglycerate mutase, which is encoded in archaeal
genomes and also in T. maritima, A. aeolicus,
and D. radiodurans. Indeed, sequence analysis of those genomes
shows that they all encode an uncharacterized enzyme, distantly related to
alkaline phosphatase and cofactor-independent phosphoglycerate mutase. Based on
the conservation of active site residues, this archaeal enzyme has been
predicted to have a phosphoglycerate mutase activity [258,261]; this
prediction has now been experimentally confirmed in two independent studies
[308,866]. Remarkably, the phyletic pattern of the respective
COG complements the union of the patterns for the two forms of phosphoglycerate
mutase, which ensures the presence of at least one type of phosphoglycerate
mutase in every species, except for Rickettsia:
Figure 2.10
.
Phyletic patterns of the three forms of phosphoglycerate
mutase
The species symbols are as in .
This summation also shows that there is no necessity in yet another form of
phosphoglycerate mutase, which has been designated GpmB in E.
coli (see 3.2.1.3), but has
never been experimentally demonstrated to have this activity:
Indeed, recent data show that this protein does not have a phosphoglycerate
mutase activity, at least in B. subtilis. Instead, it appears
to function as a non-specific sugar phosphatase [702]. This example shows the impressive power of the
comparative-genomic approach for prediction of gene functions. This methodology
is discussed in greater detail later in this book (see 5.2).