NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Koonin EV, Galperin MY. Sequence - Evolution - Function: Computational Approaches in Comparative Genomics. Boston: Kluwer Academic; 2003.

Cover of Sequence - Evolution - Function

Sequence - Evolution - Function: Computational Approaches in Comparative Genomics.

Show details

Chapter 8Genomes and the Protein Universe

We have now surveyed some of the principal methodological approaches of comparative genomics and the major evolutionary conclusions that can be inferred from genome comparisons. In this short chapter, we take a view of genomes from a different vantage point. We briefly describe the current understanding of the organization of the protein Universe and project it on genomes to reveal common and unique patterns.

8.1. The Protein Universe Is Highly Structured and There Are Few Common Folds

The theoretical size of the sequence space, i.e. the total number of possible protein sequences is, for all practical purposes, infinite. Assuming that an average protein length is 200 amino acids, there can be 20200 different protein sequences, a number that is much greater than, for example, the number of protons in our Universe (not the protein universe often mentioned in this chapter, but the physical Universe around us). Our current theoretical understanding of protein folding is insufficient to estimate the total possible number of protein structures, but one suspects it is also vast. Obviously, only a miniscule fraction of the practically infinite sequence space is populated by real protein sequences. Still the number of unique sequences encoded in real genes is likely to be substantial. For example, assuming there is 107–108 species on Earth and the genome of each species consists of 103–105 genes, there are 1010–1013 unique protein sequences, a speck compared to the vast sequence space, but still several orders of magnitude more than contained in today's databases. A question of major fundamental and practical interest is how these sequences are distributed in the sequence and structure spaces. The discussion of numerous homologous relationships between proteins in the preceding chapters should make it obvious that the distribution cannot be random: there certainly are numerous, distinct clusters of homologs separated from other clusters. However, more precise, quantitative answers are necessary, and these are usually sought within the framework of hierarchical classification of proteins. Throughout this book, we repeatedly referred to protein folds, superfamilies, and families, which are categories within this classification. However, before we proceed further with the discussion of the structure of the protein universe, it is useful to identify the entire hierarchy and to introduce the levels more precisely (Table 8.1). This classification had been introduced largely through analysis of protein structures in the context of the SCOP database construction [590]. Similar categories are adopted in the CATH database [633]. The phylogenetic clusters, such as COGs, which we encountered throughout this book, lie directly below the lowest category in the structural classification, the family.

Table 8.1. Hierarchical classification of proteins.

Table 8.1

Hierarchical classification of proteins.

Perhaps the most important categories in this classification are Fold (near the top of the hierarchy, disregarding the less informative notion of a structural class) and COG, near the bottom, which represent two fundamental levels of evolutionary relationships. In particular, folds are, typically, the largest monophyletic classes of proteins, and the number of distinct folds may be considered the central aspect of the structure of the protein universe.

Of course, experimentally determined structures of proteins representing all existing folds are not yet available, so the number of folds needs to be estimated through extrapolation. A number of researchers have attempted to come up with such estimates. In the first of such analyses, Zuckerkandl [945] and Barker and Dayhoff [75] examined “independent” sequence superfamilies (i.e. those for which similarity could not be detected with the methods available at the time) and converged at ~1,000 such families. The first estimate based on independent structures rather than sequences was produced in 1992 by Cyrus Chothia [147] who used a very straightforward extrapolation approach. He found that ~1/4 of the sequences encoded in each (then partially sequenced) genome showed significant similarity to sequences in the SWISS-PROT database, and ~1/3 of the sequences from SWISS-PROT were related to one of the 83 folds available at the time. From these data, elementary extrapolation: 83 × 3 × 4 ≈ 1,000 gives the expected total number of folds, which remarkably agreed with the early estimates despite the different approach. Subsequent, more sophisticated estimates based on theoretical analysis of the sampling of sequence families from the structure database or from genome-specific sequence sets produced estimates of the total number of folds between 700 and 4,000 [306,912,938].

A recent study by Coulson and Moult, which explicitly incorporated the division of protein folds into three categories: superfolds that consist of numerous families, mesofolds that include a limited number of families, and unifolds that consist of only one, compact family, produced an even higher estimate, at least 10,000 unique folds [161]. The discrepancy between these estimates becomes less dramatic and easier to explain when one examines the distribution of folds by the number of protein families (Figure 8.1). This distribution shows that there is a small number of folds with a large number of families (mostly these are the well-known superfolds, such as P-loop NTPases, the Rossmann fold, or TIM-barrels) and an increasing number of folds that consist of a small number of families. By far the largest class are the “unifolds”. Thus, it seems certain that the great majority (>95%) of the protein families belong to ~1,000 common folds. What is still in dispute is the number of unifolds that encompass the rest of the proteins. Approximately, one half of the more common folds are already represented by at least one experimentally determined structure, which means that at least rough mapping of the protein universe is already at an advanced stage.

Figure 8.1. The distribution of folds by the number of families in the protein structural database (PDB).

Figure 8.1

The distribution of folds by the number of families in the protein structural database (PDB). The families were obtained by clustering the sequences with the cut-off of 0.3 bit/position, which was shown to give the best fit to the data (see [912]).

The curve in Figure 8.1 shows that the protein universe is extremely well-structured: not only are most of the sequences clustered in a small number of densely populated areas (folds), but the distribution of sequences among the folds is highly non-random. Why are a few folds (superfolds) so common whereas the majority are rare? As on many occasions in biology, there are two types of explanation, a physical/functional one and an evolutionary one.

The first view maintains that the common folds are good for protein functions because they are particularly stable and/or because they are well suited to accommodate catalytic and binding site (e.g. in the loops flanking the β-sheet in the P-loop or Rossmann fold). The evolutionary approach interprets the distribution from the point of view of the simple “the rich get richer” principle: an already common domain is more likely to be adapted for a new function via duplication with subsequent diversification simply because the chance of a duplication of such a domain is greater (in network analysis, this is called “preferential attachment”). Probably more realistically, the two views may be combined in the “the fit get fitter” principle [74]: domains that are functionally versatile and stable are favored by selection, but once they become common, purely stochastic processes help them proliferate further.

Once we have obtained an approximate but apparently reasonable count of the folds in the entire protein universe, it is of interest to see how this set of folds project on genomes from different walks of life. An extrapolation from the number of folds actually detected in the protein sequences encoded in each genome shows that unicellular organisms encode from ~30% to ~60% of the folds (Table 8.2), which shows that even these simple life forms extensively sample the protein universe.

Table 8.2. Predicted number of protein folds in complete genomes.

Table 8.2

Predicted number of protein folds in complete genomes.

8.2. Counting the Beans: Structural Genomics, Distributions of Protein Folds and Superfamilies in Genomes and Some Models of Genome Evolution

Accessible structural classifications of proteins, such as SCOP and CATH, became available almost simultaneously with multiple genome sequences. So it was a natural idea to take a structural census of the genomes, i.e. determine and compare the distributions of protein folds and superfamilies in them [280,281,367,833,911]. These “surveys of finite parts lists”, to use the lucky phrase of Mark Gerstein [281], are vital to structural genomics , the rapidly growing research direction, which we cannot cover in this book in any adequate depth but must mention at least in passing. In principle, the goal of structural genomics is, no more no less, to determine all protein structures existing in nature. However, since this goal is unattainable for all practical purposes, structural genomics aims at determining a representative set of structures, which would allow the rest to be modeled on the basis of homology [145,733,872].

The clustered organization of the protein universe discussed in the previous section makes this feasible, provided that a strategy of target prioritization is well defined. This strategy is to ensure that newly determined structures are non-redundant, that is, each of them represents a family or at least a COG (see Table 8.1), which so far has been missing in the structural database. For this purpose, it is essential that the structural census of genomes is as complete and accurate as possible. A recent conservative estimate suggests that “it would take approximately 16,000 carefully selected structure determinations to construct useful atomic models for the vast majority of all proteins” [872], and there is little doubt that this research program will be carried out well within the first quarter of the 21st century. So far, few structures have been actually determined within the structural-genomic paradigm, but these indeed turned out to be novel and led to functional and evolutionary insights (e.g. [572,937]).

Beyond the indisputably important practical goals of structural genomics, the genome-specific catalogues of protein folds and superfamilies are of fundamental interest: their qualitative examination may highlight important differences in the lifestyles of different organisms, whereas mathematical analysis of the distributions has the potential of revealing hidden regularities in genome evolution. Table 8.3 shows the list of the top 10 protein folds for a number of bacterial, archaeal, and eukaryotic genomes. The counts of the folds were obtained by running domain-specific PSSMs (based on the SCOP classification) against the predicted protein set from each genome [911]. What immediately strikes one when perusing this list is, first, how similar are the rankings for organisms with very different genome sizes and lifestyles and the cumulative rankings for the three domains of life. The other prominent feature of these distributions is the overwhelming domination of the P-loops, particularly, in prokaryotes. The smaller the proteome the greater the fraction of P-loops, which emphasizes the involvement of P-loop NTPases in vital, housekeeping functions (e.g. translation and replication). At least for the top 30 folds, the distribution of the fraction in genomes depending on the rank gives an excellent fit to the exponent, which is what should be expected if the probability of duplication is the same for all folds (Figure 8.2; [911]). However, the most abundant fold, the P-loops (rank 1) is much more abundant than predicted on the basis of this straightforward assumption (Figure 8.2). To paraphrase the famous quip of J.B.S. Haldane about beetles, “God seems to have an inordinate fondness for P-loops”. P-loops NTPases are the motors associated with so many diverse functions that, when a new function emerges, it is extremely likely that a duplication of a P-loop domain will provide the necessary engine. It is only logical to surmise that this domain was among the first, if not the first one, which took shape at the dawn of life; this is, of course, compatible with all reconstructions of ancestral gene repertories (Chapter 6).

Table 8.3. Top 10 protein folds in complete genomes from the three domains of life.

Table 8.3

Top 10 protein folds in complete genomes from the three domains of life.

Figure 8.2. Distribution of the top 30 protein folds in combined proteomes.

Figure 8.2

Distribution of the top 30 protein folds in combined proteomes. The vertical axis gives the average fraction of a fold in all analyzed proteomes.

In 1998, Martijn Huynen and Erik van Nimwegen [374] made the seminal observation that the frequency distributions of paralogous protein families in genomes seemed to follow a negative power law, i.e. the dependence of the general form:

Image ch8e1.jpg

where F(i) is the frequency of a family with i members, and c and γ are coefficients. This observation is extremely interesting and provocative because power laws are found in an enormous variety of biological, physical, and other contexts, which seem to have little if anything in common. Examples of quantities that show power distributions include the number of acquaintances or sexual contacts people have, the number of links between documents in the Internet, or the number of species that become extinct within a year. The classic Pareto law in economics describing the distribution of people by their income and the even more famous Zipf law in linguistics describing the frequency distribution of words in texts belong in the same category. What all of these phenomena do have in common is that they are based on ensembles or networks with preferential attachment, which evolve according to “the rich get richer” or perhaps “the fit get fitter” principles already mentioned above [74]. More specifically, power laws can be potentially explained in terms of self-criticality phenomena, but they also allow simple explanations through gene (protein, domain) birth, death and “invention” models (which we designate BDIMs for short) [685,726].

Gene birth occurs via duplication; gene death is, of course, the loss process discussed in detail in Chapter 6, and innovation may involve gene acquisition via horizontal gene transfer, emergence of genes from non-coding sequences, or emergence of globular domains from non-globular sequences.

A more detailed analysis indicates that domain family distributions are actually best described by so-called generalized Pareto distributions, which include power laws as asymptotics (i.e. the distribution fits a power law for large families) [419,482]. Figure 8.3 shows the domain family size distribution for the plant A. thaliana with the corresponding fits. It has been shown that such distributions can be generated by a specific class of models (linear BDIMs), in which domain birth rate is equal to the death rate, and each of these rates depends on the family size in such a way that members of small families are somewhat more likely to either duplicate or die than members of large families. Furthermore, it can be proved that linear BDIMs rapidly reach equilibrium from any initial conditions [419]. Translating this into the language of genome evolution, although genomes are “in flux” (this is also supported by the BDIM analysis, which reveals fairly high innovation rates), the number of families of a given size remains constant over extended periods of evolution. This may be considered yet another manifestation of the genomic clock discussed in Chapter 6.

Figure 8.3. Domain family size distribution for Arabidopsis thaliana.

Figure 8.3

Domain family size distribution for Arabidopsis thaliana. Horizontal axis, number of family members; vertical axis, number of families.

For sure, the above is “bean bag genomics”, which ignores the biological identity of protein families. It is notable, however, that linear BDIMs describe with considerable accuracy even the number of large, expanded families present in each genome (Figure 8.3 and [419]), although the proliferation of such families is usually regarded as adaptation.

The observations described here suggest a more nuanced view: natural selection “chooses” families that are allowed to proliferate, but the general dynamics of protein family evolution can be described as a purely stochastic process. It seems that even oversimplified models that disregard selection are starting to reveal fundamental aspects of genome evolution.

8.3. Evolutionary Dynamics of Multidomain Proteins and Domain Accretion

Protein domains do not exist in isolation. In the course of evolution, they often combine to form multidomain architectures. As we have seen, analysis of such architectures may be helpful for predicting functions of uncharacterized domains (the guilt by association approach, see Chapter 5). Multidomain proteins play critical roles in the cell by providing effective links between different functional systems. Because of this ability, complex multidomain architectures are particularly common in all kinds of signaling systems. In many orthologous sets of proteins, a distinct trend can be traced toward increased complexity of domain architectures in more complex organisms. This tendency, which was dubbed “domain accretion” [459], is illustrated in Figure 8.4 for a set of orthologous eukaryotic transcription factors. Various domains added to the conserved core, typically at the ends, provide tethering to other chromatin-associated proteins and, in the case of A. thaliana, apparently to the ubiquitin-dependent protein degradation machinery.

Figure 8.4. Domain accretion in transcription factor TAFii250.

Figure 8.4

Domain accretion in transcription factor TAFii250. C1,C2,C3, uncharacterized conserved domains; Zk, Zn knuckle; Br, Bromo domain; Ub, ubiquitin.

Since proteins form complex networks (see below), even a modest increase in the number of domains in interacting partners may translate into numerous new interactions, which probably contributes to the solution of the ostensible paradox of “too few” genes in complex organisms [488].

Given the utility of multidomain proteins for a variety of cellular functions, one could think that natural selection would favor their formation to the extent that they would be overrepresented with respect to the single-domain proteins. Quantitative analysis does not seem to support this conclusion. The distribution of proteins by the number of different (repetitions of the same domain excluded from the analysis) domains shows an excellent fit to an exponent ([911]; Y.I. Wolf and E.V.K., unpublished observations; Figure 8.5), which is compatible with a random recombination (joining and break) model of evolution of multidomain proteins. We should note, however, that the slopes of the curves in Figure 8.5 are markedly different for archaea, bacteria, and eukaryotes, indicating that the fraction of multidomain proteins or, in terms of the random model, the relative probability of domain joining increases in the order:

Image ch8e2.jpg

Figure 8.5. Distributions of the number of proteins with different number of domains in bacteria, archaea, and eukaryotes.

Figure 8.5

Distributions of the number of proteins with different number of domains in bacteria, archaea, and eukaryotes. The plot is in double logarithmic scale.

The underrepresentation of multidomain proteins in archaea compared to the other two domains might be related to the low stability of large proteins in the hyperthermophilic habitats of these organisms. The excess of multidomain proteins in eukaryotes is not unexpected given the complexity of considerations above, and we also should note the deviation from the exponent in the right tail of the distribution caused by the presence of proteins with a large number of domains (Figure 8.6).

Figure 8.6. Distribution of protein domains by the number of links in multidomain proteins.

Figure 8.6

Distribution of protein domains by the number of links in multidomain proteins. The number of links is the number of different domains with which the given domain combines in multidomain proteins. The data from 13 analyzed bacterial and archaeal genomes (more...)

The above analysis tells us nothing about the propensity of individual domains to form multidomain architectures, and these propensities differ widely. Perhaps by now we should not be surprised to learn that the distribution of the number of connections a domain has with other domains in multidomain proteins can be roughly approximated by a power law ([924]; Figure 8.6). This means that a small number of domains are hubs of multidomain connections that hold together cellular interaction networks. We already referred to these domains as “promiscuous” and mentioned some examples when discussing the “guilt by association” approach in Chapter 5. As in the case of the evolutionary dynamics of domain families discussed in section 8.2, although evolution of multidomain proteins seems to occur via random processes of joining and breaking (Figure 8.4), the fit (to form usable multidomain architectures) still “gets fitter and fitter”.

8.4. Conclusions and Outlook

In this brief chapter, we only could cast a very superficial and formal glance at the protein universe. Even so, we could notice several major features of this world. Although the theoretical sequence and probably structure spaces are virtually infinite, the populated parts are definitely finite and, more importantly, extremely non-homogeneous. Because of this concentration of proteins in a relatively small number of hubs—folds and superfamilies—a properly equipped party of explorers, such as the structural genomics programme, can visit at least all the major ones in a reasonable time.

We have further seen that superimposing the structure of the protein universe upon genomes and quantitative analysis of the results may give us unexpected insights into some general principles (may we tentatively say “laws”?) of genome evolution.

8.5. Further reading

Barabasi AL. 2002. Linked: The New Science of Networks. Perseus Publishing, Cambridge, MA.
Coulson AF, Moult J. A unifold, mesofold, and superfold model of protein fold use. Proteins. 2002;46:61–71. [PubMed: 11746703]
Huynen MA, van Nimwegen E. The frequency distribution of gene family sizes in complete genomes. Molecular Biology and Evolution. 1998;15:583–589. [PubMed: 9580988]
Koonin EV, Aravind L, Kondrashov AS. The impact of comparative genomics on our understanding of evolution. Cell. 2000;101:573–576. [PubMed: 10892642]
Qian J, Luscombe NM, Gerstein M. Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. Journal of Molecular Biology. 2001;313:673–681. [PubMed: 11697896]
Vitkup D, Melamud E, Moult J, Sander C. Completeness in structural genomics. Nature Structural Biology. 2001;8:559–566. [PubMed: 11373627]
Copyright © 2003, Kluwer Academic.
Bookshelf ID: NBK20267


Related information

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...