• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Sep 19, 2006; 103(38): 14056–14061.
Published online Sep 7, 2006. doi:  10.1073/pnas.0606239103
PMCID: PMC1560931

Evolution of protein structural classes and protein sequence families


In protein structure space, protein structures cluster into four elongated regions when mapped based solely on similarity among the 3D structures. These four regions correspond to the four major classes of present-day proteins defined by the contents of secondary structure types and their topological arrangement. Evolution of and restriction to these four classes suggest that, in most cases, the evolution of genes may have been constrained or selected to those genetic changes that results in structurally stable proteins occupying one of the four “allowed” regions of the protein structure space, “structural selection,” an important component of natural selection in gene evolution. Our studies on tracing the “common structural ancestor” for each protein sequence family of known structure suggest that: (i) recently emerged proteins belong mostly to three classes; (ii) the proteins that emerged earlier evolved to gain a new class; and (iii) the proteins that emerged earliest evolved to become the present-day proteins in the four major classes, with the fourth-class proteins becoming the most dominant population. Furthermore, our studies also show that not all present-day proteins evolved from one single set of proteins in the last common ancestral organism, but new common ancestral proteins were “born” at different evolutionary times, not traceable to one or two ancestral proteins: “the multiple birth model” for the evolution of protein sequence families.

Keywords: protein fold classes, common structural ancestor, evolutionary age, protein structure universe

The protein universe (1), the totality of all proteins in all organisms on earth, is vast. However, an estimate of the order of magnitude can be made (Table 1): Although the currently known genome sizes range from 106 to 1011 DNA base pairs, the number of genes are estimated to range only from 103 to <105 per organism (www.ncbi.nlm.nih.gov/genomes). Taking the estimated 13.6 million species of living organisms on Earth (2), which is very likely to be an underestimate, into account, there are >1010 to 1012 different proteins in all organisms from the three domains of life (Eukarya, Bacteria, and Archaea) on Earth. However, this vast number of proteins are predicted to consist of only ≈105 sequence domain families (3), the members of each family having similar amino acid sequences (4). The sizes of the sequence domain families have a power law distribution (Fig. 1): Most families have a small number of members, but some have a very large number of members. We expect similar distribution for sequence families. Most of these ≈105 sequence families are estimated to belong to ≈104 structural families (57), because some sequence families turn out to have the same 3D structural fold. Some protein structures consist of more than one domain and, at present, ≈103 structures of fold domains are known (8).

Fig. 1.
The family sizes of protein sequence domains in Pfam database (3) (release 16; 7,677 Pfam families) have a power law distribution. The Pfam families (x axis) were sorted by their family size. The number of members in a given Pfam family (y axis) was truncated ...
Table 1.
The estimated orders of magnitude of the total numbers in various categories for all proteins in all organisms on Earth

There has been a long history of attempts for classification of known protein structures based on subjective analysis of the secondary structure contents of proteins and their topological arrangements in the structures (912) and on objective analysis of 3D coordinates of Cα atoms in protein structures (1, 13, 14). These attempts resulted in, among others, two excellent databases of protein structure classification, CATH (14) and SCOP (11). A recent study, based solely on objective similarity among the 3D structures represented by Cα atoms and using a much larger structure database and multidimensional scaling, revealed that all of the known protein folds (15) and protein structures (16) cluster into four elongated regions in the very sparsely populated protein structure space (Fig. 2). Interestingly, these four groups correspond approximately to the four classes defined by Levitt and Chothia (9) and used in SCOP, the Structural Classification of Proteins (11).

Fig. 2.
A global view of the protein structure space (16). The 1,898 nonredundant protein structures from Protein Data Bank are mapped in the 3D space to visualize the major feature of the map. The protein structure space is sparsely populated, and all of the ...

The fact that most of proteins are structured and that the protein structure space is very sparsely populated and restricted mostly to the four elongated regions suggest that mutations in genes encoding proteins have been constrained to those resulting in a structurally viable protein occupying one of the four allowed regions of the protein structure space: structural selection or “designability” (17, 18).

To obtain information on the evolution of these structural classes, we present a simple way of estimating the evolutionary ages of the common structural ancestor (CSA) of each protein sequence family of known fold. Assigning the age of the CSA of a protein family represented by each representative protein in the protein structure space (16) makes it possible to imbed the evolutionary information into the map of protein structure universe. We assign the age of the CSA of a protein family to be the same as the age of the most recent common ancestral organism that presumably contained the CSA of the family. Finally, we convert the map of the protein structure universe into the map of the ages of CSAs. Based on the analysis of these maps of protein structure universe and the evolutionary ages of the CSAs, we propose a model for the evolution of protein structural classes and a model for the evolution of protein sequence families.

We start with the following facts and assumptions:

  1. There is a key difference between the evolution of organisms vs. the evolution of proteins: the current model of evolution of organisms has the absolute requirement of reproduction of organisms and, thus, all present-day organisms ultimately come from one common ancestor organism. However, the evolution of proteins, therefore genes, does not need to follow the evolutionary path of organismic reproduction. Rather, the evolution of proteins is directly related to improved, unaltered, or diversified molecular functions, and the protein function is directly related to protein structure.
  2. Protein structures are more conserved than sequences in evolution, thus most proteins in a given sequence family have similar or related molecular structures.
  3. All information about protein structures is derived from the proteins of present-day organisms, and the protein universe of the present-day organisms represents a time-sliced view of all proteins at their various stages of evolution.


Evolutionary Age of CSAs.

Mapping the protein structure universe revealed four major clusters of protein structures (1, 15, 16). An examination of the map suggested a hint of imbedded evolutionary time in the map. To estimate the “age” of a protein structure in the map, we define the term CSA: For a given protein structure, all its sequence homologues are searched from a sequence database, for example, from the Pfam database (3), and all of the organisms that contain the genes coding for the members of that sequence family are identified. We then find the most recent common ancestor (MRCA) node of these organisms in the phylogenetic tree of life constructed based on the small subunit rRNA gene as described in Materials and Methods. We make an assumption that the CSA of the protein and its family members was present in the MRCA organism (Fig. 3a), and that the age of the CSA is represented by the phylogenetic distance between the MRCA and the reference node in the tree. The proecdure is shown schematically in Fig. 3b.

Fig. 3.
Schematic diagram for building a phylogenetic tree representing all of the organisms that contain the proteins of known structures or their sequence homologues (a) and assigning the age of the CSA of a protein family (b). The MRCA organism of the organisms ...

Evolution of the Relative Abundance of the Protein Structural Classes.

When each protein structure in the protein structure space (Fig. 2) is represented by the relative age of the CSA (Fig. 4a) of the protein family to which it belongs, we see a general trend: the proteins with young CSA age (blue) belong mostly to three classes (α, β, and α+β classes), those with middle age (green or yellow) belong to the same three classes plus α/β class, and, finally, the majority of the CSAs of old age (red) belong to α/β class. This observation suggests that recently born and still-evolving proteins belong to all-α or all-β class (as well as their random mixtures, α+β class), but the majority of the “mature” proteins belong to α/β class. The trends of the evolution of the protein structural classes are more easily visible in a distribution of structural classes across the evolutionary ages (Fig. 4b Upper) or the relative percent population of structural classes in a given evolutionary age (Fig. 4b Lower).

Fig. 4.
Evolution of the relative abundance of the protein structural classes. (a) The “age map” of CSAs. The color gradient, from blue (the youngest) to red (the oldest), represents the relative age of the CSAs of the protein families represented ...

We also notice that the protein chain lengths correlate significantly (Spearman's rank correlation coefficient r = 0.3098, P < 2 × 10−16) with the ages of CSAs (Fig. 4c). These observations combined with the assumption that the present-day proteins represent the entire spectrum of proteins at different stages of evolution from their respective CSAs, we propose a scenario for the evolution of protein structural classes: ancestral proteins of small short secondary structures primarily in three classes (α, β, and α+β classes) evolve to medium-sized proteins of four classes (α, β, α+β, and α/β classes) in roughly similar proportions, then to larger proteins with a preponderance in α/β class, as schematically shown in Fig. 5.

Fig. 5.
Proposed scenario for the evolution of protein structural classes extrapolated from the age map of common structural ancestors. The age map (Fig. 4a) is a snapshot at present time of the global evolutionary process of protein structural classes. The observation ...

Evolution of Protein Families: Multiple Birth Model.

We have expanded our approach to estimate the evolutionary ages of all curated protein sequence families in Pfam. As was evident from the ages of the CSAs of the proteins with known structural folds, not all present-day proteins are evolved from the proteins of the last common ancestor, but new CSAs can be traced to various points through out the evolutionary time. The above information combined again with the assumption that the protein universe of the present-day organisms represents a time-sliced view of all proteins at their various stages of evolution; we propose a possible scenario for the evolution of protein families as illustrated in Fig. 6. We hypothesize that, although all present-day organisms may have evolved from the last common ancestral organisms by organismic replication from an ancestor to a descendant organism, most of the present-day protein families were not evolved from their ancestor proteins existed in the last common ancestral organism (Fig. 6a) as expected for “the single birth model” of protein family evolution (19), but new CSAs were born throughout evolutionary time (Fig. 6b, the multiple birth model of protein evolution), and they evolved to the present-day proteins or died out.

Fig. 6.
Model for the evolution of protein families: (a) Single birth model of protein families, where all of the present-day protein families are evolved from the proteins, existed in the last common ancestral organism. Each colored circle represents a CSA of ...


We emphasize that our studies are aimed at gaining a coarse-grained global view and overall trends associated with the evolution of the protein structure classes and sequence families. In our multistep processes projecting the evolutionary ages onto the protein structural space map, many details are “smoothed out” to extract the major trends of evolution of the protein structure classes and sequence families, such as the effect of horizontal gene transfer and sampling of only those globular proteins for which the 3D structures are known. For example, we remove those proteins that may have entered an organism through horizontal gene transfer by the jackknife test as described in Materials and Methods. Some of our conclusion is consistent with others. For example, the α/β class proteins as the most ancient proteins also have been suggested by parsimonious scenario of fold occurrence in genomes (20), and birth, death, and diversification of genes have been described in ref. 21.

There are several questions invoked by the features of the protein structural space and its evolutionary implications. Some of them are as follows:

  1. How is the gene for a new CSA born? Because the new CSA has no traceable single ancestral protein, we propose that the new gene for the CSA was constructed of multiple gene fragments, for example, by multiple recombination events mediated by phages, viruses, or other mechanisms.
  2. Is the protein structure space constantly expanding or has it reached an equilibrium state? One possible argument for the equilibrium state is that a newly born protein evolves into gradually larger-sized proteins of improved, neutral, or diversified functions until it reaches an equilibrium, at which point destabilizing effects of the large size (of the protein, thus, its gene) outweigh the additional changes in function or diversity.
  3. What is the implication of Fig. 3 that reveals three evolutionary stages where the relative abundance of the four major protein structure classes changed their relative ranking? One possible implication is that there were three evolutionary periods when the Earth environment changed dramatically.

Materials and Methods

Construction of a Phylogenetic Tree Representing All of the Organisms That Contain the Proteins of Known Structures or Their Sequence Homologues.

We used the 1,898 protein chains representing a nonredundant set of all of the known protein structures in Protein Data Bank (PDB) [PDB_select 25 data set (22) used by Hou et al. (16) for mapping the protein structure space] as a reference data set. For each chain, we identified the protein domain family in the Pfam database (Release 16.0) (3) to which it belongs and all of the organisms represented by the members of the family. To reconstruct the phylogenetic tree of organisms covering all members of the retrieved protein families, we combined the taxonomic sources of all members of the protein families and extracted nonredundant species (65,532 organisms).

To simplify the tree structure, we grouped the nonredundant organisms at a higher level of taxonomic classification. When the fourth level of taxonomic classification listed in the Pfam database is used for grouping, the final number of taxa was reduced to 468. Among them, the gene sequences of small subunit rRNA (16S rRNA of prokaryotes or 18S rRNA of eukaryotes) of 345 taxa were available in the European ribosomal RNA database (23) (Table 2, which is published as supporting information on the PNAS web site). For each of these 345 taxa, we chose the longest small subunit rRNA sequence (but not shorter than 1,200 bases). Using these prealigned rRNA sequences, a universal tree of life for the 345 taxa was constructed by using neighbor joining (NJ) and maximum likelihood (ML) methods by using the PHYLIP package (24). For the NJ tree, we used 100 bootstrapped sequence replicates and obtained a consensus tree. Because the consensus tree does not produce branch lengths, the branch lengths of the consensus tree were recalculated from maximum likelihood method while keeping NJ tree topology. The ML tree was built under the assumption of constant rate with F84 model of sequence evolution. For both tree-building procedures, we used a bacterial taxon (Aqufiex pyrophilus) as an outgroup. Because both trees were topologically similar and the evolutionary ages of protein structural classes calculated by the method (see below) showed no disparity in terms of shape of distribution and overall trend (Fig. 7, which is published as supporting information on the PNAS web site), we selected the ML tree as a reference tree for the purpose of obtaining a global view of the phylogenetic relationship among the organisms at a higher taxonomic level.

Estimating the Relative Age of the CSAs.

We make the assumption that the CSA of a given protein sequence family appeared most recently in an organism at the MRCA node and the age of the CSA is represented by the branch length between the MRCA and a common reference node. To determine the MRCA node in the tree, we mapped all members of the Pfam family to which a given protein of known structure belongs on the tree. To remove the effect of horizontal gene transfer on estimating the CSA ages, we first tested the congruency of multiple MRCA nodes of the organisms represented by a protein family by using a jackknife operation: Each MRCA node is identified for all member organisms minus one, and the evolutionary age of the MRCA is estimated; examine the ages of the multiple MRCAs and remove those that are statistically outside of the mono-modal distribution; and take the median value of the remaining ages. The ages of CSAs were normalized to be in the range of 0 to 1. These relative ages are assigned to each of the 1,898 nonredundant protein structures in the protein structure space to visualize the major features of the distribution of the ages vs. protein structure classes (see below).

Mapping the Relative Ages of CSAs on the Protein Structure Space.

Mapping of the relative ages of proteins of known structure is done in two stages: First, 1,898 nonredundant protein structures are positions (mapped) in the protein structure space based on their all-to-all structural similarities as described in Hou et al. (15, 16) and briefly summarized below. As mentioned earlier, we used the PDB_select 25 data set, which contained 1,949 protein chains with <25% pairwise sequence identity. Of those, 51 chains were further removed because of low resolution or length requirements of the DaliLite (25) program that we used to calculate the similarity of protein structures. The remaining data set has 1,898 chains. The pairwise structural similarity for the 1,898 protein chains were measured by using the DaliLite program. The 1,898 × 1,898 similarity score matrix [sij] (where i = 1,…,1,898; j = 1,…,1,898) was converted to dissimilarity matrix [dij], “distance metrix,” by using

equation image

where s99.95 is the 99.95 percentile value of the maximum value among all off-diagonal sij's (i.e., ij). The dissimilarity matrix then was subject to the classical multidimensional scaling (MDS) procedure (26) to find the positional coordinates in a multidimensional (1,898 dimension) space of the protein structure universe. We used s99.95 to prevent a few extremely large similarity scores from dominating the distribution feature of the structural space map. To capture and visualize the major features of the high dimensional space, we represent the protein structure space in three dimensions (Fig. 1) by using the three components with highest eigenvalues, which are substantially greater than the rest.

Second, we represent the relative age of each of the nonredundant protein structures by the relative age of the CSA of the sequence family to which that particular protein belongs. Then, we population average by replacing the age of each CSA by the average age of 22 nearest neighbors weighted on the distances in the map. The number of nearest neighbors was chosen by the median of the number of statistically significant score pairs (DaliLite z score ≥2) of 1,898 protein chains. The weighted population averaging is to visualize the major trends of the “age map of CSAs” and to smooth out the “noise” due to factors such as horizontal transfer of genes, sparse sampling of protein families, and the members of each family (Fig. 4).

Supplementary Material

Supporting Information:


We thank Drs. Jingtong Hou, Gregory Sims, and Se-Ran Jun for our weekly discussions on the subjects of this work as well as other related subjects and Drs. David Eisenberg, Norman Pace, and Yun S. Song, whose expertise helped us to improve our thoughts, for valuable comments. This work has been supported by National Institutes of Health Grant GM62412.


common structural ancestor
most recent common ancestor.


The authors declare no conflict of interest.


1. Holm L, Sander C. Science. 1996;273:595–603. [PubMed]
2. Hawksworth PM, Kalin-Arroyo MT. Global Biodiversity Assessment. Cambridge, UK: Cambridge Univ Press; 1995.
3. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer ELL, et al. Nucleic Acids Res. 2004;32:D138–D141. [PMC free article] [PubMed]
4. Dayhoff MO. Fed Proc. 1976;35:2132–2138. [PubMed]
5. Wolf YI, Grishin NV, Koonin EV. J Mol Biol. 2000;299:897–905. [PubMed]
6. Denton M, Marshall C. Nature. 2001;410:417. [PubMed]
7. Coulson AF, Moult J. Proteins. 2002;46:61–71. [PubMed]
8. Andreeva A, Howorth D, Brenner SE, Hubbard TJP, Chothia C, Murzin AG. Nucleic Acids Res. 2004;32:D226–D229. [PMC free article] [PubMed]
9. Levitt M, Chothia C. Nature. 1976;261:552–558. [PubMed]
10. Richardson JS. Nature. 1977;268:495–500. [PubMed]
11. Murzin AG, Brenner SE, Hubbard T, Chothia C. J Mol Biol. 1995;247:536–540. [PubMed]
12. Chothia C, Hubbard T, Brenner S, Barns H, Murzin A. Annu Rev Biophys Biomol Struct. 1997;26:597–627. [PubMed]
13. Michie AD, Orengo CA, Thornton JM. J Mol Biol. 1996;262:168–185. [PubMed]
14. Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. Structure (London) 1997;5:1093–108. [PubMed]
15. Hou J, Sims GE, Zhang C, Kim SH. Proc Natl Acad Sci USA. 2003;100:2386–2390. [PMC free article] [PubMed]
16. Hou J, Jun SR, Zhang C, Kim SH. Proc Natl Acad Sci USA. 2005;102:3651–3656. [PMC free article] [PubMed]
17. Li H, Helling R, Tang C, Wingreen N. Science. 1996;273:666–669. [PubMed]
18. Tiana G, Shakhnovich BE, Dokholyan NV, Shakhnovich EI. Proc Natl Acad Sci USA. 2004;101:2846–2851. [PMC free article] [PubMed]
19. Chothia C, Gough J, Vogel C, Teichmann SA. Science. 2003;300:1701–1703. [PubMed]
20. Winstanley HF, Abeln S, Deane CM. Bioinformatics. 2005;21(Suppl 1):i449–i458. [PubMed]
21. Koonin EV, Wolf YI, Karev GP. Nature. 2002;420:218–223. [PubMed]
22. Hobohm U, Sander C. Protein Sci. 1994;3:522–524. [PMC free article] [PubMed]
23. Wuyts J, Perriere G, Van de Peer Y. Nucleic Acids Res. 2004;32:D101–D103. [PMC free article] [PubMed]
24. Felsenstein J. Cladistics. 1989;5:164–166.
25. Holm L, Park J. Bioinformatics. 2000;16:566–567. [PubMed]
26. Havel TF, Kuntz ID, Crippen GM. J Theor Biol. 1983;104:359–381. [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...