• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Alberts B, Johnson A, Lewis J, et al. Molecular Biology of the Cell. 4th edition. New York: Garland Science; 2002.

  • By agreement with the publisher, this book is accessible by the search feature, but cannot be browsed.
Cover of Molecular Biology of the Cell

Molecular Biology of the Cell. 4th edition.

Show details

The Shape and Structure of Proteins

From a chemical point of view, proteins are by far the most structurally complex and functionally sophisticated molecules known. This is perhaps not surprising, once one realizes that the structure and chemistry of each protein has been developed and fine-tuned over billions of years of evolutionary history. We start this chapter by considering how the location of each amino acid in the long string of amino acids that forms a protein determines its three-dimensional shape. We will then use this understanding of protein structure at the atomic level to describe how the precise shape of each protein molecule determines its function in a cell.

The Shape of a Protein Is Specified by Its Amino Acid Sequence

Recall from Chapter 2 that there are 20 types of amino acids in proteins, each with different chemical properties. A protein molecule is made from a long chain of these amino acids, each linked to its neighbor through a covalent peptide bond (Figure 3-1). Proteins are therefore also known as polypeptides. Each type of protein has a unique sequence of amino acids, exactly the same from one molecule to the next. Many thousands of different proteins are known, each with its own particular amino acid sequence.

Figure 3-1. A peptide bond.

Figure 3-1

A peptide bond. This covalent bond forms when the carbon atom from the carboxyl group of one amino acid shares electrons with the nitrogen atom (blue) from the amino group of a second amino acid. As indicated, a molecule of water is lost in this condensation (more...)

The repeating sequence of atoms along the core of the polypeptide chain is referred to as the polypeptide backbone. Attached to this repetitive chain are those portions of the amino acids that are not involved in making a peptide bond and which give each amino acid its unique properties: the 20 different amino acid side chains (Figure 3-2). Some of these side chains are nonpolar and hydrophobic (“water-fearing”), others are negatively or positively charged, some are reactive, and so on. Their atomic structures are presented in Panel 3-1, and a brief list with abbreviations is provided in Figure 3-3.

Figure 3-2. The structural components of a protein.

Figure 3-2

The structural components of a protein. A protein consists of a polypeptide backbone with attached side chains. Each type of protein differs in its sequence and number of amino acids; therefore, it is the sequence of the chemically different side chains (more...)

Box Icon

Panel 3-1

The 20 Amino Acids Found in Proteins.

Figure 3-3. The 20 amino acids found in proteins.

Figure 3-3

The 20 amino acids found in proteins. Both three-letter and one-letter abbreviations are listed. As shown, there are equal numbers of polar and nonpolar side chains. For their atomic structures, see Panel 3-1 (pp. 132–133).

As discussed in Chapter 2, atoms behave almost as if they were hard spheres with a definite radius (their van der Waals radius). The requirement that no two atoms overlap limits greatly the possible bond angles in a polypeptide chain (Figure 3-4). This constraint and other steric interactions severely restrict the variety of three-dimensional arrangements of atoms (or conformations) that are possible. Nevertheless, a long flexible chain, such as a protein, can still fold in an enormous number of ways.

Figure 3-4. Steric limitations on the bond angles in a polypeptide chain.

Figure 3-4

Steric limitations on the bond angles in a polypeptide chain. (A) Each amino acid contributes three bonds (red) to the backbone of the chain. The peptide bond is planar (gray shading) and does not permit rotation. By contrast, rotation can occur about (more...)

The folding of a protein chain is, however, further constrained by many different sets of weak noncovalent bonds that form between one part of the chain and another. These involve atoms in the polypeptide backbone, as well as atoms in the amino acid side chains. The weak bonds are of three types: hydrogen bonds, ionic bonds, and van der Waals attractions, as explained in Chapter 2 (see p. 57). Individual noncovalent bonds are 30–300 times weaker than the typical covalent bonds that create biological molecules. But many weak bonds can act in parallel to hold two regions of a polypeptide chain tightly together. The stability of each folded shape is therefore determined by the combined strength of large numbers of such noncovalent bonds (Figure 3-5).

Figure 3-5. Three types of noncovalent bonds that help proteins fold.

Figure 3-5

Three types of noncovalent bonds that help proteins fold. Although a single one of these bonds is quite weak, many of them often form together to create a strong bonding arrangement, as in the example shown. As in the previous figure, R is used as a general (more...)

A fourth weak force also has a central role in determining the shape of a protein. As described in Chapter 2, hydrophobic molecules, including the nonpolar side chains of particular amino acids, tend to be forced together in an aqueous environment in order to minimize their disruptive effect on the hydrogen-bonded network of water molecules (see p. 58 and Panel 2-2, pp. 112–113). Therefore, an important factor governing the folding of any protein is the distribution of its polar and nonpolar amino acids. The nonpolar (hydrophobic) side chains in a protein—belonging to such amino acids as phenylalanine, leucine, valine, and tryptophan—tend to cluster in the interior of the molecule (just as hydrophobic oil droplets coalesce in water to form one large droplet). This enables them to avoid contact with the water that surrounds them inside a cell. In contrast, polar side chains—such as those belonging to arginine, glutamine, and histidine—tend to arrange themselves near the outside of the molecule, where they can form hydrogen bonds with water and with other polar molecules (Figure 3-6). When polar amino acids are buried within the protein, they are usually hydrogen-bonded to other polar amino acids or to the polypeptide backbone (Figure 3-7).

Figure 3-6. How a protein folds into a compact conformation.

Figure 3-6

How a protein folds into a compact conformation. The polar amino acid side chains tend to gather on the outside of the protein, where they can interact with water; the nonpolar amino acid side chains are buried on the inside to form a tightly packed hydrophobic (more...)

Figure 3-7. Hydrogen bonds in a protein molecule.

Figure 3-7

Hydrogen bonds in a protein molecule. Large numbers of hydrogen bonds form between adjacent regions of the folded polypeptide chain and help stabilize its three-dimensional shape. The protein depicted is a portion of the enzyme lysozyme, and the hydrogen (more...)

Proteins Fold into a Conformation of Lowest Energy

As a result of all of these interactions, each type of protein has a particular three-dimensional structure, which is determined by the order of the amino acids in its chain. The final folded structure, or conformation, adopted by any polypeptide chain is generally the one in which the free energy is minimized. Protein folding has been studied in a test tube by using highly purified proteins. A protein can be unfolded, or denatured, by treatment with certain solvents, which disrupt the noncovalent interactions holding the folded chain together. This treatment converts the protein into a flexible polypeptide chain that has lost its natural shape. When the denaturing solvent is removed, the protein often refolds spontaneously, or renatures, into its original conformation (Figure 3-8), indicating that all the information needed for specifying the three-dimensional shape of a protein is contained in its amino acid sequence.

Figure 3-8. The refolding of a denatured protein.

Figure 3-8

The refolding of a denatured protein. (A) This experiment demonstrates that the conformation of a protein is determined solely by its amino acid sequence. (B) The structure of urea. Urea is very soluble in water and unfolds proteins at high concentrations, (more...)

Each protein normally folds up into a single stable conformation. However, the conformation often changes slightly when the protein interacts with other molecules in the cell. This change in shape is often crucial to the function of the protein, as we see later.

Although a protein chain can fold into its correct conformation without outside help, protein folding in a living cell is often assisted by special proteins called molecular chaperones. These proteins bind to partly folded polypeptide chains and help them progress along the most energetically favorable folding pathway. Chaperones are vital in the crowded conditions of the cytoplasm, since they prevent the temporarily exposed hydrophobic regions in newly synthesized protein chains from associating with each other to form protein aggregates (see p. 357). However, the final three-dimensional shape of the protein is still specified by its amino acid sequence: chaperones simply make the folding process more reliable.

Proteins come in a wide variety of shapes, and they are generally between 50 and 2000 amino acids long. Large proteins generally consist of several distinct protein domains—structural units that fold more or less independently of each other, as we discuss below. The detailed structure of any protein is complicated; for simplicity a protein’s structure can be depicted in several different ways, each emphasizing different features of the protein.

Panel 3-2 (pp. 138–139) presents four different depictions of a protein domain called SH2, which has important functions in eucaryotic cells. Constructed from a string of 100 amino acids, the structure is displayed as (A) a polypeptide backbone model, (B) a ribbon model, (C) a wire model that includes the amino acid side chains, and (D) a space-filling model. Each of the three horizontal rows shows the protein in a different orientation, and the image is colored in a way that allows the polypeptide chain to be followed from its N-terminus (purple) to its C-terminus (red).

Box Icon

Panel 3-2

Four different Ways of Depicting a Small Protein Domain: the SH2 Domain (Courtesy of David Lawson.).

Panel 3-2 shows that a protein’s conformation is amazingly complex, even for a structure as small as the SH2 domain. But the description of protein structures can be simplified by the recognition that they are built up from several common structural motifs, as we discuss next.

The α Helix and the β Sheet Are Common Folding Patterns

When the three-dimensional structures of many different protein molecules are compared, it becomes clear that, although the overall conformation of each protein is unique, two regular folding patterns are often found in parts of them. Both patterns were discovered about 50 years ago from studies of hair and silk. The first folding pattern to be discovered, called the α helix, was found in the protein α-keratin, which is abundant in skin and its derivatives—such as hair, nails, and horns. Within a year of the discovery of the α helix, a second folded structure, called a β sheet, was found in the protein fibroin, the major constituent of silk. These two patterns are particularly common because they result from hydrogen-bonding between the N–H and C=O groups in the polypeptide backbone, without involving the side chains of the amino acids. Thus, they can be formed by many different amino acid sequences. In each case, the protein chain adopts a regular, repeating conformation. These two conformations, as well as the abbreviations that are used to denote them in ribbon models of proteins, are shown in Figure 3-9.

Figure 3-9. The regular conformation of the polypeptide backbone observed in the α helix and the β sheet.

Figure 3-9

The regular conformation of the polypeptide backbone observed in the α helix and the β sheet. (A, B, and C) The α helix. The N–H of every peptide bond is hydrogen-bonded to the C=O of a neighboring peptide bond located (more...)

The core of many proteins contains extensive regions of β sheet. As shown in Figure 3-10, these β sheets can form either from neighboring polypeptide chains that run in the same orientation (parallel chains) or from a polypeptide chain that folds back and forth upon itself, with each section of the chain running in the direction opposite to that of its immediate neighbors (antiparallel chains). Both types of β sheet produce a very rigid structure, held together by hydrogen bonds that connect the peptide bonds in neighboring chains (see Figure 3-9D).

Figure 3-10. Two types of β sheet structures.

Figure 3-10

Two types of β sheet structures. (A) An antiparallel β sheet (see Figure 3-9D). (B) A parallel β sheet. Both of these structures are common in proteins.

An α helix is generated when a single polypeptide chain twists around on itself to form a rigid cylinder. A hydrogen bond is made between every fourth peptide bond, linking the C=O of one peptide bond to the N–H of another (see Figure 3-9A). This gives rise to a regular helix with a complete turn every 3.6 amino acids. Note that the protein domain illustrated in Panel 3-2 contains two α helices, as well as β sheet structures.

Short regions of α helix are especially abundant in proteins located in cell membranes, such as transport proteins and receptors. As we discuss in Chapter 10, those portions of a transmembrane protein that cross the lipid bilayer usually cross as an α helix composed largely of amino acids with nonpolar side chains. The polypeptide backbone, which is hydrophilic, is hydrogen-bonded to itself in the α helix and shielded from the hydrophobic lipid environment of the membrane by its protruding nonpolar side chains (see also Figure 3-77).

In other proteins, α helices wrap around each other to form a particularly stable structure, known as a coiled-coil. This structure can form when the two (or in some cases three) α helices have most of their nonpolar (hydrophobic) side chains on one side, so that they can twist around each other with these side chains facing inward (Figure 3-11). Long rodlike coiled-coils provide the structural framework for many elongated proteins. Examples are α-keratin, which forms the intracellular fibers that reinforce the outer layer of the skin and its appendages, and the myosin molecules responsible for muscle contraction.

Figure 3-11. The structure of a coiled-coil.

Figure 3-11

The structure of a coiled-coil. (A) A single α helix, with successive amino acid side chains labeled in a sevenfold sequence, “abcdefg” (from bottom to top). Amino acids “a” and “d” in such a sequence (more...)

The Protein Domain Is a Fundamental Unit of Organization

Even a small protein molecule is built from thousands of atoms linked together by precisely oriented covalent and noncovalent bonds, and it is extremely difficult to visualize such a complicated structure without a three-dimensional display. For this reason, various graphic and computer-based aids are used. A CD-ROM produced to accompany this book contains computer-generated images of selected proteins, designed to be displayed and rotated on the screen in a variety of formats.

Biologists distinguish four levels of organization in the structure of a protein. The amino acid sequence is known as the primary structure of the protein. Stretches of polypeptide chain that form α helices and β sheets constitute the protein’s secondary structure. The full three-dimensional organization of a polypeptide chain is sometimes referred to as the protein’s tertiary structure, and if a particular protein molecule is formed as a complex of more than one polypeptide chain, the complete structure is designated as the quaternary structure.

Studies of the conformation, function, and evolution of proteins have also revealed the central importance of a unit of organization distinct from the four just described. This is the protein domain, a substructure produced by any part of a polypeptide chain that can fold independently into a compact, stable structure. A domain usually contains between 40 and 350 amino acids, and it is the modular unit from which many larger proteins are constructed. The different domains of a protein are often associated with different functions. Figure 3-12 shows an example—the Src protein kinase, which functions in signaling pathways inside vertebrate cells (Src is pronounced “sarc”). This protein has four domains: the SH2 and SH3 domains have regulatory roles, while the two remaining domains are responsible for the kinase catalytic activity. Later in the chapter, we shall return to this protein, in order to explain how proteins can form molecular switches that transmit information throughout cells.

Figure 3-12. A protein formed from four domains.

Figure 3-12

A protein formed from four domains. In the Src protein shown, two of the domains form a protein kinase enzyme, while the SH2 and SH3 domains perform regulatory functions. (A) A ribbon model, with ATP substrate in red. (B) A spacing-filling model, with (more...)

The smallest protein molecules contain only a single domain, whereas larger proteins can contain as many as several dozen domains, usually connected to each other by short, relatively unstructured lengths of polypeptide chain. Figure 3-13 presents ribbon models of three differently organized protein domains. As these examples illustrate, the central core of a domain can be constructed from α helices, from β sheets, or from various combinations of these two fundamental folding elements. Each different combination is known as a protein fold. So far, about 1000 different protein folds have been identified among the ten thousand proteins whose detailed conformations are known.

Figure 3-13. Ribbon models of three different protein domains.

Figure 3-13

Ribbon models of three different protein domains. (A) Cytochrome b562, a single-domain protein involved in electron transport in mitochondria. This protein is composed almost entirely of α helices. (B) The NAD-binding domain of the enzyme lactic (more...)

Few of the Many Possible Polypeptide Chains Will Be Useful

Since each of the 20 amino acids is chemically distinct and each can, in principle, occur at any position in a protein chain, there are 20 × 20 × 20 × 20 = 160,000 different possible polypeptide chains four amino acids long, or 20n different possible polypeptide chains n amino acids long. For a typical protein length of about 300 amino acids, more than 10390 (20300) different polypeptide chains could theoretically be made. This is such an enormous number that to produce just one molecule of each kind would require many more atoms than exist in the universe.

Only a very small fraction of this vast set of conceivable polypeptide chains would adopt a single, stable three-dimensional conformation—by some estimates, less than one in a billion. The vast majority of possible protein molecules could adopt many conformations of roughly equal stability, each conformation having different chemical properties. And yet virtually all proteins present in cells adopt unique and stable conformations. How is this possible? The answer lies in natural selection. A protein with an unpredictably variable structure and biochemical activity is unlikely to help the survival of a cell that contains it. Such proteins would therefore have been eliminated by natural selection through the enormously long trial-and-error process that underlies biological evolution.

Because of natural selection, not only is the amino acid sequence of a present-day protein such that a single conformation is extremely stable, but this conformation has its chemical properties finely tuned to enable the protein to perform a particular catalytic or structural function in the cell. Proteins are so precisely built that the change of even a few atoms in one amino acid can sometimes disrupt the structure of the whole molecule so severely that all function is lost.

Proteins Can Be Classified into Many Families

Once a protein had evolved that folded up into a stable conformation with useful properties, its structure could be modified during evolution to enable it to perform new functions. This process has been greatly accelerated by genetic mechanisms that occasionally produce duplicate copies of genes, allowing one gene copy to evolve independently to perform a new function (discussed in Chapter 7). This type of event has occurred quite often in the past; as a result, many present-day proteins can be grouped into protein families, each family member having an amino acid sequence and a three-dimensional conformation that resemble those of the other family members.

Consider, for example, the serine proteases, a large family of protein-cleaving (proteolytic) enzymes that includes the digestive enzymes chymotrypsin, trypsin, and elastase, and several proteases involved in blood clotting. When the protease portions of any two of these enzymes are compared, parts of their amino acid sequences are found to match. The similarity of their three-dimensional conformations is even more striking: most of the detailed twists and turns in their polypeptide chains, which are several hundred amino acids long, are virtually identical (Figure 3-14). The many different serine proteases nevertheless have distinct enzymatic activities, each cleaving different proteins or the peptide bonds between different types of amino acids. Each therefore performs a distinct function in an organism.

Figure 3-14. The conformations of two serine proteases compared.

Figure 3-14

The conformations of two serine proteases compared. The backbone conformations of elastase and chymotrypsin. Although only those amino acids in the polypeptide chain shaded in green are the same in the two proteins, the two conformations are very similar (more...)

The story we have told for the serine proteases could be repeated for hundreds of other protein families. In many cases the amino acid sequences have diverged much further than for the serine proteases, so that one cannot be sure of a family relationship between two proteins without determining their three-dimensional structures. The yeast α2 protein and the Drosophila engrailed protein, for example, are both gene regulatory proteins in the homeodomain family. Because they are identical in only 17 of their 60 amino acid residues, their relationship became certain only when their three-dimensional structures were compared (Figure 3-15).

Figure 3-15. A comparison of a class of DNA-binding domains, called homeodomains, in a pair of proteins from two organisms separated by more than a billion years of evolution.

Figure 3-15

A comparison of a class of DNA-binding domains, called homeodomains, in a pair of proteins from two organisms separated by more than a billion years of evolution. (A) A ribbon model of the structure common to both proteins. (B) A trace of the α-carbon (more...)

The various members of a large protein family often have distinct functions. Some of the amino acid changes that make family members different were no doubt selected in the course of evolution because they resulted in useful changes in biological activity, giving the individual family members the different functional properties they have today. But many other amino acid changes are effectively “neutral,” having neither a beneficial nor a damaging effect on the basic structure and function of the protein. In addition, since mutation is a random process, there must also have been many deleterious changes that altered the three-dimensional structure of these proteins sufficiently to harm them. Such faulty proteins would have been lost whenever the individual organisms making them were at enough of a disadvantage to be eliminated by natural selection.

Protein families are readily recognized when the genome of any organism is sequenced; for example, the determination of the DNA sequence for the entire genome of the nematode Caenorhabditis elegans has revealed that this tiny worm contains more than 18,000 genes. Through sequence comparisons, the products of a large fraction of these genes can be seen to contain domains from one or another protein family; for example, there appear to be 388 genes containing protein kinase domains, 66 genes containing DNA and RNA helicase domains, 43 genes containing SH2 domains, 70 genes containing immunoglobulin domains, and 88 genes containing DNA-binding homeodomains in this genome of 97 million base pairs (Figure 3-16).

Figure 3-16. Percentage of total genes containing one or more copies of the indicated protein domain, as derived from complete genome sequences.

Figure 3-16

Percentage of total genes containing one or more copies of the indicated protein domain, as derived from complete genome sequences. Note that one of the three domains selected, the immunoglobulin domain, has been a relatively late addition, and its relative (more...)

Proteins Can Adopt a Limited Number of Different Protein Folds

It is astounding to consider the rapidity of the increase in our knowledge about cells. In 1950, we did not know the order of the amino acids in a single protein, and many even doubted that the amino acids in proteins are arranged in an exact sequence. In 1960, the first three-dimensional structure of a protein was determined by x-ray crystallography. Now that we have access to hundreds of thousands of protein sequences from sequencing the genes that encode them, what technical developments can we look forward to next?

It is no longer a big step to progress from a gene sequence to the production of large amounts of the pure protein encoded by that gene. Thanks to DNA cloning and genetic engineering techniques (discussed in Chapter 8), this step is often routine. But there is still nothing routine about determining the complete three-dimensional structure of a protein. The standard technique based on x-ray diffraction requires that the protein be subjected to conditions that cause the molecules to aggregate into a large, perfectly ordered crystalline array—that is, a protein crystal. Each protein behaves quite differently in this respect, and protein crystals can be generated only through exhaustive trial-and-error methods that often take many years to succeed—if they succeed at all.

Membrane proteins and large protein complexes with many moving parts have generally been the most difficult to crystallize, which is why only a few such protein structures are displayed in this book. Increasingly, therefore, large proteins have been analyzed through determination of the structures of their individual domains: either by crystallizing isolated domains and then bombarding the crystals with x-rays, or by studying the conformations of isolated domains in concentrated aqueous solutions with powerful nuclear magnetic resonance (NMR) techniques (discussed in Chapter 8). From a combination of x-ray and NMR studies, we now know the three-dimensional shapes, or conformations, of thousands of different proteins.

By carefully comparing the conformations of known proteins, structural biologists (that is, experts on the structure of biological molecules) have concluded that there are a limited number of ways in which protein domains fold up—maybe as few as 2000. As we saw, the structures for about 1000 of these protein folds have thus far been determined; we may, therefore, already know half of the total number of possible structures for a protein domain. A complete catalog of all of the protein folds that exist in living organisms would therefore seem to be within our reach.

Sequence Homology Searches Can Identify Close Relatives

The present database of known protein sequences contains more than 500,000 entries, and it is growing very rapidly as more and more genomes are sequenced—revealing huge numbers of new genes that encode proteins. Powerful computer search programs are available that allow one to compare each newly discovered protein with this entire database, looking for possible relatives. Homologous proteins are defined as those whose genes have evolved from a common ancestral gene, and these are identified by the discovery of statistically significant similarities in amino acid sequences.

With such a large number of proteins in the database, the search programs find many nonsignificant matches, resulting in a background noise level that makes it very difficult to pick out all but the closest relatives. Generally speaking, a 30% identity in the sequence of two proteins is needed to be certain that a match has been found. However, many short signature sequences (“fingerprints”) indicative of particular protein functions are known, and these are widely used to find more distant homologies (Figure 3-17).

Figure 3-17. The use of short signature sequences to find homologous protein domains.

Figure 3-17

The use of short signature sequences to find homologous protein domains. The two short sequences of 15 and 9 amino acids shown (green) can be used to search large databases for a protein domain that is found in many proteins, the SH2 domain. Here, the (more...)

These protein comparisons are important because related structures often imply related functions. Many years of experimentation can be saved by discovering that a new protein has an amino acid sequence homology with a protein of known function. Such sequence homologies, for example, first indicated that certain genes that cause mammalian cells to become cancerous are protein kinases. In the same way, many of the proteins that control pattern formation during the embryonic development of the fruit fly Drosophila were quickly recognized to be gene regulatory proteins.

Computational Methods Allow Amino Acid Sequences to Be Threaded into Known Protein Folds

We know that there are an enormous number of ways to make proteins with the same three-dimensional structure, and that—over evolutionary time—random mutations can cause amino acid sequences to change without a major change in the conformation of a protein. For this reason, one current goal of structural biologists is to determine all the different protein folds that proteins have in nature, and to devise computer-based methods to test the amino acid sequence of a domain to identify which one of these previously determined conformations the domain is likely to adopt.

A computational technique called threading can be used to fit an amino acid sequence to a particular protein fold. For each possible fold known, the computer searches for the best fit of the particular amino acid sequence to that structure. Are the hydrophobic residues on the inside? Are the sequences with a strong propensity to form an α helix in an α helix? And so on. The best fit gets a numerical score reflecting the estimated stability of the structure.

In many cases, one particular three-dimensional structure will stand out as a good fit for the amino acid sequence, suggesting an approximate conformation for the protein domain. In other cases, none of the known folds will seem possible. By applying x-ray and NMR studies to the latter class of proteins, structural biologists hope to able to expand the number of known folds rapidly, aiming for a database that contains the complete library of protein folds that exist in nature. With such a library, plus expected improvements in the computational methods used for threading, it may eventually become possible to obtain an approximate three-dimensional structure for a protein as soon as its amino acid sequence is known.

Some Protein Domains, Called Modules, Form Parts of Many Different Proteins

As previously stated, most proteins are composed of a series of protein domains, in which different regions of the polypeptide chain have folded independently to form compact structures. Such multidomain proteins are believed to have originated when the DNA sequences that encode each domain accidentally became joined, creating a new gene. Novel binding surfaces have often been created at the juxtaposition of domains, and many of the functional sites where proteins bind to small molecules are found to be located there (for an example see Figure 3-12). Many large proteins show clear signs of having evolved by the joining of preexisting domains in new combinations, an evolutionary process called domain shuffling (Figure 3-18).

Figure 3-18. Domain shuffling.

Figure 3-18

Domain shuffling. An extensive shuffling of blocks of protein sequence (protein domains) has occurred during protein evolution. Those portions of a protein denoted by the same shape and color in this diagram are evolutionarily related. Serine proteases (more...)

A subset of protein domains have been especially mobile during evolution; these so-called protein modules are generally somewhat smaller (40–200 amino acids) than an average domain, and they seem to have particularly versatile structures. The structure of one such module, the SH2 domain, was illustrated in Panel 3-2 (pp. 138–139). The structures of some additional protein modules are illustrated in Figure 3-19.

Figure 3-19. The three-dimensional structures of some protein modules.

Figure 3-19

The three-dimensional structures of some protein modules. In these ribbon diagrams, β-sheet strands are shown as arrows, and the N- and C-termini are indicated by red spheres. (Adapted from M. Baron, D.G. Norman, and I.D. Campbell, Trends Biochem. (more...)

Each of the modules shown has a stable core structure formed from strands of β sheet, from which less-ordered loops of polypeptide chain protrude (green). The loops are ideally situated to form binding sites for other molecules, as most flagrantly demonstrated for the immunoglobulin fold, which forms the basis for antibody molecules (see Figure 3-42). The evolutionary success of such β-sheet-based modules is likely to have been due to their providing a convenient framework for the generation of new binding sites for ligands through small changes to these protruding loops.

A second feature of protein modules that explains their utility is the ease with which they can be integrated into other proteins. Five of the six modules illustrated in Figure 3-19 have their N- and C-terminal ends at opposite poles of the module. This “in-line” arrangement means that when the DNA encoding such a module undergoes tandem duplication, which is not unusual in the evolution of genomes (discussed in Chapter 7), the duplicated modules can be readily linked in series to form extended structures—either with themselves or with other in-line modules (Figure 3-20). Stiff extended structures composed of a series of modules are especially common in extracellular matrix molecules and in the extracellular portions of cell-surface receptor proteins. Other modules, including the SH2 domain and the kringle module illustrated in Figure 3-19, are of a “plug-in” type. After genomic rearrangements, such modules are usually accommodated as an insertion into a loop region of a second protein.

Figure 3-20. An extended structure formed from a series of in-line protein modules.

Figure 3-20

An extended structure formed from a series of in-line protein modules. Four fibronectin type 3 modules (see Figure 3-19) from the extracellular matrix molecule fibronectin are illustrated in (A) ribbon and (B) space-filling models. (Adapted from D.J. (more...)

The Human Genome Encodes a Complex Set of Proteins, Revealing Much That Remains Unknown

The result of sequencing the human genome has been surprising, because it reveals that our chromosomes contain only 30,000 to 35,000 genes. With regard to gene number, we would appear to be no more than 1.4-fold more complex than the tiny mustard weed, Arabidopsis, and less than 2-fold more complex than a nematode worm. The genome sequences also reveal that vertebrates have inherited nearly all of their protein domains from invertebrates—with only 7 percent of identified human domains being vertebrate-specific.

Each of our proteins is on average more complicated, however. A process of domain shuffling during vertebrate evolution has given rise to many novel combinations of protein domains, with the result that there are nearly twice as many combinations of domains found in human proteins as in a worm or a fly. Thus, for example, the trypsinlike serine protease domain is linked to at least 18 other types of protein domains in human proteins, whereas it is found covalently joined to only 5 different domains in the worm. This extra variety in our proteins greatly increases the range of protein–protein interactions possible (see Figure 3-78), but how it contributes to making us human is not known.

The complexity of living organisms is staggering, and it is quite sobering to note that we currently lack even the tiniest hint of what the function might be for more than 10,000 of the proteins that have thus far been identified in the human genome. There are certainly enormous challenges ahead for the next generation of cell biologists, with no shortage of fascinating mysteries to solve.

Larger Protein Molecules Often Contain More Than One Polypeptide Chain

The same weak noncovalent bonds that enable a protein chain to fold into a specific conformation also allow proteins to bind to each other to produce larger structures in the cell. Any region of a protein’s surface that can interact with another molecule through sets of noncovalent bonds is called a binding site. A protein can contain binding sites for a variety of molecules, both large and small. If a binding site recognizes the surface of a second protein, the tight binding of two folded polypeptide chains at this site creates a larger protein molecule with a precisely defined geometry. Each polypeptide chain in such a protein is called a protein subunit.

In the simplest case, two identical folded polypeptide chains bind to each other in a “head-to-head” arrangement, forming a symmetric complex of two protein subunits (a dimer) held together by interactions between two identical binding sites. The Cro repressor protein—a gene regulatory protein that binds to DNA to turn genes off in a bacterial cell—provides an example (Figure 3-21). Many other types of symmetric protein complexes, formed from multiple copies of a single polypeptide chain, are commonly found in cells. The enzyme neuraminidase, for example, consists of four identical protein subunits, each bound to the next in a “head-to-tail” arrangement that forms a closed ring (Figure 3-22).

Figure 3-21. Two identical protein subunits binding together to form a symmetric protein dimer.

Figure 3-21

Two identical protein subunits binding together to form a symmetric protein dimer. The Cro repressor protein from bacteriophage lambda binds to DNA to turn off viral genes. Its two identical subunits bind head-to-head, held together by a combination of (more...)

Figure 3-22. A protein molecule containing multiple copies of a single protein subunit.

Figure 3-22

A protein molecule containing multiple copies of a single protein subunit. The enzyme neuraminidase exists as a ring of four identical polypeptide chains. The small diagram shows how the repeated use of the same binding interaction forms the structure. (more...)

Many of the proteins in cells contain two or more types of polypeptide chains. Hemoglobin, the protein that carries oxygen in red blood cells, is a particularly well-studied example (Figure 3-23). It contains two identical α-globin subunits and two identical β-globin subunits, symmetrically arranged. Such multisubunit proteins are very common in cells, and they can be very large. Figure 3-24 provides a sampling of proteins whose exact structures are known, allowing the sizes and shapes of a few larger proteins to be compared with the relatively small proteins that we have thus far presented as models.

Figure 3-23. A protein formed as a symmetric assembly of two different subunits.

Figure 3-23

A protein formed as a symmetric assembly of two different subunits. Hemoglobin is an abundant protein in red blood cells that contains two copies of α globin and two copies of β globin. Each of these four polypeptide chains contains a (more...)

Figure 3-24. A collection of protein molecules, shown at the same scale.

Figure 3-24

A collection of protein molecules, shown at the same scale. For comparison, a DNA molecule bound to a protein is also illustrated. These space-filling models represent a range of sizes and shapes. Hemoglobin, catalase, porin, alcohol dehydrogenase, and (more...)

Some Proteins Form Long Helical Filaments

Some protein molecules can assemble to form filaments that may span the entire length of a cell. Most simply, a long chain of identical protein molecules can be constructed if each molecule has a binding site complementary to another region of the surface of the same molecule (Figure 3-25). An actin filament, for example, is a long helical structure produced from many molecules of the protein actin (Figure 3-26). Actin is very abundant in eucaryotic cells, where it constitutes one of the major filament systems of the cytoskeleton (discussed in Chapter 16).

Figure 3-25. Protein assemblies.

Figure 3-25

Protein assemblies. (A) A protein with just one binding site can form a dimer with another identical protein. (B) Identical proteins with two different binding sites often form a long helical filament. (C) If the two binding sites are disposed appropriately (more...)

Figure 3-26. Actin filaments.

Figure 3-26

Actin filaments. (A) Transmission electron micrographs of negatively stained actin filaments. (B) The helical arrangement of actin molecules in an actin filament. (A, courtesy of Roger Craig.)

Why is a helix such a common structure in biology? As we have seen, biological structures are often formed by linking subunits that are very similar to each other—such as amino acids or protein molecules—into long, repetitive chains. If all the subunits are identical, the neighboring subunits in the chain can often fit together in only one way, adjusting their relative positions to minimize the free energy of the contact between them. As a result, each subunit is positioned in exactly the same way in relation to the next, so that subunit 3 fits onto subunit 2 in the same way that subunit 2 fits onto subunit 1, and so on. Because it is very rare for subunits to join up in a straight line, this arrangement generally results in a helix—a regular structure that resembles a spiral staircase, as illustrated in Figure 3-27. Depending on the twist of the staircase, a helix is said to be either right-handed or left-handed (Figure 3-27E). Handedness is not affected by turning the helix upside down, but it is reversed if the helix is reflected in the mirror.

Figure 3-27. Some properties of a helix.

Figure 3-27

Some properties of a helix. (A–D) A helix forms when a series of subunits bind to each other in a regular way. At the bottom, the interaction between two subunits is shown; behind them are the helices that result. These helices have two (A), three (more...)

Helices occur commonly in biological structures, whether the subunits are small molecules linked together by covalent bonds (for example, the amino acids in an α helix) or large protein molecules that are linked by noncovalent forces (for example, the actin molecules in actin filaments). This is not surprising. A helix is an unexceptional structure, and it is generated simply by placing many similar subunits next to each other, each in the same strictly repeated relationship to the one before.

A Protein Molecule Can Have an Elongated, Fibrous Shape

Most of the proteins we have discussed so far are globular proteins, in which the polypeptide chain folds up into a compact shape like a ball with an irregular surface. Enzymes tend to be globular proteins: even though many are large and complicated, with multiple subunits, most have an overall rounded shape (see Figure 3-24). In contrast, other proteins have roles in the cell requiring each individual protein molecule to span a large distance. These proteins generally have a relatively simple, elongated three-dimensional structure and are commonly referred to as fibrous proteins.

One large family of intracellular fibrous proteins consists of α-keratin, introduced earlier, and its relatives. Keratin filaments are extremely stable and are the main component in long-lived structures such as hair, horn, and nails. An α-keratin molecule is a dimer of two identical subunits, with the long α helices of each subunit forming a coiled-coil (see Figure 3-11). The coiled-coil regions are capped at each end by globular domains containing binding sites. This enables this class of protein to assemble into ropelike intermediate filaments—an important component of the cytoskeleton that creates the cell’s internal structural scaffold (see Figure 16-16).

Fibrous proteins are especially abundant outside the cell, where they are a main component of the gel-like extracellular matrix that helps to bind collections of cells together to form tissues. Extracellular matrix proteins are secreted by the cells into their surroundings, where they often assemble into sheets or long fibrils. Collagen is the most abundant of these proteins in animal tissues. A collagen molecule consists of three long polypeptide chains, each containing the nonpolar amino acid glycine at every third position. This regular structure allows the chains to wind around one another to generate a long regular triple helix (Figure 3-28A). Many collagen molecules then bind to one another side-by-side and end-to-end to create long overlapping arrays—thereby generating the extremely tough collagen fibrils that give connective tissues their tensile strength, as described in Chapter 19.

Figure 3-28. Collagen and elastin.

Figure 3-28

Collagen and elastin. (A) Collagen is a triple helix formed by three extended protein chains that wrap around one another (bottom). Many rodlike collagen molecules are cross-linked together in the extracellular space to form unextendable collagen fibrils (more...)

In complete contrast to collagen is another protein in the extracellular matrix, elastin. Elastin molecules are formed from relatively loose and unstructured polypeptide chains that are covalently cross-linked into a rubberlike elastic meshwork: unlike most proteins, they do not have a uniquely defined stable structure, but can be reversibly pulled from one conformation to another, as illustrated in Figure 3-28B. The resulting elastic fibers enable skin and other tissues, such as arteries and lungs, to stretch and recoil without tearing.

Extracellular Proteins Are Often Stabilized by Covalent Cross-Linkages

Many protein molecules are either attached to the outside of a cell’s plasma membrane or secreted as part of the extracellular matrix. All such proteins are directly exposed to extracellular conditions. To help maintain their structures, the polypeptide chains in such proteins are often stabilized by covalent cross-linkages. These linkages can either tie two amino acids in the same protein together, or connect different polypeptide chains in a multisubunit protein. The most common cross-linkages in proteins are covalent sulfur–sulfur bonds. These disulfide bonds (also called S–S bonds) form as proteins are being prepared for export from cells. As described in Chapter 12, their formation is catalyzed in the endoplasmic reticulum by an enzyme that links together two pairs of –SH groups of cysteine side chains that are adjacent in the folded protein (Figure 3-29). Disulfide bonds do not change the conformation of a protein but instead act as atomic staples to reinforce its most favored conformation. For example, lysozyme—an enzyme in tears that dissolves bacterial cell walls—retains its antibacterial activity for a long time because it is stabilized by such cross-linkages.

Figure 3-29. Disulfide bonds.

Figure 3-29

Disulfide bonds. This diagram illustrates how covalent disulfide bonds form between adjacent cysteine side chains. As indicated, these cross-linkages can join either two parts of the same polypeptide chain or two different polypeptide chains. Since the (more...)

Disulfide bonds generally fail to form in the cell cytosol, where a high concentration of reducing agents converts S–S bonds back to cysteine –SH groups. Apparently, proteins do not require this type of reinforcement in the relatively mild environment inside the cell.

Protein Molecules Often Serve as Subunits for the Assembly of Large Structures

The same principles that enable a protein molecule to associate with itself to form rings or filaments operate to generate much larger structures in the cell—supramolecular structures such as enzyme complexes, ribosomes, protein filaments, viruses, and membranes. These large objects are not made as single, giant, covalently linked molecules. Instead they are formed by the noncovalent assembly of many separately manufactured molecules, which serve as the subunits of the final structure.

The use of smaller subunits to build larger structures has several advantages:


A large structure built from one or a few repeating smaller subunits requires only a small amount of genetic information.


Both assembly and disassembly can be readily controlled, reversible processes, since the subunits associate through multiple bonds of relatively low energy.


Errors in the synthesis of the structure can be more easily avoided, since correction mechanisms can operate during the course of assembly to exclude malformed subunits.

Some protein subunits assemble into flat sheets in which the subunits are arranged in hexagonal patterns. Specialized membrane proteins are sometimes arranged this way in lipid bilayers. With a slight change in the geometry of the individual subunits, a hexagonal sheet can be converted into a tube (Figure 3-30) or, with more changes, into a hollow sphere. Protein tubes and spheres that bind specific RNA and DNA molecules form the coats of viruses.

Figure 3-30. An example of single protein subunit assembly requiring multiple protein–protein contacts.

Figure 3-30

An example of single protein subunit assembly requiring multiple protein–protein contacts. Hexagonally packed globular protein subunits can form either a flat sheet or a tube.

The formation of closed structures, such as rings, tubes, or spheres, provides additional stability because it increases the number of bonds between the protein subunits. Moreover, because such a structure is created by mutually dependent, cooperative interactions between subunits, it can be driven to assemble or disassemble by a relatively small change that affects each subunit individually. These principles are dramatically illustrated in the protein coat or capsid of many simple viruses, which takes the form of a hollow sphere (Figure 3-31). Capsids are often made of hundreds of identical protein subunits that enclose and protect the viral nucleic acid (Figure 3-32). The protein in such a capsid must have a particularly adaptable structure: it must not only make several different kinds of contacts to create the sphere, it must also change this arrangement to let the nucleic acid out to initiate viral replication once the virus has entered a cell.

Figure 3-31. The capsids of some viruses, all shown at the same scale.

Figure 3-31

The capsids of some viruses, all shown at the same scale. (A) Tomato bushy stunt virus; (B) poliovirus; (C) simian virus 40 (SV40); (D) satellite tobacco necrosis virus. The structures of all of these capsids have been determined by x-ray crystallography (more...)

Figure 3-32. The structure of a spherical virus.

Figure 3-32

The structure of a spherical virus. In many viruses, identical protein subunits pack together to create a spherical shell (a capsid) that encloses the viral genome, composed of either RNA or DNA (see also Figure 3-31). For geometric reasons, no more than (more...)

Many Structures in Cells Are Capable of Self-Assembly

The information for forming many of the complex assemblies of macromolecules in cells must be contained in the subunits themselves, because purified subunits can spontaneously assemble into the final structure under the appropriate conditions. The first large macromolecular aggregate shown to be capable of self-assembly from its component parts was tobacco mosaic virus (TMV). This virus is a long rod in which a cylinder of protein is arranged around a helical RNA core (Figure 3-33). If the dissociated RNA and protein subunits are mixed together in solution, they recombine to form fully active viral particles. The assembly process is unexpectedly complex and includes the formation of double rings of protein, which serve as intermediates that add to the growing viral coat.

Figure 3-33. The structure of tobacco mosaic virus (TMV).

Figure 3-33

The structure of tobacco mosaic virus (TMV). (A) An electron micrograph of the viral particle, which consists of a single long RNA molecule enclosed in a cylindrical protein coat composed of identical protein subunits. (B) A model showing part of the (more...)

Another complex macromolecular aggregate that can reassemble from its component parts is the bacterial ribosome. This structure is composed of about 55 different protein molecules and 3 different rRNA molecules. If the individual components are incubated under appropriate conditions in a test tube, they spontaneously re-form the original structure. Most importantly, such reconstituted ribosomes are able to perform protein synthesis. As might be expected, the reassembly of ribosomes follows a specific pathway: after certain proteins have bound to the RNA, this complex is then recognized by other proteins, and so on, until the structure is complete.

It is still not clear how some of the more elaborate self-assembly processes are regulated. Many structures in the cell, for example, seem to have a precisely defined length that is many times greater than that of their component macromolecules. How such length determination is achieved is in many cases a mystery. Three possible mechanisms are illustrated in Figure 3-34. In the simplest case, a long core protein or other macromolecule provides a scaffold that determines the extent of the final assembly. This is the mechanism that determines the length of the TMV particle, where the RNA chain provides the core. Similarly, a core protein is thought to determine the length of the thin filaments in muscle, as well as the length of the long tails of some bacterial viruses (Figure 3-35).

Figure 3-34. Three mechanisms of length determination for large protein assemblies.

Figure 3-34

Three mechanisms of length determination for large protein assemblies. (A) Coassembly along an elongated core protein or other macromolecule that acts as a measuring device. (B) Termination of assembly because of strain that accumulates in the polymeric (more...)

Figure 3-35. An electron micrograph of bacteriophage lambda.

Figure 3-35

An electron micrograph of bacteriophage lambda. The tip of the virus tail attaches to a specific protein on the surface of a bacterial cell, after which the tightly packaged DNA in the head is injected through the tail into the cell. The tail has a precise (more...)

The Formation of Complex Biological Structures Is Often Aided by Assembly Factors

Not all cellular structures held together by noncovalent bonds are capable of self-assembly. A mitochondrion, a cilium, or a myofibril of a muscle cell, for example, cannot form spontaneously from a solution of its component macromolecules. In these cases, part of the assembly information is provided by special enzymes and other cellular proteins that perform the function of templates, guiding construction but taking no part in the final assembled structure.

Even relatively simple structures may lack some of the ingredients necessary for their own assembly. In the formation of certain bacterial viruses, for example, the head, which is composed of many copies of a single protein subunit, is assembled on a temporary scaffold composed of a second protein. Because the second protein is absent from the final viral particle, the head structure cannot spontaneously reassemble once it has been taken apart. Other examples are known in which proteolytic cleavage is an essential and irreversible step in the normal assembly process. This is even the case for some small protein assemblies, including the structural protein collagen and the hormone insulin (Figure 3-36). From these relatively simple examples, it seems very likely that the assembly of a structure as complex as a mitochondrion or a cilium will involve temporal and spatial ordering imparted by numerous other cell components.

Figure 3-36. Proteolytic cleavage in insulin assembly.

Figure 3-36

Proteolytic cleavage in insulin assembly. The polypeptide hormone insulin cannot spontaneously re-form efficiently if its disulfide bonds are disrupted. It is synthesized as a larger protein (proinsulin) that is cleaved by a proteolytic enzyme after the (more...)


The three-dimensional conformation of a protein molecule is determined by its amino acid sequence. The folded structure is stabilized by noncovalent interactions between different parts of the polypeptide chain. The amino acids with hydrophobic side chains tend to cluster in the interior of the molecule, and local hydrogen-bond interactions between neighboring peptide bonds give rise to α helices and β sheets.

Globular regions, known as domains, are the modular units from which many proteins are constructed; such domains generally contain 40–350 amino acids. Small proteins typically consist of only a single domain, while large proteins are formed from several domains linked together by short lengths of polypeptide chain. As proteins have evolved, domains have been modified and combined with other domains to construct new proteins. Domains that participate in the formation of large numbers of proteins are known as protein modules. Thus far, about 1000 different ways of folding up a domain have been observed, among more than about 10,000 known protein structures.

Proteins are brought together into larger structures by the same noncovalent forces that determine protein folding. Proteins with binding sites for their own surface can assemble into dimers, closed rings, spherical shells, or helical polymers. Although mixtures of proteins and nucleic acids can assemble spontaneously into complex structures in a test tube, many biological assembly processes involve irreversible steps. Consequently, not all structures in the cell are capable of spontaneous reassembly after they have been dissociated into their component parts.

By agreement with the publisher, this book is accessible by the search feature, but cannot be browsed.

Copyright © 2002, Bruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, and Peter Walter; Copyright © 1983, 1989, 1994, Bruce Alberts, Dennis Bray, Julian Lewis, Martin Raff, Keith Roberts, and James D. Watson .
Bookshelf ID: NBK26830


  • Cite this Page
  • Disable Glossary Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...