U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

National Research Council (US) Committee on Research Opportunities in Biology. Opportunities in Biology. Washington (DC): National Academies Press (US); 1989.

Cover of Opportunities in Biology

Opportunities in Biology.

Show details

3Molecular Structure and Function

Biological Macromolecules are Machines

All biological functions depend on events that occur at the molecular level. These events are directed, modulated, or detected by complex biological machines, which are themselves large molecules or clusters of molecules. Included are proteins, nucleic acids, carbohydrates, lipids, and complexes of them. Many areas of biological science focus on the signals detected by these machines or the output from these machines. The field of structural biology is concerned with the properties and behavior of the machines themselves. The ultimate goals of this field are to be able to predict the structure, function, and behavior of the machines from their chemical formulas, through the use of basic principles of chemistry and physics and knowledge derived from studies of other machines. Although we are still a long way from these goals, enormous progress has been made during the past two decades. Because of recent advances, primarily in recombinant DNA technology, computer science, and biological instrumentation, we should begin to realize the goals of structural biology during the next two decades.

Much of biological research still begins as descriptive science. A curious phenomenon in some living organism sparks our interest, perhaps because it is reminiscent of some previously known phenomenon, perhaps because it is inexplicable in any terms currently available to us. The richness and diversity of biological phenomena have led to the danger of a biology overwhelmed with descriptions of phenomena and devoid of any unifying principles. Unlike the rest of biology, structural biology is in the unique position of having its unifying principles largely known. They derive from basic molecular physics and chemistry. Rigorous physical theory and powerful experimental techniques already provide a deep understanding of the properties of small molecules. The same principles, largely intact, must suffice to explain and predict the properties of the larger molecules. For example, proteins are composed of linear chains of amino acids, only 20 different types of which regularly occur in proteins. The properties of proteins must be determined by the amino acids they contain and the order in which they are linked. While these properties may become complex and far removed from any property inherent in single amino acids, the existence of a limited set of fundamental building blocks restricts the ultimate functional properties of proteins.

Nucleic acids are potentially simpler than proteins since they are composed of only four fundamental types of building blocks, called bases, linked to each other through a chain of sugars and phosphates. The sequence of these bases in the DNA of an organism constitutes its genetic information. This sequence determines all of the proteins an organism can produce, all of the chemical reactions it can carry out, and, ultimately, all of the behavior the organism can reveal in response to its environment.

Carbohydrates and lipids are intermediate in complexity between nucleic acids and proteins. We currently know less about them, but this deficit is rapidly being eliminated.

The central focus in structural biology at present is the three-dimensional arrangement of the atoms that constitute a large biological molecule. Two decades ago this information was available for only several proteins and one nucleic acid, and each three-dimensional structure determined was a landmark in biology. Today such structures are determined routinely, and we have begun to see structures of not just individual large molecules, but whole arrays of such molecules. The first three-dimensional structures were each consistent with our expectations based on fundamental physics and chemistry. Most of the structures determined subsequently, however, were completely unrelated, and a large body of descriptive structural data began to emerge as more and more structures were revealed by x-ray crystallography. From newer data, patterns of three-dimensional structures have begun to emerge; it is now clear that most if not all structures will eventually fit into rational categories.

The Main Theme of Structural Biology Is the Relation of Molecular Structure to Function

Since biologists are ultimately interested in function, structural biology is often a means toward an end. The role played by structural biology differs somewhat depending on our prior knowledge of the function of particular molecules under investigation. Where considerable knowledge about function already exists, the determination of three-dimensional structure has almost inevitably led to major additional insights into function. For example, the three-dimensional structure of hemoglobin, the protein that carries oxygen in our blood stream, has helped us understand how we adapt to changes in altitude, how fish control their depth, and how a large number of human mutant hemoglobins relate to particular disease symptoms.

Often knowledge about structure can provide dramatic advances in our understanding about function even when prior knowledge is sketchy. For example, early biological experiments had shown that DNA contained genetic information, but these experiments offered no real clues to how a molecule could store information or how that information could be passed from cell to cell or from generation to generation. The structure of DNA, with bases paired between two different chains, led immediately to the correct conclusions about the mechanism of information storage and transfer. The information resided in the sequence of the bases; the apparent redundancy of two strands with equivalent (complementary) information meant that each could serve to pass the information onto a daughter strand. Furthermore, the redundancy offered a natural defense against loss of information. Even if one strand is damaged (as by chemicals or radiation), in the vast majority of cases the information on the other strand can be used to recover the missing information. Indeed, cells have evolved truly elegant mechanisms to determine which strand contains the original undamaged information; such models could provide useful paradigms for the current human preoccupation with electronic information handling.

The ultimate challenge for structural biology occurs when we have a structure but no clues at all about its function. Because of dramatic advances in our ability to determine structures, this challenge is likely to occur with increasing frequency. There have been a few remarkable cases in which limited structural information, such as a knowledge of the sequence of amino acid residues in a protein, without any three-dimensional structural information, has led to significant insights into function. In general, however, our current ability to predict function from structure in the absence of prior biological clues is limited, and one of our major needs is to improve our predictive abilities.

Biological Structure Is Organized Hierarchically

The structures of large biological molecules such as proteins and nucleic acids are complex. It is not practical or useful to describe these structures in words. In fact, highly specialized computer-driven graphics systems have been especially created to display molecular structures visually. An example of the output from one of these display systems is shown in Plates 1 and 2. Such devices are an invaluable aid to today's structural biologist, and future advances should make such devices cheaper, easier to use, and thus more readily available to all biologists.



Plates 1, 2 A repressor protein (from bacteriophage 434) is shown (left) approaching DNA and (right) bound to DNA in a crystal of the repressor-DNA complex. This binding turns off expression of a bacteriophage gene. [J. E. Anderson et al., Nature 326:846-852 (more...)

Because of the complexity of biological structures, it is frequently convenient to deal only with certain aspects of these structures. It is common practice to describe structure at a series of hierarchical levels, called primary, secondary, tertiary, and quaternary structure. This hierarchy reflects some of the types of information provided by particular experimental techniques used to determine the structures of biological molecules.

The primary structure is the covalent chemical structure, that is, a specification of the identity of all the atoms and the bonds that connect them. The major molecules with which we work—proteins, nucleic acids, and carbohydrates— usually consist of linear arrays of units, each of which has a similar overall structure; they differ only in certain details. The types of units are limited in numbers: 4 common ones in typical nucleic acids, roughly a dozen in typical carbohydrates, and 20 in proteins. Thus, the primary structure can be specified almost completely by naming the linear order, or sequence, of each type of unit of the chain. The primary structure is given by the sequence plus a description of any additional covalent modifications or crosslinks.

The sequence of proteins, nucleic acids, and carbohydrates is determined principally by chemical methods. This is understandable since it is, in fact, the chemical structure. These methods have advanced tremendously in the past decade, and the implications of these advances constitute the second section of this chapter.

The secondary structure refers to regular patterns of folding of adjacent residues. Most secondary structures are helices. Some of the most frequent and best-known helices are the alpha helices found in many proteins and double helices found in virtually all nucleic acids. Carbohydrates also form helices. Helices are convenient structural motifs: They are easy to recognize by inspection of a known three-dimensional structure, they are relatively easy to detect experimentally by physical techniques, and their appearance within many structures is relatively easy to predict just from a knowledge of the primary structure.

The tertiary structure is the complete three-dimensional structure of a single biological unit. Until recently the only available method for determining this structure was x-ray diffraction studies of a single crystal sample. Now electron and neutron diffraction have become available as tools for solid samples, and nuclear magnetic resonance spectroscopy has been developed to the point where it can be used to determine the tertiary structure of small proteins and nucleic acids in liquid solution, that is, close to the state in which they are usually found inside living cells. The tertiary structure usually provides the starting point for studies that attempt to correlate structure and function.

Quaternary structure describes the assembly of individual molecular units into more complex arrays. The simplest example of quaternary structure is a protein that consists of multiple subunits. The units may be identical or different. The arrangement of the subunits frequently has important functional implications. Some quaternary structures have been determined by experimental methods that reveal not only the arrangement of the subunits but also their individual tertiary structures. However, Many quaternary structures are too complex to be addressed by existing techniques. Here a variety of methods ranging from electron microscopy to neutron scattering to chemical crosslinking can still provide information about the overall shape of the assembly and detailed arrangement of the components.

In the sections that follow, we will first explore the levels of biological structures; our concerns will be improved methods for revealing these structures and the application of the resulting information to solving biological problems. We will then consider the current and future prospects for predicting the higher order structure of biological macromolecules from more readily available information on lower order structure. Finally we will consider the power of our newfound abilities to alter macromolecular structure more or less at will.

Primary Structure

Nucleic Acid and Protein Sequence Data Are Accumulating Rapidly

The amount of available information on the primary structure of biological polymers is increasing at an astounding rate. Two decades ago we knew the nucleotide sequence of only a single small nucleic acid, the yeast alanine transfer RNA. We knew the amino acid sequence of fewer than 100 different types of proteins.

Today more than 18 million base pairs of DNA have been sequenced, and the data are accumulating at more than several million bases a year. The first completed sequences were research landmarks. Now sequences are appearing so rapidly that many research journals refuse to publish such information unless it has some particular novel or utilitarian aspects. Indeed, sequence data are currently accumulating faster than we can analyze them, and even faster than we can enter them into the data bases by existing methods.

The longest block of continuous DNA sequence known is the entire primary structure of Epstein-Barr virus. This 172,282-base-pair genome is responsible for a number of human diseases including infectious mononucleosis, Burkitt's lymphoma, and nasopharyngeal carcinoma. Knowledge of the DNA sequence potentially unlocks for us all of the secrets of the virus. The challenge now is to use this sequence information to learn how to prevent or control the diseases caused by the virus. Other landmarks of recent DNA sequencing include the complete DNA sequence of the maize (corn) chloroplast DNA (about 130,000 base pairs) and the complete sequence of the gene for human factor VIII, one of the proteins involved in blood clotting, which is defective in certain hemophilias. We know the complete sequence of many other important proteins, RNAs, and viruses. Perhaps what is most important is that we have the technical ability to determine the sequence of virtually any piece of DNA, RNA, or protein.

Sequence Comparisons Lead to Structural, Functional, and Evolutionary Insights

Much valuable comparative sequence information awaits us as the data accumulate and as analytic methods become more reliable and informative. Already, one can do much using the data bases to help interpret any DNA sequence plucked more or less at random from a genome. The patterns of sequence in the regions that code for the amino acid chains of proteins differ enough from the noncoding regions that the former can usually be identified. For example, we know about types of sequences that are required for efficient synthesis of proteins in many different types of organisms. We know about some general types of control elements for certain genes important in developmental pattern formation or in an organism's response to environmental stress.

When the protein sequence predicted from a gene is compared with all known protein sequences, there is about one chance in three that it will be similar enough to one or more of them to be recognized in a match. This provides an immediate clue to the function of the previously unknown protein. Perhaps the most spectacular example of such a match was the discovery that the product of the sis oncogene, a protein of unknown function that is associated with some cancers, was extremely similar to a blood protein that promotes normal growth, the platelet-derived growth factor. As the data base grows in size, and as our general knowledge about the function of its constituents does likewise, the probabilities of informative matches should rise steadily. One can anticipate the growth of a new speciality, molecular archeology, that resembles the field of archeology itself. Protein and gene sequences are old. They have been rearranged and altered much as the residual artifacts from a town or fortification have partially deteriorated and become dispersed. The components that remain, however, when properly viewed, provide clues to the function of the whole.

An archeologist might conclude that a room full of amphoras was likely to have been a storage room and not sleeping quarters. In the same way, we can already look at some protein sequences and gain clues about functions, even about functions that we have never observed in detail in the laboratory. Proteins that have transmembrane domains will sit in membranes, proteins with nucleic acid binding sequences will bind DNA or RNA; a protein with both would probably bring a nucleic acid into the vicinity of a membrane and keep it there. The analysis can be carried further because the details of the protein sequence can provide even greater clues, just as the details of the decoration on a piece of pottery or the shape of an arrowhead can identify the geographic origin of the people who produced it.

Many proteins with related functions have probably evolved from common ancestors. Thus receptors—proteins designed to sit at the Cell surface and detect the environment—may represent one or more fundamental families of structures and sequences. For example, the sequence of the beta-adrenergic receptor, which binds the hormone adrenalin, and the sequence for rhodopsin, which detects light, are sufficiently similar that we can tell both were once related through a common progenitor. In the same way, proteases often resemble other proteases and structural proteins resemble other structural proteins.

Many proteins are modified chemically after they are synthesized. Proteases may remove one or both ends of the initial chain as well as make cleavages in the middle: Carbohydrates may be added to form glycoproteins. Some of these modifications occur as the protein travels from its initial site of synthesis to its final location in the cell. Others, such as phosphates, are added and removed repeatedly as part of the functioning or regulation of the protein. The enzymes that perform these modifications frequently do so by recognizing particular signal sequences. Because we now know some of these signals, a search of protein sequences can frequently reveal potential modification sites and in turn provide additional clues to the function of the protein.

Three-dimensional structure is better conserved in evolution than sequence is. Apparently there are severe constraints on folding a protein to make a compact three-dimensional array that is stable in the aqueous medium of a cell and resistant to proteases. Once we have mastered techniques for estimating possible folded structures from amino acid sequences, we will enhance our ability to explore molecular archeology. Mastery of these techniques itself will probably require an examination of many more three-dimensional structures by x-ray diffraction. What we still cannot do with much success is predict the function of an arbitrary protein without some molecular archeological clues.

When inspected by eye, a three-dimensional protein structure is complex and confusing; about the best most trained observers can do id identify potential binding sites as clefts or pockets and find potential sites of flexibility, such as connectors, between domains. Clues to functional regions can emerge from amino acids that are found in places other than their usual locations. For example, in typical soluble proteins, hydrophilic residues (which. have an affinity for water), such as charged residues, reside on the surface, whereas hydrophobic (which avoid contact with water) residues, such as those with hydrocarbon side chains, are found buried in the interior. A buried charged group, particularly if it is not paired with an opposite charge, can be a clue to a functional site. Similarly an exposed hydrophobic group may reveal a binding site for a hydrophobic small molecule. A whole set of such groups may indicate a surface of the protein that interacts with another protein or a membrane.

One can go only so far with visual inspection. Methods for systematic analysis of three-dimensional protein structures are needed that can extract, from the structures, as many clues as possible about protein function. Such procedures are still in theft infancy; the next decade should see rapid growth in such techniques now that a sufficient library of known structures and functions exists on which to develop, test, and refine these methods.

The DNA Sequences of Entire Genomes of Some Simple Organisms Will

The explosion in sequence data has just begun. DNA sequencing is far easier than protein sequencing, and the tools already available for cloning and efficient sequencing of 500-base-pair blocks of DNA will ensure that the current stream of new sequence data will become a torrent.

The ultimate target would be to determine the sequence of all the DNA in an organism, that is, to sequence an entire genome. Genomes range in size from 750,000 base pairs (a mycoplasma) to more than 3 billion base pairs.

Such large-scale sequencing programs are feasible by today's technology, but they are expensive in both manpower and actual dollar cost. Automated DNA sequencing techniques have begun to be developed, which should markedly diminish manpower requirements and decrease costs. It now seems likely that in the next few decades we will determine the complete DNA sequence of the bacterium Escherichia coil, the yeast Saccharomyces cerevisiae, the human genome, the fruitfly Drosophila, the mouse genome, the nematode Caenorhabditis elegans, and possibly even a number of plant and other bacterial and yeast genomes. The resulting information will stimulate future generations of biologists as they explore the functions of the tens of thousands of genes that will be revealed for the first time by such sequencing programs.

The major issue facing us today is how to stage the process of large-scale sequence determination. One set of concerns related to this issue deals with the optimal scientific strategy and the selection of targets for sequencing. Another set of concerns deals with attempts to organize and accelerate this work by mechanisms other than the types of investigator-initiated individual research projects typical in current biological science.

Most investigators favor making a physical map of a genome before commencing really large-scale sequencing. This physical map will consist of an ordered set of large DNA fragments that covers the entire genome. From each large DNA fragment, smaller pieces can be isolated (or cloned) and used as source material to perform the actual DNA sequencing. Some workers favor constructing the ordered set of fragments by isolating individual ones at random and then determining which fragments are neighbors in the genome. Others favor dividing the genomes into successively smaller pieces—first chromosomes, then chromosome fragments, then very large DNA pieces—until an ordered set of DNA fragments is created. At present there are good arguments for attempting both approaches simultaneously.

The first physical maps of genomes or segments of genomes are almost completed. In principle, one could use these and simply start large-scale sequencing now, with existing approaches. However, the likelihood of major improvements in automated technology over the next 5 to 10 years leads many people to favor concentrating current efforts on speeding the development of that technology and delaying most massive sequencing until the technology is available. As automated DNA sequencing machines become common, the rate-limiting step in obtaining data will shift to the production of the DNA needed for sequencing. Thus, we need to enhance our ability to prepare large numbers of discrete DNA fragments (preferably kept in linear order from some larger starting fragment). Robotics seems an attractive and useful new technology. At present, most sequencing methods are limited to a maximum of about 500 base pairs per DNA fragment. Every significant increase in the size of the fragment that can be sequenced will improve the overall efficiency of the process. Multiplex methods, in which numerous different fragments are handled in parallel or in series, offer another way to accelerate and expedite the entire process.

A third set of scientific concerns deals with the choice of species, individual organisms, and genes to sequence. At one extreme are those who believe that current efforts should be restricted to sequences tied to existing biological problems. For example, in the pursuit of human disease genes, it might be far more important to determine the DNA sequence of the same gene in many individuals than to extend a given sequence into neighboring regions to see what is there. Similarly, comparisons of different species frequently provide biological insights that would not have been possible if studies had been restricted to a single organism.

The advantage of this traditional problem-oriented approach is obvious: The sequences obtained are more or less guaranteed to be interesting and useful. However, the disadvantage is also obvious: As interesting regions are sequenced, it will become more difficult to motivate people to risk explorations of regions of genomes for which little or no information is available. While explorations of these regions have the potential to make major advances through finding completely unexpected genes and functions, the work is also risky since some regions may yield no rewards at all. The realities of tight funding and frequently competitive review for renewed funding militate against such work; if it is to be encouraged, new support mechanisms may need to be created, with longer term commitments and rewards for more risk-taking.

The final set of concerns deals with whether genomic sequencing should be organized in ways similar to those in which ''big science" has been dealt with in other disciplines. The actual process of sequence determination is boring. It seems to require more dedication and large-scale organization than most typical biology projects. The intellectual rewards of obtaining sequences of entire genomes are likely to be missed by most of those involved in the massive effort to accumulate the data. Much of the data may not result in publications in the primary scientific literature, and some publications that do result may have very large numbers of authors. Thus special efforts may be needed to maintain investigators' morale.

Structural and Computational Methods Need to Advance to Keep Pace with the Explosion in Sequence Data

As the acquisition of sequence data continues to accelerate over the next few years, the problem of managing these data will become increasingly severe. Considerable thought and resources will be needed to optimize the collection of data in consistent formats, the entry of data into computerized data bases accessible to all investigators, and the refinement of computer algorithms for all sorts of sequence and structure analysis. The anticipated size of the data base—100 million base pairs in the next few years, 10 to 100 billion base pairs eventually—is not staggering even by today's standards. However, the way the data are being accumulated, by efforts in hundreds of different laboratories, each with its own computer systems and idiosyncracies, poses a serious problem. What would help is a relatively uniform system of data annotation and transmittal. If this can be done by translation programs that accept a wide variety of inputs and return them to the data base and to the investigator in the standard format, it would probably win broad acceptance by the community because each laboratory could then maintain its own style.

A second complexity of the existing data bases is that there are three independent repositories for nucleic acid sequence data: GenBank, operated by the Los Alamos National Laboratory; the EMBL data base, operated by the European Molecular Biology Laboratory in Heidelberg; and Protein Identification Resource, operated by the National Biomedical Research Foundation in Washington, D.C. The multiplicity of data bases poses severe problems for current and potential users. Ideally, the three should be combined into one.

Nomenclature is a major problem for all three data bases. The names of molecules, species, and genes are constantly changing, and the data bases also change the cryptic names that they use to identify entries. The various data bases are also not cross-indexed. A major research problem is to determine what data are common to two or all three data bases. Moreover, any such cross-comparison is outdated as soon as one of the data bases is updated. International responsibility for entering data into major data bases and greatly expanding the use of electronic communication for this purpose is badly needed. Someone will have to construct and maintain a cross-index of related biological data bases. Possibly a direct-access system will be set up. In any case, the availability of a periodically updated cross-index will allow other installations to provide an integrated retrieval system to their users more easily.

The third major complexity of the sequence data base is the sophistication of many of the interrogations that will be made. Today each new sequence is almost automatically run through a comparison program to see what matches can be found with preexisting sequences. At the most trivial level, ibis procedure may reveal that the sequence has already been reported by someone else, and perhaps the same gene will have been reported under another name. At a more profound level, the comparison may reveal functional and structural insights. These comparisons can consume large amounts of computer time if they are carded out with algorithms that try to detect even very slight degrees of sequence similarity.

In the near future, we should begin to see many more attempts to use the data bases to refine methods for structure prediction, studies that will consume enormous amounts of computer time. It seems prudent to plan ahead and support research and development of better computer algorithms and better computer hardware to optimize the biologist's use of the DNA data base. Among the needs are database management systems designed to keep track of inquiries and results, so that insights gained by different inquiries can be synergistic and so that unwarranted duplicate inquiries can be short-circuited. Other needs are for improved analytical tools for predicting structure and for comparing structure and sequence. These tools may take the form of new chips, parallel processors, or more powerful algorithms.

Carbohydrate Research Is Gaining Momentum

In the past decade, structural studies on carbohydrates have begun to approach the capabilities of more developed areas of protein and nucleic acid structure. Techniques have been developed to deduce the complete structure of complex oligosaccharides, including oligosaccharides found in scarce glycoproteins, such as cell-surface molecules.

Glycoproteins are proteins containing covalently attached sugars, usually short carbohydrate polymers attached to the side chains of the amino acids asparagine, serine, or threonine. Glycoproteins are found throughout nature, from simple single-celled organisms to humans, and they play critical roles in these organisms. Glycoproteins are usually, but not exclusively, found on the surfaces of cells and in cellular secretions. For example, almost all of the human blood proteins and all of the well-characterized eukaryotic cell-surface macromolecules are glycoproteins. In addition, glycoproteins are key components in the outer coatings of a number of pathological agents, including viruses and parasites. Many of the molecules used by the immune system to combat these pathogens are also glycoproteins. Recently, important roles have been identified for some glycoproteins that remain in the cell's interior, such as the proteins that form the pores in the nuclear membrane.

The new techniques in carbohydrate research include nuclear magnetic resonance (NMR), fast-atom-bombardment mass spectrometry, and metabolic labeling with radioactive sugars combined with stepwise degradation with a battery of purified glycosidases. The consequence has been the elucidation of hundreds of oligosaccharide structures, which has enhanced our understanding of the actual repertoire of structures synthesized by various organisms and cell types. These data form the basis for studies of the biological role of oligosaccharides.

The techniques have also been invaluable for studies of the biosynthesis of complex oligosaccharides. One important discovery is that a lipid-linked oligosaccharide serves as the precursor of asparagine-linked oligosaccharides. It is noteworthy that this lipid-linked precursor oligosaccharide structure has been highly conserved; it is the same in yeast as it is in mammals, and presumably occurred in a common ancestor of fungi and animals.

Several Major Advances in Studying Carbohydrates Can Be Anticipated

New NMR methods will enable the determination of the three-dimensional structures of oligosaccharides. This information will greatly enhance studies of the interaction of oligosaccharides with carbohydrate-binding molecules, such as receptors and lectins, and with glycosyltransferases. In 1986 the first partial complementary DNA (cDNA) clone for a glycosyltransferase was obtained. That a number of laboratories are focusing on this area suggests that the genes for many of these enzymes will be cloned soon. This achievement will provide valuable information about the structures of the molecules that synthesize complex carbohydrates and whether or not they exist as gene families. The clones will also be used for studying the catalytic function of the enzymes and for manipulating their cellular levels.

The organic synthesis of complex oligosaccharides has lagged behind their biochemistry. Improvements in synthesis have been made, and it seems likely that progress will continue. The ability to synthesize large quantities of complex oligosaccharides of known structures will be of considerable value to the investigators. Many laboratories are trying to implicate oligosaccharides in a variety of biological interactions, ranging from adhesion between the cells of sponges to sperm-egg interactions to the inhibition of growth that occurs when certain cells contact each other. The availability of synthetic oligosaccharides, of inhibitors that alter oligosaccharide biosynthesis, and of endoglycosidases that remove oligosaccharides from glycoproteins will be useful in these studies.

Trying to understand the features of glycoproteins that are distinct from those of unglycosylated proteins is of current interest. Contributions of sugars to protein folding and macromolecular assembly might be fundamentally different from those of amino acids. The physical properties of sugars differ from those of amino acid side chains. They can confer special mechanical and other properties, such as those of fish-antifreeze glycoproteins, mucins, or connective tissue proteoglycans. The vast diversity in the types of sugars and the ways they are connected suggest that oligosaccharides on glycoproteins (and glycolipids) can potentially encode large amounts of information, which may play an important part in the development of multicellular organisms, recognition in immune systems, and information storage and processing in nervous systems.

The main outstanding question for the future is, What function, if any, does the carbohydrate play in determining glycoprotein function? With the exception of the specialized physical properties imparted to mucins and proteoglycans, we do not know why most glycoproteins are glycosylated. In some studies of the functions of oligosaccharides on glycoproteins, the proteins appear to function perfectly well without the carbohydrates, whereas in others glycosylation is essential. We do not understand the basis for these different results; no underlying rules have been uncovered.

The main obstacle to understanding function is the lack of proper assays. For instance, the finding that oligosaccharide structures are dramatically altered during embryonic development has led to the proposal that surface oligosaccharides function as recognition molecules during development. However, there is no direct evidence for this. What is lacking is the ability to manipulate the oligosaccharides in these systems and to devise assays to test for effects. The same type of problem faces investigators studying the potential roles of oligosaccharides in other systems. One way to improve this situation would be to develop more inhibitors that are specific for particular glycosyltransferases or processing enzymes. This would make it possible to alter the oligosaccharides that cells synthesize. Another approach is to transfect active genes for particular glycosyltransferases into cells that normally do not express these genes. The objective would be to see how cell behavior is affected when the cells synthesize oligosaccharides with altered structures.

Three-Dimensional Structure

The Three-Dimensional Structure of Biological Macromolecules Determines How They Function

It is the three-dimensional shape of proteins and nucleic acids that endows them with their biological activities. Structural molecular biology uses x-ray diffraction, nuclear magnetic resonance, and other techniques to determine the three-dimensional arrangement of the atoms in biological molecules. Studying these detailed atomic arrangements provides insight into how the molecules fold and how their folded surfaces can act as biological catalysts, as recognition and adhesion devices, as architectural elements in cells, and as the storage libraries of genetic information.

X-Ray Diffraction Provides Highly Revealing Images of Molecules

By recording the scattering pattern produced by shining x-rays on a crystal, it is possible to compute an atomically detailed image of the molecule forming the crystal. The x-my diffraction method, pioneered in the early part of this century, can now be applied to crystals as simple as those of table salt or as complex as those formed from polio virus.

The particular power of x-ray diffraction is that, because x-rays are just a very short wavelength form of light, the method produces a three-dimensional image of molecules in the same sense that our eyes produce images from light. Thus we can literally see the arrangement of atoms and distribution of electrons in biological molecules, rather than have to infer such information from indirect experiments. The experimental difference between x-ray diffraction and seeing (for example, in a microscope) is that there exist no x-ray lenses. This lack has been overcome by exacting experimental methods involving binding heavy metal atoms to proteins as atomic markers and by using a computer in place of an x-ray lens to transform the raw x-ray scattering pattern into an image. The latter operation is possible because we know the mathematical equations governing the actions of lenses.

The three-dimensional atomic structures of more than 300 proteins are now known. X-ray diffraction continues to unveil between one and two dozen new protein structures each year. The distribution of atoms in any given protein is bewilderingly complicated, difficult to describe, and satisfactorily revealed only in three-dimensional stereoscopic drawings. However, now that so many structures have been determined, structural themes are seen to recur, allowing recognition and categorization of structural motifs.

Structural Motifs are Repeatedly Used to Carry Out Similar Functions

A structural motif composed of three extended strands and two helical coils of protein, named the nucleotide-binding domain, is found as part of many enzymes that bind nucleotides, such as adenosine triphosphate (ATP). Recently, from the nucleotide sequence of the gene encoding a protein associated with bladder cancer in humans, the nucleotide-binding domain was correctly predicted to be part of this protein's three-dimensional structure.

Another structural motif has been seen in some of the proteins that recognize specific sequences of DNA and consequently regulate genes by turning them off or on. In this instance, two helical coils of protein connected by a short bend form a module that can plug into the major groove of a DNA double helix. The atomic surface of this recognition-helix motif is different in different proteins, imparting to them the ability to recognize and bind tightly to different specific sequences of DNA. As a result, one protein turns on one specific gene, whereas another might turn off another gene. Understanding how this structure functions has allowed scientists to synthesize novel regulators of genetic information.

Three-Dimensional Structures Give Insight Into How Proteins-Fold and How Groups of Proteins Assemble

A number of metabolic enzymes have similar structures that look strikingly like a barrel. Each stave of the barrel resembles a hairpin composed of one extended strand folded back against one helix. Genetic studies have shown that each stave is encoded by a unit of genetic information—an exon—and that the barrels are formed by stringing together eight of these genetic units. Correlating the genetic map of the staves with about a dozen three-dimensional structures of various barrels has led to the suggestion that the staves may derive from independent folding units. The discovery of such substructures may help unlock the mystery of how proteins fold by letting us see the primitive folding units that first appeared as life evolved. Maybe we can learn how to fold proteins by seeing how nature learned to do it.

The Major Goal of Protein Crystallography Is to Show How Proteins Function

The majority of the proteins of known structure are enzymes. When the unique three-dimensional spatial information from the structure is combined with a vast array of observations on the properties of the protein, many of the secrets of how enzymes catalyze chemical reactions are revealed. For example, we know how enzymes can bind certain substrate molecules specifically, how certain amino acid side chains are positioned to act as catalysts, and how the enzyme can change its shape in response to binding the substrate or regulatory molecules. We know that sometimes these shape changes can be transmitted through the structure. We also know how the same structural motifs can be varied in different proteins to produce a series of enzymes with similar mechanism but different substrate specificities. For example, a whole set of proteases have extremely similar active sites, all of which contain an activated serine residue.

Photoreaction Center. In 1987, the structure of the first membrane assembly involved in photosynthesis, the photoreaction center, was determined. This complex of four proteins converts light energy into an electrical gradient across a membrane (Figure 3-1). The structure immediately showed the path through the protein that electrons traverse to cross photosynthetic membranes, a key initial step in the conversion of light into chemical energy.

Figure 3-1. Cartoon of the three-dimensional structure of photoreaction center.

Figure 3-1

Cartoon of the three-dimensional structure of photoreaction center. [L. Stryer, Biochemistry (Freeman, ed. 3, New York, 1988)]

This structure also gave major new insights into the exact nature of packing of alpha helices in a detergent micelle that is thought to mimic the lipid bilayer environment in a real membrane. It revealed a logical relationship between the length of particular hydrophobic sequences and the angle at which they cross the membrane. It also showed that other helices were actually located at the surface of the micelle and that these had arrangements of hydrophobic and hydrophilic residues that enabled them to interact favorably with both the lipid in the micelle and water outside it. Thus the structure of the photoreaction center provides the first understanding at high resolution of the interface between a protein and the lipid bilayer.

It is possible to diffuse small molecules (enzymatic substrates, inhibitors, reaction products, or drugs) into protein crystals because the spaces between proteins in protein crystals are large, water-filled channels. A herbicide was soaked into the crystals of the photoreaction center to see how it bound to the protein. The three-dimensional atomic image of a photosynthetic protein being inhibited by a herbicide provides a new level of understanding for developing agricultural products.

Antibodies. Antibodies are protein molecules that play a key role in an organism's defense against foreign molecules called antigens. The first step in this process is specific complex formation between an antibody molecule and a molecule of antigen. The structural basis of this antigen-antibody recognition is beginning to be elucidated. Structures of certain antigen-binding fragments of antibodies were determined some time ago. These structures demonstrated directly that certain regions of antibodies were extremely variable in conformation from one antibody to another. Six stretches of amino acids, containing these hypervariable sequences from the heavy and light chains (See Chapter 7) of the antibody, are adjacent in space. Together these describe a surface believed to be complementary to the recognized region (epitope) on the antigen used to raise the particular antibody. Binding studies with small-molecule fragments of antigens (haptens) revealed the importance of these complementarity-determining regions in forming a complex of small molecules, and by implication, macromolecular antigens. This work was responsible for providing an intellectual framework for understanding and investigating antigen-antibody interactions. In 1986, the first structures of complexes between protein antigens and antibodies of the immune system were determined. These direct pictures of how antibodies recognize an enzyme and a protein from the surface of influenza virus are providing fundamental information about how our immune systems recognize and destroy foreign molecules.

Future challenges will be to characterize the complex antigen-receptor interactions by which different types of cells participate in and regulate immune responses. The physiological-cellular description of the process is already giving way to a molecular characterization. Genetic engineering, cell culture, and hybridization techniques will allow production of molecules directly involved in the cell communication process and in the elaboration of the response to antigenic simulation.

Knowledge of How Structures Function is Being Used to Design Drugs

The three-dimensional atomic structure was recently determined for a complex between a protein on the outside of the 1968 Hong Kong influenza virus and the receptor molecule on human cell surfaces to which the virus binds to initiate an infection. By observing the atomic details of how the virus binds to a cell, scientists hope to design and synthesize drug molecules that can mimic the cell receptor. Such inhibitors would prevent infection by binding to a virus, interfering with its ability to attach itself to and to infect a human cell.

Dihydrofolate reductase is an essential enzyme for cell growth. It is the target both for antibacterial drug design and for chemotherapeutic agents that arrest human cancers. So far, studies of the three-dimensional structure of the enzyme complexed with various inhibitors have resulted in the development of the new antibacterial drug trimethoprim.

The antihypertensive drug captopril lowers blood pressure by inhibiting angiotensin-converting enzyme that normally produces a substance that constricts blood vessels and raises blood pressure. Captopril was designed by studying the three-dimensional structure of a digestive enzyme related in its chemical activity to the angiotensin-converting enzyme and by synthesizing a compound that would fit tightly to and block the active site of a converting enzyme.

Drugs that bind to the blood protein hemoglobin, preventing it from aggregating in individuals with the hereditary sickle-cell trait, are also being studied. Knowledge of the structure of β-lactamase, an enzyme that bacteria use to destroy penicillin and similar antibiotics, has opened the way for foiling the method by which bacteria resist drugs.

Designing drugs by understanding the atomic details of how inhibitors fit onto the surfaces of proteins and block normal activities is just beginning. Many believe we are entering a new era of drug discovery based on designing molecules for stereochemical fit to their targets by actually seeing the target molecules with substrates and the drugs that are bound to them.

DNA is an Dynamic Molecule That Can Switch Between Different Structural States

When the double helix structure for DNA was first announced, it was an instant public success. It represented a neat solution to a number of chemical and biological problems, and it was easy to describe and to remember. The importance of pairing between bases on the two DNA strands and stacking of adjacent bases along each individual DNA strand is overwhelming in nucleic acid structures. In terms of relative importance to the overall structure, there are no counterparts in proteins. However, with time, the structure of DNA has been found to be much more complex than was originally thought, since there are a variety of different double helical structures. The diversity of such structures has dramatically altered our thinking about the DNA molecule.

To date, the folding of DNA has been largely thought of as the assembly of the double helix through formation of successive base pairs. The insertion and deletion of extra helical twists in circular DNA molecules has forced attention on the topology of these complex systems and has presented a massive mechanistic problem at the enzyme level. The bending of the double helix and its control by sequence variation is also under intensive investigation. The double helix was initially thought to be rigid. In view of the compact packing required in the nucleus of the cell, bending was obviously essential. However, the structural details of the contortions that the double helix can actually undergo have only recently been recognized. The structure of the nucleosome, now known at high resolution with its coiled double helix and protein core, is a beautiful example of the biological importance of bending. Other proteins that interact with DNA can also induce bending.

The dynamic aspects of the equilibrium structures of DNA have become clear with direct experimental measurement of the swinging in and out of individual bases to and from the axis of the helix. Larger scale motions on a much longer time scale are revealed by pulsed field gel electrophoresis, which separates molecules of enormous molecular weight.

RNA Structure Is an More Challenging Area of Research

The structure of RNA has been even more difficult to deal with than that of proteins and DNA. Only the smallest class, transfer RNA, has yielded any solved crystal structures. All the transfer RNAs turn out to be similar L-shaped molecules. This similarity is reflected in the cloverleaf model for secondary structure, originally derived by searching for similar base pairing possibilities within the single chains. Although no other RNA structures are yet available through diffraction procedures, the extensive use of sequence data and sequence homology has led to a large array of secondary structure predictions that will almost certainly be retained in the three-dimensional structures eventually determined. Nuclear magnetic resonance (NMR) is starting to provide a substantial amount of structural information on RNAs, but diffraction-quality crystals would be enormously useful.

One of the most unexpected discoveries of the past few years is the catalytic activity of certain RNAs in RNA splicing reactions (See Chapter 4). The enzyme RNA polymerase produces an RNA copy, called the transcript, from the gene in the chromosome. In eukaryotic cells, such a transcript frequently has several long stretches (introns) that interrupt the functional portion, that will form a messenger RNA (mRNA), a transfer RNA (tRNA), or a ribosomal RNA (rRNA). These introns are precisely cut out of the transcript and the functional structures (exons) are rejoined to yield the mature RNA in what is called an RNA splicing reaction. In some cases, this reaction can be carried out by the intron RNA itself without the help of any proteins. These RNA molecules are the first known examples of true, nonprotein biological catalysts. The details of the highly organized three-dimensional structures of these catalytic RNA molecules have not yet been unraveled.

Much Remains to Be Learned About the Structures of Carbohydrates

Significant by its absence in the above discussion is any mention of the three-dimensional structure of polysaccharides (a carbohydrate made up of a large number of sugar molecules). As mentioned earlier, these substances have been particularly intransigent in yielding high-resolution structural data. Only the smallest compounds have provided truly crystalline material. Most studies have been chemical or spectroscopic. In view of their unquestioned biological importance, much greater effort on the three-dimensional structure of this class of polymers is indicated. We do not even know whether such molecules have unique three-dimensional structures.

A Technical Breakthrough Promises Information About Dynamic Processes in the Function of Proteins

The massive electron-storage rings that physicists use to probe the fundamental components of matter also emit x-ray beams high in power. These synchrotron x-ray sources have recently been used to study large biological molecules. The beams of x-rays are thousands of times as strong as those from conventional laboratory x-ray sources, reducing x-ray data-collection time from months to hours. An experimental breakthrough in the application of multiple-wavelength x-ray diffraction now provides exposure times of milliseconds. The biochemical events on the surface of a protein can therefore be studied by a series of snapshots of the structure every few milliseconds. This should allow the sequence of events that constitute a chemical reaction or protein conformational change to be understood in atomic detail. Examining the dynamics of fundamental biological reactions will deepen our understanding of how proteins work, provide insight into normal functions, and raise the possibility of understanding abnormal functioning in disease.

Crystallography Will Continue to Increase in Importance

The future for structural biology is particularly bright at present because two factors have coincided. First, the recent explosive growth in the power of molecular biology, as a result of gene cloning and recombinant DNA technology, suddenly provides a large amount of any given macromolecule and the ability to modify these at will, to test or alter their functions. This brings the fundamental molecules at the basis of almost every process in living systems into the range of structural study.

Second, as the discovery of new molecules has accelerated, the technology by which x-ray structures are determined has undergone a rapid evolution. New methods and algorithms have made determining x-ray structures easier, but most important, because x-ray crystallography is highly technical, it has benefited enormously from the recent leap in computational power and computer-controlled instrumentation.

Nuclear Magnetic Resonance is the Technique of Choice for Studying Molecular Structures in Solution

Recently, NMR, a structural and analytical tool used by chemists for many years (Chapter 2), has made rapid progress in providing information in structural biology. NMR is a spectroscopic technique in which the absorption of radio-frequency energy is measured for the nuclei of molecules placed in highly magnetic fields. Because the absorption frequency is related to the chemical environment of a nucleus, NMR measurements provide structural details such as inter-atomic distances and conformational angles from samples in aqueous solution.

The spectra are very complex. The recently developed procedures for spreading out the absorption peaks in two dimensions have dramatically improved both resolution and the ability to assign each peak to specific atoms in the macromolecule. Intricate patterns of radio-frequency pulses can be used to collect information on which atoms are close to which others. As with all spectroscopic techniques, each absorption process occurs on a different, characteristic, time scale, and it is sensitive to events that occur on that same time scale. Thus, the dynamic behavior of many parts of the molecule can be directly measured over a broad frequency range.

Nuclear Magnetic Resonance and X-Ray Diffraction Form an Strong Partnership

NMR and x-ray diffraction provide both overlapping and distinct information about molecules. Recently, the chain-folding of a small protein was determined from an analysis of interatomic distances provided by NMR. X-ray diffraction simultaneously verified the structure, confirming as a side benefit that the structure of a protein in solution as seen by NMR is the same as that in a crystal as seen by x-ray diffraction. We can now confidently predict that NMR will make it possible to determine a series of structures of small proteins in solution.

The two techniques are complementary in the nature of the information that they can best supply. X-ray crystallography can provide precise atomic positions for almost all the atoms in a macromolecule, and it can be applied to very large molecules with no size limit yet found. However, it requires crystals of excellent quality, poorly ordered regions cannot be defined, and no estimates of the rates of certain types of molecular motion can be inferred. NMR can provide structural information, but not yet with the precision obtainable from x-ray structures. So far, NMR studies have produced atomic level data for only very small proteins. However, the procedure uses solutions rather than crystals, it can provide information on flexible regions, and it can reveal the times required for many dynamic processes. Both techniques are rapidly improving their capabilities and are likely to continue to dominate structural biology in the near future.

Molecular Assemblies

The past decade has seen major advances in our ability to study the structure of molecular assemblies. These are aggregates of individual macromolecules, most frequently complexes between proteins or proteins and nucleic acids. Among the major triumphs have been solution by x-ray methods of large structures such as the nucleosome, the photosynthetic reaction center, an antibody-antigen complex, and spherical viruses. Another highlight was the solution of the structure of a protein within its native membrane by electron microscopy; this structure is still at relatively low resolution, but it should be possible to extend it to 3 angstroms.

Understanding of the molecular architecture and function of some key filamentous complexes and organelles—such as actin filaments decorated with the myosin head, the ribosome, and clathrin-coated vesicles—has significantly advanced through the use of three-dimensional electron image reconstruction combined with in vitro reassembly. Also important has been the use of deuterium labeling and neutron scattering to derive three-dimensional maps specifying the relative locations of macromolecular components in large complexes such as RNA polymerase and the E. coli ribosome. Crucial insight, paralleling physiological and biochemical data, has been gained into dynamic processes such as muscle contraction and microtubule assembly through the use of time-resolved synchrotron x-radiation.

New Technology Has Improved Our Ability to Determine Complex Structures

These successes have stemmed in the main from technological developments and breakthroughs. In x-ray crystallography, particularly of viruses, the progress can be traced to four advances: (1) the introduction of methods to acquire and process high-resolution data from very large repeating units (such as crystals with very large unit cells); (2) the development of methods that take advantage of the symmetry of many molecular assemblies in solving the structure; (3) the tremendous advances in the speed, size, and affordability of computers; and (4) the development of computer graphics that efficiently communicate the information obtained.

An additional major breakthrough was learning to crystallize membrane protein assemblies in the form of protein-detergent micelles. Crystallization has been achieved now for several such membrane proteins, including the photosynthetic reaction center, E. coli matrix porin, and bacteriorhodopsin. Also of particular significance for membrane structure has been the progress made in electron microscopy. Preparative methods that preserve specimens have been developed as have mathematical analyses of the image that can provide three-dimensional details. The advent of cryo-electron microscopy, in which wet specimens are prepared by being frozen so rapidly that the water remains amorphous rather than crystalline, has made it possible to see functional molecules in action for the first time.

Nucleosomes. The nucleosome is the fundamental repeating structural unit that makes up the chromosomes of all eukaryotic cells. The elucidation of the molecular details of the nucleosome necessitated the use of a wide range of newly developed physical and biochemical approaches. The existence of nucleosomes was first recognized by electron microscopy and from the finding of a unit of histone organization that could explain the nuclease cutting pattern of chromosomal DNA. The nucleosome was perceived and eventually proven to consist of a histone octamer about which is wrapped approximately 160 base pairs of DNA, with a single molecule of an additional histone bound on the outside. Electron and x-ray crystallographic analyses showed that the DNA is coiled in two left-handed turns around the central histone octamer. With the methods used to analyze nucleosome digestion, and with the use of cloned DNAs, the chromatin organization of many genes has been studied. It is now understood that transcriptionally inactive genes are packaged in nucleosomes, but the active genes are organized differently, with regulatory sequences in special exposed regions and with gene and flanking sequences in altered nucleosomes or in novel particulate structures. Challenges for the future are to determine how chains of nucleosomes are further coiled or folded in condensed states of chromosomes, to elucidate the mechanism of unfolding that accompanies gene activation, and to solve the structure of transcriptionally active genes.

Membrane Proteins. Major insights into the three-dimensional structure of membranes came from the electron microscopic analysis of bacteriorhodopsin—a light-driven proton pump—in the purple membranes isolated from the cell membrane of the bacterium Halobacterium halobium. The structural maps provided the first case in which a membrane protein was shown to be composed of a bundle of alpha-helical segments. There were seven such segments, closely packed in a left-handed configuration extending roughly perpendicular to the plane of the membrane bilayer for most of its width (Figure 3-2). The amino acid sequence for a number of cell-surface receptors such as rhodopsin, the beta-adrenergic receptor, and the muscarinic acetylcholine receptor have recently been determined. Analysis of the patterns of secondary structure and distribution of hydrophobic residues predicted from these sequences indicate that the 7-alpha-helical bundle is a recurring motif among cell-surface receptors. The quaternary structures of protein oligomers that form membrane channels, such as the nicotinic acetylcholine receptor and the gap-junction connection, have become better understood through the same approach.

Figure 3-2. Bacteriorhodopsin as it sits in a bilayer [R.

Figure 3-2

Bacteriorhodopsin as it sits in a bilayer [R. Henderson and P. N. T. Unwin, Biophys. Struct. Mech. 3:121 (1977)]

Viruses. Viruses consist of nucleic acid (RNA or DNA) packaged in a multisubunit protein coat. The three-dimensional structures of a number of plant and animal viruses have been determined by x-ray diffraction; they show assemblies of as many as 180 proteins forming icosahedral shells that package the virus's genetic information (Figure 3-3). Each determination of a vital structure has been a triumph of both persistence and innovation, and we have learned much from every new step. Determining the structures of the simple plant viruses was a major advance. These structures were far larger than any previously determined by crystallography; they provided our first real insights into vital architecture at the molecular level. The viral images also led to insights into how viruses assemble themselves. One day we may learn to use this insight to design strategies to prevent viruses from assembling as part of an effort to control the infections they cause. Recently, the structures of two animal viruses (rhinovirus and poliovirus) and the structure of the adenovirus hexon have revealed that the molecular topology of their coat proteins is essentially the same as that of the simple plant viruses. Thus all such viruses may have a common evolutionary ancestor.

Figure 3-3. A schematic diagram of the location of the 180-protein subunits on the surface of the tomato bushy stunt virus, as determined by x-ray diffraction.

Figure 3-3

A schematic diagram of the location of the 180-protein subunits on the surface of the tomato bushy stunt virus, as determined by x-ray diffraction. [S.C. Harrison et al., Nature 276:368-373 (1978)]

The structures of the human rhinovirus (common cold virus) and poliovirus have especially important implications for medical research. From these structures it has been possible to ascertain the sites of attachment of various types of neutralizing antibodies and the sites of binding for a series of experimental antiviral drugs that are suspected of inhibiting virus replication by preventing the low pH-mediated uncoating of the vital RNA. The structural results show that these antiviral compounds insert themselves into the hydrophobic interior of one subunit of the protein coat, suggesting how the drugs may inhibit the disassembly of the virus.

In the near future, this newly developed technology should allow a fairly rapid survey of the structures of numerous vital pathogens or their components and thereby provide a good deal of information about how they work. We should come to understand the mechanism of neutralization of viruses by antibodies by looking at the structure of neutralizing monoclonal antibodies and their complexes with the virus. Moreover, it should become possible to design more powerful antiviral drugs that interfere with attachment, penetration, uncoating, or assembly.

How Our Immune System Recognizes Viruses. In the past five years, the determinations of the crystal structures of poliovirus, rhinovirus, and both of the surface proteins of influenza virus have allowed us to visualize those parts of the viruses recognized by the human immune system. Monoclonal antibodies and the amino acid sequences of many strains of viruses have made it possible to map the regions of each virus that are attacked by human antibodies. Plate 3, for example, shows five sites (A-E) on an influenza virus that bind antibodies. How these antigenic sites vary in structure every few years, resulting in new epidemics of ''the flu," is now under study.

Plate 3. Sites A-E on the surface protein of the flu virus are recognized by our immune system.

Plate 3

Sites A-E on the surface protein of the flu virus are recognized by our immune system. Variation in the structure of these sites results in the recurrence of epidemics in the human population. [Based on work of D. Wiley, Harvard University, and J. Skehel, (more...)

Even Larger Structures Will Be Solved During the Next Decade

Research on molecular assemblies over the next decades is likely to involve extending x-ray analytical methods to still larger aggregates, such as ribosomal subunits, and to more difficult materials, such as membrane complexes and networks of peptides and sugars called proteoglycans, which are materials that play important roles in the interactions between cells. There will be greater use of clusters of heavy atoms (rather than simply individual atoms) to determine the structures. Area detectors and synchrotron radiation will be used to collect x-ray data. The increase in power and accessibility of computing resources will greatly benefit the data processing. With symmetrical assemblies, new methods may eventually allow the determination of new high-resolution structures without the need to resort to heavy-atom methods. Since progress on each of these points is already being made, one can anticipate that early in the next decade determining the structure of small viruses will be routine and that a number of larger viruses will also be solved.

Electron microscopy of rapidly frozen specimens, in the form of two-dimensional crystals or isolated molecules and in conjunction with heavy-atom cluster labels, should provide a wealth of new information on molecular assemblies trapped in different conformational states in nearly physiological conditions. Methods of three-dimensional image reconstruction from crystalline specimens will be further refined by computational procedures. Averaging methods to extract details from isolated molecules will become more powerful. With further improvements in methods for electron microscopy at very low temperatures, we can expect images of two-dimensional crystals in some instances to yield atomic resolution and, more generally, resolution at the secondary structure level. These images, when combined to derive three-dimensional maps, will provide essential frameworks upon which other diverse information can be added to build up detailed pictures of molecular structure and action. Oligomeric membrane proteins may become fairly well understood by such an approach, since the lipid bilayer imposes stringent constraints on possible transmembrane structures. The identification of residues exposed at one or the other surface of the bilayer can readily be accomplished by labeling, for example, with antibodies to specific strings of amino acids.

The ability to capture a particular conformation by rapid freezing should soon make it possible to visualize the configuration of the myosin head attached to actin at different stages in the contraction-relaxation cycle and to visualize membrane channels in their closed, open, and inactivated states. Closer ties with physiology can also be expected to emerge as a result of further development of microscopic and time-resolved techniques. Computer-aided light microscopy, for example, has facilitated the discovery of the mechanisms of microtubule-based motility and is beginning to reveal the fate of very small populations of molecules in cells. Similarly, time-resolved x-ray diffraction has been developed on the basis of very few model systems (muscle contraction, microtubule assembly), but the applicability of these methods goes far beyond the original aims.

Crystallization of the Sample Is Still an Major Hurdle

The major obstacle to the structural analysis of molecular assemblies has been and will continue to be in preparing suitable crystals. The growth of two-dimensional crystals has become a key aspect of electron microscopic structure determination, and new general approaches are urgently needed. Thus, the lipid-layer crystallization technique (in which macromolecules are bound to a lipid ligand in mono- or multilayers in order to facilitate high lateral concentration and oriented binding), if successfully developed, would play a critical role. The difficulties encountered in three-dimensional crystallization, as needed for highresolution x-ray analysis, will depend on the type of assembly in question. With membrane complexes, the crystals must be grown from precise mixtures of detergents, amphiphiles (polar molecules that have an affinity to both aqueous and nonaqueous areas), protein, and lipid; the process of crystallization has an additional dimension compared with that of soluble proteins. A major difficulty at present, therefore, is in obtaining sufficient commitments of financing and time to support such crystallization efforts. The first such crystallizations were carried out only after many years of trials in Europe, where the support of science can maintain a constant effort in a high-risk, long-term endeavor. Because risks have been demonstrably reduced, considerable weight must be given to early successes in growing crystals of sufficient quality for high-resolution analyses.

The crystallization of viruses is often made difficult by the small supplies available for systematic experimentation. Thus, large cell-culture laboratories with suitable biohazard containment are required. New methods of crystallization may also be needed, particularly for lipid-enveloped viruses such as rubella or measles. The forming of homogeneous complexes of viruses with antibodies, drugs, and receptors will call for considerable effort. An alternative approach, which has proven successful in the past, consists of crystallization of components of the assembled structure, determining the three-dimensional structures of each of the components, and then using electron microscopy studies to provide the architectural details of how the components are arranged in the assembly. Complementary analyses of this nature will also, in many instances, provide the most appropriate pathway toward understanding the details and action of the large intracellular organelles, such as the nuclear pore and the various types of cytoskeletal filaments. An exciting result to be expected along these lines over the next 10 years is the merging of the available low-resolution picture of myosin-actin filament interaction with that of the high-resolution structures currently being determined for the actin monomer and the myosin head.

Complex Biological Structures Can Assemble Themselves

Researchers have begun to unravel how molecular assemblies are formed. In living cells, the production of the components destined to be assembled is often coordinated tightly both spatially and temporally. This coordination is revealed by studies with mutants, in which individual components are defective or are not synthesized in the proper amounts. Sometimes assembly occurs just by spontaneous association of individual proteins and nucleic acids, but steps in assembly are frequently accompanied by the covalent modification of key proteins and nucleic acids. Such modification can make the assembly irreversible—in essence, to lock the pieces into place. Other assembly mechanisms have been found to make use of scaffolding molecules. These molecules are present at intermediate stages in the assembly to help align critical components, but they disappear before the final structure is formed, just as a scaffold is taken down as a building is finished.

Studies that attempt to assemble biological structures in vitro have been particularly fruitful. These allow the timing of particular steps to be controlled at will, and particular components can be added sequentially or simultaneously, which in turn allows detailed study of assembly pathways and direct tests of the function of specific components of the assembly by single-component-omission experiments. Such studies are not always possible in vivo. For example, if a protein has functions critical to the cell, it will be difficult to see its effect on the structure or function of an assembly by simply preventing its being synthesized.

In vitro assembly is also useful for many structural studies. For example, neutron-scattering measurements usually require the creation of an assembly in which some components contain the normal isotope of hydrogen whereas in others the hydrogen is substituted with deuterium. Such manipulations can be carried out only by starting with isolated components in vitro. Complex biological structures successfully assembled in vitro include ribosomes, microtubules, nucleosomes, and even many viruses.

Directed Modification of Proteins

We Can Now Design and Construct New Molecular Machines

Until recently, the experimental strategies available to structural biology were largely limited to examining naturally occurring biological structures. Testing specific hypotheses by altering structures was limited to observing naturally occurring biological variants when they could be identified, as in the numerous mutant hemoglobins. This approach is limited in having no systematic way to search for a particular desired variant. Furthermore, one was restricted to those variants that had no lethal consequences for the organism and variants that had a significant chance of arising by natural biological mutation or evolution.

The development of recombinant DNA technology has dramatically altered our study of the structure and function of proteins. The major breakthrough lies in our new ability to modify or synthesize de novo genes (DNA) that, when introduced into cells, direct the synthesis of modified or new protein molecules. What was only a fantasy a few years ago is today a routine procedure: We can produce protein molecules of any desired sequence. We can produce altered proteins in bacteria, yeast, or plant or animal tissue-culture cells, which makes it possible to isolate large enough quantities for structural and functional studies. In addition we can produce the altered proteins in vivo in transgenic animals to gauge the effect of the altered protein on complex biological processes.

The Future Will See an Heightened Interdisciplinary Cooperation Between Structural and Molecular Biology

The techniques of traditional structural analysis and of recombinant DNA when combined increase the value of both. Such integrated approaches will allow more rapid and informative studies of the structures of proteins and how these structures determine function. The future will see ever-closer working relations among scientists expert in these different disciplines.

The potential to alter proteins at will is remarkable since it transforms structural biology from a science limited to strictly descriptive observations to an experimental science in which specific hypotheses can be tested with appropriate controls in specifically modified molecules. Our ability to do this is still in its infancy; much experience will be needed before the strategies in routine use approach optimal design. However, it is already clear that the ability to alter the sequence of proteins and nucleic acids systematically has revolutionary applications for structural biology. The importance of these new technologies is twofold, the first of which is widely appreciated, the second of which is perhaps less often noted.

First, by altering protein structure, or by creating new proteins, we are able to produce improved or even new proteins of value to human welfare, such as new pharmaceuticals. We are already using recombinant DNA technologies to produce human growth hormone, the blood-clotting factor VIII, the anticoagulant tissue plasminogen activator, interferons, and several lymphokines (which regulate development of various cells in the body). Variants of these proteins are being made and tested for improved properties such as increased heat stabilities, or increased lifetime in the blood. We are limited in these efforts by the fact that in most cases we do not yet sufficiently understand the structures of these proteins or the relations of structure to function to know what changes to make. As we learn more about the structures of these molecules from x-ray crystallography and other techniques, the successful production of useful variants will increase.

Second, determining protein structure will be enhanced by our ability to modify proteins. These modifications will result in proteins that crystallize better, that are designed for easy insertion of heavy metals needed in x-ray crystallography, and that have specific perturbations introduced to test hypotheses. In parallel, as the body of defined structures grows, we will be in a better position to design rationally modified proteins. We will know more about possible protein-folding motifs, domain attachments, and the effects of certain kinds of single-residue modifications. Today, in the absence of a known three-dimensional protein structure, the behavior of site-directed mutants can be unpredictable. The cellular location of the new product, its stability, and its properties can frequently differ from our naïve predictions. This situation should change markedly as we gain more experience with site-directed protein modification and its structural consequences.

Methods for Designing New Proteins. The first approach to protein design is site-directed mutagenesis. Here one usually alters a single amino acid by changing one or two nucleotides in the gene at the point coding for that amino acid. The result is a site mutant, which may resemble natural mutants, except that the experimenter can choose the site and the replacement.

The second approach is to make larger alterations. Usually this involves interchanging segments of two or more different proteins. Domains are segments of sequence often associated with particular functions of a protein. Therefore, by appropriate switches in domains, one can rationally create proteins likely to have desired hybrid functions.

The creation of such chimeric proteins in the laboratory mirrors the events that seem to occur in protein evolution. The genes for proteins in most organisms occur in blocks of coding regions (exons) and blocks of noncoding regions (introns). The introns are removed from the message by RNA splicing before a final transcript is used to direct the synthesis of protein (See Chapter 4). The exons frequently appear to correspond to functional or structural motifs in the protein, and often they correspond to actual three-dimensional structural domains. Many new protein functions may have evolved by exon shuffling—rearrangements among pre-existing exons of proven functional capability. Such exon shuffling provides a rapid way to create proteins with new, hybrid functions. It also provides a rationale for the presence of interrupted genes. An organism that has such a pattern of genetic information is likely to be more able to cut and paste its genes in a meaningful way and thus should have a selective advantage.

Domains are also regions that appear to fold independently into three-dimensional structures. Switching pre-existing domains maximizes the likelihood that the new chimeric protein will still be able to achieve a stable, well-ordered, three-dimensional structure.

Genetically Engineered Proteins Reveal Much About How Proteins Function

The use of site-directed protein modification offers great promise for answering some of the fundamental questions in contemporary biology. For example cell-surface receptors must migrate throughout the cell from one organelle to another, moving from the endoplasmic reticulum (site of synthesis) to the Golgi complex (site of carbohydrate modification) to the plasma membrane (site of clustering in specialized regions of the cell surface called coated pits). Once inside a coated pit, these proteins are taken inside the cell in a coated vesicle and then recycled back to the cell surface in a recycling vesicle. All of these movements seem to be dictated by signals contained within the structure of the protein itself. What are these targeting signals? Are they simply short, continuous stretches of amino acids or are they determined by the three-dimensional structure of the protein? Are protein modifications, such as phosphorylation or fatty acylation of the protein, required for any of these targeting signals?

The use of chimeric proteins has made it possible to define the functions of linear sequences responsible for protein translocation into the endoplasmic reticulum, mitochondria, and nucleus. However, signals that are defined by noncontinuous amino acid sequences are more difficult (if not impossible) to define functionally with chimeric proteins. Incorrect protein folding becomes a major obstacle when the function of an internal sequence or domain is examined by this approach.

Once a targeting signal for a given movement is identified, the scientist is in a superb position to study biochemically how the proteins of the cell interact with the cell-surface receptor to affect the desired targeting event. All the potential questions can now be answered in model systems, in which cloned genes for cell-surface receptors are transfected into cultured cells and then studied functionally and biochemically.

The targeting problem is related in a major way to another crucial problem in biology: protein folding. Many of the rules for protein folding can be derived from site-directed mutagenesis studies of proteins such as cell-surface receptors.

For example, the folding of a cell-surface receptor in the lumen of the endoplasmic reticulum depends in large part on the arrangement of cysteine residues and other amino acids in the primary structure of the polypeptide. By use of site-directed mutagenesis, one can begin to vary the position and number of cysteine residues to determine their effects on the interaction of the protein with the cellular machinery of the folding process.

Cell-surface receptors are key molecules that mediate a variety of physiologically important processes, ranging from the regulation of blood glucose and cholesterol levels to the control of body iron and vitamin B12 stores. Fundamental research on these molecules should shed light not only on basic science but on medicine as well.

For example, it has recently been possible to create functional, chimeric cell-surface receptors. In a chimera, the extracellular domain of epidermal growth factor receptor can stimulate the tyrosine kinase activity of an attached, insulin receptor—intracellular domain. This result shows that the insulin and epidermal growth factor receptors use a common mechanism for signal transduction across the plasma membrane (See Chapter 7). Future applications of this type of approach include studying the function of new receptors for unknown ligands by activating the new receptor's cytoplasmic domain with a heterologous ligandbinding domain derived from an already characterized receptor.

Some oncogenes represent naturally occurring receptor mutants. These will help us to understand normal mechanisms of receptor function and should lead to an understanding of how normal signaling processes are subverted to result in tumorigenesis.

Functional Protein Molecules Can Also Be Synthesized Chemically

For peptide chains of fewer than about 100 amino acids, chemical (as well as biological) synthesis is now possible and will frequently be the method of choice for shorter chains. By synthesizing chains in blocks, much longer chains will become practical synthetic goals. Chemical synthesis permits the insertion of isotopes, either stable or radioactive, at specific single sites in the chain. The peptide bond itself, can be replaced in selected locations to render the product totally resistant to proteolytic degradation at that position. In general, such products are extremely difficult or impossible to prepare biologically. Such a chemical approach is particularly valuable in testing hypotheses related to small structural and functional domains and the possible refolding of these isolated units. Single- and multiple-site mutagenesis experiments are equally easy chemically since any amino acid can be selected for any position with the automatic equipment for synthesis that is now available. The simultaneous use of recombinant DNA techniques and sophisticated polypeptide chemical synthesis will create new approaches to both the understanding of protein structure and the development of specific reagents and functions.


Initial reaction to the appearance of the first high-resolution crystal structure of a protein was one of shock at the complexity and the total absence of obvious symmetry. This reaction was conditioned, in part, by the earlier appearance of the model for DNA, in which the base-paired double helix, once revealed, seemed elegant and simple. During the past two decades, the successful determination of many protein structures has led to the realization that there are, indeed, underlying substructural motifs in these molecules. These motifs are themselves complex and asymmetrical, but they are repeated in many structures.

The properties of the chemically bonded atoms in the peptide chain severely constrain the possible conformations the chain can assume, and only a small number of secondary structures are possible regardless of the sequence of amino acids. The structural motifs are made up of various combinations of these secondary units. These supersecondary units, in turn, are packed together into structural domains. A domain may be the whole molecule, but, in larger proteins, the native molecules are frequently composed of several domains.

We Do Not Yet Understand How Proteins Assume Their Intricate Three-Dimensional Forms

The biosynthetic machinery that synthesizes peptide chains is the same for all proteins. As far as is known, the peptide emerges as an essentially straight chain having no intrinsic biological activity. During, or shortly after, the completion of this synthesis, this chain folds up spontaneously to give the final unique, compact, biologically active structure, which is characteristic of the native protein. In the early 1960s this process was shown to occur in vitro. By now a large number of proteins have been shown to undergo this reaction without any apparent help other than the correct solvent environment. The mystery of how a biologically active protein is formed from an inert disordered chain is known as the folding problem. Research in this area, a major interface between the fields of biology and chemistry, has rapidly expanded during the past five years.

The basic question has always been, From chemical theory, can the three-dimensional structure of a protein be derived solely from its known amino acid sequence? Since folding appears to be a purely spontaneous process, prediction of the structure is a severe test of the level of our understanding of the chemistry of polypeptide chains. The attack on this problem is both experimental and theoretical. Its solution is not only essential as a basic underpinning for all of molecular biology, but would also be of great practical importance in the industrial application of genetic engineering.

Experimental Studies Search for Folding Intermediates

On the experimental side, much work has centered on the search for folding intermediates. Do the secondary structural elements in native proteins exist, as such, in small purified peptides apart from the rest of the structure? In the past it was thought that such structures would be so unstable that they would not be found, but recent evidence, based largely on optical spectroscopy, suggests a positive answer to the question, at least for certain sequences. Is there a definable folding pathway along which such structure intermediates can be found (Figure 3-4)? This question is controversial. Thermodynamic measurements can frequently be fitted to a two-state model reflecting only the native and the unfolded forms. Kinetic data, on the other hand, are often difficult or impossible to interpret without the assumption of one or more intermediate states. The methods are invariably indirect and the interpretation non-unique. Since crystals cannot be obtained for the unfolded state, x-ray diffraction data are not even potentially available. Detailed NMR studies on long peptides are still on the horizon, but may in the future provide direct structural information on these intermediate states.

Figure 3-4. Schematic diagram of the folding process for an all helical protein.

Figure 3-4

Schematic diagram of the folding process for an all helical protein. The reaction starts with an extended chain containing no permanent intrachain interactions. This proceeds to a hypothetical intermediate with fluctuating helical segments that occasionally (more...)

A Diverse Range of Theoretical Studies Is in Progress

Theoretical approaches to folding have been proceeding along three lines. The first two are fundamental procedures for which, in principle, we do not need to know the final structure. The third is a collection of ad hoc procedures with which we try to produce useful generalizations by examining the known structures.

Energy Minimization. Minimization of the conformational energy of pep-tides is perhaps the oldest of the three procedures. The goal is to predict the native folded structure of the protein by assuming that it is actually the most stable structure. This requires computing a potential energy function with many terms representing possible conformational changes: bond stretching, angle bending, torsional rotations, van der Waals interactions, and various electrostatic terms. The minimization of this potential energy with respect to the locations of each of the atoms in the protein should lead to the observed native structure. The difficulties are formidable. The largest hurdle one must overcome is the problem of multiple local energy minima. The potential energy function is like the surface of the earth. Energy minimization corresponds to finding the lowest point on the surface. Wherever one starts on the surface, one can find the lowest point nearby. But how does one know if this is the lowest point on the whole planet? How can one tell if a locally stable protein structure is actually the most stable possible structure? Although intense efforts are under way, it is not yet possible to consistently derive the correct native structure by starting with an unfolded chain and attempting to minimize the potential energy.

Molecular Dynamics. In the second theoretical approach, one actually simulates the motions of the atoms of a protein. The potential energy function (in principle, the same one used for energy minimization) can be used to obtain the forces on each atom. The movement of the atoms in response to these forces is then calculated. The applications of this powerful procedure are under intensive development. As with energy minimization, the fidelity of this approach depends on the quality of the potential energy function and on the proper modeling of the solvent. Molecular dynamic simulations have not yet successfully folded a protein in the absence of additional structural information. The latter can, in principle, be provided experimentally by NMR or optical spectroscopic procedures. When such data can be supplied, some recent tests have shown notable success in carrying out the folding simulations.

The Protein Data Bank is an Rich Resource for Predicting Structure

In the ad hoc approaches, the protein data bank is searched for patterns and statistical correlations. For example, probabilities based on the occurrence of each amino acid in various types of secondary structure differ and can, in turn, be used predictively to estimate probable regions of alpha helix, beta strand, and beta turn structures in any sequence. In parallel efforts, combinatorial algorithms aimed at packing assigned secondary structures into supersecondary and larger tertiary units have been developed. Most recently, combinations of such secondary and tertiary prediction schemes that show great promise in providing probable domain structures have been worked out. Whether the resulting models are close enough to converge to the native structure through molecular dynamic or energy minimization procedures is not yet known. Although all ad hoc approaches are implicity based on the underlying chemistry through the use of known structures, only a few explicity refer to these properties in the algorithm itself.

New Experimental Tools Will Aid Studies of Protein Folding

Genetic Approaches. The power of modern genetics is being brought to bear on the problems of folding. Some mutants seem to be clearly deficient in the folding process, and yet the final folded protein does not seem abnormal in any way. The discovery of other systems of this sort and their detailed analysis may provide a great deal of information on folding pathways that would not be found by other procedures, or even suspected.

Polypeptide Synthesis. The chemical and recombinant DNA approaches have been discussed in an earlier section. These complementary approaches to providing peptides of known sequence will play major roles in the future study of protein folding. At this time, the behavior of peptides at membrane interfaces has been studied in detail by chemical synthesis; general specifications of such interactions are starting to appear, and marked improvement in our understanding of electrostatic interactions in alpha helices seems imminent.

Through recombinant DNA approaches, many different molecules appropriate for structure-function investigations are already being created, largely through single-site mutagenesis. Estimates of hydrogen bonding energies are being derived from comparisons between carefully planned and constructed mutants of proteins of known structure, and factors affecting protein stability are being outlined.

The Folding Problem Now Seems Ripe for Major Advances

The immediate future for the folding problem looks remarkably (and unexpectedly) bright. The development of both fundamental and ad hoc theoretical approaches is advancing rapidly. The correlation and interactions between theory and experiment will be much closer than has generally been true in the past. Combined approaches, with various levels of theory or theory and experiment, seem likely to be the most fruitful. The ability to easily synthesize specific polymers, themselves specifically designed to test theoretical predictions or to provide missing values for parameters, seems particularly promising.

Instrumentation. The solution of the structures of new proteins, and of mutant versions of older proteins, will continue to be of major importance. Thus the development and implementation of new and improved x-ray and neutron diffraction procedures is as important to the folding problem it is as to other areas in structural biology. Improvements in both solid-state and high-resolution NMR will be central to the specification of the unfolded state and the search for definable folding intermediates. Proteins that are isotopically labeled at specific sites will be essential in this process, and they will also permit the study by NMR of substantially larger proteins than can currently be tackled.

New Techniques and Instrumentation

Improvements in Analytical Techniques and Instrumentation Are Necessary

Better methods for automated X-ray diffraction are critical to our increased understanding of molecular structure and function. In addition, more general and effective methods are needed for direct analysis of x-ray data without the need for preparing many heavy metal derivatives. Many of the most interesting biological molecules have not been crystallized. Systematic studies are needed, aimed at producing crystals and other ordered arrays suitable for high-resolution structural determinations. Funding mechanisms must be adjusted to allow the long-term support of this speculative but extremely critical area.

Methods are also needed to extend two-dimensional NMR to larger structures and to automate its analysis. The development of instruments operating at higher magnetic fields will certainly play an important role in this work.

Advances in Computation Will Revolutionize the Study of Molecular Structure and Function

Improved methods are needed for collecting and transmitting DNA sequences including a single, international data base. Improved methods are also needed for extracting more biological information directly from sequence data. We must ensure that continuing advances in computer science are made available, rapidly and broadly, to the field of structural biology.

More accurate protein folding calculations must be developed, including better methods for refining x-ray structures and improved semiempirical methods based on the ever-increasing data base of structures. In addition, uniform, inexpensive devices to display three-dimensional structures are needed so that, ultimately, every biologist can view any known structure directly and accurately.



Plates 4, 5 Space-filling representation of Fab (antibody fragment containing antigen-binding sites) of an anti-lysozyme antibody and lysozyme. The antibody heavy chain is shown in blue, the light chain in yellow, lysozyme in green, and glutamine 121 (more...)

Copyright © 1989 by the National Academy of Sciences.
Bookshelf ID: NBK217812


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...