Creating Databases

Publication Details

For most of the last century, the main problem facing biologists was gathering the information that would allow them to understand living things. Organisms gave up their secrets only grudgingly, and there were never enough data, never enough facts or details or clues to answer the questions being asked. Today, biologic researchers face an entirely different sort of problem: how to handle an unaccustomed embarrassment of riches.

“We have spent the last 100 years as hunter-gatherers, pulling in a little data here and there from the forests and the trees,” William Gelbart, professor of molecular and cellular biology at Harvard University, told the workshop audience. “Now we are at the point where agronomy is starting and we are harvesting crops that we sowed in an organized fashion. And we don't know very well how to do it.” “In other words,” Gelbart said, “with our new ways of harvesting data, we don't have to worry so much about how to capture the data. Instead we have to figure out what to do with them and how to learn something from them. This is a real challenge.”

It is difficult to convey to someone not in the field just how many data—and how many different kinds of data—biologists are reaping from the wealth of available technologies. Consider, for instance, the nervous system. As Stephen Koslow, director of the Office on Neuroinformatics at the National Institute of Mental Health, recounted, researchers who study the brain and nervous system are accumulating data at a prodigious rate, all of which need to be stored, catalogued, and integrated if they are to be of general use.

Some of the data come from the imaging techniques that help neuroscientists peer into the brain and observe its structure and function. Magnetic resonance imaging (MRI), computed tomography (CT), positron emission tomography (PET), and single-photon emission computed tomography (SPECT) each offer a unique way of seeing the brain and its components. Functional magnetic resonance imaging (fMRI) reveals which parts of a brain are working hardest during a mental activity, electroencephalography (EEG) tracks electric activity on the surface of the brain, and magnetoencephalography (MEG) traces deep electric activity. Cryosectioning creates two-dimensional images from a brain that has been frozen and carved into thin slices, and histology produces magnified images of a brain's microscopic structure. All of those different sorts of images are useful to scientists studying the brain and should be available in databases, Koslow said.

Furthermore, many of the images are most useful not as single shots but as series taken over some period. “The image data are dynamic data,” Koslow said. “They change from day to day, from moment to moment. Many events occur in a millisecond, others in minutes, hours, days, weeks, or longer.”

Besides images, neuroscientists need detailed information about the function of the brain. Each individual section of the brain, from the cerebral cortex to the hippocampus, has its own body of knowledge that researchers have accumulated over decades, Koslow noted. “And if you go into each of these specific regions, you will find even more specialization and detail—cells or groupings of cells that have specific functions. We have to understand each of these cell types and how they function and how they interact with other nerve cells.”

“In addition to knowing how these cells interact with each other at a local level, we need to know the composition of the cells. Technology that has recently become available allows us to study individual cells or individual clusters of similar cells to look at either the genes that are being expressed in the cells or the gene products. If you do this in any one cell, you can easily come up with thousands of data points.” A single brain cell, Koslow noted, may contain as many as 10,000 different proteins, and the concentration of each is a potentially valuable bit of information.

The brain's 100 billion cells include many types, each of which constitutes a separate area of study; and the cells are hooked together in a network of a million billion connections. “We don't really understand the mechanisms that regulate these cells or their total connectivity,” Koslow said; “this is what we are collecting data on at this moment.”

Neuroscientists describe their findings about the brain in thousands of scientific papers each year, which are published in hundreds of journals. “There are global journals that cover broad areas of neuroscience research,” Koslow said, “but there are also reductionist journals that go from specific areas—the cerebral cortex, the hippocampus—down to the neuron, the synapse, and the receptor.”

The result is a staggering amount of information. A single well-studied substance, the neurotransmitter serotonin, has been the subject of 60,000-70,000 papers since its discovery in 1948, Koslow said. “That is a lot of information to digest and try to synthesize and apply.” And it represents the current knowledge base on just one substance in the brain. There are hundreds of others, each of which is a candidate for the same sort of treatment.


“We put four kinds of things into our databases,” Gelbart said. “One is the biologic objects themselves”—such things as genetic sequences, proteins, cells, complete organisms, and whole populations. “Another is the relationships among those objects,” such as the physical relationship between genes on a chromosome or the metabolic pathways that various proteins have in common. “Then we also want classifiers to help us relate those objects to one another.” Every database needs a well-defined vocabulary that describes the objects in it in an unambiguous way, particularly because much of the work with databases is done by computers. Finally, a database generally contains metadata, or data about the data: descriptions of how, when, and by whom information was generated, where to go for more details, and so on. “To point users to places they can go for more information and to be able to resolve conflicts,” Gelbart explained, “we need to know where a piece of information came from.”

Creating such databases demands a tremendous amount of time and expertise, said Jim Garrels, president and CEO of Proteome, Inc., in Beverly, Massachusetts. Proteome has developed the Bioknowledge Library, a database that is designed to serve as a central clearinghouse for what researchers have learned about protein function. The database contains descriptions of protein function as reported in the scientific literature, information on gene sequences and protein structures, details about proteins' roles in the cell and their interactions with other proteins, and data on where and when various proteins are produced in the body.


It is a major challenge, Garrels said, simply to capture all that information and structure it in a way that makes it useful and easily accessible to researchers. Proteome uses a group of highly trained curators who read the scientific literature and enter important information into the database. Traditionally, many databases, such as those on DNA sequences, have relied on the researchers themselves to enter their results, but Garrels does not believe that would work well for a database like Proteome's. Much of the value of the database lies in its curation—in the descriptions and summaries of the research that are added to the basic experimental results. “Should authors curate their own papers and send us our annotation lines? I don't think so. We train our curators a lot, and to have 6,000 untrained curators all sending us data on yeast would not work.” Researchers, Garrels said, should deposit some of their results directly into databases—genetic sequences should go into sequence databases, for instance—but most of the work of curation should be left to specialists.

In addition to acquiring and arranging the data, curators must perform other tasks to create a workable database, said Michael Cherry, technical manager for Stanford University's Department of Genetics and one of the specialists who developed the Saccharomyces Genome Database and the Stanford Microarray Database. For example, curators must see that the data are standardized, but not too standardized. If computers are to be able to search a database and pick out the information relevant to a researcher's query, the information must be stored in a common format. But, Cherry said, standardization will sometimes “limit the fine detail of information that can be stored within the database.”

Curators must also be involved in the design of databases, each of which is customized to its purpose and to the type of data; they are responsible for making a database accessible to the researchers who will be using it. “Genome databases are resources for tools, as well as resources for information,” Cherry said, in that the databases must include software tools that allow researchers to explore the data that are present.

In addition, he said, curators must work to develop connections between databases. “This is not just in the sense of hyperlinks and such things. It is also connections with collaborators, sharing of data, and sharing of software.”

Perhaps the most important and difficult challenge of curation is integrating the various sorts of data in a database so that they are not simply separate blocks of knowledge but instead are all parts of a whole that researchers can work with easily and efficiently without worrying about where the data came from or in what form they were originally generated.

“What we want to be able to do,” Gelbart said, “is to take the structural information that is encapsulated in the genome—all the gene products that an organism encodes, and the instruction manual on how those gene products are deployed—and then turn that into useful information that tells us about the biologic process and about human disease. On one pathway, we are interested in how those gene products work—how they interact with one another, how they are expressed geographically, temporally, and so on. Along another path, we would like to study how, by perturbing the normal parts list or instruction manual, we create aberrations in how organisms look, behave, carry out metabolic pathways, and so on. We need databases that support these operations.”

The Need for Bioinformaticists

As the number and sophistication of databases grow rapidly, so does the need for competent people to run them. Unfortunately, supply does not seem to be keeping up with demand.

“We have a people problem in this field,” said Stanford's Gio Wiederhold. “The demand for people in bioinformatics is high at all levels, but there is a critical lack of training opportunities and also of available trainees.”

Wiederhold described several reasons for the shortage of bioinformatics specialists. People with a high level of computer skills are generally scarce, and “we are competing with the excitement that is generated by the Internet, by the World Wide Web, by electronic commerce.” Furthermore, biology departments in universities have traditionally paid their faculty less than computer-science or engineering departments. “That makes it harder for biologists and biology departments to attract the right kind of people.”

Complicating matters is the fact that bioinformatics specialists must be competent in a variety of disciplines—computer science, biology, mathematics, and statistics. As a result, students who want to enter the field often have to major in more than one subject. “We have to consider the load for students,” Wiederhold said. “We can't expect every student interested in bioinformatics to satisfy all the requirements of a computer-science degree and a biology degree. We have to find new programs that provide adequate training without making the load too high for the participants.”

Furthermore, even those with the background and knowledge to go into bioinformatics worry that they will find it difficult to advance in such a nontraditional specialty. “The field of bioinformatics is scary for many people,” Wiederhold said. “Because it is a multidisciplinary field, people are worried about where the positions are and how easily they will get tenure.” Until universities accept bioinformatics as a valuable discipline and encourage its practitioners in the same way as those in more traditional fields, the shortage of qualified people in the field will likely continue.

One stumbling block to such integration, Gelbart said, is that the best way to organize diverse biologic data would be to reflect their connections in the body. But, he said, “we really don't understand the design principles, so we don't know the right way to do it.” It is a chicken-and-egg problem of the sort that faced Linnaeus: A better understanding of the natural world can be expected to flow from a well-organized collection of data, but organizing the data well demands a good understanding of that world. The solution is, as it was with Linnaeus, a bootstrap approach: Organize the data as well as you can, use them to gain more insights, use the new insights to further improve the organization, and so on.