NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

National Research Council (US) Committee on Applied and Theoretical Statistics. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington (DC): National Academies Press (US); 2010.

Cover of Steps Toward Large-Scale Data Integration in the Sciences

Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop.

Show details

2The Current State of Data Integration in Science

The workshop opened with a series of presentations about data integration challenges and approaches in several areas of science.


Carl Kesselman of the University of Southern California presented some examples of how data integration provides value to biomedical research and shared his vision of important goals. He noted that there is a generic shift in biomedical research, from advances being based on a new understanding of fundamental biological mechanisms to advances being driven by patterns in data. That is, insights are arising from connections and correlations found between diverse types of data acquired from various modalities. An example is the use of biomarkers—the finding of connections between data that suggest predictors or indicators of various disease profiles, diagnostic procedures, and so on. Related to this is the carrying out of retrospective studies to discern patterns that suggest mechanisms that might be investigated. These trends are driving the need for data integration. Dr. Kesselman gave several examples of research results that required identifying and retrieving data from distributed locations.

Alex Szalay of Johns Hopkins University observed that many fields of science are becoming data intensive, and thus reliant on cyberinfrastructure. An example is the use of virtual observatories in astronomy, in which the database serves as a sort of laboratory in which an astronomer can make “observations.” The Large Hadron Collider (LHC) and human genome research are other examples of data-intensive science. These efforts require sizeable investments in software. Dr. Szalay estimated that the Sloan Digital Sky Survey (SDSS) allots some 30 percent of its budget to software, and the Large Synoptic Survey Telescope (LSST) project is planning to allocate 50 percent to software. The LHC’s data-management elements constitute a major part of the overall operation. Tim Frazier of Lawrence Livermore National Laboratory added that a large amount of hardware is also required if one plans to move data into and out of a large repository: many network switches and high-speed networks. If computations can be performed within the repository, they can be carried out faster and more efficiently.

Michael Stonebraker of the Massachusetts Institute of Technology (MIT) observed that a virtual observatory requires a global schema,1 a concept that has not worked very well in most enterprises. There have been numerous efforts to develop global schema, but anticipating the many questions that might be posed of the data constitutes a significant barrier. Orri Erling of OpenLink Software, Inc., suggested that there might be a need for a framework that enables a globally evolving schema. Dr. Kesselman said he has had a reasonable amount of success by aiming for a point somewhere between the notion of a global standard and total chaos. He saw something similar in the workshop’s presentation on data integration at the National Oceanic and Atmospheric Administration (NOAA), which suggested defining limited communities of interest in order to constrain the problems of data interoperability. He saw the challenge as providing the infrastructure to support an exchange of information within communities of practice that are connected but not global.

Clifford Lynch of the Coalition for Networked Information pointed out that there are two kinds of data reuse and felt the notion suggested by Kesselman is not a complete solution. One kind of reuse is reexamination of data for a compilation or a meta-analysis, in conjunction with similar data and carried out for purposes that are not too far afield from those that drove the collection originally. The other kind is reuse of data outside the disciplinary frameworks within which they were collected. Some examples of reuse are very interdisciplinary and jump the fences between science, social science, and humanities in unpredictable ways. These latter types of reuse make it difficult to know which kinds of life cycle should be assumed. No one has a good understanding of the kinds of metadata that facilitate reuse of data in a new context. In contrast, we have a much better understanding of the kinds of metadata that facilitate incremental or predictable kinds of reuse, such as meta-analyses and compilation. In addition, we normally conceive of metadata as documenting a data set in isolation. But if data sets are to be integrated, it might be important to include metadata that inform the integration process, such as metadata about commonalities or disparities that reduce or increase uncertainties when those sets are aggregated.

Dr. Szalay asked how the value of data is established, because that would guide planning for data reuse. Making data accessible for reuse requires resources, but in general there is no clear business model for who should pay and how much. Dr. Lynch pointed out that we do not generally understand the cost-benefit trade-offs of metadata: how much it costs to create metadata and how much they improve discovery and usability. By “metadata,” Dr. Lynch meant more than just the documentation purposely attached to a data set. He said that reuse could also require other documentation about, say, the technologies used for the data collection or generation, but in general we have a very imprecise understanding of what to retain. We also do not know when it is worthwhile to hold onto data with fairly deficient metadata in the hope that someone who cares enough will figure it out. In some cases, data with deficient documentation can be as useless as no data at all or as dangerous as corrupt data.

In a related question, Dr. Lynch asked who should take charge of these data for the longer term. These investments are a necessary part of scientific research, but they are not routinely accounted for in budgets and plans. He proposed that one of the most compelling problems is how to give concrete guidance to the research communities about what is good practice in handing off data at the end of a project so that they can be curated and made available for reuse.


The sheer amount of data available in many fields of science is well known. Dr. Szalay reported that astronomy is experiencing a doubling of data every year. The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) will soon contain more than a petabyte (PB) of data, and the Visible and Infrared Survey Telescope for Astronomy (VISTA) will reach the same level about a year later. The Large Synoptic Survey Telescope (LSST) project might accumulate hundreds of petabytes of data over the next decade, while the proposed Square Kilometer Array (SKA) would be receiving a terabit of data per second from the radio antenna. The Sloan Digital Sky Survey ended up with more than 18 TB of data. Frazier reported that the National Ignition Facility is producing some 5 PB a day. Dr. Kesselman added that genome-sequencing machines can produce 1 TB per week of information, and Dr. Marron observed that this rate will probably grow to terabytes per day soon. In addition, simulations produce enormous amounts of data and will soon also be operating at the petascale.

Keith Clarke of the University of California, Santa Barbara, observed that new data sources are dramatically improving the resolution of geo-spatial data. For example, it was common until recently to find databases of terrain elevations with 30-m spatial resolution, but now 10- or 3-m resolution is common. Similarly, elevations may now be measured more precisely: For example, Dr. Clarke is mapping his own campus at 2-cm resolution with terrestrial scanning lidar (light detection and ranging, a laser-based means of measuring distances). Many streams are coming directly from GPS devices that record 2 to 3 readings per second. When there are multiple paths, time and space stamps are recorded at submeter accuracy. These developments call for new methods when integrating geospatial data—when registering one data set with another, for example.

Disciplines such as astronomy and high-energy physics rely on a limited number of large-scale data sources. In that situation, plans and protocols can be put in place to manage the data and their reuse, and these steps facilitate the finding of data. More difficult are disciplines where enormous data sets can be produced by many laboratories and research groups. Dr. Marron pointed to genome sequencing as an example. With emerging capabilities, a biology laboratory will be able to produce over 100 billion base pairs a day, which begins to rival the 150 billion base pairs produced in all of 2007 by the Human Genome Institute. A laboratory that produces 100 billion base pairs per day will never be able to fully analyze those data, so researchers from the broader community will have to be called on. First, however, they will have to be able to find the data, then download them, and then have a mechanism for analyzing them, all of which are challenging.

Dr. Frazier pointed out that as data volumes get larger and larger, at least the first round of analysis is most efficient if it is done where the data reside rather than importing enormous portions of data into an analysis tool. He suggested that certain data sets for climate research clearly exceed that threshold, and spoke of a colleague who works with a 350 PB database of climate information. At that scale, it is impractical to query anything out of the database without some initial analysis to cut the size of what is returned. The efficient approach—the only one for massive sets—is to ship analysis code to the database and begin the computation where the data reside.


Thomas Karl of NOAA’s Climate Data Center gave some indication of the complexity of climate data. Data are collected from a number of observing systems making atmospheric measurements from space and on land and the ocean. Different research communities collect in the different domains, and the communities could be even further segregated according to their collection modality (such as radar or infrared). These communities tend to have their own practices for formatting data, so it is necessary to have software that can translate data sets into compatible formats. In NOAA alone, some 50 different formats have been identified. The subcommunities in climate research have different ways of reporting uncertainty: Some include confidence intervals with the data, while others report best estimates. More generally, there are silos in modeling systems, measurement systems, and knowledge systems, and even different concepts of what to include in the metadata. Integrating these disparate data streams to create a whole-system view is far from trivial.

Dr. Szalay observed that the astronomy community and its data are similarly disaggregated. For example, most infrared data are stored in Pasadena, x-ray data in Boston, and optical ultraviolet data in Baltimore.

Dr. Seidel observed that, in addition to developing standards that would allow disparate data streams to interoperate, there is also a need to specify the properties of the algorithms that produced them, to observe how they interact with one another, and to understand how uncertainties and errors might propagate from one to the other. Dr. Szalay pointed out that, with the rapidly increasing abilities to collect data and produce simulations, algorithms have to stay bounded, scaling perhaps as O(n log n). N-squared or N-cubed algorithms are not useful in these data-intensive fields, or at least not for long. Dr. Clarke added that there are large inherent uncertainties within geospatial data more generally, arising in particular from measurement errors and from mismatches between data collected at different times or through different means. Research is just beginning to explore how to deal with the uncertainties visually and statistically.

Dr. Clarke also noted that, with geospatial data, the goal is often more complex than producing single map layers (showing, for example political boundaries, geography, or roads), although that can be extremely valuable. Rather, one might produce multiple map layers and then coregister them to allow detecting differences in each image. These differences can be happening at different timescales—minutes (in the case of some military imaging) to years. Imagine climate research that might need to examine changes that take place over decades or even centuries, such as fluctuations in glaciers or the urbanization of terrain.

Each geospatial data set is associated with one of the 39 or 40 reference systems that can be used for determining Earth’s average shape, depending on the accuracy desired. Some systems convert place names to coordinates and others deal with problems of tiling, mosaicing, registration, edge matching, conflation, temporal inconsistencies, and so on. Many choices are made in converting raw data to their final form, and these choices must be accounted for when data are integrated.

Dr. Kesselman offered an example of a data-integration challenge from the life sciences, where the data from one functional magnetic resonance imaging (fMRI) machine need to be calibrated with those from another before any integration can be performed. The correct metadata must be associated with each data set so that the integration can proceed properly. He pointed to a research study that examined whether there are positive symptoms in schizophrenics associated with severe temporal gyrus dysfunction. Answering that question required integrating results from multiple data sources, from multiple sites, and multiple imaging modalities. The fMRI data came from multisite studies that were distributed across 15 different scanners at 15 different sites. But these data sets had to be integrated in order to answer the question.

Dr. Kesselman said that biomedical research is increasingly dealing with new types of data, such as genomics, blood proteins, and new imaging modalities. Clinical observations and clinical data become a critical part of doing biomedical research in a lot of settings. The diversity of data types means we have to look at questions such as whether we can do analytics or queries across genomics and brain structure and imaging, and the research must look at all of those things simultaneously. Lastly, data integration in the biomedical sciences is distinct in that there are fairly severe privacy and data anonymization issues because the work often involves the use of information about individuals and may include identifying information.

Dr. Lynch ended the discussion by saying that the semantic complexity of a data set is not really closely correlated to its size. While a really massive data set is very likely to cause technical problems, difficult challenges can arise with small sets, too. While they do not necessarily cause enormous technical problems in terms of the computational, storage, or communications capacity required to work with them, they can be semantically complex. They can also be hard to characterize, and they can be quite difficult to integrate or reuse because of that.


Some areas of science tend naturally to have data sources that are distributed geographically. Biomedical research is one of these: Its research groups tend to be associated with universities or hospitals as opposed to being clustered at single large facilities. Thus, according to Dr. Kesselman, data integration is a common challenge in that field. Climate research, too, often involves data integration, Dr. Karl observed, not only because data collected or compiled at different sites and with different types of instruments need to be integrated, but also because the simulation centers distributed around the globe are increasingly central to progress. For example, the 2007 report of the Intergovernmental Panel on Climate Change required coordinating many different model simulations and developing an archive of their outputs. This was a major nontrivial advance: For the first time, a researcher could analyze 25 different models and attempt to develop error bars for their integrated projections.

As noted above, there are a limited number of large data repositories in astronomy, so researchers tend to know where to find the biggest data sets. But it is much more difficult to learn about the many smaller data sets, said Dr. Szalay, and they are just as important for many lines of investigation. Because the threshold for properly publishing smaller, more specialized sets of data and adding the necessary metadata might be too high for an individual or small group. Dr. Szalay believed there would be value in a repository service that enables users everywhere to obtain a single view of collections all over the world.


Because information is gathered when a researcher is first granted access to the equipment at a controlled-access experimental facility, it is possible to capture at least some of the metadata automatically. Dr. Frazier noted that records of every experiment run on the National Ignition Facility show who worked on the machine, when they were working on it, and what they were doing. There is detail about precisely what was installed in the machine at the time the experiment was run, how it was calibrated, and what happened for each of those parts. Such records are not captured automatically in most experimental settings, however.

Laura Haas of the IBM Almaden Research Center noted that the database community has been integrating work flows more generally, to include not just metadata but also information about the subsequent analysis, such as which data were selected, which analyses were performed, and what methods and software were used, all of which could apply to situations such as the one mentioned by Dr. Frazier. Dr. Haas suggested that the Vis-Trails work by Juliana Freire at the University of Utah could be useful in that regard. VisTrails keeps track of the details of experimental setups and the resulting data provenance, all in an XML database.

Unfortunately, the compilation of metadata is often given the lowest priority and assigned to the most junior people in a research group, in Dr. Karl’s experience. This can mean that metadata are suboptimal for some or many secondary applications of the data.

Philip Bernstein of Microsoft Research asked the physical and life scientists at the workshop whether there are any incentives for them to make their own data reusable. If, for example, it takes three times as much the effort to make data reusable as to simply create them for one’s own use (which he thought was probably a good estimate) do scientists get rewarded for that effort, or is this just a labor of love? If, on the other hand, a researcher looking at an important question has the choice of either pursuing a large grant that will purchase a new instrument and support other people or reusing existing data, how will he or she decide? Are institutions and reward systems biased in favor of the former course of action?

Dr. Szalay noted that astronomy has seen a substantial sociological shift over the last 10 years in this regard. Before, people were not sharing data very much—they kept their tapes in their desks. Now the community has almost reached the point where a researcher would be questioned if they did not find and reuse existing data. It is key, here, that the researcher(s) who originally collected the data and made them available for reuse be acknowledged and receive scholarly credit akin to that received when one of their publications is subsequently cited.

Dr. Seidel suggested that data-management plans might be made a requirement for proposals, with reviewers being instructed to take that part of the proposal seriously. For example, since 2003 NIH has required that all proposals involving direct cost expenditures of more than $500,000 per year must include a data-sharing plan. Dr. Szalay added that it would also be useful to have supplemental funds available for data management just as supplemental funds sometimes are available for educational elements of a research proposal. If the reviewers conclude that the data-management plan is a good one, then the researcher would receive the supplement rather than having to pay for data management and scientific research from the same pool of funds.


Dr. Clarke pointed out that geospatial researchers have created the beginnings of a data-integration policy through the adoption of the National Spatial Data Infrastructure. Other countries have developed their own counterpart standards of practice. Early federal standards were top-down and not as successful as those that have emerged through consensus.

Dr. Kesselman gave an example of tools called the human imaging database (HID), which was developed in conjunction with the fMRI research mentioned above. One of the functions implemented in the HID is the ability to do distributed database query and the ability to do data integration. More generally, the biomedical field has shown a lot of interest in ontologies and defining vocabularies and dictionaries. That interest has led others to explore the federation of the semantic descriptions of the data. Dr. Kesselman’s own group has been looking at whether there are reusable data integration tools that can be applied in order to avoid creating data federations and data environments for specific uses.

Dr. Frazier said that for NIF data, which are intended to be held for 30 years, unique identifiers are generated. These are better than surrogate keys in databases, because their value is not lost if the data are at some point migrated into a new schema with new surrogate keys. These identifiers can be applicable throughout the long life of the data.

He also observed that scientists are not generally afraid of new technologies; however, they certainly are afraid of interrupting their work while someone creates specialized software or when software has failure modes that the scientist cannot fix on the spot. Scientists also are unwilling to invest in new technologies unless they know that long-term support exists. Finally, most prefer to control their data and analyses personally, so they lean toward the use of methods and software that they understand and can run themselves. There can be resistance to methods and software that are less familiar or that require control to be transferred to another person.

In contrast, though, Dr. Szalay pointed to the increasing presence of federations with astronomy data. For example, the National Virtual Observatory was set up with the explicit understanding that it would own the massive data sets and manage them as they evolve. Dr. Lynch observed that this delegation of data stewardship is a general trend for other scientific projects that generate vast data streams. Such endeavors usually operate at a scale large enough that someone has the mandate to plan for information management. Usually, information management is factored into the budgets, and Dr. Lynch sees some willingness among funders to also provide some money for data stewardship.

However, with medium- to small-scale science, any kind of data management or information management is often an ad hoc process. Frequently it is relegated to graduate students because the enterprise is not large enough to support specialists in data management and information technology. Both the funders of scientific research and the researchers themselves have over the past decade come to recognize that data are a very important part of the output of research, one that deserves management in its own right. But data sharing and data stewardship fall into two different timescales. The former often happens on timescales that are similar to those of a research project, perhaps extending a few years longer. But data stewardship operates on timescales that are more familiar to data archivists and research librarians, which are longer than the active professional life and interests of many of the researchers involved in the project that produced the data (and certainly longer than the tenure of a graduate student). More important for the question of data integration, stewardship timescales may exceed the lifetimes of the experimental or computational environments that created the data, making it difficult to interpret the data because tacit knowledge erodes as the people involved move on and the corporate memory is lost.

Dr. Stonebraker asked how much of the data-integration problem would be solved if software were developed to address the technical challenges mentioned so far in this section. Dr. Szalay responded that such software would solve a lot of the issues, especially if it were scalable to the tens or hundreds of petabytes. But it would not solve the problem of very large, dispersed data, which requires figuring out what to do when a petabyte of data must be moved on demand across the Internet, or how to avoid that. Dr. Szalay said such movement is possible now for sets up to tens of terabytes of data, but it will not be possible for at least 5 or 10 years or more with petabytes of data.


Michael Brodie of Verizon Communications brought up some more general issues of standards across enterprises. Establishing and maintaining standards in a very large community, whether a scientific community or an enterprise, is difficult because there are few general principles to help one decide how well a given standard will suit a particular data set, particularly when that data set is innovative or might be subject to novel reuse sometime in the future. It is unclear how to assess whether an existing standard can be extended, or whether a new standard should be developed from scratch.

Alon Halevy of Google, Inc., suggested that improvements to search tools could be a productive way to improve data-integration capabilities. For certain types of science, the hardest part of data integration may be finding the few data sources that are relevant to the research task. The integration will often be ad hoc, done for one task and then finished, and that is fine. So the real bottleneck is the ability to find the necessary databases among the thousands or millions of data sets that might be relevant. In such cases, the key enabler might be including metadata that allow the data set to be uncovered by a search engine. Similarly, it would be very useful if search engines could find and index the many data transforms that various groups have developed for a wide range of integrations.

Dr. Marron raised another topic having to do with data sharing and access. He thought that several workshop speakers were suggesting that the solution was for funding agencies to just require everybody to share data. Certainly there have been instances where that has been done. At NIH this has been given serious consideration, and some programs have requirements for data sharing.

But there is also the other aspect of that, the motivation, which needs to be investigated. Dr. Marron suggested that more widespread use of registries would be very helpful here. Properly designed and managed registries are not only able to facilitate the reuse of data, but they might also improve the incentives for sharing them. Registries could affect the incentives by designating the entry of data as equivalent to publication. Then, by tracking reuse, registries could provide tenure and review committees with credible statistics about how widely the data were used or the value of their publication, measures that are analogous to those associated with paper publications and citation indexes. In addition, registries could provide a means of enforcing certain rules about metadata, because to register the data one must include appropriate metadata. It is important, though, that the registries be supported over the long term or else researchers will be wary of investing the time and effort to contribute to them.

Repositories are not the same as registries: New, large, centralized repositories are very unlikely to receive funding in Dr. Marron’s view because, once an agency gets involved in supporting a database, it is a never-ending process. NIH supports a number of large centralized databases, and their cost has increased dramatically. He thinks the better model is distributed databases and distributed costs of maintaining them.

David Maier of Portland State University raised the question of how to train people to work effectively with shared data. The best technology solutions require some domain knowledge along with knowledge from computer science. Dr. Maier is not sure if it is better to start with people from a domain of science and try to give them data management skills, or start with people from a computer science background and try to give them domain skills. In his personal experience, computer science students do not get nearly enough statistics training to do the kinds of analyses they might be called upon to do, and their training in databases rarely exposes them to the challenges of working with other people’s preexisting databases.

Dr. Stonebraker observed that there are two main ways in which scientific data today differ noticeably from scientific data of, say, a decade ago:

  • Scale. Data sets are rapidly becoming larger. For example, it used to be commonplace for satellite imagery to tile the Earth into 100-m squares. Now the technology supports 5-m squares. Satellite imagery data sets have thus become larger by a factor of 400. This increase in data set size is expected to continue for the foreseeable future.
  • Number and type. The numbers and types of scientific data sets appear to be increasing exponentially. For example, sensor tagging technology is making it possible to tag everything of value and have it report interesting data on a real-time basis. Temperature data are available not only from traditional sources but from cell phones, car navigation systems, portable GPS devices, and the like. These are just a few examples of expanding range of the disparate types of data of possible interest to the scientific community.

In Dr. Stonebraker’s view, this dramatic increase in data availability calls for the following four capabilities:

  • Locate data sets more effectively. Scientists must be able to discover data sets of interest much more easily than they can today.
  • Convert data sets easily to a usable format. It should be much easier for scientists to reformat data sets than is currently the norm.
  • Integrate multiple data sets. Since data on the same phenomenon often come from many sources, a scientist needs to readily discover the syntax and semantics of data sets and to convert them to be syntactically and semantically comparable.
  • Process larger data sets. As noted above the scale of scientific data is increasing rapidly.

The benefits of integrating large volumes of data, multiple data sets from different sources, and multiple types of data are enormous, and this integration will enable science to advance more rapidly and in areas heretofore outside the realm of possibility.



A global schema is a single structure that can be used to organize all the data stored by a specified field.

Copyright © 2010, National Academy of Sciences.
Bookshelf ID: NBK45678


  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (313K)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...