NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

National Research Council (US) Committee on Applied and Theoretical Statistics. Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop. Washington (DC): National Academies Press (US); 2010.

Cover of Steps Toward Large-Scale Data Integration in the Sciences

Steps Toward Large-Scale Data Integration in the Sciences: Summary of a Workshop.

Show details

5Workshop Lessons

At the end of the workshop, Michael Stonebraker presented the following list of messages that he thought were brought out by the discussions:

  • Many research groups leave the task of developing data integration software to science postdoctoral students, which is wasteful of the students’ time and can lead to inadequate results. Good DBMSs are difficult to write and take many person-years of effort. A better idea is to apply computer science expertise early in the process. A partnership of equals between computer scientists and natural scientists can pay off admirably. The successful collaboration between Alex Szalay and Jim Gray is a prime example.
  • It is impossible to build a complete software stack quickly. The best way to progress is to specify modest short-term goals and get them accomplished. Once something is working, one can build the next phase. In other words, one should take “baby steps,” always going from something that works to something that continues to work. What often kills projects is the desire to take a giant leap in functionality, without having intervening milestones.
  • Funding agencies can help scientists establish the capability for data integration by steps encouraging (or, indeed, requiring) the researchers they support to publish and curate their data. Agencies should strengthen the incentives for scientists to preserve their data in reusable form, such as by giving special consideration to proposals that include plans for careful data publication.
  • Moreover, funding agencies can encourage the establishment and maintenance of data repositories and work to improve the tools available for data curation and sharing.
  • An open-source tool-kit to assist with data transformations would be of immense value. This is something that agencies can budget for, solicit proposals for, and fund.
  • An open-source science-oriented DBMS would also be of immense value. Again, this is something that agencies can budget for, solicit proposals for, and fund.

Dr. Stonebraker offered his own thoughts on how to improve the software that enables data integration. Noting that scientists often build the entire software stack for each new project, he pointed out how this limits, even precludes, the reuse of software modules and the leveraging of well-established tools. Building afresh was followed by the Mission to Planet Earth a decade ago as well as more recently by the Large Hadron Collider project. In contrast, the Sloan Digital Sky Survey (SDSS) made data available in an SQL server database and allowed astronomers to run a collection of queries of interest.

Dr. Stonebraker suggested a number of ways to improve the common state of practice:

  • Send the query to the data, not the other way around. Currently, publication schemes typically send data sets to scientists who load these data into their favorite software system and then further reduce them to find actual data of interest. In effect, a central system sends data to scientists who query them locally to discover items of interest. This approach is an inefficient use of bandwidth, because large data sets are sent over networks only to then be reduced two or three orders of magnitude. It would be much more efficient to reduce the data upstream in response to a request and save the bandwidth.1 An alternate approach for saving bandwidth, which is sometimes practiced today, is to store the data in a processed form, so that their transmittal is easier. But this has the shortcoming that requesting scientists have different needs, so any given processing will not be optimal for everyone. To facilitate the flexibility that scientists need, one may have to make available the raw data and not just a highly processed derived data set.
  • Put the raw data in a DBMS and then run the processing inside the DBMS engine. The only feasible way to allow a scientist to insert his or her own components into the processing pipeline is to make the processing a collection of DBMS tasks. Otherwise, the complexity of altering the pipeline is just too daunting.
  • Record the provenance (lineage) of the data carefully, with an automated system. This is necessary for the raw data, of course, but it is also crucial to precisely record the semantics of any derived data, thus carefully maintaining the provenance of those data sets. This is not something that current application code or system software is good at. Also, anything that requires human effort is not going to be widely used, and so systems are needed that record provenance as a side effect of natural science inquiry and processing, not an additional step. One of the big advantages of a DBMS is that it can record provenance automatically by recording every query and update that has been run.
  • A better DBMS is obviously needed for science applications, one of the challenges called for in Chapter 2. Scientists who spoke at the workshop did not like current relational DBMSs, which were built for business data processing, because they do not work well, if at all, on science data. The six messages presented at the beginning of this chapter are unlikely to be successful with current commercial DBMSs. Self-documenting data sets, via RDF with reference to code systems, will be needed, along with separation of the data from the application/analysis software.
  • At present, most fields of science do not have systematic means for a scientist to make data available. They do not have public repositories in which to insert data, standards for provenance to describe the exact meaning of data sets, or easy ways to search the Internet looking for data sets of interest. In addition to data repositories, repositories of standards and translators are also needed.

While there was some discussion of these ideas at the workshop, no attempt was made to capture the range of opinions, and the thoughts presented in this chapter do not necessarily represent a consensus of the workshop participants.



An anonymous reviewer pointed out that, in general, this approach may not scale, as some centralized stores will have to support an ever increasing number of queries. A complementary approach is to have replication on demand, where subsets of the data are replicated to secondary sites based on local demand. A form of this approach was taken by the LHC with its predetermined tier structure.

Copyright © 2010, National Academy of Sciences.
Bookshelf ID: NBK45671


  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (313K)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...