BOX 5-1The Metagenomic Data Deluge: Future Data Storage and Access Challenges

From the perspective of sequence data repositories, projected data storage needs for archiving Sanger-based capillary sequence data might not seem overly formidable. Every year disk space gets cheaper, with storage density increasing steadily. Hard drives have experienced a 50-million-fold increase in storage density since their invention. So, is there cause for concern for future metagenomic data storage and retrieval?

Projected future sequence DNA data storage challenges are more complex than simple extrapolation from today’s Sanger-based capillary sequence production rates. There are three central reasons why data accumulation is expected to accelerate dramatically, and soon:

  1. Technology. New sequencing technologies (see Table 4-2) are poised to increase data throughput and density substantially and pose new platform-specific data storage challenges. Since these platforms also enable individual labs to produce as much sequence data as did large production-scale centers in the past, the data storage and dissemination needs are expected to become even more acute.
    The projected throughput of one newly emerging sequencing technology, Solexa, is as much as 10000 Mb per run, compared to 0.07 Mb per run on a Sanger-based capillary machine. Each Solexa run produces 1 × 1012 bytes of image data, which reduces to 1 × 109 base pairs of raw data per run. Estimates from some sequencing centers suggest that sequence data production and storage needs per annum will approach 10 tera base pairs (Tb) of raw sequence data (1 × 1012). This estimate does not consider the need for associated metadata (see below), which would increase storage needs by orders of magnitude.
  2. Approach. Metagenomic survey approaches can now access vast amounts of biological “sequence space” for study, virtually instantaneously. The days of slower methodical sequencing efforts, one organism at a time, are changing rapidly. Metagenomics sequence datasets will soon dwarf all other sequence databases combined, even in the early stages of development. The metadata required for these data (below) will add to the data storage requirements dramatically.
  3. Metadata density and complexity. The magnitude of metadata and associated storage needs for metagenomics datasets are greater than those for straightforward, single organism-based DNA sequencing efforts. Metadata are central and mandatory for metagenomics efforts, because they provide the context for data analyses and interpretation. Metadata are non-homogeneous and add complexity and density to the data storage and dissemination challenge. For example, a single organism’s genome requires 1 × 107 bytes for the raw DNA sequence storage, increasing to 1 × 1010 bytes when sequence annotation is added. By contrast, 1 × 107 bytes of metagenomic sequence from a single sample with its associated metadata might require 1 × 1012 bytes of storage. Simple data storage projections from DNA alone are deceptive, unless they take these annotation and metadata storage requirements into account.

From: 5, Data Management and Bioinformatics Challenges of Metagenomics

Cover of The New Science of Metagenomics
The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet.
National Research Council (US) Committee on Metagenomics: Challenges and Functional Applications.
Washington (DC): National Academies Press (US); 2007.
Copyright © 2007, National Academy of Sciences.

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.