BOX 3-1Organizing Metagenomic Sequence Data

Clustering: An approach to data analysis in which a large dataset is divided into distinct subsets based on some specific measure. In analyzing DNA or protein sequences, clustering is used to identify groups of sequences that share an evolutionary origin (families) but can also identify larger sets, such as genomes (see binning). Genome annotations can be viewed as form of clustering, where individual genes are assigned to well-characterized (or at least previously known) gene families. In metagenomics, direct clustering of DNA sequences is likely to remain a primary annotation method, as most of these sequences will not be easily assigned to any known gene family. In direct clustering, the nucleotide (or predicted protein) sequence itself is the basis of the grouping of sequences.

Binning: A clustering method that uses composition and/or other characteristics of DNA contigs (overlapping individual reads) to divide them into groups (clusters) that belong to specific genomes or groups of genomes. Examples of characteristics that can be used for binning are GC content and codon use. In metagenomic projects in which genome assembly is a goal, this is used as a preliminary step.

Gene annotation: A process of classifying predicted genes into known and well-characterized gene families. In metagenomics, where a substantial percentage of sequences cannot be easily classified, annotations often remain at the preliminary stage of clustering the sequences into groups (families) that are otherwise uncharacterized.

Gene prediction: A process of analyzing genomic DNA sequences to predict which encode biological functions, such as coding for proteins, structural and regulatory RNA, and other regulatory elements. Gene prediction is important for determining the functional repertoire of a microbial community and for comparing the capabilities of different communities.

From: 3, From Genomics to Metagenomics: First Steps

Cover of The New Science of Metagenomics
The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet.
National Research Council (US) Committee on Metagenomics: Challenges and Functional Applications.
Washington (DC): National Academies Press (US); 2007.
Copyright © 2007, National Academy of Sciences.

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.