Send to

Choose Destination
PLoS One. 2016 Oct 19;11(10):e0165015. doi: 10.1371/journal.pone.0165015. eCollection 2016.

Filtration and Normalization of Sequencing Read Data in Whole-Metagenome Shotgun Samples.

Author information

Department of Pediatrics, Baylor College of Medicine, Houston, Texas 77030, United States of America.
Clinical Research Group, 'Molecular Pathology of Cystic Fibrosis and Pseudomonas Genomics', OE 6710, Hannover Medical School, Hannover D-30625, Germany.
Biomedical Research in Endstage and Obstructive Lung Disease (BREATH), German Center for Lung Research, Hannover, Germany.


Ever-increasing affordability of next-generation sequencing makes whole-metagenome sequencing an attractive alternative to traditional 16S rDNA, RFLP, or culturing approaches for the analysis of microbiome samples. The advantage of whole-metagenome sequencing is that it allows direct inference of the metabolic capacity and physiological features of the studied metagenome without reliance on the knowledge of genotypes and phenotypes of the members of the bacterial community. It also makes it possible to overcome problems of 16S rDNA sequencing, such as unknown copy number of the 16S gene and lack of sufficient sequence similarity of the "universal" 16S primers to some of the target 16S genes. On the other hand, next-generation sequencing suffers from biases resulting in non-uniform coverage of the sequenced genomes. To overcome this difficulty, we present a model of GC-bias in sequencing metagenomic samples as well as filtration and normalization techniques necessary for accurate quantification of microbial organisms. While there has been substantial research in normalization and filtration of read-count data in such techniques as RNA-seq or Chip-seq, to our knowledge, this has not been the case for the field of whole-metagenome shotgun sequencing. The presented methods assume that complete genome references are available for most microorganisms of interest present in metagenomic samples. This is often a valid assumption in such fields as medical diagnostics of patient microbiota. Testing the model on two validation datasets showed four-fold reduction in root-mean-square error compared to non-normalized data in both cases. The presented methods can be applied to any pipeline for whole metagenome sequencing analysis relying on complete microbial genome references. We demonstrate that such pre-processing reduces the number of false positive hits and increases accuracy of abundance estimates.

[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for Public Library of Science Icon for PubMed Central
Loading ...
Support Center