Grid-Assembly: An oligonucleotide composition-based partitioning strategy to aid metagenomic sequence assembly

J Bioinform Comput Biol. 2015 Jun;13(3):1541004. doi: 10.1142/S0219720015410048. Epub 2015 Feb 8.

Abstract

Metagenomics approach involves extraction, sequencing and characterization of the genomic content of entire community of microbes present in a given environment. In contrast to genomic data, accurate assembly of metagenomic sequences is a challenging task. Given the huge volume and the diverse taxonomic origin of metagenomic sequences, direct application of single genome assembly methods on metagenomes are likely to not only lead to an immense increase in requirements of computational infrastructure, but also result in the formation of chimeric contigs. A strategy to address the above challenge would be to partition metagenomic sequence datasets into clusters and assemble separately the sequences in individual clusters using any single-genome assembly method. The current study presents such an approach that uses tetranucleotide usage patterns to first represent sequences as points in a three dimensional (3D) space. The 3D space is subsequently partitioned into "Grids". Sequences within overlapping grids are then progressively assembled using any available assembler. We demonstrate the applicability of the current Grid-Assembly method using various categories of assemblers as well as different simulated metagenomic datasets. Validation results indicate that the Grid-Assembly approach helps in improving the overall quality of assembly, in terms of the purity and volume of the assembled contigs.

Keywords: Metagenomics; sequence assembly; tetranucleotide frequency.

MeSH terms

  • Algorithms*
  • Computer Simulation
  • Databases, Genetic*
  • High-Throughput Nucleotide Sequencing
  • Metagenome
  • Metagenomics / methods*
  • Oligonucleotides
  • Reproducibility of Results
  • Sequence Analysis, DNA / methods

Substances

  • Oligonucleotides