Assemble CRISPRs from metagenomic sequencing data

Bioinformatics. 2016 Sep 1;32(17):i520-i528. doi: 10.1093/bioinformatics/btw456.

Abstract

Motivation: Clustered regularly interspaced short palindromic repeats and associated proteins (CRISPR-Cas) allows more specific and efficient gene editing than all previous genetic engineering systems. These exciting discoveries stem from the finding of the CRISPR system being an adaptive immune system that protects the prokaryotes against exogenous genetic elements such as phages. Despite the exciting discoveries, almost all knowledge about CRISPRs is based only on microorganisms that can be isolated, cultured and sequenced in labs. However, about 95% of bacterial species cannot be cultured in labs. The fast accumulation of metagenomic data, which contains DNA sequences of microbial species from natural samples, provides a unique opportunity for CRISPR annotation in uncultivable microbial species. However, the large amount of data, heterogeneous coverage and shared leader sequences of some CRISPRs pose challenges for identifying CRISPRs efficiently in metagenomic data.

Results: In this study, we developed a CRISPR finding tool for metagenomic data without relying on generic assembly, which is error-prone and computationally expensive for complex data. Our tool can run on commonly available machines in small labs. It employs properties of CRISPRs to decompose generic assembly into local assembly. We tested it on both mock and real metagenomic data and benchmarked the performance with state-of-the-art tools.

Availability and implementation: The source code and the documentation of metaCRISPR is available at https://github.com/hangelwen/metaCRISPR CONTACT: yannisun@msu.edu.

MeSH terms

  • Bacteriophages*
  • Clustered Regularly Interspaced Short Palindromic Repeats*
  • Genetic Variation
  • Metagenomics*
  • Prokaryotic Cells