Format

Send to

Choose Destination
PLoS One. 2014 Aug 20;9(8):e105067. doi: 10.1371/journal.pone.0105067. eCollection 2014.

Profile hidden Markov models for the detection of viruses within metagenomic sequence data.

Author information

1
Biological and Medical Informatics Graduate Program, University of California San Francisco, San Francisco, California, United States of America; Departments of Medicine, Biochemistry and Biophysics, and Microbiology, University of California San Francisco, San Francisco, California, United States of America; Howard Hughes Medical Institute, Bethesda, Maryland, United States of America.
2
The J. David Gladstone Institutes, University of California San Francisco, San Francisco, California, United States of America.
3
The J. David Gladstone Institutes, University of California San Francisco, San Francisco, California, United States of America; Institute for Human Genetics & Division of Biostatistics, University of California San Francisco, San Francisco, California, United States of America.
4
Departments of Medicine, Biochemistry and Biophysics, and Microbiology, University of California San Francisco, San Francisco, California, United States of America; Howard Hughes Medical Institute, Bethesda, Maryland, United States of America.

Abstract

Rapid, sensitive, and specific virus detection is an important component of clinical diagnostics. Massively parallel sequencing enables new diagnostic opportunities that complement traditional serological and PCR based techniques. While massively parallel sequencing promises the benefits of being more comprehensive and less biased than traditional approaches, it presents new analytical challenges, especially with respect to detection of pathogen sequences in metagenomic contexts. To a first approximation, the initial detection of viruses can be achieved simply through alignment of sequence reads or assembled contigs to a reference database of pathogen genomes with tools such as BLAST. However, recognition of highly divergent viral sequences is problematic, and may be further complicated by the inherently high mutation rates of some viral types, especially RNA viruses. In these cases, increased sensitivity may be achieved by leveraging position-specific information during the alignment process. Here, we constructed HMMER3-compatible profile hidden Markov models (profile HMMs) from all the virally annotated proteins in RefSeq in an automated fashion using a custom-built bioinformatic pipeline. We then tested the ability of these viral profile HMMs ("vFams") to accurately classify sequences as viral or non-viral. Cross-validation experiments with full-length gene sequences showed that the vFams were able to recall 91% of left-out viral test sequences without erroneously classifying any non-viral sequences into viral protein clusters. Thorough reanalysis of previously published metagenomic datasets with a set of the best-performing vFams showed that they were more sensitive than BLAST for detecting sequences originating from more distant relatives of known viruses. To facilitate the use of the vFams for rapid detection of remote viral homologs in metagenomic data, we provide two sets of vFams, comprising more than 4,000 vFams each, in the HMMER3 format. We also provide the software necessary to build custom profile HMMs or update the vFams as more viruses are discovered (http://derisilab.ucsf.edu/software/vFam).

PMID:
25140992
PMCID:
PMC4139300
DOI:
10.1371/journal.pone.0105067
[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for Public Library of Science Icon for PubMed Central
Loading ...
Support Center