Format

Send to

Choose Destination
Bioinformatics. 2016 Dec 1;32(23):3535-3542. Epub 2016 Aug 11.

LCA*: an entropy-based measure for taxonomic assignment within assembled metagenomes.

Author information

1
Graduate Program in Bioinformatics University of British Columbia, Vancouver, Canada.
2
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
3
Department of Microbiology and Immunology, University of British Columbia, Vancouver, Canada.
4
ECOSCOPE Training Program, University of British Columbia, Vancouver, British Columbia, Canada.
5
Peter Wall Institute for Advanced Studies, University of British Columbia.

Abstract

MOTIVATION:

A perennial problem in the analysis of environmental sequence information is the assignment of reads or assembled sequences, e.g. contigs or scaffolds, to discrete taxonomic bins. In the absence of reference genomes for most environmental microorganisms, the use of intrinsic nucleotide patterns and phylogenetic anchors can improve assembly-dependent binning needed for more accurate taxonomic and functional annotation in communities of microorganisms, and assist in identifying mobile genetic elements or lateral gene transfer events.

RESULTS:

Here, we present a statistic called LCA* inspired by Information and Voting theories that uses the NCBI Taxonomic Database hierarchy to assign taxonomy to contigs assembled from environmental sequence information. The LCA* algorithm identifies a sufficiently strong majority on the hierarchy while minimizing entropy changes to the observed taxonomic distribution resulting in improved statistical properties. Moreover, we apply results from the order-statistic literature to formulate a likelihood-ratio hypothesis test and P-value for testing the supremacy of the assigned LCA* taxonomy. Using simulated and real-world datasets, we empirically demonstrate that voting-based methods, majority vote and LCA*, in the presence of known reference annotations, are consistently more accurate in identifying contig taxonomy than the lowest common ancestor algorithm popularized by MEGAN, and that LCA* taxonomy strikes a balance between specificity and confidence to provide an estimate appropriate to the available information in the data.

AVAILABILITY AND IMPLEMENTATION:

The LCA* has been implemented as a stand-alone Python library compatible with the MetaPathways pipeline; both of which are available on GitHub with installation instructions and use-cases (http://www.github.com/hallamlab/LCAStar/).

CONTACT:

shallam@mail.ubc.caSupplementary information: Supplementary data are available at Bioinformatics online.

PMID:
27515739
PMCID:
PMC5181528
DOI:
10.1093/bioinformatics/btw400
[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for Silverchair Information Systems Icon for PubMed Central
Loading ...
Support Center