Format

Send to

Choose Destination
Genome Biol. 2018 Oct 30;19(1):165. doi: 10.1186/s13059-018-1554-6.

RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification.

Author information

1
Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA.
2
Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA.
3
Department of Computer Science, Rice University, Houston, TX, USA. treangen@rice.edu.

Abstract

In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly outpaces the number of new genera; as a result, more reads are classified with newer database versions, but fewer are classified at the species level; and Bayesian-based re-estimation mitigates this effect but struggles with novel genomes. These results suggest a need for new classification approaches specially adapted for large databases.

KEYWORDS:

Comparative analysis; LCA; Metagenomics; Microbiome; Reference database; Taxonomic classification; k-mer

PMID:
30373669
PMCID:
PMC6206640
DOI:
10.1186/s13059-018-1554-6
[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for BioMed Central Icon for PubMed Central
Loading ...
Support Center