Format

Send to:

Choose Destination
See comment in PubMed Commons below
Database (Oxford). 2014 Jun 12;2014. pii: bau058. doi: 10.1093/database/bau058. Print 2014.

Curation accuracy of model organism databases.

Author information

  • 1Bioinformatics Research Group, Artificial Intelligence Center, SRI International, CA, USA, Department of Genetics, Stanford University, CA 94305, USA, Department of Bacteriology, University of Wisconsin, WI 53706-1521, USA, Department of Cellular and Molecular Pharmacology, University of California at San Francisco, CA 94158-2140, USA, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, WI 53726, USA and Department of Medical Microbiology and Immunology, University of Wisconsin, WI 53706-1521, USA.
  • 2Bioinformatics Research Group, Artificial Intelligence Center, SRI International, CA, USA, Department of Genetics, Stanford University, CA 94305, USA, Department of Bacteriology, University of Wisconsin, WI 53706-1521, USA, Department of Cellular and Molecular Pharmacology, University of California at San Francisco, CA 94158-2140, USA, DOE Great Lakes Bioenergy Research Center, Wisconsin Energy Institute, WI 53726, USA and Department of Medical Microbiology and Immunology, University of Wisconsin, WI 53706-1521, USA pkarp@ai.sri.com.

Abstract

Manual extraction of information from the biomedical literature-or biocuration-is the central methodology used to construct many biological databases. For example, the UniProt protein database, the EcoCyc Escherichia coli database and the Candida Genome Database (CGD) are all based on biocuration. Biological databases are used extensively by life science researchers, as online encyclopedias, as aids in the interpretation of new experimental data and as golden standards for the development of new bioinformatics algorithms. Although manual curation has been assumed to be highly accurate, we are aware of only one previous study of biocuration accuracy. We assessed the accuracy of EcoCyc and CGD by manually selecting curated assertions within randomly chosen EcoCyc and CGD gene pages and by then validating that the data found in the referenced publications supported those assertions. A database assertion is considered to be in error if that assertion could not be found in the publication cited for that assertion. We identified 10 errors in the 633 facts that we validated across the two databases, for an overall error rate of 1.58%, and individual error rates of 1.82% for CGD and 1.40% for EcoCyc. These data suggest that manual curation of the experimental literature by Ph.D-level scientists is highly accurate. Database URL: http://ecocyc.org/, http://www.candidagenome.org//

© The Author(s) 2014. Published by Oxford University Press.

PMID:
24923819
[PubMed - indexed for MEDLINE]
PMCID:
PMC4207230
Free PMC Article
PubMed Commons home

PubMed Commons

0 comments
How to join PubMed Commons

    Supplemental Content

    Full text links

    Icon for HighWire Icon for PubMed Central
    Loading ...
    Write to the Help Desk