Our Data Model
What is NCBI Virus?
NCBI Virus is an integrative, value-added resource designed to support retrieval, display and analysis of a curated collection of virus sequences and large sequence datasets. We are a community portal for viral sequence data, and our goal is to increase the usability of data archived in GenBank and other NCBI repositories.
Our mission is to enable researchers:
1) to find sequences and sequence data-sets of interest more easily via filtering of data along normalized metadata, and
2) to use virus sequence data more effectively by creating custom data reports and exporting those reports in various formats for use outside of NCBI Virus.
This is a work in process and we welcome your feedback! Please, contact us using this contact form or use the Feedback link to send in your comments/suggestions.
Data Model, Type of Data and Dataflow
NCBI Virus uses machine processing of all records from the International Nucleotide Sequence Database Collaboration (INSDC) databases as well as human curation to provide high-quality virus sequences, with standardized metadata for a subset of these sequence (read more in our publication).
We use manual and machine curation to validate viral sequence data and normalize sequence and sample attributes (metadata). This data is then made available through a custom search interface that supports selection of data based on a variety of properties.
When a sequence is submitted to GenBank or another INSDC database, the authors provide a description of the sample it was isolated from - for example, the collection date and country, the host, and the isolation source. We have standardized this metadata, so instead of searching for all similar terms (and their misspellings), you can easily filter by a single term. You can read more at Search for sequences by virus name or taxonomy group.
Currently NCBI Virus database includes the data from following sequence groups:
Refseqs - reference sequence records from one or more complete genome sequences for each viral species. Were available, RefSeqs are created based on "exemplar" isolates for each recognized species identified by the International Committee on Taxonomy of Viruses (ICTV). The majority of RefSeqs are complete genomes, but because some of ICTV "exemplars" are not complete sequences, there are also incomplete RefSeqs in the database. A separate RefSeq record is created for each segment in segmented viral genomes.
Complete nucleotide squences - all NCBI viral nucleotide sequences, where GenBank ASN.1 format contains the following descriptors: descr/molinfo/completeness=complete or there is a word 'complete' present in the record’s definition line (defline). It also includes complete reference records (RefSeqs).
Partial nucleotide sequences – nucleotide sequences that are not complete according to the definition above.
Proviral sequences - sequnces that have "/proviral" source qualifier in the GenBank record.
Find more about NCBI Virus functionalities and how to get started at our Help page.
Meet the Team
The NCBI Virus Team is part of the National Center for Biotechnology Information (NCBI). We focus on developing free virus-related resources, which includes both computational development and design as well as curation of virus sequences and large sequence datasets.
When citing NCBI Virus, please refer to:
|Virus Variation Resource - improved response to emergent viral outbreaks.||Hatcher EL, Zhdanov SA, Bao Y, Blinkova O, Nawrocki EP, Ostapchuck Y, Schaffer AA, Brister JR.||Nucleic Acids Res. 2017 Jan 4;45(D1):D482-D490. doi: 10.1093/nar/gkw1065. Epub 2016 Nov 28.|
Other related publications by NCBI Virus team
|Minimum Information about an Uncultivated Virus Genome (MIUViG).||Roux S, Adriaenssens EM, Dutilh BE, Koonin EV, Kropinski AM, et al.||Nat Biotechnol. 2018 Dec 17. doi: 10.1038/nbt.4306.|
|Overlapping genes and the proteins they encode differ significantly in their sequence composition from non-overlapping genes.||Pavesi A, Vianelli A, Chirico N, Bao Y, Blinkova O, et al.||PLoS One. 2018 Oct 19;13(10):e0202513.|
|How to Name and Classify Your Phage: An Informal Guide.||Adriaenssens E, Brister JR.||Viruses. 2017 Apr 3;9(4). pii: E70.|
|Consensus statement: Virus taxonomy in the age of metagenomics.||Simmonds P, Adams MJ, Benkő M, Breitbart M, Brister JR, et al.||Nat Rev Microbiol. 2017 Mar;15(3):161-168.|
|NCBI will no longer make taxonomy identifiers for individual influenza strains on January 15, 2018.||Hatcher E, Bao Y, Amedeo P, Blinkova O, Cochrane G, et al.||PeerJ Preprints.|
|NCBI viral genomes resource.||Brister JR, Ako-Adjei D, Bao Y, Blinkova O.||Nucleic Acids Res. 2015 Jan;43(Database issue):D571-7.|
|HIV-1, human interaction database: current status and new features.||Ako-Adjei D, Fu W, Wallin C, Katz KS, Song G, Darji D, et al.||Nucleic Acids Res. 2015 Jan;43(Database issue):D566-70.|
|The Influenza Virus Resource at the National Center for Biotechnology Information.||Bao Y., P. Bolotov, D. Dernovoy, B. Kiryutin, L. Zaslavsky, et al.||J. Virol. 2008 Jan;82(2):596-601.|
|Visualization of large influenza virus sequence datasets using adaptively aggregated trees with sampling-based subscale representation.||Zaslavsky L, Y. Bao and T. A. Tatusova.||BMC Bioinformatics, 2008; 9:237.|
|Accelerating the neighbor-joining algorithm using the adaptive bucket data structure.||Zaslavsky L. and Tatusova T.||Bioinformatics Research and Applications. Lecture Notes in Computer Science, Springer-Verlag, 2008; 4983:122-133.|
|Multiresolution approaches to representation and visualization of large influenza virus sequence datasets.||Zaslavsky L, Bao Y and Tatusova T.||IEEE International Conference on Bioinformatics and Biomedicine. 2007.|
|FLAN: a web server for influenza virus genome annotation.||Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Tatusova T.||Nucleic Acids Research. 2007 Jul 1; 35 (Web Server issue): W280-4.|
|An Adaptive Resolution Tree Visualization of Large Influenza Virus Sequence Datasets.||Zaslavsky L, Bao Y, and Tatusova T.||Bioinformatics Research and Applications. Lecture Notes in Computer Science, Springer-Verlag, 2007;4463:192-202.|
|Large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution.||Ghedin E, Sengamalay NA, Shumway M, Zaborsky J, Feldblyum T, et al.||Nature. 2005 Oct 20; 437(7062): 1162-6.|
|Whole-genome analysis of human influenza A virus reveals multiple persistent lineages and reassortment among recent H3N2 viruses.||Holmes EC, Ghedin E, Miller N, Taylor J, Bao Y, et al.||PLoS Biol. 2005 Sep; 3(9): e300.|
|Virus Variation Resource - recent updates and future directions.||Brister JR, Bao Y, Zhdanov SA, Ostapchuck Y, Chetvernin V, et al.||Nucleic Acids Res. 2014 Jan; 42(Database issue):D660-5. doi: 10.1093/nar/gkt1268. Epub 2013 Dec 4.|
|The virus variation resources at the National Center for Biotechnology Information: dengue virus.||Resch W, Zaslavsky L, Kiryutin B, Rozanov M, Bao Y, Tatusova TA.||BMC Microbiol. 2009 Apr 2;9:65.|