PubMed Nucleotide Protein Genome Structure PMC Taxonomy OMIM
 Search for
     
Genome resources

Information
Home
About this site
About viruses
Statistics
FAQs
Advisors
Help

All Viral Genomes
Alphabetical list
RefSeq genomes
Other genomes
RefSeq proteins
RefSeq FTP
Taxonomy groups

All Viroid Genomes
Alphabetical list
RefSeq genomes
Other genomes
Taxonomy groups

Tools
BLAST
PASC
Protein clusters

Related NCBI Resources
Genotyping
Influenza viruses
Retroviruses
SARS-CoV
Taxonomy

Virus Taxonomy
ICTV
ICTV 7th Report

Other Databases and Projects
dsRNA viruses
HCV
HCV(eu)
HIV
Influenza
Plant viruses
Poxviruses
SARS Bioinformatics
Subviral RNA
VIDA
VBCa

Related Sites
All The Virology
Big Picture Book
The Beauty of Viruses
Viruses: From Structure To Biology

   About Viral Genomes

This collection of virus genomic sequences is a part of Entrez Genome that provides curated sequence data and related information for the community. A detailed description of the NCBI Viral Genomes project has been published in Journal of Virology 2004 Jul;78(14):7291-7298.

Reference Sequence

Retrieving useful information from public sequence databases is usually complicated by their high redundancy level. In the case of viral sequences, the situation is further complicated by the large number of strains, isolates, and mutants. Among these are both complete genomic sequences and partial sequences. All available sequences are compared and one (complete set of) well-studied and/or best-annotated full-length genomic sequence(s) for each virus becomes the "reference genome". Preference is given to strains or isolates of practical importance and/or to the sequences that were established and deposited to a public database earlier than others. Each genomic segment (viral equivalent of the chromosome) is represented by one "reference sequence". Therefore, a reference genome may consist of one or more reference sequence(s) depending on the number of genomic segments in a particular virus. For example, the monopartite genome of poliovirus is represented by the only record NC_002058, that contains the sequence of the notorious pathogenic strain Mahoney (GenBank accession V01149), published by Vincent Racaniello and David Baltimore in 1981. There are a few exceptions to the rule "one reference genome per one species", e.g., the records NC_001347 and NC_006273 contain complete genomes of the well-studied laboratory strain AD169 and the wild type strain Merlin, respectively, for the same species, Human herpesvirus 5. Viral reference sequences are also part of the NCBI RefSeq collection, and can be downloaded from there.

Reviewed, Validated, and Provisional Records

RefSeq records are typically created from the original GenBank/EMBL/DDBJ records containing the full-length sequences of genomic segments. Only few of them, such as the Murine hepatitis virus JHM NC_006852, have been assembled from overlapping incomplete sequences found in public records. If the RefSeq record is further curated by our staff, then it is marked as "reviewed". The improvements may include the accommodation in the record of relevant biological information from the literature or other sequence records, as well as corrected taxonomy names and lineages. For example, in the poliovirus genome record, NC_002058, we have updated the information on polyprotein processing based on more recent and advanced annotations available for the enteroviruses. Those records that have undergone an initial review are "validated" records. Other records remain in the "provisional" status until additional analysis is performed. Viral RefSeq records bear accession numbers starting with the characters NC_, with the exception of the AC_ series that represent additional complete sequences reannotated and submitted to NCBI Genomes by outside experts.

Multicomponent (Segmented) Viruses

Each segment of a multicomponent genome is annotated in a separate record. We assemble each complete multicomponent genome manually by matching strain/isolate information for its potential components (i.e. complete sequences with the same taxonomy identification numbers, tax_ids).

Genome Neighbors (other complete sequences for the species)

An additional sequence that belongs to the same species as a reference sequence becomes a genome "neighbor" for this reference sequence, provided that it matches all of the criteria that were used to select complete genomic sequences. A reference sequence may be replaced by a better annotated sequence and become a genome neighbor for the new reference sequence. The genome neighbors are DDBJ/EMBL/GenBank records accessible from the lists of reference sequences or via the Entrez Genome link "Other genomes for species" (see FAQs).

How the Viral Genomes Are Shown

The viral genomes are first consolidated into broad categories, based on the type and structure of the nucleic acid (such as double-stranded DNA viruses, double-stranded RNA viruses, negative-strand ssRNA viruses, positive-strand ssRNA viruses etc.). From the initial groupings, one can link to the alphabetical list of individual genomic records. A roster of all virus families and a comprehensive list of all virus records are accessible from the home page. On the top of that, the search interface, located on the home and help pages, allows to retrieve lists of genomes grouped by taxonomy categories. Nucleotide or protein sequences of all viral reference genomes can be retrieved from the corresponding Entrez database by clicking the Entrez Nucleotide or Entrez Protein hyperlinks under "All Viral Genomes" in the left side bar.

In addition to being a source of complete viral genomic sequences, this site also presents tools that can be used to analyze these sequences and their products.

NCBI Viral Genomes home page

The NCBI Viral Genomes home page contains a short introduction to viruses, a scheme of Influenza A virus replication, a search textbox and links to available viral genomes listed alphabethically or grouped by families.

VOG (Viral COG) - Clusters/Groups of Related Viral Proteins

To reveal and visualize both close and remote similarities between viruses and to facilitate navigation through viral proteins, protein sequences from viral RefSeq have been grouped by sequence similarity using BLAST-based approaches. For the DNA viruses or Phage protein clusters, the Clusters of Orthologous Groups (COG) or single-linkage (BLASTCLUST) approaches were used. Clusters/groups for RNA viruses have been constructed by a new method that takes into account the positions of BLAST hits on each protein. Each group contains proteins that bear related sequences regardless of total lengths or the presence of other domains. Therefore, multidomain proteins usually belong to more than one group. Renovated Web displays include summary listings by taxonomy groups and group/cluster views that now contain schematic representations of proteins, on which regions of similarity with other members of a particular cluster/group are highlighted. A user can obtain multiple alignments of selected proteins or of only those portions, which are relevant to a particular cluster.


Revised: June 8, 2006