NCBI » Bookshelf » The NCBI Handbook » Querying and Linking the Data » The Reference Sequence (RefSeq) Project
 
handbook
The NCBI Handbook
1st
McEntyreJo
OstellJim
National Center for Biotechnology Information (NCBI), National Library of Medicine, National Institutes of Health, Bethesda, MD 20892-6510
bioinformatics

 Chapter 18:  The Reference Sequence (RefSeq) Project

Kim Pruitt, Tatiana Tatusova, and Donna Maglott
3012007ch18
Created: October 9, 2002.
Last Update: January 3, 2007.
Summary

The Reference Sequence (RefSeq) database is a non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxa. The collection includes sequences from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes. Each RefSeq represents a single, naturally occurring molecule from one organism. The goal is to provide a comprehensive, standard dataset that represents sequence information for a species. It should be noted, though, that RefSeq has been built using data from public archival databases only.

RefSeq biological sequences (also known as RefSeqs) are derived from GenBank records but differ in that each RefSeq is a synthesis of information, not an archived unit of primary research data. Similar to a review article in the literature, a RefSeq represents the consolidation of information by a particular group at a particular time. RefSeqs are available without restriction and can be retrieved in several different ways such as: searching NCBI's databases including Nucleotide, Protein, Gene, and Map Viewer; searching with a sequence via BLAST; doing an FTP download; or through links from other NCBI resources including Gene, Map Viewer, and PubMed.

Introduction

The goal of the NCBI RefSeq project is to provide an accurate, non-redundant, and comprehensive collection of naturally occurring DNA, RNA, and protein molecules for major organisms. The collection explicitly connects nucleotide and protein sequences that are related. Ideally, all molecule types will be available for each well-studied organism; however, because the state of public sequence data varies from genome to genome, the level of available information for different organisms at any given time also varies. Intermediate records are provided for some organisms when the genomic sequence is not complete. Although rearrangements and mutations do occur naturally, the goal is to represent sequence standards that are considered to be the predominant "normal" version of the sequence.

RefSeq provides a non-redundant framework of information to facilitate database searches, whether they are searched via genomic location, sequence, or text annotation. RefSeq represents an objective and experimentally verifiable definition of non-redundancy by providing one example of each natural biological molecule per organism. The collection may include alternatively spliced transcripts that share some exons, identical proteins expressed from alternatively spliced transcripts, paralogs, and homologs. Additional records are provided to represent alternate haplotypes or strains for some organisms.

RefSeq is unique in providing a large, multi-species, curated sequence database that explicitly links chromosome, transcript, and protein information that establishes a baseline for integrating a large body of diverse data including sequence, genetic, expression, and functional information into a single, consistent framework with a uniform set of conventions and standards. RefSeq is substantially based on sequence records submitted to public archival databases. Hereafter, the shorter term "GenBank" (Chapter 1) is used to indicate the full set of archival sequence data that is submitted to, and redistributed by, the three collaborating databases; the European Molecular Biology Laboratory (EMBL), the DNA Data Bank of Japan (DDBJ), and GenBank. Note that although based upon GenBank, RefSeq is distinct from GenBank and is not included in the GenBank database. GenBank is an archive of sequences and annotations supplied by original authors and cannot be altered by others. RefSeq differs from GenBank in the same way that a review article differs from a related collection of primary research articles on the same subject. Each RefSeq represents a synthesis by a person or group of the primary information that was generated and submitted by others. Other organizing principles or standards of judgment are possible, which is why the work is attributed to the synthesizing "editors". The RefSeq dataset is curated on an ongoing basis by collaborating groups and by NCBI staff. Sequence records are presented in a standard format and subjected to computational validation. The GenBank source of the RefSeq record, curation status, and attribution to the curation group are also indicated.

RefSeq standards support genome annotation, gene characterization, mutation analysis, expression studies, and polymorphism discovery. The RefSeq collection supports the following:

Database Content: Background

Table 1. The RefSeq accession number format and molecule types
Accession prefixMolecule typeComment
AC_GenomicComplete genomic molecule, alternate assembly
NC_GenomicComplete genomic molecule, reference assembly
NG_GenomicIncomplete genomic region
NT_GenomicContig or scaffold, clone-based or WGSa
NW_GenomicContig or scaffold, primarily WGSa
NS_GenomicEnvironmental sequence
NZ_bGenomicUnfinished WGS
NM_mRNA
NR_RNA
XM_cmRNAPredicted model
XR_cRNAPredicted model
AP_ProteinAnnotated on AC_ alternate assembly
NP_Protein
YP_cProtein
XP_cProteinPredicted model
ZP_cProteinPredicted model, annotated on NZ_ genomic records

a Whole Genome Shotgun sequence data.

b An ordered collection of WGS for a genome.

c Computed.

The September 2006 RefSeq collection includes sequences from more than 3,700 distinct taxonomic identifiers, ranging from viruses to bacteria to eukaryotes. It represents chromosomes, organelles, plasmids, viruses, transcripts, and more than 2,800,000 proteins. Every sequence has a stable accession number, a version number, and an integer identifier (gi) assigned to it. Outdated versions are always available if a sequence is updated. RefSeq records can be distinguished from GenBank records by the inclusion of an underscore (“_”) in the accession identifier. The RefSeq accession prefix has an implied meaning in terms of the type of molecule it represents. Table 1 indicates the types of sequence molecules and the corresponding RefSeq accession number formats. See also the RefSeq website.

Updates

RefSeq updates are provided daily. These include records added to the collection and records updated to reflect sequence or annotation changes, including complete re-annotation of a genome. New and updated records are made available in Entrez and BLAST databases as soon as possible. The FTP site also provides daily update information (see below).

Flat File Format and Annotated Features

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch18.f1.jpg.
An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch18.f1.jpg.
Figure 1. Features of a RefSeq record.Dialog balloons (more...)
An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch18.f1.jpg.

Figure 1. Features of a RefSeq record.Dialog balloons indicate distinguishing features including the ACCESSION number format, the COMMENT text that indicates the status and source of the sequence information, the (optional) gene description (Summary text), the (optional) description of transcript variants, and links to additional information. Links are provided to other sources of information as relevant (e.g., CCDS, Gene Ontology, OMIM, the ExPASy Enzyme Commission website), as well as to other NCBI pages (e.g., the protein record and Entrez Gene). Note: This image has been adapted for didactic purposes. The information presented here does not correspond to the complete information currently available for this record.

RefSeq records appear similar in format to the GenBank records from which they are derived. Attributes novel to RefSeqs include a unique accession prefix followed by an underscore (Table 1), and a COMMENT field that indicates the RefSeq status and the source of the sequence information (Figure 1). Some RefSeq records may include feature annotations or database cross-references (dbxref) that are not seen in the underlying GenBank record. New annotation is provided by computation and by manual curation. For example, nucleotide variation, STS, and tRNA features are computed for a subset of RefSeq entries using the data available in dbSNP (Chapter 5), UniSTS, and through tRNA-scan prediction (Lowe et al.). RefSeq proteins also report on conserved domains computed by NCBI's Conserved Domain Database (Chapter 3) and protein modification sites that were identified by the Human Protein Reference Database (HPRD). Other nucleotide and protein features, publications, and comments may be added by collaborating groups or NCBI staff (see Box 2).

Assembling and Maintaining the RefSeq Collection

Summary

Table 2. RefSeq status codes
CodeDescription
GENOME ANNOTATIONThe RefSeq record is provided via automated processing and is not subject to individual review or revision between builds.
INFERREDThe RefSeq record has been predicted by genome sequence analysis, but it is not yet supported by experimental evidence. The record may be partially supported by homology data.
PREDICTEDThe RefSeq record has not yet been subject to individual review, and some aspect of the RefSeq record is predicted.
PROVISIONALThe RefSeq record has not yet been subject to individual review. The initial sequence-to-gene name associations have been established by outside collaborators or NCBI staff.
REVIEWEDThe RefSeq record has been reviewed by NCBI staff or by a collaborator. The NCBI review process includes assessing available sequence data and the literature. Some RefSeq records may incorporate expanded sequence and annotation information.
VALIDATEDThe RefSeq record has undergone an initial review to provide the preferred sequence standard. The record has not yet been subject to final review, at which time additional functional information may be provided.
WGSThe RefSeq record is provided to represent a collection of whole genome shotgun sequences. These records are not subject to individual review or revisions between genome updates.
The RefSeq database is the result of data extraction from GenBank, curation, and computation, combined with extensive collaboration with authoritative groups. Each molecule is annotated as accurately as possible with the organism name, strain (or breed, ecotype, cultivar, or isolate), gene symbol for that organism, and informative protein name. Collaborations with authoritative groups outside of NCBI provide a variety of information ranging from curated sequence data, nomenclature, feature annotations, and links to external organism-specific resources. When no collaboration has been established, the NCBI staff assembles the data from GenBank. Each record has a comment, indicating the level of curation that it has received (Table 2), and attribution of the collaborating group. Thus, a RefSeq record can include either an essentially unchanged, validated copy of the original GenBank record or corrected or additional information that has been added by collaborators or experts at NCBI.

In cases when a molecule is represented by multiple sequences for an organism in GenBank, an effort is made by NCBI staff to select the "best" sequence to be presented as a RefSeq. The goal is to avoid known mutations, sequencing errors, cloning artifacts, and erroneous annotation. RefSeqs identified with a problem of this type are corrected. Sequences are validated to confirm that the genomic sequence corresponding to an annotated mRNA feature matches the mRNA sequence record, and that coding region features really can be translated into the corresponding protein sequence.

Working groups using distinct process pipelines compile the RefSeq collection for different organisms. RefSeq records are provided via several distinct approaches including:

  • collaboration

  • computational genome annotation pipeline

  • curation by NCBI staff

  • extraction from GenBank

Collaboration

Table 3. Selected examples of collaborator-contributed RefSeq records
OrganismCollaborator
Saccharomyces cerevisiaeSaccharomyces Genome Database (SGD)
Arabidopsis thalianaThe Arabidopsis Information Resource (TAIR)
Pseudomonas aeruginosaPseudomonas aeruginosa Community Annotation Project (PseudoCAP)
Caenorhabditis elegansWormBase
Drosophila melanogasterFlyBase
RefSeq welcomes collaborations with authoritative groups outside NCBI that are willing to provide sequences, nomenclature, annotation, or links to phenotypic or organism-specific resources. The RefSeq feedback form can be used to provide corrections or to initiate collaboration. For some species, the RefSeq collection is curated entirely by a collaborating authoritative group that provides both the sequences and annotation. For others, most notably the human and mouse RefSeq collections, there have been numerous collaborations with individual scientists, for either specific genes or complete gene families. Other collaborations may be over a set of organisms. For example, a Viral Genome Advisory group has been established to support curation of the viral RefSeq collection. Thus, RefSeq records may contain information provided by an external authoritative source and/or analyses and curation at NCBI. The collaborating group is identified on RefSeq records. Table 3 lists some examples of annotated genomes provided by this method. Also see the RefSeq website.

If RefSeq records are supplied by a collaborating group, processing may be automated in that data are periodically downloaded, validated to detect errors, and modified to add such annotation as cross-references to Entrez Gene. The validation process checks for logical conflicts in the annotation and makes small changes to format the submission as a RefSeq record. In such cases, NCBI does not directly curate the annotation data or make sequence changes; conflicts or problems that are identified either by the validation process or received by email from researchers are reported back to the submitting group. As the collaborating group supplies updates, changes are reflected in the RefSeq collection for that organism.

Computational Genome Annotation Pipeline

NCBI computes annotation of genomic sequence data for some genomes including some microbial species, human, mouse, rat, cow, chimpanzee, dog, zebrafish, and honey bee. The annotation pipeline is automated and yields genomic, transcript, and protein RefSeq records. Names annotated on the transcript and protein products are based on sequence similarity. Annotation data are refreshed periodically, and records that are generated from this process flow are not subject to individual incremental updates or manual curation (see Table 1; see Chapter 14 for more information on the eukaryotic genome annotation pipeline). For some species (including human), records may be provided by a mixture of methods. In other words, there may be a set of curated transcript and protein records in addition to a set of records that were generated computationally. RefSeq records that are processed by NCBI's pipeline are displayed in the NCBI Map Viewer (Chapter 20), included in Entrez Gene, and are also available in the main Entrez sequence databases.

Curation by NCBI Staff

A portion of the RefSeq dataset is supported by NCBI curation staff. This subset includes viral, vertebrate, and some invertebrate organisms. Most bacterial, plant, and fungal records are provided either by collaboration or by processing the annotated genome data submitted to GenBank; however, a small number of bacterial genomes are annotated and curated by NCBI staff. Viral genomes are curated in consultation with a viral board of advisors.

Curation of Viral, Mitochondrial, and Bacterial Records

Table 5. Selected examples of REVIEWED viral, microbial, and small genomes
RefSeqOrganism
NC_003197Salmonella typhimurium LT2
NC_003277Salmonella typhimurium LT2 plasmid pSLT
NC_001802Human immunodeficiency virus 1
NC_000907Haemophilus influenzae Rd KW20
NC_004718SARS coronavirus
The RefSeq records curated manually at NCBI are annotated with a status of REVIEWED or VALIDATED in the RefSeq comment block (see Table 5). For example, viral genomes are re-annotated using GeneMarkS in collaboration with Mark Borodovsky's Bioinformatics Group at the Georgia Institute of Technology. The GeneMarkS results are then manually reviewed by NCBI staff. Viral annotation also relies on an established Viral Genome Advisors group and other experts. For example, the HIV-1 RefSeq was curated by NCBI Staff in collaboration with the authors of the book Retroviruses. The NCBI curators expanded the mature peptide annotation for several viruses, including the poliovirus and hepatitis C, based on observations reported in the literature that had not been included in a GenBank submission. Metazoan mitochondrial records are curated in accordance with established protein and gene nomenclature.

Curation of Vertebrate and Invertebrate Records

Curation of higher eukaryotic organisms is primarily focused on the mammalian genomes, especially the human and mouse. The RefSeq process flow for these organisms provides transcripts, proteins, and some genomic regions. Records of some genomic regions are provided to facilitate genome-wide annotation and may represent gene clusters or single genes or pseudogenes. Processing integrates the official nomenclature and other information including alternate names, Gene Ontology (GO) terms, and GeneRIFs available in Entrez Gene. Multiple collaborations support the collection of this descriptive information (Box 1; see also Chapter 19).

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch18.f2.jpg.
An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch18.f2.jpg.
Figure 2.
RefSeq Processing Pipelines. Once sequence (more...)
An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch18.f2.jpg.

Figure 2. RefSeq Processing Pipelines. Once sequence data are deposited in the public archival databases, it is available for RefSeq processing. Processing pipelines include the vertebrate curation pipeline, the computational genome annotation pipeline, and extraction from GenBank. These pipelines generate new and updated RefSeq records that become publicly available in Entrez Nucleotide, Protein, and Gene databases. (A) Once a gene is defined and associated with sufficient sequence information in an internal curation database, it can be pushed into the RefSeq pipeline. The RefSeq process is initiated by an automated BLAST step, which uses the stored sequence data as a query against GenBank to identify the longest mRNA for each locus. This initial RefSeq record has a status of provisional, predicted, or inferred. Subsequent curation may result in a sequence or annotation update (as described in Box 2) and a status of validated or reviewed. Records are updated if the underlying GenBank accession number is updated or if other associated data are updated, including nomenclature, publications, or map location. (B) Available RefSeq and GenBank data are aligned to an assembled genome, ab initio gene prediction is carried out that uses alignment data, and an analysis program integrates all available data to define the annotation models. New "model" RefSeq records are generated by this pipeline. (C) When a complete, annotated genome becomes available in GenBank, a set of corresponding RefSeq records are generated by duplicating the GenBank submission, followed by validation and addition of cross-references to Entrez Gene (via a dbXref citing the GeneID) and, in some cases, more informative and standardized protein names.

Sequences enter the RefSeq curation process flow by a combination of computational analysis, collaboration, and in-house curation. As illustrated in Figure 2, generation of the initial RefSeq record is dependent on identifying a representative sequence for a gene. New genes and sequence data are added to the collection by collaborators and NCBI-based analyses to mine information from UniGene, cDNA alignments, and GenBank (Box 1). Quality assessment (QA) processes are executed regularly to flag questionable data for review. This QA checks for errors and conflicts in nomenclature, sequence similarity and genomic placement, and potential cloning errors (e.g., chimeras) and leverages data from other NCBI resources including HomoloGene, Map Viewer, and GenBank related sequences.

Once sequence data are unambiguously associated with a GeneID, the data may be propagated into a RefSeq record. The completeness of the sequence information and the category of the gene (e.g., protein coding, pseudogene) determine whether a RefSeq will be made, and if so, of what type (DNA, RNA, mRNA plus protein). Reviewed RefSeq records are not made for transposable elements or those loci for which the product type is uncertain (e.g., protein coding or not). Also, a RefSeq may not be provided when the associated data are known to be incomplete. For example, if accessions for a protein-coding locus are annotated to indicate that the protein is a partial sequence, then automatic processing to provide the RefSeq transcript and protein does not occur. It should be noted, however, that the RefSeq collection does include partial transcripts and proteins that are provided by collaborating groups or when the RefSeq is based on annotated whole-genome submissions in GenBank.

RefSeq processing for non-protein-coding RNA loci uses the longest defining transcript record that has been associated with the GeneID. For non-transcribed loci (such as non-transcribed pseudogenes), the RefSeq record is often derived from a defined region of a large genomic sequence with no indication of exon substructure. Curation of these types of records is minimal because the current focus is on curation of protein-coding loci; however, these records provide an important reagent for the computational annotation pipeline and support annotation of non-protein-coding genes that might otherwise be missed or misrepresented as a predicted protein-coding gene. Other records are provided to represent larger genomic regions including gene clusters, genes requiring rearrangement to express a product (immunoglobulins and T-cell receptors), and haplotypes with known differences in gene content. These genomic region records are annotated by the curation staff, often in collaboration with scientific experts, and are not provided by automatic processing.

For protein-coding loci, an initial "seed" sequence is selected from the set of accessions associated with a given GeneID based on protein and transcript lengths. Sequences that are annotated as partial proteins are not considered. Because we have only a subset of sequence data stored for a GeneID, an automated process checks for additional sequence data that may be a more complete representative of the transcript; the selected seed sequence is used as a query sequence for automated BLAST analysis, and if a longer mRNA is identified (with an identical coding sequence), then that sequence is used to provide the RefSeq record. This stage of the process associates additional accessions with the GeneID and also includes detection of conflicts and problems, including sequence-to-locus association ambiguity and vector contamination.

If data conflicts are identified for the GenBank accession used to generate the RefSeq record, then it is resolved before the RefSeq can become public. The RefSeq record is generated using the sequence data from the GenBank accession, and the annotation data from the in-house version of the Entrez Gene database. In addition, RefSeq records are subject to programmatic validation to identify annotation format errors and to provide annotation in a more consistent format. Entrez Gene information including the GeneID, cross-references to other databases, official nomenclature, alias symbols, alternate descriptive names, map location, and additional citations, including those submitted as GeneRIFs, are applied to the records. Records at this stage have a PROVISIONAL, PREDICTED, or INFERRED status.

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch18.f3.jpg.

Figure 3. How to recognize suppressed and redundant RefSeq records. (a) A standard text statement is included on the Entrez document summary for suppressed RefSeq records (red arrow). (b) If redundant RefSeq records are merged, then both accession numbers appear on the flat file ACCESSION line (green arrow). The first ACCESSION number listed is the primary identifier, and all other listed accessions are "secondary" accession numbers.

RefSeq staff prioritizes reviewing problems submitted by email or identified by in-house QA analysis. For example, analysis is carried out to identify sequences that include repeats, have poor-quality splice sites, align poorly to a high quality genome, have no similar proteins, are very short or very long, or that are extremely similar to sequences associated with a different GeneID. Review of problem sets may result in updating a RefSeq record, providing new RefSeq records, modifying sequence-to-gene associations, merging Entrez Gene records, or discontinuing a RefSeq, GeneID, or both. A RefSeq is suppressed if it is found to represent a transcribed repeat element, to be derived from the wrong organism (i.e., the GenBank sequence it was based on does not have an accurate organism annotation), or to not represent a "gene". Records that have been determined to represent an incomplete sequence, such as a partial protein sequence or incompletely spliced transcript, are temporarily suppressed until more complete sequence data can be represented in the RefSeq record. An Entrez query will still retrieve a suppressed record, with a disclaimer appearing on the query result document summary (Figure 3a), but the suppressed record is not included in the BLAST databases, nor in the calculation of related sequences or the BLink display (BLink are pre-computed protein BLAST results) or in RefSeq FTP releases. If a RefSeq is found to be redundant with another public RefSeq, then one is retained and the other becomes a secondary accession number (Figure 3b). If the sequences were associated with two different GeneIDs, then the GeneIDs are merged so that in Entrez Gene, a query with either of the original GeneIDs will retrieve the remaining single record.

Once problems are resolved, the curators review sequence alignments, the published literature, and internal and external databases, with the aim of finding the best representative nucleotide and protein sequence and annotation available at that time. The resulting information is propagated to both the RefSeq sequence record and to the Entrez Gene database. Box 2 lists additional detailed information concerning the type of errors corrected and the information added by the manual curation process. Review of individual transcript and protein records is carried out primarily by NCBI staff, but some sequences and annotations are provided by collaboration. The curation process also provides additional sequence records to represent splice variants when sufficient information about their full-length composition is available. Records that have undergone manual curation, either by NCBI staff or a collaborating group, have a validated or reviewed status. Note that for many genes, intermediate levels of manual curation may correct sequence miss-associations, to base the RefSeq on a more optimal GenBank record and to provide additional data to Entrez Gene before full review of a RefSeq sequence record.

We welcome input from the research community to improve the quality of RefSeqs. Interested parties are invited to contact us by sending an email to the NCBI Help Desk () or by using our feedback form. See also the RefSeq website.

Extraction from GenBank

RefSeq representation of complete genome data for prokaryotes, organelles, and some eukaryotic genomes is provided by propagating whole-genome sequence data and annotation available in GenBank to a RefSeq record.

In general, these RefSeq records undergo an initial automated validation step before being released. The resulting record is a copy of a GenBank sequence, but processing may result in some corrections and more consistent feature annotation. The RefSeq record could then differ from the original GenBank record in details of the feature annotation format, names, publications, and cross-references to other databases including Entrez Gene. Of particular note is that the transcript is often provided as a separate RefSeq record for eukaryotic organisms. This is not done for GenBank genome submissions, which instantiate only the protein separately.

This process flow is supported by the Entrez Genome and Genome Project databases. The Entrez Genome Project database tracks whole-genome sequencing projects, other types of large-scale projects, and provides an overview of the organism as well as links to data and other resources. As new genome sequences are submitted to GenBank, the general status of that project is tracked in the Genome Project database. When the sequence is public for organelles or prokaryotic projects, the corresponding RefSeq record is generated automatically. Processing of eukaryotic genomes is more complex, largely because the volume of data is significant, and so these are processed by programs that are run manually.

The resulting genomic RefSeq data is represented in the Entrez Genome database, which represents the RefSeq collection of complete, or nearly complete, genomes and chromosomes that are generated at NCBI by GenBank extraction, collaboration, or the computational annotation pipeline. The Genome database is divided into several major taxonomic groups including: Archaea, Bacteria, Eukaryota, Viroids, Viruses, and Plasmids. The Entrez Genome website includes custom displays, analysis, and tools for prokaryotic and some eukaryotic genomes (see Table 4).

RefSeq record processing of GenBank genomic data falls into four primary categories: chromosomes, microbial genomes, small complete genomes, and viruses.

Chromosomes

RefSeq records in this category are usually submitted directly to Entrez Genomes as a complete chromosome sequence representing an assembly of individual clones that are themselves available in GenBank. For some genomes, such as Drosophila melanogaster, the RefSeq representation uses a unit of interest to the research community and limits size to a chromosome arm rather than the complete chromosome. RefSeq records may also be available for some genomes that are not yet fully sequenced but for which complete sequence is available for individual chromosomes. These records may be annotated by the NCBI computational annotation pipeline, or they may be curated by an organism-specific collaborating group and undergo NCBI validation before being released.

Microbial Genomes

Similar to chromosomes, complete microbial genomes are submitted to GenBank, which are then automatically processed to create a RefSeq record. Microbial RefSeq records are not curated on an organism-by-organism basis but are subject to additional automatic validation and computational analysis. The vast increase in microbial genomes over the last few years has enabled the RefSeq group to build a system for the curation of protein clusters to unify protein and gene nomenclature across genomes. The manually curated annotation for the cluster of proteins is then applied to all complete microbial genomes that contain the protein of interest. Additional tools are used for the prediction and analysis of both coding regions and other genes such as tRNAs (tRNAscan-SE).

Small Complete Genomes

Smaller complete genomic sequences, including organelles, plasmids, and viruses are based on single GenBank records. Automatic processing scans GenBank daily for complete genome updates and new submissions; identified records are candidates for a complete genome RefSeq. These records are manually evaluated to make the final decision; if more than one genomic sequence is available for the genome, then only one is selected to become the RefSeq standard. This selection takes into account various factors including the level of annotation, strain information, and community input.

Access and Retrieval

RefSeq records can be accessed by direct query, BLAST, FTP download, or indirectly through links provided from several NCBI resources including Gene, Genome, Genome Project, and Map Viewer. In addition, RefSeq records are included in some computed resources, and therefore links may be found from those pages to individual RefSeq records. For example, the RefSeq collection is included in HomoloGene, UniGene, Clusters of Orthologous Groups of proteins (COG; Chapter 22) analysis, and in Conserved Domain Database (CDD; Chapter 3) analysis to identify proteins with similar domain architecture. Some links from Entrez databases to RefSeq records are based on Entrez Gene associations (e.g., links from OMIM; Chapter 7), whereas others are based on sequence similarity or RefSeq annotation content including links from PubMed. Links to RefSeq records may be found in the following resources:

The distinct accession number format used for RefSeq records (Table 1) makes it easy to spot links to RefSeq records from these and other NCBI resources. Several approaches to access and retrieve RefSeq records are described below.

Entrez Query Access

RefSeq records can be retrieved by querying various databases in the Entrez system (Chapter 15). All RefSeqs can be found in the Entrez Nucleotide or Protein databases, whereas queries in the Genome database retrieve the subset of the RefSeq collection that comprises complete genomic molecules. Together, Genome and Gene represent the entire RefSeq collection. See the RefSeq website for examples of Entrez queries. A subset is represented in Map Viewer. The genomic RefSeq is reported in the map, and the annotations are viewed in the Genes and Transcript maps.

Searching Nucleotide or Protein

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch18.f4.jpg.
An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch18.f4.jpg.
Figure 4. Using Entrez Limits to restrict a query (more...)
An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch18.f4.jpg.

Figure 4. Using Entrez Limits to restrict a query to RefSeq. Use the highlighted menu boxes to restrict the query to a genomic or mRNA sequence or to restrict the query to the RefSeq collection.

Table 6. Entrez queries to retrieve sets of RefSeq records
QueryAccession prefixRefSeq category retrieved
srcdb_refseq[prop]All RefSeq accessionsAll
srcdb_refseq_known[prop]NC_, AC_, NG_, NM_, NR_, NP_, AP_REVIEWED, PROVISIONAL, PREDICTED, INFERRED, and VALIDATED
srcdb_refseq_reviewed[prop]NC_, AC_,NG_, NM_, NR_, NP_, AP_REVIEWED records
srcdb_refseq_validated[prop]NC_, NM_,NR_,NP_VALIDATED records
srcdb_refseq_provisional[prop]NC_, AC_, NG_, NM_, NR_, NP_, AP_PROVISIONAL records
srcdb_refseq_predicted[prop]NM_, NR_, NP_PREDICTED records
srcdb_refseq_inferred[prop]AC_, AP_, NM_,NR_,NP_INFERRED records
srcdb_refseq_model[prop]aNT_, NW_, XM_, XR_, XP_, ZP_Genome annotation model records

a Retrieves those NT_ and NW_ records that have gene annotation.

General queries in Entrez Nucleotide or Protein databases may retrieve a mixture of GenBank and RefSeq records. The query result display includes tabs to access specific result sets including, by default, a tab to access RefSeq-specific results. Details about the subsets to report via folder Tabs can be configured using the My NCBI interface. Alternatively, queries can be restricted to the RefSeq collection by using the Limited to: settings (Figure 4) or by entering a query restriction using the property fields listed in Table 6. In addition, both the Limited to: settings and property field terms can also be used to restrict the query by type of molecule (e.g., DNA versus mRNA) and other parameters. See Entrez Help and the RefSeq Query help page for further details.

Searching Genome and the Genome Project

RefSeq records in the Genome or Genome Project databases can be retrieved using an accession number for a complete genomic molecule (NC_ accession prefix) or organism name. The Genome Project database can also be queried using the property restriction “srcdb_refseq[prop]”.

Searching Gene

The majority of the RefSeq collection is represented in Entrez Gene, a gene-centered database (Chapter 19); RefSeq records representing assembled environmental samples (with an NS_ accession prefix) are not included in Gene but can be found in the Genome and Nucleotide databases. Additional organisms, records, and associated data continue to be added to Entrez Gene and RefSeq over time as new data become available.

Genes with specific RefSeq accessions can be retrieved by querying with the RefSeq accession number. A more general query to retrieve Genes with associated RefSeq records can be carried out by using the property "srcdb_refseq". For example, a query can be formed to find members of a gene family that share a common name root for which there are RefSeq records (for example, “abcc*[sym] AND srcdb_refseq_known[prop]”). RefSeq to Gene connections are also provided by direct links; RefSeq records include a link to the Entrez Gene report page via the GeneID dbXref link on the gene and CDS features (Figure 1). Gene reports the RefSeq accession numbers in the RefSeq section of the report, with links to the Nucleotide or Protein records. Gene reports may also include a graphical depiction of genome annotation data as represented in the Map Viewer resource in the Genomic regions, transcripts, and products section, with links to Nucleotide and Protein displays. When this graphical section is provided, an additional report is available with details about exon and intron boundaries and length. You can change the display format from Full Report to Gene Table to access this report.

Entrez Gene query results and gene reports indicate when a RefSeq is available, with links provided to the nucleotide and protein sequences and to related resources, including the Map Viewer and BLink (pre-computed protein alignments) and Conserved Domain Database (CDD). The process of RefSeq curation also expands the data available in Entrez Gene by providing a range of information including:

  • alternate names

  • Enzyme Committee numbers

  • gene summaries

  • publications

  • related GenBank accessions

  • transcript variant descriptions

BLAST

An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch18.f5.jpg.
An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch18.f5.jpg.
Figure 5. (a) RefSeq records are included in NCBI (more...)
An external file that holds a picture, illustration, etc., usually as some form of binary object. The name of referred object is ch18.f5.jpg.

Figure 5. (a) RefSeq records are included in NCBI BLAST databases. In a BLAST summary list of results, the abbreviation ref identifies records that are provided by the RefSeq collection. Accessions that are included in the NCBI resources UniGene, GEO, and Gene are linked to those resources via the colored icons for U, E, and G, respectively. (b) The Genome View button is provided when BLAST results can be viewed in context of the graphical Map Viewer display.

RefSeq transcript and protein records are included in the non-redundant nucleotide and protein BLAST databases, and genomic sequences are included in the "chromosome" database; therefore, when a query sequence matches a RefSeq record, the hit is included in the BLAST results (see Figure 5). Accessions included in the results set, either RefSeq or GenBank, that are associated with GeneIDs are indicated by a small blue G icon that is linked to the Gene report. Additional organism-specific BLAST pages provide access to specific custom databases to query against the assembled genome or other databases. The set of supported custom databases varies by organism. These custom BLAST pages can be accessed via the Map Viewer, Genome Project reports, or through the Genomic Biology webpage. For example, the several species-specific genome BLAST pages provide access to query the genome assembly, transcripts, or proteins and may include options to query against additional custom databases such as sequence data from the Trace archive, clones, or ab initio predictions. As illustrated in Figure 5b, BLAST results for queries against assembled genome sequence data that are available in the Map Viewer include a button called Genome View that provides access to a custom view in the Map Viewer, where BLAST hits are displayed in the context of the genome.

FTP

RefSeq data are available in three FTP areas. Configured RefSeq BLAST databases are available for download from the BLAST FTP site; separate databases are provided for genomic, transcript, and protein records. Organism-specific subsets are provided in the genomes FTP site. This area includes RefSeq records that are generated by or used in Map Viewer and Entrez Genomes processing. The full RefSeq collection is available in the RefSeq FTP site, with the exception of the NS_ accession series representing environmental sample records. The RefSeq collection is provided as comprehensive bi-monthly releases in addition to daily updates for records that are new or updated between RefSeq release cycles. The comprehensive release provides data in multiple file formats, including flat file and fasta, as well as providing the data organized into primary taxonomic groups in addition to the complete dataset. In addition, a small number of subdirectories are available that provide weekly comprehensive releases of the transcript and protein RefSeq data for organisms of high interest that have frequent updates of curated records, such as human, mouse, and rat. Information about the RefSeq release is documented in the RefSeq FTP site in the release-notes subdirectory; the availability of new releases is announced on the RefSeq website and to subscribers of the refseq-announce email list.

Related Reading
Besemer J, Lomsadze A, Borodovski M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29:2607-2618; 2001. [PubMed].
Blake JA, Eppig JT, Richardson JE, Davisson MT. The Mouse Genome Database (MGD): expanding genetic and genomic resources for the laboratory mouse. The Mouse Genome Database Group. Nucleic Acids Res 28:108-111; 2000. [PubMed].
Boguski MS, Schuler GD. ESTablishing a human transcript map. Nat Genet 10:369-371; 1995. [PubMed].
Coffin JM, Hughes SH, Varmus E. Retroviruses. Plainview (NY): Cold Spring Harbor Laboratory Press; 1997.
FlyBase Consortium. The FlyBase database of the Drosophila Genome Projects and community literature. The FlyBase Consortium. Nucleic Acids Res 27:85-88; 1999. [PubMed].
Hamosh A, Scott AF, Amberger J, Valle D, McKusick VA. Online Mendelian Inheritance in Man (OMIM). Hum Mutat 15:57-61; 2000. [PubMed].
Lowe, T.M. & Eddy, S.R. 1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucl. Acids Res. 25: 955-964. [PubMed].
Lukashin A, Borodovski M. GeneMark.hmm new solutions for gene finding. Nucleic Acids Res 26:1107-1115; 1998. [PubMed].
Marchler-Bauer A, Panchenko AR, Shoemaker BA, Thiesse PA, Geer LY, Bryant SH. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 30:281-283; 2002. [PubMed].
Pruitt KD, Katz KS, Sicotte H, Maglott DR. Introducing RefSeq and Entrez Gene: curated human genome resources at the NCBI. Trends Genet 16:44-47; 2000. [PubMed].
Tatusova TA, Karsch-Mizrachi I, Ostell JA. Complete genomes in WWW Entrez: data representation and analysis. Bioinformatics 15:536-543; 1999. [PubMed].
Twigger S, Lu J, Shimoyama M, Chen D, Pasko D, Long H, Ginster J, Chen CF, Nigam R, Kwitek A, Eppig J, Maltais L, Maglott D, Schuler G, Jacob H, Tonellato PJ. Rat Genome Database (RGD): mapping disease into the genome. Nucleic Acids Res 30:125-128; 2002. [PubMed].
Westerfield M, Doerry E, Kirkpatrick AE, Douglas SA. Zebrafish informatics and the ZFIN database. Methods Cell Biol 60:339-355; 1999. [PubMed].
White JA, McAlpine PJ, Antonarakis S, Cann H, Eppig JT, Frazer K, Frezal J, Lancet D, Nahmias J, Pearson P, Peters J, Scott A, Scott H, Spurr N, Talbot C Jr, Povey S. Guidelines for human gene nomenclature (1997). HUGO Nomenclature Committee. Genomics 45:468-471; 1997. [PubMed].
Help ǀ Contact Bookshelf
The NCBI Handbook
(navigation arrows) Go to previous chapter Go to next chapter Go to top of this page Go to bottom of this page Go to Table of Contents