Introduction
The goal of the NCBI RefSeq project is to provide an accurate, non-redundant, and comprehensive collection of naturally occurring DNA, RNA, and protein molecules for major organisms. The collection explicitly connects nucleotide and protein sequences that are related. Ideally, all molecule types will be available for each well-studied organism; however, because the state of public sequence data varies from genome to genome, the level of available information for different organisms at any given time also varies. Intermediate records are provided for some organisms when the genomic sequence is not complete. Although rearrangements and mutations do occur naturally, the goal is to represent sequence standards that are considered to be the predominant "normal" version of the sequence.
RefSeq provides a non-redundant framework of information to facilitate database searches, whether they are searched via genomic location, sequence, or text annotation. RefSeq represents an objective and experimentally verifiable definition of non-redundancy by providing one example of each natural biological molecule per organism. The collection may include alternatively spliced transcripts that share some exons, identical proteins expressed from alternatively spliced transcripts, paralogs, and homologs. Additional records are provided to represent alternate haplotypes or strains for some organisms.
RefSeq is unique in providing a large, multi-species, curated sequence database that explicitly links chromosome, transcript, and protein information that establishes a baseline for integrating a large body of diverse data including sequence, genetic, expression, and functional information into a single, consistent framework with a uniform set of conventions and standards. RefSeq is substantially based on sequence records submitted to public archival databases. Hereafter, the shorter term "GenBank" (Chapter 1) is used to indicate the full set of archival sequence data that is submitted to, and redistributed by, the three collaborating databases; the European Molecular Biology Laboratory (EMBL), the DNA Data Bank of Japan (DDBJ), and GenBank. Note that although based upon GenBank, RefSeq is distinct from GenBank and is not included in the GenBank database. GenBank is an archive of sequences and annotations supplied by original authors and cannot be altered by others. RefSeq differs from GenBank in the same way that a review article differs from a related collection of primary research articles on the same subject. Each RefSeq represents a synthesis by a person or group of the primary information that was generated and submitted by others. Other organizing principles or standards of judgment are possible, which is why the work is attributed to the synthesizing "editors". The RefSeq dataset is curated on an ongoing basis by collaborating groups and by NCBI staff. Sequence records are presented in a standard format and subjected to computational validation. The GenBank source of the RefSeq record, curation status, and attribution to the curation group are also indicated.
RefSeq standards support genome annotation, gene characterization, mutation analysis, expression studies, and polymorphism discovery. The RefSeq collection supports the following:
-
easy identification of sequence standards for genomes, transcripts, or proteins
-
genome annotation
-
comparative genomics
-
reduction of redundancy in clustering approaches
-
provides a foundation for unambiguous association of functional information (supports navigation)
Database Content: Background
Table 1. The RefSeq accession number format and molecule types
| AC_ | Genomic | Complete genomic molecule, alternate assembly |
| NC_ | Genomic | Complete genomic molecule, reference assembly |
| NG_ | Genomic | Incomplete genomic region |
| NT_ | Genomic | Contig or scaffold, clone-based or WGSa |
| NW_ | Genomic | Contig or scaffold, primarily WGSa |
| NS_ | Genomic | Environmental sequence |
| NZ_b | Genomic | Unfinished WGS |
| NM_ | mRNA | |
| NR_ | RNA | |
| XM_c | mRNA | Predicted model |
| XR_c | RNA | Predicted model |
| AP_ | Protein | Annotated on AC_ alternate assembly |
| NP_ | Protein | |
| YP_c | Protein | |
| XP_c | Protein | Predicted model |
| ZP_c | Protein | Predicted model, annotated on NZ_ genomic records |
The September 2006
RefSeq collection includes sequences from more than 3,700 distinct taxonomic identifiers, ranging from viruses to bacteria to eukaryotes. It represents chromosomes, organelles, plasmids, viruses, transcripts, and more than 2,800,000 proteins. Every sequence has a stable accession number, a version number, and an integer identifier (gi) assigned to it. Outdated versions are always available if a sequence is updated.
RefSeq records can be distinguished from
GenBank records by the inclusion of an underscore (“_”) in the accession identifier. The
RefSeq accession prefix has an implied meaning in terms of the type of molecule it represents.
Table 1 indicates the types of sequence molecules and the corresponding
RefSeq accession number formats. See also the
RefSeq website.
Updates
RefSeq updates are provided daily. These include records added to the collection and records updated to reflect sequence or annotation changes, including complete re-annotation of a genome. New and updated records are made available in Entrez and BLAST databases as soon as possible. The FTP site also provides daily update information (see below).
Assembling and Maintaining the RefSeq Collection
Summary
Table 2. RefSeq status codes
| GENOME ANNOTATION | The RefSeq record is provided via automated processing and is not subject to individual review or revision between builds. |
| INFERRED | The RefSeq record has been predicted by genome sequence analysis, but it is not yet supported by experimental evidence. The record may be partially supported by homology data. |
| PREDICTED | The RefSeq record has not yet been subject to individual review, and some aspect of the RefSeq record is predicted. |
| PROVISIONAL | The RefSeq record has not yet been subject to individual review. The initial sequence-to-gene name associations have been established by outside collaborators or NCBI staff. |
| REVIEWED | The RefSeq record has been reviewed by NCBI staff or by a collaborator. The NCBI review process includes assessing available sequence data and the literature. Some RefSeq records may incorporate expanded sequence and annotation information. |
| VALIDATED | The RefSeq record has undergone an initial review to provide the preferred sequence standard. The record has not yet been subject to final review, at which time additional functional information may be provided. |
| WGS | The RefSeq record is provided to represent a collection of whole genome shotgun sequences. These records are not subject to individual review or revisions between genome updates. |
The
RefSeq database is the result of data extraction from
GenBank, curation, and computation, combined with extensive collaboration with authoritative groups. Each molecule is annotated as accurately as possible with the organism name, strain (or breed, ecotype, cultivar, or isolate), gene symbol for that organism, and informative protein name. Collaborations with authoritative groups outside of
NCBI provide a variety of information ranging from curated sequence data, nomenclature, feature annotations, and links to external organism-specific resources. When no collaboration has been established, the
NCBI staff assembles the data from
GenBank. Each record has a comment, indicating the level of curation that it has received (
Table 2), and attribution of the collaborating group. Thus, a
RefSeq record can include either an essentially unchanged, validated copy of the original
GenBank record or corrected or additional information that has been added by collaborators or experts at
NCBI.
In cases when a molecule is represented by multiple sequences for an organism in GenBank, an effort is made by NCBI staff to select the "best" sequence to be presented as a RefSeq. The goal is to avoid known mutations, sequencing errors, cloning artifacts, and erroneous annotation. RefSeqs identified with a problem of this type are corrected. Sequences are validated to confirm that the genomic sequence corresponding to an annotated mRNA feature matches the mRNA sequence record, and that coding region features really can be translated into the corresponding protein sequence.
Working groups using distinct process pipelines compile the RefSeq collection for different organisms. RefSeq records are provided via several distinct approaches including:
Collaboration
Table 3. Selected examples of collaborator-contributed RefSeq records
| Saccharomyces cerevisiae | Saccharomyces Genome Database (SGD) |
| Arabidopsis thaliana | The Arabidopsis Information Resource (TAIR) |
| Pseudomonas aeruginosa | Pseudomonas aeruginosa Community Annotation Project (PseudoCAP) |
| Caenorhabditis elegans | WormBase |
| Drosophila melanogaster | FlyBase |
RefSeq welcomes collaborations with authoritative groups outside
NCBI that are willing to provide sequences, nomenclature, annotation, or links to phenotypic or organism-specific resources. The
RefSeq feedback form can be used to provide corrections or to initiate collaboration. For some species, the
RefSeq collection is curated entirely by a collaborating authoritative group that provides both the sequences and annotation. For others, most notably the human and mouse
RefSeq collections, there have been numerous collaborations with individual scientists, for either specific genes or complete gene families. Other collaborations may be over a set of organisms. For example, a
Viral Genome Advisory group has been established to support curation of the viral
RefSeq collection. Thus,
RefSeq records may contain information provided by an external authoritative source and/or analyses and curation at
NCBI. The collaborating group is identified on
RefSeq records.
Table 3 lists some examples of annotated genomes provided by this method. Also see the
RefSeq website.
If RefSeq records are supplied by a collaborating group, processing may be automated in that data are periodically downloaded, validated to detect errors, and modified to add such annotation as cross-references to Entrez Gene. The validation process checks for logical conflicts in the annotation and makes small changes to format the submission as a RefSeq record. In such cases, NCBI does not directly curate the annotation data or make sequence changes; conflicts or problems that are identified either by the validation process or received by email from researchers are reported back to the submitting group. As the collaborating group supplies updates, changes are reflected in the RefSeq collection for that organism.
Computational Genome Annotation Pipeline
NCBI computes annotation of genomic sequence data for some genomes including some microbial species, human, mouse, rat, cow, chimpanzee, dog, zebrafish, and honey bee. The annotation pipeline is automated and yields genomic, transcript, and protein
RefSeq records. Names annotated on the transcript and protein products are based on sequence similarity. Annotation data are refreshed periodically, and records that are generated from this process flow are not subject to individual incremental updates or manual curation (see
Table 1; see
Chapter 14 for more information on the eukaryotic genome annotation pipeline). For some species (including human), records may be provided by a mixture of methods. In other words, there may be a set of curated transcript and protein records in addition to a set of records that were generated computationally.
RefSeq records that are processed by
NCBI's pipeline are displayed in the
NCBI Map Viewer (
Chapter 20), included in
Entrez Gene, and are also available in the main
Entrez sequence databases.
Curation by NCBI Staff
A portion of the RefSeq dataset is supported by NCBI curation staff. This subset includes viral, vertebrate, and some invertebrate organisms. Most bacterial, plant, and fungal records are provided either by collaboration or by processing the annotated genome data submitted to GenBank; however, a small number of bacterial genomes are annotated and curated by NCBI staff. Viral genomes are curated in consultation with a viral board of advisors.
Curation of Viral, Mitochondrial, and Bacterial Records
Table 5. Selected examples of REVIEWED viral, microbial, and small genomes
| NC_003197 | Salmonella typhimurium LT2 |
| NC_003277 | Salmonella typhimurium LT2 plasmid pSLT |
| NC_001802 | Human immunodeficiency virus 1 |
| NC_000907 | Haemophilus influenzae Rd KW20 |
| NC_004718 | SARS coronavirus |
The
RefSeq records curated manually at
NCBI are annotated with a status of
REVIEWED or
VALIDATED in the
RefSeq comment block (see
Table 5). For example, viral genomes are re-annotated using GeneMarkS in collaboration with Mark Borodovsky's Bioinformatics Group at the Georgia Institute of Technology. The GeneMarkS results are then manually reviewed by
NCBI staff. Viral annotation also relies on an established
Viral Genome Advisors group and other experts. For example, the
HIV-1
RefSeq was curated by
NCBI Staff in collaboration with the authors of the book
Retroviruses. The
NCBI curators expanded the mature peptide annotation for several viruses, including the poliovirus and hepatitis C, based on observations reported in the literature that had not been included in a
GenBank submission. Metazoan mitochondrial records are curated in accordance with established protein and gene nomenclature.
Curation of Vertebrate and Invertebrate Records
Curation of higher eukaryotic organisms is primarily focused on the mammalian genomes, especially the human and mouse. The
RefSeq process flow for these organisms provides transcripts, proteins, and some genomic regions. Records of some genomic regions are provided to facilitate genome-wide annotation and may represent gene
clusters or single genes or
pseudogenes. Processing integrates the official nomenclature and other information including alternate names,
Gene Ontology (GO) terms, and GeneRIFs available in
Entrez Gene. Multiple collaborations support the collection of this descriptive information (
Box 1; see also
Chapter 19).
Figure 2. RefSeq Processing Pipelines. Once sequence data are deposited in the public archival databases, it is available for
RefSeq processing. Processing pipelines include the vertebrate curation pipeline, the computational genome annotation pipeline, and extraction from
GenBank. These pipelines generate new and updated
RefSeq records that become publicly available in
Entrez Nucleotide,
Protein, and
Gene databases. (
A) Once a gene is defined and associated with sufficient sequence information in an internal curation database, it can be pushed into the
RefSeq pipeline. The
RefSeq process is initiated by an automated
BLAST step, which uses the stored sequence data as a query against
GenBank to identify the longest
mRNA for each
locus. This initial
RefSeq record has a status of provisional, predicted, or inferred. Subsequent curation may result in a sequence or annotation update (as described in
Box 2) and a status of validated or reviewed. Records are updated if the underlying
GenBank accession number is updated or if other associated data are updated, including nomenclature, publications, or map location. (
B) Available
RefSeq and
GenBank data are aligned to an assembled genome,
ab initio gene prediction is carried out that uses alignment data, and an analysis program integrates all available data to define the annotation models. New "model"
RefSeq records are generated by this pipeline. (
C) When a complete, annotated genome becomes available in
GenBank, a set of corresponding
RefSeq records are generated by duplicating the
GenBank submission, followed by validation and addition of cross-references to
Entrez Gene (via a dbXref citing the
GeneID) and, in some cases, more informative and standardized protein names.
Sequences enter the
RefSeq curation process flow by a combination of computational analysis, collaboration, and in-house curation. As illustrated in , generation of the initial
RefSeq record is dependent on identifying a representative sequence for a gene. New genes and sequence data are added to the collection by collaborators and
NCBI-based analyses to mine information from
UniGene,
cDNA alignments, and
GenBank (
Box 1). Quality assessment (QA) processes are executed regularly to flag questionable data for review. This QA checks for errors and conflicts in nomenclature, sequence similarity and genomic placement, and potential cloning errors (e.g., chimeras) and leverages data from other
NCBI resources including
HomoloGene,
Map Viewer, and
GenBank related sequences.
Once sequence data are unambiguously associated with a GeneID, the data may be propagated into a RefSeq record. The completeness of the sequence information and the category of the gene (e.g., protein coding, pseudogene) determine whether a RefSeq will be made, and if so, of what type (DNA, RNA, mRNA plus protein). Reviewed RefSeq records are not made for transposable elements or those loci for which the product type is uncertain (e.g., protein coding or not). Also, a RefSeq may not be provided when the associated data are known to be incomplete. For example, if accessions for a protein-coding locus are annotated to indicate that the protein is a partial sequence, then automatic processing to provide the RefSeq transcript and protein does not occur. It should be noted, however, that the RefSeq collection does include partial transcripts and proteins that are provided by collaborating groups or when the RefSeq is based on annotated whole-genome submissions in GenBank.
RefSeq processing for non-protein-coding RNA loci uses the longest defining transcript record that has been associated with the GeneID. For non-transcribed loci (such as non-transcribed pseudogenes), the RefSeq record is often derived from a defined region of a large genomic sequence with no indication of exon substructure. Curation of these types of records is minimal because the current focus is on curation of protein-coding loci; however, these records provide an important reagent for the computational annotation pipeline and support annotation of non-protein-coding genes that might otherwise be missed or misrepresented as a predicted protein-coding gene. Other records are provided to represent larger genomic regions including gene clusters, genes requiring rearrangement to express a product (immunoglobulins and T-cell receptors), and haplotypes with known differences in gene content. These genomic region records are annotated by the curation staff, often in collaboration with scientific experts, and are not provided by automatic processing.
For protein-coding loci, an initial "seed" sequence is selected from the set of accessions associated with a given GeneID based on protein and transcript lengths. Sequences that are annotated as partial proteins are not considered. Because we have only a subset of sequence data stored for a GeneID, an automated process checks for additional sequence data that may be a more complete representative of the transcript; the selected seed sequence is used as a query sequence for automated BLAST analysis, and if a longer mRNA is identified (with an identical coding sequence), then that sequence is used to provide the RefSeq record. This stage of the process associates additional accessions with the GeneID and also includes detection of conflicts and problems, including sequence-to-locus association ambiguity and vector contamination.
If data conflicts are identified for the GenBank accession used to generate the RefSeq record, then it is resolved before the RefSeq can become public. The RefSeq record is generated using the sequence data from the GenBank accession, and the annotation data from the in-house version of the Entrez Gene database. In addition, RefSeq records are subject to programmatic validation to identify annotation format errors and to provide annotation in a more consistent format. Entrez Gene information including the GeneID, cross-references to other databases, official nomenclature, alias symbols, alternate descriptive names, map location, and additional citations, including those submitted as GeneRIFs, are applied to the records. Records at this stage have a PROVISIONAL, PREDICTED, or INFERRED status.
Figure 3. How to recognize suppressed and redundant RefSeq records. (a) A standard text statement is included on the Entrez document summary for suppressed RefSeq records (red arrow). (b) If redundant RefSeq records are merged, then both accession numbers appear on the flat file ACCESSION line (green arrow). The first ACCESSION number listed is the primary identifier, and all other listed accessions are "secondary" accession numbers.
RefSeq staff prioritizes reviewing problems submitted by email or identified by in-house QA analysis. For example, analysis is carried out to identify sequences that include repeats, have poor-quality
splice sites, align poorly to a high quality genome, have no similar proteins, are very short or very long, or that are extremely similar to sequences associated with a different
GeneID. Review of problem sets may result in updating a
RefSeq record, providing new
RefSeq records, modifying sequence-to-gene associations, merging
Entrez Gene records, or discontinuing a
RefSeq,
GeneID, or both. A
RefSeq is suppressed if it is found to represent a transcribed repeat element, to be derived from the wrong organism (i.e., the
GenBank sequence it was based on does not have an accurate organism annotation), or to not represent a "gene". Records that have been determined to represent an incomplete sequence, such as a partial protein sequence or incompletely spliced transcript, are temporarily suppressed until more complete sequence data can be represented in the
RefSeq record. An
Entrez query will still retrieve a suppressed record, with a disclaimer appearing on the query result document summary (), but the suppressed record is not included in the
BLAST databases, nor in the calculation of related sequences or the
BLink display (
BLink are pre-computed protein
BLAST results) or in
RefSeq FTP releases. If a
RefSeq is found to be redundant with another public
RefSeq, then one is retained and the other becomes a secondary accession number (). If the sequences were associated with two different
GeneIDs, then the
GeneIDs are merged so that in
Entrez Gene, a query with either of the original
GeneIDs will retrieve the remaining single record.
Once problems are resolved, the curators review sequence alignments, the published literature, and internal and external databases, with the aim of finding the best representative nucleotide and protein sequence and annotation available at that time. The resulting information is propagated to both the
RefSeq sequence record and to the
Entrez Gene database.
Box 2 lists additional detailed information concerning the type of errors corrected and the information added by the manual curation process. Review of individual transcript and protein records is carried out primarily by
NCBI staff, but some sequences and annotations are provided by collaboration. The curation process also provides additional sequence records to represent splice variants when sufficient information about their full-length composition is available. Records that have undergone manual curation, either by
NCBI staff or a collaborating group, have a validated or reviewed status. Note that for many genes, intermediate levels of manual curation may correct sequence miss-associations, to base the
RefSeq on a more optimal
GenBank record and to provide additional data to
Entrez Gene before full review of a
RefSeq sequence record.
We welcome input from the research community to improve the quality of RefSeqs. Interested parties are invited to contact us by sending an email to the NCBI Help Desk (info@ncbi.nlm.nih.gov) or by using our feedback form. See also the RefSeq website.
Access and Retrieval
RefSeq records can be accessed by direct query, BLAST, FTP download, or indirectly through links provided from several NCBI resources including Gene, Genome, Genome Project, and Map Viewer. In addition, RefSeq records are included in some computed resources, and therefore links may be found from those pages to individual RefSeq records. For example, the RefSeq collection is included in HomoloGene, UniGene, Clusters of Orthologous Groups of proteins (COG; Chapter 22) analysis, and in Conserved Domain Database (CDD; Chapter 3) analysis to identify proteins with similar domain architecture. Some links from Entrez databases to RefSeq records are based on Entrez Gene associations (e.g., links from OMIM; Chapter 7), whereas others are based on sequence similarity or RefSeq annotation content including links from PubMed. Links to RefSeq records may be found in the following resources:
The distinct accession number format used for
RefSeq records (
Table 1) makes it easy to spot links to
RefSeq records from these and other
NCBI resources. Several approaches to access and retrieve
RefSeq records are described below.
Entrez Query Access
RefSeq records can be retrieved by querying various databases in the Entrez system (Chapter 15). All RefSeqs can be found in the Entrez Nucleotide or Protein databases, whereas queries in the Genome database retrieve the subset of the RefSeq collection that comprises complete genomic molecules. Together, Genome and Gene represent the entire RefSeq collection. See the RefSeq website for examples of Entrez queries. A subset is represented in Map Viewer. The genomic RefSeq is reported in the map, and the annotations are viewed in the Genes and Transcript maps.
Searching Nucleotide or Protein
Figure 4. Using Entrez Limits to restrict a query to RefSeq. Use the highlighted menu boxes to restrict the query to a genomic or mRNA sequence or to restrict the query to the RefSeq collection.
Table 6. Entrez queries to retrieve sets of RefSeq records
| srcdb_refseq[prop] | All RefSeq accessions | All |
| srcdb_refseq_known[prop] | NC_, AC_, NG_, NM_, NR_, NP_, AP_ | REVIEWED, PROVISIONAL, PREDICTED, INFERRED, and VALIDATED |
| srcdb_refseq_reviewed[prop] | NC_, AC_,NG_, NM_, NR_, NP_, AP_ | REVIEWED records |
| srcdb_refseq_validated[prop] | NC_, NM_,NR_,NP_ | VALIDATED records |
| srcdb_refseq_provisional[prop] | NC_, AC_, NG_, NM_, NR_, NP_, AP_ | PROVISIONAL records |
| srcdb_refseq_predicted[prop] | NM_, NR_, NP_ | PREDICTED records |
| srcdb_refseq_inferred[prop] | AC_, AP_, NM_,NR_,NP_ | INFERRED records |
| srcdb_refseq_model[prop]a | NT_, NW_, XM_, XR_, XP_, ZP_ | Genome annotation model records |
General queries in
Entrez Nucleotide or Protein databases may retrieve a mixture of
GenBank and
RefSeq records. The query result display includes tabs to access specific result sets including, by default, a tab to access
RefSeq-specific results. Details about the subsets to report via folder
Tabs can be configured using the
My NCBI interface. Alternatively, queries can be restricted to the
RefSeq collection by using the
Limited to: settings () or by entering a query restriction using the property fields listed in
Table 6. In addition, both the
Limited to: settings and property field terms can also be used to restrict the query by type of molecule (e.g.,
DNA versus mRNA) and other parameters. See
Entrez Help and the
RefSeq Query help page for further details.
Searching Genome and the Genome Project
RefSeq records in the Genome or Genome Project databases can be retrieved using an accession number for a complete genomic molecule (NC_ accession prefix) or organism name. The Genome Project database can also be queried using the property restriction “srcdb_refseq[prop]”.
Searching Gene
The majority of the RefSeq collection is represented in Entrez Gene, a gene-centered database (Chapter 19); RefSeq records representing assembled environmental samples (with an NS_ accession prefix) are not included in Gene but can be found in the Genome and Nucleotide databases. Additional organisms, records, and associated data continue to be added to Entrez Gene and RefSeq over time as new data become available.
Genes with specific
RefSeq accessions can be retrieved by querying with the
RefSeq accession number. A more general query to retrieve Genes with associated
RefSeq records can be carried out by using the property "srcdb_refseq". For example, a query can be formed to find members of a gene family that share a common name root for which there are
RefSeq records (for example, “abcc*[sym] AND srcdb_refseq_known[prop]”).
RefSeq to Gene connections are also provided by direct links;
RefSeq records include a link to the
Entrez Gene report page via the
GeneID dbXref link on the gene and
CDS features (). Gene reports the
RefSeq accession numbers in the
RefSeq section of the report, with links to the Nucleotide or Protein records. Gene reports may also include a graphical depiction of genome annotation data as represented in the
Map Viewer resource in the
Genomic regions, transcripts, and products section, with links to Nucleotide and Protein displays. When this graphical section is provided, an additional report is available with details about
exon and
intron boundaries and length. You can change the display format from
Full Report to
Gene Table to access this report.
Entrez Gene query results and gene reports indicate when a RefSeq is available, with links provided to the nucleotide and protein sequences and to related resources, including the Map Viewer and BLink (pre-computed protein alignments) and Conserved Domain Database (CDD). The process of RefSeq curation also expands the data available in Entrez Gene by providing a range of information including:
BLAST
Figure 5. (a) RefSeq records are included in NCBI BLAST databases. In a BLAST summary list of results, the abbreviation ref identifies records that are provided by the RefSeq collection. Accessions that are included in the NCBI resources UniGene, GEO, and Gene are linked to those resources via the colored icons for U, E, and G, respectively. (b) The Genome View button is provided when BLAST results can be viewed in context of the graphical Map Viewer display.
RefSeq transcript and protein records are included in the non-redundant nucleotide and protein
BLAST databases, and genomic sequences are included in the "chromosome" database; therefore, when a query sequence matches a
RefSeq record, the hit is included in the
BLAST results (see ). Accessions included in the results set, either
RefSeq or
GenBank, that are associated with
GeneIDs are indicated by a small blue
G icon that is linked to the Gene report. Additional organism-specific
BLAST pages provide access to specific custom databases to query against the assembled genome or other databases. The set of supported custom databases varies by organism. These custom
BLAST pages can be accessed via the
Map Viewer, Genome Project reports, or through the
Genomic Biology webpage. For example, the several species-specific genome
BLAST pages provide access to query the genome assembly, transcripts, or proteins and may include options to query against additional custom databases such as sequence data from the Trace archive, clones, or
ab initio predictions. As illustrated in ,
BLAST results for queries against assembled genome sequence data that are available in the
Map Viewer include a button called
Genome View that provides access to a custom view in the
Map Viewer, where
BLAST hits are displayed in the context of the genome.
FTP
RefSeq data are available in three FTP areas. Configured RefSeq BLAST databases are available for download from the BLAST FTP site; separate databases are provided for genomic, transcript, and protein records. Organism-specific subsets are provided in the genomes FTP site. This area includes RefSeq records that are generated by or used in Map Viewer and Entrez Genomes processing. The full RefSeq collection is available in the RefSeq FTP site, with the exception of the NS_ accession series representing environmental sample records. The RefSeq collection is provided as comprehensive bi-monthly releases in addition to daily updates for records that are new or updated between RefSeq release cycles. The comprehensive release provides data in multiple file formats, including flat file and fasta, as well as providing the data organized into primary taxonomic groups in addition to the complete dataset. In addition, a small number of subdirectories are available that provide weekly comprehensive releases of the transcript and protein RefSeq data for organisms of high interest that have frequent updates of curated records, such as human, mouse, and rat. Information about the RefSeq release is documented in the RefSeq FTP site in the release-notes subdirectory; the availability of new releases is announced on the RefSeq website and to subscribers of the refseq-announce email list.