![]() ![]() |
|||||||||||||||||||
Decade of Data |
RefSeq: A Database of Reference Sequences
|
||||||||||||||||||
RefSeq is a new database, distinct from GenBank, which currently comprises a non-redundant set of human reference sequences for mRNAs and proteins. RefSeq is designed to simplify the analysis of a burgeoning sequence databank by providing reference sequence standards representing the transcripts and proteins encoded by loci. RefSeq records incorporate data gleaned from multiple GenBank records and literature sources. Although currently limited to human mRNA and protein sequences, RefSeq will eventually expand to include sequences from other organisms represented in GenBank. Additional sequence types, such as constructed genomic contigs and entire chromosomal reference sequences, will also be included. Accession numbers of five types, distinguished by the initial two characters, are issued to RefSeq records, depending on the type of sequence involved:
RefSeq Records May Be Either Provisional or Reviewed Sequence records are incorporated into RefSeq in two stages, Provisional and Reviewed. In creating a Provisional record, the first step involves associating sequence data with named genes to select an initial input sequence. The input sequence is then used to locate a source sequence for the Provisional Ref-Seq entry. The source sequence is usually the longest mRNA sequence in GenBank that both contains the input sequence and is annotated with a complete coding region. This source sequence is then fed into an automatic process that generates the Provisional RefSeq records. A Provisional record is generated from the source GenBank record, with the addition of gene names and aliases, a stable LocusID number, the MIM number for the gene, and a statement in the Comment field that the entry is Provisional. The date of deposition of candidate records is not considered when selecting the source sequence. Hence, the selection of a particular GenBank record as the basis of a RefSeq record does not imply primacy of publication. In stage two, a Provisional RefSeq record is reviewed by NCBI staff or outside experts to produce a Reviewed record. During this stage, the Provisional entry may be modified and augmented considerably, incorporating data from other sequence records or from the scientific literature, in order to reflect the current state of knowledge of the locus in question. References to the literature are added along with a brief summary of the locus. The RefSeq mRNA sequence may also be extended using data from other genomic or mRNA GenBank records; however, because of their error-prone nature, EST data are not incorporated into RefSeq records. If there is strong evidence that a gene produces multiple biologically important transcripts and proteins, then individual RefSeq records are created to represent each. Such transcript variant Ref-Seq records are constructed after a careful review of the literature, or in collaboration with experts. RefSeq Records Can Be Found Using Entrez and LocusLink RefSeq records are included in the Entrez nucleotide database and may be searched in the same way as GenBank records. For example, a simple search using a RefSeq accession number, such as NM_000642, will return the corresponding record. LocusLink can be used to retrieve RefSeq records on the basis of a gene name, LocusID, or chromosomal location. RefSeq Records Are BLASTable RefSeq records are included in the BLAST nr database. The first por-tion of the one-line description in this BLAST output will resemble the following:
This line gives the unique gi number 4557284, accession number NM_ 000646, and LOCUS name AGLf assigned to the Reference sequence. For more information, click on the Reference Sequences link on the NCBI home page. DM, KP, DW |
|||||||||||||||||||