NCBI HomeEntrez Gene Gene Help Gene FAQ

Entrez Gene: Integrated Access to Genes of Genomes in the Reference Sequence Collection

Introduction back to top

With the increasing sequencing and annotation of key genomes, having a gene-based view of the resultant information is useful. Entrez Gene has therefore been implemented to supply key connections in the nexus of map, sequence, expression, structure, function, citation, and homology data. Unique identifiers are assigned to genes with defining sequences, genes with known map positions, and genes inferred from phenotypic information. These gene identifiers are tracked, and information is added when available. Entrez Gene can be considered as the successor to LocusLink, with the major differences being in greater scope (more of the genomes represented by NCBI Reference Sequences or RefSeqs) and in being integrated for indexing and query in NCBI's Entrez system.

To help you use Entrez Gene effectively, this document is divided into the following sections:

  1. Maintaining the Data
  2. Interpreting the Display
  3. Query Tips with Examples
  4. Tips for the Programmer
  5. Connecting users of Entrez Gene to your web site (LinkOut)

Previous users of LocusLink may find the subsections Tips for Previous LocusLink Users helpful.

Maintaining the Data back to top

New Records back to top

Records are added to Gene if any of the following conditions are met:

Gene records are not created for genomes incompletely represented by whole genome shotgun (WGS) assemblies, which are provided in terms of RefSeqs by the accessions of the pattern NZ_ABCD12345678.

The minimum set of data necessary for a gene record, therefore, is a unique identifier or GeneID assigned by NCBI, a preferred symbol, and any sequence information, map information, or official nomenclature from an authority list.

You may construct a URL to identify records created within a certain number of days before the current date by using the following:

  • http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? filters=on&orig_db=gene&cmd=search&DB=gene&term=[your query]&pmfilter_CDatLimit=[number of days]

    such as the following, to find human records created within the last 31 days:
    http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? filters=on&orig_db=gene&cmd=search&DB=gene&term=human[organism]&pmfilter_CDatLimit=31

    Updated Records back to top

    Existing records are updated when new information is received. For some genomes, this may occur when a genome is reannotated and corresponding RefSeqs are updated. For other genomes, this may occur when any information attached to a gene record is altered. At present, updates are processed twice weekly, with refreshed data appearing Tuesdays and Thursdays.

    GeneRIFs, although displayed as part of a Gene record, are processed independently of the Gene record itself. Most GeneRIFs were provided by the staff of the National Library of Medicine's Index Section, and are integrated weekly. Those are available with the first update to Gene within a week.

    The modification date, therefore, will be the later of any update to Gene or the GeneRIF supplement.

    You may construct a URL to identify records updated within a certain number of days before the current date by using the following:

  • http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? filters=on&orig_db=gene&cmd=search&DB=gene&term=escherichia coli[organims]&pmfilter_MDatLimit=[number of days]

    such as, to find records for Escherichia coli modified in the last 31 days:
    http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? filters=on&orig_db=gene&cmd=search&DB=gene&term=escherichia coli[organism]&pmfilter_MDatLimit=31


    Suppressed Records

    All previous records will be searchable in Gene by the GeneID only. If it is known that the suppressed record has been replaced with another, the URL to the current record is provided. This is of particular benefit to those interested in retrieving information about NCBI Gene predictions that have not been retained.

    Tips for Previous Users of LocusLink

    Display of all discontinued records, replaced or not, is an added function compared to LocusLink. In LocusLink, only replaced records, and a subset of discontinued records, could be retrieved.

    Interpreting the Display back to top

    This section is divided according to the Display options provided by Entrez:

    Summary Display back to top

    When you process a query, the result is displayed in the Summary format. You can see this by noting the word Summary in the Display box.

    The summary display (Figure 1) provides a check box for selecting records to display, the preferred symbol, the complete full name if available, the binomial for the species (in square brackets), genomic location, and the GeneID. If the gene is on a named plasmid, then the plasmid name will be given as the location. The Links, provided at the right, support ready display of related records in Entrez, as documented in Table 1. To navigate to more information about specific records, check the records to display (the default being all) and select the display option in the gray query bar.

    Figure 1. Sample Summary Display.

    A summary display (Note display option set to Summary.) for the query homeo box homolog. When available, Gene uses the title for a gene supplied by the appropriate nomenclature committee. In this example, note that some nomenclature committees include, as part of their official name, information about the species that was used to define the name of the homolog. That part of the title is usually given in parentheses, and should not be confused with the binomial for the source genome, which is italicized and in square brackets.

    Tips for Previous Users of LocusLink
    The summary display format corresponds most closely to what was termed the Brief Display option in LocusLink, with the LocusID, symbol, species, full description, location, and array of colored boxes indicating available links. In Gene, there is no link from the location information to Map Viewer; that is provided in the Links menu instead. The array of colored boxes is also replaced by the Links menu.

    The Summary Display option of LocusLink, which reported all alternate symbols and sequence accessions associated with a gene record, is not currently provided from Gene.

    Brief Display back to top

    The functions allowed from the brief display are similar to those described from the summary display described in the previous section. This purpose of the Brief option is to support a more compact result set, while providing enough information (the preferred symbol, 20 characters of the full name, and the GeneID) to allow you to select records to display in full. The Links provided at the right of the display support ready display of related records in Entrez, including: Nucleotide sequences, Protein sequences, Books, GEO, HomoloGene, OMIM, SNP, UniGene, UniSTS, and Taxonomy. To see more information, check the records to display (the default being all), and select the display option in the grey query bar.

    Tips for Previous Users of LocusLink

    There is no exact precedent in LocusLink. The links in the Links menu at the right correspond to the color-coded links in LocusLink (Table 1), but there are many more in Entrez Gene than LocusLink supplied.

    Table 1. Correspondence between LocusLink letters and Entrez Links
    Letter in LocusLink Link name Scope
    P PubMed All citations, including those established via GeneRIFs.
    O OMIM The OMIM records based on the gene or the phenotype (human only).
    R   Not implemented separately; included in Nucleotide.
    G Nucleotide A subset of nucleotide links.
    ESTs are excluded unless the gene is defined by an EST; unannotated high throughput genomic (HTG) sequences are also excluded unless used as a source for a RefSeq.
    P Protein A subset of protein sequences encoded by the gene.
    H HomoloGene HomoloGene links based on shared GeneID.
    U UniGene UniGene links based on shared nucleotide sequences.
    V SNP Links to dbSNP for all variations related to the GeneID, variant by variant.
    GeneView in dbSNP A display of all variations related to the GeneID, by placement in the gene. This is the view directly corresponding to the V link.
    Note: Gene supports many more links from the menu at the right of the Summary and Brief views than than the color-coded letter icons LocusLink provided. For Gene, all that is required is a connection between Gene and another Entrez database. Thus there may be links to Books, GEO, UniSTS, Taxonomy, etc.

    Full Report (default) back to top

    The Full Report includes diagrams and text to represent some of what is known about a gene. Access to additional information maintained in other resources within NCBI or external to NCBI is provided by the Links menu at the top right and by other HTML anchors within the page.

    This section of the Help document is divided into the following subsections:

    If multiple records are selected for display, the start of each record is indicated by the numbered open box at the left and by the Links menu at the right.

    Title back to top

    The first line of this section provides the preferred symbol and descriptive name in bold font, followed by the italicized binomial in square brackets. If there is a recognized authority for the gene nomenclature of a species, that is the source for these values.

    The NCBI GeneID and the Locus tag for the record are in the next line. Locus tag corresponds to the systematic feature qualifier used by the international sequence collaboration (DDBJ/EMBL/GenBank), and can be assigned by sequence submitters as a unique, systematic gene descriptor. When such a value is not available from submitted sequence, the identifier from a collaborating model organism database is used. Locus tag is often used to anchor a link to a database other than Entrez Gene.

    The second line also includes the last date a record was changed in the format day-month-year. Change is defined as any modification to the content of the record, including ancillary changes such as the URL for a displayed link.

    Navigation Menu back to top

    The menu at the right of the Entrez Gene report supports navigation to multiple sites of interest. In some display formats, the menu can be expanded and compressed by clicking on the + or -, respectively. More details about each submenu follow.

  • Table of Contents lists the subcategories of information available for a gene. Clicking on the name of the subcategory takes you to that portion of the gene record. The arrow pointing up on the bar separating subcategories will return you to the top of the page should you want to make a different selection from the menu.
  • Links contains standard links seen in all Entrez records, followed by links to sites of interest not in the Entrez system. The Linkout option is always last.
  • Entrez Gene Info enumerates resources that may help you find and understand the information in Gene. The Help link goes to the top level of the help documentation. This is the same document accessed by the question marks in the horizontal section separators.
  • Feedback enumates several sites where you can comment on or add data to Gene.
  • Subscriptions
  • provides links to forms where you can subscribe to mailing list to receive announcements about updates to Gene, Map Viewer, and RefSeq.

    Summary back to top

    The next section includes several categories of information for each gene as available. These include:

    1. Nomenclature
    2. Symbols and names provided by the named external authority.
    3. Gene type. Possible values are tRNA, rRNA, snRNA, scRNA, snoRNA, miscRNA, protein-coding, pseudo, other, and unknown.
      These are indexed as properties of a gene via the terms:

      More details about Gene Types are in the related subsection describing what is indexed by Gene as a property.

    4. Gene or RNA name. The name assigned to the gene or its primary RNA product.
    5. RefSeq status. Any of the set of status descriptions defined by RefSeq.
    6. Organism. The binomial, and strain when appropriate, with a link to the NCBI Taxonomy database.
    7. Lineage. Binomial and lineage from the Taxonomy database.
    8. Gene Aliases. Unofficial symbols and descriptions that have been used for this gene and its products. If there is no official symbol, the symbol at the top of the display is repeated in this section.
    9. Summary. Descriptive text about the gene, its cellular localization, its function, and its effect on phenotype.

    Genomic regions, transcripts and products back to top

    This section is provided when a gene has been annotated on a genomic RefSeq, in other words when the position of a pseudogene, or the intron/exon/coding region information, is available in some genomic coordinate system. You can use this section to:

    Each position of a gene product, when represented by a RefSeq RNA (accession NM_000000 or XM_000000 for mRNA, NR_000000 or XR_000000 for non-protein coding RNAs) and/or protein (NP_000000 or XP_000000), is provided relative to the genomic accession on which it is annotated. Each accession is an anchor to a menu allowing display of the sequence in several formats. Protein accessions also facilitate retrieval of specific BLink, CDD, or COG displays. Using the following figure as an example (interactive view):

    Accessing the Genomic Sequence

    NC_000010 is the accession of the genomic contig that contains the gene. The region being shown in this accession is indicated by the integers to the left and right of the accession. The rightward pointing arrow indicates that these mRNAs are annotated in the same orientation as the NC accession (Since this is an example of a human chromosome, pter->qter). If mRNA is annotated on the reverse complement of the genomic sequence, then the phrase (shown on reverse complement genome) is printed. RefSeq below provides a link to the RefSeq section where there may be more information about these sequences.

    Clicking on the accession NC_000010 brings up a menu that facilitates navigation to:

    When you have used any of the above links to display the genomic sequence of the gene displayed on the record (in this case human PAX2), you can use standard Entrez functions to:

    For more details about the use of Entrez features, please refer to the general Entrez Help Documentation.


    Display of Products and Accessing their Sequences

    Under the line representing the genomic sequence is a description of the products of the gene. If the gene is a non-transcribed pseudogene, then only the diagram of the genomic sequence is shown.

    If there are accessions listed at the left of the diagram (e.g. NM_003987, NM_003989, etc.), these represent RefSeq RNAs. Each accession anchors a link to retrieving sequence, formatted as described (FASTA, GENBANK, and GRAPHIC) for the genomic RefSeq. If there are accessions listed at the right, these are for encoded proteins. In addition to supporting links to sequence, the protein accessions facilitate connections to protein-specific tools or displays, such as

    Each protein accession may also have a distinguishing label. In this example, these are the isoform designations, isoforms a, b, c, and d.

    The labels of the RNA and protein accessions are color coded to coincide with the diagram of the placement of the exons. Blue indicates RNA, and in protein-coding genes, the untranslated region (UTR) of any exon. Maroon indicates protein, and therefore is used to represent the coding region of an exon. If an intron is flanked by coding sequence, then the line connecting the exons is maroon; otherwise, it is blue.

    Genomic Context back to top

    The genomic context section reports the location of the gene on the chromosome in non-sequence coordinates and the strain and genotype information of the source sequence. The title bar of this section includes a link to Map Viewer, resulting in the same display as that generated from the Map Viewer link in the Links menu.

    If the gene has been included in a genomic annotation, the section also diagrams neighboring genes and indicates their orientations. If the name of a gene is too long to use for a label, truncation is indicated by an ellipsis (...). The gene being shown on the diagram is in maroon. All other diagrams and labels anchor links to specific gene pages, supporting quick navigation to review neighboring genes by clicking in the area of the symbol/arrow.

    If a gene has not been included in the current version of the annotated genome provided in NCBI RefSeqs, the Genomic Context section will not include a diagram, but will report the map location. If the gene is annotated on more than one genomic RefSeq, only one will be used for the graphic display, but the location information will be provided in the ASN.1 of the record.

    Bibliography back to top

    The Bibliography section provides a link to PubMed in the same format as that generated from the PubMed option in the pull-down Links menu at the upper right portion of the page.

    If GeneRIFs have been submitted, they are included in this section. The majority of these annotations have been provided via a collaboration between the NLM's Index Section and NCBI. The GeneRIF home page provides more information about the project, including how general users can make submissions.

    Interactions back to top

    There are two major subcategories of information reported as 'Interactions'.

    HIV-1 Interactions back to top

    The HIV-1, Human Protein Interaction Database is funded by the Division of Acquired Immunodeficiency Syndrome (DAIDS) of the National Institute of Allergy and Infectious Diseases (NIAID) (Home). As the title indicates, this project focuses on the human proteins that have been shown to interact with proteins from HIV-1. The format of this section is different on the human and HIV-1 gene reports. On human, the display consists of:

    On HIV-1, the display is subdivided by peptide name, and includes:

    Please note that there are separate reports from this section that are available for download, both from the HIVInteractions home page and the GeneRIF subdirectory of Gene's ftp site.

    General Interactions back to top

    Interactions in this general section are reported as pairs. The report will always include the product of the gene that is part of the interaction in the first column. Depending on the type of interaction, the rest of the display may report:

    Alleles back to top

    This section reports the general characteristics of alleles that have been described for a gene, and provides links to more detailed information. This function is being phased in gradually; the current set is for mouse and is being developed from information supplied by Mouse Genome Informatics.

    General Gene Information back to top

    This section includes several subcategories of information, including:

    1. GeneOntology (GO). The specific GO terms are listed by source of the information, category, term, evidence information, and links to supporting publications. Each GO term supports a link to the AmiGO browser. This section of the Entrez Gene report includes abbreviations indicating the level of support for assigning a GO term to a gene. Explanations for these codes are provided from the Gene Ontology site.

      Entrez Gene does not alter the associations provided by a model organism database, nor does Entrez Gene recapitulate the directed acyclic graph structure provided from GO. Thus Entrez Gene does not support retrival of all genes associated with a specific GO term based on that term's parent.

    2. Homology. A partial listing, with links, of orthologs in other species. Other views of homology data are available from TaxPlot and the HomoloGene link in the Links menu.
    3. Phenotypes. A description of the effect of the gene on phenotype, especially disease. Links to more information are provided as available, as in the case of human disease and links to OMIM.
    4. Markers. An enumeration of the markers that are related to this gene. The relationship is reported based either on direct reports, e-PCR using mRNA templates, or e-PCR-based localization on the genome within a region beginning 2 kb upstream of the gene and ending 0.5 kb downstream. Links are provided the NCBI UniSTS database.
    5. Pathways. A description of pathways that include this gene with links to more information about that pathway.
    6. Relationships. At present, used for gene models to describe some of the related sequences used to support the model transcript.

    General Protein Information back to top

    This section applies only to genes that encode proteins. It reports the name or names that have been assigned to proteins encoded by the gene, and other descriptive text. The names are as annotated on the RefSeq protein, when that protein is available. The sources of these names include model organism databases, annotation on public sequence databases, and curation by RefSeq staff.

    NCBI Reference Sequences (RefSeqs) back to top

    This section describes the gene-specific NCBI reference sequences (RefSeqs) that have been established for this gene. In addition to enumerating the accessions and providing links to the appropriate Entrez sequence database, this section may also include descriptions of each transcript variant, accessions of the public sequences used to support any transcript, and a listing of computed domains in an encoded protein. The text provided in this section therefore supports retrieving gene records based on descriptions of conserved domains.

    Related Sequences back to top

    This section lists nucleotide and protein accessions that are related to the gene, and provides links to the appropriate sequence record in Entrez nucleotide or protein. It is not intended to be a comprehensive list of all sequences related to any gene; such sequences can more explicitly be found by using BLAST to query sequence databases, or using pre-calculated reports of related sequences via Entrez Nucleotide, Entrez Protein, or BLink. The sequence accessions in this section are provided for in tab-delimited format in the gene2accession.gz file in the DATA directory of Gene's ftp site.

    Gene knowingly lists protein accessions on records being represented as not protein-coding. The intent is to make the connection between sequence annotation and Gene's current representation of the type of gene. Users with evidence indicating that the Gene record should be reviewed are encouraged to contact the RefSeq staff.

    The display: If the protein sequence record is not part of a set of a nucleotide record and the protein it encodes, the word 'none' is printed in the Nucleotide column. The type of nucleotide record is printed before the nucleotide accession, and the strain is printed after the protein accession, as applicable.

    Additional Links back to top

    This section provides a printable view of a subset of links to information both within and external to NCBI. Some of these links overlap those included in the Links menu. The intent of this section is to provide a printable report of, for example, MIM numbers, UniGene cluster numbers, and family-specific Web sites.

    Tips for Previous Users of LocusLink

    Table 2
    Where to find the content of a LocusLink report page in Entrez Gene

    LocusLink Gene Comments
    Table of Contents Not retained  
    Alphabetic lists Not retained  
    Gene diagram Transcripts and Products Gene adds the function of Genomic context, to allow a quick view of nearby genes and links to their report pages.
    Link to Evidence Viewer from Gene diagram Evidence Viewer link in Links menu The option to first see only the diagram of the alignment is not retained.
    Button Links Links menu On the Gene Graphic/Default display, the number of links may be greater than in LocusLink.
    Title bar with links to nomenclature source. Initial text, with link to nomenclature source via LocusTag. Links from LocusTag values may connect to an external database where official nomenclature has not been assigned.
    OVERVIEW SECTION
    RefSeq Summary Summary not changed
    Locus type Combination of Gene type and evidence type (under development) The text values are not equivalent. LocusLink's Locus type values are being subdivided into Gene type and Evidence type categories.
    Protein names General protein information not changed
    Alternate symbols Gene aliases not changed
    RELATIONSHIPS SECTION
    Homology data Links menu; HomoloGene What is printed in LocusLink is still printed in Gene.
    Related models Related not changed; limited to genomes being annotated by NCBI's pipeline.
    FUNCTION SECTION
    GeneRIFs GeneRIFs Not in a function section; indented under Bibliography
    GO annotation General gene information: GeneOntology organization changed, but not content
    Phenotype General gene information: Phenotypes organization changed, but not content
    MAP SECTION
    Chromosome Genomic context not changed
    Associated markers General gene information:Sequence Tagged Site (Markers) Entrez Gene added display of alternate marker names.
    SEQUENCE SECTION
    RefSeq Sub Section
    Category RefSeq status not changed
    GenBank source Source sequence not changed
    Domain matches Domains CDD link also attached to the protein accession in Transcripts and Products and in Links menu.
    BL (BLink) BLink link attached to the protein accession in the Transcripts and Products section The function was not changed, but the placement and visibility are different.
    Variant name After the protein accession in the Transcripts and Products section content not changed
    Annotation SubSection
    Genomic contig Transcripts and Products function retained
    gb: Link to gene-specific subsequence GENBANK view from source NC, NT, or NW accession function retained as GENBANK option from the genomic accession-based menu.
    svLink to graphic display of gene-specific subsequence GRAPHICS view from source NC, NT, or NW accession function retained as GRAPHIC option from the genomic accession-based menu.
    mv: Map Viewer Map Viewer in Links menu function retained
    ev: Evidence viewer Evidence Viewer in Links menu function retained
    mm: ModelMaker ModelMaker in Links menu function retained
    strain or haplotype not retained  
    Related Sequences Subsection
    Accessions, type, and strain data Related Sequences content not changed.
    BL (BLink) from protein accessions not retained  
    Additional Links Additional Links  

    ASN.1 Display back to top

    The ASN.1 display provides Gene records structured according to the Entrezgene specification. An XML transformation of the ASN.1 is also provided by use of Display XML. Detailed information about the specification is provided in the For the Programmer Section.

    Tips for Previous Users of LocusLink

    There is no direct precedent in LocusLink. The nearest equivalent is the Save All Loci function that provides the record as text.

    Display XML back to top

    Any record or selected set of records can be displayed in XML format. The XML is generated automatically from the ASN.1 record that is used to support the display, with the names of the tags defined by the ASN.1 specification. Detailed information about the specification is provided in the For the Programmer Section.

    Tips for Previous Users of LocusLink

    There is no direct precedent in LocusLink. The nearest equivalent is the Save All Loci function that provides the record as text.

    Display Gene Table back to top

    The Gene Table display represents the gene structure as annotated on the current genomic RefSeq representing the reference genome. It is not updated between NCBI builds. The table reports the intron/exon organization of each transcript, and, if an mRNA, the region of each exon that contains coding sequence. (Sample) It does this in two ways:

    1. graphically, by repeating the display included in the Full Report
    2. in a table, by reporting the position of any exon, intron, or coding region.

    In addition to allowing you to browse the structure of any gene and it products, this view also makes it easier to view and download the gene-related sequence as summarized in the following table. Please note:

    Navigation If there are multiple gene products (RNAs/proteins) annotated for a gene, navigation within the page is facilitated by the following:


    Table 3. Access to Sequence Information from the Gene Table Display Option
    Scope Link to use
    Complete gene feature Options from the menu opened by clicking on the genomic RefSeq accession (format NC_, NW_ NT_) at the top of the Graphic display.
    Complete RNA
  • Options from the menu opened by clicking on the RNA RefSeq accession (format NM_, NR_, XM_, XR_) at the left of the Graphic display.
  • Display in GenBank format from the RNA RefSeq accession at the top of the accession-specific subsection of the table.
  • Complete protein
  • Options from the menu opened by clicking on the Protein RefSeq accession (format NP_, XP_) at the left of the Graphic display.
  • Display in GenBank format from the protein RefSeq accession at the top of the accession-specific subsection of the table.
  • Single intron, exon, or coding sequence (CDS) Display in fasta format from the range shown in the corresponding table column.

    Tips for Previous Users of LocusLink

    There is no direct precedent in LocusLink.

    Display UI List back to top

    This display is essentially the same as that of the Brief format, with the addition of the unique identifiers for a Gene record on the second line. The difference between the Brief and UI List displays is more apparent, however, when selecting the Send to Text option. For UI List, only the Gene identifiers are reported.

    Tips for Previous Users of LocusLink

    There is no direct precedent in LocusLink. The nearest equivalent is the Save All Loci function that provides the record as text.

    Display LinkOut back to top

    The goal of the LinkOut feature is to facilitate access to relevant online resources beyond the Entrez system to extend, clarify, or supplement information found in the Entrez databases. Resources external to the Entrez system can submit their links as described here. Display LinkOut thus functions to show which records have supplemental information available from LinkOut providers and support connections to those providers.

    Tips for Previous Users of LocusLink

    There is no direct precedent in LocusLink. The nearest equivalents are the links integrated into the LocusLink record.

    Display Book Links back to top

    An increasing number of Gene records are described in books and monographs provided in the Bookshelf. For sample records in this category, try the query "gene books"[filter] in the Gene query bar.

    Tips for Previous Users of LocusLink

    The Book Links correspond to the Books button displayed at the top of the LocusLink report page for records known to be linked to information in books in the Bookshelf.

    Display GEO Links back to top

    When expression data are reported for sequences associated with a Gene record, a link to the Gene Expression Omnibus is provided. For sample records in this category, try the query "gene geo"[filter] in the Gene query bar.

    Tips for Previous Users of LocusLink

    There are no connections to GEO in LocusLink.

    Display HomoloGene Links back to top

    These links are maintained by the HomoloGene staff, based on When expression data are reported for sequences associated with a Gene record, a link to the Gene Expression Omnibus is provided. For sample records in this category, try the query "gene geo"[filter] in the Gene query bar.

    Tips for Previous Users of LocusLink

    These links are the equivalent of the [H]omology button displayed from the Brief or Summary displays of query results and for the HOMOL button on the Locus Report page.

    Display Nucleotide Links back to top

    The nucleotide links display option supports rapid display of the nucleotide sequence records associated with all or selected Gene records retrieved by a query. The results are displayed from Entrez Nucleotide, making it easy to format in your favorite sequence display mode.

    Tips for Previous Users of LocusLink

    There is no direct precedent in LocusLink. The nearest equivalents are the [G]enBank and [R]efSeq buttons displayed from the Brief or Summary displays of query results for individual records.

    Display OMIM Links back to top

    The OMIM links display option supports rapid display of all or selected OMIM records associated with the Gene records retrieved by a query. The results are displayed from Entrez OMIM, making it easy to navigate to information related to diseases and disorders maintained by OMIM staff.

    Tips for Previous Users of LocusLink

    These links are the equivalent of the [O]MIM button displayed from the Brief or Summary displays of query results for individual records and the OMIM button on the Locus Report page.

    Display Protein Links back to top

    The protein links display option supports rapid display of all or selected protein sequence records associated with the Gene records retrieved by a query. The results are displayed from Entrez Protein, making it easy to format in your favorite sequence display mode.

    Tips for Previous Users of LocusLink

    There is no direct precedent in LocusLink. The nearest equivalent is the [P]rotein button displayed from the Brief or Summary displays of query results for individual records.

    Display PubMed Links back to top

    The PubMed Links display option supports rapid display of all or selected PubMed records associated with the Gene records retrieved by a query. The results are displayed from PubMed, making it easy to process citation information according to your favorite method.

    Tips for Previous Users of LocusLink

    These links are the equivalent of the [P]ubMed button displayed from the Brief or Summary displays of query results for individual records and the PUB button on the Locus Report page.

    Display SNP Links back to top

    The SNP links display option supports navigation to the Summary report of variation in dbSNP for the selected gene or genes.

    Tips for Previous Users of LocusLink

    These links are related to that provided by the [V]ariation button displayed from the Brief or Summary displays of query results for individual record and the VAR button on the Locus Report page. There is, however, a significant difference. The report is not the summary of all variants in a gene context that is used by the link from LocusLink. Rather it is the link to all reference snp (rs) records found in a gene. To generate a gene-specific display, there are several approaches to use. One is to click on the LocusLink connection seen to the left of every Links menu, go to LocusLink, and then use the VAR button. Another approach is to open one of the rs records, scroll down to the gene-specific section, currently called LocusLink Analysis, and click on the all link.

    Display Taxonomy Links back to top

    The Taxonomy links display option supports navigation to the Taxonomy database.

    Tips for Previous Users of LocusLink

    There was no link to Taxonomy from LocusLink.

    Display UniGene Links back to top

    The UniGene links display option supports navigation to UniGene for the selected gene or genes. This link is calculated from nucleotide accessions that are associated with both resources. The UniGene cluster is also included in Gene records for vertebrate genomes, making it possible to query Gene directly by cluster.

    Tips for Previous Users of LocusLink

    These links are related to that provided by the [U]niGene button displayed from the Brief or Summary displays of query results for individual record and the UniGene button on the Locus Report page. Entrez Gene does add a new opportunity to retrieve UniGene links for more than one gene record at a time.

    Display UniSTS Links back to top

    The UniSTS links display option supports navigation to UniSTS for the markers associated with selected gene or genes. This gene-to-marker association is calculated from reports from external databases, or by e-PCR based matches to cDNAs or annotated genes.

    Tips for Previous Users of LocusLink

    These links are related to individual links to UniSTS anchored by marker name on the the Locus Report page. Entrez Gene does add a new opportunity to retrieve UniSTS links for more than one gene record and more than one marker at a time.

    Query tips back to top

    All functions of the Entrez indexing and query engine are used by Gene. This section will therefore summarize only how to use the tools in the context of the Gene database. General instructions for all Entrez databases are available here.

    Limits back to top

    The limits page allows you to set the context for making queries to Entrez Gene.It is accessed from the Limits tab at the lower left in the grey query bar.

    If, for example, you want to retrieve only genes encoded on plastids, you would check 'Plastids' in the Include Only section, enter your additional query terms in the Search for box, and press Go.The Limits is designed to make it easier to do the following types of queries by checking boxes, rather than writing out the text of an Entrez query. Please be reminded that Limits remain through multiple queries, unless you 'deselect' them. You can see if Limits are set on your current query result by the explicatory text in the yellow banner toward the top of the results page. You can discontinue the Limits functionality at any time by removing the check to the left of the word Limits in the query bar of a result page.

    Examples using limits:

    Table 4 summarizes the Fields, filters, and properties that are used to categorize information in Gene records. The table also provides examples of how to use these entities effectively to retrieve records. The table is alphabetized by the values in the Field Name menu.

    Table 4. Fields, Filters, and Properties in Entrez Gene
    Field name Definition [including field abbreviations] Examples
    Name subcategory
    Disease name or phenotype of mutants Disease or phenotype associated with the record. [DIS] Find the genes that contribute to SCID
    SCID[dis]
    Gene name A symbol for the gene.Includes preferred symbols, aliases, and locus tags.
    [SYM][SYMB][GN][GENE NAME]
    genes with a symbol starting with smt
    smt*[sym]
    Gene/protein name The short or full name of the gene or any of its protein products (when applicable.) Find genes that have the word kinase in GO annotation but do not have the word kinase in the name.
    kinase[gene ontology] NOT kinase[gene/protein name]
    Location subcategory
    Chromosome Chromosome location of the gene. The value used is according to the convention of the source genome. In other words, if III is used, III but not 3 will be indexed in this field.
    [CHRM][CHR][CHROMOSOME]
    Retrieve records containing the word kinase and the gene is location on chromosome III: kinase AND III[chr]
    Retrieve records containing the words zinc and finger that are of human origin but not on chromosome 19: zinc finger NOT 19[chr] AND "Homo sapiens"[orgn]
    Default map location A map location in the units standard for the genome. For example, for human it is cytogenetic band, for mouse it is the MGI map (centiMorgans). This is processed as a text field, so range queries are not implemented. For range queries, use Map Viewer. rat genes mapped to 18 q:
    rat[orgn] AND 18q[default map location]
    Sequence subcategory In Gene, this means searching by sequence identifier, not by the sequence itself, which is managed by BLAST.
    Nucleotide accession An accession for a nucleotide sequence [NACC] There are instances where the same accession is applied to both nucleotide and protein sequences. So to restrict an accession to nucleoide, use this field. (Accessions beginning with BC are not in this category.)
    BC052629[NACC]
    Nucleotide UID The gi of a nucleotide sequence[NUID][NUCL_UID][NUCLEOTIDE_UID] Many integer identifiers have overlapping number spaces, So to find the gene record that corresponds to a given nucleotide gi from gene, use this field.
    27363473[NUID]
    Protein accession An accession for a protein sequence [PACC][PROT_ACCN] There are instances where the same accession is applied to both nucleotide and protein sequences. So to restrict an accession to protein, use this field. (Accessions beginning with three letters are not in this category.)
    AAH52629[PACC]
    Protein UID protein gi [PUID][PROT_UID][PROTEIN UID] Many integer identifiers have overlapping number spaces, So to find the gene record that corresponds to a given protein gi from gene, use this field.
    27363473[PUID]
    Nucleotide or Protein accession A sequence accession of any type [ACCN] Find all the genes encoded in accession AE003828.
    AE003828
    Miscellaneous subcategory (alphabetical)
    Creation date Date the record was created.
    [cd][cdat][creation date]
    records containing the word xenopus created between February 5, 2004 and February 12, 2004:
    2004/2/5:2004/2/12[cd] AND xenopus[orgn]
    EC/RN number Enzyme commission identifier for a product of the gene. Indexed without the EC prefix.
    [ECNO][EC]
    Retrieve records where proteins have an E.C. number of 1..9.3.1: 1.9.3.1[ECNO]
    Filter Find records with a relationship to other data in Gene. For more examples of use of filters, see the preview/index section. Retrieve records of mouse kinase genes with expression data stored in GEO
    mouse[orgn] AND gene_geo[filter] AND kinase
    Gene Ontology GO terms applied to this gene AND the GO identifer as the integer.. The terms include the component, function, and process categories.[GO][GENE ONTOLOGY]
    Rat genes with GO terms starting with 'kinase signaling'
    kinase signaling*[gene ontology] rat[orgn]
    any gene with the GO id of GO:0004872
    4872[GO]
    LocusLink ID The gene identifier from LocusLink. [LID][LOCUS_ID] Retrieve the record where LocusID =2: 2[LID].
    MIM Identifier assigned to human genes and phenotypes by OMIM [MIM] Retrieve records that contain the MIM number 181510: 181510[MIM]
    Modification date Last date the record was modified.[MODDATE][MDAT][LMOD][DATE][UPDATED][MD] Retrieve records for genes from eubacterial genomes last modified after March 10, 2004:
    eubacteria[orgn] AND 2004/3/10:2010/1/1[md]
    Retrieve records from sea urchins modified in the last 30 days:
    echinoidea[orgn]+AND+"last 30 days"[mdat]
    Property an attribute of a Gene record based on its content [prop][property] mouse records with transcript variants:
    mouse[orgn] AND "has transcript variants"[property]
    PubMed UID PubMed ID [PMID] Many integer identifiers have overlapping number spaces, So to find the gene record(s) that corresponds a paper in PubMed from gene, use this field.
    12477932[PMID]
    Taxonomy ID Identifier for the species or strain in the NCBI taxonomy database[TAXID][TID]
    HINT: txid{value} also works, e.g. txid9606
    Find all records in Entrez Gene for the pig:
    9823[taxid]
    alternatively,
    txid9823
    Text Word Any word in the record. [TEXT][WORD][AB][TXT] Retrieve records that contain '32' in a record that also contains threonine, serine and kinase: serine AND threonine AND kinase AND 32[TEXT]
    UniGene cluster number UniGene cluster including the text prefix.
    [UNIGENE], [UGEN]
    Hs.2[UNIGENE]

    Preview/Index back to top

    The Preview/Index page on any Entrez database is a powerful resource to construct useful queries and to view terms that have been indexed under any field name. Table 4 in the previous section described the fields used in indexing the records and provided some representative queries using those fields. This section will:

    1. Describe filters in general and how they can be used to find records of interest in gene.
    2. Describe the properties assigned to gene records and provide examples of how to use them.


    The term filter is used to describe categories of records grouped based on their relationship either to other Entrez databases or to external resources that have submitted LinkOut connections. If the former, the filter is named according to the pattern 'gene other_Entrez_database', such as 'gene protein'. If the latter, the first two letters of the filter's name are 'lo', for linkout. For a comprehensive listing of filters valid for the Gene database, and the number of records in each, follow these steps:

    1. click on Preview/Index under the query bar.
    2. use the pull-down menu named 'All Fields' and select Filter
    3. click on index button to the right of Preview to see the filter names and the number of instances of each

    Filters in the Gene database are powerful tools to retrieve records of interest. For example, to retrieve all records for human genes that are associated with OMIM (i.e. have connections to OMIM) and have been annotated on the genome, use the 'AND' operator with both 'gene omim' and 'gene nucleotide pos'. The current set of filters is:

    Filter nameDefinition
    all total records, current or not
    gene all all current records
    gene books Gene records with explicit links to Entrez Books
    gene gensat Gene records with explicit links to Entrez GenSAT
    gene geo Gene records with explicit links to Entrez GEO
    gene homologene Gene records with explicit links to Entrez HomoloGene
    gene nucleotide Gene records with explicit links to Entrez nucleotide, excluding RefSeq chromosome or contig accessions
    gene nucleotide pos Gene records with explicit links to Entrez nucleotide, limited to those of RefSeq chromosome or contig accessions, and thus including position data
    gene omim Gene records with explicit links to Entrez OMIM, and thus includes links to both disease and 'gene' records in OMIM
    gene protein Gene records with explicit links to Entrez Protein, and thus includes links to GenPept and SwissProt accessions
    gene pubmed Gene records with explicit links to Entrez PubMed
    gene snp Gene records with explicit links to Entrez dbSNP, and thus supports finding genes variation information available in dbSNP
    gene taxonomy Gene records with explicit links to Entrez Taxonomy
    gene unigene Gene records with explicit links to Entrez UniGene
    gene unists Gene records with explicit links to Entrez UniSTS (marker data)


    Properties are assigned to gene records based on content, rather than relationship to other database records, which is the role of filters (see previous section). At present, the properties assigned to Gene records fall into these major categories:

    1. type of gene: property named as genetype name_of_type
    2. source of the gene: property named as source name_of_source
    3. type of RefSeq provided for the gene: property named as srcdb refseq type_of_refseq
    4. other

    The genetype options should be self-explanatory, except perhaps for miscrna, other and unknown. Names for the types of molecules encoded by genes follow the conventions of the collaborating sequence databases (DDBJ/EMBL/GenBank); thus miscrna (misc_rna, miscellaneous RNA) is assigned to any gene that encodes an RNA product not included in the other specifics. The genetype other property is applied to loci of known type, but a specific category has not yet been applied in the Entrezgene data model (i.e.named fragile sites). The genetype unknown property is applied to probable genes for which the type is still under review. This category is frequently used when the defining sequence has uncertain coding propensity. NOTE: At the time of this writing, the assignment of gene records to the latter two categories continues to be refined. We appreciate your suggestions for any improvements.

    The source options should be self-explanatory, with source other being used where a specific category has not yet been applied in the Entrezgene data model.

    The srcdb refseq categories are as enumerated by RefSeq and will not be duplicated here.

    The other properties (i.e #4 above) are explained in the following table.

    Property name Explanation
    alive the record is current and primary (i.e. not secondary or discontinued). The term secondary is applied to any record that has been merged into another. This occurs most often when multiple genes are defined based on incomplete data, and these are later discovered to be parts of the same gene. One gene record then becomes secondary to the other.
    GeneRIF a record having one or more GeneRIF annotations attached
    has ccds a gene that encodes a protein sequence that is a member of a Consensus CDS (CCDS). See http://www.ncbi.nlm.nih.gov/projects/CCDS/
    has transcript variants a record having two or more associated RefSeq transcripts, i.e. splice variants. NOTE: this is limited to RefSeq annotation, and should NOT be used to identify all genes exhibiting alternative splicing, promoter usage, and/or polyadenylation signals.

    History back to top

    Use of History in Entrez Gene is consistent with all other Entrez databases. You may refer to the History section of the Entrez help documentation for more information.

    Clipboard back to top

    Use of Clipboard in Entrez Gene is consistent with all other Entrez databases. You may refer to the Clipboard section of the Entrez help documentation for more information.

    Details back to top

    Use of Details in Entrez Gene is consistent with all other Entrez databases. You may refer to the Details Button section of the Entrez help documentation for more information.

    Examples back to top

    Constructing queries based on free text, filters, and properties can be quite powerful in retrieving records of interest from Gene. The following table summarizes some of these approaches by describing

    Although these examples use field restriction (Click here for the comprehensive list of fields, filters, and properties in Entrez Gene.), free text can also be submitted. Entrez Gene then weights the retrievals based on the field in which a result was found. For example, if your query matches a gene symbol in one record and arbitrary text in another, the record where the match is on the symbol will be displayed before the other in the results.

    Scope Query Notes
    find genes mapped to Arabidopsis thaliana chromosome 3 that have orthologs reported in HomoloGene arabidopsis thaliana[orgn] AND 3[chr] AND gene_homologene[filter]
    • [orgn] is used to restrict 'Arabidopsis thaliana' to the organism field. That restriction could also be set by checking Arabidopsis thaliana on the Limits form.
    • [chr] is used to restrict '3' to the chromosome field
    • gene homologene[filter] is used to restrict records to those processed by HomoloGene.
    • This query is not currently able to be processed by MapViewer, because the relationship to HomoloGene is not processed for indexing at present, nor by HomoloGene, because the chromosome data are not captured in HomoloGene.
    find genes also being processed by OMIM but for which there is not currently a RefSeq of the type 'known' gene omim[filter] NOT srcdb refseq known[prop]
    • gene omim[filter] is used to find all Gene records with relationships to OMIM
    • srcdb refseq known[prop] is used (as the boolean NOT) to find all such records that do not have RefSeqs of the accession format NM_000000, NG_000000, or NR_000000.
    find genes from genomes other than mammals that are classified by the GO consortium to have some relationship to the cytoskeleton cytoskeleton[go] NOT mammalia[orgn]
    • [go] is used to restrict to the field 'Genome Ontology'
    • [orgn] is used to restrict (as the boolean NOT) to species not classified as mammals.
    • Queries based on GO terms are not supported in either Map Viewer nor HomoloGene. Please note that Gene does NOT recapitulate tree-based searching for GO annotation; this retrieval is based solely on the existence of the word in any GO category. Links are provided to the GO Web site to support more specific searches.

    Tips for the programmer back to top

    Contents:

    The Gene Data Model back to top

    The data model for Entrez gene is documented in the Entrezgene specification. It combines several definitions used by other NCBI databases, such as seqfeat, but also establishes definitions specific to Entrezgene. Of special note is the Gene-commentary, which is used to represent many descriptors of genes. Each Gene-commentary is defined by type, and supports specific representation of such elements as sequence database accessions (accession, version), citations (refs), external or internal resources defining the data (source), and position information. Heading, label, and text are used for general data, with the choice influenced by display in the Entrez Gene viewers.

    The DTD for Gene is available from NCBI's DTD directory: /dtd/

    Converting from Use of LL_tmpl back to top

    LL_tmpl description Entrezgene equivalent
    >>[numeric] record separator; the number equals the LocusID
    NOTE: LocusID = geneid for those records in also public in LocusLink
    Record set closed by bracket matching
    Entrezgene ::= {
    
    LOCUSID [numeric] [unique] [required] the unique integer id for a locus
    NOTE: LocusID = geneid for those records in also public in LocusLink
    geneid
    Entrezgene ::= {
      track-info {
        geneid 2,
    
    CURRENT_LOCUSID: [numeric] [unique] [optional] If a LocusID has been merged with another, the current LOCUSID corresponding to the value on the previous LOCUSID line, is provided here. current-id
    In this example, geneid 217346 is secondary to geneid 193217.
    Entrezgene ::= {
      track-info {
        geneid 217346,
        status secondary,
        current-id {
          {
            db "LocusID",
            tag id 193217
          },
          {
            db "GeneID",
            tag id 193217
          }
        },
    
    LOCUS_CONFIRMED: [alphanumeric][yes|no] The LOCUSID has been assigned to a confirmed locus and can be treated as an identifier that will be tracked. No direct equivalent at present
    LOCUS_TYPE: [alphanumeric] description of the type of locus type
    type protein-coding,
    

    and
    evidence (under development)
    ORGANISM: [alphanumeric] [unique] [required]source species (Homo sapiens, Rattus norvegicus, etc.), based on NCBI's Taxonomy source->taxname
    source {
        genome genomic,
        origin natural,
        org {
          taxname "Homo sapiens",
          common "human",
          db {
            {
              db "taxon",
              tag id 9606
            }
          },
          syn {
            "man"
          },
          orgname {
            name binomial {
              genus "Homo",
              species "sapiens"
            },
            lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
     Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo",
            gcode 1,
            mgcode 2,
            div "PRI"
          }
        },
    
    RELL: [set][optional][alphanumeric][multiple] description|id|id type|print representation[/set] brief text summarizing the relationship, the other id, the type of id, and the display for that second id. At present these id types of are 2 classes: l for locus_id, n for nucleotide accession official/default symbol for the other locus being described

    The function of reporting the other GeneID is not retained. Consider HomoloGene's ftp site for these data.

    Gene/Gene relationship of Homology
    homology {
        {
          type comment,
          heading "Mouse",
          label "A2m",
          text "6 62.00 cM",
          source {
            {
              src {
                db "HomoloGene",
                tag str "A2m"
              },
              anchor "A2m",
              url "/Homology/view.cgi?map=ncbi_mgd&tax_
    id=10090&chr=6&symbol=A2m"
            }
          }
        }
      },
    
    STATUS: [alphanumeric] [optional] (only if a reference sequence exists) [REVIEWED|PROVISIONAL|PREDICTED|MODEL|INFERRED] type of reference sequence record Gene-commentary of type comment with heading ="RefSeq Status" and label of the appropriate value
        {
          type comment,
          heading "RefSeq Status",
          label "REVIEWED"
        },
    
    
    NC: the accession for chromosome RefSeq records [alphanumeric] [optional] (only if a reference sequence exists) the RefSeq accession for a genomic record, followed by the gi and strain, if applicable. locus->accession
     locus {
        {
          type genomic,
          label "tat",
          accession "NC_001802",
          seqs {
            int {
              from 5376,
              to 7969,
              strand plus,
              id gi 9629357
            }
          },
    
    NR: the RefSeq accession for a non-messenger RNA. May be in two places:
    • if annotated on a genomic RefSeq, then in locus, products, type="...", accession
    • always in comments, type comment, products, type ..., heading "...Sequence", accession
            type comment,
            heading "NCBI Reference Sequences (RefSeq)",
            products {
              {
                type mRNA,
                heading "mRNA Sequence",
                accession "NM_000014",
                version 3,
                source {
                  {
                    src {
                      db "Nucleotide",
                      tag str "6226959"
                    },
                    anchor "NM_000014"
                  }
                },
                seqs {
                  whole gi 6226959
                },
                products {
                  {
                    type peptide,
                    heading "Product",
                    accession "NP_000005",
                    version 3,
                    source {
                      {
                        src {
                          db "Protein",
                          tag id 4557225
                        },
                        anchor "NP_000005",
                        post-text "alpha 2 macroglobulin precursor"
                      }
                    },
                    seqs {
                      whole gi 4557225
                    },
      
    NM: the RefSeq accession for a mRNA record [alphanumeric] [optional] (only if a mRNA reference sequence exists) the accession for the mRNA, followed by the gi and the strain, if applicable
    NP: the RefSeq accession for a protein record [alphanumeric] [optional] (only if a reference sequence exists) the RefSeq accession number for a protein record, followed by the PID for that protein
    XR: [alphanumeric][optional] (only if a model exists) the RefSeq accession of a model RNA, not associated with a protein product
    XM: [alphanumeric] [optional] (only if a model exists) the accession for the mRNA, followed by the gi and the strain, if applicable
    XP: the RefSeq accession for a model protein record [alphanumeric] [optional] (only if an XM exists) the RefSeq accession of a model protein, followed by the PID for that protein
    NG: the RefSeq accession for genomic region (nucleotide) records Gene-commentary only, heading "NCBI Reference Sequences (RefSeq)"
       {
          type comment,
          heading "NCBI Reference Sequences (RefSeq)",
          comment {
            {
              type genomic,
              heading "Reference",
              accession "NG_002315",
              version 1,
              source {
                {
                  src {
                    db "Nucleotide",
                    tag id 24047158
                  },
                  anchor "NG_002315"
                }
              },
              seqs {
                int {
                  from 1,
                  to 652,
                  strand plus,
                  id gi 24047158
                }
              }
            }
          }
        },
    
    PRODUCT: [alphanumeric] [optional] (only if a reference sequence exists) the name of the product of this transcript Provided as post-text in the Gene-commentary for the protein accession
             products {
                {
                  type peptide,
                  heading "Product",
                  accession "NP_000005",
                  version 3,
                  source {
                    {
                      src {
                        db "Protein",
                        tag id 4557225
                      },
                      anchor "NP_000005",
                      post-text "alpha 2 macroglobulin precursor"
                    }
                  },
    
    
    TRANSVAR: [alphanumeric] [optional] (only if a reference sequence exists) a variant-specific description Gene-commentary, of type comment, where heading "Transcriptional Variant". Within the Reference Sequences Gene-commentary, indented under the RNA product.
             comment {
                {
                  type comment,
                  heading "Transcriptional Variant",
                  comment {
                    {
                      type comment,
                      text "Transcript Variant: This variant (PAX3A) includes an
     alternate segment in the coding region, which causes a frameshift, and lacks
     several segments in the 3' coding region, compared to variant PAX3. The
     resulting protein (isoform PAX3a) has a shorter and distinct C-terminus,
     compared to isoform PAX3. Isoform PAX3a lacks the paired-type homeodomain."
                    }
                  }
                },
    
    ASSEMBLY: [alphanumeric] [optional][multiple] (only if a reference sequence exists)[/SET] Gene-commentary, of type other, where heading "Source Sequence". Within the Reference Sequences Gene-commentary, indented under the RNA product.
              {
                  type other,
                  heading "Source Sequence",
                  source {
                    {
                      src {
                        db "Nucleotide",
                        tag str "AJ007392,S69369"
                      },
                      anchor "AJ007392,S69369"
                    }
                  },
    
    
    CONTIG: [SET][alphanumeric][optional][multiple] the accession.version of the RefSeq contig, the nucleotide gi, the strain, the position of the gene (from|to|orientation), the chromosome, and an indicator of whether this is on the reference assembly or a strain|haplotype  
    XG: [alphanumeric][optional] (only if an NG accession was used in the annotation process to define position of features on the contig) NG accession, nucleotide gi, strain [SET] The function of indicating whether an NG accession was used in NCBI's annotation process is not currently retained. The NG accessions are, however, included in the Reference Sequence section.
    EVID: [alphanumeric] [optional] (only if a model exists) text summary of the evidence for this model The function reporting the evidence supporting an annotated gene or RNA feature is not currently retained.
    CDD: [alphanumeric][multiple][optional] name|key|score|e_value|bit_score [/SET] [/SET] Gene reports domain content; position of these domains is part of the annotation of the RefSeq protein. The domain information is included as a gene-commentatary of type other, with the heading Domains on a gene-commentary of type peptide. The e-value is not reported.
    comment {
                    {
                      type other,
                      heading "Domains",
                      comment {
                        {
                          type other,
                          source {
                            {
                              src {
                                db "CDD",
                                tag id 5952
                              },
                              anchor "pfam00207: Alpha-2-macroglobulin family",
                              post-text "score:2365"
                            }
                          }
                        },
    
    ACCNUM: GenBank nucleotide accession used related to the RefSeq record [SET][alphanumeric] [optional] [multiple] nucleotide sequence accession number (no version), nucleotide gi, strain (if applicable), 5' end of the gene in the sequence, 3' end of the gene in the sequence one accession number per line The data previously reported as ACCNUM, TYPE, and PROT are now reported as a set of gene-commentaries starting with one of type comment with the heading Related Sequences. The nucleotide sequence information is reported under products, as a gene-commentary of type mRNA. If that nucleotide sequence has an associated accession for one or more protein products, those data are reported under products as type peptide. Accession, version, and gi are provided. The function of reporting the position coordinates if there is no protein product is not currently retained.
    Example of an mRNA, its encoded protein, and strain of origin
         type comment,
          heading "Related Sequences",
          products {
            {
              type mRNA,
              heading "mRNA",
              accession "AY185125",
              version 1,
              source {
                {
                  src {
                    db "Nucleotide",
                    tag id 27966960
                  },
                  anchor "AY185125"
                }
              },
              seqs {
                whole gi 27966960
              },
              products {
                {
                  type peptide,
                  accession "AAO25741",
                  version 1,
                  source {
                    {
                      src {
                        db "Protein",
                        tag id 27966961
                      },
                      anchor "AAO25741"
                    }
                  },
                  seqs {
                    whole gi 27966961
                  }
                }
              },
              comment {
                {
                  type other,
                  label "Strain",
                  text "C57BL/6"
                }
              }
            },
    
    TYPE: [e|m|g] refers to type of nucleotide sequence: e=EST m=mRNA g=genomic
    PROT: [SET][multiple][optional]A potentially repeating set of two values: accession and identifier (PID value) for the coding region or regions annotated on the associated nucleotide record, one line for each accession If no data are available, na is supplied. The delimiter is |. [/SET][/SET]
    [OFFICIAL|PREFERRED]_SYMBOL: [alphanumeric] [unique] [required] the symbol used for gene reports OFFICIAL: validated by the appropriate nomenclature committee PREFERRED: interim option selected for display na is used for models without evidence The preferred symbol and preferred name are reported as gene->locus and gene->desc, respectively. Whether or not these are official is not explicitly represented. If there is a value for locus-tag, the resource associated with that locus-tag should be used to determine if the names are official or interim. If locus is not supplied, however, it indicates no official symbol has been identified.
    A record with an official symbol and name.
    
    gene {
        locus "A2m",
        desc "alpha-2-macroglobulin",
        ...
    

    A record with no identified official symbol or name.
     gene {
        desc "spongiotrophoblast specific protein",
        maploc "17p14",
        db {
          {
            db "LocusID",
            tag id 64509
          }
        },
        syn {
          "Tpbp"
        },
        locus-tag "RGD:621454"
      },
    
    [OFFICIAL|PREFERRED]_GENE_NAME: [alphanumeric] [unique] [required (but may be null)] the gene description used for gene reports OFFICIAL: validated by the appropriate nomenclature committee PREFERRED: interim selected for display [NOTES--If the symbol is official, the gene_name will be official. No record will have both official AND interim nomenclature.
    PREFERRED_PRODUCT: [alphanumeric] [unique] [optional] the name of the product used in the RefSeq record The name of any RefSeq protein product is reported as part of the protein's gene-commentary, as post-text.
    type peptide,
                  heading "Product",
                  accession "NP_057236",
                  version 2,
                  source {
                    {
                      src {
                        db "Protein",
                        tag id 7706625
                      },
                      anchor "NP_057236",
                      post-text "retinoic acid receptor, beta isoform 2"
                    }
                  },
                  seqs {
                    whole gi 7706625
                  },
    
    
    ALIAS_SYMBOL: [alphanumeric][multiple] other symbols associated with this gene All aliases are listed as synonyms gene->syn.
      syn {
          "HAP",
          "RRB2",
          "NR1B2"
        },
    
    ALIAS_PROT: [alphanumeric][multiple] other protein names associated with this gene All protein names are enumerated as prot->name.
      prot {
        name {
          "retinoic acid receptor, beta",
          "RAR-epsilon",
          "RAR, beta form",
          "HBV-activated protein",
          "retinoic acid receptor beta 2",
          "retinoic acid receptor beta 4",
          "hepatitis B virus activated protein",
          "retinoic acid receptor, beta polypeptide"
        }
      },
    
    REL2: [set][optional][alphanumeric][multiple] LocusID of the interacting protein| RefSeq accession of the interacting protein| name of the interacting protein| keyword for the type of interaction| accession of the RefSeq protein associated with this locus| name of the RefSeq protein at this locus| a description of the interaction| PubMed id(s) describing the interaction [/set] Not yet implemented.
    PHENOTYPE: [SET][alphanumeric][multiple] a phenotype associated with a mutation in this gene Descriptions of phenotypes associated with a gene are reported in gene-commentaries of type comment with the heading Phenotypes. The name of any phenotype is provide as text, and the source of that name, and its identifier there, are reported as database cross-references source.
          type comment,
          heading "Phenotypes",
          comment {
            {
              type comment,
              text "Cystic fibrosis",
              source {
                {
                  src {
                    db "MIM",
                    tag id 219700
                  },
                  anchor "MIM: 219700"
                }
              }
            },
            {
              type comment,
              text "Pancreatitis, idiopathic",
              source {
                {
                  src {
                    db "MIM",
                    tag id 602421
                  },
                  anchor "MIM: 602421"
                }
              }
            },
            {
              type comment,
              text "Sweat chloride elevation without CF",
              source {
                {
                  src {
                    db "MIM",
                    tag id 602421
                  },
                  anchor "MIM: 602421"
                }
              }
            }
          }
        },
    
    
    PHENOTYPE_ID: [/SET] an ID used for this phenotype. For humans, this is the MIM number
    SUMMARY: [alphanumeric][optional] a summary description of the gene, its products, its significance, and mutant phenotypes This is optional text, represented in the ASN.1 as 'summary', after the 'gene' and 'prot' text and before 'location'.
    UNIGENE: [alphanumeric][multiple] UniGene cluster id(s) associated with this gene UniGene cluster designations are reported as a gene-commentary of type comment and text UniGene within a gene-commentary of type comment and heading Additional Links. The cluster designation is provided both as a db_xref and as an anchor.
    {
          type comment,
          heading "Additional Links",
          comment {
            {
              type comment,
              text "UniGene",
              source {
                {
                  src {
                    db "UniGene",
                    tag str "Hs.411882"
                  },
                  anchor "Hs.411882",
                  url "/UniGene/clust.cgi?ORG=Hs&CID=41
    1882"
                }
              }
            },
    
    
    OMIM: [numeric][optional][multiple] MIM number MIM numbers are reported as a gene-commentary of type comment and text MIM within a gene-commentary of type comment and heading Additional Links. The MIM number is provided both as a db_xref and as an anchor.
    {
          type comment,
          heading "Additional Links",
          ...
          comment {
              type comment,
              text "MIM",
              source {
                {
                  src {
                    db "MIM",
                    tag str "602421"
                  },
                  anchor "602421"
                }
              }
            },
    
    
    CHR: [alphanumeric][optional][multiple] the chromosome assignment Chromosome is represented according to the NCBI-BioSource standard, namely as source->subtype.
      
    source {
        genome genomic,
        origin natural,
        org {
          taxname "Homo sapiens",
          common "human",
        ...
        },
        subtype {
          {
            subtype chromosome,
            name "7"
        ...
    
    MAP: [alphanumeric][optional][multiple] One line, consisting of a repeating set of 3 data elements, each element separated by | the first element is the location; the second is the source (as a URL when appropriate), and the third element is the type of map information (G = genetic, C=cytogenetc) Map data are stored under location, with the units of the location being reported as method map-type.
    An example of a cytogenetic map location.
      location {
        {
          display-str "7q31.2",
          method map-type cyto
        }
      },
    
    STS: set of STS markers [SET][alphanumeric][optional][multiple] multiline set, one marker per line marker name|chromosome|sts_id|D segment|seq_known|evidence[/SET] evidence types are currently either epcr, or PubMed id(s) Markers are reported as gene-commentaries of type comment under the heading Sequence Tagged Site (Markers). The UniSTS id is the value of tag id, the preferred name is anchor, and evidence is post-text. The function of reporting the chromosome to which the marker has been mapped is not retained. The function of enumerating all marker aliases has, however, been added.
     {
          type comment,
          heading "Sequence Tagged Site (Markers)",
          comment {
            {
              type comment,
              source {
                {
                  src {
                    db "UniSTS",
                    tag id 12967
                  },
                  anchor "D7S2742",
                  post-text "(e-PCR)"
                }
              },
              comment {
                {
                  type other,
                  label "Alternate name",
                  text "G00-674-897"
                },
                {
                  type other,
                  label "Alternate name",
                  text "G11318"
                },
                {
                  type other,
                  label "Alternate name",
                  text "G13271"
                },
         ...
    
    COMP: set of comparative map links [alphanumeric][optional][multiple] c_tax_id|c_symbol|c_chromosome|c_position|c_locus_id| q_chromosome|symbol of the current gene|map_name[/SET] the tax_id of the homolog, the symbol of the homolog, the homologous chromosome, the homologous position, the locus_id of the homolog, the chromosome of the source record, the map name This function is not likely to be retained.
    ECNUM: [alphanumeric][optional][multiple] Enzyme Commission numbers (EC) are reported as gene-commentaries of type property and label EC. The EC number is reporte