Entrez Gene is one of the several gene-centered resources at NCBI. Others include the Gene Expression Nervous System Atlas (GENSAT), the Gene Expression Omnibus (GEO) , HomoloGene, Online Mendelian Inheritance in Man (OMIM), and UniGene. The taxonomic scope of these resources differs. For example UniGene has clustered transcript information for some species that Entrez Gene does not, and Entrez Gene has records not cross-referenced in UniGene. Entrez Gene is solely responsible for providing the unique GeneID that is used to identify information for genes and other types of loci.
On a regular basis, model organism databases and other contributing groups are checked for novel information. If the record already exists in Entrez Gene, new information is added and outdated information is corrected. Otherwise, a new record is created.
Entrez Gene can be considered curated because many of the contributing databases are curated. Additionally, records in Entrez Gene may be reviewed by NCBI staff. However, Entrez Gene does not always attempt to reconcile genes defined by various annotation pipelines that may differ in levels of curatorial review and rules about what constitutes a gene.
Entrez Gene serves as a hub of information for databases both within and external to NCBI. Records are processed either gene-by-gene or as part of the submission of an annotated genome or chromosome. Entrez Gene identifiers, and associated names and sequence accessions, provide a common frame of reference for many databases.
For some genomes (e.g. human, mouse, rat, chicken, dog), Entrez Gene records are updated continuously. For other genomes, updates to Entrez Gene depend on the re-submission of genomic sequence annotation from an external group.
Entrez Gene includes records for confirmed genes and for genes predicted by annotation processes. The evidence for a gene can be inferred from the status of the RefSeq that defines it (information on status definitions can be found at http://www.ncbi.nlm.nih.gov/RefSeq/key.html - status). For example, RefSeqs that are termed as predicted or model have less supporting evidence than those in the validated, provisional, or reviewed categories. However, new sequence information is submitted to the public databases daily, and the status of a gene may not reflect current knowledge. New information on related sequences can be checked from Entrez Gene through the links to Entrez Nucleotide, Entrez Protein, and BLAST Link (BLink).
Entrez Gene does not claim to be comprehensive; rather, it serves as a guide to additional information in other databases. For example, a gene can be represented by multiple sequences, but not all are reported explicitly from Entrez Gene. Instead, connections are supplied from Entrez Gene to Entrez Nucleotide, Entrez Protein, and Blink, where more sequences with significant similarity can be retrieved. In addition to the multiple links to NCBI databases, LinkOuts submitted to Entrez Gene from external databases support ready navigation to more gene-specific information. The central functions of Entrez Gene are to establish unique identifiers for genes that can be tracked and, in so doing, support accurate connections with the defining sequences, nomenclature and other descriptors. With this infrastructure, it is possible to:
support the NCBI annotation pipeline based on placement of sequences with known GeneIDs
provide a species-independent frame of reference for genes and all their attributes
support identification of the genes represented by sequences in the public databases
Records are added to Entrez Gene if any of the following conditions is met:
A RefSeq is created for a genome that has been completely sequenced and that record contains annotated genes. In the case of RNA viruses with polyprotein precursors, annotated proteins are treated as equivalent to a gene.
A recognized, genome-specific database provides information about genes (preferably with defining sequence), mapped phenotypes, or sequences that are treated as markers for incompletely characterized genes (e.g. expressed sequence tags and gene traps).
The NCBI annotation pipeline identifies potential genes (models).
A sequence submitted to public databases defines a new gene. For some genomes, the processing in Entrez Gene depends on UniGene's clustering process to identify a single representative sequence.
The minimum set of data necessary for a record in Entrez Gene, therefore, is a unique identifier or GeneID assigned by NCBI, a preferred symbol, and any of sequence information, map information, or nomenclature from a recognized authority.
Existing records are updated when new information is received. The staff of Entrez Gene collaborates with curators of organism-specific databases, nomenclature authorities, international annotation groups, other groups in NCBI, and other valued contributors to resolve discrepancies and improve the data. When a record is updated, its modification date changes. For some genomes, this may occur when the genome is re-annotated and converted into an updated RefSeq. For others, it may occur when any information attached to a gene record is altered. Other changes include adding, updating, or deleting sequence information, GeneRIFs, nomenclature, publications, and key identifiers such as numbers assigned to records in Mendelian Inheritance in Man (MIM numbers) and IDs from model organism databases.
From time to time it is necessary to combine Entrez Gene records or suppress ones created in error. Current or previous records can be retrieved from Entrez Gene by the GeneID. When a secondary GeneID has been replaced with another, a URL to the current record is provided.
Much of the power of querying Entrez Gene comes from mining its connections with other databases. Changes in these relationships are not captured in the modification date on the Entrez Gene record. For example, if information about new single nucleotide polymorphisms (SNPs) in a gene is submitted to the Single Nucleotide Polymorphism database and this information is now connected to Entrez Gene, that change is not reflected in the modification date of the record in Entrez Gene. In other words, a query to Entrez Gene based on records that have connections to dbSNP (using filters, as described below in “How to query Entrez Gene”) will return a different set of records, although there is no change in the modification date in any of the Entrez Gene records.
Databases external to NCBI's Entrez system can submit and update links at any time. Users logged into My NCBI may elect to display any LinkOut with a standard icon. Changes in these connections will not be reflected in the modification date on the record in Entrez Gene.
Note: Database providers are encouraged to review the documentation about supplying LinkOuts (for more information see http://www.ncbi.nlm.nih.gov/entrez/linkout/doc/nonbiblinkout.html). This is a powerful method to attract users of Entrez Gene to your own database.
As with all databases accessed via Entrez, records can be retrieved from Entrez Gene based on:
information anywhere in the record
information in specified fields (Box 1)
information on properties of the record (Box 2)
the relationship of any record to other records in the Entrez system or on providers of external links (filters, Box 3)
Queries can be as simple as a single word or as complex as a combination of terms qualified by boolean operators using field restriction, properties, and filters. Several functions standard to Entrez are available to help users query Entrez Gene efficiently. Descriptions of these functions are below:
Limits supports restricting results by combinations of species, by a value in one field, and by the modification date on the record.
Preview/Index provides a comprehensive list of fields, filters, and properties currently used by Entrez Gene. It also reports the number of occurrences and values stored in each field, filter, and property, and it allows you to combine any term by boolean operators with existing queries. This is a key interface to test robust query strategies.
History offers a review of recent queries and menus that can be used to combine these queries to selected sets of interest.
Clipboard hold records of interest for up to 8 hours.
Details shows how a query was processed. A query can then be refined and resubmitted.
My NCBI allows users to save searches, customize filters, and schedule document delivery.
Entrez Utilities allows users to retrieve records in other programs based on the same queries used interactively.
More details on using these functions are in the Entrez help document and FAQ pages.
To clarify these standards, consider the following examples:
Example 1: Find human and mouse genes not annotated on the genome but having reviewed RefSeq records. First, you have to know that if a gene is annotated on the most recent genomic annotation, the filter “gene nucleotide pos” is set. Then you need to restrict your query by species and by the type of RefSeq.
If you typed this interactively, the query would be:
(Human[organism] OR mouse[organism]) AND "srcdb refseq reviewed"[Properties] NOT "gene nucleotide pos"[Filter]
A much simpler approach is: to use Limits to set the species; preview/index to find the appropriate properties (reviewed RefSeqs, a characteristic of multiple Gene records); and a filter to find those not annotated on a genome (based on lack of links to contig or chromosome-based RefSeqs).
The steps you might follow are:
Click on Limits and check both human and mouse in the mammals section.
Click on Preview/Index, select properties, click on Index, scroll until you see “srcdb refseq reviewed”, select it, and click on AND.
Still in Preview/Index, select fillters, click on Index, scroll until you see gene nucletide pos, select it, and click on NOT.
Example 2: Find all Gene records from fungi that have expression data in UniGene or GEO.
If you typed this interactively, the query would be:
fungi[organism] AND ( "gene unigene"[filter] OR "gene geo"[filter])
A much simpler approach is to use Limits to set the taxonomic group and preview/index to find the appropriate filters and combine them correctly
The steps you might follow are:
Click on Limits and check fungi.
Click on Preview/Index, select filters, click on Index, scroll until you see “gene unigene” select it, and click on AND.
Still in Preview/Index, select filters, click on Index, scroll until you see “gene geo”, select it, and click on OR, and click on GO.
More sample queries are provided from the Entrez Gene help documents.
Entrez Gene provides several displays differing in content and format to help you find and report the information you want. There are two default displays: the summary HTML page returned in response to a query, and the complete (Graphic) HTML display returned after a single record is selected. All HTML displays include the Links function that indicates what other resources contain additional information. Some of these links are based on information managed directly from Entrez Gene. For example, links to Entrez Nucleotide, Entrez Protein, PubMed, and OMIM are based on the sequences, citations, and MIM numbers contained in a record. Other links are managed from databases other than Entrez Gene or from information shared by other databases. For example, links to dbSNP, GENSAT, GEO, HomoloGene, UniGene, and UniSTS are based on shared nucleotide sequence data. Links to CDD are based on shared protein sequence. Links to Map Viewer indicate that information about the position of the gene is available.
Another useful display format is the Gene Table. If a gene has been annotated on any genomic RefSeq, the intron/exon organization of each transcript is summarized. In the case of an mRNA, the translated region of each exon is summarized. Gene Table facilitates access to other gene-related sequences, such as the complete RNA, protein, specific exons, introns, or coding regions. Other display formats include XML and ASN1- specifications for each can be found in the Entrez Gene help document.
Detailed information about each display format is available in the "Interpreting the Display" section of the Entrez Gene help document.
The content of an Entrez Gene record fits into several sub-categories. Those listed here correspond roughly to what is seen in the default full (Graphic) display.
Entrez Gene uses official symbols and full names and reports the nomenclature authority when available. Otherwise, symbols and names are selected from the defining sequence record. For example, if sequence and positional homology (synteny) suggest that a nameless locus in one species is orthologous to a named gene in another, the symbol from the ortholog may be used. If no symbol is identified, and the genome is processed gene-by-gene rather than as a complete re-annotation, the letters LOC are prepended to the GeneID. Once a meaningful symbol is identified, the contrived "LOC" symbol is removed (because the record will still be searchable and identified by the GeneID itself).
In addition to official symbols and full names, Entrez Gene provides others seen in publications and sequence records. These alternative names are not meant to be comprehensive and often are identified only when the RefSeq is being reviewed.
Several NCBI databases use the nomenclature maintained by Entrez Gene. These names are incorporated based primarily on the name-GeneID-sequence relationships that Entrez Gene reports. These data are reported in several files on Entrez Gene's FTP site, including DATA/gene_info.gz and DATA/gene2accession.gz.
Some of the components of the Gene record describe key characteristics of the gene, its function, and its products. The Summary, written by RefSeq staff and/or by external contributors such as OMIM or Rat genome Database (RGD), provides a quick synopsis of what is known about the gene, the function of its encoded protein or RNA products, disease associations, spatial and temporal distribution, and so on. The gene type is assigned from a list of options defined in the Entrez Gene data model.
The value of RefSeqStatus indicates the maximum level of review that has been provided to the set of gene-specific accessions.
Several types of map information may be included in an Entrez Gene record. One type is the description of location in units commonly used for a given genome. Genetic and physical map positions are incorporated from the published maps used in Map Viewer. Rather than report all position data for any gene in any coordinate system, this information can be obtained through links to Map Viewer. Information can also be accessed through marker names, which are linked to the UniSTS record.
When no independent map data are available and the gene has been placed on a genomic assembly, map position may be inferred by a calculated correspondence between sequence and other map units, such as cytogenetic bands. One example is the calculation of cytogenetic position according to the algorithm developed by Furey and Haussler (7). With each re-assembly of a genome, genes might be moved to other chromosomes with which better alignments are identified. If marker and other data are consistent with but distinct from the published map location, then the Entrez Gene record is modified to be consistent with current information.
Markers are reported in Entrez Gene either as a gene or as a marker that has a calculated or curated relationship with a gene. Entrez Gene does not store all of the markers available for a genome; that is the function of UniSTS. The marker data in Entrez Gene come from any of the following: a report from a genome-specific database; calculations based on e-PCR that indicates that an mRNA is associated with the gene; and e-PCR based localization on the genome within a region beginning 2 kb upstream of the gene and ending 0.5 kb downstream. In queries initiated from Entrez Gene, genes that have PCR-based markers can be identified by the query "gene unists"[filter].
When a gene has been annotated on a genomic RefSeq, map information is also presented by the graphic display of neighboring genes. An arrow indicates the direction of transcription. If the name of a gene is too long to be used as a label, truncation is indicated by an ellipsis (...). The gene specific to the displayed record is highlighted. The arrows and labels anchor links to the records for those genes, supporting quick navigation. If a gene is annotated on more than one genomic RefSeq, only one is used for the graphic display. The location data for each RefSeq are provided in the ASN.1 of the full Entrez Gene record.
Map data are also supported by named links to Map Viewer in the Links menu. Because links are provided by the Map Viewer database, changes in these links are not reflected in the modification date on the record. For genomes where comparative maps are available in Map Viewer, links to Map Viewer are also provided for those views.
Sequence information is presented in multiple forms in Gene:
graphical displays of the intron/exon organization of splice variants
reports of intron/exon organization of each variant in the Gene Table display
reports of accessions from DDBJ, EMBL, GenBank and Swiss-Prot
links to the genomic sequence, in standard formats, for the genomic sequence of the gene, individual introns or exons, and the transcripts (Gene Table display) http://www.ncbi.nlm.nih.gov/entrez/query/static/help/genehelp.html#display_table
links to related records via the Conserved Domain database
links to the BLink viewer of protein neighbors
Sequence information (accessions and links) is distributed throughout the Entrez Gene record. For example, the Transcripts and Products diagram is provided when a gene has been annotated on a genomic RefSeq, in other words when the intron/exon/coding region information is available in genomic coordinates. Each position of a gene product, when represented by a RefSeq RNA and/or protein, is provided relative to the genomic DNA. Each RefSeq Accession number (genomic, mRNA and protein) anchors a link to different formats of the sequence in Entrez Nucleotide or Entrez Protein (the link can be found over the diagram). The link from the Accession number for the genomic sequence displays only gene-specific region. The anchor on the protein accessions also facilitates retrieval of specific BLink, CDD, or COG displays.
The NCBI Reference Sequences (RefSeqs) section lists nucleotide and protein accessions that are related to the gene and provides links to the appropriate sequence record in Entrez Nucleotide or Entrez Protein. Conserved domains are reported by name, location on the sequence, and the BLAST score substantiating the assignment.
“Related sequences” lists nucleotide and protein accessions that are related to the gene and provides links to the appropriate sequence record in Entrez Nucleotide or Protein. If the protein sequence record is not part of a set of a nucleotide record and the protein it encodes, the word 'none' is printed in the nucleotide column. The type of nucleotide record is printed before the nucleotide accession, and the strain is printed after the protein accession, as applicable.
Gene uses several approaches to describe the function of a gene and its encoded products. These include:
explicit descriptive statements (RefSeq Summary and GeneRIF)
names of genes, products, and pathways
associated ontologies (GO)
reports of interactions
Enzyme Commission (EC) numbers
inferences from domain content
descriptions of diseases or allele-specific phenotypes
Many of these categories include links to additional information in other databases. Links to the data sources are provided. We appreciate the cooperation of the resources that have made their data freely available.
Except for indicating the availability of comparative maps (limited at the time of this writing to Entrez Gene records from human, mouse, and rat), Entrez Gene provides information about homology only by displaying links to HomoloGene and/or COG. It also provides links to resources that display pre-computed sequence relationships such as BLink.
The qualitative assessment of whether a gene is expressed is captured in the Gene type and in the types of sequence accessions associated with the Gene record. The quantitative and spatio-temporal aspects of expression are stored in other databases, including GEO, GENSAT, and UniGene at NCBI.
Entrez Gene provides information about other sites of interest both within a record and via the LinkOut mechanism. As more data providers submit their LinkOuts to Entrez Gene, the second method will be increasingly powerful. Users can take advantage of the LinkOut connections, and other filters, by registering for My NCBI and customizing the display.
Free Full text in PMC]