NCBI helix logo  NCBI Home / Entrez / Gene / Gene Help
--

Entrez Gene Frequently Asked Questions

  1. For General Users
  2. For Programmers and Database Developers
  3. Archives (e.g. LocusLink transition)
*General Questions back to top
  1. I am trying to search for a gene with the symbol zgc:56666, but I get only some odd message about range operations. What do I do?
  2. Nomenclature. How and when are gene symbols and names assigned?
  3. How can I obtain the genomic sequence for a gene?
  4. Notification of changes in Gene.
  5. Differing Representations of RefSeqs
  6. Entrez Gene and OMIM
  7. How does Gene maintain certain types of information?
Queries with words containing : (colon) back to top

A feature of the Entrez query interface is to use the colon (:) to indicate a range search. For example, in nucleotide this is used to find sequences within specified length boundaries (1:5000[sequence length] finds records no more than 5 Kb in length.) Thus when a word has a : embedded, the query interpreter tries to process this as a range. To bypass this problem, surround the word with double quotes.

Typing zgc:56666 in the query box will result in the message Range operation not supported. Try it.
Typing "zgc:56666" in the query box will retrieve any record that contains that explict series of characters. Try it.

Nomenclature back to top

Sources

The names (symbols) and full descriptions used in Gene come from 4 major sources:

  1. Species-specific nomenclature committees, with great appreciation, as enumerated here. Note also
  2. The gene name (symbol) and protein names provided in submissions used as sources for RefSeq records.
  3. Symbols and full descriptions submitted by contributors of information about loci not defined by sequence.
  4. NCBI's annotation pipeline

Updates and access

Entrez Gene attempts to maintain current nomenclature. Updates to names in Entrez Gene are not propagated immediately to all other resources in NCBI. You may notice, for example, that symbols in genomic RefSeq annotation, Map Viewer, HomoloGene or UniGene, and their respective ftp sites, are not the same as those you see in Entrez Gene. RefSeq, for example, does not resubmit the full annotation of a reference contig accession to the nucleotide database each time a symbol changes. The symbols seen in Map Viewer and RefSeqs for contigs and chromosomes, however, should be the same, because all are updated only with each major re-annotation of a genome. It may help to consider that the Entrez Gene GeneID is unique across all taxa. You can therefore convert any GeneID into its current names by using the definitions provided in the file available as ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz. For example, if you transferred the gene_info.gz file to a unix file system, the command

gzcat gene_info.gz | cut -f2,3,5,9,13

will give you

  1. the GeneID
  2. the current official symbol or database identifier if no official symbol is available
  3. a pipe-delimited set of aliases
  4. the full name
  5. the nomenclature status of the name, where

If a GeneID is no longer current, it will not be reported in the file gene_info.gz. The file gene_history.gz in the same ftp directory can be used to determine if there is a replacement GeneID, for which the current names can then be determined as above.

Symbols beginning with LOC When a published symbol is not available, and orthologs have not yet been determined, Entrez Gene will provide a symbol that is constructed as 'LOC' + the GeneID. This is not retained when a replacement symbol has been identified.

Names beginning with 'similar to' When NCBI automatically annotates a genome, it predicts both mRNAs and the proteins they encode. The protein sequences are compared to public protein sequence records from several model organisms. If a significant match is found, and the name is informative, then the automatic annotation process constructs the name of the model by combining 'similar to ' and the name of the matching protein. Because the sequences represented by NCBI's predictions are provided in accessions beginning with XM_ or XP_ or XR_, you might assume that all accessions with that format would have names beginning with 'similar to '. This is not necessarily the case, because NCBI will generate XM_ and XP_ or XR_ accessions for genes identified outside of the annotation pipeline but for which curated (NM and NP or NR) accessions are not available. These genes, and the RefSeq accessions that represent them, will not have names beginning with similar to.

NOTE: To the greatest extent possible, each protein-coding gene in mitochondria has been assigned the same name (symbol) and full description. In some instances, this is at variance with the symbol assigned by species-specific nomenclature committees. In those cases, the species-specific nomenclature is provided as alternates.

Obtaining genomic sequence back to top

From Entrez Gene's gene diagram (note: position values are one-offset)
When a gene is annotated on a RefSeq for a chromosome or a genomic contig, the RefSeq accession over the diagram anchors a link to Entez Nucleotide for the range corresponding exactly to the gene feature on the sequence record. Steps like these will therefore allow you to extract and save the genomic sequence of interest.

  1. Click on the Nucleotide accession centered over the gene diagram in the Transcripts and products section of Entrez Gene Graphic (full) display.
  2. Select either FASTA or GENBANK
  3. Adjust the Range: from and to values if you want to capture upsteam or downstream sequence, and click on Display
  4. If you selected the FASTA option, save your sequence by copying and pasting or by using the Send all to file option.
  5. If you selected the GENBANK option, confirm you have adjusted the sequence correctly by looking at the position of the gene feature in the range you have defined. When adjusted to your satisfaction, displaying the record as FASTA and then follow step 4 above.

From Entrez Gene's Gene Table (note: position values are one-offset)
The Transcripts and products display in the Gene Table view supports the same fuctionality listed above. It also allows you to click on any of the sub-features (introns, exons, CDS portions of an exon) corresponding to each annotated transcript to extract that genomic subsequence in FASTA format. You can display the subsequence in GenBank format if you want to see other annotation in the region.

From Map Viewer
From Entrez Gene, you can navigate to Map Viewer to use the download functions there.

  1. Select Map Viewer from the Links menu at the upper right of the Entrez Gene record.
  2. Click on Download/View Sequence/Evidence in the upper right of Map Viewer display, or click on dl in the label for the gene.
  3. Adjust the range and strand if you like and press enter or Change Region/Strand.
  4. Select a format (FASTA is the default).
  5. Save

From Entrez Nucleotide (note: position values are one-offset)
Within Entrez Nucleotide, feature names anchor URLs. Clicking on 'gene' results in a display (in GenBank format) of that subsequence. To save the sequence, change the display format to FASTA and save as described above.

Notification of changes in Entrez Gene back to top

Gene maintains a list serve (gene-announce) that is used to notify subscribers of current or future changes in Entrez Gene and any of its reports. If this is of interest to you, please subscribe.

Differing representations of RefSeqs back to top

Display of RefSeqs in Transcripts and Products vs. in the Reference Sequences (RefSeq) section

The diagram of the placement of RefSeq transcripts in the Transcripts and Products Section is based on the annotation of the positions of exons and coding sequences in the current genomic RefSeq. For some genomes, the genomic RefSeqs are updated independently of the annotated product RNAs, with the latter being updated more frequently. This means that several kinds of discrepancies between the diagram and the current RefSeq RNAs may result.


The Gene Table display vs. Entrez Nucleotide.

RefSeq RNA records are often based on cDNA sequences submitted to GenBank. They therefore can differ from the reference genomic sequence, either for biological reasons (variation or RNA editing) or some unresolved sequence discrepancy. The report of intron/exon organization in the Gene Table display is based on the placement of exons and CDS on the genomic sequence. If the independently determined RefSeq mRNA cannot be aligned perfectly to the genome, the lengths given in the Gene Table display may differ from that of the mRNA sequence itself. As discussed in the section above, it is also possible that the sequence of the RefSeq RNA was updated after it was aligned to, and used to annotate, the reference sequence. This also might result in discrepancies between the annotation on the genomic sequence, and the current RefSeq RNA.

Representation of nucleotide positions

NCBI uses two conventions to represent the position of features in a sequence.

The names are self-explanatory. In the sequence AAAATGCCC, the position of the start codon ATG is 3 in zero-offset and 4 in one-offset. If you find a difference in position information that is 'off-by-one', please review the conventions used in each file.

The zero-offset convention is used in the ASN.1 representation of sequence databases. The ASN.1 of Entrez Gene, and the derivative tab-delimited files gene2refseq.gz and gene2accession.gz in the DATA subdirectory of Gene's ftp site also use the convention of 0 offset.

Reports designed for browsing use the convention of one-offset. Thus the position data seen in default HTML views of Entrez Gene (and Nucleotide) are always one greater than that reported in the ASN.1 display.

NOTE:. The files in the Map Viewer subdirectories in the genomes path that give position information for genes (seq_gene.md.gz) and other features are one-based. Please be aware of this when processing these files.

Entrez Gene and OMIM back to top

Entrez Gene integrates information from OMIM, and creates links to OMIM, at two levels:

  1. the gene
  2. associated disorders or phenotypes

Links provided from the Links menu in the upper right-hand part of the Gene record are based on both types of MIM numbers. Within the body of the record, the MIM number associated with the gene is reported in the Additional links section; any MIM number associated with a disease is reported in the Phenotypes section, along with the name of the disease. Symbols used by OMIM for genes and diseases are intermingled in Gene's Gene aliases section.

The gene_info.gz file provided from the Gene ftp site includes the MIM number associated with the gene. If that gene is associated with Mendelian disorders that have a different MIM number, that MIM number will not be provided in gene_info.gz.

All MIM numbers associated with Entrez Gene records are reported in the ftp file mim2gene. The value in the third column in that file indicates whether the MIM number is only for the disease ('phenotype') or for the gene ('gene').

How Entrez Gene maintains certain types of information back to top

Conserved Domains

As sequence records are added to or updated in the Protein database, they are compared to records in the Conserved Domain Database (CDD) to identify likely domain content. The results of these analyses for RefSeq proteins are indexed for retrieval in Gene, are displayed when a Gene record is retrieved from Entrez, and are integrated into the ASN.1 that is provided for ftp transfer. The sequence of events is therefore:

Thus it may require a few days for a new RefSeq accession to display domain information in Gene.

To extract domain information directly for any protein sequence, consider using E-utilities. The url to fetch domain data based on a protein gi follows the pattern:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=gnl|ANNOT:CDD|[put the gi here]&retmode=xml.

Example URL for efetch for CDD:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=gnl|ANNOT:CDD|6978425&retmode=xml


GeneRIFs

GeneRIFs are established by two primary methods.

In the former case, the records are updated weekly. In the latter case, RefSeq staff reviews the submission before release, and contacts the submitter if questions arise. These data should be public within a week.


GO terms

NCBI reports GO terms appropriate for a GeneID by integrating information from the following sources:

For all genomes but human, a species-specific gene-identifier (FBgn id, MGI id, RGD ID) is converted to the GeneID. For human, the connection is made from common protein accessions. Most current gaps in the human set, therefore, result from lags in matching protein accessions to GeneIDs. According to Gene's current data flow, any association of a protein accession with more than one gene record must be reviewed by a curator. This multiplicity can be frequently with gene families where multiple genes encode the same protein sequence.

Entrez Gene currently reports, and uses for indexed queries, only the explicit GO term or terms assigned to any gene. It does not support querying at any node of the GO graph, nor retrieving all genes that match terms at more specific nodes based on a query at a higher node.


Interactions

Entrez Gene represents interaction data as pairs (more...). Entrez Gene staff does not curate these data, but does validate identifiers supplied with the source files.

*For Programmers and Database Developers back to top

  1. How to connect your database to Entrez Gene--Using LinkOut
  2. How to construct URLS to connect to Gene
  3. Relationship of LocusID to GeneID
  4. The Gene ftp site
  5. Gene-related ftp sites
  6. Extracting Gene in XML format
Using LinkOut back to top

Because Gene is an Entrez database, database providers can now use the LinkOut mechanism to direct users of Entrez Gene to related sites providing more information about a particular record. The benefits to data providers are several:

This area of LinkOut's documentation provides instructions geared more to the non-bibliographic data providers.

How to construct URLs to link to Gene back to top

Because Gene is now an Entrez database, URLs can be constructed using standard Entrez methods. Use db=gene to define the database. Information on the gene-specific properties and filters is maintained in the Preview/Index section of the help documentation. Note that the Properties and Filters used by Gene and the counts for each in the current database can be viewed at any time by following these steps:

  1. On any Gene query bar, click on Preview/Index.
  2. In the All fields menu, select Properties or Filter
  3. Click on Index and navigate through the options.

Entrez Gene's home page also provides examples of some URLs that allow common displays.

If you have stored LocusIDs locally, and have used the LocusID to connect to LocusLink, you can use the same integer value to connect to Gene, based on this pattern, where ID is the LocusID/GeneID.

  • /entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=ID

    Relationship of LocusID to GeneID back to top

    The method described in the previous section to construct URLs that connect to Gene by using what you may think of as a LocusID, is based on the premise that the GeneID integer is the same as the LocusID seen in LocusLink. We plan to provide the GeneID equal to the LocusID as long as LocusLink continues to be offered. Drosophila melanogaster and Caenorhabditis elegans are the only genomes where exceptions may occur, because the database underlying LocusLink is no longer directly involved in tracking identifiers assigned to genes when annotations are resubmitted for these genomes.

    Gene ftp site back to top

    The Gene ftp site retains the general functions of the LocusLink ftp site, namely a mixture of

    The README file in the gene directory provides more detailed information. See also the Gene-OMIM faq above for more information about MIM numbers provided in gene_info.gz and mim2gene.

    The comprehensive extraction is provided in ASN.1 in the subdirectory ASN. In addition to the comprehensive file All_data.gz, there are subdirectories divided by taxonomic nodes. Each of these sub-directories contains a comprehensive extracion for that node, but may also contain some species-specific files. For example, Mammalia contains these files:

    File name Content
    All_Mammalia.gz Gene records for mammalian species, including mitochondria.
    Bos_taurus.gz Gene records for Bos taurus, including mitochondria.
    Canis_familiaris.gz Gene records for Canis familiaris, including mitochondria.
    Homo_sapiens.gz Gene records for Homo sapiens, including mitochondria.
    Mus_musculus.gz Gene records for Mus musculus, including mitochondria.
    Pan_troglodytes.gz Gene records for Pan troglodytes, including mitochondria.
    Rattus_norvegicus.gz Gene records for Rattus norvegicus, including mitochondria.
    Sus_scrofa.gz Gene records for Sus scrofa, including mitochondria.

    GeneRIFs
    The portion of LL_tmpl that contained GeneRIF data was indicated by the header GRIF.
    and contained the PubMed ID and the text of the GeneRIF. Those data are now available from the GeneRIF ftp site as the file generif_basic.gz.

    Gene-related ftp sites back to top

    There are other ftp sites at NCBI that contain gene-related information. These include:

    1. Map Viewer
      Within a genome-specific directory in the path ftp://ftp.ncbi.nih.gov/genomes/, click on maps, then mapview, then the folder for the current build. In that directory you should find the file seq_gene.md. The gene lines in this file give the ranges for the gene in chromosome (as applicable) and contig coordinates. For example, a command like
      gzcat seq_gene.md | egrep "GENE.*reference" will extract the 'GENE' lines for the reference assembly.
      • The first line in the file names the columns.
      • chrStart, chrEnd and orientation refer to the chromosome.
      • cnt_start, cnt_stop, cnt_orient refer to the contig
    2. UniGene
    3. UniSTS

    Extracting Gene in XML format back to top

    If you prefer to use reports formated in XML rather than ASN.1, you have several options:

    1. E-utilities
    2. gene2xml
    3. Web Entrez

    Try the robust functions provided via E-utilities. A common approach is to combine use of ESearch to obtain a set of GeneIDs of interest, with EFetch which retrieves records by GeneID. The document EFetch for Sequence and other Molecular Biology Databases provides more information about how to set the parameters for extracting information from Entrez databases. It is as simple as:

    Example, showing how you can display on the web, what will be retrieved by the for the GeneID you submit to EFetch.

    http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=2&retmode=xml

    Example, showing how you can display on the web, what will be retrieved by a list of GeneIDs submitted to ESummary.

    http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=19,11303,313210,373945,378973,464631&retmode=xml

    A representative perl script using both ESearch and retrieval from ESummary is provided from the ftp site as taxidToGeneNames.pl. It uses NCBI's Taxonomy database identifier to support species-specific extraction of information incorporated in the Entrez Gene Summary display format.

    Examples:


  • gene2xml

  • The tool gene2xml, described here converts the ASN.1 provided in binary set format (in the ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ASN_BINARY/ directory), into XML. It also converts ASN in the binary format into concatenated text.


  • Entrez Gene

  • Entrez supports reporting any record or set of records in XML format. After you have retrieved record(s) of interest, select XML from the display menu and the result will be displayed according to Entrez Genes DTD. You can then send that result to a file.

    Note: to convert multiple records to XML via the Entrez interface, check the boxes to left of the gene symbol in the query result view.

    *Archives (e.g. LocusLink transition) back to top
    1. What is the relationship between Gene and LocusLink?
    2. When and how will LocusLink be discontinued?
    Gene and LocusLink back to top

    Entrez Gene integrates information from LocusLink and from genes annotated on Reference Sequences from completely sequenced genomes. It thus provides a unified look and feel for gene-specific information independent of the species of origin. It also provides a foundation for other functions that to this point were available only for genomes in LocusLink, namely GeneRIFs and linkouts from BLAST results.

    Discontinuing LocusLink back to top

    LocusLink was discontinued March 1, 2005. URLs directed to LocusLink/list.cgi and LocusLink/LocRpt.cgi are being re-directed to Entrez Gene. The last undate in the LocusLink format was generated June 1, 2005. The history of the transition is maintained here.

    GeneRIFs Beginning June 8, 2004, LocusLink no longer displayed GeneRIF information.


    CDD LocusLink stopped reporting information about onserved domains that may be found in RefSeq proteins in September, 2004.

    The last version of the LL_tmpl file that contained CDD information is available from the LocusLink ftp site in the file ARCHIVE/LL_tmpl_040903.gz. To extract domain information directly for any protein sequence, consider using E-utilities. The url to fetch domain data based on a protein gi follows the pattern:

    http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=gnl|ANNOT:CDD|[put the gi here]&retmode=xml.

    Example URL for efetch for CDD:

    http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=gnl|ANNOT:CDD|6978425&retmode=xml