For Programmers and Database Developers
Archives (e.g. LocusLink transition)
Nomenclature. How and when are gene symbols and names assigned?
Why can I sometimes display a record, but then cannot retrieve it by a query?
A feature of the Entrez query interface is to use the colon (:) to indicate a range search. For example, in nucleotide this is used to find sequences within specified length boundaries (1:5000[sequence length] finds records no more than 5 Kb in length.) Thus when a word has a ‘:’ embedded, the query interpreter tries to process this as a range. To bypass this problem, surround the word with double quotes.
Typing zgc:173705 in the query box will result in the message Error in query. See Details Try it.
Typing "zgc:173705" in the query box will retrieve any record that contains that explicit series of characters. Try it.
This section includes more details about sources, updates, and conventions for genes of uncertain function.
The names (symbols) and full descriptions used in Gene come from 5 major sources:
Species-specific nomenclature committees, with great appreciation, as enumerated here and here. Note also
The gene name (symbol) and protein names provided in submissions used as sources for RefSeq records.
Symbols and full descriptions submitted by contributors of information about loci not defined by sequence.
Curation by NCBI staff
NCBI's annotation pipeline
If there is a nomenclature committee for a species, those names have precedence.
Entrez Gene attempts to maintain current nomenclature. Updates to names in Entrez Gene are not propagated immediately to all other resources in NCBI. You may notice, for example, that symbols in genomic RefSeq annotation, Map Viewer, HomoloGene or UniGene, and their respective ftp sites, are not the same as those you see in Entrez Gene. RefSeq, for example, does not resubmit the full annotation of a reference contig accession to the nucleotide database each time a symbol changes. The symbols seen in Map Viewer and RefSeqs for contigs, scaffolds, and chromosomes, however, should be the same, because all are updated only with each major re-annotation of a genome. It may help to consider that the Entrez Gene GeneID is unique across all taxa. You can therefore convert any GeneID into its current names by using the definitions provided in the file available as ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz. For example, if you transferred the gene_info.gz file to a unix or linux file system, the command
gzcat gene_info.gz | cut -f2,3,5,9,13
will give you
the GeneID
the current official symbol or database identifier if no official symbol is available
a pipe-delimited set of aliases
the full name
the nomenclature status of the name, where
0 = official from a nomenclature committee,
I = interim from a nomenclature committee,
- = NCBI-supplied.
If a GeneID is no longer current, it will not be reported in the file gene_info.gz. The file gene_history.gz in the same ftp directory can be used to determine if there is a replacement GeneID, for which the current names can then be determined as above.
Symbols beginning with LOC. When a published symbol is not available, and orthologs have not yet been determined, Entrez Gene will provide a symbol that is constructed as 'LOC' + the GeneID. This is not retained when a replacement symbol has been identified, although queries by the LOC term are still supported.
Names beginning with 'similar to'. When NCBI automatically annotates a genome, it predicts both mRNAs and the proteins they encode. The protein sequences are compared to public protein sequence records from several model organisms. If a significant match is found, and the name is informative, then the automatic annotation process constructs the name of the model by combining 'similar to ' and the name of the matching protein. Because the sequences represented by NCBI's predictions are provided in accessions beginning with XM_ or XP_ or XR_, you might assume that all accessions with that format would have names beginning with 'similar to '. This is not necessarily the case, because NCBI will generate XM_ and XP_ or XR_ accessions for genes identified outside of the annotation pipeline but for which curated (NM and NP or NR) accessions are not available. These genes, and the RefSeq accessions that represent them, will not have names beginning with similar to.
Other cases of uncertainty. When the name that should be assigned to the gene or protein is uncertain, sources use different conventions. The terms that are used {‘hypothetical’ (often from RefSeq), ‘similar to ‘ (from NCBI’s annotation pipeline), ‘putative’, ‘unknown’, ‘novel’ (from original submitters)} should not be construed to indicate different types of uncertainty. The terms can be considered equivalent, and reflect primarily the source of the naming. Entrez Gene and RefSeq encourage all data submitters to conform to the suggestions from major sequence databases.
NOTE: To the greatest extent possible, each protein-coding gene in mitochondria has been assigned the same name (symbol) and full description. In some instances, this is at variance with the symbol assigned by species-specific nomenclature committees. In those cases, the species-specific nomenclature is provided as alternates.
When a gene is annotated on a RefSeq for a chromosome or a genomic contig, the RefSeq accession over the diagram anchors a link to Entrez Nucleotide for the range corresponding exactly to the gene feature on the sequence record. Steps like these will therefore allow you to extract and save the genomic sequence of interest.
Click on the Nucleotide accession centered over the gene diagram in the Transcripts and products section of Entrez Gene Graphic (full) display.
Select either FASTA or GENBANK
Adjust the Range: from and to values if you want to capture upsteam or downstream sequence, and click on Display
If you selected the FASTA option, save your sequence by copying and pasting or by using the Send all to file option.
If you selected the GENBANK option, confirm you have adjusted the sequence correctly by looking at the position of the gene feature in the range you have defined. When adjusted to your satisfaction, displaying the record as FASTA and then follow step 4 above.
The Transcripts and products display in the Gene Table view supports the same functionality listed above. It also allows you to click on any of the sub-features (introns, exons, CDS portions of an exon) corresponding to each annotated transcript to extract that genomic subsequence in FASTA format. You can display the subsequence in GenBank format if you want to see other annotation in the region.
From Entrez Gene, you can navigate to Map Viewer to use the download functions there.
Select Map Viewer from the Links menu at the upper right of the Entrez Gene record.
Click on Download/View Sequence/Evidence in the upper right of Map Viewer display, or click on dl in the label for the gene.
Adjust the range and strand if you like and press enter or Change Region/Strand.
Select a format (FASTA is the default).
Save
Within Entrez Nucleotide, feature names anchor URLs. Clicking on 'gene' results in a display (in GenBank format) of that subsequence. To save the sequence, change the display format to FASTA and save as described above.
For a limited number of genes in the human genome, gene-specific genomic RefSeqs, termed RefSeqGene, have been created. These have a RefSeq accession beginning with NG_ and can be retrieved from the nucleotide database using the query refseqgene[keyword].
Gene maintains a list serve (gene-announce) that is used to notify subscribers of current or future changes in Entrez Gene and any of its reports. If this is of interest to you, please subscribe.
The diagram of the placement of RefSeq transcripts in the Transcripts and Products Section is based on the annotation of the positions of exons and coding sequences in the current genomic RefSeq. For some genomes, the genomic RefSeqs are updated independently of the annotated product RNAs, with the latter being updated more frequently. This means that several kinds of discrepancies between the diagram and the current RefSeq RNAs may result.
The diagram may be labeled with an mRNA accession (for a predicted transcript) of the format XM_123456, yet clicking on that accession results in an entry in Entrez Nucleotide that indicates this accession is no longer primary. That means that a curated mRNA (accession of the format NM_123456 or NM_123456789) has been generated to replace the previous model accession. This new "NM" accession will be reported in the Reference Sequences section, in the subsection entitled RefSeqs maintained independently of Annotated Genomes.
The diagram may be labeled with curated RNA accessions (of the format NM_123456 or NM_123456789 or NR_123456) different from those listed in the RefSeq section. This will result if curation after the submission of the annotated genome identified more transcript variants, which therefore are listed only in the Reference Sequence section but not in the diagram. It will also result if curation after submission of the annotated genome identified an error in the annotated product, and the accession for that product was suppressed. In that case, the Transcripts and Products section will indicate a transcript not listed in the RefSeq section of the Entrez Gene report. A comment explaining why the record was suppressed is also provided.
The diagram may be labeled with a version of an mRNA or protein accession (for example, NM_123456.1) different from that listed in the RefSeq section (for example, NM_123456.2). This will result if the sequence has been changed in any way, such as extending the 5' or 3' ends, or removing mismatches between the cDNA sequence and the reference assembly.
RefSeq RNA records are often based on cDNA sequences submitted to GenBank. They therefore can differ from the reference genomic sequence, either for biological reasons (variation or RNA editing) or some unresolved sequence discrepancy. The report of intron/exon organization in the Gene Table display is based on the placement of exons and CDS on the genomic sequence. If the independently determined RefSeq mRNA cannot be aligned perfectly to the genome, the lengths given in the Gene Table display may differ from that of the mRNA sequence itself. As discussed in the section above, it is also possible that the sequence of the RefSeq RNA was updated after it was aligned to, and used to annotate, the reference sequence. This also might result in discrepancies between the annotation on the genomic sequence, and the current RefSeq RNA.
At times, one gene record may be merged into another gene record. If genes are merged after an annotation is released, there may be more than one location reported on a genomic sequence per GeneID in the Summary report, each resulting from the annotation before the merge.
NCBI uses two conventions to represent the position of features in a sequence.
offset 0 or 0-based or zero-offset
offset 1 or 1-based or one-offset
The names are self-explanatory. In the sequence AAAATGCCC, the position of the start codon ATG is 3 in zero-offset and 4 in one-offset. If you find a difference in position information that is 'off-by-one', please review the conventions used in each file.
The zero-offset convention is used in the ASN.1 representation of sequence databases. The ASN.1 of Entrez Gene, and the derivative tab-delimited files gene2refseq.gz and gene2accession.gz in the DATA subdirectory of Gene's ftp site also use the convention of 0 offset.
Reports designed for browsing use the convention of one-offset. Thus the position data seen in default HTML views of Entrez Gene (and Nucleotide) are always one greater than that reported in the ASN.1 display.
NOTE:. The files in the Map Viewer subdirectories in the genomes path that give position information for genes (seq_gene.md.gz) and other features are one-based. Please be aware of this when processing these files.
Entrez Gene integrates information from OMIM, and creates links to OMIM, at two levels:
the gene
associated disorders or phenotypes
Links provided from the Links menu in the upper right-hand part of the Gene record are based on both types of MIM numbers. Within the body of the record, the MIM number associated with the gene is reported in the See Related and Additional links sections; any MIM number associated with a disease is reported in the Phenotypes section, along with the name of the disease. Symbols used by OMIM for genes and diseases are intermingled in Gene's Gene aliases section.
The gene_info.gz file provided from the Gene ftp site includes the MIM number associated with the gene. If that gene is associated with Mendelian disorders that have a different MIM number, that MIM number will not be provided in gene_info.gz.
All MIM numbers associated with Entrez Gene records are reported in the ftp file mim2gene. The value in the third column in that file indicates whether the MIM number is only for the disease ('phenotype') or for the gene ('gene').
As sequence records are added to or updated in the Protein database, they are compared to records in the Conserved Domain Database (CDD) to identify likely domain content. The results of these analyses for RefSeq proteins are indexed for retrieval in Gene, are displayed when a Gene record is retrieved from Entrez, and are integrated into the ASN.1 that is provided for ftp transfer. The sequence of events is therefore:
new sequence added to the protein database
analyzed by the CDD group
Entrez Gene re-indexed
Thus it may require a few days for a new RefSeq accession to display domain information in Gene.
To extract domain information directly for any protein sequence, consider using E-utilities. The url to fetch domain data based on a protein gi follows the pattern:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=gnl|ANNOT:CDD|[put the gi here]&retmode=xml.
GeneRIFs are established by three primary methods.
Extraction from the published literature by staff of the National Library of Medicine.
Summary reports from HuGE Navigator
User submissions from an Entrez Gene record.
In the first case, the records are updated weekly. In the second case, Entrez Gene processes information about how a citation in PubMed is related to a GeneID, and converts that to a standard text. In the last case, RefSeq staff reviews the submission before release, and contacts the submitter if questions arise. User-submitted data should be public within a week.
GeneRIFs are reported from the full report in the Bibliography section. A scrolling window provides unique text of a GeneRIF; the citation or citations that support that statement are available by clicking on the document icon at the left of the GeneRIF. Because the text of GeneRIFs submitted from HuGE Navigator is computed, it is likely that more than one citation will be displayed in PubMed to support that text. Please be certain to note the report of the number of records return by the query, and scroll through the web page to review all the citations.
GeneRIFs are reported from this subdirectory: ftp://ftp.ncbi.nih.gov/gene/GeneRIF/. In these files, each GeneRIF is reported separately. If there are multiple records for the same gene with the same text, each will be reported from one line in the file. If there are multiple records for the same gene with different text but the same PubMed id, each will be reported from one line in the file.
NCBI reports GO terms appropriate for a GeneID by integrating information from the following sources:
The ftp site of the GO consortium here
For human only, the GOA ftp site here
Data provided in sequence submissions.
For all genomes but human, a species-specific gene-identifier (FBgn id, MGI id, RGD ID) is converted to the GeneID. For human, the connection is made from common protein accessions. Most current gaps in the human set, therefore, result from lags in matching protein accessions to GeneIDs. According to Gene's current data flow, any association of a protein accession with more than one gene record must be reviewed by a curator. This multiplicity can be frequently with gene families where multiple genes encode the same protein sequence.
Entrez Gene currently reports, and uses for indexed queries, only the explicit GO term or terms assigned to any gene. It does not support querying at any node of the GO graph, nor retrieving all genes that match terms at more specific nodes based on a query at a higher node.
Entrez Gene represents interaction data as pairs (more...). Entrez Gene staff does not curate these data, but does validate identifiers supplied with the source files.
There are two methods by which a gene record can be accessed:
Directly by a public GeneID
A query via the Entrez indexing system which returns the list of GeneIDs that satisfy your query.
For recent records, it is possible that the record itself is public, but the indexing of that record is not yet complete so retrieval by Entrez search returns no results. Because Gene re-indexes daily, this discrepancy should last no more than 24 hours.
There are several qualifiers that you might consider using to determine if the function is known or not known. Gene is currently allowing the user to decide which criteria to use, rather than making that decision unilaterally.
Does the gene encode a protein with a conserved domain?
Use gene_cdd[filter] to identify those that do or do not.
Has a GeneRIF been submitted for the gene?
Use generif[prop] to identify those that do or do not
If human, is the gene also discussed in the OMIM database?
Use gene_omim[filter] to identify records also described (or not) in OMIM
How is the gene named?
If the full name starts with ‘hypothetical’, no group has decided how to name this. If the preferred symbol starts with NCRNA, nomenclature groups believe this gene produces a non-coding RNA of unknown function.
Hypothetical*[title]
Ncrna*[preferred symbol]
To find mouse protein-coding genes of unknown function. This query uses the first part of the title of the gene (predicted* or hypothetical*), and excludes those that have a GeneRIF submitted.
mouse[orgn] AND "genetype protein coding"[Properties] AND (hypothetical*[title] OR predicted*[title]) AND alive[prop] NOT generif[prop]
To find protein-coding genes from Drosophila melanogaster that do not have a product with a conserved domain in NCBI’s conserved domain database:
"drosophila melanogaster"[orgn] AND "genetype protein coding"[Properties] NOT gene_cdd[filter] AND alive[prop]
To find non-coding RNAs of unknown function
ncrna*[Preferred Symbol] AND alive[prop]
Because Gene is an Entrez database, database providers can now use the LinkOut mechanism to direct users of Entrez Gene to related sites providing more information about a particular record. The benefits to data providers are several:
The provider controls making and removing connections between Entrez Gene and the provider's web site.
The provider's web site may receive additional traffic because of links from users of Entrez Gene.
This area of LinkOut's documentation provides instructions geared more to the non-bibliographic data providers.
Because Gene is now an Entrez database, URLs can be constructed using standard Entrez methods. Use db=gene to define the database. Information on the gene-specific properties and filters is maintained in the Preview/Index section of the help documentation. Note that the Properties and Filters used by Gene and the counts for each in the current database can be viewed at any time by following these steps:
On any Gene query bar, click on Preview/Index.
In the All fields menu, select Properties or Filter
Click on Index and navigate through the options.
Entrez Gene's home page also provides examples of some URLs that allow common displays.
If you have stored LocusIDs locally, and have used the LocusID to connect to LocusLink, you can use the same integer value to connect to Gene, based on this pattern, where ID is the LocusID/GeneID.
/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=ID
The method described in the previous section to construct URLs that connect to Gene by using what you may think of as a LocusID, is based on the premise that the GeneID integer is the same as the LocusID last seen in LocusLink. Drosophila melanogaster and Caenorhabditis elegans are the only genomes where exceptions may occur, because the database underlying LocusLink is no longer directly involved in tracking identifiers assigned to genes when annotations are resubmitted for these genomes.
The Gene ftp site provides two major types of reports:
tab-delimited files matching GeneIDs to citation, accession, and name information
a comprehensive extraction
The README file in the gene directory provides more detailed information. See also the Gene-OMIM faq above for more information about MIM numbers provided in gene_info.gz and mim2gene.
The comprehensive extraction is provided in ASN.1 in the subdirectory ASN. In addition to the comprehensive file All_data.gz, there are subdirectories divided by taxonomic nodes. Each of these sub-directories contains a comprehensive extracion for that node, but may also contain some species-specific files. For example, Mammalia contains these files:
| File name | Content |
| All_Mammalia.gz | Gene records for mammalian species, including mitochondria. |
| Bos_taurus.gz | Gene records for Bos taurus, including mitochondria. |
| Canis_familiaris.gz | Gene records for Canis familiaris, including mitochondria. |
| Homo_sapiens.gz | Gene records for Homo sapiens, including mitochondria. |
| Mus_musculus.gz | Gene records for Mus musculus, including mitochondria. |
| Pan_troglodytes.gz | Gene records for Pan troglodytes, including mitochondria. |
| Rattus_norvegicus.gz | Gene records for Rattus norvegicus, including mitochondria. |
| Sus_scrofa.gz | Gene records for Sus scrofa, including mitochondria. |
GeneRIFs
GeneRIF data are now available from the GeneRIF ftp site as the file generif_basic.gz.
If you prefer to use reports formated in XML rather than ASN.1, you have several options:
Try the robust functions provided via E-utilities. A common approach is to combine use of ESearch to obtain a set of GeneIDs of interest, with EFetch which retrieves records by GeneID. The document EFetch for Sequence and other Molecular Biology Databases provides more information about how to set the parameters for extracting information from Entrez databases. It is as simple as:
defining db as gene
defining retmode as xml
defining id as the GeneID of interest
Example, showing how you can display on the web, what will be retrieved by the for the GeneID you submit to EFetch.
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=2&retmode=xml
Example, showing how you can display on the web, what will be retrieved by a list of GeneIDs submitted to ESummary.
A representative perl script using both ESearch and retrieval from ESummary is provided from the ftp site as taxidToGeneNames.pl. It uses NCBI's Taxonomy database identifier to support species-specific extraction of information incorporated in the Entrez Gene Summary display format.
Examples:
taxidToGeneNames.pl -t 9606 -o xml --reports data from the summary for human genes with output as XML
taxidToGeneNames.pl -t 10090 -o tab --reports GeneID, symbol, full name from the summary for mouse in tab-delimited output
The tool gene2xml, described here converts the ASN.1 provided in binary set format (in the ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ASN_BINARY/ directory), into XML. It also converts ASN in the binary format into concatenated text.
Entrez supports reporting any record or set of records in XML format. After you have retrieved record(s) of interest, select XML from the display menu and the result will be displayed according to Entrez Genes DTD. You can then send that result to a file.
Note: to convert multiple records to XML via the Entrez interface, check the boxes to left of the gene symbol in the query result view.
The ftp files in the ASN_BINARY subdirectory of Gene's ftp site are binary concatenated gzip files. This type of content is defined in the specification RFC-1952:
“2.2. File format
A gzip file consists of a series of "members" (compressed data sets). The format of each member is specified in the following section. The members simply appear one after another in the file, with no additional information before, between, or after them.”
This specification can be found at the Internet Engineering Task Force web site at http://www.ietf.org/rfc/rfc1952.txt.
If you are developing applications to decompress Gene’s ASN.1 binary format ftp files, be sure that any compression library that you are using supports this standard. For example, there is a known issue with the compression library in Microsoft ® .NET Framework 3.5 which does not support decompressing this type of content. For further information about this issue, see http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=357758
Entrez Gene integrates information from LocusLink and from genes annotated on Reference Sequences from completely sequenced genomes. It thus provides a unified look and feel for gene-specific information independent of the species of origin. It also provides a foundation for other functions that to this point were available only for genomes in LocusLink, namely GeneRIFs and linkouts from BLAST results.
LocusLink was discontinued March 1, 2005. URLs directed to LocusLink/list.cgi and LocusLink/LocRpt.cgi are being re-directed to Entrez Gene. The last undate in the LocusLink format was generated June 1, 2005. The history of the transition is maintained here.
GeneRIFs Beginning June 8, 2004, LocusLink no longer displayed GeneRIF information.
CDD LocusLink stopped reporting information about onserved domains that may be found in RefSeq proteins in September, 2004.
The last version of the LL_tmpl file that contained CDD information is available from the LocusLink ftp site in the file ARCHIVE/LL_tmpl_040903.gz. To extract domain information directly for any protein sequence, consider using E-utilities. The url to fetch domain data based on a protein gi follows the pattern:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=gnl|ANNOT:CDD|[put the gi here]&retmode=xml.