|
NCBI Home
|
|
|
|
|
| Queries with words containing : (colon) |
|
A feature of the Entrez query interface is to use the colon (:) to indicate a range search. For example, in nucleotide this is used to find sequences within specified length boundaries (1:5000[sequence length] finds records no more than 5 Kb in length.) Thus when a word has a : embedded, the query interpreter tries to process this as a range. To bypass this problem, surround the word with double quotes.
Typing zgc:56666 in the query box will result in the message Range operation not supported.
Try it.
Typing "zgc:56666" in the query box will retrieve any record that contains that explict series of characters.
Try it.
| Nomenclature |
|
Sources
The names (symbols) and full descriptions used in Gene come from 4 major sources:
Updates and access
Entrez Gene attempts to maintain current nomenclature. Updates to names in Entrez Gene are not propagated immediately to all other resources in NCBI. You may notice, for example, that symbols in genomic RefSeq annotation, Map Viewer, HomoloGene or UniGene, and their respective ftp sites, are not the same as those you see in Entrez Gene. RefSeq, for example, does not resubmit the full annotation of a reference contig accession to the nucleotide database each time a symbol changes. The symbols seen in Map Viewer and RefSeqs for contigs and chromosomes, however, should be the same, because all are updated only with each major re-annotation of a genome. It may help to consider that the Entrez Gene GeneID is unique across all taxa. You can therefore convert any GeneID into its current names by using the definitions provided in the file available as ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz. For example, if you transferred the gene_info.gz file to a unix file system, the command
gzcat gene_info.gz | cut -f2,3,5,9,13
will give you
If a GeneID is no longer current, it will not be reported in the file gene_info.gz. The file gene_history.gz in the same ftp directory can be used to determine if there is a replacement GeneID, for which the current names can then be determined as above.
Symbols beginning with LOC When a published symbol is not available, and orthologs have not yet been determined, Entrez Gene will provide a symbol that is constructed as 'LOC' + the GeneID. This is not retained when a replacement symbol has been identified.
Names beginning with 'similar to' When NCBI automatically annotates a genome, it predicts both mRNAs and the proteins they encode. The protein sequences are compared to public protein sequence records from several model organisms. If a significant match is found, and the name is informative, then the automatic annotation process constructs the name of the model by combining 'similar to ' and the name of the matching protein. Because the sequences represented by NCBI's predictions are provided in accessions beginning with XM_ or XP_ or XR_, you might assume that all accessions with that format would have names beginning with 'similar to '. This is not necessarily the case, because NCBI will generate XM_ and XP_ or XR_ accessions for genes identified outside of the annotation pipeline but for which curated (NM and NP or NR) accessions are not available. These genes, and the RefSeq accessions that represent them, will not have names beginning with similar to.
NOTE: To the greatest extent possible, each protein-coding gene in mitochondria has been assigned the same name (symbol) and full description. In some instances, this is at variance with the symbol assigned by species-specific nomenclature committees. In those cases, the species-specific nomenclature is provided as alternates.
| Obtaining genomic sequence |
|
From Entrez Gene's gene diagram (note: position values are one-offset)
When a gene is annotated on a RefSeq for a chromosome or a genomic contig, the RefSeq accession over the diagram anchors a link to Entez Nucleotide for the range corresponding exactly to the gene feature on the sequence record. Steps like these will therefore allow you to extract and save the genomic sequence of interest.
From Entrez Gene's Gene Table (note: position values are one-offset)
The Transcripts and products display in the Gene Table view supports the same fuctionality listed above. It also allows you to click on any of the sub-features (introns, exons, CDS portions of an exon) corresponding to each annotated transcript to extract that genomic subsequence in FASTA format. You can display the subsequence in GenBank format if you want to see other annotation in the region.
From Map Viewer
From Entrez Gene, you can navigate to Map Viewer to use the download functions there.
From Entrez Nucleotide (note: position values are one-offset)
Within Entrez Nucleotide, feature names anchor URLs. Clicking on 'gene' results in a display (in GenBank format) of that subsequence. To save the sequence, change the display format to FASTA and save as described above.
| Notification of changes in Entrez Gene |
|
Gene maintains a list serve (gene-announce) that is used to notify subscribers of current or future changes in Entrez Gene and any of its reports. If this is of interest to you, please subscribe.
| Differing representations of RefSeqs |
|
Display of RefSeqs in Transcripts and Products vs. in the Reference Sequences (RefSeq) section
The diagram of the placement of RefSeq transcripts in the Transcripts and Products Section is based on the annotation of the positions of exons and coding sequences in the current genomic RefSeq. For some genomes, the genomic RefSeqs are updated independently of the annotated product RNAs, with the latter being updated more frequently. This means that several kinds of discrepancies between the diagram and the current RefSeq RNAs may result.
RefSeq RNA records are often based on cDNA sequences submitted to GenBank. They therefore can differ from the reference genomic sequence, either for biological reasons (variation or RNA editing) or some unresolved sequence discrepancy. The report of intron/exon organization in the Gene Table display is based on the placement of exons and CDS on the genomic sequence. If the independently determined RefSeq mRNA cannot be aligned perfectly to the genome, the lengths given in the Gene Table display may differ from that of the mRNA sequence itself. As discussed in the section above, it is also possible that the sequence of the RefSeq RNA was updated after it was aligned to, and used to annotate, the reference sequence. This also might result in discrepancies between the annotation on the genomic sequence, and the current RefSeq RNA.
Representation of nucleotide positionsNCBI uses two conventions to represent the position of features in a sequence.
The zero-offset convention is used in the ASN.1 representation of sequence databases. The ASN.1 of Entrez Gene, and the derivative tab-delimited files gene2refseq.gz and gene2accession.gz in the DATA subdirectory of Gene's ftp site also use the convention of 0 offset.
Reports designed for browsing use the convention of one-offset. Thus the position data seen in default HTML views of Entrez Gene (and Nucleotide) are always one greater than that reported in the ASN.1 display.
NOTE:. The files in the Map Viewer subdirectories in the genomes path that give position information for genes (seq_gene.md.gz) and other features are one-based. Please be aware of this when processing these files.
| Entrez Gene and OMIM |
|
Entrez Gene integrates information from OMIM, and creates links to OMIM, at two levels:
Links provided from the Links menu in the upper right-hand part of the Gene record are based on both types of MIM numbers. Within the body of the record, the MIM number associated with the gene is reported in the Additional links section; any MIM number associated with a disease is reported in the Phenotypes section, along with the name of the disease. Symbols used by OMIM for genes and diseases are intermingled in Gene's Gene aliases section.
The gene_info.gz file provided from the Gene ftp site includes the MIM number associated with the gene. If that gene is associated with Mendelian disorders that have a different MIM number, that MIM number will not be provided in gene_info.gz.
All MIM numbers associated with Entrez Gene records are reported in the ftp file mim2gene. The value in the third column in that file indicates whether the MIM number is only for the disease ('phenotype') or for the gene ('gene').
| How Entrez Gene maintains certain types of information |
|
Conserved Domains
As sequence records are added to or updated in the Protein database, they are compared to records in the Conserved Domain Database (CDD) to identify likely domain content. The results of these analyses for RefSeq proteins are indexed for retrieval in Gene, are displayed when a Gene record is retrieved from Entrez, and are integrated into the ASN.1 that is provided for ftp transfer. The sequence of events is therefore:
To extract domain information directly for any protein sequence, consider using E-utilities. The url to fetch domain data based on a protein gi follows the pattern:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=gnl|ANNOT:CDD|[put the gi here]&retmode=xml.
Example URL for efetch for CDD:
GeneRIFs
GeneRIFs are established by two primary methods.
GO terms
NCBI reports GO terms appropriate for a GeneID by integrating information from the following sources:
Entrez Gene currently reports, and uses for indexed queries, only the explicit GO term or terms assigned to any gene. It does not support querying at any node of the GO graph, nor retrieving all genes that match terms at more specific nodes based on a query at a higher node.
Interactions
Entrez Gene represents interaction data as pairs (more...). Entrez Gene staff does not curate these data, but does validate identifiers supplied with the source files.
|
|
|
| Using LinkOut |
|
Because Gene is an Entrez database, database providers can now use the LinkOut mechanism to direct users of Entrez Gene to related sites providing more information about a particular record. The benefits to data providers are several:
This area of LinkOut's documentation provides instructions geared more to the non-bibliographic data providers.
| How to construct URLs to link to Gene |
|
Because Gene is now an Entrez database, URLs can be constructed using standard Entrez methods. Use db=gene to define the database. Information on the gene-specific properties and filters is maintained in the Preview/Index section of the help documentation. Note that the Properties and Filters used by Gene and the counts for each in the current database can be viewed at any time by following these steps:
Entrez Gene's home page also provides examples of some URLs that allow common displays.
If you have stored LocusIDs locally, and have used the LocusID to connect to LocusLink, you can use the same integer value to connect to Gene, based on this pattern, where ID is the LocusID/GeneID.
| Relationship of LocusID to GeneID |
|
The method described in the previous section to construct URLs that connect to Gene by using what you may think of as a LocusID, is based on the premise that the GeneID integer is the same as the LocusID seen in LocusLink. We plan to provide the GeneID equal to the LocusID as long as LocusLink continues to be offered. Drosophila melanogaster and Caenorhabditis elegans are the only genomes where exceptions may occur, because the database underlying LocusLink is no longer directly involved in tracking identifiers assigned to genes when annotations are resubmitted for these genomes.
| Gene ftp site |
|
The Gene ftp site retains the general functions of the LocusLink ftp site, namely a mixture of
The README file in the gene directory provides more detailed information. See also the Gene-OMIM faq above for more information about MIM numbers provided in gene_info.gz and mim2gene.
The comprehensive extraction is provided in ASN.1 in the subdirectory ASN. In addition to the comprehensive file All_data.gz, there are subdirectories divided by taxonomic nodes. Each of these sub-directories contains a comprehensive extracion for that node, but may also contain some species-specific files. For example, Mammalia contains these files:
| File name | Content |
|---|---|
| All_Mammalia.gz | Gene records for mammalian species, including mitochondria. |
| Bos_taurus.gz | Gene records for Bos taurus, including mitochondria. |
| Canis_familiaris.gz | Gene records for Canis familiaris, including mitochondria. |
| Homo_sapiens.gz | Gene records for Homo sapiens, including mitochondria. |
| Mus_musculus.gz | Gene records for Mus musculus, including mitochondria. |
| Pan_troglodytes.gz | Gene records for Pan troglodytes, including mitochondria. |
| Rattus_norvegicus.gz | Gene records for Rattus norvegicus, including mitochondria. |
| Sus_scrofa.gz | Gene records for Sus scrofa, including mitochondria. |
GeneRIFs
The portion of LL_tmpl that contained GeneRIF data was indicated by the header GRIF.
and contained the PubMed ID and the text of the GeneRIF. Those data are now available from the GeneRIF ftp site as the file generif_basic.gz.
| Gene-related ftp sites |
|
There are other ftp sites at NCBI that contain gene-related information. These include:
| Extracting Gene in XML format |
|
If you prefer to use reports formated in XML rather than ASN.1, you have several options:
Try the robust functions provided via E-utilities. A common approach is to combine use of ESearch to obtain a set of GeneIDs of interest, with EFetch which retrieves records by GeneID. The document EFetch for Sequence and other Molecular Biology Databases provides more information about how to set the parameters for extracting information from Entrez databases. It is as simple as:
Example, showing how you can display on the web, what will be retrieved by the for the GeneID you submit to EFetch.
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=2&retmode=xmlExample, showing how you can display on the web, what will be retrieved by a list of GeneIDs submitted to ESummary.
A representative perl script using both ESearch and retrieval from ESummary is provided from the ftp site as taxidToGeneNames.pl. It uses NCBI's Taxonomy database identifier to support species-specific extraction of information incorporated in the Entrez Gene Summary display format.
Examples:
The tool gene2xml, described here converts the ASN.1 provided in binary set format (in the ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ASN_BINARY/ directory), into XML. It also converts ASN in the binary format into concatenated text.
Entrez supports reporting any record or set of records in XML format. After you have retrieved record(s) of interest, select XML from the display menu and the result will be displayed according to Entrez Genes DTD. You can then send that result to a file.
Note: to convert multiple records to XML via the Entrez interface, check the boxes to left of the gene symbol in the query result view.
|
|
|
| Gene and LocusLink |
|
Entrez Gene integrates information from LocusLink and from genes annotated on Reference Sequences from completely sequenced genomes. It thus provides a unified look and feel for gene-specific information independent of the species of origin. It also provides a foundation for other functions that to this point were available only for genomes in LocusLink, namely GeneRIFs and linkouts from BLAST results.
| Discontinuing LocusLink |
|
LocusLink was discontinued March 1, 2005. URLs directed to LocusLink/list.cgi and LocusLink/LocRpt.cgi are being re-directed to Entrez Gene. The last undate in the LocusLink format was generated June 1, 2005. The history of the transition is maintained here.
GeneRIFs Beginning June 8, 2004, LocusLink no longer displayed GeneRIF information.
CDD LocusLink stopped reporting information about onserved domains that may be found in RefSeq proteins in September, 2004.
The last version of the LL_tmpl file that contained CDD information is available from the LocusLink ftp site in the file ARCHIVE/LL_tmpl_040903.gz. To extract domain information directly for any protein sequence, consider using E-utilities. The url to fetch domain data based on a protein gi follows the pattern:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=gnl|ANNOT:CDD|[put the gi here]&retmode=xml.
Example URL for efetch for CDD: