NCBI helix logo Entrez Gene banner
spacer
PubMed Nucleotide Protein Genome Structure PopSet Taxonomy OMIM Help
 Search for
spacer
spacer
     Entrez Gene News

This page summarizes news and announcements related to Entrez Gene.

The history of what has been sent to subscribers to the Gene announcements mailing list is also available.


Links Between RefSeqs and Ensembl.   (December 3, 2009)

Entrez Gene is now calculating matches between NCBI and Ensembl annotation based on comparison of rna and protein features.

For organisms that are represented in the Consensus Coding Sequence (CCDS) project (i.e., human and mouse), the set of matches includes all protein sequences in CCDS and their corresponding mRNAs.

For all other organisms, matches are collected as follows. For a protein to be identified as a match between RefSeq and Ensembl, there must be at least 80% overlap between the two. Furthermore, splice site matches must meet certain conditions: either 60% or more of the splice sites must match, or there may be at most one splice site mismatch.

For rna features, the matching criteria are the same as for proteins above.

This data can be accessed in a number of different ways. First, in the Full Report view in Entrez Gene, matching Ensembl transcripts and proteins are listed with the RefSeqs in the NCBI Reference Sequences section next to the label Related Ensembl, with links to the Ensembl web site for the associated transcripts and proteins. Links to Ensembl genes will continue to be reported in the Summary section at the top of the Full Report.

Second, genes with matching Ensembl annotation can be found using a new property named "matches Ensembl". For example, to find all genes with Ensembl matches, use:

  • matches Ensembl [properties]

Third, Ensembl matches are provided on our FTP site in a new file named gene2ensembl.gz. This file is described in ftp://ftp.ncbi.nih.gov/gene/README.


New Properties for rnatype.   (December 3, 2009)

For some time now, the type of gene has been indexed with a genetype property such as "genetype protein coding [properties]". Entrez Gene is now indexing rna types as well, so you can find genes by the types of RNAs that are represented on them, such as "rnatype mRNA [properties]". The current list of rnatype properties is:

  • rnatype trna (transfer RNA, tRNA)
  • rnatype rrna (ribosomal RNA, rRNA)
  • rnatype snrna (small nuclear RNA, snRNA)
  • rnatype scrna (small cytoplasmic RNA)
  • rnatype snorna (small nucleolar RNA)
  • rnatype miscrna (miscellaneous RNA)
  • rnatype ncrna (non-coding RNA)
  • rnatype mrna (messenger RNA)
  • rnatype mirna (micro RNA)
  • rnatype other genetic
  • rnatype other

Gene Groups (relationships)   (April 2, 2009)

Gene is now reporting different types of gene-to-gene relationships. The first type is a relationship between a pseudogene and its related functional gene. In the General gene information section of the Full Report in Entrez Gene, a Related functional gene heading will appear for relevant genes, with a link to the related functional gene. Because these relationships are bi-directional, the functional gene will have a link to its related pseudogenes in its General gene information section, with the heading Related pseudogene(s).

The data are not complete, so please be aware that the lack of a report should not be interpreted to indicate that a gene does not have any pseudogenes.

Additional types of relationships will be added in the future.


Sort by Chromosome   (December 17, 2008)

An option to sort by chromosome has been added. Choosing this option causes the records to be sorted in this order:

  1. Alphabetically by organism name
  2. Numerically by chromosome
  3. Numerically by the start position on the chromosome.

For example, suppose that the search results include genes for Homo sapiens (human) and Mus musculus (mouse). The human genes will all appear before those for mouse. Within the set of human genes in the results, those that are placed on chromosome 1 will appear first, followed by those placed on chromosome 2, and so on. Finally, within a chromosome, genes will be sorted according to their start positions on the chromosome.

Genes that are not placed on a chromosome will appear at the end of the results. Genes that are placed on multiple chromosomes will be sorted according to the first such chromosome.

In conjunction with the new sort option, two new fields have been added to the DocSum (Document Summary) for Gene. The ChrSort field contains a sortable version of the first chromosome, if any, on which this gene is placed. The ChrStart field contains the start position for the first such chromosome.


Search by Preferred Symbol   (December 17, 2008)

The new [Preferred symbol] index field contains the preferred symbol for the gene, as compared to the [sym] field which is indexed for preferred symbols, aliases and locus tags. There is often a lot of overlap among preferred symbols and aliases, and this new field allows you to restrict the search to those genes with the specified preferred symbol, while excluding those that would match only on an alias name or locus tag.

For example, a query of set1[sym] would return 13 results, while a query of set1[preferred symbol] will return only those 11 results with set1 as the preferred symbol.

The alias for this field is [PREF].


Property "Officially Named"   (December 17, 2008)

This new property is set for all genes with official nomenclature.

Example of usage: officially named [prop] .


Search by GeneID Range.   (October 27, 2008)

You can now search for a range of GeneID values. For example:

  • 1:1000[GeneID]

will find all GeneIDs between 1 and 1000. This may be helpful for users of E-utilities.


New Search Options.   (October 27, 2008)

Two new fields have been added to facilitate searching:

  1. Exon count
  2. Gene length

Exon Count

This field contains the number of distinct, non-overlapping RefSeq exons annotated for all RNA products of a gene interval, based on annotation in this priority: reference assembly first, alternate assembly second.

This field can be queried by either a single integer value or a range. For example, to retrieve all human records with one exon, use:

  • human[orgn] AND 1[exoncount]

To retrieve all records with a range of exons:

  • 10:20[exoncount]

The aliases for this field are [XC] and [NUMEXONS].

Gene Length

This field contains the gene length based on annotation in this priority: reference assembly first, alternate assembly second. If there are multiple placements, only on non-reference assemblies, then the longest value on non-reference assemblies is used.

This field can be queried by either a single integer value or a range. For example, to retrieve all records with a gene span less than or equal to 5kb, use:

  • 1:5000[genelength]

The aliases for this field are [GL] and [GENELEN].


New hiv1interactions Property.   (October 27, 2008)

This property is set for all genes with curated HIV-1:human protein interaction data.

Example of usage:

  • hiv1interactions[prop]

See also http://www.ncbi.nlm.nih.gov/RefSeq/HIVInteractions/index.html


New Sort Options.   (March 17, 2008)

An option has been added to the gray query bar to allow you to re-sort the results of your query. The options currently are:

  • Sort by Relevance
  • Sort by Gene Weight
  • Sort by Gene Symbol

These options are defined as follows.

Sort by Relevance (the current default)

Relevance is calculated from Gene's assessment of what fields are the most important by which to find search results. For example, Gene assigns more value to results if they match a term in the 'Gene Name' (symbol) field than to a match in free text such as the RefSeq or GeneRIF summary. Thus if your query is the single term 'cat', then records with symbols of 'cat' will be sorted ahead of records with the term 'cat' only elsewhere in the record.

Sort by Gene Weight

Gene Weight is calculated from multiple lines of evidence geared toward evaluating how well a gene has been characterized. These lines include:

  1. Informative Gene-PubMed links. Informativeness is inversely proportional to the number of Genes connected to a PubMed record.
  2. Informative symbols or full names. A gene with a LOC+GeneID is weighted less, for example, than a gene with the symbol 'ABCA1'. A gene with a description that starts with the word 'hypothetical' is weighted less than one with a description that starts with 'cystic fibrosis'.
  3. Inclusion in HomoloGene or Protein Clusters. Genes (or their products) that are known to be conserved are weighted more highly.
  4. Inclusion in OMIM or Books.

Sort by Gene Symbol

This sort orders records by the preferred symbol assigned to the gene.


Limit by Chromosomal Region.   (March 11, 2008)

The Limits page, accessed by clicking on Limits in the Query bar, now has a function to facilitate retrieving Gene records by chromosomal location. The section supporting this function is titled Limit by Chromosomal Region.

You must first select the organism against which you want to do a search. A scrollable menu is provided, but you can jump to the region of interest by beginning to type a trivial name or the binomial (e.g. human, Homo sapiens, rat, Rattus norvegicus). For example, if you want to find genes on mouse chromosome 5, you do not have to scroll all the way down to Mus musculus, but can type Mus and then scroll to find and select Mus musculus.

When you have selected a species, a new menu is offered with the chromosomes appropriate to the species. For most genomes represented in Gene, the only choice will the mitochondrion or a plastid. For Drosophila melanogaster, the choices are the arms for chromosomes 2 and 3 rather than the complete chromosome.

After selecting the chromosome (or arm) you can enter the integers representing the lower and upper boundaries between which you want to find genes. These values will be used as additional query elements.

For example, if you wanted to identify all zinc finger genes on human chromosome 19 between 40,000,000 and 50,000,000 bp, you could follow these steps:

  1. enter the words zinc finger in the query box
  2. click on the Limits tab
  3. start typing Homo sapiens in the organism box and select Homo sapiens
  4. the chromosome selection menu will appear
  5. select 19
  6. enter 40000000 in the From box (no punctuation)
  7. enter 50000000 in the To box (no punctuation)
  8. press Go

Your result should look like this and is equivalent to entering this in Gene's query box:

zinc finger[All Fields] AND (NC_000019[nucl_accn] AND 40000000[CHRPOS]:50000000[CHRPOS])

Please note "Limits: Homo sapiens Chr.19 From 40000000 to 50000000" is displayed in yellow on the results page, and will remain set until you remove the check in the Limits tab or return to the Limits page to refine your query.

Please see the news item Chromosome base position available for query and in document summary for additional information.


Reporting of annotation information.   (September 20, 2007)

For genomes that NCBI annotates, Entrez Gene will represent information about the annotation of each current GeneID. Text phrases will be attached to the gene data if the gene is not annotated well, or if annotation has changed in a complex way. Text phrases will also attached if there is no defining cDNA or genomic sequence for the gene, or if the GeneID was created after the most recent genome annotation. The goal is to facilitate retrieval of Gene records where the annotation on the RefSeq genomic records, if it exists, should be interpreted with caution. Thus, records that are not known to have annotation issues can be retrieved by appending this clause to your query:

  • NOT "Annotation Information" [Text]

Specific sub-categories of annotation information are:

  "partial on reference assembly"   The annotated gene, as suggested by the defining cDNA, is not complete.
  "spans an assembly gap"   There is a gap in the assembly where the defining cDNA should align.
  "suggests misassembly"   There are order/orientation issues suggested by the cDNA alignment.
  "not annotated on reference assembly"   This gene is not annotated on the reference assembly.

Please note that the double quotes are included in the text phrases shown here because they are mandatory when performing a text phrase search.


Chromosome base position available for query and in document summary.   (May 2, 2007)

The location of a gene's annotation on a reference chromosome is now reported in the document summary. Thus if a gene is annotated on an unplaced scaffold/contig, or on a genome without chromosomes, placements will not be reported. The report includes the accession and version of the RefSeq accession for the chromosome and the position of the gene.

The report is provided only for genomes where chromosome coordinates are defined, and only for the reference assembly.

You may query by a range of chromosome base positions, subject to the limitations indicated above. Your query should include the Chromosome and either the Organism or the Taxonomy ID, and in general, you should specify a range of at least 100 kb. The results of the query will include all genes that lie either partly or completely within the range specified.

For example, the query:

  • 9606[taxid] AND 12[chromosome] AND 9100000:9200000[chrpos]

will find genes C12orf33, PZP and A2M.


Query by accession with version number.   (March 12, 2007)

A query for an accession string can now include the version number.

For example:

  • NP_000005.2 [Protein Accession]

An accession can be queried without the version number as well.


Addition of "has ccds" property.   (October 30, 2006)

A "has ccds" property has been added to Entrez Gene. This property identifies genes that encode a protein sequence that is a member of a Consensus CDS (CCDS).

For example, to find all human gene records that are in CCDS, use:

  • has ccds[prop] and 9606[Taxonomy ID]

For information about CCDS, see http://www.ncbi.nlm.nih.gov/projects/CCDS/ .


Implementation of augmented RefSeq section.   (October 9, 2006)

The Reference Sequences section of Entrez Gene's full report option now has additional subsections to support the display of the position of a gene on multiple assemblies, links to the genomic sequence within that range, and the accessions of the RNA and protein sequences specific to those assemblies. RefSeqs for which annotation or sequence may be updated without requiring a complete re-annotation of the genome are now labeled as such.

The technical description of this change was announced here.


Modification to Full Report display   (September 26, 2006)

Entrez Genes's display was restructured to facilitate browsing. Among the changes you may notice are the scolling windows for display of GeneRIFs, Interactions, and Markers.


Automatic spelling suggestions for interactive queries   (April 6, 2006)

Gene was added to the set of databases using NCBI's spelling suggestion tool (also available via e-utilities).

Try these examples:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=search&db=gene&term=eschrichia
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=search&db=gene&term=influnza
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=search&db=gene&term=myesthenia

Alternate words or spellings are suggested only if your original query term is at least five letters long.


Revised December 9, 2009

NCBI    |    NLM    |    NIH    |    Top of page