NCBI Logo   NCBI » UniGene » FAQ
PubMed Protein Genome Structure PopSet Taxonomy OMIM
 

NCBI

UniGene
Homepage
Query Tips
FAQs
Finding cDNAs
Library Browser
DDD
Download UniGene

Related Databases
Gene
HomoloGene
dbEST
Trace Archive

NIH cDNA Projects
MGC
ZGC
XGC
CGAP

 
 Search
  Limits Index History Details


  Frequently Asked Questions

  Question: How often is UniGene updated?

The time needed to update UniGene with new sequences varies. Generally, this takes more than 1 week but less than 1 month.

 Why are some UniGene clusters retired?
 

UniGene clusters can eventually be "retired" for various reasons, such as:

  • the sequences in a cluster might be retracted by the submitters because they are found to have contaminants.
  • two clusters can be joined, in which case one of the original cluster IDs would be retired.
  • a cluster can be split into two or more clusters in such a way that none of the smaller clusters can be recognized to be "the same as" the original cluster.

Cluster IDs are not reused after being retired, and specific information about why a particular cluster was retired is not available from the UniGene Web pages.

Using a retired cluster number (Hs.######) in UniGene's search tool will generate a page with links to the current clusters for the sequences.

UniGene clusters often have an expression such as "ESTs, highly similar to ACTIN 1" or "weakly similar to..." How are the degrees of similarity defined?

   Basically, there are three distinctions of similarity:

   1. "Highly similar to" means >90% in the aligned region.

   2. "Moderately Similar to" means 70-90% similar in the aligned region.

   3. "Weakly similar to" means <70% similar in the aligned region.

  How are the protein similarities in the PROTSIM field of UniGene  

 records calculated?

For each nucleotide sequence in UniGene, a search is made for sequence similarity to known proteins from eight organisms. This is done using Blastx. Blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. Blastx has 'in-frame' gapped alignments and uses sum statistics to link alignments from different frames.

The peptide databases used by UniGene are those representing Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Escherichia coli. and Arabidopsis thaliana. The eight protein databases exclude mitochondrial proteins and are screened for redundancy.

The nucleotide sequence is considered to match the protein sequence if the BLASTx E value is less than 1e-6. For each of the eight databases, the best-hit is that alignment with the lowest bit score. This protein is used as one of the eight prot_sim features of the nucleotide sequence.

The proteins assigned to a UniGene cluster are chosen among the prot_sim proteins assigned to the cluster's component sequences. The exact algorithm used to select the representative protein is currently being revised.

Is there an easy way to get the best contig/alignment for all the EST data in a cluster?

Although no assembly or contig is available, the longest sequence in each cluster has been identified. Automatic assembly of EST sequences is periodically re-evaluated. To date, we have found that accurate assemblies require curation and have chosen not to create the inaccurate dataset.

There is a file on the NCBI anonymous FTP site called Hs.seq.uniq.Z. This contains one sequence selected from each UniGene cluster that is the one with the longest region of high-quality sequence data.

  Where mRNA or annotated genomic sequence is available and there are no alternative  

splice forms, all other sequences in the cluster will be subsequences of this one. You can find this file at ftp://ftp.ncbi.nih.gov/repository/UniGene/.

For example, a user interested in cluster Hs.10920 might search the file using a text editor or the command "grep" on UNIX for the string "Hs.10920". This procedure would identify the section of the file that contains the best sequence.

Can you give me more details on the construction of a particular library, such as the tissue source, cloning strategy, or vector used?

Information concerning library origin is only provided by the submitter of the sequences, and NCBI does not have any additional information other than what is shown in the record. You can contact the submitter using the contact information found in the dbEST records or GenBank records for ESTs derived from the library of interest.

How are the cluster titles assigned?

There are several possible sources for the title, in order of preference:

  • For Human, LocusLink title. MGD data are used in an analogous way for mouse
  • name of product
  • defline of mRNA record
  • defline of genomic record
  • ESTs, similar to something
  • ESTs


The mRNA or genomic sequence chosen is arbitrary from the end-user's perspective; there is no easy way for the user to look at the sequences in a cluster and reproduce the algorithm that chooses which one gives the title.

How do you define a polyadenylation signal in the UniGene sequences?

Some sequences have an obvious polyA tail. However, we allow only a finite amount of sequence beyond the end of the polyA tail. For this reason, tails on low-quality reads are sometimes missed. Other sequences are also considered as ending with a polyA tail if there is a polyA signal. The signal is a sequence of ATTAAA or AATAAA that is 10-35 nucleotides from either the polyA tail or sequence end. All sequences are searched in both orientations.

How can I link my page to UniGene?

Several possibilities exist for making links to UniGene's pages.

Each requires the organism of interest be specified, using the following abbreviations:

Aae Aedes aegypti
Aga Anopheles gambiae
Ame Apis mellifera
Afp Aquilegia formosa x Aquilegia pubescens
At Arabidopsis thaliana
Bmo Bombyx mori
Bt Bos taurus
Bfl Branchiostoma floridae
Bna Brassica napus
Cel Caenorhabditis elegans
Cfa Canis familiaris
Cre Chlamydomonas reinhardtii
Cin Ciona intestinalis
Csa Ciona savignyi
Csi Citrus sinensis
Cpo Coccidioides posadasii
Dr Danio rerio
Ddi Dictyostelium discoideum
Dm Drosophila melanogaster
Fne Filobasidiella neoformans
Fhe Fundulus heteroclitus
Gga Gallus gallus
Gac Gasterosteus aculeatus
Gmo Gibberella moniliformis
Gma Glycine max
Ghi Gossypium hirsutum
Gra Gossypium raimondii
Han Helianthus annuus
Hs Homo sapiens
Hv Hordeum vulgare
Hma Hydra magnipapillata
Lsa Lactuca sativa
Lco Lotus corniculatus
Les Lycopersicon esculentum
Mfa Macaca fascicularis
Mmu Macaca mulatta
Mgr Magnaporthe grisea
Mdo Malus x domestica
Mtr Medicago truncatula
Mte Molgula tectiformis
Mm Mus musculus
Ncr Neurospora crassa
Omy Oncorhynchus mykiss
Ocu Oryctolagus cuniculus
Os Oryza sativa
Ola Oryzias latipes
Oar Ovis aries
Ppa Physcomitrella patens
Pin Phytophthora infestans
Pgl Picea glauca
Psi Picea sitchensis
Ppr Pimephales promelas
Pta Pinus taeda
Pba Populus balsamifera
Ptp Populus tremula x Populus tremuloides
Rn Rattus norvegicus
Sof Saccharum officinarum
Ssa Salmo salar
Sja Schistosoma japonicum
Sma Schistosoma mansoni
Stu Solanum tuberosum
Sbi Sorghum bicolor
Spu Strongylocentrotus purpuratus
Ssc Sus scrofa
Tru Takifugu rubripes
Tgo Toxoplasma gondii
Tca Tribolium castaneum
Ta Triticum aestivum
Vvi Vitis vinifera
Xl Xenopus laevis
Str Xenopus tropicalis
Zm Zea mays


Creating a link to a specific UniGene cluster ID requires that the cluster ID number be      specified using CID=, such as in the following format:
http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?ORG=Hs&CID=3686

A link can also be made using the GenBank accession of a member sequence or the gi of that sequence in the following formats:
http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?ACC=R14038
http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?GI=767114

Creating a link to a specific UniGene sequence ID requires that the UniGene sequence ID be specified using SID=, such as in the following format:
http://www.ncbi.nlm.nih.gov/UniGene/seq.cgi?ORG=Hs&SID=1000

Creating a link to a specific dbEST library ID requires the dbEST library ID be specified using LID=, such as in the following format:
http://www.ncbi.nlm.nih.gov/UniGene/library.cgi?ORG=Hs&LID=1460

 

UniGene References
Pontius JU, Wagner L, Schuler GD. UniGene: a unified view of the transcriptome. In: The NCBI Handbook. Bethesda (MD): National Center for Biotechnology Information; 2003.
[Full Text] [PDF]

Wheeler DL, et al. Database resources of the National Center for Biotechnology. Nucl Acids Res 31:28-33; 2003.
[PubMed] [Full Text] [PDF]

Schuler GD. Pieces of the puzzle: expressed sequence tags and the catalog of human genes. J Mol Med 75:694-698; 1997.
[PubMed]
[PDF]

Schuler GD, et al. A gene map of the human genome. Science 274:540-546; 1996
[PubMed]
[Full Text]

Boguski MS, Schuler GD ESTablishing a human transcript map. Nature Genetics 10: 369-371; 1995.
[PubMed]


 

 

 


 

Questions or Comments?
 E-mail the NCBI Help Desk
firstgov logo
National Center for Biotechnology Information
U.S. National Library of Medicine
National Institutes of Health
DHHS logo
Disclaimer  | Freedom of Information Act  |  Privacy Policy