Question: How often is
UniGene updated?
The time needed to update UniGene with new sequences varies. Generally,
this takes more than 1 week but less than 1 month.
Why are some UniGene clusters retired?
UniGene clusters can eventually be "retired" for
various reasons, such as:
- the sequences in a cluster might be
retracted by the submitters because they are found to have contaminants.
- two clusters can be joined, in which case
one of the original cluster IDs would be retired.
- a cluster can be split into two or more
clusters in such a way that none of the smaller clusters can be recognized
to be "the same as" the original cluster.
Cluster IDs are not reused after being retired, and specific information
about why a particular cluster was retired is not available from the UniGene
Web pages.
Using a retired cluster number (Hs.######) in UniGene's search tool will
generate a page with links to the current clusters for the sequences.
UniGene clusters often have an expression such
as "ESTs, highly similar to ACTIN 1"
or "weakly similar to..."
How are
the degrees of similarity defined?
Basically, there are three distinctions of similarity:
1. "Highly similar to" means >90% in the aligned region.
2. "Moderately Similar to" means 70-90% similar in the aligned region.
3. "Weakly similar to" means <70% similar in the aligned region.
How are the protein similarities in the PROTSIM field of
UniGene
records calculated?
For each nucleotide sequence in UniGene, a
search is made for sequence similarity to known proteins from eight organisms.
This is done using Blastx. Blastx compares the six-frame conceptual
translation products of a nucleotide query sequence (both strands) against a
protein sequence database. Blastx has 'in-frame' gapped alignments and uses
sum statistics to link alignments from different frames.
The peptide databases used by UniGene are
those representing Homo sapiens, Mus
musculus, Rattus norvegicus, Drosophila melanogaster, Caenorhabditis elegans,
Saccharomyces cerevisiae, Escherichia coli.
and Arabidopsis thaliana.
The eight protein databases exclude mitochondrial proteins and are screened
for redundancy.
The nucleotide sequence is considered to match
the protein sequence if the BLASTx E value is less than 1e-6. For each of the
eight databases, the best-hit is that alignment with the lowest bit score.
This protein is used as one of the eight prot_sim features of the nucleotide
sequence.
The proteins assigned to a UniGene cluster are
chosen among the prot_sim proteins assigned to the cluster's component
sequences. The exact algorithm used to select the representative protein is
currently being revised.
Is there an easy way to get the best contig/alignment
for all the EST data in a cluster?
Although no assembly or contig is available, the longest
sequence in each cluster has been identified. Automatic assembly of EST
sequences is periodically re-evaluated. To date, we have found that accurate
assemblies require curation and have chosen not to create the inaccurate
dataset.
There is a file on the NCBI anonymous FTP site called
Hs.seq.uniq.Z. This contains one sequence selected from each UniGene cluster
that is the one with the longest region of high-quality sequence data.
Where mRNA or annotated genomic sequence is available and
there are no alternative
splice forms, all other sequences in the cluster
will be subsequences of this one. You can find this file at
ftp://ftp.ncbi.nih.gov/repository/UniGene/.
For example, a user interested in cluster Hs.10920 might
search the file using a text editor or the command "grep" on UNIX for the
string "Hs.10920". This procedure would identify the section of the file that
contains the best sequence.
Can you give me more details on the
construction of a particular library, such as the tissue source, cloning
strategy, or vector used?
Information concerning library origin is only provided by the submitter of
the sequences, and NCBI does not have any additional information other than
what is shown in the record. You can contact the submitter using the contact
information found in the dbEST records or GenBank records for ESTs derived
from the library of interest.
How are the cluster titles assigned?
There are several possible sources for the title, in order of preference:
- For Human, LocusLink title. MGD data are
used in an analogous way for mouse
- name of product
- defline of mRNA record
- defline of genomic record
- ESTs, similar to something
- ESTs
The mRNA or genomic sequence chosen is arbitrary from the end-user's
perspective; there is no easy way for the user to look at the sequences in a
cluster and reproduce the algorithm that chooses which one gives the title.
How do you define a polyadenylation signal in
the UniGene sequences?
Some sequences have an obvious polyA tail. However, we allow only a finite
amount of sequence beyond the end of the polyA tail. For this reason, tails
on low-quality reads are sometimes missed. Other sequences are also
considered as ending with a polyA tail if there is a polyA signal. The signal
is a sequence of ATTAAA or AATAAA that is 10-35 nucleotides from either the
polyA tail or sequence end. All sequences are searched in both orientations.
How can I link my page to UniGene?
Several possibilities exist for making links to UniGene's pages.
Each requires the organism of interest be specified, using the following
abbreviations:
| Aae |
Aedes aegypti |
| Aga |
Anopheles gambiae |
| Ame |
Apis mellifera |
| Afp |
Aquilegia formosa x Aquilegia pubescens |
| At |
Arabidopsis thaliana |
| Bmo |
Bombyx mori |
| Bt |
Bos taurus |
| Bfl |
Branchiostoma floridae |
| Bna |
Brassica napus |
| Cel |
Caenorhabditis elegans |
| Cfa |
Canis familiaris |
| Cre |
Chlamydomonas reinhardtii |
| Cin |
Ciona intestinalis |
| Csa |
Ciona savignyi |
| Csi |
Citrus sinensis |
| Cpo |
Coccidioides posadasii |
| Dr |
Danio rerio |
| Ddi |
Dictyostelium discoideum |
| Dm |
Drosophila melanogaster |
| Fne |
Filobasidiella neoformans |
| Fhe |
Fundulus heteroclitus |
| Gga |
Gallus gallus |
| Gac |
Gasterosteus aculeatus |
| Gmo |
Gibberella moniliformis |
| Gma |
Glycine max |
| Ghi |
Gossypium hirsutum |
| Gra |
Gossypium raimondii |
| Han |
Helianthus annuus |
| Hs |
Homo sapiens |
| Hv |
Hordeum vulgare |
| Hma |
Hydra magnipapillata |
| Lsa |
Lactuca sativa |
| Lco |
Lotus corniculatus |
| Les |
Lycopersicon esculentum |
| Mfa |
Macaca fascicularis |
| Mmu |
Macaca mulatta |
| Mgr |
Magnaporthe grisea |
| Mdo |
Malus x domestica |
| Mtr |
Medicago truncatula |
| Mte |
Molgula tectiformis |
| Mm |
Mus musculus |
| Ncr |
Neurospora crassa |
| Omy |
Oncorhynchus mykiss |
| Ocu |
Oryctolagus cuniculus |
| Os |
Oryza sativa |
| Ola |
Oryzias latipes |
| Oar |
Ovis aries |
| Ppa |
Physcomitrella patens |
| Pin |
Phytophthora infestans |
| Pgl |
Picea glauca |
| Psi |
Picea sitchensis |
| Ppr |
Pimephales promelas |
| Pta |
Pinus taeda |
| Pba |
Populus balsamifera |
| Ptp |
Populus tremula x Populus tremuloides |
| Rn |
Rattus norvegicus |
| Sof |
Saccharum officinarum |
| Ssa |
Salmo salar |
| Sja |
Schistosoma japonicum |
| Sma |
Schistosoma mansoni |
| Stu |
Solanum tuberosum |
| Sbi |
Sorghum bicolor |
| Spu |
Strongylocentrotus purpuratus |
| Ssc |
Sus scrofa |
| Tru |
Takifugu rubripes |
| Tgo |
Toxoplasma gondii |
| Tca |
Tribolium castaneum |
| Ta |
Triticum aestivum |
| Vvi |
Vitis vinifera |
| Xl |
Xenopus laevis |
| Str |
Xenopus tropicalis |
| Zm |
Zea mays |
Creating a link to a specific UniGene cluster ID requires that the cluster
ID number be specified using CID=, such as in the following format:
http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?ORG=Hs&CID=3686
A link can also be made using the GenBank accession of a member sequence
or the gi of that sequence in the following formats:
http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?ACC=R14038
http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?GI=767114
Creating a link to a specific UniGene sequence ID requires that the
UniGene sequence ID be specified using
SID=, such as in the following
format:
http://www.ncbi.nlm.nih.gov/UniGene/seq.cgi?ORG=Hs&SID=1000
Creating a link to a specific dbEST library ID requires the dbEST library
ID be specified using LID=, such as in the following format:
http://www.ncbi.nlm.nih.gov/UniGene/library.cgi?ORG=Hs&LID=1460
| UniGene References |
|
Pontius JU, Wagner L, Schuler GD. UniGene: a
unified view of the transcriptome. In: The NCBI Handbook. Bethesda (MD):
National Center for Biotechnology Information; 2003.
[Full Text]
[PDF]
Wheeler DL, et al. Database
resources of the National Center for Biotechnology. Nucl Acids Res
31:28-33; 2003.
[PubMed]
[Full Text]
[PDF]
Schuler GD. Pieces of the puzzle: expressed sequence
tags and the catalog of human genes. J Mol Med 75:694-698; 1997.
[PubMed]
[PDF]
Schuler GD, et al. A gene map of the human genome.
Science 274:540-546; 1996
[PubMed]
[Full Text]
Boguski MS, Schuler GD ESTablishing a human
transcript map. Nature Genetics 10: 369-371; 1995.
[PubMed]
|