1. How often is UniGene updated?
2. Why are some UniGene clusters retired?
3. UniGene clusters often have an expression such as "ESTs, highly similar to ACTIN 1" or "weakly similar to..." How are the degrees of similarity defined?
4. How are the protein similarities in the PROTSIM field of UniGene records calculated?
5. Is there an easy way to get the best contig/alignment for all the EST data in a cluster?
6. Can you give me more details on the construction of a particular library, such as the tissue source, cloning strategy, or vector used?
7. How are the cluster titles assigned?
8. How does UniGene calculate links to Entrez Gene?
9. I have seen cases where multiple genes, as identified in Entrez Gene, are linked to a UniGene cluster. Can you explain why this happens?
10. When there are multiple links to Entrez Gene, how does UniGene pick one gene when naming the title?
11. How do you define a polyadenylation signal in the UniGene sequences?
12. How can I link my page to UniGene?
The time needed to update UniGene with new sequences varies. Generally, this takes more than 1 week but less than 1 month.
UniGene clusters can eventually be "retired" for various reasons, such as:
Cluster IDs are not reused after being retired, and specific information about why a particular cluster was retired is not available from the UniGene Web pages.
Using a retired cluster number (Hs.######) in UniGene's search tool will generate a page with links to the current clusters for the sequences.
Basically, there are three distinctions of similarity:
1. "Highly similar to" means >90% in the aligned region.
2. "Moderately Similar to" means 70-90% similar in the aligned region.
3. "Weakly similar to" means <70% similar in the aligned region.
For each nucleotide sequence in UniGene, a search is made for sequence similarity to known proteins from eight organisms. This is done using Blastx. Blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. Blastx has 'in-frame' gapped alignments and uses sum statistics to link alignments from different frames.
The peptide databases used by UniGene are those representing Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Escherichia coli. and Arabidopsis thaliana. The eight protein databases exclude mitochondrial proteins and are screened for redundancy.
The nucleotide sequence is considered to match the protein sequence if the BLASTx E value is less than 1e-6. For each of the eight databases, the best-hit is that alignment with the lowest bit score. This protein is used as one of the eight prot_sim features of the nucleotide sequence.
The proteins assigned to a UniGene cluster are chosen among the prot_sim proteins assigned to the cluster's component sequences. The exact algorithm used to select the representative protein is currently being revised.
Although no assembly or contig is available, the longest sequence in each cluster has been identified. Automatic assembly of EST sequences is periodically re-evaluated. To date, we have found that accurate assemblies require curation and have chosen not to create the inaccurate dataset.
There is a file on the NCBI anonymous FTP site called Hs.seq.uniq.Z. This contains one sequence selected from each UniGene cluster that is the one with the longest region of high-quality sequence data.
Where mRNA or annotated genomic sequence is available and there are no alternative splice forms, all other sequences in the cluster will be subsequences of this one. You can find this file at ftp://ftp.ncbi.nih.gov/repository/UniGene/.
For example, a user interested in cluster Hs.10920 might search the file using a text editor or the command "grep" on UNIX for the string "Hs.10920". This procedure would identify the section of the file that contains the best sequence.
Information concerning library origin is only provided by the submitter of the sequences, and NCBI does not have any additional information other than what is shown in the record. You can contact the submitter using the contact information found in the dbEST records or GenBank records for ESTs derived from the library of interest.
There are several possible sources for the title, in order of preference:
The mRNA or genomic sequence chosen is arbitrary from the end-user's perspective; there is no easy way for the user to look at the sequences in a cluster and reproduce the algorithm that chooses which one gives the title.
UniGene calculates links to Entrez Nucleotide based on transcripts in each cluster. Entrez Gene calculates links to Entrez Nucleotide based on transcript and genomic sequences that are assigned to each Gene. UniGene to Entrez Gene links are calculated based on common nucleotide sequences shared by records in UniGene and Entrez Gene.
This can happen for several different reasons. When UniGene generates clusters based on alignment to the sequence of a genome (genome-based build), transcripts identified in Entrez Gene as being in different genes may all align to the same location (share intron-exon boundaries with the annotated gene). These co-placement data are used by RefSeq staff to review the curated GeneID/transcript relationship. Multiple GeneIDs can also be associated with a UniGene cluster when UniGene uses a transcript-based build and some transcript sequences in the UniGene cluster are re-assigned to a different gene by Entrez Gene after the data freeze for the build. See also 8. How does UniGene calculate links to Entrez Gene?
When sequences from multiple genes, as identified by Entrez Gene, contribute to a UniGene cluster, one gene is selected for assigning the name to the cluster. This selection is based on the number of contributing sequences from each gene, as well as the strength of evidence for the gene (e.g. publications).
Some sequences have an obvious polyA tail. However, we allow only a finite amount of sequence beyond the end of the polyA tail. For this reason, tails on low-quality reads are sometimes missed. Other sequences are also considered as ending with a polyA tail if there is a polyA signal. The signal is a sequence of ATTAAA or AATAAA that is 10-35 nucleotides from either the polyA tail or sequence end. All sequences are searched in both orientations.
Several possibilities exist for making links to UniGene's pages. Each requires the organism of interest be specified, using the organism abbreviation which is the part preceding the numerical cluster id. For example, organism abbreviation for human UniGene cluster Hs.374043 is "Hs".
Creating a link to a specific UniGene cluster ID requires that the cluster
ID number be specified using CID=, such as in the following format:
A link can also be made using the GenBank accession of a member sequence or the gi of that sequence in the following formats:
Creating a link to a specific UniGene sequence ID requires that the UniGene sequence ID be specified using SID=, such as in the following format:
Creating a link to a specific dbEST library ID requires the dbEST library ID be specified using LID=, such as in the following format: