UniGene Build Procedure - Transcriptome Based
Clustering is the process of finding subsets of
sequences that belong together within a larger set. This is done by
converting discrete similarity scores to Boolean links between sequences.
That is, two sequences are considered linked if their similarity exceeds a
threshold. UniGene clustering proceeds in several stages, with each stage
adding less reliable data to the results of the preceding stage. This staged
clustering affords greater control than a more egalitarian treatment of all
links between sequences.
The Stages Screening for contaminants, repeats, and low-complexity
sequence is performed. Low-complexity screening is performed using NCBI's
Dust program. Mitochondrial and ribosomal sequences are screened for, as are
vector contaminants and repetitive elements. After screening, a sequence must
contain at least 100 informative bp to be a candidate for entry into UniGene.
Builds are either
genome based or transcript based, as
described here.
Gene links are found. The set of mRNA sequences is compared with itself.
Sequence pairs that are sufficiently similar are linked together to form
initial clusters.
Links between ESTs and mRNA are added to these clusters. The set of ESTs is
compared with sequences from the set of initial clusters using
megaBLAST, and
sufficiently similar sequence pairs are added to the clusters. Links that
would join the initial mRNA-based clusters are discarded. EST to EST links
are also generated and used to extend the initial clusters and to generate
clusters composed solely of ESTs.
Clone-based edges are added; these allow
non-overlapping 5' and 3' ESTs to be assigned to the same cluster. Because of
imperfect clone labeling, a single clone-ID based edge is insufficient to
merge two clusters. Clone IDs that link at least two 5' ends from one cluster
with at least two 3' ends from another cluster are found, and the two
clusters are merged.
Any resulting cluster that does not contain a sequence with a polyadenylation
signal or tail is discarded. Clusters that meet these criteria are called
anchored clusters, because their 3' ends are presumed to be known.
ESTs that do not belong to an anchored cluster are rechecked at a lower level
of stringency than in the preceding passes. An EST that passes this less
stringent test is then added to the cluster that contains the sequence that
is the best match to the EST; it is a guest member.
Clusters of size 1 (that is, clusters that seem to identify infrequently
expressed genes) are compared against the rest of the sequences in UniGene at
a lower level of stringency and merged with the cluster containing the most
similar sequence.
The resulting clusters are compared with the preceding week's build and
renumbered in an attempt to maintain continuity. Because the sequences that
make up a cluster may change from week to week and because the cluster
identifier may disappear (typically when two clusters merge), using the
cluster identifier as a reference is ill advised. Using the GB accession
numbers of the sequences that make up the cluster is a safe alternative.