The
availability of genomic sequence is helpful for identifying sets of
transcript sequences that correspond to distinct transcription loci or to
annotated genes, which is the goal of UniGene. The procedure used for
genome-based clustering of transcript sequences is described here.
Several types of evidence are used to identify a transcription locus'
boundaries and to identify which transcripts represent the locus. Although
determining gross structure and transcript representation of many genes is
not sensitive to the details of transcript mapping, there are cases where the
details are important: overlapping genes on opposite strands, or genes
located within introns of other genes, for example. To accurately resolve
these and similar cases, we identify genes by incorporating evidence in order
of confidence, beginning with the strongest data.
Annotation of characterized genes annotated on the genomic sequence is
recorded. Annotated genes include those supported by experimentally confirmed
RefSeqs as well as transcription loci that are predicted to encode a protein
realized in a gene model. These annotated exon boundaries and the association
of exons with genes form a skeleton that can be extended by subsequent
analysis but cannot be contradicted subsequently.
Transcribed sequences that can be stringently aligned to genomic sequence
with a requirement of splice site consensus are used to enumerate additional
exon-intron boundaries. Not all sequence alignments satisfy this stringent
requirement. Any sequences sharing an exon-intron boundary that can be
identified with only one gene are grouped together.
Unspliced sequences, as well as sequences for which the splicing location or
orientation is uncertain, are associated with an overlapping exon if one
exists, or placed against genome if not. Sequence orientation is used where
there is possible ambiguity of gene orientation.
Sequences that do not align to genomic sequence are grouped together, and
transcribed sequences within an interval smaller than 3000 nt that have a
common clone of origin are grouped together.
Clusters that do not correspond to an annotated gene and are less than 500
bases 3' of another cluster are likely alternative 3' termini, and are merged
into the upstream cluster. This merging is not transitive.