NCBI Logo   NCBI » UniGene » Build Procedure-Genome Based
PubMed Protein Genome Structure PopSet Taxonomy OMIM
 


 
UniGene Links
 DDD
 FAQ
 Query Tips


 Other Links

 Locus Link
 HomoloGene
 dbEST
 Trace Archive
 BLAST
 CGAP
 
 Search
  Limits Index History Details

UniGene Build Procedure - Genome Based
 

The availability of genomic sequence is helpful for identifying sets of transcript sequences that correspond to distinct transcription loci or to annotated genes, which is the goal of UniGene. The procedure used for genome-based clustering of transcript sequences is described here.

Several types of evidence are used to identify a transcription locus' boundaries and to identify which transcripts represent the locus. Although determining gross structure and transcript representation of many genes is not sensitive to the details of transcript mapping, there are cases where the details are important: overlapping genes on opposite strands, or genes located within introns of other genes, for example. To accurately resolve these and similar cases, we identify genes by incorporating evidence in order of confidence, beginning with the strongest data.

Annotation of characterized genes annotated on the genomic sequence is recorded. Annotated genes include those supported by experimentally confirmed RefSeqs as well as transcription loci that are predicted to encode a protein realized in a gene model. These annotated exon boundaries and the association of exons with genes form a skeleton that can be extended by subsequent analysis but cannot be contradicted subsequently.


Transcribed sequences that can be stringently aligned to genomic sequence with a requirement of splice site consensus are used to enumerate additional exon-intron boundaries. Not all sequence alignments satisfy this stringent requirement. Any sequences sharing an exon-intron boundary that can be identified with only one gene are grouped together.

Unspliced sequences, as well as sequences for which the splicing location or orientation is uncertain, are associated with an overlapping exon if one exists, or placed against genome if not. Sequence orientation is used where there is possible ambiguity of gene orientation.

Sequences that do not align to genomic sequence are grouped together, and transcribed sequences within an interval smaller than 3000 nt that have a common clone of origin are grouped together.

Clusters that do not correspond to an annotated gene and are less than 500 bases 3' of another cluster are likely alternative 3' termini, and are merged into the upstream cluster. This merging is not transitive.


Questions or Comments?
 E-mail the NCBI Help Desk
firstgov logo
National Center for Biotechnology Information
U.S. National Library of Medicine
National Institutes of Health
DHHS logo
Disclaimer  | Freedom of Information Act  |  Privacy Policy