Gnomon description

Gnomon uses a set of heuristics to find the maximal self-consistent set of corresponding transcript and protein alignment data to set the constraints for an Hidden Markov Model(HMM)-based gene prediction. The goal is to ensure that if a biological expert is presented with the same data, they could not produce an obviously improved gene model. Using this set of heuristics Gnomon predicts the gene structure in genomic DNA sequences in a multistep fashion.

The program evaluates the coding propensity of the available transcript alignments and determines their most probable coding regions. A single set of non-overlapping transcript alignments with better coding propensity is chosen. Then the best matching proteins for these transcript alignments are aligned back on the genomic DNA sequence.

Gnomon makes the first pass of the prediction using the above transcript and protein alignments as the constraints. For the transcript alignments, the program makes sure that the chosen coding region is a part of a putative mRNA than can be extended on both sides of the predicted coding region. For the protein alignments, Gnomon checks that the predicted gene has every exon in the right frame as suggested by the protein alignment, although in this case, the program is free to choose the splice sites and to introduce other exons between parts of the protein alignment.

The genes that were built using the alignments from the above step are included in the final output. For the rest of the gene models, the best matching proteins are found and then aligned back on the genomic DNA sequence. These protein alignments are used in the second pass of the prediction for refining the models.

While doing the alignment of the best matching proteins, Gnomon finds all cases where two exons of the protein alignment are within 50 bp and have different frames. Because the probability of such a short intron is extremely low, in all these cases the program introduces a frame shift in the genomic sequence allowing for combining the exons into a single one. In some cases, protein alignments include a stop codon in the middle of the alignment. These stop codons are disregarded during the prediction and appear as premature stops in the model. Both the models with frame shifts and the models with premature stops are annotated as possible pseudogenes in the Gnomon output.


Updated: September 29, 2003.

Disclaimer     Privacy statement     NCBI Service Desk