|
Gnomon uses a set of heuristics to find the maximal self-consistent set of
corresponding transcript and protein alignment data to set the constraints
for an Hidden Markov Model(HMM)-based gene prediction. The goal is to ensure that if a
biological expert is presented with the same data, they could not produce an
obviously improved gene model. Using this set of heuristics Gnomon predicts the gene structure in genomic DNA sequences in a multistep fashion.
The program evaluates the coding propensity of the available transcript
alignments
and determines their most probable coding regions. A single set of non-overlapping
transcript alignments with better coding propensity is chosen. Then the best
matching
proteins for these transcript alignments are aligned back on the genomic DNA sequence.
Gnomon makes the first pass of the prediction using the above transcript and
protein
alignments as the constraints. For the transcript alignments,
the program makes sure
that the chosen coding region is a part of a putative mRNA than can be
extended on both sides of the
predicted coding region. For the protein alignments, Gnomon checks that the
predicted gene has
every exon in the right frame as suggested by the protein alignment, although in this
case, the program is free to choose the splice sites and to introduce other exons
between parts of the protein alignment.
The genes that were built using the alignments from the above step are
included in
the final output. For the rest of the gene models, the best matching
proteins are
found and then aligned back on the genomic DNA sequence. These protein
alignments are
used in the second pass of the prediction for refining the models.
While doing the alignment of the best matching proteins, Gnomon finds all
cases where
two exons of the protein alignment are within 50 bp and have
different frames. Because the probability of such a short intron is extremely low, in all these
cases the program introduces a frame shift in
the
genomic sequence allowing for combining the exons into a single one. In some cases, protein
alignments include a stop codon in
the middle
of the alignment. These stop codons are disregarded during the prediction
and appear
as premature stops in the model. Both the models with frame shifts and the
models
with premature stops are annotated as possible pseudogenes in the Gnomon
output.
|