Types of
alignments.
Assumptions:
1.
I assume
that positions of all neighbors are detected.
2.
Neighbors
are defined as closest aligned pairs of nucleotides on the flanking sequence.
It is not decided yet should neighbors have matching nucleotides or not.
3.
Left
neighbor is always less then corresponding right one.
4.
SNP
position on the flanking sequence is always between neighbors. SNP position on
the subject sequence is always between left and right neighbors.
5.
Allele is
the part of the sequence between neighbors, excluding neighbors themselves.
Loctype 1: Insertion
on the subject sequence.
In this case single nucleotide on the flanking sequence is substituted
with the sequence of nucleotides with total length more than one. In this case
the following equations are always true:
Rs Ls 1
> 1
Rf
Lf 1 = 1

|
|
Left
neighbor |
SNP
position |
Right
neighbor |
Distance
between neighbors |
|
Flanking
sequence (Q) |
Lf = 199 |
Sf = 200 |
Rf = 201 |
201-199-1
= 1 |
|
Subject
sequence (S) |
Lc = 2034 |
Sc = 0 |
Rc = 2062 |
2062-2034
-1 = 27 |
Loctype 2: True SNP.
This is a restricted situation of Loctype 1. In this case exactly one
nucleotide on the flanking sequence is replaced with exactly one nucleotide on the
subject sequence. For this case the following equations are always true:
Rs Ls 1 = 1
Rf Lf 1 = 1
|
|
Left
neighbor |
SNP
position |
Right
neighbor |
Distance
between neighbors |
|
Flanking
sequence (Q) |
Lf = 199 |
Sf = 200 |
Rf = 201 |
201 199
1 |
|
Subject
sequence (S) |
Lc = 2051 |
Sc = 2052 |
Rc = 2053 |
2053
2051 -1 |
Loctype 3: Deletion on
the subject sequence.
In this situation part of the
flanking sequence, which includes SNP site is deleted from the contig. For such
case it is always true:
Rs Ls 1
Rf Lf 1 >=
1
|
|
Left
neighbor |
SNP
position |
Right
neighbor |
Distance
between neighbors |
|
Flanking
sequence (Q) |
Lf = 163 |
Sf = 200 |
Rf = 201 |
201-163-1 = 27 |
|
Subject
sequence (S) |
Lc = 2034 |
Sc = 0 |
Rc = 2035 |
0 |
Loctype 4: Range
insertion.
In this case shorter part of flanking sequence, containing SNP site is
replaced with longer subsequence on the subject. Neighbours are not adjacent to
the declared SNP site.
Rc Lc 1 >
1
Rf Lf 1 >
1
Rc Lc < Rf - Lf
|
|
Left
neighbor |
SNP
position |
Right
neighbor |
Distance
between neighbors |
|
Flanking
sequence (Q) |
Lf = 197 |
Sf = 200 |
Rf = 202 |
202-197-1 = 4 |
|
Subject
sequence (S) |
Lc = 2033 |
Sc = 2036 |
Rc = 2038 |
2038-2033-1=4 |
Loctype 5: Range
substitution.
Here the part of the flanking sequence is replaced with another sequence
of nucleotides having exactly the same length. Left and right flanking
neighbors are not adjacent to the declared SNP site.
Rc - Lc 1 >=
1
Rf Lf 1 >=
1
Rc Lc = Rf Lf
|
|
Left
neighbor |
SNP
position |
Right
neighbor |
Distance
between neighbors |
|
Flanking
sequence (Q) |
Lf = 197 |
Sf = 200 |
Rf = 202 |
202-197-1 = 4 |
|
Subject
sequence (S) |
Lc = 2033 |
Sc = 2036 |
Rc = 2038 |
2038-2033-1=4 |
Loctype 6: Range
deletion.
Part of the flanking sequence containing SNP site is replaced with shorter
subsequence on the subject. Neighbors are not adjacent to the declared SNP
site.
Rc Lc 1 >
1
Rf Lf 1 >
1
Rc Lc > Rf - Lf
|
|
Left
neighbor |
SNP
position |
Right
neighbor |
Distance
between neighbors |
|
Flanking
sequence (Q) |
Lf = 183 |
Sf = 200 |
Rf = 202 |
202 -183 -1 = 18 |
|
Subject
sequence (S) |
Lc = 2033 |
Sc = 0 |
Rc = 2044 |
2044 -2033
-1=10 |
Alignment Prifiling. Profiling function.
Weights of the mismatches should not be equal
along the flanking sequence. The weight should be assigned according to some profiling
function. Because of the nature of the sequencing process it is normal to have
more errors concentrated on the tails of the flanking sequence. We have to take
this fact in count to save alignments which can easily be disregarded because
of the errors in the tails of the query sequence. Let assume, that the
distribution of the errors follow the rule of natural distribution starting
some point of the flank. This can be approximated with the function F(x):


Quality o the
alignment can be calculated using the following equation:

Having:

Optimistic identity rate can be calculated by
this function. I named it optimistic, because I didnt include mismatches:

Mismatch will affect the numerator of the
function. The function to describe mismatches will have parts of unmovable
discontinuations. Strictly speaking we have to take an integral for this
function to get the effect of the mismatches, but due to corpuscular nature of
the alignment we can easily replace it with the sum of elementary functions.
; where m vector of mismatch
positions.
Thus, the final function:
