Types of alignments.

 

Assumptions:

1.          I assume that positions of all neighbors are detected.

2.          Neighbors are defined as closest aligned pairs of nucleotides on the flanking sequence. It is not decided yet should neighbors have matching nucleotides or not.

3.          Left neighbor is always less then corresponding right one.

4.          SNP position on the flanking sequence is always between neighbors. SNP position on the subject sequence is always between left and right neighbors.

5.          Allele is the part of the sequence between neighbors, excluding neighbors themselves.

 

Loctype 1: Insertion on the subject sequence.

In this case single nucleotide on the flanking sequence is substituted with the sequence of nucleotides with total length more than one. In this case the following equations are always true:

 

Rs – Ls – 1 > 1

Rf – Lf – 1 = 1

 

 

 

Left neighbor

SNP position

Right neighbor

Distance between neighbors

Flanking sequence (Q)

Lf = 199

Sf = 200

Rf = 201

201-199-1 = 1

Subject sequence (S)

Lc = 2034

Sc = 0

Rc = 2062

2062-2034 -1 = 27

 

 


 

Loctype 2: True SNP.

This is a restricted situation of Loctype 1. In this case exactly one nucleotide on the flanking sequence is replaced with exactly one nucleotide on the subject sequence. For this case the following equations are always true:

 

Rs – Ls – 1 = 1

Rf – Lf – 1 = 1

 

 

 

Left neighbor

SNP position

Right neighbor

Distance between neighbors

Flanking sequence (Q)

Lf = 199

Sf = 200

Rf = 201

201 – 199 – 1

Subject sequence (S)

Lc = 2051

Sc = 2052

Rc = 2053

2053 – 2051 -1

 

Loctype 3: Deletion on the subject sequence.

In this situation part of the flanking sequence, which includes SNP site is deleted from the contig. For such case it is always true:

Rs – Ls – 1

Rf – Lf – 1 >= 1

 

 

 

Left neighbor

SNP position

Right neighbor

Distance between neighbors

Flanking sequence (Q)

Lf = 163

Sf = 200

Rf = 201

 201-163-1 = 27

Subject sequence (S)

Lc = 2034

Sc = 0

Rc = 2035

0

 

Loctype 4: Range insertion.

In this case shorter part of flanking sequence, containing SNP site is replaced with longer subsequence on the subject. Neighbours are not adjacent to the declared SNP site.

 

Rc – Lc – 1 > 1

Rf – Lf – 1 > 1

Rc – Lc < Rf  - Lf

 

 

 

 

 

Left neighbor

SNP position

Right neighbor

Distance between neighbors

Flanking sequence (Q)

Lf = 197

Sf = 200

Rf = 202

 202-197-1 = 4

Subject sequence (S)

Lc = 2033

Sc = 2036

Rc = 2038

2038-2033-1=4

 


 

Loctype 5: Range substitution.

Here the part of the flanking sequence is replaced with another sequence of nucleotides having exactly the same length. Left and right flanking neighbors are not adjacent to the declared SNP site. 

 

Rc - Lc – 1 >= 1

Rf – Lf – 1 >= 1

Rc – Lc = Rf – Lf

 

 

 

Left neighbor

SNP position

Right neighbor

Distance between neighbors

Flanking sequence (Q)

Lf = 197

Sf = 200

Rf = 202

 202-197-1 = 4

Subject sequence (S)

Lc = 2033

Sc = 2036

Rc = 2038

2038-2033-1=4

 


 

Loctype 6: Range deletion.

Part of the flanking sequence containing  SNP site is replaced with shorter subsequence on the subject. Neighbors are not adjacent to the declared SNP site.

 

Rc – Lc – 1 > 1

Rf – Lf – 1 > 1

Rc – Lc > Rf  - Lf

 

 

 

 

 

Left neighbor

SNP position

Right neighbor

Distance between neighbors

Flanking sequence (Q)

Lf = 183

Sf = 200

Rf = 202

 202 -183 -1 = 18

Subject sequence (S)

Lc = 2033

Sc = 0

Rc = 2044

2044 -2033 -1=10

 


 


 Alignment Profiling: Profiling function.

 

Weights of the mismatches should not be equal along the flanking sequence. The weight should be assigned according to some profiling function. Because of the nature of the sequencing process it is normal to have more errors concentrated on the tails of the flanking sequence. We have to take this fact in count to save alignments which can easily be disregarded because of the errors in the tails of the query sequence. Let assume, that the distribution of the errors follow the rule of natural distribution starting some point of the flank. This can be approximated with the function F(x):

 

 

 

 

 

 

Quality of the alignment can be calculated using the following equation:

 

 


 

Having:

 

 

Optimistic identity rate can be calculated by this function. I named it “optimistic”, because I didn’t include mismatches:

 

 

Mismatch will affect the numerator of the function. The function to describe mismatches will have parts of unmovable discontinuations. Strictly speaking we have to take an integral for this function to get the effect of the mismatches, but due to corpuscular nature of the alignment we can easily replace it with the sum of elementary functions.

 

; where m – vector of mismatch positions.

 

 

 

Thus, the final function: