Logo of protsciBioMed CentralBiomed Central Web Sitesearchsubmit a manuscriptregisterthis articleProteome ScienceJournal Front Page
Proteome Sci. 2006; 4: 18.
Published online Sep 23, 2006. doi:  10.1186/1477-5956-4-18
PMCID: PMC1617084

Analytical model of peptide mass cluster centres with applications

Abstract

Background

The elemental composition of peptides results in formation of distinct, equidistantly spaced clusters across the mass range. The property of peptide mass clustering is used to calibrate peptide mass lists, to identify and remove non-peptide peaks and for data reduction.

Results

We developed an analytical model of the peptide mass cluster centres. Inputs to the model included, the amino acid frequencies in the sequence database, the average length of the proteins in the database, the cleavage specificity of the proteolytic enzyme used and the cleavage probability. We examined the accuracy of our model by comparing it with the model based on an in silico sequence database digest. To identify the crucial parameters we analysed how the cluster centre location depends on the inputs. The distance to the nearest cluster was used to calibrate mass spectrometric peptide peak-lists and to identify non-peptide peaks.

Conclusion

The model introduced here enables us to predict the location of the peptide mass cluster centres. It explains how the location of the cluster centres depends on the input parameters. Fast and efficient calibration and filtering of non-peptide peaks is achieved by a distance measure suggested by Wool and Smilansky.

Background

The mass spectrometric (MS) technique is widely used to identify proteins in biological samples [1-4]. The proteins are cleaved into peptides by a residue specific protease, e.g. trypsin. The resulting cleavage products can then be analysed by Peptide Mass Fingerprinting (PMF) [5] or subjected to MS/MS fragment ion analysis [6,7], which both rely on the comparison of peptide or peptide fragment ion spectra with spectra simulated from protein sequence databases [8].

The sensitivity and specificity of the peptide identification can be increased by various post-processing methods, for example calibration [9-12] and identification of non-peptide peaks [10,13,14]. The fact that peptide masses are not uniformly distributed across the mass range but form equidistantly spaced clusters [15] is employed by some of these methods. In dependence on the atomic composition of the peptide, the monoisotopic mass would emerge below (e.g. cystein rich peptides) or above (e.g. lysine rich peptides) the cluster centres. The deviation from the cluster centre is a result of the mass defect, which is the difference between the nominal mass and the monoisotopic mass (Table (Table1).1). The mass defect is a result of atom fusion [16,17].

Table 1
Masses of Atoms

Calibration

Mass spectrometric peptide peak-lists of peptide mass finger print experiments [18] can be calibrated by comparing the location of measured peptide masses with the location of the peptide mass cluster centres. Gras et al. [19] suggested the use of maximum likelihood methods in order to determine the calibration coefficients a and b. They defined the likelihood function by:

iP(ami+b,Δm),     (1) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaaeqbqaaiabdcfaqjabcIcaOiabdggaHjabd2gaTnaaBaaaleaacqWGPbqAaeqaaOGaey4kaSIaemOyaiMaeiilaWIaeyiLdqKaemyBa0MaeiykaKcaleaacqWGPbqAaeqaniabggHiLdGccqGGSaalcaWLjaGaaCzcamaabmGabaGaeGymaedacaGLOaGaayzkaaaaaa@41C6@

where mi is the i-th mass in the peak-list, and Δm is a search window. P(m, Δm) is the probability to find a mass in [m, m + Δm] given the theoretical distribution of peptide masses. The parameters a, b for argmax i P(ami + b, Δm) can then be used to calibrate the peak-lists. The authors, however, do not provide information on whether P(m, Δm) was determined from the exact distribution of the peptide masses or if a model approximating the distribution was used. They also do not mention which algorithm was used to maximise the likelihood. They reported that a mass measurement accuracy of 0.2Da and better was obtained after calibration.

Wool and Smilansky [10] have used Discrete Fourier Transformation (DFT) to determine the frequency λ and phase ϕ of a peak-list or mass spectrum. By comparing the experimental λ and ϕ with the theoretical λ = 1.000495 and ϕ = 0, they determined the slope and intercept of the calibration function. The authors reported a 40 – 60% reduction of the mass measurement error. Furthermore, they presented a scoring scheme for sequence database searches. This scoring scheme approximates the probability P(m, Δm) to observe a peptide peak of mass m with given measurement error Δm.

Matrix noise filtration

The most widely used MALDI matrices for the analysis of peptides are 3,5-Dimethoxy-4-hydroxycinnamic acid (synapic acid), alpha-Cyano-4-hydroxycinnamic acid (alpha cyano) [20] and 2,5-dihydroxybenzoic acid (DHB) [21]. Unfortunately, clusters of matrix molecules can be ionised and cause peaks in the same mass range where peptide peaks are measured. Matrix aggregate formation can be minimised but not eliminated by adding ammonium acetate [21].

Some of the database search scoring schemes incorporate the number of signals (peaks) not assigned to a protein when computing the identification scores [22]. Therefore, the presence of matrix signals in MS spectra decreases the sensitivity of the MS spectra interpretation. Hence, the removal of peaks strongly deviating from the cluster centres is applied [21,23]. The measure of deviation from cluster centres introduced here provides a simple tool to filter non-peptide peaks.

Data reduction

A further application which employs the property of peptide mass clustering is the binning of the mass measurement range. By applying this technique the amount of data is reduced, thus increasing the speed with which the pairwise comparison of spectra can be made [24,25].

All these applications require us to know the exact location of or the distance between the peptide mass cluster centres. The distance between the cluster centres, which we will henceforth call wavelength λ, is commonly computed by first generating an in silico digest of the database. Afterwards, the linear dependence between the decimal point and the integer part is determined by regression analysis, for a relatively small mass range of 500 to 1000Da [23]. Various authors report different values of the distance between clusters: Wool and Smilansky reported 1.000495 [10], Gay et al. 1.000455 [15], while Tabb et al. used a wavelength of 1.00057 [24].

In this work we present an analytical model allowing us to predict the mass of the peptide cluster centres. The parameters of the model include: the frequencies of the amino acids in the sequence database [26], the average protein length of the proteins in the database, the cleavage sites of the proteolytic enzyme and the cleavage probability. Based on this model we introduced a measure of deviation of peptide masses from the nearest cluster centre, which is a refinement of a measure proposed by Wool and Smilansky [10]. Using this distance measure, we developed a calibration procedure which employs least squares linear regression in order to determine the affine model of the mass measurement error and subsequently to calibrate the spectra. Using this method we reached higher calibration accuracy as reported by Wool and Smilansky [10], and Gras et al [19]. We used the same distance measure to identify and remove non-peptide peaks prior to database searches performed by the Mascot search engine [22].

Results and discussion

A simple way to predict the peptide mass cluster centres of a protein database

Figure Figure11 shows the mass defect, the difference of the monoisotopic (m(M)) and nominal (m(N))masses of peptides of a sequence specific in silico protein sequence database digest [27], as a function of m(N). The peptides were produced with the restriction that no missed cleavages were allowed. A strong linear dependence of the mass defect on m(N) can be observed.

Figure 1
The peptide mass rule. Panel A: Scatterplot of m(M) - m(N) against the m(N) mass (m(M) - monoisotopic mass, m(N) - nominalmass). Red dashed line – the model determined by linear regression with intercept fixed at 0. The magenta line represents ...

The first model of this dependence which we examined was m(M) - m(N) = c1·m(N). We fixed the intercept at 0, because a hypothetical peptide with a nominal mass of 0 must have a monoisotopic mass equal to 0. The slope coefficient c1, determined by linear regression (cf. Methods) equalled 4.98·10-4(Figure (Figure1,1, Panel A – red dashed line), which is a value similar to the values 4.95·10-4 reported by Wool and Smilansky [10].

We were interested in determining the dependence between monoisotopic and nominal mass analytically.

For example, the monoisotopic mass (m(M)) of hypothetical peptides built only of one amino acid i can be predicted, given their nominal mass (m(N)) by mi(M) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemyAaKgabaWaaeWaceaacqWGnbqtaiaawIcacaGLPaaaaaaaaa@3245@ = λimi(N) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemyAaKgabaWaaeWaceaacqWGobGtaiaawIcacaGLPaaaaaaaaa@3247@ when λi = mi(M) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemyAaKgabaWaaeWaceaacqWGnbqtaiaawIcacaGLPaaaaaaaaa@3245@/mi(N) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemyAaKgabaWaaeWaceaacqWGobGtaiaawIcacaGLPaaaaaaaaa@3247@. For peptides generated by random cleavage of protein sequences from a protein database this dependence is approximated by:

λDB=iAAfimi(M)iAAfimi(N),     (2) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF7oaBdaWgaaWcbaGaemiraqKaemOqaieabeaakiabg2da9maalaaabaWaaabeaeaacqWGMbGzdaWgaaWcbaGaemyAaKgabeaakiabd2gaTnaaDaaaleaacqWGPbqAaeaadaqadiqaaiabd2eanbGaayjkaiaawMcaaaaaaeaacqWGPbqAcqGHiiIZcqWGbbqqcqWGbbqqaeqaniabggHiLdaakeaadaaeqaqaaiabdAgaMnaaBaaaleaacqWGPbqAaeqaaOGaemyBa02aa0baaSqaaiabdMgaPbqaamaabmGabaGaemOta4eacaGLOaGaayzkaaaaaaqaaiabdMgaPjabgIGiolabdgeabjabdgeabbqab0GaeyyeIuoaaaGccqGGSaalcaWLjaGaaCzcamaabmGabaGaeGOmaidacaGLOaGaayzkaaaaaa@5520@

where fi is the frequency of the amino acid i in the database.

Now write mi(M) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemyAaKgabaWaaeWaceaacqWGnbqtaiaawIcacaGLPaaaaaaaaa@3245@ = λDBmi(N) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemyAaKgabaWaaeWaceaacqWGobGtaiaawIcacaGLPaaaaaaaaa@3247@ + εi. Substituting this is (2), it follows that ∑i[set membership]AA fiεi = 0. Therefore, for an amino acid randomly selected from the database, with frequencies fi, the expectation of εi is zero. Now consider a peptide made of a random selection of J amino acids, i(1),...,i(J). The ratio of monoisotopic to nominal mass for this peptide would be:

λp=j=1Jmi(j)Mj=1Jmi(j)N=λDBj=1Jmi(j)N+j=1Jεi(j)j=1Jmi(j)N. MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF7oaBdaWgaaWcbaGaemiCaahabeaakiabg2da9maalaaabaWaaabmaeaacqWGTbqBdaqhaaWcbaGaemyAaK2aaeWaceaacqWGQbGAaiaawIcacaGLPaaaaeaacqWGnbqtaaaabaGaemOAaOMaeyypa0JaeGymaedabaGaemOsaOeaniabggHiLdaakeaadaaeWaqaaiabd2gaTnaaDaaaleaacqWGPbqAdaqadiqaaiabdQgaQbGaayjkaiaawMcaaaqaaiabd6eaobaaaeaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGkbGsa0GaeyyeIuoaaaGccqGH9aqpdaWcaaqaaiab=T7aSnaaBaaaleaacqWGebarcqWGcbGqaeqaaOWaaabmaeaacqWGTbqBdaqhaaWcbaGaemyAaK2aaeWaceaacqWGQbGAaiaawIcacaGLPaaaaeaacqWGobGtaaaabaGaemOAaOMaeyypa0JaeGymaedabaGaemOsaOeaniabggHiLdGccqGHRaWkdaaeWaqaaiab=v7aLnaaBaaaleaacqWGPbqAdaqadiqaaiabdQgaQbGaayjkaiaawMcaaaqabaaabaGaemOAaOMaeyypa0JaeGymaedabaGaemOsaOeaniabggHiLdaakeaadaaeWaqaaiabd2gaTnaaDaaaleaacqWGPbqAdaqadiqaaiabdQgaQbGaayjkaiaawMcaaaqaaiabd6eaobaaaeaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGkbGsa0GaeyyeIuoaaaGccqGGUaGlaaa@7A1F@

If ∑i εi(j) were uncorrelated with (imi(j)(N))1 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqadiqaamaaqababaGaemyBa02aa0baaSqaaiabdMgaPnaabmGabaGaemOAaOgacaGLOaGaayzkaaaabaWaaeWaceaacqWGobGtaiaawIcacaGLPaaaaaaabaGaemyAaKgabeqdcqGHris5aaGccaGLOaGaayzkaaWaaWbaaSqabeaacqGHsislcqaIXaqmaaaaaa@3C01@ for a random selection of amino acids, then λp would have expectation λDB. Of course, there may be a relationship between εi and mi(N) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemyAaKgabaWaaeWaceaacqWGobGtaiaawIcacaGLPaaaaaaaaa@3247@ and we would wish to use any such relationship to improve prediction of mi(M) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemyAaKgabaWaaeWaceaacqWGnbqtaiaawIcacaGLPaaaaaaaaa@3245@

Figure Figure22 visualises the frequencies fi of all amino acids in the Uniprot database [27] with their respective λi plotted on the abscissa. The position of the red vertical line on the abscissa denotes λDB (Equation 2) and equals λDB = 1.000511. The dotted, dashed and dot dashed lines indicate the wavelength λ of DHB, alpha-cyano and sinapic acid mass spectrometric matrix clusters, respectively.

Figure 2
Bar-plot of the Amino Acid frequencies. The bars are drawn on the position of λi = mi(M) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemyAaKgabaWaaeWaceaacqWGnbqtaiaawIcacaGLPaaaaaaaaa@3245@ ...

When testing for the significance of the intercept coefficient in the regression model mM [proportional, variant] λmN of a sequence specific (Tryptic) in silico database digest, we found that the intercept coefficient must be included into the model. Therefore, the extended model of the monoisotopic peptide mass cluster centres was:

m(M) = c1·m(N) + c0.     (3)

Subtracting mN from each side of Equation 3 we obtained Δ = m(M) - m(N) = (c1 - 1)·m(N) + c0. The coefficients of the affine linear model of the cluster centres, determined using regression analysis of Δ = m(M) - m(N) on m(N) were c0 = 0.029 and (c1 - 1) = 4.85·10-4.

The maximal difference between the prediction of m(M) using m(M) = 1.000499·m(N) and m(M) = 1.000485·m(N) + 0.029 is 0.022 Dalton for m(N) [set membership] [600, 2500] Dalton.

The influence of the digestion enzyme on the wavelength of peptide mass clusters

In case of a complete sequence specific cleavage of proteins, the number of generated peptides is CP + 1 peptides, given that CP is the number of cleavage sites per protein. The peptides generated from the terminus of the protein (further called terminal) will not bear a cleavage site residue RC at their end. All the other peptides, which we call internal, will have such a residue at their end. The fraction of the internal peptides fc,n is given by

fc,n=CPnCP+1n,     (4) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGMbGzdaWgaaWcbaGaem4yamMaeiilaWIaemOBa4gabeaakiabg2da9maalaaabaGaem4qam0aaSbaaSqaaiabdcfaqbqabaGccqGHsislcqWGUbGBaeaacqWGdbWqdaWgaaWcbaGaemiuaafabeaakiabgUcaRiabigdaXiabgkHiTiabd6gaUbaacqGGSaalcaWLjaGaaCzcamaabmGabaGaeGinaqdacaGLOaGaayzkaaaaaa@42D8@

where n is the number of missed cleavages per protein. We approximate CP, for a sequence database, by:

CP=|P|(fRC),     (5) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGdbWqdaWgaaWcbaGaemiuaafabeaakiabg2da9iabcYha8jabdcfaqjabcYha8jabgwSixpaabmGabaWaaabqaeaacqWGMbGzdaWgaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaaaleqaaaqabeqaniabggHiLdaakiaawIcacaGLPaaacqGGSaalcaWLjaGaaCzcamaabmGabaGaeGynaudacaGLOaGaayzkaaaaaa@42CD@

where fRC MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGMbGzdaWgaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaaaleqaaaaa@30A1@ are the relative frequencies of the cleavage sites and |P| is the average protein length in the database. The fraction of the terminal peptides in case of n missed cleavages is given by 1 - fc,n. The fraction of cleavage site residues RC in a internal peptide of mass mpep, with n missed cleavage sites is denoted fm,n and approximated by:

fm,n=(n+1)m¯mpep,     (6) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGMbGzdaWgaaWcbaGaemyBa0MaeiilaWIaemOBa4gabeaakiabg2da9maabmGabaGaemOBa4Maey4kaSIaeGymaedacaGLOaGaayzkaaWaaSaaaeaacuWGTbqBgaqeaaqaaiabd2gaTnaaBaaaleaacqqGWbaCcqqGLbqzcqqGWbaCaeqaaaaakiabcYcaSiaaxMaacaWLjaWaaeWaceaacqaI2aGnaiaawIcacaGLPaaaaaa@4393@

where m¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGTbqBgaqeaaaa@2E27@ is the average mass of an amino acid residue. A more accurate model of fm,n is provided in the Appendix. In the case of terminal peptides the fraction of cleavage site residues RC equals fm,n - 1. The fraction of all the other amino acid residues R\RC equals 1 - fm,n or 1 - fm,n - 1 respectively. Table Table22 summarises these results.

Table 2
Frequencies of cleavage site residues, and all other residues, in peptides of mass m and of terminal, and internal, peptides.

In the case of internal peptides, the average contribution of the amino acid residues to the peptide mass is the weighted sum:

mRC,n()=(1fm,n)mnone+fm,nmRc     (7)=mnone+fm,n(mRCmnone),(8) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqadeGacaaabaGaemyBa02aa0baaSqaaiabdkfasnaaBaaameaacqWGdbWqaeqaaSGaeiilaWIaemOBa4gabaWaaeWaceaacqGHxiIkaiaawIcacaGLPaaaaaGccqGH9aqpdaqadiqaaiabigdaXiabgkHiTiabdAgaMnaaBaaaleaacqWGTbqBcqGGSaalcqWGUbGBaeqaaaGccaGLOaGaayzkaaGaeyyXICTaemyBa02aaSbaaSqaaiabd6gaUjabd+gaVjabd6gaUjabdwgaLbqabaGccqGHRaWkcqWGMbGzdaWgaaWcbaGaemyBa0MaeiilaWIaemOBa4gabeaakiabgwSixlabd2gaTnaaBaaaleaacqWGsbGudaWgaaadbaGaem4yamgabeaaaSqabaGccaWLjaGaaCzcaaqaamaabmGabaGaeG4naCdacaGLOaGaayzkaaaabaGaeyypa0JaemyBa02aaSbaaSqaaiabd6gaUjabd+gaVjabd6gaUjabdwgaLbqabaGccqGHRaWkcqWGMbGzdaWgaaWcbaGaemyBa0MaeiilaWIaemOBa4gabeaakiabgwSixpaabmGabaGaemyBa02aaSbaaSqaaiabdkfasnaaBaaameaacqWGdbWqaeqaaaWcbeaakiabgkHiTiabd2gaTnaaBaaaleaacqWGUbGBcqWGVbWBcqWGUbGBcqWGLbqzaeqaaaGccaGLOaGaayzkaaGaeiilaWcabaWaaeWaceaacqaI4aaoaiaawIcacaGLPaaaaaaaaa@7A97@

where

mnone=iR\RCfimi,     (9) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaWgaaWcbaGaemOBa4Maem4Ba8MaemOBa4Maemyzaugabeaakiabg2da9maaqafabaGaemOzay2aaSbaaSqaaiabdMgaPbqabaGccqGHflY1cqWGTbqBdaWgaaWcbaGaemyAaKgabeaakiabcYcaSaWcbaGaemyAaKMaeyicI4SaemOuaiLaeiixaWLaemOuai1aaSbaaWqaaiabdoeadbqabaaaleqaniabggHiLdGccaWLjaGaaCzcamaabmGabaGaeGyoaKdacaGLOaGaayzkaaaaaa@4B8F@

is the average mass of non cleavage residues, and:

mRC=iRCfimi     (10) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaWgaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaaaleqaaOGaeyypa0ZaaabuaeaacqWGMbGzdaWgaaWcbaGaemyAaKgabeaakiabgwSixlabd2gaTnaaBaaaleaacqWGPbqAcqGHflY1aeqaaaqaaiabdMgaPjabgIGiolabdkfasnaaBaaameaacqWGdbWqaeqaaaWcbeqdcqGHris5aOGaaCzcaiaaxMaadaqadiqaaiabigdaXiabicdaWaGaayjkaiaawMcaaaaa@4845@

is the average mass of the cleavage site residues RC. Finally, the wavelength of internal peptides is presented as:

λRC,nm=mRC,n(M)mRC,n(N)     (11) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF7oaBdaqhaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaWccqGGSaalcqWGUbGBaeaacqWGTbqBaaGccqGH9aqpdaWcaaqaaiabd2gaTnaaDaaaleaacqWGsbGudaWgaaadbaGaem4qameabeaaliabcYcaSiabd6gaUbqaamaabmGabaGaemyta0eacaGLOaGaayzkaaaaaaGcbaGaemyBa02aa0baaSqaaiabdkfasnaaBaaameaacqWGdbWqaeqaaSGaeiilaWIaemOBa4gabaWaaeWaceaacqWGobGtaiaawIcacaGLPaaaaaaaaOGaaCzcaiaaxMaadaqadiqaaiabigdaXiabigdaXaGaayjkaiaawMcaaaaa@4C83@

The wavelength of terminal peptides was determined by: λRC,m(n1)=mRC,n1(M)mRC,n1(N) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF7oaBdaqhaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaWccqGGSaalcqWGTbqBaeaadaqadiqaaiabd6gaUjabgkHiTiabigdaXaGaayjkaiaawMcaaaaakiabg2da9maalaaabaGaemyBa02aa0baaSqaaiabdkfasnaaBaaameaacqWGdbWqaeqaaSGaeiilaWIaemOBa4MaeyOeI0IaeGymaedabaWaaeWaceaacqWGnbqtaiaawIcacaGLPaaaaaaakeaacqWGTbqBdaqhaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaWccqGGSaalcqWGUbGBcqGHsislcqaIXaqmaeaadaqadiqaaiabd6eaobGaayjkaiaawMcaaaaaaaaaaa@4EEC@.

The wavelength λ of all peptides at a mass m with exactly n missed cleavages is given by:

λRC,nm,=mRC,n(M),mRC,n(N),     (12) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF7oaBdaqhaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaWccqGGSaalcqWGUbGBaeaacqWGTbqBcqGGSaalcqGHxiIkaaGccqGH9aqpdaWcaaqaaiabd2gaTnaaDaaaleaacqWGsbGudaWgaaadbaGaem4qameabeaaliabcYcaSiabd6gaUbqaamaabmGabaGaemyta0eacaGLOaGaayzkaaGaeiilaWIaey4fIOcaaaGcbaGaemyBa02aa0baaSqaaiabdkfasnaaBaaameaacqWGdbWqaeqaaSGaeiilaWIaemOBa4gabaWaaeWaceaacqWGobGtaiaawIcacaGLPaaacqGGSaalcqGHxiIkaaaaaOGaaCzcaiaaxMaadaqadiqaaiabigdaXiabikdaYaGaayjkaiaawMcaaaaa@51F2@

where

mRC,n[MN],=fc,nmRC,n[MN]+(1fc,n)mRC,n1[MN](13)=mnone+(mRCmnone)(fc,nfm,n+fm,(n1)fc,nfm,(n1))     (14)=with Equation 6mnone+m¯m(fc,n+n)(mRCmnone)(15)=with Equation 4mnone+(CpnCp+1n+n)m¯m(mRCmnone)(16) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqaaeabeaaaaaqaaiabd2gaTnaaDaaaleaacqWGsbGudaWgaaadbaGaem4qameabeaaliabcYcaSiabd6gaUbqaaiabcUfaBjabd2eanjabd6eaojabc2faDjabcYcaSiabgEHiQaaaaOqaceaaDlGaaCzcaiabg2da9aqaaiabdAgaMnaaBaaaleaacqWGJbWycqGGSaalcqWGUbGBaeqaaOGaeyyXICTaemyBa02aa0baaSqaaiabdkfasnaaBaaameaacqWGdbWqaeqaaSGaeiilaWIaemOBa4gabaGaei4waSLaemyta0KaemOta4Kaeiyxa0faaOGaey4kaSIaeiikaGIaeGymaeJaeyOeI0IaemOzay2aaSbaaSqaaiabdogaJjabcYcaSiabd6gaUbqabaGccqGGPaqkcqGHflY1cqWGTbqBdaqhaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaWccqGGSaalcqWGUbGBcqGHsislcqaIXaqmaeaacqGGBbWwcqWGnbqtcqWGobGtcqGGDbqxaaaakeaadaqadiqaaiabigdaXiabiodaZaGaayjkaiaawMcaaaqaaaqaceaaPlGaaCzcaiabg2da9aqaaiabd2gaTnaaBaaaleaacqWGUbGBcqWGVbWBcqWGUbGBcqWGLbqzaeqaaOGaey4kaSYaaeWaceaacqWGTbqBdaWgaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaaaleqaaOGaeyOeI0IaemyBa02aaSbaaSqaaiabd6gaUjabd+gaVjabd6gaUjabdwgaLbqabaaakiaawIcacaGLPaaacqGHflY1daqadiqaaiabdAgaMnaaBaaaleaacqWGJbWycqGGSaalcqWGUbGBaeqaaOGaemOzay2aaSbaaSqaaiabd2gaTjabcYcaSiabd6gaUbqabaGccqGHRaWkcqWGMbGzdaWgaaWcbaGaemyBa0MaeiilaWIaeiikaGIaemOBa4MaeyOeI0IaeGymaeJaeiykaKcabeaakiabgkHiTiabdAgaMnaaBaaaleaacqWGJbWycqGGSaalcqWGUbGBaeqaaOGaemOzay2aaSbaaSqaaiabd2gaTjabcYcaSiabcIcaOiabd6gaUjabgkHiTiabigdaXiabcMcaPaqabaaakiaawIcacaGLPaaacaWLjaGaaCzcaaqaamaabmGabaGaeGymaeJaeGinaqdacaGLOaGaayzkaaaabaaabaWaaGbaaeaacqGH9aqpaSqaaiabbEha3jabbMgaPjabbsha0jabbIgaOjabbccaGiabbweafjabbghaXjabbwha1jabbggaHjabbsha0jabbMgaPjabb+gaVjabb6gaUjabbccaGiabbAda2aGccaGL44paaeaacqWGTbqBdaWgaaWcbaGaemOBa4Maem4Ba8MaemOBa4MaemyzaugabeaakiabgUcaRmaalaaabaGafmyBa0MbaebaaeaacqWGTbqBaaWaaeWaceaacqWGMbGzdaWgaaWcbaGaem4yamMaeiilaWIaemOBa4gabeaakiabgUcaRiabd6gaUbGaayjkaiaawMcaamaabmGabaGaemyBa02aaSbaaSqaaiabdkfasnaaBaaameaacqWGdbWqaeqaaaWcbeaakiabgkHiTiabd2gaTnaaBaaaleaacqWGUbGBcqWGVbWBcqWGUbGBcqWGLbqzaeqaaaGccaGLOaGaayzkaaaabaWaaeWaceaacqaIXaqmcqaI1aqnaiaawIcacaGLPaaaaeaaaeaadaagaaqaaiabg2da9aWcbaGaee4DaCNaeeyAaKMaeeiDaqNaeeiAaGMaeeiiaaIaeeyrauKaeeyCaeNaeeyDauNaeeyyaeMaeeiDaqNaeeyAaKMaee4Ba8MaeeOBa4MaeeiiaaIaeeinaqdakiaawIJ=aaqaaGqadiab=1gaTnaaBaaaleaacqWFUbGBcqWFVbWBcqWFUbGBcqWFLbqzaeqaaOGaey4kaSYaaeWaceaadaWcaaqaaiab=neadnaaBaaaleaacqWFWbaCaeqaaOGaeyOeI0Iae8NBa4gabaGae83qam0aaSbaaSqaaiab=bhaWbqabaGccqGHRaWkieqacqGFXaqmcqGHsislcqWFUbGBaaGaey4kaSIae8NBa4gacaGLOaGaayzkaaGaeyyXIC9aaSaaaeaacuWFTbqBgaqeaaqaaiab=1gaTbaacqGHflY1daqadiqaaiab=1gaTnaaBaaaleaacqWFsbGudaWgaaadbaGae83qameabeaaaSqabaGccqGHsislcqWFTbqBdaWgaaWcbaGae8NBa4Mae83Ba8Mae8NBa4Mae8xzaugabeaaaOGaayjkaiaawMcaaaqaamaabmGabaGaeGymaeJaeGOnaydacaGLOaGaayzkaaaaaaaa@2F36@

is the weighted sum of the mass of the terminal peptides (with frequency 1 - fc,n) and the internal peptides (with frequency fc,n).

Cleavage probability pc In practice, the cleavage probability will depend on various factors, for example on the incubation time and the efficiency of the protease used. The probability to generate a peptide with n [set membership] 0...∞ missed cleavage sites, given the cleavage probability pc can be modelled using the geometric distribution:

P(n, pc) = (1 - pc)n·pc     (17)

Furthermore,

n=0(1pc)npc=1     (18) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaaeWbqaamaabmGabaGaeGymaeJaeyOeI0IaemiCaa3aaSbaaSqaaiabdogaJbqabaaakiaawIcacaGLPaaadaahaaWcbeqaaiabd6gaUbaaaeaacqWGUbGBcqGH9aqpcqaIWaamaeaacqGHEisPa0GaeyyeIuoakiabgwSixlabdchaWnaaBaaaleaacqWGJbWyaeqaaOGaeyypa0JaeGymaeJaaCzcaiaaxMaadaqadiqaaiabigdaXiabiIda4aGaayjkaiaawMcaaaaa@478A@

holds. Hence, given the cleavage probability is pcand cleavage residues RC, we express the peptide mass by:

mRC,pc=mnone+n=0(1pc)npc(mRCmnone)Sn,     (19) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaqhaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaWccqGGSaalcqWGWbaCdaWgaaadbaGaem4yamgabeaaaSqaaiabgEHiQaaakiabg2da9Gqadiab=1gaTnaaBaaaleaacqWFUbGBcqWFVbWBcqWFUbGBcqWFLbqzaeqaaOGaey4kaSYaaabCaeaadaqadiqaaGqabiab+fdaXGGabiab9jHiTiab=bhaWnaaBaaaleaacqWFJbWyaeqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqWFUbGBaaaabaGae8NBa4Mae0xpa0Jae4hmaadabaGae0NhIukaniabggHiLdGccqqFflY1cqWFWbaCdaWgaaWcbaGae83yamgabeaakiab9vSixpaabmGabaGae8xBa02aaSbaaSqaaiab=jfasnaaBaaameaacqWFdbWqaeqaaaWcbeaakiab9jHiTiab=1gaTnaaBaaaleaacqWFUbGBcqWFVbWBcqWFUbGBcqWFLbqzaeqaaaGccaGLOaGaayzkaaGae0xXICTae83uam1aaSbaaSqaaiab=5gaUbqabaGccqGGSaalcaWLjaGaaCzcamaabmGabaGaeGymaeJaeGyoaKdacaGLOaGaayzkaaaaaa@6CC2@

where

Sn = (fc,nfm,n + fm,(n-1) - fc,nfm,(n-1)).     (20)

Therefore, the wavelength λ of peptides if the cleavage probability is pc is given by:

λRC,pcm,=mRC,pc(M),mRC,pc(N),     (21) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF7oaBdaqhaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaWccqGGSaalcqWGWbaCdaWgaaadbaGaem4yamgabeaaaSqaaiabd2gaTjabcYcaSiabgEHiQaaakiabg2da9maalaaabaGaemyBa02aa0baaSqaaiabdkfasnaaBaaameaacqWGdbWqaeqaaSGaeiilaWIaemiCaa3aaSbaaWqaaiabdogaJbqabaaaleaadaqadiqaaiabd2eanbGaayjkaiaawMcaaiabcYcaSiabgEHiQaaaaOqaaiabd2gaTnaaDaaaleaacqWGsbGudaWgaaadbaGaem4qameabeaaliabcYcaSiabdchaWnaaBaaameaacqWGJbWyaeqaaaWcbaWaaeWaceaacqWGobGtaiaawIcacaGLPaaacqGGSaalcqGHxiIkaaaaaOGaaCzcaiaaxMaadaqadiqaaiabikdaYiabigdaXaGaayjkaiaawMcaaaaa@5693@

The monoisotopic mass as a function of the nominal mass can be expressed by:

m(M)=λRC,pc(m),m(N)(22)=mRC,pc(M),m(N)mRC,pc(N),(23)=with Eq. 20 and 4mnone(M)m(N)+n=0(1pc)npc(mRC(M)mnone(M))m¯(fc,n+n)mnone(N)+n=0(1pc)npc(mRC(N)mnone(N))m¯m(N)(fc,n+n)     (24)for m(N)m¯mnone(M)m(N)mnone(N)+n=0(1pc)npc(mRC(M)mnone(M))m¯(fc,n+n)mnone(M)(25) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeGacaaWgaa8guaaaaqaeqaaaaaabaGaemyBa02aaWbaaSqabeaacqGGOaakcqWGnbqtcqGGPaqkaaaakeGabaqddiaaxMaacqGH9aqpaeGabaWNYJGaciab=T7aSnaaDaaaleaacqWGsbGudaWgaaadbaGaem4qameabeaaliabcYcaSiabdchaWnaaBaaameaacqWGJbWyaeqaaaWcbaGaeiikaGIaemyBa0MaeiykaKIaeiilaWIaey4fIOcaaOGaeyyXICTaemyBa02aaWbaaSqabeaacqGGOaakcqWGobGtcqGGPaqkaaaakeaadaqadiqaaiabikdaYiabikdaYaGaayjkaiaawMcaaaqaaaqaceaaLmGaaCzcaiabg2da9aqaamaalaaabaGaemyBa02aa0baaSqaaiabdkfasnaaBaaameaacqWGdbWqaeqaaSGaeiilaWIaemiCaa3aaSbaaWqaaiabdogaJbqabaaaleaacqGGOaakcqWGnbqtcqGGPaqkcqGGSaalcqGHxiIkaaGccqGHflY1cqWGTbqBdaahaaWcbeqaaiabcIcaOiabd6eaojabcMcaPaaaaOqaaiabd2gaTnaaDaaaleaacqWGsbGudaWgaaadbaGaem4qameabeaaliabcYcaSiabdchaWnaaBaaameaacqWGJbWyaeqaaaWcbaGaeiikaGIaemOta4KaeiykaKIaeiilaWIaey4fIOcaaaaaaOqaamaabmGabaGaeGOmaiJaeG4mamdacaGLOaGaayzkaaaabaaabaWaaGbaaeaacqGH9aqpaSqaaiabbEha3jabbMgaPjabbsha0jabbIgaOjabbccaGiabbweafjabbghaXjabb6caUiabbccaGiabbkdaYiabbcdaWiabbccaGiabbggaHjabb6gaUjabbsgaKjabbccaGiabbsda0aGccaGL44paaeaadaWcaaqaaiabd2gaTnaaDaaaleaacqWGUbGBcqWGVbWBcqWGUbGBcqWGLbqzaeaadaqadiqaaiabd2eanbGaayjkaiaawMcaaaaakiabgwSixlabd2gaTnaaCaaaleqabaWaaeWaceaacqWGobGtaiaawIcacaGLPaaaaaGccqGHRaWkdaaeWaqaamaabmGabaGaeGymaeJaeyOeI0IaemiCaa3aaSbaaSqaaiabdogaJbqabaaakiaawIcacaGLPaaadaahaaWcbeqaaiabd6gaUbaakiabgwSixlabdchaWnaaBaaaleaacqWGJbWyaeqaaOGaeyyXIC9aaeWaceaacqWGTbqBdaqhaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaaaleaadaqadiqaaiabd2eanbGaayjkaiaawMcaaaaakiabgkHiTiabd2gaTnaaDaaaleaacqWGUbGBcqWGVbWBcqWGUbGBcqWGLbqzaeaadaqadiqaaiabd2eanbGaayjkaiaawMcaaaaaaOGaayjkaiaawMcaaiabgwSixlqbd2gaTzaaraWaaeWaceaacqWGMbGzdaWgaaWcbaGaem4yamMaeiilaWIaemOBa4gabeaakiabgUcaRiabd6gaUbGaayjkaiaawMcaaaWcbaGaemOBa4Maeyypa0JaeGimaadabaGaeyOhIukaniabggHiLdaakeaacqWGTbqBdaqhaaWcbaGaemOBa4Maem4Ba8MaemOBa4MaemyzaugabaWaaeWaceaacqWGobGtaiaawIcacaGLPaaaaaGccqGHRaWkdaaeWaqaamaabmGabaGaeGymaeJaeyOeI0IaemiCaa3aaSbaaSqaaiabdogaJbqabaaakiaawIcacaGLPaaadaahaaWcbeqaaiabd6gaUbaakiabgwSixlabdchaWnaaBaaaleaacqWGJbWyaeqaaOGaeyyXIC9aaeWaceaacqWGTbqBdaqhaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaaaleaadaqadiqaaiabd6eaobGaayjkaiaawMcaaaaakiabgkHiTiabd2gaTnaaDaaaleaacqWGUbGBcqWGVbWBcqWGUbGBcqWGLbqzaeaadaqadiqaaiabd6eaobGaayjkaiaawMcaaaaaaOGaayjkaiaawMcaaiabgwSixpaalaaabaGafmyBa0MbaebaaeaacqWGTbqBdaahaaWcbeqaamaabmGabaGaemOta4eacaGLOaGaayzkaaaaaaaakmaabmGabaGaemOzay2aaSbaaSqaaiabdogaJjabcYcaSiabd6gaUbqabaGccqGHRaWkcqWGUbGBaiaawIcacaGLPaaaaSqaaiabd6gaUjabg2da9iabicdaWaqaaiabg6HiLcqdcqGHris5aaaakiaaxMaacaWLjaaabaWaaeWaceaacqaIYaGmcqaI0aanaiaawIcacaGLPaaaaeaaaeGabaq=aiaaxMaadaagaaqaaiabgIKi7cWcbaGaeeOzayMaee4Ba8MaeeOCaiNaeeiiaaIaemyBa02aaWbaaWqabeaadaqadiqaaiabd6eaobGaayjkaiaawMcaaaaaliablUMi=iqbd2gaTzaaraaakiaawIJ=aaqaamaalaaabaGaemyBa02aa0baaSqaaiabd6gaUjabd+gaVjabd6gaUjabdwgaLbqaaiabcIcaOiabd2eanjabcMcaPaaakiabgwSixlabd2gaTnaaCaaaleqabaGaeiikaGIaemOta4KaeiykaKcaaaGcbaGaemyBa02aa0baaSqaaiabd6gaUjabd+gaVjabd6gaUjabdwgaLbqaaiabcIcaOiabd6eaojabcMcaPaaaaaGccqGHRaWkdaWcaaqaamaaqadabaWaaeWaceaacqaIXaqmcqGHsislcqWGWbaCdaWgaaWcbaGaem4yamgabeaaaOGaayjkaiaawMcaamaaCaaaleqabaGaemOBa4gaaOGaeyyXICTaemiCaa3aaSbaaSqaaiabdogaJbqabaGccqGHflY1cqGGOaakcqWGTbqBdaqhaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaaaleaacqGGOaakcqWGnbqtcqGGPaqkaaGccqGHsislcqWGTbqBdaqhaaWcbaGaemOBa4Maem4Ba8MaemOBa4MaemyzaugabaGaeiikaGIaemyta0KaeiykaKcaaaqaaiabd6gaUjabg2da9iabicdaWaqaaiabg6HiLcqdcqGHris5aOGaeiykaKIaeyyXICTafmyBa0MbaebadaqadiqaaiabdAgaMnaaBaaaleaacqWGJbWycqGGSaalcqWGUbGBaeqaaOGaey4kaSIaemOBa4gacaGLOaGaayzkaaaabaGaemyBa02aa0baaSqaaiabd6gaUjabd+gaVjabd6gaUjabdwgaLbqaaiabcIcaOiabd2eanjabcMcaPaaaaaaakeaadaqadiqaaiabikdaYiabiwda1aGaayjkaiaawMcaaaaaaaa@8A55@

This equation represents our final model of the peptide mass cluster centres. To illustrate the accuracy of the prediction we computed the residuals Δ between the monoisotopic masses of the in silico database digest and the cluster centres predicted by Equation 24. Figure Figure33 shows the relative residuals Δppm(m) = Δ(m)/m·106, in parts per million. The grey line shows the moving average of the residuals Δppm(m) computed for a window of 15Da.

Figure 3
Deviation Δppm of peptide masses from mass cluster centres predicted using the Equation 24 in parts per million [ppm]. Gray line – moving average of Δppm. Orange lines – Standard deviation of Δppm, Green lines – ...

Figure Figure4,4, panel A, shows the difference between nominal and monoisotopic mass (m(M) - m(N)) where m(M) was predicted using the model of Equation 24. We observed that m(M) - m(N) [proportional, variant] m(N) is approximately a straight line for the mass range greater than 500Da. By using the predicted monoisotopic mass m(M) at m(N) = 500 and at m(N) = 3000 we determined the slope:

Figure 4
The monoisotopic mass as an function of the nominal mass. Left panel : m(M) - m(N) = (λRC,pc(m), MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWF7oaBdaqhaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaWccqGGSaalcqWGWbaCdaWgaaadbaGaem4yamgabeaaaSqaaiabcIcaOiabd2gaTjabcMcaPiabcYcaSiabgEHiQaaaaaa@39BC@ ...

c1=3000λRC,pc(3000),500λRC,pc(500),3000500=1.000482,     (26) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGJbWydaWgaaWcbaGaeGymaedabeaakiabg2da9maalaaabaGaeG4mamJaeGimaaJaeGimaaJaeGimaaJaeyyXICncciGae83UdW2aa0baaSqaaiabdkfasnaaBaaameaacqWGdbWqaeqaaSGaeiilaWIaemiCaa3aaSbaaWqaaiabdogaJbqabaaaleaadaqadiqaaiabiodaZiabicdaWiabicdaWiabicdaWaGaayjkaiaawMcaaiabcYcaSiabgEHiQaaakiabgkHiTiabiwda1iabicdaWiabicdaWiabgwSixlab=T7aSnaaDaaaleaacqWGsbGudaWgaaadbaGaem4qameabeaaliabcYcaSiabdchaWnaaBaaameaacqWGJbWyaeqaaaWcbaWaaeWaceaacqaI1aqncqaIWaamcqaIWaamaiaawIcacaGLPaaacqGGSaalcqGHxiIkaaaakeaacqaIZaWmcqaIWaamcqaIWaamcqaIWaamcqGHsislcqaI1aqncqaIWaamcqaIWaamaaGaeyypa0JaeGymaeJaeiOla4IaeGimaaJaeGimaaJaeGimaaJaeGinaqJaeGioaGJaeGOmaiJaeiilaWIaaCzcaiaaxMaadaqadiqaaiabikdaYiabiAda2aGaayjkaiaawMcaaaaa@6F94@

and intercept coefficient

c0=500(λRC,pc(500),1)c1500=0.029.     (27) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGJbWydaWgaaWcbaGaeGimaadabeaakiabg2da9iabiwda1iabicdaWiabicdaWiabgwSixpaabmGabaacciGae83UdW2aa0baaSqaaiabdkfasnaaBaaameaacqWGdbWqaeqaaSGaeiilaWIaemiCaa3aaSbaaWqaaiabdogaJbqabaaaleaadaqadiqaaiabiwda1iabicdaWiabicdaWaGaayjkaiaawMcaaiabcYcaSiabgEHiQaaakiabgkHiTiabigdaXaGaayjkaiaawMcaaiabgkHiTiabdogaJnaaBaaaleaacqaIXaqmaeqaaOGaeyyXICTaeGynauJaeGimaaJaeGimaaJaeyypa0JaeGimaaJaeiOla4IaeGimaaJaeGOmaiJaeGyoaKJaeiOla4IaaCzcaiaaxMaadaqadiqaaiabikdaYiabiEda3aGaayjkaiaawMcaaaaa@5AE8@

These coefficients are in good agreement with the slope and intercept determined by linear regression for the in silico sequence database digest (Figure (Figure11).

Furthermore, we observed that the intercept c0 will be positive if mRC MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGTbqBdaWgaaWcbaGaemOuai1aaSbaaWqaaiabdoeadbqabaaaleqaaaaa@30AF@ > mnone, zero or negative otherwise. The slope c1 equals λnone = mnone(M)mnone(N) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabd2gaTnaaDaaaleaacqWGUbGBcqWGVbWBcqWGUbGBcqWGLbqzaeaadaqadiqaaiabd2eanbGaayjkaiaawMcaaaaaaOqaaiabd2gaTnaaDaaaleaacqWGUbGBcqWGVbWBcqWGUbGBcqWGLbqzaeaadaqadiqaaiabd6eaobGaayjkaiaawMcaaaaaaaaaaa@404C@, for large m(N), because the frequency of the cleavage site residues RC decreases with increasing peptide length:

lim|Pep|fm,nlimmpep(n+1)m¯m(N)=0. MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWfqaqaaiGbcYgaSjabcMgaPjabc2gaTbWcbaGaeiiFaWNaemiuaaLaemyzauMaemiCaaNaeiiFaWNaeyOKH4QaeyOhIukabeaakiabdAgaMnaaBaaaleaacqWGTbqBcqGGSaalcqWGUbGBaeqaaOGaeyyhIu7aaCbeaeaacyGGSbaBcqGGPbqAcqGGTbqBaSqaaiabd2gaTnaaBaaameaacqWGWbaCcqWGLbqzcqWGWbaCaeqaaSGaeyOKH4QaeyOhIukabeaakmaalaaabaWaaeWaceaacqWGUbGBcqGHRaWkcqaIXaqmaiaawIcacaGLPaaacuWGTbqBgaqeaaqaaiabd2gaTnaaCaaaleqabaWaaeWaceaacqWGobGtaiaawIcacaGLPaaaaaaaaOGaeyypa0JaeGimaaJaeiOla4caaa@5CF1@

Figure Figure4,4, panel B, displays the difference between the line (c1 + 1)·m(M) + c0 and the prediction made using Equation 3. For the mass range m [set membership] (500, 4000) where peptide masses for peptide mass fingerprinting are acquired this difference is minimal.

The coefficients c0 and c1 do not depend on the mass of the peptides. Due to this feature, we are going to use the affine model c1m(N) + c0 to predict the peptide mass cluster centres in the applications discussed later. This simplified model is also in agreement with the affine model (Equation 3), which has been fitted by linear regression to the in silico database digest in order to explain the dependency of the peptide mass cluster centres on the nominal mass.

Error of the model

Combinatorial restrictions may cause significant differences between the linear prediction of the model (Equation 24) introduced and the actual location of the cluster centre. To asses this error we first computed the location of the cluster centres (average of all monoisotopic masses in cluster) of the in silico database digest, and afterwards determined the difference to the cluster centre location predicted by model of Equation 24. This difference Δ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuqHuoargaqeaaaa@2E2A@(cluster) is shown in Figure Figure55.

Figure 5
Difference between cluster centre computed for the in silico database digest and the cluster centre location predicted by the model (Equation 24). Orange lines – minimum and maximum, red lines – first and third quartile, green – ...

For a moving window of 100Da we computed the maximum and minimum (orange), third and first quartile (red), median (blue) and mean(gree) of Δ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuqHuoargaqeaaaa@2E2A@(cluster). The combinatorial restriction decreases with increasing mass and for peptide masses greater than 1000Da it is negligible. However, Δ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuqHuoargaqeaaaa@2E2A@(cluster) increases again for masses greater than 2500Da because peptide masses may deviate more strongly from the cluster centres and furthermore much fewer long peptides are generated.

The type of distribution around the cluster centres

In order to remove non-peptide peaks prior to database search, filtering thresholds have to be chosen. In Figure Figure33 the orange line visualises the standard deviation while the green lines show the 1% and 99% quantiles of Δppm(m) = Δ(m)/m·106 computed for a mass window of 15Da. In addition the dotted, dashed, and dot dashed line show the deviation Δppm(m), at which clusters of mass spectrometric matrices are expected.

The standard deviation of Δppm(m) is symmetric and does not change for m > 1500. We were interested to determine the distribution of Δppm around the peptide mass cluster centres. To determine the type of distribution we use qqplots [28] shown in Figure Figure6.6. We compared the distribution of the residues Δppm(m), observed for four different mass windows (m [set membership] (500 – 530), m [set membership] (1000 – 1110), m [set membership] (2000 – 2200) and m [set membership] (3400 – 3700)) with the normal distribution and t-distributions with various degrees of freedom. The t-distribution with degrees of freedom μ [set membership] (15, 25) is a good approximation of the empirical distribution of Δppm for masses > 2000,.

Figure 6
qqplot – of Δppm = mm - c1·mN - c0 versus the t-distribution with 19 degrees of freedom for four mass ranges m [set membership] (500 – 530), m [set membership] (1000 – 1110), m [set membership] (2000 – 2200)and m [set membership] (3400 ...

Sensitivity analysis

The input parameters to the model of the peptide mass cluster centres included:

fi – frequencies of the amino acids.

• cleavage specificity of the protease RC

• |P| – Protein length

pc – cleavage probability

To examine how the output of the model is influenced by these factors we varied the protein length |P| in steps of 100 from 300 to 800 amino acids per protein. We determined the amino acid frequencies fi for 9 sequence databases (cf. Methods) and used them as inputs to the model. Furthermore, six cleavage specificities (shown in Table Table3)3) were examined and the cleavage probability pc was changed from 0.4 to 1 in increments of 0.2.

Table 3
Cleavage sites of proteolytic enzymes [36]

The box-plots, of Figure Figure7,7, Panel A demonstrate that the values of the intercept coefficient c0 (Equation 27) mainly depend on the cleavage probability pc and on the cleavage specificity of the proteolytic enzyme. The relatively small height of the boxes indicates that the differences in amino acid frequencies fi for the databases examined, and the average protein length |P| have a negligible effect on the intercept coefficient. The slope coefficient c1 (see Equation 26) depends only on the cleavage site specificities of the proteolytic enzyme and the amino acid frequencies f. The box-plots 7 Panel B show that the model output is highly sensitive to the cleavage specificity of the proteolytic enzyme.

Figure 7
Panel A – Box plots of the intercept coefficient c0 (Equation 27) itemised according the cleavage specificity and cleavage probability. Panel B – Box plots of the slope coefficient c1 (Equation 26) itemised according the cleavage specificity. ...

A measure of distance to cluster centres

Given an experimentally determined mM we were interested to estimate the deviation Δ from the closest predicted cluster centre. The model of the monoisotopic mass is:

c0 + c1·mN + Δ = mM,     (28)

where c0, c1 can be obtained using the Equations 27 and 26, mN is the nominal mass (an integer).

Therefore, for a given mM, c0 and c1 we can determine the deviation Δ from the closest cluster centre of smaller mass by using the modulo operator as suggested by Wool and Smilansky [10]:

(mM - c0)(modc1) = (c1·m + Δ)(modc1) = Δ.     (29)

However, in order to determine the distance to the closest cluster centre we considered two cases:

Δλ(mi,0)={(mic0)(modc1)if(mic0)(modλnone)<0.51+(mic0)(modc1)otherwise.     (30) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqGHuoardaWgaaWcbaacciGae83UdWgabeaakmaabmGabaGaemyBa02aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqaIWaamaiaawIcacaGLPaaacqGH9aqpdaGabeqaauaabaqacmaaaeaadaqadiqaaiabd2gaTnaaBaaaleaacqWGPbqAaeqaaOGaeyOeI0Iaem4yam2aaSbaaSqaaiabicdaWaqabaaakiaawIcacaGLPaaadaqadiqaaiGbc2gaTjabc+gaVjabcsgaKjabdogaJnaaBaaaleaacqaIXaqmaeqaaaGccaGLOaGaayzkaaaabaGaeeyAaKMaeeOzaygabaWaaeWaceaacqWGTbqBdaWgaaWcbaGaemyAaKgabeaakiabgkHiTiabdogaJnaaBaaaleaacqaIWaamaeqaaaGccaGLOaGaayzkaaWaaeWaceaacyGGTbqBcqGGVbWBcqGGKbazcqWF7oaBdaWgaaWcbaGaemOBa4Maem4Ba8MaemOBa4MaemyzaugabeaaaOGaayjkaiaawMcaaiabgYda8iabicdaWiabc6caUiabiwda1aqaaiabgkHiTiabigdaXiabgUcaRmaabmGabaGaemyBa02aaSbaaSqaaiabdMgaPbqabaGccqGHsislcqWGJbWydaWgaaWcbaGaeGimaadabeaaaOGaayjkaiaawMcaamaabmGabaGagiyBa0Maei4Ba8MaeiizaqMaem4yam2aaSbaaSqaaiabigdaXaqabaaakiaawIcacaGLPaaaaeaacqqGVbWBcqqG0baDcqqGObaAcqqGLbqzcqqGYbGCcqqG3bWDcqqGPbqAcqqGZbWCcqqGLbqzcqGGUaGlaeaaaaGaaCzcaiaaxMaadaqadiqaaiabiodaZiabicdaWaGaayjkaiaawMcaaaGaay5Eaaaaaa@88A4@

The units of Δλ(mi, 0) are in [m/z]. The magenta dot dashed curves in Figure Figure33 indicate the maximum detectable distance from cluster centres in ppm (±0.5Da/m·106[ppm]). Deviations from the cluster centres outside the range enclosed by these two curves are assigned to the wrong cluster. In case of theoretical peptide masses and experimental masses calibrated to high precision, such distances are observed only for masses greater than 2500Da. Fortunately, the majority of tryptic peptide masses detected in a mass spectrometric peptide fingerprint experiment are below this mass.

Applications

Linear regression on peptide mass rule LR/PR

The limitations of calibration methods based on the property of peptide mass clustering are a mass accuracy of only 0.2Da, its sensitivity to non-peptide peaks in the spectra, and that it completely fails if the number of peptide peaks in the peak list is small [10,14,19]. Hence, in practice, the method is used to confirm the results of internal calibration only [14,29]. However, the advantage of the calibration methods based on the property of peptide mass clustering, over other calibration methods [12], is that no internal or external calibrants are required in order to calibrate the peptide mass lists.

We propose here a novel method for the calibration of PMF data, based on robust linear regression and the distance measure introduced in the Equation 30. To determine the slope of the mass measurement error we computed the deviation from the peptide mass rule for every pair of peak masses (mi, mj) within a peak-list, employing the following equation:

Δλ(mi,mj)={|mimj|(modλnone)if|mimj|(modλnone)<0.51+(|mimj|(modλnone))otherwise.     (31) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqGHuoardaWgaaWcbaacciGae83UdWgabeaakmaabmGabaGaemyBa02aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWGTbqBdaWgaaWcbaGaemOAaOgabeaaaOGaayjkaiaawMcaaiabg2da9maaceqabaqbaeaabiWaaaqaaiabcYha8jabd2gaTnaaBaaaleaacqWGPbqAaeqaaOGaeyOeI0IaemyBa02aaSbaaSqaaiabdQgaQbqabaGccqGG8baFdaqadiqaaiGbc2gaTjabc+gaVjabcsgaKjab=T7aSnaaBaaaleaacqWGUbGBcqWGVbWBcqWGUbGBcqWGLbqzaeqaaaGccaGLOaGaayzkaaaabaGaeeyAaKMaeeOzaygabaGaeiiFaWNaemyBa02aaSbaaSqaaiabdMgaPbqabaGccqGHsislcqWGTbqBdaWgaaWcbaGaemOAaOgabeaakiabcYha8naabmGabaGagiyBa0Maei4Ba8MaeiizaqMae83UdW2aaSbaaSqaaiabd6gaUjabd+gaVjabd6gaUjabdwgaLbqabaaakiaawIcacaGLPaaacqGH8aapcqaIWaamcqGGUaGlcqaI1aqnaeaacqGHsislcqaIXaqmcqGHRaWkdaqadiqaaiabcYha8jabd2gaTnaaBaaaleaacqWGPbqAaeqaaOGaeyOeI0IaemyBa02aaSbaaSqaaiabdQgaQbqabaGccqGG8baFdaqadiqaaiGbc2gaTjabc+gaVjabcsgaKjab=T7aSnaaBaaaleaacqWGUbGBcqWGVbWBcqWGUbGBcqWGLbqzaeqaaaGccaGLOaGaayzkaaaacaGLOaGaayzkaaaabaGaee4Ba8MaeeiDaqNaeeiAaGMaeeyzauMaeeOCaiNaee4DaCNaeeyAaKMaee4CamNaeeyzauMaeeOla4cabaaaaiaaxMaacaWLjaWaaeWaceaacqaIZaWmcqaIXaqmaiaawIcacaGLPaaaaiaawUhaaaaa@9C08@

Figure Figure88 left top panel shows the distance Δλ(mi, mj) (Equation 31) as a function of Δd = |mi - mj|, computed for all pairs (mi, mj) [set membership] peak-list, which adhere to the additional constraint that Δd = |mi - mj| <mmax. This constraint is necessary because the measure Δλ is only able to assign deviation smaller than 0.5Da to the correct cluster centre. For large values of Δd, Δλ increases, if c1 ≠ 0 and assignments to wrong clusters may occur. If a systematic dependence of Δλ on Δd is observed it indicates a mass measurement error. We determined the slope c^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGJbWygaqcaaaa@2E0B@1 using robust linear regression [30] with the intercept fixed at 0. To correct the peak-list masses we applied

Figure 8
Principle and results of linear regression on peptide rule LR/PR calibration. Panel A: Scatter-plot of ΔPR (mi, mj) (Equation 31) in dependence of Δd = |mi - mj|. The slope, obtained by robust regression, is shown by the red line. Panel ...

mcorrected = mexperimental·(1 - c^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGJbWygaqcaaaa@2E0B@1)

To determine the intercept coefficient of the mass measurement error we subsequently computed Δλ(mcorrected, 0) (using Equation 30), for all peak-list masses. Figure Figure8,8, Panel B shows the distribution of Δλ(mi, 0) before correcting for the slope error (gray histogram) and afterwards (black histogram). The red vertical line indicates the mean Δ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuqHuoargaqeaaaa@2E2A@λ(mi, 0), computed for the corrected data, which we used to approximate the intercept c^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGJbWygaqcaaaa@2E0B@0 of the mass measurement error.

The strip charts (Figure (Figure8,8, Panel C and D) visualises the experimental masses of two trypsin peptides 842.508Da and 2211.100Da observed in most of the samples of the dataset with 380 peak-lists. The result of LR/PR calibration (red circles) is compared with raw masses (gray triangles) and the output of the Wool and Smilansky calibration method (blue crosses). The LR/PR-method is able to calibrate mass spectrometric peak-lists to an accuracy of 0.1Da. This measurement accuracy surpasses the other published calibration methods [10,19] at least two-fold.

Filtering of non-peptide peaks using the peptide mass rule

Non-peptide peaks can be recognised according to their deviation from the cluster centres. The amino acids that have the most extreme λ values are I, L and K (because of their large fraction of Hydrogen H (1.007825) atoms) and C (Cysteine – because of the heavy sulfur atom S (31.97207)). If we plot the position after the decimal point given by n·(λi - l)(modl) with n [set membership] N, for i = L and i = C, and connect the points for readability purposes by a line (the red and green lines in Figure Figure99 respectively), we obtain the range enclosing any possible decimal point a theoretical peptide mass can have. If a mass with a decimal point lying in the dashed region is detected it can not be a peptide peak. For peptide peaks, the following inequalities hold:

Figure 9
Schema of non-peptide mass filtering. Abscissae – peptide mass, ordinate – m mod 1, dashed region – non-peptide masses. Green line – decimal part of poly-(L(lys), I(ile)) peptide masses as a function of their mass. Red ...

-413[ppm] = (λC - λDB)·106 < ΔΔ(m, 0)·106/m = Δλppm MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqGHuoardaqhaaWcbaacciGae83UdWgabaGaemiCaaNaemiCaaNaemyBa0gaaaaa@3430@ (m, 0) < (λL - λDB) = 241[ppm],     (32)

where λDB = 1.000511 (Equation 2). We used the relative deviation of Δppm from the cluster centre in parts per million instead of using absolute values.

Figure Figure33 shows that only very short peptides approach the lower bound of -413ppm. This is due to the low frequency of Cysteine (C). The high frequencies of K, L, I (whose λ ≈ 1.00074) mean that the theoretical upper bound of 241ppm can indeed be reached by some peptides with a mass of ≈ l000Da. Peptides of higher mass never approach the upper and lower theoretical bound due to the rapidly decreasing probability to consist of K, L or I, or of C only. The lines for the standard deviation of SN (orange lines) and of the 1% and 99% quantile (green lines) in Figure Figure33 indicate that it is an exceedingly rare event to encounter a peptide mass for which Δλppm MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqGHuoardaqhaaWcbaacciGae83UdWgabaGaemiCaaNaemiCaaNaemyBa0gaaaaa@3430@(m, 0) will deviate more than 200ppm from the peptide cluster centre predicted by our model. Therefore, we use 200ppm as a filtering threshold. An essential requirement, to apply this filtering method successfully is that peak-list must be calibrated to high precision [12].

Figure Figure1010 visualizes the result of non-peptide peak filtering in case of a dataset of 380 calibrated peak-lists. Spots removed by applying the filtering criterion Δλppm MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqGHuoardaqhaaWcbaacciGae83UdWgabaGaemiCaaNaemiCaaNaemyBa0gaaaaa@3430@(m, 0) > 200 are shown in green. Peptide masses removed due to filtering of abundant masses [12] are shown in red.

Figure 10
Scatter plot : abscissae – peptide mass mi, ordinate – mimodλ with λ = 1.000495. In red are highlighted peaks removed from the dataset because of their high frequencies. In green, peaks removed due to the strong deviation ...

We studied how the non-peptide peak filtering influences the Probability Based Mascot Score (PBMS) [22]. In theory, for example one cystein rich peptide strongly deviating from the peptide mass rule and with a unique mass in the database digest, if properly assigned is sufficient to identify the protein unambiguously [10]. In case of PBMS, which requires multiple matches to peptide masses, a single match of a unique peptide mass, even if properly assigned, will not give a score indicating reliable identification of the protein. Furthermore, this scoring scheme takes into account the number of non-matching peaks. If many unassigned peaks are observed, the score is decreased and the assignment is interpreted as insignificant. Therefore, the removal of non-peptide peaks should increase the identification sensitivity. Table Table44 demonstrates that an increase of 2.5% in the number of identified samples can be obtained by removing all peaks with a distance Δλppm MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqGHuoardaqhaaWcbaacciGae83UdWgabaGaemiCaaNaemiCaaNaemyBa0gaaaaa@3430@(m, 0) > 200ppm from the peptide peak-lists. Row 8 of Table Table44 shows that non-peptide peak filtering increases the PBMS score in 30 – 55% of cases. Removal of peptide peaks due to filtering caused a decrease of the PBMS score in less than 1% of samples.

Table 4
Results for filtering of non-peptide masses.

We concluded that non-peptide peak filtering increases the sensitivity of protein identification if using the PBMS scoring schema. However, to which extend these results can be reproduced is dependent on the database search algorithm used.

Conclusion

We introduced here a simple model to predict the cluster centres of peptide masses. The input parameters of the model can be easily determined for the sequence databases. We studied how these parameters influence the location of cluster centres, concluding that the cleavage specificity of the enzyme used for peptide digestion and the cleavage probability are the main factors. The change of the cluster centre location due to changes in average protein length or due to variability of amino acid frequencies among the databases is relatively small. However, our analysis also illustrates that, due to combinatorial constraints, the location of the cluster centres for masses smaller than l000Da can differ from the average location. Based on the model of the peptide mass cluster centres we derived a measure to determine the deviation of an experimental peptide mass from the nearest cluster centre. We used this distance measure to calibrate the peptide peak-lists and to recognise non-peptide peaks. The calibration method, linear regression on peptide rule, is a robust and accurate method to calibrate single peak lists without resorting to internal calibrants. With this method higher calibration precision was obtained in comparison to other calibration methods, which also employ the property of peptide mass clustering.

The same distance measure was used to recognise non-peptide peaks and to remove them from the peak-lists. Due to their removal, an increase of the identification rate of up to 2.5% for the PBMS scoring schema was observed.

Methods

Data sets

In this study, we used three data sets generated in different proteome analyses:

1. A bacterial proteome of Rhodopirellula baltica (unpublished data) (1,193 spectra) measured on a Reflex III [31] MALDI-TOF instrument.

2. A mammalian proteome of Mus musclus (1, 882 spectra) measured on an Ultraflex [31] MALDI-TOF instrument.

3. A plant proteome of Arabidopsis thaliana [32] measured on an Autoflex [31] MALDI-TOF instrument.

All PMF MS spectra derive from tryptic protein digests of individually excised protein spots. For this purpose, the whole tissue/cell protein extracts of the aforementioned organisms were separated by two-dimensional (2D) gel electrophoresis [33] and visualised with MS compatible Coomassie brilliant blue G250 [32]. The MALDI-TOF MS analysis was performed using a delayed ion extraction and by employing the MALDI AnchorChip ™targets (Bruker Daltonics, Bremen, Germany). Positively charged ions in the m/z range of 700 – 4, 500m/z were recorded. Subsequently, the SNAP algorithm of the XTOF spectrum analysis software (Bruker Daltonics, Bremen, Germany) detected the monoisotopic masses of the measured peptides. The sum of the detected monoisotopic masses constitutes the raw peak-list.

Calibration

In order to perform filtering of non-peptide peaks the dataset must be calibrated to high mass measurement accuracy. To align the dataset we used a calibration sequence [12] consisting of several calibration procedures.

First calibration using external calibration samples was performed in order to remove higher order terms of the mass measurement error [11]. Next, the affine mass measurement error of all samples on the sample support was determined by linear regression on the peptide mass rule introduced here. Subsequently, the thin plate splines were used to model the mass measurement error in dependence of the sample support positions to calibrate the spectra. Finally, the spectra were aligned using a modified spanning tree algorithm [12].

Mascot database search

Processed peak-lists were then used for the protein database searches with the Mascot search software (Version 1.8.1) [22], employing a mass accuracy of ± 0.1Da. Methionine oxidation was set as a variable and carbamidomethylation of cysteine residues as fixed modification. We allowed only one missed proteolytic cleavage site in the analysis.

Sequence databases

We determined the amino acid frequencies of the nine protein sequence databases listed in Table Table5.5. Seven of these databases are organism specific subsets of the NCBI non-redundant protein database [34].

Table 5
Protein lengths and amino acid frequencies (one letter code) for nine in the nine databases, length – average protein length in database, reference – database reference; fi – amino acid frequencies

In silico protein digestion

The theoretical digestion of the protein databases was done with ProtDigest [35], a command line program taking a protein sequence database file in fasta format and cleavage specificities as input. Other optional input parameters included fixed as well as variable modifications and number of missed cleavages. The output file contains all theoretically resulting peptides with their corresponding masses.

Regression analysis

The complete tryptic insilico digest of the SwissProt [27] database generated more than 7 million peptides. In order to compute the slope coefficient we were sampling 500 times 10000 monoisotopic and corresponding nominal masses. For each sample we fitted the affine linear model with and without fixed intercept using linear regression. The slope and intercept coefficients in Figure Figure11 are the medians of these 500 samples.

Appendix

Wool and Smilanskys algorithm

Wool and Smilansky [10] use a Discrete Fourier Transform (DFT) to determine the calibration coefficients. The wavelength λ of a peptide peak-list can be determined by convolution. The "time domain" is the peak-list X with masses xi. We computed the amplitude A (Equation 36) for a small range of frequencies (ω ~ f = 1/λ around λtheo. We scanned the range λ [set membership] λtheo ± 0.0005 in steps of 5·10-7 computing, for each λ, the real part (Equation 35), the imaginary part (Equation 34) and the amplitude A(ω) (Equation 36):

f = 1/λ ω = 2πf,     (33)

(ω)=isin(ωxi),     (34) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiaacqWFresWdaqadiqaaGGaciab+L8a3bGaayjkaiaawMcaaiabg2da9maaqafabaacbiGae03CamNae0xAaKMae0NBa42aaeWaceaacqGFjpWDcqWG4baEdaWgaaWcbaGaemyAaKgabeaaaOGaayjkaiaawMcaaiabcYcaSaWcbaGaemyAaKgabeqdcqGHris5aOGaaCzcaiaaxMaadaqadiqaaiabiodaZiabisda0aGaayjkaiaawMcaaaaa@4637@

(ω)=icos(ωxi),     (35) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiaacqWFCeIWdaqadiqaaGGaciab+L8a3bGaayjkaiaawMcaaiabg2da9maaqafabaacbiGae03yamMae03Ba8Mae03Cam3aaeWaceaacqGFjpWDcqWG4baEdaWgaaWcbaGaemyAaKgabeaaaOGaayjkaiaawMcaaiabcYcaSaWcbaGaemyAaKgabeqdcqGHris5aOGaaCzcaiaaxMaadaqadiqaaiabiodaZiabiwda1aGaayjkaiaawMcaaaaa@463B@

A(ω)=(ω)2+(ω)2.     (36) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGbbqqdaqadiqaaGGaciab=L8a3bGaayjkaiaawMcaaiabg2da9maakaaabaGaeyyeHe8aaeWaceaacqWFjpWDaiaawIcacaGLPaaadaahaaWcbeqaaiabikdaYaaakiabgUcaRiabgYricpaabmGabaGae8xYdChacaGLOaGaayzkaaWaaWbaaSqabeaacqaIYaGmaaaabeaakiabc6caUiaaxMaacaWLjaWaaeWaceaacqaIZaWmcqaI2aGnaiaawIcacaGLPaaaaaa@44B1@

The wavelength of the masses in the peak-list is the λ at the maximum of A(ω). The phase for this ω0 = ωmax A(ω) can be determined by:

ϕ0=ϕ(ωmaxA(ω))=arctan((ω0)2(ω0)2)     (37) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacqWFvpGAdaWgaaWcbaGae8hmaadabeaakiabg2da9iab=v9aQnaabmGabaGae8xYdC3aaSbaaSqaaiGbc2gaTjabcggaHjabcIha4jabdgeabnaabmGabaGae8xYdChacaGLOaGaayzkaaaabeaaaOGaayjkaiaawMcaaiabg2da9iGbcggaHjabckhaYjabcogaJjabcsha0jabcggaHjabc6gaUnaabmGabaWaaSaaaeaacqGHresWdaqadiqaaiab=L8a3naaBaaaleaacqaIWaamaeqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqaIYaGmaaaakeaacqGHCeIWdaqadiqaaiab=L8a3naaBaaaleaacqaIWaamaeqaaaGccaGLOaGaayzkaaWaaWbaaSqabeaacqaIYaGmaaaaaaGccaGLOaGaayzkaaGaeyyXICTaaCzcaiaaxMaadaqadiqaaiabiodaZiabiEda3aGaayjkaiaawMcaaaaa@5E8C@

The peak centres are at the line:

M=2πω0N+ϕ0ω0  where  N=1,2,...,n.     (38) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWfGaqaaiabd2eanbWcbeqaaGGaaiab=jdiIcaakiabg2da9maalaaabaGaeGOmaiJaeyyXICncciGae4hWdahabaGae4xYdC3aaSbaaSqaaiabicdaWaqabaaaaOGaeyyXICTaemOta4Kaey4kaSYaaSaaaeaacqGFvpGAdaWgaaWcbaGaeGimaadabeaaaOqaaiab+L8a3naaBaaaleaacqaIWaamaeqaaaaakiabbccaGiabbccaGiabbEha3jabbIgaOjabbwgaLjabbkhaYjabbwgaLjabbccaGiabbccaGiabd6eaojabg2da9iabigdaXiabcYcaSiabikdaYiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiabd6gaUjabc6caUiaaxMaacaWLjaWaaeWaceaacqaIZaWmcqaI4aaoaiaawIcacaGLPaaaaaa@5D33@

But they should be on the line:

M = λtheo * N.         (39)

Solving Equation 38 for N and substituting N in the Equation 39 yields the Equation:

M=λtheoω02π(Mϕ0ω0),     (40) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGnbqtcqGH9aqpdaWcaaqaaGGaciab=T7aSnaaBaaaleaacqWG0baDcqWGObaAcqWGLbqzcqWGVbWBaeqaaOGaeyyXICTae8xYdC3aaSbaaSqaaiabicdaWaqabaaakeaacqaIYaGmcqGHflY1cqWFapaCaaGaeiikaGYaaCbiaeaacqWGnbqtaSqabeaaiiaacqGFYaIOaaGccqGHsisldaWcaaqaaiab=v9aQnaaBaaaleaacqaIWaamaeqaaaGcbaGae8xYdC3aaSbaaSqaaiabicdaWaqabaaaaOGaeiykaKIaeiilaWIaaCzcaiaaxMaadaqadiqaaiabisda0iabicdaWaGaayjkaiaawMcaaaaa@5195@

α=λtheoω02π and β=ϕ0ω0 and(41)mcorr=α(mexpβ)=αmexpαβ,     (42) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqadeGacaaabaacciGamaiGy6paa8xSdeMamaiGy6paayypa0ZaiaiGy6paaSaaaeacaciM+daacWaGaIP=aaWF7oaBdGaGaIP=aaWgaaWcbGaGaIP=aaGamaiGy6paamiDaqNamaiGy6paamiAaGMamaiGy6paamyzauMamaiGy6paam4Ba8gabKaGaIP=aaaakiadaciM+daagwSixladaciM+daa=L8a3nacaciM+daaBaaaleacaciM+daacWaGaIP=aaaIWaamaeqcaciM+daaaaGcbGaGaIP=aaGamaiGy6paaGOmaiJamaiGy6paayyXICTamaiGy6paa8hWdahaaiadaciM+daabccaGiadaciM+daabggaHjadaciM+daab6gaUjadaciM+daabsgaKjadaciM+daabccaGiadaciM+daa=j7aIjadaciM+daag2da9macaciM+daalaaabGaGaIP=aaGamaiGy6paa8x1dO2aiaiGy6paaSbaaSqaiaiGy6paaiadaciM+daaicdaWaqajaiGy6paaaaakeacaciM+daacWaGaIP=aaWFjpWDdGaGaIP=aaWgaaWcbGaGaIP=aaGamaiGy6paaGimaadabKaGaIP=aaaaaaGccWaGaIP=aaqGGaaicWaGaIP=aaqGHbqycWaGaIP=aaqGUbGBcWaGaIP=aaqGKbazaeaadaqadiqaaiabisda0iabigdaXaGaayjkaiaawMcaaaqaaiabd2gaTnaaBaaaleaacqWGJbWycqWGVbWBcqWGYbGCcqWGYbGCaeqaaOGaeyypa0Jae8xSdeMaeiikaGIaemyBa02aaSbaaSqaaGqaciab+vgaLjab+Hha4jab+bhaWbqabaGccqGHsislcqWFYoGycqGGPaqkcqGH9aqpcqWFXoqycqWGTbqBdaWgaaWcbaGae4xzauMae4hEaGNae4hCaahabeaakiabgkHiTiab=f7aHjab=j7aIjabcYcaSiaaxMaacaWLjaaabaWaaeWaceaacqaI0aancqaIYaGmaiaawIcacaGLPaaaaaaaaa@F59D@

mcorr = α(mexp - β) = αmexp - αβ,     (42)

which can be used to correct the masses. This is an affine linear model with two coefficients α and αβ.

Abbreviation

• PBMS – Probability based Mascot score

• DFT – Discrete Fourier Transformation

m/z – mass over charge

Authors' contributions

WEW developed and implemented the methods described, carried out the analysis and visualised the results.

WEW, MF, ML and AKE wrote the manuscript.

AKE implemented the sequence digester.

All authors contributed to the final version of the manuscript.

Acknowledgements

I would like to thank the members of Algorithmic Bioinformatics group of Prof. Knut Reinert at the FU-Berlin for valuable discussion, especially Andreas Döring and Dr. Clemens Gröpl. Many thanks also to Dr. Johan Gobom, Dr. Patrick Giavalisco for providing the PMF-MS data and for valuable discussion. This project was partially funded by the National Genome Research Network (NGFN) of the German Ministry for Education and Research (BMBF).

References

  • Fenyo D. Identifying the proteome: software tools. Current Opinion in Biotechnology. 2000;11:391–395. doi: 10.1016/S0958-1669(00)00115-4. [PubMed] [Cross Ref]
  • Griffin TJ, Aebersold R. Advances in proteome analysis by mass spectrometry. J Biol Chem. 2001;276:45497–500. doi: 10.1074/jbc.R100014200. [PubMed] [Cross Ref]
  • Patterson SD. Data analysis – the Achilles heel of proteomics. Nat Biotechnol. 2003;21:221–2. doi: 10.1038/nbt0303-221. [PubMed] [Cross Ref]
  • Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003;422:198–207. doi: 10.1038/nature01511. [PubMed] [Cross Ref]
  • Mann M, Hojrup P, Roepstorff P. Use of mass spectrometric molecular weight information to identify proteins in sequence databases. Biol Mass Spectrom. 1993;22:338–345. doi: 10.1002/bms.1200220605. [PubMed] [Cross Ref]
  • Johnson R, Martin S, Biemann K, Stults J, Watson J. Novel Fragmentation Process of Peptides by Collision-Induced Decomposition in a Tandem Mass Spectrometer: Differentiation of Leucine and Isoleucine. Analytical Chemistry. 1987;59:2621–2625. doi: 10.1021/ac00148a019. [PubMed] [Cross Ref]
  • Smith RD, Loo JA, Edmonds CG, Barinaga CJ, Udseth HR. New developments in biochemical mass spectrometry: electrospray ionization. Anal Chem. 1990;62:882–99. doi: 10.1021/ac00208a002. [PubMed] [Cross Ref]
  • Apweiler R, Bairoch A, Wu CH. Protein sequence databases. Curr Opin Chem Biol. 2004;8:76–80. doi: 10.1016/j.cbpa.2003.12.004. [PubMed] [Cross Ref]
  • Gentzel M, Kocher T, Ponnusamy S, Wilm M. Preprocessing of tandem mass spectrometric data to support automatic protein identification. Proteomics. 2003;3:1597–610. doi: 10.1002/pmic.200300486. [PubMed] [Cross Ref]
  • Wool A, Smilansky Z. Precalibration of matrix-assisted laser desorption/ionization-time of flight spectra for peptide mass fingerprinting. Proteomics. 2002;2:1365–1373. doi: 10.1002/1615-9861(200210)2:10<1365::AID-PROT1365>3.0.CO;2-9. [PubMed] [Cross Ref]
  • Gobom J, Mueller M, Egelhofer V, Theiss D, Lehrach H, Nordhoff E. A calibration method that simplifies and improves accurate determination of peptide molecular masses by MALDI-TOF MS. Anal Chem. 2002;74:3915–3923. doi: 10.1021/ac011203o. [(eng)] [PubMed] [Cross Ref]
  • Wolski WE, Lalowski M, Jungblut P, Reinert K. Calibration of mass spectrometric peptide mass fingerprint data without specific external or internal calibrants. BMC Bioinformatics. 2005;6:203. doi: 10.1186/1471-2105-6-203. http://www.biomedcentral.com/1471-2105/6/203 [PMC free article] [PubMed] [Cross Ref]
  • Levander F, Rognvaldsson T, Samuelsson J, James P. Automated methods for improved protein identification by peptide mass fingerprinting. Proteomics. 2004;4:2594–601. doi: 10.1002/pmic.200300804. [PubMed] [Cross Ref]
  • Chamrad DC, Koerting G, Gobom J, Thiele H, Klose J, Meyer HE, Blueggel M. Interpretation of mass spectrometry data for high-throughput proteomics. Anal Bioanal Chem. 2003;376:1014–22. doi: 10.1007/s00216-003-1995-x. [PubMed] [Cross Ref]
  • Gay S, Binz PA, Hochstrasser DF, Appel RD. Modeling peptide mass fingerprinting data using the atomic composition of peptides. Electrophoresis. 1999;20:3527–3534. doi: 10.1002/(SICI)1522-2683(19991201)20:18<3527::AID-ELPS3527>3.0.CO;2-9. [(eng)] [PubMed] [Cross Ref]
  • Wikipedia is a Web-based, free-content encyclopedia. 2004. http://www.wikipedia.org
  • Giles J. Internet encyclopaedias go head to head. Nature. 2005;438:900–1. doi: 10.1038/438900a. [PubMed] [Cross Ref]
  • Pappin DJC, Hojrup P, Bleasby AJ. Rapid identification of proteins by peptide-mass fingerprinting. Curr Biol. 1993;3:327–332. doi: 10.1016/0960-9822(93)90195-T. [PubMed] [Cross Ref]
  • Gras R, Muller M, Gasteiger E, Gay S, Binz PA, Bienvenut W, Hoogland C, Sanchez JC, Bairoch A, Hochstrasser DF, Appel RD. Improving protein identification from peptide mass fingerprinting through a parameterized multi-level scoring algorithm and an optimized peak detection. Electrophoresis. 1999;20:3535–3550. doi: 10.1002/(SICI)1522-2683(19991201)20:18<3535::AID-ELPS3535>3.0.CO;2-J. [(eng)] [PubMed] [Cross Ref]
  • Gobom J, Schürenberg M, Mueller M, Theiss D, Lehrach H, Nordhoff E. Alpha-cyano-4-hydroxycinnamic acid affinity sample preparation. A protocol for MALDI-MS peptide analysis in proteomics. Analytical Chemistry. 2001;73:434–438. doi: 10.1021/ac001241s. [PubMed] [Cross Ref]
  • Zhen Y, Xu N, Richardson B, Becklin R, Savage JR, Blake K, Peltier JM. Development of an LC-MALDI method for the analysis of protein complexes. J Am Soc Mass Spectrom. 2004;15:803–22. doi: 10.1016/j.jasms.2004.02.004. [PubMed] [Cross Ref]
  • Perkins DN, Pappin DJ, Creasy DM, Cottrell JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis. 1999;20:3551–3567. doi: 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2. [PubMed] [Cross Ref]
  • Schmidt F, Schmid M, Jungblut PR, Mattow J, Facius A, Pleissner KP. Iterative data analysis is the key for exhaustive analysis of peptide mass fingerprints from proteins separated by two-dimensional electrophoresis. J Am Soc Mass Spectrom. 2003;14:943–56. doi: 10.1016/S1044-0305(03)00345-3. [PubMed] [Cross Ref]
  • Tabb DL, MacCoss MJ, Wu CC, Anderson SD, Yates JRr. Similarity among tandem mass spectra from proteomic experiments: detection, significance, and utility. Anal Chem. 2003;75:2470–7. doi: 10.1021/ac026424o. [PubMed] [Cross Ref]
  • Wolski WE, Lalowski M, Martus P, Herwig R, Giavalisco P, Gobom J, Sickmann A, Lehrach H, Reinert K. Transformation and other factors of the peptide mass spectrometry pairwise peak-list comparison process. BMC Bioinformatics. 2005;6:285. doi: 10.1186/1471-2105-6-285. [PMC free article] [PubMed] [Cross Ref]
  • Cagney G, Amiri S, Premawaradena T, Lindo M, Emili A. In silico proteome analysis to facilitate proteomics experiments using mass spectrometry. Proteome Science. 2003;1:5. doi: 10.1186/1477-5956-1-5. http://www.Proteomesci.com/content/1/1/5 [PMC free article] [PubMed] [Cross Ref]
  • Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LSL. The Universal Protein Resource (UniProt) Nucleic Acids Res. 2005:D154–9. [PMC free article] [PubMed]
  • Becker RA, Chambers JM, Wilks AR. The New S Language. London: Chapman & Hall; 1988.
  • Samuelsson J, Dalevi D, Levander F, Rognvaldsson T. Modular, Scriptable, and Automated Analysis Tools for High-Throughput Peptide Mass Fingerprinting. Bioinformatics. 2004 [PubMed]
  • Venables WN, Ripley BD. Modern Applied Statistics with S. Fourth. Springer; 2002. http://www.stats.ox.ac.uk/pub/MASS4/ [ISBN 0-387-95457-0]
  • Bruker Daltonics – enabling life science tools based on mass spectrometry. 2004. http://www.bdal.com
  • Giavalisco P, Nordhoff E, Kreitler T, Kloeppel KD, Lehrach H, Klose J, Gobom J. Proteome Analysis of Arabidopsis Thaliana by 2-D Electrophoresis and Matrix Assisted Laser Desorption/Ionization Time of Flight Mass Spectrometry. [To appear in Proteomics] [PubMed]
  • Klose J, Kobalz U. Two-dimensional electrophoresis of proteins: an updated protocol and implications for a functional analysis of the genome. Electrophoresis. 1995;16:1034–59. doi: 10.1002/elps.11501601175. [PubMed] [Cross Ref]
  • Pruitt KD, Tatusova T, Maglott DR. NCBI Reference Sequence project: update and current status. Nucleic Acids Res. 2003;31:34–7. doi: 10.1093/nar/gkg111. [PMC free article] [PubMed] [Cross Ref]
  • Emde AK. Protein Sequence Digester. 2004. http://www.inf.fu-berlin.de/~emde/
  • Mascot. 2005. http://www.matrixscience.com
  • Glockner FO, Kube M, Bauer M, Teeling H, Lombardot T, Ludwig W, Gade D, Beck A, Borzym K, Heitmann K, Rabus R, Schlesner H, Amann R, Reinhardt R. Complete genome sequence of the marine planctomycete Pirellula sp. strain 1. Proc Natl Acad Sci USA. 2003;100:8298–303. doi: 10.1073/pnas.1431443100. [PMC free article] [PubMed] [Cross Ref]

Articles from Proteome Science are provided here courtesy of BioMed Central
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...