- Journal List
- NIHPA Author Manuscripts
- PMC2806822

# A general-purpose baseline estimation algorithm for spectroscopic data

^{a}Children’s Oncology Group, 440 E. Huntington Drive Suite 402, Arcadia, CA, 91006, U.S.A

^{b}Division of Biostatistics, School of Medicine, University of California Davis, CA, 95616, U.S.A

^{}Corresponding author.

## Abstract

A common feature of many modern technologies used in proteomics—including nuclear magnetic resonance imaging and mass spectrometry—is the generation of large amounts of data for each subject in an experiment. Extracting the signal from the background noise, however, poses significant challenges. One important part of signal extraction is the correct identification of the baseline level of the data. In this article, we propose a new algorithm (the “BXR algorithm”) for baseline estimation that can be directly applied to different types of spectroscopic data, but also can be specifically tailored to different technologies. We then show how to adapt the algorithm to a particular technology—matrix-assisted laser desorption/ionization Fourier transform ion cyclotron resonance mass spectrometry—which is rapidly gaining popularity as an analytic tool in proteomics. Finally, we compare the performance of our algorithm to that of existing algorithms for baseline estimation.

The BXR algorithm is computationally efficient, robust to the type of one-sided signal that occurs in many modern applications (including NMR and mass spectrometry), and improves on existing baseline-estimation algorithms. It is implemented as the function baseline in the *R* package FTICRMS, available either from the Comprehensive *R* Archive Network (http://www.r-project.org/) or from the first author.

**Keywords:**Baseline estimation, Fourier transform ion cyclotron resonance, Matrix-assisted laser desorption/ionization, spectroscopy

## 1 Introduction

A common feature of many modern technologies used in proteomics—including nuclear magnetic resonance imaging and mass spectrometry—is the generation of large amounts of data for each subject in an experiment. Extracting the signal from the background noise, however, poses significant challenges. One important part of signal extraction is the correct identification of the baseline level of the data. In this article, we first generalize an algorithm of Xi and Rocke [1] which was developed for NMR baseline correction and show how it can be applied to data from generic spectroscopic technologies. We also indicate how it can be adapted to the unique qualities of different technologies and illustrate this by adapting it to a specific technology: matrix-assisted laser desorption/ionization Fourier transform ion cyclotron resonance mass spectrometry (MALDI FT-ICR MS). Finally, we compare the performance of our algorithm to that of existing algorithms for baseline estimation.

## 2 Methods

### 2.1 What is a baseline?

There are different possible interpretations of what exactly a “baseline” is in spectroscopic analysis. If it is assumed that the signal is all positive standing out from a (theoretically) zero baseline level, then some kind of (smoothed) running minimum would be an appropriate baseline. This is the approach taken in software packages such as Cromwell [2], LCMS-2D [3], LIMPIC [4], PrepMS [5], and PROcess [6]. Alternatively, if the noise is assumed to fluctuate about a baseline level (like in an independent, identically-distributed (iid) normal case), then some measure of center (median, mean, etc.) is more appropriate. This is the approach taken in software packages such as msInspect [7]. Software packages such as LMS [8] have options to compute either of these types of baselines. (A third common type of analysis is continuous wavelet analysis, which does not have a separate baseline correction step as such; the baseline is automatically removed as part of the wavelet transformation. This is the approach taken in software packages such as MassSpecWavelet [9] and OpenMS [10].)

The Xi-Rocke algorithm uses the second interpretation of baseline (measure of center of the noise), and as explained in Section 3, this is the appropriate way to analyze (in particular) MALDI FT-ICR MS spectra, and is arguably appropriate in other applications. In the remainder of this article we will concentrate on this type of baseline and compare our algorithm to LMS. (msInspect was designed for liquid chromatography mass spectrometry and is not directly comparable to our current algorithm.)

### 2.2 The BXR algorithm

Suppose that the data have the form (*x _{t}*,

*y*) for

_{t}*t*= 1, …,

*n*. Xi and Rocke [1] proposed using the score function in Equation (1) to estimate the baseline for NMR data.

Here, *z*_{+} max{*z*, 0}, *b _{t}* represents the value of the baseline at the

*t*-th data point, and

*A*

_{1}and

*A*

_{2}are positive constants to be determined. We maximize this score function over all possible values of {

*b*} to find the baseline

_{t}^{1}. The first term in

*F*represents the overall height of the baseline. The last term is negative only when the baseline is above the data points, so it penalizes baseline values that lie too far above the data and helps ensure that the estimated baseline will go through the middle of the data. The middle term is a measure of the curvature of the baseline, so maximizing

*F*will prevent the estimated baseline from curving too sharply.

To make the analysis easier, we change notation. Let ** b** = (

*b*

_{1}, …,

*b*)′—where the prime symbol represents the transpose of a vector or matrix—be a column vector containing the values of the baseline, and similarly let

_{n}**= (**

*y**y*

_{1}, …,

*y*)′ contain the measured values of the spectrum. Let (

_{n}*S*) be the indicator function for the set

*S*and let

**1**be an

*n*× 1 column vector of ones. Finally, it will be useful to allow

*A*

_{1}and

*A*

_{2}to vary with

*t*, taking values ${\{{A}_{1,t}\}}_{t=2}^{n-1}$ and ${\{{A}_{2,t}\}}_{t=1}^{n}$, respectively. We can then rewrite Equation (1) in vector/matrix notation as

where ** N** is an

*n*×

*n*diagonal matrix with entries

*A*

_{2,}

*(*

_{t}*b*>

_{t}*y*), and ${\mathbf{\Delta}}_{2}={\mathit{M}}_{2}^{\prime}{\mathit{A}}_{1}{\mathit{M}}_{2}$, where

_{t}

is an (*n*−2) ×*n* matrix and *A*_{1} is an (*n* − 2) × (*n* − 2) diagonal matrix with entries *A*_{1,}* _{t}*. We will refer to the process of maximizing this modified score function with respect to the baseline

**as the Barkauskas-Xi-Rocke (BXR) algorithm.**

*b*Note that the only change that the BXR algorithm makes over Xi and Rocke’s original algorithm is allowing *A*_{1} and *A*_{2} to vary with *t*. This seemingly minor change has a profound impact on the effectiveness of the algorithm in the analysis of real-world data, however, as we show in Sections 2.3 and 3.

To maximize the function in Equation (2), we calculate the gradient and Hessian:

Note that *F* is continuous everywhere, and *H*(*F*) is continuous except for jump discontinuities where *b _{t}* =

*y*for some

_{t}*t*. Also,

**Δ**

_{2}and

**are both positive semidefinite (because**

*N***′**

*b***Δ**

_{2}

**is a sum of squares, and**

*b***is diagonal with nonnegative entries). Thus,**

*N***Δ**

_{2}+

**is positive semidefinite and will be positive definite unless**

*N***Δ**

_{2}and

**have a common null vector. But from the form ${\mathbf{\Delta}}_{2}={\mathit{M}}_{2}^{\prime}{\mathit{A}}_{1}{\mathit{M}}_{2}=({\mathit{A}}_{1}^{1/2}{\mathit{M}}_{2}{)}^{\prime}{\mathit{A}}_{1}^{1/2}{\mathit{M}}_{2}$, we see that $\text{rank}({\mathbf{\Delta}}_{2})=\text{rank}({\mathit{A}}_{1}^{1/2}{\mathit{M}}_{2})=n-2$, and that**

*N***′**

*x***Δ**

_{2}

**= 0 exactly when**

*x***is a linear combination of**

*x***1**and

**(1, …,**

*n**n*)′. Furthermore, the only way that

**′**

*x***= 0 is if**

*Nx**x*= 0 whenever

_{t}*b*>

_{t}*y*. But a nontrivial linear combination of

_{t}**1**and

**can have at most one zero entry, so in order for the two matrices to have a common null vector, the baseline would have to be below all but at most one point of the data. This will clearly not happen in any reasonable data set, so we see that**

*n**H*(

*F*) is −2 times the sum of two positive semidefinite matrices that have no common nullspace near any potential maximum. Thus, we see that in any reasonable data set there will be a unique maximum, since

*H*(

*F*) is negative semidefinite overall and negative definite near any (reasonable) potential maximum. We can thus find the maximum by using Newton’s method using a reasonable starting point (e.g., median(

**) ·**

*y***1**), and the BXR algorithm is virtually guaranteed to converge to the global maximum. (Technically, in most applications it will be a quasi-Newton’s method, since the matrices

**Δ**

_{2}and

**will depend non-trivially on**

*N***—the quantity we are trying to estimate—and at each iteration we will be using the currently-estimated baseline to approximate**

*b***. Thus, at each step we will only be approximating the gradient and Hessian.)**

*b*### 2.3 Calculating *A*_{1,}_{t} and *A*_{2,}_{t}

_{t}

_{t}

Since the estimated baseline should be linear in the data (i.e., for any constants *m* and *c*, if ** b** corresponds to

**, then**

*y**m*

**+**

*b**c*should correspond to

*m*

**+**

*y**c*) and should be invariant under sampling more or fewer points in the spectrum, Xi and Rocke argue that in their original algorithm,

*A*

_{1}should have the form ${A}_{1}={n}^{4}{A}_{1}^{\ast}/\sigma $, where

*σ*is a normalizing constant based on

**. (Xi and Rocke use an estimate of the noise standard deviation; hence the use of**

*y**σ*to denote the constant.) In this article, we will allow the normalizing constant

*σ*to vary with

*t*as

*σ*but leave the smoothing parameter ${A}_{1}^{\ast}$ constant, giving us the form ${A}_{1,t}={n}^{4}{A}_{1}^{\ast}/{\sigma}_{t}$. To decide on a reasonable value of ${A}_{1}^{\ast}$, we use a result from Barkauskas

_{t}*et al*. [12] that the autocorrelation function (ACF) of a (non-stationary) time series with {(

*Y*

_{t}_{+}

*−*

_{k}*Y*

_{t}_{+}

*)(*

_{k}*Y*−

_{t}*Y*)} ≈ 0 for sufficiently large

_{t}*k*eventually oscillates around a value that is approximately equal to

The optimal value of
${A}_{1}^{\ast}$ can then be estimated by calculating the baseline ** b** using different choices for
${A}_{1}^{\ast}$ and seeing which one gives the best match to the ACF of the noise portion of the spectrum when substituted into Equation (3). (See Figure 3 in Section 3 for an example of this applied to a MALDI FT-ICR spectrum.)

In order to determine *A*_{2,}* _{t}*, we set the

*t*-th coordinate of

*F*equal to zero and assume that the baseline is flat, so that the middle term drops out. Let

*Y*be the underlying random variable whose realization is given by

_{t}*y*. We want to choose

_{t}*A*

_{2,}

*so that the function is maximized at*

_{t}*b*=

_{t}*g*(

*Y*) for some function

_{t}*g*. Thus, we want

(Of course, if *g*(*Y _{t}*) were known, there would be no need to run the algorithm.) One obvious choice for

*g*(

*Y*) is the expected value

_{t}*Y*. For this choice of

_{t}*g*(

*Y*) in the case that the data are assumed to be iid normal with variance

_{t}*σ*

^{2}, then we might choose

which recovers the result in Xi and Rocke [1].

We observe that for an arbitrary random variable *Y*, we have {(*Y* − *Y*)_{+}} = |*Y* − *Y*|/2. Thus, if there is no information about the distribution of the random variables {*Y _{t}*}, then a reasonable choice might be

*A*

_{2,}

*= (*

_{t}*>*

_{t}*y*)/(

_{t}*−*

_{t}*y*), where

_{t}*is the current estimate of the baseline (i.e., the current estimate of*

_{t}*Y*).

_{t}Figure 1 shows the effect of these choices on the estimated baseline. We ran two simulations of 973,720 observations (the number of observations in the “noise” spectrum analyzed in Section 3), each with *x*-coordinates equally spaced between zero and three and with baseline given by *y* = sin(2*πx*). In the first simulation we used iid (0, 1) noise added to the baseline, and in the second we used independent
$\mathcal{N}(0,{\sigma}_{x}^{2})$ noise, where *σ _{x}* = 1 + 0.5 cos(4

*πx*/3). We used ${A}_{1}^{\ast}={10}^{-11}$ and a constant value for

*σ*estimated by dividing the spectrum into 1024 (roughly) equal-sized sets of points, calculating the standard deviation of each set of points, then finding the average standard deviation using the estimate of center from Tukey’s biweight with

_{t}*K*= 9. We ran the BXR algorithm twice on each set of simulated data, once with each choice of

*A*

_{2,}

*above.*

_{t}*A*

_{2,}

*from Section 2.3.*

_{t}For the data generated with homoscedastic noise, both versions of the BXR algorithm perform well, with only a small amount of bias near the extreme values of the baseline (Figure 1, top). The major advantage here is that the algorithm that assumes homoscedasticity runs much faster (using roughly 10–20% of the computing time, depending on the exact convergence criterion chosen). However, if the noise is actually heteroscedastic, then assuming homoscedasticity causes the BXR algorithm to badly mis-estimate the baseline—underestimating the baseline when the variance is above average and overestimating the baseline when the variance is below average (Figure 1, bottom). The distribution-free version of the BXR algorithm, however, still produces a result that is almost indistinguishable from the true baseline. Thus, if the noise can reasonably be assumed to be iid normal, then
${A}_{2,t}={\sigma}^{-1}\sqrt{\pi /2}$ is a good choice, but if the noise is heteroscedastic with unknown distribution, then *A*_{2,}* _{t}* = (

*>*

_{t}*y*)/(

_{t}*−*

_{t}*y*)—where

_{t}*is the current estimate of the baseline—should be preferred.*

_{t}Of course, if information on the distribution of the noise for a particular technology is available, it would be advantageous to explore whether distribution-specific choices for *A*_{1,}* _{t}* and

*A*

_{2,}

*would work better than the distribution-free choices. In the next section, we will show how to do this for the particular case of MALDI FT-ICR MS data.*

_{t}## 3 Application to MALDI FT-ICR MS data

Matrix-assisted laser desorption/ionization Fourier transform ion cyclotron resonance mass spectrometry (MALDI FT-ICR MS) is a technique for high mass-resolution analysis of substances that is rapidly gaining popularity as an analytic tool in proteomics. Typically in MALDI FT-ICR MS, a sample (the *analyte*) is mixed with a chemical that absorbs light at the wavelength of the laser (the *matrix*) in a solution of organic solvent and water. The resulting solution is then spotted on a MALDI plate and the solvent is allowed to evaporate, leaving behind the matrix and the analyte. A laser is fired at the MALDI plate and is absorbed by the matrix. The matrix breaks apart and transfers a charge to the analyte, creating the ions of interest (with fewer fragments than would be created by direct ablation of the analyte with a laser). The ions are guided with a quadrupole ion guide into the ICR cell where the ions cyclotron in a magnetic field. While in the cell, the ions are excited and ion cyclotron frequencies are measured. The angular velocity, and therefore the frequency, of a charged particle is determined solely by its mass-to-charge ratio. Using Fourier analysis, the frequencies can be resolved into a sum of pure sinusoidal curves with given frequencies and amplitudes. The frequencies correspond to the mass-to-charge ratios and the amplitudes correspond to the concentrations of the compounds in the analyte. FT-ICR MS is known for high mass resolution, with separation thresholds on the order of 10^{−3} Daltons (Da) or better [13, 14].

As an application of the methods developed in Section 2, we use them on two MALDI FT-ICR spectra, one of which is a “noise” spectrum—one created with no analyte or matrix, pictured in Figure 2—and the other of which was prepared for a cancer study [15] with human blood serum as the analyte. The spectra analyzed in this article were recorded in the Lebrilla lab in the Chemistry Department at the University of California at Davis on an external source MALDI FT-ICR instrument (HiResMALDI, IonSpec Corporation, Irvine, CA) equipped with a 7.0 T superconducting magnet and a pulsed Nd:YAG laser 355 nm. The serum sample was collected at the University of California at Davis Cancer Center; the patient gave written informed consent under an IRB-approved protocol.

The BXR algorithm is especially suited to analyzing MALDI FT-ICR spectra because of the following property, first observed in Barkauskas *et al*. [12]: the data obtained by dividing the noise portion of a MALDI FT-ICR spectrum by the expected value at each point can be closely modeled by a causal and invertible autoregressive, moving-average time series with generalized gamma innovations. Thus, identifying the mean level of the noise determines the entire distribution of the noise, which leads to a nice method for identifying peaks in a MALDI FT-ICR spectrum as either noise or signal.

It follows that if we consider the random variables ${Y}_{t}^{\prime}={Y}_{t}/\mathbb{E}{Y}_{t}$, then { ${Y}_{t}^{\prime}$} should be identically distributed with mean 1. (Note that { ${Y}_{t}^{\prime}$} are not independent; the autocorrelation function is non-trivial.) Thus, we get

Using the spectrum in Figure 2 as {*Y _{t}*} (and running means to estimate {

*Y*}) gives us $\mathbb{E}\{{(1-{Y}_{t}^{\prime})}_{+}\}=0.2100706$. For each iteration we can use the currently estimated value of the baseline

_{t}*as an estimate for*

_{t}*Y*, so we see that for this spectrum we should use

_{t}*A*

_{2,}

*= 1/0.4201412*

_{t}*. Similarly, to obtain an appropriate value of*

_{t}*σ*, we observe that since the standard deviation of ${Y}_{t}^{\prime}$ estimated from the spectrum is 0.522659, we can use

_{t}*σ*= 0.522659

_{t}*.*

_{t}To choose an appropriate value of
${A}_{1}^{\ast}$, we tried values of 10^{−}* ^{j}* for

*j*= 10, …, 13 and found that the value of Var(

**)= Var(**

*b***) was closest to the eventual value of the ACF of the noise spectrum when $-{log}_{10}{A}_{1}^{\ast}\approx 10.855$ (see Figure 3).**

*y*For each of the two spectra, we calculated the baseline using four methods: a running Tukey’s biweight with *K* = 9 and bandwidth 8001; and the BXR algorithm with
${A}_{1}^{\ast}={10}^{-10.855}$ and *A*_{2,}* _{t}* chosen to be one of the three choices
${\sigma}^{-1}\sqrt{\pi /2}$ (the “iid normal method”, which is just Xi and Rocke’s original algorithm), or (

*>*

_{t}*y*)/(

_{t}*−*

_{t}*y*) (the “distribution-free method”), or 1/0.4201412

_{t}*(the “distribution-specific method”), where*

_{t}*is the current estimate of the baseline at point*

_{t}*t*. For the first two choices of

*A*

_{2,}

*we used*

_{t}*σ*calculated in the same way as for the simulated data in Section 2.3; for the last choice of

_{t}*A*

_{2,}

*, we used*

_{t}*σ*= 0.522659

_{t}*. We then computed the ratio of each of the estimated baselines to a running means estimate.*

_{t}For the noise spectrum, we used running means with bandwidth 8001. The noise spectrum has two spikes at frequencies of 41.21 kHz and 42.21 kHz which extend upward to intensities of approximately 222.7 and 95.4, respectively, and are apparently instrumental noise (they have no isotope peaks; if they were real compounds, the isotope peaks should be easily large enough to show above the noise). In the calculation of the running means, we set the values of the spectrum at frequencies corresponding to these two peaks to be missing.

For the serum spectrum, the presence of multiple large peaks would badly skew the running means. To get a reasonable estimate of the running means of the noise portion of the spectrum, we used a baseline calculated using the BXR algorithm with parameters *A*_{1,}* _{t}* and

*A*

_{2,}

*as for the noise spectrum and set the values of the spectrum at frequencies corresponding to any peak that reaches at least 3.7996 times higher than that to be missing. (From simulations of noise spectra, this is approximately equivalent to taking 4.5 standard deviations above the mean for iid normal data.) We then used running means with bandwidth 8001.*

_{t}The results for the noise spectrum are displayed in Figure 4, and the results for the serum spectrum are displayed in Figure 5. Note that each algorithm performs similarly for both spectra. Specifically, the iid normal method underestimates the baseline for small frequencies (where the noise variance is large) and overestimates the baseline for large frequencies (where the noise variance is small), as expected. The distribution-free method does much better, but is still consistently underestimating the baseline. Furthermore, in the noise spectrum the bias increases in absolute value as the frequency decreases. The running Tukey’s biweight underestimates the running means by a fairly consistent amount (although by less than the distribution-free method), which is not surprising, since a simple calculation shows that the distribution of the noise is right-skewed. Finally, we see that the baseline estimated using the distribution-specific parameters is (on average) unbiased.

Thus, applying the distribution-specific BXR algorithm to a MALDI FT-ICR spectrum is roughly equivalent to simply calculating running means for the noise portion of that spectrum. However, the BXR algorithm has two main advantages over running means. The first is speed: the BXR algorithm uses roughly half the computing time of the running means. (Of course, optimizing each algorithm could change this. Also, as noted in Yang *et al*. [16], even aside from issues of algorithm optimization, running times are only really comparable for programs in the same language. Thus, comparing run times of various algorithms should only be considered as a rough guideline.) More importantly, the negativity penalty *A*_{2,}* _{t}* in the BXR algorithm only comes into play when the baseline is above the data. If the data is above the baseline, it doesn’t matter by how much. Thus, the extremely large values in a spectrum which constitute the signal are automatically ignored by the BXR algorithm, while extra work is needed to ignore the signal when calculating the running means (as we had to do above in the estimation of the running means for the serum spectrum).

This is even more clearly illustrated by a comparison of baselines computed by the BXR algorithm and the LMS algorithm (Figures 6 and and7).7). Note that for the noise spectrum, the two baselines are extremely close to each other, except for an apparent edge effect at low frequencies for the LMS algorithm. In fact, except near the peak and at the low frequency edge, the estimates never differ by more than ±5%. However, in the areas of the serum spectrum that have signal, the estimate from the LMS algorithm is pulled up toward the signal drastically, reaching up to more than three times as high as the BXR estimate. In the areas with little or no signal (frequencies greater than 100 kHz), the LMS and BXR estimates are still within ±5% of each other. While there is a mild inflation of the baseline in the presence of signal in the BXR algorithm, it is only inflated up to 13% larger than the estimated baseline for the noise spectrum. Thus, it is clear that the BXR algorithm is far less sensitive to the presence of signal than the LMS algorithm.

## 4 Future Directions

One obvious question is how many of this results in this article are due to the particular experimental setup used to generate the spectra analyzed in this article and how much can be generalized. One encouraging sign is that the coefficients obtained in Section 3 are consistent in replicates; for a set of 56 noise spectra similar to the one displayed in Figure 2, the estimated values of
$\mathbb{E}\{{(1-{Y}_{t}^{\prime})}_{+}\}$ had a mean of 0.210011 and a standard deviation of 2.46 × 10^{−4}, while the estimated values for the standard deviation of
${Y}_{t}^{\prime}$ had a mean of 0.522584 and a standard deviation of 6.36 × 10^{−4}. Thus, it would seem to be justified to use the mean values of the two parameters for analyses on any spectrum generated on the same MALDI FT-ICR machine rather than having to calculate them individually for each spectrum. Whether or not these same numbers would apply to other MALDI FT-ICR machines is unknown, but it seems likely that at the very worst, each experimenter could use the techniques described in this article to determine the appropriate numbers for his or her experimental setup and use those.

Additionally, we have concentrated on the case the the estimated quantity is *Y _{t}* because that is the key quantity in MALDI FT-ICR MS. However, any measure of center

*g*(

*Y*) which is homogeneous of degree 1 (i.e.,

_{t}*g*(

*cY*) =

_{t}*c*·

*g*(

*Y*)) can be used instead. For example, replacing

_{t}*g*(

*Y*) =

_{t}*Y*with

_{t}*g*(

*Y*) = median(

_{t}*Y*) in the calculations for the noise spectrum from Figure 2 gives us

_{t}*σ*= 0.5570115

_{t}*and*

_{t}*A*

_{2,}

*= 1/0.3796328*

_{t}*. Plotting the ratio of the result of running the BXR algorithm with these parameters to the running median with bandwidth 8001 gives a picture that is virtually identical to the distribution-specific panel of Figure 4.*

_{t}Several variants of the BXR algorithm could be useful. One possibility is to try penalizing different order derivatives rather than the second. This would involve changing **Δ**_{2} by changing *M*_{2}. For example, if we wanted to penalize large values of the fourth derivative, we could use
${\mathbf{\Delta}}_{4}={\mathit{M}}_{4}^{\prime}{\mathit{A}}_{1}{\mathit{M}}_{4}$, where

is (*n* − 4) × *n* (and *A*_{1} would then be (*n* − 4) × (*n* − 4)). As in Section 2.2, the resulting Hessian is almost certainly negative definite (unless the baseline is below all but at most three points of the spectrum). However, it appears to be difficult to adequately smooth the spectrum in this way, since using a large enough
${A}_{1}^{\ast}$ to get a reasonably smooth estimate causes the Hessian to be computationally singular.

Another variant would be to allow
${A}_{1}^{\ast}$ to depend on *t*. Especially with MALDI FT-ICR spectra—which each show a large spike in baseline and variance near 53.75 kHz—it might be useful to incorporate
${A}_{1,t}^{\ast}$ into the formula, with an appropriate adjustment to **Δ**_{2}.

A third possibility would be to allow the matrix ** N** from Equation (2) to be non-diagonal. This is an especially attractive idea in light of the non-trivial autocorrelation of MALDI FT-ICR spectra.

A fourth possibility is to extend the score function to cases where the masses are not equally spaced. Although in MALDI FT-ICR MS we can use the frequencies as equally-spaced data, it is certainly possible that there are (or will be) technologies which will not generate equally-spaced data.

We also observe that in deriving the appropriate values for *A*_{1,}* _{t}* and

*A*

_{2,}

*to calculate a baseline, it was assumed that a baseline had already been estimated, which is obviously problematic in applications. This suggests an iterative process, where an initial baseline estimate is found (for example, by using the distribution-free BXR algorithm), then*

_{t}*A*

_{1,}

*and*

_{t}*A*

_{2,}

*are estimated using this baseline. The new values of*

_{t}*A*

_{1,}

*and*

_{t}*A*

_{2,}

*can then be used to re-estimate the baseline using the distribution-specific BXR algorithm, which will lead to new parameters, etc. In simulations, it appears this process does, in fact, converge to a stable result.*

_{t}Finally, we note that although the BXR algorithm has been developed for one-dimensional data, the same principles should be applicable to higher-dimensional data, such as data generated by liquid chromatography mass spectrometry.

## Acknowledgments

**Funding**

This work was supported by the National Human Genome Research Institute (R01-HG003352); National Institute of Environmental Health Sciences Superfund (P42-ES04699); National Institutes of Health Training Program in Biomolecular Technology (2-T32-GM08799 to DAB); and the Ovarian Cancer Research Fund.

The authors would like to thank Scott Kronewitter and Carlito Lebrilla (University of California at Davis, Department of Chemistry) for providing the MALDI FT-ICR spectra used in Section 3 and Ralph de Vere White (University of California at Davis Cancer Center, Division of Urology) for providing the serum sample used to generate the serum spectrum.

## Footnotes

^{1}Note that the values {*x _{t}*} do not appear in

*F*; the score function assumes equally-spaced data. Masses in MALDI FT-ICR spectra are

*not*equally spaced, but the masses are not directly measured. Instead, they are derived from measured frequencies via one of several non-linear transformations [11], and the frequencies

*are*equally spaced. Thus, it will be appropriate to use our generalization of Xi and Rocke’s score function without modification in Section 3.

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

## References

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.8M) |
- Citation

- Analysis of MALDI FT-ICR mass spectrometry data: a time series approach.[Anal Chim Acta. 2009]
*Barkauskas DA, Kronewitter SR, Lebrilla CB, Rocke DM.**Anal Chim Acta. 2009 Aug 26; 648(2):207-14. Epub 2009 Jul 5.* - Evaluation of combined matrix-assisted laser desorption/ionization time-of-flight and matrix-assisted laser desorption/ionization Fourier transform ion cyclotron resonance mass spectrometry experiments for peptide mass fingerprinting analysis.[Rapid Commun Mass Spectrom. 2011]
*da Silva D, Wasselin T, Carré V, Chaimbault P, Bezdetnaya L, Maunit B, Muller JF.**Rapid Commun Mass Spectrom. 2011 Jul 15; 25(13):1881-92.* - Shape-based feature matching improves protein identification via LC-MS and tandem MS.[J Comput Biol. 2011]
*Noy K, Towfic F, Wittenberg GM, Fasulo D.**J Comput Biol. 2011 Apr; 18(4):547-57. Epub 2011 Mar 21.* - Screening of synthetic PDE-5 inhibitors and their analogues as adulterants: analytical techniques and challenges.[J Pharm Biomed Anal. 2014]
*Patel DN, Li L, Kee CL, Ge X, Low MY, Koh HL.**J Pharm Biomed Anal. 2014 Jan; 87:176-90. Epub 2013 May 6.* - On the importance of mathematical methods for analysis of MALDI-imaging mass spectrometry data.[J Integr Bioinform. 2012]
*Trede D, Kobarg JH, Oetjen J, Thiele H, Maass P, Alexandrov T.**J Integr Bioinform. 2012 Mar 21; 9(1):189. Epub 2012 Mar 21.*

- The Application of Gaussian Mixture Models for Signal Quantification in MALDI-ToF Mass Spectrometry of Peptides[PLoS ONE. ]
*Spainhour JC, Janech MG, Schwacke JH, Velez JC, Ramakrishnan V.**PLoS ONE. 9(11)e111016* - PyMS: a Python toolkit for processing of gas chromatography-mass spectrometry (GC-MS) data. Application and comparative study of selected tools[BMC Bioinformatics. ]
*O'Callaghan S, De Souza DP, Isaac A, Wang Q, Hodkinson L, Olshansky M, Erwin T, Appelbe B, Tull DL, Roessner U, Bacic A, McConville MJ, Likić VA.**BMC Bioinformatics. 13115* - Metabolic Changes in Urine during and after Pregnancy in a Large, Multiethnic Population-Based Cohort Study of Gestational Diabetes[PLoS ONE. ]
*Sachse D, Sletner L, Mørkrid K, Jenum AK, Birkeland KI, Rise F, Piehler AP, Berg JP.**PLoS ONE. 7(12)e52399* - High-Precision Isothermal Titration Calorimetry with Automated Peak Shape Analysis[Analytical Chemistry. 2012]
*Keller S, Vargas C, Zhao H, Piszczek G, Brautigam CA, Schuck P.**Analytical Chemistry. 2012 Jun 5; 84(11)5066-5073*

- PubMedPubMedPubMed citations for these articles

- A general-purpose baseline estimation algorithm for spectroscopic dataA general-purpose baseline estimation algorithm for spectroscopic dataNIHPA Author Manuscripts. Jan 11, 2010; 657(2)191

Your browsing activity is empty.

Activity recording is turned off.

See more...