Logo of bmcbioiBioMed Centralsearchsubmit a manuscriptregisterthis articleBMC Bioinformatics
BMC Bioinformatics. 2008; 9: 324.
Published online Jul 29, 2008. doi:  10.1186/1471-2105-9-324
PMCID: PMC2516527

Baseline Correction for NMR Spectroscopic Metabolomics Data Analysis

Abstract

Background

We propose a statistically principled baseline correction method, derived from a parametric smoothing model. It uses a score function to describe the key features of baseline distortion and constructs an optimal baseline curve to maximize it. The parameters are determined automatically by using LOWESS (locally weighted scatterplot smoothing) regression to estimate the noise variance.

Results

We tested this method on 1D NMR spectra with different forms of baseline distortions, and demonstrated that it is effective for both regular 1D NMR spectra and metabolomics spectra with over-crowded peaks.

Conclusion

Compared with the automatic baseline correction function in XWINNMR 3.5, the penalized smoothing method provides more accurate baseline correction for high-signal density metabolomics spectra.

Background

Baseline distortions in 1D NMR spectra are mainly caused by the corruption of the first few data points in FID (free induction decay). These corrupted data points add low frequency modulations in the Fourier-transformed spectrum, and thus formed the distorted baseline. Correction of these distortions is a necessary step in NMR spectra data processing because they offset the intensity values and result in inaccuracy in peak assignment and quantification. These errors could be critical in the study of metabolomics, which involves many small but statistically significant peaks that are sensitive to baseline distortions. Incorrect quantification of these peaks may result in failures in detection of important metabolites or identifying potential biomarkers.

Existing automatic baseline distortion correction methods fall into two categories: time domain correction and frequency domain correction [1-9]. Time domain correction methods reconstruct the corrupted data points in FID to reduce the low frequency modulation [6-8]. Frequency domain correction methods construct baseline curves in the spectra directly, and subtract these baseline curves to remove the distortion [1-5]. These methods have been implemented in commercial software and hand-written programs for NMR data processing. Usually a combination of both time and frequency domain methods are applied to achieve better correction. In comparison, time domain methods provide general correction for FID and frequency domain methods provide more detailed correction on the spectrum itself. For processing a specific kind of NMR spectra, such as metabolomics spectra, frequency domain methods are easier to tailor for different needs.

In this article, we present a novel frequency domain baseline correction method for processing 1D spectra for metabolomics profiling. 1D proton NMR spectroscopy has been widely applied in metabolomics profiling because it can simultaneously measure many metabolites that lie above the detection limit. These metabolomics spectra usually have many crowded peaks such that noise regions are broken into smaller pieces and are more difficult to identify accurately. The term noise regions refer to the regions in the spectrum that only contain noise. In some part of the spectra, severely overlapping peaks form long gaps between noise regions and are a cause of numerical sensitivity in baseline construction. Traditional frequency domain methods detect noise regions and construct the baseline curve by interpolating among identified noise regions. These methods rely heavily on robust noise region identification, and therefore may not achieve optimal baseline correction in metabolomics spectra. For example, Golotvin et al [5] proposed identifying noise points by comparing the intensity range of a small neighborhood with the standard deviation of noise regions, which is estimated by dividing the spectrum into 32 sections and taking the minimum value of standard deviations of these sections. We observed that this method occasionally identifies the low signal points in metabolomics spectra as noise because they may overlap with each other and have reduced standard deviation, and as a result these signal points will be offset to zero after baseline correction. Noise standard deviation estimation is also theoretically biased to be smaller than the true value in a statistical view, and leads to additional inaccuracy in detection of noise data points.

As an alternative to the existing noise detection and interpolation approaches, we developed a new baseline correction method based on a penalized parametric smoothing model. This method fits a curve following the bottom envelope of the spectrum and doesn't need explicit identification of the noise data points. The primary motivation is that we model the baseline as a smooth curve of arbitrary form that goes through the noise region instead of linked pieces of selected noise points. We describe key features of this model by a score function and construct the optimal baseline curve corresponding to the function maximum. In addition, we present a more accurate estimation of noise variance by LOWESS (locally weighted scatterplot smoothing) regression and use it to determine the model parameters.

Methods

Suppose a 1D NMR spectrum is represented by the set of ordered pairs {xi, yi}, where xi is the abscissa in ppm units and yi is the ordinate, representing the height of the spectrum as a Fourier-transformed RF decay curve. The fundamental model behind our method is that the spectrum can be represented as

yi=bi+μieηi+εi,
(1)

where bi is the baseline, μi is the true signal, and ηi and εi are random errors normally distributed with mean 0 and variance 1, generally autocorrelated, a type of model that fits a wide variety of measurement data (Rocke and Lorenzato 1995 [10]; Rocke and Durbin 2001 [11]). An estimated baseline should be 1) smooth, but not necessarily flat; and 2) run through the middle of the data in segments where there is no signal. Based the on these features, we construct the following score function:

F(b)=ibiAi(bi+1+bi12bi)2Bi(biyi)2g(biyi)
(2)

And g(biyi)={1,biyi>00,biyi0 is the Heaviside step function.

b = {bi} is a set of points that represents a certain baseline. The optimal baseline curve b0 should maximize the score function F(b).

b0 = arg max F(b)
(3)

F(b) has three components. The first term ibi is the sum of all baseline points. We want to maximize it subject to the smoothness penalty Ai(bi+1+bi12bi)2 and the negativity penalty Bi(biyi)2g(biyi). The smoothness penalty is a discrete form of integral of squares of second-order derivatives, which is small for linear segments and large for small curvature radius. The negativity penalty is designed to be nonzero only when the baseline point is above the data point, by using the Heaviside step function g(bi - yi). It counteracts the uptrend of the first term and force the baseline to run through the middle of the data. By maximizing this function the baseline is pushed up to the spectrum but not exceeding the zero-signal level, and forced to be as smooth as possible to link peak regions.

The negative penalty parameter B is determined by the condition that the baseline should run through the center of the noise region. Take the simplest case of a spectrum with only normally distributed noise with variance σ2 and mean 0. The baseline should also be a horizontal line at y = b, so the summation term in the score function i=1nbi=nb. The smoothing term Ai(bi+1+bi12bi)2 = 0 because the baseline should be horizontal and has no curvature. The expectation value of negativity term could be calculated based on the probability density function (PDF) of the noise P(y).

<Bi(biyi)2g(biyi)>=nB(by)2g(by)P(y)dy=nBb(by)2P(y)dy
(4)

Where P(y)=12πσ2exp(y22σ2) is the PDF of normal distribution with variance σ2 and mean 0. The boundary of the integral (-∞, b) in equation (4) is determined by explicitly plug in the step function g(by)={1,y<b0,yb.

Hence the expected value of the score function <F(b) > becomes:

<F(b)>=nbnBb(by)212πσ2exp(y22σ2)dy
(5)

The estimated baseline should be at the zero intensity level, which means the score function reaches its maximum at b = 0.

<F>b|b=0=n2nσB2π=0,

This gives us the theoretical value of B

B=2π2σ1.25σ.
(6)

So the negativity penalty parameter B is determined by the noise standard deviation σ. We define the constant B=2π21.25 so that B is in the form of B=Bσ. By dividing by σ, the negativity penalty Bi(biyi)2g(biyi) will have the same order of the intensity as in the first linear summation term ibi in the score function, which guarantees that the maximal point of the score function remain invariant if the spectrum is multiplied by a scalar, so that the corresponding baseline curve will not be affected. For the same reason, the smoothing penalty parameter, denoted by A in the score function (2), should also take the form of A=CAσ to guarantee invariance in spectrum scaling, where A* is a constant and C is a coefficient related with the resolution of the spectra and will be discussed later.

For example, if we multiply the spectrum {xi, yi} by a scalar k, we get a new spectrum {xi, kyi}. The noise standard deviation of the scaled spectrum, denoted by σ', is also k times the original noise standard deviation σ: σ' = .

The score function for the original spectrum is:

F(b)=ibiCAσi(bi+1+bi12bi)2Bσi(biyi)2g(biyi)
(7)

And the score function for the scaled spectrum is:

F(b')=ibi'CAkσi(bi+1'+bi1'2bi')2Bkσi(bi'kyi)2g(bi'kyi)
(8)

The estimated baseline for scaled spectrum should also be also k times of the original spectrum baseline, thus b' = kb, or equivalently, bi' = kbi. Substitute into equation (8), we have

F(b')=ikbiCAkσi(kbi+1+kbi12kbi)2Bkσi(kbikyi)2g(kbikyi)=kF(b)

So F(b') = F(kb) = kF(b), which means the score function is in an invariant form for scaling. Multiplying the spectrum by a constant does not affect finding the optimal baseline by maximizing this score function.

In addition, the smoothness penalty should be robust to the abscissa resolution. For example, if we take half the data points (with odd indices) of the original spectrum so that the chemical shift interval is doubled, the baseline curve should not be affected. This suggests the coefficient C in A=CAσ is related with the abscissa resolution dx.

Recall the smoothing term is the sum of squared second order derivatives of {bi}, rewrite it in the generic form of discrete representation of second order derivative:

Ai(bi+1+bi12bi)2=Aσi(bi+1+bi12bidx2)2=Aσdx4i(bi+1+bi12bi)2

Therefore C has an inverse quadruple relation with the resolution dx of the abscissa. For a given spectrum, dx is inverse proportional to the number of data points n: dx1n, which means C could take the value of n4.

So A=CAσ=n4Aσ,
(9)

where A* is a constant that is independent of the spectrum. By comparing different spectra, we choose an empirically reasonable value of A* to be A* = 5 × 10-9.

Based on the above analysis, the score function takes the follow form

F(b)=ibin4Aσi(bi+1+bi12bi)2Bσi(biyi)2g(biyi)
(10)

where A* = 5 × 10-9, B* ≈ 1.25, σ is the standard deviation of noise and n is the total number of data points. The baseline curve is insensitive to small changes of A and B, unless the orders of magnitude are changed.

The estimation of the noise standard deviation σ is based on the model in equation (1), The variance of a certain part of spectrum is derived as the following

Var(y)=σε2+μ2S2,whereS2=exp(ση2)[exp(ση2)1]
(11)

It indicates that the signal variance Var(y) increases with the mean value of signal intensity μ. We can estimate the noise variance σε2 by fitting equation (11) on the signal variances and mean intensities sampled from the spectrum. We divide the spectrum into small regions and compute the variance and mean intensity within each regions. Figure Figure11 plots the variances versus mean values with region size of 32 data points, corresponding to 0.012 ppm in chemical shifts. We use LOWESS (locally weighted scatterplot smoothing) regression to fit equation (11). The red line in Figure Figure11 represents the fitted regression line. It has a quadratic form as expressed in equation (11).

Figure 1
LOWESS smoothing for variances estimation. Variances versus mean intensities sampled bins of 1D NMR metabolomics spectra. The bin size was set to 32 data points, corresponding to 0.012 ppm in chemical shift. The fitted LOWESS regression curve was plotted ...

Since Var(y)=σε2+μ2S2σε2 when μ ≈ 0, the noise variance σε2 is approximately equal to the signal variance Var(y) for small mean values. We take the predicted value of Var(y) at zero mean intensity in the LOWESS regression to be our estimate of σε2, and the standard deviation of the noise σ is the square root of the noise variance

σ=σε2

After determining the parameters, we maximize the function F(b) to find the baseline b0, according to equation (3). Mathematically, we solve F(b)b=0 to find the maximum of F(b). This partial derivative equation expands as a linear system with the solution to be b0. The numerical implementation of solving this linear system is attached in the appendix.

Results and discussion

Based on this penalized smoothing model, we test the baseline correction method on simple 1D NMR spectra and complex metabolomics spectra.

Figure Figure22 demonstrates this method corrects the baseline distortion of a simple 1D 1H NMR reference spectrum of DSS (2,2-Dimethyl-2-silapentane-5-sulfonic acid) with n = 65536 data points. The estimated noise standard deviation is σ = 8335.9. According to previous discussion of parameter determination, the parameters are set to A=n4Aσ=1.1×107 and B=Bσ=1.5×104. Figure Figure2A2A shows the original spectrum with apparent baseline distortions. This distorted baseline is detected by the penalized smoothing method in Figure Figure2B.2B. In Figure Figure2C,2C, this baseline curve is subtracted from the spectrum and the distortion is corrected. The optimal baseline found by our baseline model fits well with the distortion curve. The small peak at 2ppm in the spectrum is correctly presented after baseline subtraction.

Figure 2
Baseline correction by penalized parametric smoothing method. (A) Original 1D proton NMR spectrum of DSS reference with distorted baseline. (B) Detected baseline curve by penalized parametric smoothing method. (C) Corrected spectrum after baseline subtraction. ...

We test this method in more complicated metabolomics spectra collected from tissue samples of red abalone. The data are from a study of environmental stresses on the development of a bacterial infection among red abalones (Haliotis rufescens) [12,13]. The dataset include 65 1D proton NMR spectra with 65536 data points in each spectrum. In our test the penalized smoothing method correctly detected and removed the distorted baseline for all 65 spectra. Figure Figure33 shows the baseline correction result on one example of testing spectra using the penalized smoothing method. In Figure Figure3A,3A, the peaks of metabolites aggregate together and form continuous peak regions. Lack of noise points in these regions generates big gaps in baseline construction. As demonstrated in Figure Figure3B3B and Figure Figure3C,3C, the baseline distortion is correctly detected and removed. In these gaps, the baseline curve is constructed following the smoothing constraints in the penalized smoothing model. The penalized smoothing method does not require explicit identification of noise region identification and therefore avoids constructing the baseline by interpolation, which is sensitive to the identification of noise points, especially in the region with high peak densities where noise points is difficult to detect accurately.

Figure 3
Baseline correction for metabolomics spectrum by penalized parametric smoothing method. (A) Original spectrum with distorted baseline. (B) Detected baseline curve by penalized parametric smoothing method. (C) Corrected spectrum after baseline subtraction. ...

We compared the penalized smoothing method with commercial software in Figure Figure44 and Figure Figure5.5. We applied the penalized smoothing method and the automatic baseline correction function in XWIN-NMR software (version 3.5) carried by the Bruker AVANCE 600 NMR facility to the same spectrum with baseline distortion, and plot the baseline corrected spectrum in Figure Figure4.4. The red lines represent the ideal horizontal baseline at zero intensity level. Both methods are capable of removing large distortions of the baseline and setting the corrected baseline to near the zero intensity level. The corrected baseline by the penalized smoothing method (Figure (Figure4A)4A) fits well with the ideal horizontal baseline in red. In Figure Figure4B,4B, the corrected baseline by XWINNMR has apparently larger deviation from the zero intensity level.

Figure 4
Comparison of penalized smoothing method and XWIN-NMR 3.5. (A) Spectrum corrected by penalized parametric smoothing method. (B) Spectrum corrected by automatic baseline correction function in XWIN-NMR 3.5. The red lines represent the zero intensity level. ...
Figure 5
Comparison of baseline distortion detection by penalized smoothing methods and XWIN-NMR 3.5. This is an expanded region of the spectrum in Figure 4 before baseline correction. Baseline curves are detected by penalized parametric method (red) and XWIN ...

In Figure Figure55 we show a region from 6.5 ppm to 9.5 ppm of original spectrum to illustrate the difference of baselines detected by penalized smoothing methods (in red) and XWINNMR 3.5 (in black). The penalized smoothing baseline is closer to the center of noise regions and therefore detects the offsets of signals more accurately. This leads to a more accurate identification and quantification of signals in the corrected spectrum, especially for the small peaks at 7.08 ppm, 8.35 ppm and 8.94 ppm in Figure Figure5.5. For all 65 testing spectra, the penalized smoothing method shows significant improvement in presenting low intensity peaks in 47 spectra, corresponding to a percentage of 69%. To achieve comparable baseline correction quality in these spectra, XWINNMR 3.5 needs manually marking of the noise points or regions for baseline interpolation, which is time consuming for batch processing and is dependent on the experience of the software users. The penalized smoothing method is fully automatic and user independent.

Conclusion

We propose an alternative baseline correction method based on a penalized smoothing model. This method constructs the baseline by maximizing a score function (eq 2) that evaluates how well the baseline fits the spectrum. The model parameters A and B in score function are determined by the noise variance of the spectrum, σ2, which can be automatically estimated by LOWESS regression. This method does not require explicit identification of noise data points for baseline interpolation, or assumption of fixed forms of baseline curves, i.e. polynomials etc. Instead, it uses a general smoothing term to fit flexible forms of baseline distortion.

We applied this method to 1D NMR spectra with baseline distortion, and demonstrate it is effective for both regular 1D NMR spectra and metabolomics spectra with over-crowded peaks. The numerical implementation is fast and stable on common personal computers. This method provides an alternative to the existing noise detection and interpolation approaches in baseline correction, especially for spectra with many crowded peaks, such as in NMR metabolomics profiling, where noise points are more difficult to identify accurately. Compared with the widely used XWINNMR software, the method provided more accurate baseline correction on 47 out of 65 of our testing metabolomics spectra. With a few modifications, this penalized smoothing baseline correction approach is also applicable to 2D NMR spectra. The numerical implementation and optimization for 2D baseline correction could be one topic in further study.

Appendix

We solve Fb=0 to find the maximum of score function F(b), which leads to linear system of order n. (n is the number of data points in b).

Db = m

D is the Hessian matrix of F(b), i.e.Dij=2Fbibj, and m is a vector. In detail, the entries of D and m have the following entries:

mi={2Byi1,1bi>yibiyi;Dii=2Fbi2={12A+2B,12A,bi>yibiyi;Di,i±1=Di±1,i=2Fbibi±1=8A;Di,i±2=Di±2,i=2Fbibi±2=2A.

The boundary points may lack the neighbor points in the above formulas, and need to be treated separately. The non-existent terms in these equations are omitted:

D11=Dnn={2A+2B,2A,bi>yibiyi;D22=Dn1,n1={10A+2B,10A,bi>yibiyi;D12=D21=Dn,n1=Dn1,n=4A.

We use an iterated procedure to solve Db = m because the entries of D and m are dependent on b, and need to be updated in each iteration until it converges. This procedure is described in the following steps:

1) Set the initial value of baseline points b to be zeros.

2) determine D and m based on current values of b

3) solve Db = m for b

4) stop if the relative change of b is smaller than a threshold, otherwise go to step 2)

The Hessian matrix D is highly sparse, with all the non-zero elements in a symmetric band along the main diagonal. This kind of matrix structure can be solved very efficiently. In addition, the matrix is positive semi-definite, which guarantees the convergence of the iteration, so it is not sensitive to the initial value of b. We use MATLAB to implement the above algorithm and the code is available upon request.

Acknowledgements

The research reported in this paper was supported by grants from National Institute of Environmental Health Sciences, National Institutes of Health (P42-ES04699), from the Ecotoxicology Lead Campus Program (ELCP) of Toxic Substances Research and Teaching Program of the University of California., from the National Cancer Institute (P30-CA093373-04), from the UC Davis Health System, from the National Human Genome Research Institute (R01-HG003352), from the Air Force Office of Scientific Research (FA9550-07-1-0146) and from the Department of Energy (DE-FG02-07ER64341).

References

  • Bartels C, Güntert P, Wüthrich K. A new automatic baseline correction method for multidimensional NMR spectra with strong solvent signals. J Magn Reson Ser A. 1995;117:330–333. doi: 10.1006/jmra.1995.0780. [Cross Ref]
  • Brown D. Improved Baseline Recognition and Modeling of FT NMR Spectra. J Magn Reson Ser A. 1995;114:268–270. doi: 10.1006/jmra.1995.1138. [PubMed] [Cross Ref]
  • Saffrich R, Beneicke W, Neidig K, Kalbitzer H. Baseline correction in n-dimensional NMR spectra by sectionally linear interpolation. J Magn Reson Ser B. 1993;101:304–308. doi: 10.1006/jmrb.1993.1047. [Cross Ref]
  • Rouh A, Delsuc M, Bertrand G, Lallemand J. Baseline correction of FT NMR Spectra: An Approach in terms of classification. J Magn Reson Ser A. 1993;102:357–359. doi: 10.1006/jmra.1993.1117. [Cross Ref]
  • Golotvin S, Williams A. Improved Baseline Recognition and Modeling of FT NMR Spectra. J Magn Reson. 2000;146:122–125. doi: 10.1006/jmre.2000.2121. [PubMed] [Cross Ref]
  • Halamek J, Vondra V, Kasal M. The elimination of baseline distortions induced by audio filters. J Magn Reson Ser A. 1994;110:194–197. doi: 10.1006/jmra.1994.1204. [Cross Ref]
  • Marion D, Bax A. Baseline distortion in real-fourier-transform NMR-spectra. J Magn Reson. 1988;79:352–356.
  • Heuer A, Haeberlen U. A new method for suppressing baseline distortions in FT NMR. J Magn Reson. 1989;85:79–94.
  • Lunga G, Pogni R, Basosi R. A Simple Method for Baseline Correction in EPR Spectroscopy. J Magn Reson Ser A. 1994;108:65–70. doi: 10.1006/jmra.1994.1089. [Cross Ref]
  • Rocke D, Lorenzato S. A Two-Component Model for Measurement Error in Analytical Chemistry. Technometrics. 1995;37:176–184. doi: 10.2307/1269619. [Cross Ref]
  • Durbin B, Hardin J, Hawkins D, Rocke D. A Variance-Stabilizing Transformation for Gene Expression Microarray Data. Bioinformatics. 2002;18 Suppl 1:S105–110. [PubMed]
  • Rosenblum E, Tjeerdema R, Viant M. Effects of Temperature on Host-Pathogen-Drug Interactions in Red Abalone, Haliotis rufescens, Determined by 1H NMR Metabolomics. Env Sci Technol. 2006;40:7077–7084. doi: 10.1021/es061354e. [PubMed] [Cross Ref]
  • Rosenblum E, Viant M, Braid B, Moore J, Friedman C, Tjeerdema R. Characterizing the Metabolic Actions of Natural Stresses in the California Red Abalone, Haliotis rufescens using 1H NMR Metabolomics. Metabolomics. 2005;1:199–209. doi: 10.1007/s11306-005-4428-3. [Cross Ref]

Articles from BMC Bioinformatics are provided here courtesy of BioMed Central
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...