- Journal List
- Bioinformatics
- PMC2639009

# SIRIUS: decomposing isotope patterns for metabolite identification^{†}

^{1}Lehrstuhl für Bioinformatik, Friedrich-Schiller-Universität Jena, 07743 Jena,

^{2}Organische Chemie I, Fakultät für Chemie, Universität Bielefeld, 33501 Bielefeld and

^{3}AG Genominformatik, Technische Fakultät, Universität Bielefeld, 33501 Bielefeld, Germany

^{†}Apreliminary version of this article appeared under the title ‘Decomposing metabolomic isotope patterns’ in the Proceedings of the 6th Workshop on Algorithms in Bioinformatics, WABI 2006, in LNCS, Vol. 4175, Springer, pp. 12–23.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

## Abstract

**Motivation:** High-resolution mass spectrometry (MS) is among the most widely used technologies in metabolomics. Metabolites participate in almost all cellular processes, but most metabolites still remain uncharacterized. Determination of the sum formula is a crucial step in the identification of an unknown metabolite, as it reduces its possible structures to a hopefully manageable set.

**Results:** We present a method for determining the sum formula of a metabolite solely from its mass and the natural distribution of its isotopes. Our input is a measured isotope pattern from a high resolution mass spectrometer, and we want to find those molecules that best match this pattern. Our method is computationally efficient, and results on experimental data are very promising: for orthogonal time-of-flight mass spectrometry, we correctly identify sum formulas for >90% of the molecules, ranging in mass up to 1000 Da.

**Availability:** SIRIUS is available under the LGPL license at http://bio.informatik.uni-jena.de/sirius/

**Contact:** ed.anej-inu.tenim@nihkuvrep.notna

**Supplementary information:** Supplementary data are available at *Bioinformatics* online.

## 1 INTRODUCTION

High-resolution mass spectrometry (MS) allows determining the mass of sample molecules with very high accuracy (1–5 p.p.m.), and has become one preferred method of analyzing metabolites. The output of a mass spectrometer, after preprocessing, consists of peaks that ideally correspond to the masses of the sample molecules and their abundance. This brings into play the natural isotopic distributions of the elements: several peaks in the output correspond to the same type of sample molecule, reflecting its isotope pattern. In this article, we use this isotope pattern to identify the sample molecule by determining its molecular formula or sum formula, i.e. the number of atoms of each element.

The term “metabolite” is usually restricted to small molecules that are intermediates and products of the metabolism. These small molecules participate in almost all cellular processes such as signal transduction, stress response, catabolism, or anabolism. Today, databases mostly contain primary metabolites directly relevant for growth, development, and reproduction of a cell or an organism. In contrast, most of the metabolites not directly involved in the aforementioned functions are yet uncharacterized. The majority of known metabolites have mass <1000 Da: 96.5% of sum formulas in the KEGG COMPOUND database (Kanehisa *et al*., 2006) fall into this mass range. To identify a sample metabolite, its mass spectrum is compared to spectra in a reference database. This method is limited to identifying metabolites and chemical compounds that have been included in some reference mass spectra library. Hence, *de novo* interpretation of metabolite mass spectra is highly sought.

Our input is a list of masses *M*_{0},…,*M _{K}* with intensities

*f*

_{0},…,

*f*, normalized such that ∑

_{K}_{i}

*f*=1. We extract this data from a high-resolution mass spectrum in a preprocessing step, and assume that it corresponds to the isotope pattern of a sample molecule. Note that for molecular mixtures, separating isotopic peaks that belong to different molecules is mostly trivial in this case. Our goal is to find the molecule, or rather its sum formula, whose isotope pattern best matches the input. In the following, we use ‘molecule’ and ‘sum formula’ interchangeably. We stress that this method cannot be used as-is to identify peptides or amino acid compositions, because certain sum formulas correspond to multiple peptides.

_{i}To resolve the sum formula, Kind and Fiehn (2006) suggest to proceed as follows: first, compute all sum formulas that share monoisotopic mass with the input mass spectrum. For every such candidate molecule, simulate its isotope pattern, and match and rank it against the input isotope pattern. Several experimental studies using this setup have been reported in the literature, the most prominent by Iijima *et al*. (2008) who claim to have discovered almost 500 novel metabolites in Tomato (*Solanum lycopersicum*). Zhang *et al*. (2005) address a related problem for analyzing peptides, but this approach heavily builds on several *ad hoc* rules regarding admissible sum formulas and uses a heuristic search. Böcker and Rasche (2008) present a method for the automated interpretation of metabolite tandem mass spectra but ignore isotope patterns.

In this article, we present efficient algorithms for all steps of the analysis pipeline suggested above. We limit our presentation to the elements most abundant in living beings, but note that our methods also work for other sets of elements. We first show how to use integer decomposition techniques (Böcker and Lipták, 2007) for decomposing real-valued masses, with large improvements over naïve approaches. Second, we present a method for rapid computation of isotope distributions and mean masses of isotope peaks, improving on previously best known results (Rockwood *et al*., 2004). Fast simulation of isotope patterns is vital due to the large search space. Third, we show how to rapidly match and rank such simulated spectra against the measured spectrum. We then report on the application of our method to high-resolution mass spectra. Finally, we present the software tool SIRIUS (Sum formula Identification by Ranking Isotope patterns Using mass Spectrometry) that implements all of the above algorithms and combines them with an easy-to-use graphical user interface.

## 2 ISOTOPES AND ISOTOPE PATTERNS

The elements most abundant in living beings are hydrogen (H), carbon (C), nitrogen (N), oxygen (O), phosphor (P) and sulfur (S). Atoms with the same number of protons but different number of neutrons are called *isotopes* of the element. Each of these isotopes occurs in nature with a certain abundance, and we limit our attention to these naturally occurring isotopes. The superscript preceding a symbol denotes the *mass number* of the atom: the number of protons plus the number of neutrons. The *mass* of an atom is measured in Dalton (Da). An atom's mass is roughly but not exactly equal to its mass number, the difference being due to the binding energy in the atom's nucleus. The masses of the different isotopes and their abundance are known up to very high precision (Audi *et al*., 2003) (see Table 1 for the six elements described above). Note that unlike their mass, natural abundances of isotopes are not physical constants. Values may slightly vary depending on, say, the continent where a sample was taken.

The *nominal mass* (also called *nucleon number*) of a molecule is the sum of protons and neutrons of the constituting atoms. The *mass* of the molecule is the sum of masses of these atoms. Clearly, nominal mass and mass depend on the isotopes the molecule consists of, thus on the *isotope species* of the molecule. The isotope species where each atom is the isotope with the lowest nominal mass is called *monoisotopic*. Likewise, the mass of the monoisotopic species is called the *monoisotopic mass* of the molecule. Mass defects and, hence, differences from the ideal mass depend on the elemental composition of a molecule. For example, sucrose (C_{12}H_{22}O_{11}) has monoisotopic mass 342.116215 Da; whereas, the short peptide Leu-Asn-Pro (C_{15}H_{26}N_{4}O_{5}) has monoisotopic mass 342.190321, while both molecules have monoisotopic nominal mass 342.

The number of distinct isotope species for a molecule with *i*_{H} hydrogen, *i*_{C} carbon, *i*_{N} nitrogen, *i*_{O} oxygen, *i*_{P} phosphor and *i*_{S} sulfur atoms is . This follows because for an element *E* with *r* isotope types, a molecule *E _{l}* consisting of

*l*atoms of the element has different isotope species. The probability that a certain isotope species occurs can be computed by multiplying the probabilities of the underlying isotopes.

Even with high-resolution MS, it is often impossible to resolve isotope species with identical nominal mass. Instead, these isotope species appear as one single peak in the MS output. For this reason, we merge isotope species with identical nominal mass; we refer to the resulting distribution as the molecule's *isotopic distribution*.

For each element *E* we define two discrete random variables, denoted *X _{E}* and

*Y*, representing the mass and the mass number, respectively. For example,

_{E}*X*

_{C}with state space {12, 13.003355},

*Y*

_{C}with state space {12,13} and ℙ(

*X*

_{C}= 12) = ℙ(

*Y*

_{C}= 12) = 0.98890, ℙ(

*X*

_{C}= 13.003355)=ℙ(

*Y*

_{C}= 13) = 0.01110 are the random variables of carbon. Given a molecule consisting of

*l*atoms, we assign to the

*i*-th atom,

*i*=1,…,

*l*, two random variables

*X*and

_{i}*Y*, where

_{i}*X*∼

_{i}*X*and

_{E}*Y*∼

_{i}*Y*, with

_{E}*E*being the corresponding element. Now we can represent the molecule's

*mass distribution*by the random variable

*X*:=

*X*

_{1}+…+

*X*, and its nominal mass distribution, or

_{l}*isotopic distribution*, by

*Y*:=

*Y*

_{1}+ … +

*Y*. In an ideal mass spectrum, normalized peak intensities correspond to the isotopic distribution of the molecule. Note that

_{l}*X*and

*Y*are correlated, since

*X*can be viewed as a function of

_{E}*Y*and

_{E}*E*.

We refer to the peak at the monoisotopic mass as the monoisotopic peak, which is followed by the +1, +2, … peaks. What is the mass of the +k peak, which is a superposition of several isotope species? It is reasonable to assume that its mass is the mean mass of all isotope species that add to its intensity (Rockwood *et al*., 2004): for a molecule with monoisotopic nominal mass *N*, let *X* = *X*_{1} + ··· + *X _{l}* be the mass distribution and

*Y*=

*Y*

_{1}+ ··· +

*Y*be the isotopic distribution. The mean peak mass of the +k peak is then

_{l}*m*= 𝔼(

_{k}*X*|

*Y*=

*N+k*). We refer to the isotopic distribution together with the mean peak masses as the molecule's

*isotope pattern*.

## 3 METHODS AND ALGORITHMS

### 3.1 Real-valued decompositions

We first concentrate on the problem of decomposing the monoisotopic mass *M*_{0}. We want to find all molecules with monoisotopic mass in the interval [*l,u*] ⊆ ℝ where *l*:=*M*_{0}−ε and *u*:=*M*_{0}+ε for some measurement inaccuracy ε. Formally, we search for all solutions of the integer knapsack equation (Kellerer *et al*., 2004)

where *a _{j}*,

*j*=1,…,

*n*are real-valued monoisotopic masses of elements satisfying

*a*≧ 0. We search for all solution vectors

_{j}*c*= (

*c*

_{1},…,

*c*) such that all

_{n}*c*are non-negative integers. We may assume

_{j}*a*

_{1}<

*a*

_{2}< … <

*a*.

_{n}A straightforward solution is to enumerate all vectors *c* with *c*_{1}=0 and ∑_{j} *a _{j} c_{j}* ≤

*u*and next to test if there is some

*c*

_{1}≧ 0 such that ∑

_{j}

*a*∈ [

_{j}c_{j}*l,u*]. This results in Θ(

*M*

_{0}

^{n-1}) running time, for constant element masses. Alternatively, we can compute all potential decompositions up to some upper bound

*U*during preprocessing, sort them with respect to mass and use binary search; this results in Θ(

*U*) space requirement. These approaches are unfavorable in theoretical complexity as well as in practice: for the elements CHNOPS there exist more than 7 × 10

^{n}^{9}molecular formulas with mass ≤1500 Da.

In the remainder of this section, we transform the integer knapsack problem with real-valued coefficients into a problem instance with *integer* coefficients. We will show in the next section how to efficiently solve such instances. Choosing a *blowup factor b* ∈ ℝ, corresponding to precision 1/*b*, we can round coefficients by ϕ(*x*) := ⌈ *bx* ⌉, so and *l*′ := ϕ(*l*), *u*′ := ϕ(*u*) form an integer knapsack. Precision 1/*b* is merely a parameter of the decomposition algorithm and in principle independent of the measurement mass accuracy ε. To avoid rounding error accumulation, precision is usually set one to two orders of magnitude smaller than the measurement accuracy. Now, certain solutions *c* of the integer coefficient knapsack are no solutions of the real-valued coefficient knapsack and vice versa. We can easily sort out false positive solutions by checking (1), resulting in additional running time. We now concentrate on the more intriguing problem of false negative solutions that are missed by the integer coefficient knapsack.

Clearly ∑_{j} *a _{j} c_{j}* ≧

*l*implies since all are integers. We have to increase the upper bound

*u*′ to guarantee that all solutions of (1) are generated. We define relative rounding errors

and note that . Let Δ = Δ(*b*) := max {Δ_{j}}. If *c* satisfies ∑_{j} *a _{j} c_{j}* ≤

*u*then : clearly, and our claim follows from

One can easily check that this bound is tight. So, we re-define the integer interval by *u*′ := ⌊*bu* + Δ *u* ⌋. Without rounding correction we have to decompose (*u*−*l*) *b* integers, but rounding correction forces us to decompose an additional Δ *u* integers, independent of the interval size *u*−*l*. As an example, consider the elements CHNOPS and blowup factor *b* = 10^{5}, then Δ(*b*) = Δ_{H} (*b*) = 0.492936. So for *M*_{0} = 1000, we have to decompose an additional 492 integers. Clearly, increasing *b* usually decreases Δ(*b*). We stress that the running time of this approach is dominated by the number of *decompositions* of these integers, and not by the number of integers itself.

### 3.2 Integer decompositions

Assume that both the element masses *a*_{1},…,*a _{n}* and the query mass

*m*are positive integers. We are looking for all non-negative integer vectors (

*c*

_{1},…,

*c*) satisfying (1) (with

_{n}*l*=

*u*=

*m*). This is a well-studied problem, referred to in its different variants as Coin Change Problem, Change Making Problem or Money Changing Problem, and can be solved with a simple dynamic programming algorithm in pseudo-polynomial time (Martello and Toth, 1990). The main disadvantage of this approach is rather large memory requirement, which again depends on the maximal mass

*U*we want to decompose.

Böcker and Lipták (2007) present an algorithm for determining all such decompositions with running time *O*(*na*_{1} · γ(*m*)) and space *O*(*na*_{1}), where *a*_{1} is the smallest mass and γ(*m*) the number of decompositions of mass *m*. We briefly sketch the algorithm. Given integer masses *a*_{1} ≤ … ≤ *a _{n}*, a data structure of size

*na*

_{1}, referred to as

*Extended Residue Table*(ER table), is computed in a preprocessing step. Entry ER(

*r,i*), for

*r*=0,…,

*a*

_{1}−1 and

*i*=1,…,

*n*, is the smallest number congruent

*r*modulo

*a*

_{1}which is decomposable over

*a*

_{1},…,

*a*. Thus, the last column ER(·,

_{i}*n*) of the table gives, for each residue

*r*, the smallest number congruent

*r*modulo

*a*

_{1}that is decomposable over

*a*

_{1},…,

*a*. Computation time is

_{n}*O*(

*na*

_{1}). All decompositions of the query

*m*are then recursively generated, limiting the number of unsuccessful paths by using information from the ER table. As a result, the running time of the algorithm is proportional only to the size of the table

*na*

_{1}and the number of decompositions γ(

*m*), and does not depend directly on the input

*m*. For decomposing molecule masses, this decomposition technique has several advantages over classical dynamic programming, such as improved running times and preprocessing independent of the largest mass we want to decompose in the future. Regarding the application of decomposing molecule masses, this approach uses only one fifteenth of memory and shows better running times.

The number of decompositions γ(*m*) for an integer mass *m* over coprime integers *a*_{1},…,*a _{n}* asymptotically behaves like a polynomial of degree

*n*−1 in

*m*(Wilf, 1990). Following Beck

*et al*. (2001), we can approximate the number of molecules over the elements CHNOPS with

*real mass*in the interval [

*M,M*+ϵ] by

We can also approximate the number of molecules with mass *up to M* by integrating (2). In Figure 1, we plot the number of decompositions for masses of up to 1500 Da over the elements CHNOPS.

### 3.3 Simulating isotope patterns

We first observe that for elements CHNOPS, all molecules have isotopic distributions that decrease rapidly with increasing mass. In particular, we can restrict ourselves to computing the first *K* non-zero values of the distribution, for rather small *K* such as *K*=10. For example, amongst 11 479 entries in the KEGG COMPOUND database with mass ≤3000 Da, no molecule has intensity of the +10 peak larger than 0.00007.

The atoms hydrogen, carbon and nitrogen have only two natural isotopes. Thus, the isotopic distribution of a molecule *E _{l}* consisting of

*l*identical atoms of type

*E*with

*E*∈ {H,C,N} follows a binomial distribution: let

*q*denote the probability that

_{k}*E*has nominal mass

_{l}*N*+

*k*, where

*N*is the monoisotopic nominal mass of

*E*. Then, , where

_{l}*p*is the probability of the monoisotopic isotope. The values of the

*q*can be computed iteratively, since for

_{k}*k*≧ 0, thus computation time is

*O*(

*K*).

Where an element *E* has *r*>2 isotopes (such as oxygen and sulfur), the isotopic distribution of *E _{l}* can in theory be computed as follows: let

*p*for

_{i}*i*=0,…,

*r*−1 denote the probability of occurrence of the

*i*-th isotope. Then, the probability that

*E*has nominal mass

_{l}*N*+

*k*is , where the sum runs over all

*l*

_{0},…,

*l*

_{r−1}≧ 0 satisfying and (Hsu, 1984). However, this computation is infeasible in practice.

Given two discrete random variables *Y* and *Y*′ with state spaces Ω,Ω′ ⊆ ℕ, we can compute the distribution of the random variable *Z* := *Y*+*Y*′ by folding the distributions,

If we restrict ourselves to the first *K* values of this distribution, we can compute it in time *O*(*K*^{2}). Kubinyi (1991) suggests to compute the isotopic distributions of oxygen O_{l} and sulfur S_{l} by successive folding of the respective distribution: using a Russian multiplication scheme for the folding, this results in an algorithm with running time *O*(*K*^{2} log *l*). In applications, we do not compute these distributions on the fly but during preprocessing, for all *l*≤*L* fixed. This results in *O*(*KL*) memory for every such element, where *l* is small: The 128 oxygen atoms already have mass of about 2048 Da, exceeding the relevant mass range. For molecules consisting of different elements, we first compute or look up the isotopic distributions of the individual elements, and then combine these distributions by folding in *O*(*n* · K^{2}) time.

Using Fourier transforms of isotope distributions, we can multiply Fourier transforms instead of folding these distributions (Rockwood and Van Orden, 1996). Doing so we can eventually replace the *K*^{2} factor in the algorithm's running time by a *K*log*K* factor. As we limit our attention to small *K* such as *K*=10, this will not result in a speedup of the algorithm in practice. Also, this approach may face the problem of numerical errors.

We now come to the more challenging problem of efficiently computing the mean peak masses of a distribution. Doing so using the definition *m _{k}* =𝔼(

*X*|

*Y*=

*N*+

*k*) is highly inefficient, because we have to sum up over all isotope species. Pruning strategies have been developed to speed up computation (Yergey, 1983), but pruning leads to a loss of accuracy (Rockwood

*et al*., 2004). We now present a simple recurrence for computing these masses analogous to the folding of distributions: let

*Y*= Y

_{1}+ ··· +

*Y*and be isotopic distributions of two molecules with monoisotopic nominal masses

_{l}*N*and

*N*′, respectively. Let

*p*:= ℙ(

_{k}*Y*=

*N*+

*k*) and

*q*:= ℙ(

_{k}*Y*′ =

*N*′′

*k*) denote the corresponding probabilities,

*m*and the mean peak masses of the +k peaks. Consider the random variable

_{k}*Z*=

*Y*+

*Y*′ with monoisotopic nominal mass Ñ =

*N*+

*N*′.

#### Theorem 1. —

*The mean peak mass of the +k peak of the random variable Z = Y + Y′ can be computed as:*

Note that . Since by independence, ℙ(*Y*_{1}=*N*_{1},…,*Y _{l}*=

*N*) = ∏

_{l}_{i}ℙ(

*Y*=

_{i}*N*), the theorem follows by rearranging summands. We omit the formal proof.

_{i}The theorem allows us to ‘fold’ mean peak masses of two distributions to compute the mean peak masses of their sum. This implies that we can compute mean peak masses as efficiently as the distribution itself. This improves on the previously best known method (Rockwood *et al*., 2004), replacing the linear running time dependence on the number of atoms by its logarithm.

### 3.4 Scoring candidate molecules

We want to discriminate between (tens of thousands of) candidate molecules generated by decomposing the monoisotopic mass. To this end, we compare the simulated isotopic distribution with the measured peaks. Matching peak pairs between the spectra is trivial for this application.

Zhang and Chait (2000) and Zhang *et al*. (2002) suggest to use Bayesian Statistics to evaluate mass spectra matches:

where 𝒟 is the data (the measured spectrum), ℳ_{i} are the models (the candidate molecules) and ℬ stands for any prior background information.

Regarding this background information, we set the prior probability ℙ(ℳ_{j} | ℬ) to zero for all molecules but the decompositions of the monoisotopic mass. We also assign prior probability zero to molecular formulas that cannot correspond to a molecule, because of chemical considerations: Senior's third theorem states that the sum of valences has to be greater than or equal to twice the number of atoms minus one (Senior, 1951). Molecules violating Seniors third theorem are rare, particularly for natural compounds: in the KEGG COMPOUND database, <0.16% of substances violate this rule. We also filter out radicals with odd sum of valences. We refrain from using further priors such as the hetero-to-carbon ratio (Kind and Fiehn, 2007) because this might rather reproduce what is already known.

Next, we assign probabilities to the observed masses and intensities. Assuming independence (in particular from background information) we calculate

Here, ℙ(*M _{j}*|

*m*) is the probability to observe peak

_{j}*j*at mass

*M*when its true mass is

_{j}*m*, and ℙ(

_{j}*f*|

_{j}*p*) is the probability to observe peak

_{j}*j*with intensity

*f*when its true intensity is

_{j}*p*. Clearly, the independence of peak intensities is violated because these intensities sum to one, but this product can be seen as a rough estimate of the true probability.

_{j}Mass spectrometrists assume that the mass error of a device is roughly normally distributed with mean zero. If the mass accuracy α of the measurement (in p.p.m.) is given, then we can set the standard deviation for peak *j*, assuming that >99.7% of measurements fall into the specified mass range. But we also observe that peaks of low intensity show less mass accuracy than those with high intensity, which can be attributed to the difficulties of separating a peak of low intensity from the background noise. Our data indicate a roughly linear dependence between peak intensity and mass accuracy. To this end, two mass accuracies α_{1} (at full intensity) and α_{0} (at minimal intensity) are provided by the user, and we set

We want to estimate the probability that, given a peak with true mass *M _{j}*, a peak at mass

*M*is observed in the measured spectrum: more precisely, the probability of observing a mass difference of |

_{j}*M*−

_{j}*m*| or larger. We can compute this probability using the complementary error function ‘erfc’:

_{j}with .

Even for high-resolution MS, spectra show a systematic mass shift due to calibration inaccuracies. We can easily eliminate this shift for all masses but the monoisotopic mass: we do not compare masses of the +1, … peaks directly but instead the difference to the monoisotopic peak, *M _{j}* −

*M*versus

_{0}*m*−

_{j}*m*

_{0}for

*j*≧ 1.

Regarding peak intensities, we have to cope with a systematic error in the measured spectra: we observe in our data that peaks of low intensity are under-estimated in the measured spectrum, whereas peaks of high intensity are over-estimated, (Supplementary Fig. 1). We ascribe this problem to inaccurate peak intensity determination: vendor software estimates peak intensities as signal-to-noise ratio or height above some baseline. The baseline, in turn, is determined using several *ad hoc* rules, and its estimate can be inaccurate. Unfortunately, such inaccuracies have unequal effects on peaks of different intensities. We correct this error by adding some user-defined parameter *off* to the measured intensities *f _{i}*, and by subsequent re-normalization. We found that for both of our datasets, the same parameter

*off*= +0.02 leads to excellent results.

Computation of ℙ(*f _{j}*|

*p*) is done analogously to that of ℙ(

_{j}*M*|

_{j}*m*). Our data indicates that after correction, log ratios between measured and predicted peak intensity log (

_{j}*f*/

_{j}*p*) roughly follow a normal distribution. Again, precision parameters β

_{j}_{1}(at full intensity) and β

_{0}(at minimal intensity) are provided by the user (in percent). We compute

as our precision interval, such that >99.7% of log intensity ratios log (*f _{j}*/

*p*) fall into the range . Now, ℙ(

_{j}*f*|

_{j}*p*) can be estimated analogously to (6).

_{j}## 4 EXPERIMENTAL RESULTS

### 4.1 Datasets

To evaluate our method we used two datasets measured on two instruments. Mass spectra with single charge were measured from several organic (macro)molecules, composed of the elements CHNOPS. For every such spectrum, the sum formula of the sample molecule is known. The spectra were acquired over a period of 2 years; the molecules range in mass from 117 Da to ∼1000 Da. Peak detection and estimation of peak masses and intensities were conducted using vendor software.

The first dataset consists of 153 mass spectra. Electrospray ionization (ESI) experiments were performed using the Fourier Transform Ion Cyclotron Resonance (FT-ICR) mass spectrometer APEX III (Bruker Daltonik GmbH, Bremen, Germany). The FT-MS was equipped with a 7.0 T, 160 mm bore superconducting magnet, infinity cell and interfaced to an external (nano)ESI ion source. All mass spectra were externally mass calibrated. The five analysis parameters were chosen as α_{1} = 3, α_{0} = 6, β_{1} = 10, β_{0} = 90 and *off* = +0.02.

The second dataset consists of 86 mass spectra. ESI experiments were performed using the oa-TOF mass spectrometer MicrOTOF (Bruker Daltonik GmbH, Bremen, Germany). Quasi-internal mass calibration was used, by measurement of an infused calibrant prior to the compound of interest. For the oa-TOF analysis, we set α_{1} = 5, α_{0} = 6.5, β_{1} = 10, β_{0} = 90 and again *off* = +0.02.

### 4.2 Identification accuracy

Every input ‘mass spectrum’ consists of masses *M*_{0},…,*M _{k}* and intensities

*f*

_{0},…,

*f*. For every such spectrum, we computed all molecules such that the monoisotopic mass

_{k}*M*

_{0}has relative mass difference of at most α

_{1}p.p.m. Next, we discarded molecules violating Senior's third theorem and radicals with odd sum of valences. For each remaining molecule, we computed its theoretical isotopic distribution (with

*K*=10) and compared it to the measured isotopic distribution. We ranked the molecules according to resulting probabilities. We did not use any other background information to identify the molecule.

For the 153 mass spectra in our FT-ICR dataset, 89 resulted in a correct identification; in 86% of the mass spectra, the correct interpretation was found in the TOP 10 explanations. There is a clear correlation between mass and identification accuracy, confer Table 2. For mass spectra ≤700 Da, the true interpretation was always found in the TOP 10 explanations, except in one case where it had rank 13. For 86 mass spectra in the oa-TOF dataset, the correct sum formula was found in the TOP 10 interpretations in all but two cases. Moreover, 79 out of 86 compounds were correctly identified, which correspond to an identification rate of >90%, (Table 2). Better identification results on the oa-TOF dataset with lower mass accuracy show the crucial importance of including intensity measurements into the candidate evaluation. We note that the intensity accuracy of the oa-TOF instrument is significantly higher than that of the FT-ICR. We have also tested the variation of identification rates with different scoring parameters: identification results are relatively stable for small disturbances of parameter values, see supplementary material. Parameter estimation could be automated using a small training set. We are planning to include this feature in future implementations.

### 4.3 Running times

We analyzed all 239 mass spectra on a Pentium M 1.5 GHz processor with blowup *b* = 5 × 10^{4}, using only a few Megabyte of memory. This results in running times of <1.3s per spectrum for the complete analysis of one mass spectrum. Clearly, running times depend on molecule masses, see again Table 2. Increasing the blowup beyond 5 × 10^{4} increased running times, presumably because the smaller table can be kept in the processor cache.

## 5 IMPLEMENTATION

We have developed a java-based graphical tool called SIRIUS. At the SIRIUS core lie efficient algorithms for generating all elemental compositions for a given mass and error, calculating isotope patterns for all chemically relevant compositions, and matching and ranking the candidate molecules against the input spectrum. SIRIUS combines these algorithms with a powerful graphical user interface. An extensive management system allows simplified data handling and offers an easy way to integrate new algorithms and data structures into the framework. Through a user-friendly interface, SIRIUS allows the user to import datasets in most common mass spectrometry file formats. It supports automatic recognition of molecular ion adducts present in the input spectrum, handy visualization of identified sum formulas and their isotope patterns and customizable export of identification results to common human-readable file formats. Finally, the software provides a basic functionality to search for sum formulas identified by the algorithm in NCBI PubChem Database.^{1}

Preparation of a new analysis run can be divided into the following steps: initializing input data and instrument parameters, setting up algorithm parameters and extracting isotope patterns from the input peaklist. Input peaklist and machine settings can be reused for multiple analyses on the same data. SIRIUS provides the user with reasonable default values for algorithm parameters. The program also offers to save all algorithm and mass spectrometer settings. To this end, SIRIUS creates a persistent workspace that can be used to store local settings and to automatically reload them on request.

We use the ProteomeCommons.org IO Framework (Falkner *et al*., 2007) to import mass spectrometry data, which allows reading most MS data formats including mzData and mzXML. We parse the peaklist and divide it into signal groups related to different compounds. A peaklist can also contain several signal groups belonging to the same compound, modified by different molecular ion adducts. Identifying modifications is done by calculating mass differences between monoisotopic peak masses. In view of the small number of adducts, we apply a simple exhaustive search to find all matching mass differences. If there is no prior knowledge on the source of modification, the user can choose one or more adduct types for an isotope pattern.

The output of the algorithm is a list of candidate sum formulas for each compound. Sum formulas are listed in the summary table, sorted in decreasing order of likelihoods. To view an entry in more detail, the user can select and compare theoretical and measured isotope patterns visually, (Fig. 2). Analysis results can be exported to the application workspace and opened for further evaluation. Export file formats include plain text, PDF and XML documents.

## 6 CONCLUSION

We presented an approach to determine the sum formula of an unknown metabolite solely from its high-resolution isotope pattern. Our approach allows us to reduce the number of potential sum formulas to only a few candidates; in many cases we were able to determine the correct molecular formula. The approach is time- and memory-efficient and can be executed on a regular desktop PC. We further presented methods for the efficient simulation of isotope patterns. This is vital for larger molecules where the search space increases rapidly.

Results on experimental data clearly show the potential of our approach, in particular for oa-TOF data. In our evaluation, we have deliberately ignored some information such as prior probability of the elements or hetero-to-carbon ratio (Kind and Fiehn, 2007). We believe that such information should rather be used in a ‘post-processing’ step by an expert, instead of automatically filtering out certain sum formulas a priori. Finally, we introduced a user-friendly software called SIRIUS, which implements all of the methods presented.

## ACKNOWLEDGEMENTS

Additional programming by Martin Engler. We thank Dr H. Luftmann, Universität Münster, Organisch-Chemisches Institut, for making available the oa-TOF dataset and an anonymous referee for helpful comments.

*Funding*: Deutsche Forschungsgemeinschaft (BO 1910/1 to A.P.); Alexander von Humboldt Foundation and the Bundesministerium für Bildung und Forschung, within the group ‘Combinatorial Search Algorithms in Bioinformatics’ (to Z.L.).

*Conflict of Interest*: none declared.

## Footnotes

## REFERENCES

- Audi G, et al. The AME2003 atomic mass evaluation (ii): Tables, graphs, and references. Nucl. Phys. A. 2003;729:129–336.
- Beck M, et al. The polynomial part of a restricted partition function related to the frobenius problem. Electron. J. Comb. 2001;8:N7.
- Böcker S, Lipták Z. A fast and simple algorithm for the Money Changing Problem. Algorithmica. 2007;48:413–432.
- Böcker S, Rasche F. Towards de novo identification of metabolites by analyzing tandem mass spectra. Bioinformatics. 2008;24:I49–I55. [PubMed]
- Falkner JA, et al. Proteomecommons.org io framework: reading and writing multiple proteomics data formats. Bioinformatics. 2007;23:262–263. [PubMed]
- Hsu CS. Diophantine approach to isotopic abundance calculations. Anal. Chem. 1984;56:1356–1361.
- Iijima Y, et al. Metabolite annotations based on the integration of mass spectral information. Plant J. 2008;54:949–962. [PMC free article] [PubMed]
- Kanehisa M, et al. From genomics to chemical genomics: new developments in KEGG. Nucl. Acids Res. 2006;34:D354–D357. [PMC free article] [PubMed]
- Kellerer H, et al. Knapsack Problems. Berlin, Heidelberg: Springer; 2004.
- Kind T, Fiehn O. Metabolomic database annotations via query of elemental compositions: mass accuracy is insufficient even at less than 1 ppm. BMC Bioinformatics. 2006;7:234. [PMC free article] [PubMed]
- Kind T, Fiehn O. Seven golden rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinformatics. 2007;8:105. [PMC free article] [PubMed]
- Kubinyi H. Calculation of isotope distributions in mass spectrometry: a trivial solution for a non-trivial problem. Anal. Chim. Acta. 1991;247:107–119.
- Martello S, Toth P. Knapsack Problems: Algorithms and Computer Implementations. Chichester: John Wiley & Sons; 1990.
- Rockwood AL, Van Orden SL. Ultrahigh-speed calculation of isotope distributions. Anal. Chem. 1996;68:2027–2030. [PubMed]
- Rockwood AL, et al. Isotopic compositions and accurate masses of single isotopic peaks. J. Am. Soc. Mass Spectr. 2004;15:12–21. [PubMed]
- Senior J. Partitions and their representative graphs. Am. J. Math. 1951;73:663–689.
- Wilf H. Generating functionology. New York: Academic Press; 1990.
- Yergey JA. A general approach to calculating isotopic distributions for mass spectrometry. Int. J. Mass Spectrom. Ion Phys. 1983;52:337–349.
- Zhang J, et al. Predicting molecular formulas of fragment ions with isotope patterns in tandem mass spectra. IEEE/ACM Trans. Comput. Biol. Bioinform. 2005;2:217–230. [PubMed]
- Zhang N, et al. ProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data. Proteomics. 2002;2:1406–1412. [PubMed]
- Zhang W, Chait BT. ProFound: an expert system for protein identification using mass spectrometric peptide mapping information. Anal. Chem. 2000;72:2482–2489. [PubMed]

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (303K) |
- Citation

- Towards de novo identification of metabolites by analyzing tandem mass spectra.[Bioinformatics. 2008]
*Böcker S, Rasche F.**Bioinformatics. 2008 Aug 15; 24(16):i49-i55.* - An accurate-mass-based spectral-averaging isotope-pattern-filtering algorithm for extraction of drug metabolites possessing a distinct isotope pattern from LC-MS data.[Anal Chem. 2009]
*Zhu P, Tong W, Alton K, Chowdhury S.**Anal Chem. 2009 Jul 15; 81(14):5910-7.* - Correcting for the effects of natural abundance in stable isotope resolved metabolomics experiments involving ultra-high resolution mass spectrometry.[BMC Bioinformatics. 2010]
*Moseley HN.**BMC Bioinformatics. 2010 Mar 17; 11:139. Epub 2010 Mar 17.* - High-resolution plant metabolomics: from mass spectral features to metabolites and from whole-cell analysis to subcellular metabolite distributions.[Plant J. 2012]
*Kueger S, Steinhauser D, Willmitzer L, Giavalisco P.**Plant J. 2012 Apr; 70(1):39-50.* - LC-MS determination of bioactive molecules based upon stable isotope-coded derivatization method.[J Pharm Biomed Anal. 2012]
*Toyo'oka T.**J Pharm Biomed Anal. 2012 Oct; 69:174-84. Epub 2012 Apr 25.*

- Integration of untargeted metabolomics with transcriptomics reveals active metabolic pathways[Metabolomics : Official journal of the Meta...]
*Cho K, Evans BS, Wood BM, Kumar R, Erb TJ, Warlick BP, Gerlt JA, Sweedler JV.**Metabolomics : Official journal of the Metabolomic Society. 2014(August)http://download.springer.com/static/pdf/273/art%253A10.1007%252Fs11306-014-0713-3.pdf?auth66=1424360271_7ef1df6c4c21b2596c3268954d4a7996&ext=.pdf* - Towards automated discrimination of lipids versus peptides from full scan mass spectra[EuPA open proteomics. 2014]
*Dittwald P, Nghia VT, Harris GA, Caprioli RM, Van de Plas R, Laukens K, Gambin A, Valkenborg D.**EuPA open proteomics. 2014 Sep 1; 487-100* - MetAssign: probabilistic annotation of metabolites from LC–MS data using a Bayesian clustering approach[Bioinformatics. 2014]
*Daly R, Rogers S, Wandy J, Jankevics A, Burgess KE, Breitling R.**Bioinformatics. 2014 Oct; 30(19)2764-2771* - Structural investigation of ribosomally synthesized natural products by hypothetical structure enumeration and evaluation using tandem MS[Proceedings of the National Academy of Scie...]
*Zhang Q, Ortega M, Shi Y, Wang H, Melby JO, Tang W, Mitchell DA, van der Donk WA.**Proceedings of the National Academy of Sciences of the United States of America. 2014 Aug 19; 111(33)12031-12036* - Automated LC-HRMS(/MS) Approach for the Annotation of Fragment Ions Derived from Stable Isotope Labeling-Assisted Untargeted Metabolomics[Analytical Chemistry. 2014]
*Neumann NK, Lehner SM, Kluger B, Bueschl C, Sedelmaier K, Lemmens M, Krska R, Schuhmacher R.**Analytical Chemistry. 2014 Aug 5; 86(15)7320-7327*

- CompoundCompoundPubChem chemical compound records that cite the current articles. These references are taken from those provided on submitted PubChem chemical substance records. Multiple substance records may contribute to the PubChem compound record.
- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem chemical substance records that cite the current articles. These references are taken from those provided on submitted PubChem chemical substance records.
- TaxonomyTaxonomyTaxonomy records associated with the current articles through taxonomic information on related molecular database records (Nucleotide, Protein, Gene, SNP, Structure).
- Taxonomy TreeTaxonomy Tree

- SIRIUS: decomposing isotope patterns for metabolite identificationSIRIUS: decomposing isotope patterns for metabolite identificationBioinformatics. 2009 Jan 15; 25(2)218

Your browsing activity is empty.

Activity recording is turned off.

See more...