![]() | ![]() |
Formats:
|
||||||||||||||||||||
Copyright © 2006 Kind and Fiehn; licensee BioMed Central Ltd. Metabolomic database annotations via query of elemental compositions: Mass accuracy is insufficient even at less than 1 ppm 1University of California Davis, Genome Center, 451 E. Health Sci Dr., Davis, CA 95616, USA Corresponding author.Tobias Kind: tkind/at/ucdavis.edu; Oliver Fiehn: ofiehn/at/ucdavis.edu Received December 22, 2005; Accepted April 28, 2006. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background Metabolomic studies are targeted at identifying and quantifying all metabolites in a given biological context. Among the tools used for metabolomic research, mass spectrometry is one of the most powerful tools. However, metabolomics by mass spectrometry always reveals a high number of unknown compounds which complicate in depth mechanistic or biochemical understanding. In principle, mass spectrometry can be utilized within strategies of de novo structure elucidation of small molecules, starting with the computation of the elemental composition of an unknown metabolite using accurate masses with errors <5 ppm (parts per million). However even with very high mass accuracy (<1 ppm) many chemically possible formulae are obtained in higher mass regions. In automatic routines an additional orthogonal filter therefore needs to be applied in order to reduce the number of potential elemental compositions. This report demonstrates the necessity of isotope abundance information by mathematical confirmation of the concept. Results High mass accuracy (<1 ppm) alone is not enough to exclude enough candidates with complex elemental compositions (C, H, N, S, O, P, and potentially F, Cl, Br and Si). Use of isotopic abundance patterns as a single further constraint removes >95% of false candidates. This orthogonal filter can condense several thousand candidates down to only a small number of molecular formulas. Example calculations for 10, 5, 3, 1 and 0.1 ppm mass accuracy are given. Corresponding software scripts can be downloaded from http://fiehnlab.ucdavis.edu. A comparison of eight chemical databases revealed that PubChem and the Dictionary of Natural Products can be recommended for automatic queries using molecular formulae. Conclusion More than 1.6 million molecular formulae in the range 0–500 Da were generated in an exhaustive manner under strict observation of mathematical and chemical rules. Assuming that ion species are fully resolved (either by chromatography or by high resolution mass spectrometry), we conclude that a mass spectrometer capable of 3 ppm mass accuracy and 2% error for isotopic abundance patterns outperforms mass spectrometers with less than 1 ppm mass accuracy or even hypothetical mass spectrometers with 0.1 ppm mass accuracy that do not include isotope information in the calculation of molecular formulae. Background Metabolomics seeks to identify and quantify all metabolites in a given biological context [1]. In this respect its aim is different from metabolic fingerprinting or metabonomic approaches which utilize high dimensional unannotated variables and multivariate statistics to find biomarkers that may or may not be structurally identified in subsequent steps. Therefore, an important task in metabolomics is to identify or structurally annotate compounds in a high throughput manner. Mass spectrometry is one of the most powerful tools for unbiased analysis of small molecules in life sciences. Hundreds to thousands of metabolites can be detected when suitable sample preparation methods [2] and mass spectrometric techniques are used [3]. However, most of the metabolites in complex biological materials like plant tissues are non-annotated, unidentified metabolites [5] due to the lack of experimental databases and the chemical complexity and changing nature of an organism's metabolome. Metabolites cannot be sequenced like proteins or polynucleotides. Instead, each individual compound needs to undergo structural elucidation, starting from the elemental composition. In addition to detection and quantification of metabolites, mass spectra can further be exploited for structural elucidation of compounds [4]. In order to reduce the number of de novo elucidations for metabolomic studies, a reasonable strategy could start with tentatively annotating metabolomic mass spectra with a list of compounds that match the elemental composition of small molecules found in publicly available databases. For numerical reasons the list of potential metabolic candidates will vary with the size and the quality of the queried database, but in principle, even structures with uncommon chemical conformations like ladderanes [6] (Figure (Figure1)1
The mass of chemical elements is based on the conventional scale that defines carbon C = 12.000 u. Chemical elements are comprised of a different number of neutrons, protons and electrons, so that the combined mass for each element (other than 12C) is a rational (non-integer) number: 1H = 1.007825 u, 14N = 14.003070 u, 16O = 15.994910 u [9]. Consequently, for any given metabolite, the accurate mass deviates from the nominal mass. This feature can be exploited for recursively calculating the elemental composition from an unknown metabolite mass spectrum in the ranges of the measurement error. Mass spectrometers today can measure mass/charge ratios with high (<5 ppm error; parts per million) or very high mass accuracy (<1 ppm) [11] and can be purchased with implemented software algorithms that derive a list of possible elemental compositions from the measured monoisotopic mass. Using the accurate mass one can either solve the diophantine equation [15] or one can use a brute force approach [16] and can calculate all possible elemental combinations in a certain range. Another important prerequisite for this approach is not only accurate mass measurement but also a high resolving power of the mass spectrometer. As the output of a mass spectrum is represented as a Gaussian or Lorentzian like peak shape, very near peaks can overlap on devices with low resolving power. Resolving power (m/Δm) at a certain m/z value can be calculated at full-width half-height maximum (FWHM) of the peak. Quadrupole mass spectrometers usually can reach 3000 [11], TOF analyzers up to 10,000 and Fourier Transform Ion Cyclotron Resonance (FT-ICR) mass spectrometers can have a resolving power up to 1,000,000 or larger [35]. Isobaric masses (for example C37H31N8P7S3 MW = 899.999692 and C20H43N2O19P3S6 MW= 899.999678) can not be resolved by mass spectrometry only. In this case chromatography helps to separate these overlapping components. For the case of peptides it was claimed that accurate mass measurements of 1 ppm error would be sufficient to derive a single solution in most cases [38]. However this is not applicable for small molecules, because they are not only derived from combinations of certain amino acids. We demonstrate in this report that even a hypothetical instrument capable of accurate mass measurements of 0.1 ppm error would not fulfill this premise when matching against a comprehensive list of chemically possible elemental compositions. Additional information is required that can readily be gathered from mass spectra: the abundance of natural isotopes in metabolites which refer to the percentages in which the isotopes of an element are found in natural sources on earth. The isotopic abundance pattern of a metabolite's mass spectrum can serve as a powerful additional constraint for removing wrong elemental composition candidates. Isotope ratio mass spectrometers [14] exactly determine isotope abundances, however, under combustions of the original molecule into CO2 or other gases and therefore irrelevant for the calculation of elemental compositions of unidentified metabolites. In general, the theoretical isotopic abundance pattern of a molecular formula can be calculated using different approaches [12] either solving polynomial equations or using fast Fourier transformations [13]. An isotope abundance filter can be used for any mass spectrometer which can provide very low root mean square (RMS) errors for isotopic patterns, especially if the contribution of further metabolites can be ignored by coupling compound separation to mass spectrometric detection using liquid or gas chromatography (LC/MS and GC/MS). Mass spectra may include fragmentations, rearrangements, and adducts [10]. For the sake of clarity, mathematical and chemical considerations reported here are constrained to pseudo-molecular ions that are completely resolved from interfering compounds, assuming the utilization of efficient chromatography or high resolution mass spectrometry [18], or a combination of both. Results and discussion Database queries of elemental compositions Assuming that a unique elemental composition could be derived from a mass spectrum, this molecular information can be furnished for metabolite annotation in either of two distinct ways: an exhaustive computation of all chemically possible isomeric structures or a query of databases for known (bio)chemical compounds. Exhaustive methods (Figure (Figure2)2
For automatic annotation of metabolites in metabolomic screens, it seems today more reasonable to first search against existing chemical structures or even to limit searches for known natural product databases. A randomly chosen molecular formula of C15H12O7 (304.0583 u) was taken as test case for query results, which should comprise structures like the naturally occurring pentahydroxyflavone (Figure (Figure3).3
A range of conclusions can be derived from this exercise (Table 1). Due to its limited size and its focus to consensus biochemical pathways, the KEGG database returned far fewer hits compared to more comprehensive repositories like CAS or DNP. It is important to note that therefore, automatic annotations of mass spectra must not be limited to KEGG searches alone. Conversely, however, any hit retrieved from KEGG queries might receive a higher likelihood of truly representing an identifiable metabolite due to the focus on (conserved) biochemical pathways represented in KEGG. In contrast to the small KEGG (Ligand) DB, the CAS database represents the largest database available for small molecules containing ~ 20 million organic chemicals. However, CAS cannot serve as suitable database for routine metabolite queries. On the one hand, CAS contains many compounds that have been artificially synthesized and reported by chemists, and thus are often unlikely to be present as natural compounds. On the other hand, the CAS SciFinder front end enables only a very limited and slow formula search, allowing queries of one formula at a time but not batches or series of queries. For these two reasons, CAS queries can be excluded from automated annotation efforts of complex metabolomic surveys; however, for identification purposes of selected unknown compounds in biomarker studies, the CAS database still provides the most comprehensive overview. It is interesting to note that DNP with only 170,000 entries retrieves 129 different isomeric structures of C15H12O7 (among them many stereoisomers) whereas the far larger PubChem database resulted in only 19 hits. The PubChem database is a fast growing database. At the time of search it had only 800,000 entries, now it has more than 5 millions. PubChem is a freely accessible database and includes KEGG, ChemIDplus and NCBI and several other databases and should therefore be included in automatic metabolite annotations. An in-depth molecular diversity calculation could reveal any overlap [22]. For an automated approach, the DNP database in SD file format (*.sdf) could be used whereas only semi-automatic procedures would be possible for the Beilstein database. Consequently, for identification routines of unknown metabolites starting from elemental compositions, DNP and PubChem search results should be combined. Calculating elemental compositions: construction of an exhaustive test data set The input into metabolomic queries are elemental compositions which are calculated from experimental mass spectra. Often, the performance of mass spectrometers and underlying software algorithms to calculate such molecular formulae are presented on test cases. However, molecular formulae are not uniformly distributed across the mass range. In order to exhaustively test the performance and power of algorithms calculating elemental compositions, a data set containing all chemically possible molecular formulae between a molecular mass of 20 – 500 u (using the most common elements C, H, N, O, P and S) was constructed. It is wrongly assumed by researchers outside the mass spectrometry community that within that mass range, high mass accuracy calculations of <1 ppm would result in unambiguous calculation of unique elemental compositions. We therefore have applied a number of chemical constraints to reduce the number of potential elemental compositions in the exhaustive data set to only those combinations that are allowed by chemical bonding rules. Applying constraints is the most crucial step during the whole process of formula finding and structure elucidation. Consequently, we have used the molecular weight calculator MWTWIN with a variety of restrictions: the "smart H atoms" option was used to avoid the calculation of an unreasonably high number of hydrogen atoms. This excludes species like C26H2 which are chemically possibly but not relevant for metabolomics. In extremely seldom cases this can lead to an exclusion of certain formulas with multicenter bonds (C10H25NO4). Secondly, metals have been excluded in our test data set because most metabolites do not contain coordinating metal atoms (although certainly a number of naturally occurring metabolites do, such as chlorophylls). However, in case trimethylsilylation was used for derivatization, search queries in GC/MS metabolite profiling data must obviously include Si which was left aside for this test data set. A third important constraint is the application of valence rules for which LEWIS and SENIOR rules were applied. These rules were found to serve as an important constraint that helped reducing an initial number of 3.5 million combinations of elemental compositions to 1.6 million for the mass range of 20–500 u (C, H, N, S O and P). Surprisingly, a number of both commercial and non-commercial formula generators are based purely on mathematical rules but do not obey the LEWIS and SENIOR chemical rules. As result, for a mass of 129.034 u species like C9H5O would be calculated by such formula generators which do not exist as natural compounds (however, which might exist as charged or radical species in the gas phase). Shortly, the LEWIS rule expects each compound to account for an even number of electrons with atoms that all obey the octet rule. SENIOR's theorem [25,26] requires three essential conditions for the existence of molecular graphs: A) the sum of valences is an even number, or the total number of atoms having odd valences is even; B) the sum of valences is greater than or equal to twice the maximum valence; C) the sum of valences is greater than or equal to twice the number of atoms minus 1. We have written scripts that include these rules in order to reduce the number of generated formulae that are exported from current commercial or non-commercial software products. The second rule was not included because it only proofs the non-existence of very small molecules like CH2 [26]. The current script only allows atom numbers less than 100. We have not put in a further constraint that would account for the number of and double bonds (RDBE [32]) or double bond equivalent (DBE) because for complex molecules with more than five atom types the calculation gets quite complicated. For example, nitrogen and phosphorous can have 3 or 5 valences, and sulphur atoms may have 2, 4 or 6 valences. For molecules that contain these three atoms in different valance states, no single solution for RDBE can be calculated but an RDBE range would result. An in-dept mathematical discussion of this problem can be found here [37]. Applying the LEWIS and SENIOR check is thus much more reliable and straightforward. Our current software script obeys standard valences (ground state chemistry [17]) in a conservative effort to produce an exhaustive number of formulas for ground state chemistry. A plot of all elemental compositions between 200–300 u is given in Figure Figure4.4
Limits for unique molecular formula assignment The generation of a comprehensive data set of all chemically possible molecular formulae between 20–500 u enables the prediction of the upper ppm limit for unique molecular formula assignment (see Table 2). Querying masses and formulae at peak frequency distributions from Figure Figure4,4
Accurate isotope abundance complements accurate mass measurements Natural compounds on earth (such as metabolites from biological specimen) reflect the natural abundance of stable elemental isotopes, such as 13C (which is found at approx. 1.11% of the most frequent isotope 12C), 18O (0.2% of 16O), 15N (0.367 % of 14N), 2D (0.015% of 1H) and 33S and 34 S (0.79 and 4.43 % of 32S). The actual ratios of these stable isotopes slightly differ for each element within narrow ranges [9]. Consequently, each monoisotopic pseudomolecular ion (M0) that is used for accurate mass determinations is always accompanied by additional isotope ions. The abundance of the isotope ions (M+1, M+2, M+3) is dependent on the actual elemental composition and can therefore serve as a powerful filter in calculating unique elemental compositions from mass spectral data. In table 3 the number of calculated elemental compositions for 150.000 to 900.000 u is given at mass accuracy levels of 10-0.1 ppm without and with additional isotope abundance information. Using conventional calculations, isotope information is not included. It is clearly seen that above approx. 200 u, mass accuracies of 3–5 ppm (an error level that is usually achieved by time-of-flight mass spectrometers, TOF [11]) lead to multiple chemically possible formulae, and to dozens of elemental compositions at masses above 400 u. It has therefore been argued to utilize the high resolving power and mass accuracy of Fourier transform ion cyclotron resonance mass spectrometers that achieve around 1 ppm average error in daily routines in unattended mode sometimes worse [18]). However, even at 1 ppm error, ambiguity of chemical formulae increases sharply above 400 u, a range in which many secondary metabolites are detected. Use of a hypothetical mass spectrometer with only 0.1 ppm error would still not result in unique solutions above 500 u, which leads to the conclusion that improving mass accuracy is not the solution for automatic assignments of elemental compositions. In contrast, applying isotope pattern recognition greatly reduces the search space for possible elemental compositions. Today, TOF mass spectrometers are available that specify 2% absolute isotope abundance accuracy at 3 ppm mass accuracy level with a resolving power of 10,000 [29]. Table 3 demonstrates that such instruments may clearly outperform the 5-fold more expensive ion cyclotron resonance mass spectrometers with respect to calculation of molecular formulae. Up to 400 u, unique solutions are achieved and between 400–800 u only 2–13 possible elemental compositions are reported. A direct comparison of the list of retrieved hits at the 3 ppm level with and without exploiting the isotope abundance information confirms that applying such an orthogonal filter above 500 u removes always more than 95% of the potential formulae. It has been argued that the chemical intuition and experience of analytical chemists would sort out unlikely chemical compositions; however, such routines cannot be implemented into query algorithms and are hard to conceive even at the 1 ppm level, when hundreds of possible hits are returned at searches between 700–900 Da, the mass range of membrane lipids. The principal idea of using a combined analysis of mass spectra and isotopic distributions is known since several decades [33,31,34]. There is a further approach called MPPSIRD (mass peak profiling from selected ion recording data) [19] in which molecular formulas with non matching ion abundances are excluded. Another approach was suggested to use isotopic pattern and "virtually" enhance the resolving power of a magnetic sector instrument from 30,000 to 90,000 or that of an FT-MS from 500,000 to 1,500,000 [24]. It has also been argued that complementary information may be garnered from mass spectral fragmentation, sometimes including accurate mass data in an intelligent basket method [28]. However such an approach is not universally applicable, and even more importantly, the interdependency of accurate mass and accurate isotope analysis for automated calculation of elemental compositions has not yet been demonstrated on a comprehensive data set of chemically possible formulae.
A further example supports this notion of a high impact of an orthogonal isotopic pattern filter. Actual measurement data were taken from analysis of trimethylsilylated (TMS) sorbitol, which was calculated as a pseudomolecular ion with a mass/charge 615.324 u at 5 ppm error under chemical ionization using a gas chromatography – time of flight mass spectrometer (GC-TOF, [36]). In Figure Figure55
Conclusion Based on exhaustive generation of 1.6 million molecular formulas it has been shown that high mass accuracy (1 ppm) and high resolving power alone is not sufficient for obtaining a low numbers of molecular formulas for further structure elucidation. This is especially true for molecular masses above 300 Da containing the most common elements C, H, N, S, O and P. Only an orthogonal isotopic abundance pattern filter was able to strongly reduce the number of molecular formula candidates. This of course requires mass spectrometers with a very low error for isotopic abundance distributions (RMS 1–5%). A mass spectrometer capable of 3 ppm mass accuracy but 2% isotopic pattern accuracy usually removes more than 95% of false candidates and outperforms even a (non-existing) mass spectrometer capable of 0.1 ppm mass accuracy but no isotopic pattern accuracy. Mass spectrometry producers should be enforced to provide the isotopic abundance errors in their documentation. Software producers should be enforced to use such an approach in their formula generation software for mass spectrometers. Methods Generation of molecular formulas Exhaustive calculation of formulae from 20–500 u using C, H, N, O, S and P was performed using the Molecular Weight Calculator MWTWIN [23] on a 1.7 GHz Pentium M with 1 GByte RAM. Calculation time and data cleaning with Textpad [40] was about 24 h. As valence values and molecular masses for each of the elements are constant, the resulting patterns of these calculations are also applicable to higher mass ranges. It is feasible to calculate molecular formulas in much higher range using CHEFOEG [30]. LEWIS and SENIOR rules were checked using self-written scripts in Visual Basic which were implemented into Statistica Dataminer v7 [23] and Microsoft Excel 2003. A demo version of Molgen 3.5 [42] was downloaded and used for the calculation of the number of structural isomers of some formulae given as examples. Isotopic pattern filter Isotopic pattern were calculated with a modified Mercury6 version [13]. This version takes the molecular formula as input and writes the isotopic abundances with the according masses to a log file. It can process 1 million formulas in 3 hours on a Pentium M 1.7 GHz. The resulting formulae and isotopic patterns of a single example were transferred to an MS Excel sheet where a simple matching function was implemented. Isotopic abundances are normalized to 100. The root mean square error (RMS) of the isotopic abundances is given in percent. This Excel function adds the differences between the calculated and target intensities for each of the M+1, M+2 and M+3 peaks and matches the sum of these differences against the target intensities. Furthermore an MS Excel array formula was implemented to report the number of remaining formulae when manually entering the isotope abundance accuracy in percent (according to the mass spectrometer specifications). Mass spectrometry always reports charged species. For the correct use of the software, the neutral form of the molecule is required. In this case the charge of molecular ion can be removed and hydrogen is added or subtracted to retrieve the neutral form of the molecule (mass of proton and electron = 1.007825 u). Any other adduct must be removed in the same manner. In table 3, isotope abundance examples were taken from individual compounds that were randomly selected from 48,000 example formulae in the range of 150–900 u, each of which had to pass LEWIS and SENIOR checks and an inclusion of C and H out of the list of C, H, N, S, O and P. Accordingly, selection of another compound for each mass example would change the single result given in the 'isotope abundance accuracy' columns, but not the overall conclusions. For all cases, the MWTWIN smart H option was applied, excluding potential formulae with a high combination of elements (e.g. C26H4) [39] that are inexistent in metabolome compositions. A complete matrix containing all results for 10, 5, 3, 1 and 0.1 ppm and 20, 10, 5, 2 and 1% isotopic abundance accuracy for 150–900 ppm can be found at http://fiehnlab.ucdavis.edu. Authors' contributions Both authors contributed equally to the work. Acknowledgements We thank Matthew Monroe for making MWTWIN [23] available as freeware. We thank Alan L. Rockwood (ARUP-Lab) and Steve Van Orden (Bruker Daltonics) for providing the Mercury6 source code for accurate isotopic pattern calculation for free. We thank Ernst Schumacher (University of Bern) for providing the chemical formula generator "CHEFOG" to the public domain. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||
Proteomics. 2004 Jan; 4(1):78-83.
[Proteomics. 2004]Proc Natl Acad Sci U S A. 2004 May 18; 101(20):7809-14.
[Proc Natl Acad Sci U S A. 2004]Phytochemistry. 2003 Mar; 62(6):887-900.
[Phytochemistry. 2003]J Am Chem Soc. 2004 Dec 8; 126(48):15664-5.
[J Am Chem Soc. 2004]Nat Prod Rep. 2004 Aug; 21(4):512-8.
[Nat Prod Rep. 2004]J Fam Pract. 1997 Nov; 45(5):375.
[J Fam Pract. 1997]J Nat Prod. 2005 Mar; 68(3):385-91.
[J Nat Prod. 2005]J Am Soc Mass Spectrom. 2003 Oct; 14(10):1086-98.
[J Am Soc Mass Spectrom. 2003]Nat Prod Rep. 2004 Aug; 21(4):512-8.
[Nat Prod Rep. 2004]J Chem Inf Comput Sci. 2001 May-Jun; 41(3):702-12.
[J Chem Inf Comput Sci. 2001]J Am Soc Mass Spectrom. 2003 Oct; 14(10):1086-98.
[J Am Soc Mass Spectrom. 2003]Rapid Commun Mass Spectrom. 2002; 16(24):2356-69.
[Rapid Commun Mass Spectrom. 2002]Anal Chem. 2003 Mar 15; 75(6):1470-82.
[Anal Chem. 2003]