![]() | ![]() |
Formats:
|
||||||||||||||
Copyright © 2009 Schmidt et al Assembling proteomics data as a prerequisite for the analysis of large scale experiments 1Max Planck Institute for Infection Biology, Core Facility Protein Analysis, Berlin, Germany 2The Biotechnology Centre of Oslo, University of Oslo, Oslo, Norway 3Max Planck Institute for Infection Biology, Core Facility Bioinformatics, Berlin, Germany 4Interfaculty Institute for Genetics and Functional Genomics, University of Greifswald, Greifswald, Germany Corresponding author.Frank Schmidt: Frank.Schmidt/at/uni-greifswald.de; Monika Schmid: Schmid/at/mpiib-berlin.mpg.de; Bernd Thiede: Bernd.Thiede/at/biotek.uio.no; Klaus-Peter Pleißner: pleissner/at/mpiib-berlin.mpg.de; Martina Böhme: Boehme/at/mpiib-berlin.mpg.de; Peter R Jungblut: jungblut/at/mpiib-berlin.mpg.de Received October 10, 2008; Accepted January 23, 2009. Abstract Background Despite the complete determination of the genome sequence of a huge number of bacteria, their proteomes remain relatively poorly defined. Beside new methods to increase the number of identified proteins new database applications are necessary to store and present results of large- scale proteomics experiments. Results In the present study, a database concept has been developed to address these issues and to offer complete information via a web interface. In our concept, the Oracle based data repository system SQL-LIMS plays the central role in the proteomics workflow and was applied to the proteomes of Mycobacterium tuberculosis, Helicobacter pylori, Salmonella typhimurium and protein complexes such as 20S proteasome. Technical operations of our proteomics labs were used as the standard for SQL-LIMS template creation. By means of a Java based data parser, post-processed data of different approaches, such as LC/ESI-MS, MALDI-MS and 2-D gel electrophoresis (2-DE), were stored in SQL-LIMS. A minimum set of the proteomics data were transferred in our public 2D-PAGE database using a Java based interface (Data Transfer Tool) with the requirements of the PEDRo standardization. Furthermore, the stored proteomics data were extractable out of SQL-LIMS via XML. Conclusion The Oracle based data repository system SQL-LIMS played the central role in the proteomics workflow concept. Technical operations of our proteomics labs were used as standards for SQL-LIMS templates. Using a Java based parser, post-processed data of different approaches such as LC/ESI-MS, MALDI-MS and 1-DE and 2-DE were stored in SQL-LIMS. Thus, unique data formats of different instruments were unified and stored in SQL-LIMS tables. Moreover, a unique submission identifier allowed fast access to all experimental data. This was the main advantage compared to multi software solutions, especially if personnel fluctuations are high. Moreover, large scale and high-throughput experiments must be managed in a comprehensive repository system such as SQL-LIMS, to query results in a systematic manner. On the other hand, these database systems are expensive and require at least one full time administrator and specialized lab manager. Moreover, the high technical dynamics in proteomics may cause problems to adjust new data formats. To summarize, SQL-LIMS met the requirements of proteomics data handling especially in skilled processes such as gel-electrophoresis or mass spectrometry and fulfilled the PSI standardization criteria. The data transfer into a public domain via DTT facilitated validation of proteomics data. Additionally, evaluation of mass spectra by post-processing using MS-Screener improved the reliability of mass analysis and prevented storage of data junk. Background A major goal of proteomics is the large-scale study of proteins, particularly their structures and functions including the global qualitative and quantitative analysis of proteins in defined biological systems. The term proteomics was chosen to make an analogy with genomics, but proteomics is significantly more complex. As a result of alternative splicing, point-mutations, degradations and co- and post-translational modifications, the number of protein species [1] of a proteome exceeds by far the number of protein-coding genes of the corresponding genome. In the past, qualitative proteome profiling has overcome limitations in protein identification due to the amazing developments in mass spectrometry. Increased sensitivity and mass accuracy in conjunction with comprehensive database annotations allows the high-throughput identification of proteins. On the other hand, quantitative profiling, an essential part of proteomics, requires technologies that accurately, reproducibly, and comprehensively quantify proteins. During the past years, novel mass spectrometry-based methods such as ICAT [2], SILAC [3] and iTRAQ [4] were developed for relative quantification. The amount of identification and quantification data increased dramatically during the recent years and resulted in the accumulation of "metadata", which means data about data. The manufacturers of ESI-MS and MALDI-MS instruments and image analysis software have endeavored to close the gap between the increased amount of information and its interpretation. However, this mostly resulted in individual solutions for each company which hampered the exchange of experimental data. However, beside commercial solutions some open LIMS systems such as PROTEIOS [5] or the open source laboratory information management system for 2-D gel electrophoresis-based proteomics workflows [6] are available free of charge and some of them were compared in more detail by Piggee et al. [7]. The representation of protein data must be standardized to compare proteomics results worldwide. For this purpose, some solutions were proposed, such as the Proteome Standards Initiative (PSI) [8,9], and PEDRo [10]. The latter yielded to adapt XML or specialized mzXML [11] or mzML [12] which are open file formats for data exchange. In our concept, the Oracle-based data repository system SQL-LIMS™ (Applied Biosystems, Foster City, USA) plays the central role in the proteomics workflow and was applied to the proteomes of Mycobacterium tuberculosis, Helicobacter pylori, Salmonella typhimurium and protein complexes such as the 20S proteasome. Technical operations of our proteomics workflow were used as the standard for SQL-LIMS™ template creation. Post-processed data of different approaches, such as LC/ESI-MS, MALDI-MS and 2-DE gel electrophoresis were stored in SQL-LIMS™ by using a Java-based data parser. A minimum set of the proteomics data were transferred into the web-accessible Proteome Database System for Microbial Research http://www.mpiib-berlin.mpg.de/2D-PAGE/[13] using a Java-based interface (Data Transfer Tool) with the requirements of the PEDRo standardization. Furthermore, the stored proteomics data were extractable out of SQL-LIMS™ as XML documents. Results and discussion Concept for integration of proteomics data We applied a variety of 2-DE and LC-based approaches for the comprehensive proteome analysis of microorganisms and other protein complexes. These technologies included 2-DE/MS coupled with image analysis, 1-DE/MS, ICAT/1-DE/MS, ICAT/2-DE/MS, LC/MS and ICAT/LC/MS (Figure (Figure1).1
However, there is no doubt that administration of programs such as SQL*LIMS™ are time consuming due to difficulties in template and interface programming. Thus, SQL*LIMS™ needed to be maintained by at least one full time administrator and specialized lab-manager. To overcome extensive training in SQL*LIMS™ and to make proteomics data available, we have developed a data transfer tool (DTT) as shown in Figure Figure1.1 Data storing in SQL*LIMS™ The requirements for data storing in SQL*LIMS™ depend on the experimental workflow. As a result, the data management system must contain specifically designed features (Figure (Figure2).2
Transfer of SQL*LIMS™ data into the Intranet/Internet database via DTT In order to share the experimental results with other laboratories rather easily, the DTT was designed to facilitate the transfer out of the SQL*LIMS™ into the proteome database system (Figure (Figure3).3
Pre and post-processing LC/ESI-MS/MS data Tandem mass spectrometry has been particularly useful for determining the protein components of complex mixtures. The following strategy was applied to evaluate LC/ESI-MS/MS peak lists data: MS/MS spectra were automatically transformed into peak lists (.dta-files) by SEQUEST and subsequently imported into MS-Screener for generating data matrices. The binary matrices were subjected to hierarchical agglomerative cluster analyses performed by means of the hclust- function within R. To illustrate an example of cluster analyses, Figure Figure44
Pre and post-processing MALDI-MS data Proteins separated by 2-DE were identified by peptide mass fingerprinting (PMF) after in-gel digestion. A Voyager Elite MALDI-TOF mass spectrometer and/or a 4700 Proteomics Analyzer MALDI-TOF/TOF instrument were used for this purpose. MS peak lists were generated by the program GRAMS or the peak-to-mascot script of the program 4700 Explorer™. In addition, the peak lists were evaluated by the program MS-Screener. Experimentally derived contaminant masses, e.g., masses matching to matrix, keratins, and autolysis products of trypsin or dye were detected and deleted from the spectra [14]. The simplified peak lists were analyzed by PMF using search algorithms, such as Mascot or MS-Fit. Subsequently, the modified peak list and the identification results were parsed and stored in SQL*LIMS™. Experimental Two-dimensional electrophoresis (2-DE) Automated 2-DE spot processing High-throughput MALDI-MS PMF was performed as follows: Spots of interest were excised from 2-DE gels, transferred into 96-well microtiter plates, and digested with trypsin using a spot-cutter (Proteome Works, Bio-Rad, Hercules, CA, USA). Subsequently, equal volume of resulting peptides and α-cyano-4-hydroxycinnamic acid (CHCA) were mixed and spotted onto MALDI templates by the Ettan spot-handling workstation (Amersham Biosciences, Uppsala, Schweden). Subsequently, MALDI spectra were internally calibrated and the resulting peak lists exported using the "Peak-to-Mascot" script of the 4700 Explorer software (Version 2.0) (Applied Biosystems, Foster City, USA). The parameters applied for this process were optimized (signal-to-noise ratio, mass range, peak density, etc.). Afterwards, the MS-Screener program was used to determine and to remove common contaminant masses. Data analysis by MS-Screener The program MS-Screener (Version 1.0.1) was applied to evaluate large datasets of peak lists. This program comprised 162 Java classes and has been developed for Java 2 Runtime Environment (Version J2RE 1.4.1; http://java.sun.com/). MS-Screener offers a multi-platform support for Linux, Solaris and Microsoft Windows including a helpful graphical user interface (GUI). Graphical representations of peak lists as plot-views have been integrated using the JFreeChart class library (Version 0.9.13) http://www.jfree.org/jfreechart/index.html published under the GNU Lesser General Public License. MS-Screener facilitates the import and export of ASCII files (.pkm (GRAMS, Applied Biosystems, Framingham, USA), .pkt (Data Explorer, Applied Biosystems, Framingham, USA), .txt (Peak-to-Mascot, 4700 Explorer, Applied Biosystems, Framingham, USA) and .dta (SEQUEST, Thermo Finnigan, San Jose, USA)) and data exchange via other interfaces. MS-Screener was used for many tasks, e.g. the detection of common mass peaks, the elimination of contaminant masses, and the calculation of the half decimal places rule [14]. Furthermore, it was used to generate peak lists matrices as a prerequisite for cluster analyses using R. Moreover, the recalibration of binary peak lists and a peak pair comparison tool to determine ICAT ratios were applied. The MS-Screener results were transformed in tab-separated files (.txt) to transfer the data into SQL*LIMS™. Mass spectrometry and protein identification/quantification For protein identifications, 2-DE spots were analyzed by MALDI-MS or MS/MS or ESI-MS/MS [16,18-20]. In most cases, spots to be identified were digested by trypsin prior to MS analysis [21]. MALDI-MS was carried out using a Voyager Elite MALDI-TOF mass spectrometer or a 4700 Proteomics Analyzer MALDI-TOF/TOF (both from Applied Biosystems, Framingham, USA). Protein identifications were achieved by database comparisons using search algorithms such as Mascot [22] or MS-FIT http://prospector.ucsf.edu, whereby Mascot was available as in-house version. Searches were accomplished either individually or in batch mode (analysis of large datasets). In the latter case, Mascot-Daemon http://www.matrixscience.com was used as batch interface. Individual searches were performed by the Mascot web-front end or the SQL-LIMS™ clients, respectively, and both were connected with in-house Mascot server. The search parameters applied have previously been described [21]. Moreover, proteins were separated and identified by large-scale on-line LC/ESI-MS/MS. The protein samples were prepared as described [23] and measured by LCQ ion trap mass spectrometer (Thermo Finnigan, San Jose, USA). For peptide identifications, the generated MS/MS spectra were evaluated using the SEQUEST analysis program and/or Mascot. In order to quantify differences between 20S proteasome subtypes [15,24] and proteomes of M. tuberculosis and bovis BCG [23], proteins were labelled with the ICAT reagent and analyzed by LC/ESI-MS/MS. To calculate the relative ratios, MS-spectra were evaluated by the program Xpress. Furthermore, a complementary approach was used to detect differences in protein abundance, which combines ICAT and 2-DE and were quantified by the program MS-Screener [24]. An iterative search procedure was applied for in-depth analysis of large 2-DE/MALDI-MS datasets [14]. SQL-LIMS™ Proteomics Solution The workflow described above requires a suitable system for the integration and management of raw and processed experimental data. These issues were addressed by the Laboratory Information Management System (LIMS) in combination with an implemented SQL*LIMS™ Proteomics Solution, customized for our proteomics research laboratory. The implemented solution was based on the Applied Biosystems™ product suite for life science, including a core application (SQL*LIMS™). The latter was designed for analytical laboratories, Pharma R&D and manufacturing environments. Furthermore, components specifically designed for microtiter plates (SQL*GT™) and proteomics (Proteomics Solution) data management were implemented. Operating flexibility and extensibility of this solution has minimized the requirement for code customization. The SQL*LIMS™ users are allowed to enter new or to amend existing workflows and to open interfaces providing an add-on and built-in mechanism for the integration of MS instruments and third-party tools. A highly integrated environment has been addressed from the very beginning as a key factor to enhance productivity by streamlining time consuming operations such as MS data exchange (work list uploading and peak list downloading) or protein search engines querying. Data transfer tool Java interface (DTT) The data transfer tool was designed to facilitate the data transfer from the SQL*LIMS™ into the public 2-DE database, which is the essential part of our Proteome Database System http://www.mpiib-berlin.mpg.de/2D-PAGE/. The DTT has been developed in Java using J2SE 1.4 http://java.sun.com/j2se/1.4 and Eclipse http://www.eclipse.org. The program comprised a graphical user interface (GUI) to enable the selection of datasets which were to be transferred. For safety reasons, the data transfers out of SQL-LIMS™ were protected by password accession. Competing interests The authors declare that they have no competing interests. Authors' contributions FS and MS carried out the proteomics studies, participated in the database structure and the template creation, and prepared the manuscript. KPP contributed to the concept and the realization of the 2D-PAGE and the SQL-LIMS database. MB participated in the DTT tool development. BT participated in the realization of the manuscript. PRJ coordinated and conceived of the study, and participated in its design. All authors read and approved the final manuscript. Acknowledgements The authors thank Luigi Colombo from ABI for the support and the BMBF (031U107A/207A) for funding. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||
Chem Cent J. 2008 Jan 30; 2():1.
[Chem Cent J. 2008]Nat Biotechnol. 1999 Oct; 17(10):994-9.
[Nat Biotechnol. 1999]Mol Cell Proteomics. 2002 May; 1(5):376-86.
[Mol Cell Proteomics. 2002]Mol Cell Proteomics. 2004 Dec; 3(12):1154-69.
[Mol Cell Proteomics. 2004]BMC Bioinformatics. 2006 Oct 4; 7():430.
[BMC Bioinformatics. 2006]Proteomics. 2004 May; 4(5):1305-13.
[Proteomics. 2004]J Am Soc Mass Spectrom. 2003 Sep; 14(9):943-56.
[J Am Soc Mass Spectrom. 2003]J Mol Biol. 2000 Nov 10; 303(5):643-53.
[J Mol Biol. 2000]J Am Soc Mass Spectrom. 2003 Sep; 14(9):943-56.
[J Am Soc Mass Spectrom. 2003]Humangenetik. 1975; 26(3):231-43.
[Humangenetik. 1975]J Am Soc Mass Spectrom. 2003 Sep; 14(9):943-56.
[J Am Soc Mass Spectrom. 2003]Humangenetik. 1975; 26(3):231-43.
[Humangenetik. 1975]Science. 1989 Oct 6; 246(4926):64-71.
[Science. 1989]Anal Chem. 1988 Oct 15; 60(20):2299-301.
[Anal Chem. 1988]Methods. 2005 Mar; 35(3):237-47.
[Methods. 2005]Electrophoresis. 1999 Dec; 20(18):3551-67.
[Electrophoresis. 1999]