![]() | ![]() |
Formats:
|
||||||||||||
Copyright © 2008 The Author(s) MAPU 2.0: high-accuracy proteomes mapped to genomes 1Department of Proteomics and Signal Transduction, Max-Planck Institute for Biochemistry, Am Klopferspitz 18, D-82152 Martinsried, Germany and 2European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK *To whom correspondence should be addressed. Tel: Phone: +49 89 8578 2557; Fax: +49 89 8578 2219; Email: mmann/at/biochem.mpg.de Received September 15, 2008; Accepted October 7, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract The MAPU 2.0 database contains proteomes of organelles, tissues and cell types measured by mass spectrometry (MS)-based proteomics. In contrast to other databases it is meant to contain a limited number of experiments and only those with very high-resolution and -accuracy data. MAPU 2.0 displays the proteomes of organelles, tissues and body fluids or conversely displays the occurrence of proteins of interest in all these proteomes. The new release addresses MS-specific problems including ambiguous peptide-to-protein assignments and it provides insight into general functional features on the protein level ranging from gene ontology classification to comprehensive SwissProt annotation. Moreover, the derived proteomic data are used to annotate the genomes using Distributed Annotation Service (DAS) via EnsEMBL services. MAPU 2.0 is a model for a database specifically designed for high-accuracy proteomics and a member of the ProteomExchange Consortium. It is available on line at http://www.mapuproteome.com. INTRODUCTION Mass spectrometry (MS)-based proteomics has progressed dramatically in throughput, accuracy and sensitivity and this trend shows no sign of abating (1,2). In order to make the data useful to the larger biomedical community, they have to be easily accessible via the web. Existing proteome databases have focused on different aspects of proteomic data capture and data mining. The PRIDE database, for example, was primarily developed as a vehicle to share proteomic experiments (3). The PeptideAtlas project, on the other hand, is mainly interested in collecting data on peptide fragmentation patterns as a tool for improving future proteomics experiments (4,5). In a similar vein, the Global Proteome Machine Database (GPMDB), collects tandem mass spectra with a view to improve peptide identification (6,7). These and other proteome databases are beginning to be connected in the ProteomExchange Consortium (8). In contrast to the above efforts, our group has focused on the development of a proteome database specifically for very high-resolution and high-accuracy data. Such data could previously only be generated by a few specialized laboratories but the required instrumentation has now spread to hundreds of sites. By only admitting high-resolution data, we avoid a problem endemic to databases that aggregate a wide variety of heterogeneous data, namely the control of overall false positive rates for protein identification. The Max-Planck Unified (MAPU) proteome database contains data from large-scale projects on the mapping of body fluids, tissues and cell lines (9). Its new version, MAPU 2.0, provides a comprehensive proteome information system consisting of the data integration of combined large-scale proteomic projects and the inclusion of protein annotations from standard protein databases, such as UniProt (10). To allow the peptide-based retrieval of high-accuracy proteomic data across projects in a scalable way, we changed the basic concept of the MAPU database completely. The main modifications are the combination of various proteomic sub-databases, of a modern programming environment (C# and .NET) allowing a rich graphical user experience, solving MS specific problems such as peptide-to-protein assignments, the inclusion of additional large-scale proteomic datasets, the detailed cross-reference to SwissProt annotations and two-way connection to EnsEMBL using Distributed Annotation Service (DAS) technology. The last point is specifically pertinent, because as the number of sequenced genomes increases rapidly (11), the annotation of these sequences with biological information becomes increasingly important. Mapping large-scale data derived from MS-based proteomics to the genome sequence is one valuable annotation because it verifies genes and gene models for part of the genome. The EnsEMBL project provides an excellent system to integrate any kind of data that contributes to the annotation of the genome (12,13). In MAPU 2.0, we map high-accuracy proteomic data to the genome in a two way fashion and used the DAS source system to illustrate certain features including the presence of the protein in specific cell types for each identified gene transcript. METHODS AND MATERIALS General concept of MAPU 2.0 The initial content and format of MAPU have been described in Zhang et al. (9). The basic schema of the database has changed dramatically, and the new database version unifies all sub-databases by reassigning the measured peptides along with their corresponding data from each experiment to protein entries of an updated database version. The new architecture is based in part on concepts developed for the Phosphorylation Site Database [PHOSIDA; www.phosida.com (14,15)]. It allows the organism-specific retrieval of various cell-type and organelle associated proteomic data. The user can query the database organism-specifically by protein name, protein description, gene symbol, accession of the database used for identification [such as the International Protein Index (IPI) (16)], SwissProt accession identifier, protein sequence or peptide sequence (Figure 1
The peptide-to-protein assignment presents one of the main problems in ‘shotgun’ MS, where proteins are first digested to peptides, since a given peptide might occur in several proteins (17). Multiple incidences of a certain peptide sequence can cause ambiguous protein assignments. In accordance with Occam's razor, we assign a given peptide sequence to the candidate protein with the highest number of peptides within one project. The user is alerted to this problem by color highlighting the listed peptides: green indicates that the selected protein of interest has the maximum number of peptides in comparison to all other proteins that contain the same peptide, whereas blue indicates that there is another protein entry that contains the peptide and shows the same number of identified peptides in total. Red points to the occurrence of another protein, which shows a higher number of detected peptides in total and thus presents the more likely associated protein. When pointing the mouse to one of the corresponding ‘occurrences’ buttons, a blue colored pop-up box lists all protein entries that contain the given peptide along with the total number of containing peptides that have been identified (Figure 2 In addition to the illustration of associated cell types and organelles along with the measured peptides, general information about the protein is provided: Besides protein descriptions and full protein sequences, the corresponding Gene Ontology identifiers (18) are listed linking to the Gene Ontology web site reporting full descriptions of the selected annotation. Furthermore, the annotations to each instance include PubMed references and general features such as active sites, motifs, domains or signaling sites derived from SwissProt (Figure 3
Proteomics datasets containing quantitative information in the form of isotope ratios are becoming the norm rather than the exception (19). In this case, the median quantitative data of all measured and assigned peptides is taken to quantify the protein. Additionally, each displayed web page includes a question mark button that directs to the help section of MAPU 2.0 describing the format of the current page or exemplifying the web application guideline. These help sections are also available via the ‘background’ section of MAPU 2.0, which also contains general descriptions of the experimental designs of various projects. To allow the retrieval of legacy sub-databases that could not be included in the new concept, a link to the old database version is provided. This is the case for the organellar database (20) as well as the red blood cell database (21), as both datasets are exclusively protein-based and therefore cannot be mapped to MAPU 2.0 due to the lack of peptide information. MAPU 2.0 is based on a modern and scalable software architecture, namely C# and the ASP.NET technology. This allows MAPU 2.0 to share class libraries, with PHOSIDA (15). The concepts and web applications of MAPU 2.0 and PHOSIDA are very similar and show that very distinct proteomic databases can be built using shared components. Genome Annotation We spent particular efforts on precisely mapping our high-resolution proteomic data to the genome. For this purpose, we extracted measured peptides of each proteomic dataset and reassigned the peptide sequences to genes annotated in the EnsEMBL database (11). If a specified peptide matches with sequences of more than one translated gene, we assigned the peptide to the gene transcript that shows the highest number of matching peptides in total within the associated project. Therefore, the peptide-to-gene transcript assignment results one-to-one relationships reducing potential redundancy. The genome annotation section is accessible via the ‘notepad’ button located next to the main ‘web book’ of MAPU 2.0 (Figure 4
Furthermore, each chromosome is divided into 93 bins: on the left hand side the number of transcripts annotated in EnsEMBL is displayed. Selecting one of the bin boxes pops up the EnsEMBL web page, showing a detailed view of the selected chromosome region. On the right hand side, the number of transcripts that have been detected in any of the uploaded projects is illustrated for each bin. Clicking on one of these bin buttons results in the listing of all identified gene transcripts along with the descriptions of the corresponding genes and exact localizations on the chromosome. Moreover, a link is provided for each gene transcript that connects to the EnsEMBL homepage displaying the full annotation of that transcript. In addition to the general annotation of the given gene transcript, the pop-up EnsEMBL page will show all peptides that have been identified via the MAPU 2.0 DAS source (12). Thus, whenever a web user requests the information provided by MAPU 2.0 on EnsEMBL, the data included in the MAPU 2.0 database are illustrated via the DAS/Proserver system (13). Clicking on one of the illustrated peptides yields a report of all the cell types that contain the peptides for this gene transcript. In addition to the MAPU 2.0 DAS source, we have also established a PHOSIDA DAS source providing all phosphorylation sites that have been unambiguously identified (Class 1 sites) (14), but also phosphosites that lack precise identification within the phosphorylated peptide sequence due to insufficient fragment information in MS/MS (ambiguous sites). SUMMARY AND PERSPECTIVES MAPU 2.0 is a database specifically created for high-resolution, high-accuracy proteomic data. It provides a user-friendly environment and several of its concepts are innovative and could be transferred to proteomic databases of a more general nature. We addressed MS-specific problems including ambiguous peptide-to-protein assignments by straightforward approaches such as color highlighting of given peptide sequences. In addition, we used the proteomic data that are integrated in MAPU 2.0 to annotate the genome via the DAS technology provided by the EnsEMBL project. MAPU 2.0 is becoming a member of the ProteomExchange Consortium, allowing its data to be exchanged with other databases. Mass spectrometric data is becoming much more accurate and faster to produce, paralleling in some ways the advent of next generation sequencing technology. This will also bring particular opportunities and challenges. For instance, we believe that proteomic data is now sufficiently readily produced in high quality, that it does not make sense to store all proteomics results accompanying publications in central databases, particularly if the data was generated in ‘one-off ’ projects and with low resolution technology. Instead, reference proteomes should be measured with extremely high-accuracy and in dedicated state of the art facilities. We intend to further develop MAPU with a view to serve as a model database for such high-accuracy reference proteomes. FUNDING Marie Curie Fellowship (to F.G.). Part of this work was supported by ‘Interaction Proteome’ as 6th Framework program by the EU directorate, which also funded for open access charge. Conflict of interest statement. None declared. ACKNOWLEDGEMENT We thank Andrew Jenkinson and Eugene Kulesha for helping to establish the DAS system for MAPU 2.0, Phani Garapati and Markus Fritz for testing the web application and Michael Schuster for suggestions. We also thank all of the users of our website and especially those who have provided feedback. REFERENCES 1. Aebersold R, Mann M. Mass spectrometry-based proteomics. Nature. 2003;422:198–207. [PubMed] 2. Mann M, Kelleher NL. PNAS. 2008. Precision proteomics: the case for high resolution and high mass accuracy. doi:10.1073/pnas.0800788105. 3. Jones P, Cote RG, Martens L, Quinn AF, Taylor CF, Derache W, Hermjakob H, Apweiler R. PRIDE: a public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res. 2006;34:D659–D663. [PubMed] 4. Desiere F, Deutsch EW, King NL, Nesvizhskii AI, Mallick P, Eng J, Chen S, Eddes J, Loevenich SN, Aebersold R. The PeptideAtlas project. Nucleic Acids Res. 2006;34:D655–D658. [PubMed] 5. Deutsch EW, Lam H, Aebersold R. PeptideAtlas: a resource for target selection for emerging targeted proteomics workflows. EMBO Rep. 2008;9:429–434. [PubMed] 6. Craig R, Cortens JP, Beavis RC. Open source system for analyzing, validating, and storing protein identification data. J. Proteome Res. 2004;3:1234–1242. [PubMed] 7. Craig R, Cortens JC, Fenyo D, Beavis RC. Using annotated peptide mass spectrum libraries for protein identification. J. Proteome Res. 2006;5:1843–1849. [PubMed] 8. Hermjakob H, Apweiler R. The Proteomics Identifications Database (PRIDE) and the ProteomExchange Consortium: making proteomics data accessible. Expert Rev. Proteomics. 2006;3:1–3. [PubMed] 9. Zhang Y, Zhang Y, Adachi J, Olsen JV, Shi R, de Souza G, Pasini E, Foster LJ, Macek B, Zougman A, et al. MAPU: Max-Planck Unified database of organellar, cellular, tissue and body fluid proteomes. Nucleic Acids Res. 2007;35:D771–D779. [PubMed] 10. The UniProt Consortium. The universal protein resource (UniProt). Nucleic Acids Res. 2008;36:D190–D195. [PubMed] 11. Flicek P, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, et al. Ensembl 2008. Nucleic Acids Res. 2008;36:D707–D714. [PubMed] 12. Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L. The distributed annotation system. BMC Bioinformatics. 2001;2:7. [PubMed] 13. Finn RD, Stalker JW, Jackson DK, Kulesha E, Clements J, Pettett R. ProServer: a simple, extensible Perl DAS server. Bioinformatics. 2007;23:1568–1570. [PubMed] 14. Olsen JV, Blagoev B, Gnad F, Macek B, Kumar C, Mortensen P, Mann M. Global, in vivo, and site-specific phosphorylation dynamics in signaling networks. Cell. 2006;127:635–648. [PubMed] 15. Gnad F, Ren S, Cox J, Olsen JV, Macek B, Oroshi M, Mann M. PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites. Genome Biol. 2007;8:R250. [PubMed] 16. Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, Apweiler R. The International Protein Index: an integrated database for proteomics experiments. Proteomics. 2004;4:1985–1988. [PubMed] 17. Nesvizhskii AI, Aebersold R. Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell Proteomics. 2005;4:1419–1440. [PubMed] 18. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. Gene ontology: tool for the unification of biology. Nat. Genet. 2000;25:25–29. [PubMed] 19. Ong SE, Mann M. Mass spectrometry-based proteomics turns quantitative. Nat. Chem. Biol. 2005;1:252–262. [PubMed] 20. Foster LJ, de Hoog CL, Zhang Y, Xie X, Mootha VK, Mann M. A mammalian organelle map by protein correlation profiling. Cell. 2006;125:187–199. [PubMed] 21. Pasini EM, Kirkegaard M, Mortensen P, Lutz HU, Thomas AW, Mann M. In-depth analysis of the membrane and cytosolic proteome of red blood cells. Blood. 2006;108:791–801. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||
Nature. 2003 Mar 13; 422(6928):198-207.
[Nature. 2003]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D659-63.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D655-8.
[Nucleic Acids Res. 2006]EMBO Rep. 2008 May; 9(5):429-34.
[EMBO Rep. 2008]J Proteome Res. 2004 Nov-Dec; 3(6):1234-42.
[J Proteome Res. 2004]Nucleic Acids Res. 2007 Jan; 35(Database issue):D771-9.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2008 Jan; 36(Database issue):D190-5.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2008 Jan; 36(Database issue):D707-14.
[Nucleic Acids Res. 2008]BMC Bioinformatics. 2001; 2():7.
[BMC Bioinformatics. 2001]Bioinformatics. 2007 Jun 15; 23(12):1568-70.
[Bioinformatics. 2007]Nucleic Acids Res. 2007 Jan; 35(Database issue):D771-9.
[Nucleic Acids Res. 2007]Cell. 2006 Nov 3; 127(3):635-48.
[Cell. 2006]Genome Biol. 2007; 8(11):R250.
[Genome Biol. 2007]Proteomics. 2004 Jul; 4(7):1985-8.
[Proteomics. 2004]Mol Cell Proteomics. 2005 Oct; 4(10):1419-40.
[Mol Cell Proteomics. 2005]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Nat Chem Biol. 2005 Oct; 1(5):252-62.
[Nat Chem Biol. 2005]Cell. 2006 Apr 7; 125(1):187-99.
[Cell. 2006]Blood. 2006 Aug 1; 108(3):791-801.
[Blood. 2006]Genome Biol. 2007; 8(11):R250.
[Genome Biol. 2007]Nucleic Acids Res. 2008 Jan; 36(Database issue):D707-14.
[Nucleic Acids Res. 2008]BMC Bioinformatics. 2001; 2():7.
[BMC Bioinformatics. 2001]Bioinformatics. 2007 Jun 15; 23(12):1568-70.
[Bioinformatics. 2007]Cell. 2006 Nov 3; 127(3):635-48.
[Cell. 2006]