![]() | ![]() |
Formats:
|
||||||||||
Copyright © The Author 2006. Published by Oxford University Press. All rights reserved The PeptideAtlas project 1Institute for Systems Biology, Seattle, WA, USA 2Cedars-Sinai Medical Center, Los Angeles, CA, USA 3UCLA Department of Chemistry and Biochemistry, Los Angeles, CA, USA 4Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA 5Institute of Zoology, University of Zurich, Winterthurstrasse 190, 8057 Zürich, Switzerland 6Institute of Molecular Systems Biology, Swiss Federal Institute of Technology, ETH Hönggerberg, Zürich, Switzerland 7Nestlé Research Center, Vers-chez-les-Blanc, 1026 Lausanne, Switzerland *To whom correspondence should be addressed. Email: fdesiere/at/yahoo.com; Email: frank.desiere/at/rdls.nestle.com The authors wish it to be known that, in their opinion, the first three authors should be regarded as joint First Authors Received August 4, 2005; Revised October 3, 2005; Accepted October 3, 2005. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions/at/oxfordjournals.org This article has been cited by other articles in PMC.Abstract The completion of the sequencing of the human genome and the concurrent, rapid development of high-throughput proteomic methods have resulted in an increasing need for automated approaches to archive proteomic data in a repository that enables the exchange of data among researchers and also accurate integration with genomic data. PeptideAtlas (http://www.peptideatlas.org/) addresses these needs by identifying peptides by tandem mass spectrometry (MS/MS), statistically validating those identifications and then mapping identified sequences to the genomes of eukaryotic organisms. A meaningful comparison of data across different experiments generated by different groups using different types of instruments is enabled by the implementation of a uniform analytic process. This uniform statistical validation ensures a consistent and high-quality set of peptide and protein identifications. The raw data from many diverse proteomic experiments are made available in the associated PeptideAtlas repository in several formats. Here we present a summary of our process and details about the Human, Drosophila and Yeast PeptideAtlas builds. INTRODUCTION PeptideAtlas was initially designed to annotate eukaryotic genomes with peptide sequences obtained from mass spectrometry (MS) experiments. These peptide sequences can be collected using the procedure summarized in Figure 1
The identified peptide sequences are then mapped onto their respective genome sequence. First a modified BLAST (3) algorithm is used (with the parameters adapted for searching small peptides: -E 1 -W 2 -M PAM30 -G 9 -e 10 -K 50 -b 50 -F F) to exactly match each peptide to an organism's reference protein database; next, exact and complete matches are used to infer a peptide's chromosomal coordinates; finally, the results are loaded into the PeptideAtlas relational database. We denote each execution of the mapping process as a ‘build’. The database schema (available at project website http://www.peptideatlas.org) can accommodate multiple PeptideAtlas builds, that is, it can handle a variety of organisms and a variety of reference protein sequence sets. Visualization of the results is then achieved using the Distributed Annotation System (DAS) (4) in conjunction with the Ensembl genome browser (5). Owing to growing interest from the scientific community, other PeptideAtlas builds were added to the initial human PeptideAtlas (6), including a build for human plasma/serum (7) and for other species such as Drosophila melanogaster and yeast Saccharomyces cerevisiae. A summary of the currently available builds is shown in Figure 2
FEATURES OF PeptideAtlas The current Human build of PeptideAtlas (April 2005) contains peptide sequences identified in 90 proteomic experiments, in which proteins were extracted from various cell and tissue types. This number of experiments represents a considerable increase compared with the original build (6) and comprises published as well as a large number of (yet) unpublished human datasets from various cell types, such as T cells, B cells, lymphocytes, lymphoblasts, hepatocytes, intestinal cells, hepatoma cells and others. A full listing of all the experiments and samples currently in PeptideAtlas can be found at the project website. The raw data for all published or released datasets are also provided in a repository there. In the April 2005 build, 3.3 million MS/MS spectra were searched and yielded 35 391 distinct peptides with PeptideProphet probability P ≥ 0.9 that were mapped onto 11 115 of the human Ensembl proteins (version 30.35c; March 22, 2005). These proteins represent unique proteins or splice forms from 30% of human genes in Ensembl.Figure 2 Repository function. It is our intent to make publicly available for download as much of the raw data that we use to build the PeptideAtlas as possible. This includes datasets that have been previously published or otherwise released by the data producers. Unpublished raw datasets that were used to build PeptideAtlas are kept private until publication or release by the authors. There are currently ~75 human and yeast experiments available for download in the repository, totaling over 85 GB of downloads. We provide for download the raw MS/MS files, mzXML format MS/MS files (8), full SEQUEST search results, PeptideProphet results as well as the final ProteinProphet (9) output file. Simple sample descriptions and links to related publications are also provided. Database function. The results from the PeptideAtlas builds can be downloaded or browsed by users via the PeptideAtlas web interface, as depicted in Figure 3
Users can view the most current ISB Human PeptideAtlas tracks in the Ensembl genome browser by following the instructions on the website. Once the PeptideAtlas tracks have been defined, the peptide coordinate URLs in our web database interfaces link to a view of the peptide in the Ensembl genome browser. From that genome view, one can also link back to the PeptideAtlas database by clicking on the peptide link. STATISTICS FOR PeptideAtlas Reliably estimating the false positive error rates in an automated fashion is critical: large-scale datasets generated by high-throughput methods inherently contain results with a large number of false identifications (10). PeptideAtlas uses a program called PeptideProphet (9) to remove the majority of false positive identifications. PeptideProphet computes a probability that an assignment of an MS/MS spectrum to a peptide sequence is correct based on the database search scores, the difference between the measured and theoretical peptide masses, the expected and found number of termini for the type of enzymatic cleavage used and a variety of other factors. Probabilities computed by PeptideProphet have been shown to be accurate in the entire probability range and, therefore, can be used to filter out the probabilities that fall below a certain threshold (2). We provide options for users to browse or download versions of the PeptideAtlas built with several P thresholds. FUTURE DIRECTIONS PeptideAtlas is a first step toward the goal of fully annotating and validating eukaryotic genomes by using experimentally observed protein products. PeptideAtlas provides a process and a framework to accommodate proteome information generated by high-throughput proteomics technologies and is able to efficiently disseminate experimental data in the public domain. Its significance continues to grow as more data are submitted from diverse experiments, using different cellular compartments and enrichments methods. PeptideAtlas also provides a resource for the development of new avenues of research. The datasets will provide a rich source of data for computational scientists to develop and test new algorithms for proteomic analysis, gene-discovery and splice variant prediction. The need for public proteomics data repositories is recognized (11) and we intend PeptideAtlas to continue to grow as a public database and resource. We strongly encourage researchers to contribute their own MS/MS data to the PeptideAtlas project. In the near future, we will make builds for organisms such as mouse, Arabidopsis thaliana and Halobacterium sp. NRC-1, and continue to make subsets such as the Human Plasma PeptideAtlas (7). Also in the near future we hope to provide an interface to access representative spectra of peptides, and will provide a way to retrieve information on peptide modifications (such as phosphorylation, etc.). Acknowledgments This project has been funded in part with funds from the National Heart, Lung, and Blood Institute; National Institutes of Health under contract no. N01-HV-28179. Funding to pay the Open Access publication charges for this article was provided by Nestlé. Conflict of interest statement. None declared. REFERENCES 1. Eng J., McCormack A.L., Yates J.R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994;5:976–989. 2. Keller A., Nesvizhskii A.I., Kolker E., Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002;74:5383–5392. [PubMed] 3. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. [PubMed] 4. Dowell R.D., Jokerst R.M., Day A., Eddy S.R., Stein L. The distributed annotation system. BMC Bioinformatics. 2001;2:7. [PubMed] 5. Hubbard T., Andrews D., Caccamo M., Cameron G., Chen Y., Clamp M., Clarke L., Coates G., Cox T., Cunningham F., et al. Ensembl 2005. Nucleic Acids Res. 2005;33:D447–D453. [PubMed] 6. Desiere F., Deutsch E.W., Nesvizhskii A.I., Mallick P., King N.L., Eng J.K., Aderem A., Boyle R., Brunner E., Donohoe S., et al. Integration with the human genome of peptide sequences obtained by high-throughput mass spectrometry. Genome Biol. 2005;6:R9. [PubMed] 7. Deutsch E.W., Eng J.K., Zhang H., King N.L., Nesvizhskii A.I., Lin B., Lee H., Yi E.C., Osssola R., Aebersold R. Human Plasma PeptideAtlas. Proteomics. 2005;5:3497–3500. [PubMed] 8. Pedrioli P.G., Eng J.K., Hubley R., Vogelzang M., Deutsch E.W., Raught B., Pratt B., Nilsson E., Angeletti R.H., Apweiler R., et al. A common open representation of mass spectrometry data and its application to proteomics research. Nat. Biotechnol. 2004;22:1459–1466. [PubMed] 9. Nesvizhskii A.I., Keller A., Kolker E., Aebersold R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 2003;75:4646–4658. [PubMed] 10. Nesvizhskii A.I., Aebersold R. Analysis, statistical validation and dissemination of large-scale proteomics datasets generated by tandem MS. Drug Discov. Today. 2004;9:173–181. [PubMed] 11. Prince J.T., Carlson M.W., Wang R., Lu P., Marcotte E.M. The need for a public proteomics repository. Nat. Biotechnol. 2004;22:471–472. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||
Anal Chem. 2002 Oct 15; 74(20):5383-92.
[Anal Chem. 2002]J Mol Biol. 1990 Oct 5; 215(3):403-10.
[J Mol Biol. 1990]BMC Bioinformatics. 2001; 2():7.
[BMC Bioinformatics. 2001]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D447-53.
[Nucleic Acids Res. 2005]Genome Biol. 2005; 6(1):R9.
[Genome Biol. 2005]Proteomics. 2005 Aug; 5(13):3497-500.
[Proteomics. 2005]Genome Biol. 2005; 6(1):R9.
[Genome Biol. 2005]Nat Biotechnol. 2004 Nov; 22(11):1459-66.
[Nat Biotechnol. 2004]Anal Chem. 2003 Sep 1; 75(17):4646-58.
[Anal Chem. 2003]Drug Discov Today. 2004 Feb 15; 9(4):173-81.
[Drug Discov Today. 2004]Anal Chem. 2003 Sep 1; 75(17):4646-58.
[Anal Chem. 2003]Anal Chem. 2002 Oct 15; 74(20):5383-92.
[Anal Chem. 2002]Nat Biotechnol. 2004 Apr; 22(4):471-2.
[Nat Biotechnol. 2004]Proteomics. 2005 Aug; 5(13):3497-500.
[Proteomics. 2005]