![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||
Copyright © 2007, Cold Spring Harbor Laboratory Press Whole proteome analysis of post-translational modifications: Applications of mass-spectrometry for proteogenomic annotation 1 Bioinformatics Program, University of California San Diego, La Jolla, California 92093, USA; 2 Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington 99352, USA; 3 Burnham Institute for Medical Research, La Jolla, California 92037, USA; 4 Fellowship for Interpretation of Genomes, Burr Ridge, Illinois 60527, USA; 5 San Diego State University, San Diego, California 92182, USA; 6 Department of Computer Science and Engineering, University of California San Diego, La Jolla, California 92093, USA 7These authors contributed equally to this work. 8Corresponding author.E-mail ngupta/at/ucsd.edu; fax (858) 534-7029. Received February 22, 2007; Accepted June 12, 2007. This article has been cited by other articles in PMC.Abstract While bacterial genome annotations have significantly improved in recent years, techniques for bacterial proteome annotation (including post-translational chemical modifications, signal peptides, proteolytic events, etc.) are still in their infancy. At the same time, the number of sequenced bacterial genomes is rising sharply, far outpacing our ability to validate the predicted genes, let alone annotate bacterial proteomes. In this study, we use tandem mass spectrometry (MS/MS) to annotate the proteome of Shewanella oneidensis MR-1, an important microbe for bioremediation. In particular, we provide the first comprehensive map of post-translational modifications in a bacterial genome, including a large number of chemical modifications, signal peptide cleavages, and cleavages of N-terminal methionine residues. We also detect multiple genes that were missed or assigned incorrect start positions by gene prediction programs, and suggest corrections to improve the gene annotation. This study demonstrates that complementing every genome sequencing project by an MS/MS project would significantly improve both genome and proteome annotations for a reasonable cost. The number of sequenced bacterial genomes has been increasing rapidly, with 70 out of 250 sequenced bacterial genomes finished in the last year alone (Benson et al. 2006). Annotation of these genomes continues to be a challenging task, requiring both automated analysis and manual curation. This challenge is greatly magnified in recent meta-genomic projects, which seek to sample DNA from the environment (Venter et al. 2004; http://www.sorcerer2expedition.org). In this article, we demonstrate the use of liquid chromatography–coupled mass spectrometry (LC-MS/MS) for both proteomic and genomic annotations of bacteria. While gene finding in prokaryotes has significantly improved, prediction of short genes, annotation of genes with unusual codon usage, as well as accurate prediction of start codons remains a challenge. Moreover, biologically important problems of annotating post-translational modifications (chemical modifications, signal peptides, proteolytic events), unusual stop codons, programmed frameshifts, quantifying protein expression, etc., still cannot be solved with genomic techniques. MS/MS is a key technology for proteomic analysis (Aebersold and Mann 2003; Jensen 2006) that fragments individual peptides and uses the resulting tandem mass-spectra as a “fingerprint” to identify the protein of origin. Chemical modifications at specific residues are detectable as a change in the fragmentation pattern (e.g., shifts in the masses of fragments containing the modification). These modifications are often critical to protein function, as in the case of regulating binding partners, subcellular localization, the three-dimensional structure of the protein or in modifying activity of a catalytic site. Protein modifications in prokaryotes are of great biological interest but are not yet well understood. The idea of querying MS/MS data set against a genome to identify protein coding genes has been used earlier in different settings (Yates et al. 1995; Kuster et al. 2001; Oshiro et al. 2002; Fermin et al. 2006). Bacterial genomes, with a simple gene structure, are a particularly attractive target for such methods. The identified peptides validate the predicted genes, correct erroneous gene annotations, and reveal some completely missed genes. Church and colleagues used proteomic data for genome analysis of relatively small bacterium, Mycoplasma pneumoniae (Jaffe et al. 2004a), and later on the newly sequenced Mycoplasma mobile, in which 26 genes were predicted exclusively based on proteomic data (Jaffe et al. 2004b). Similar efforts have been made for other bacterial genomes (Kalume et al. 2005; Wang et al. 2005). Nevertheless, many significant technological challenges remain in using MS/MS for gene annotation, post-translational (proteolytic) processing, and modifications. For example, only one post-translational event was discovered in the whole proteome of M. pneumoniae (Jaffe et al. 2004a). In Jaffe et al. (2004b), the investigators expressed disappointment on not being able to find any post-translational modification, in spite of their ability to detect expression of 88% of all proteins with high residue coverage. This exemplifies the difficulty in studying post-translational events in large-scale studies. This study further develops the methods of proteogenomic annotations from Yates et al. (1995), Kuster et al. (2001), Oshiro et al. (2002), Jaffe et al. (2004a, b), Kalume et al. (2005), Wang et al. (2005), and Fermin et al. (2006) and provides the first comprehensive analysis of post-translational modifications in complete genomes. Until very recently, such analysis was not feasible since the whole genome search for mutated and modified peptides was prohibitively time-consuming. The search becomes particularly time-consuming in the case of nontryptic peptides that enable identification of proteolytic events (see below). Moreover, our analysis revealed at least 25 modification types present in the sample, a significantly larger number than the existing database search tools can practically handle under realistic parameters. We capitalize on the recently developed database filtration tool Inspect (Tanner et al. 2005) and the blind database search tool MS-Alignment (Tsur et al. 2005) to overcome these difficulties. Here we use the Gram-negative bacterium Shewanella oneidensis MR-1 as a test case for our LC-MS/MS–based approach to proteome annotation. It is an aero-tolerant anaerobe able to reduce heavy metal ions and remove them from solution, making it a potential agent for bioremediation (Nealson et al. 2002). The genome of this strain was sequenced in 2002 (Heidelberg et al. 2002) with 4931 predicted protein-coding genes, of which only a few have been experimentally verified. The revised number of genes now stands at 4928 according to the TIGR Comprehensive Microbial Resource (Peterson et al. 2001). We will refer to these as TIGR genes in this article. The organism has been extensively studied by the Shewanella Federation (http://shewanella.org/). The difficulties in gene prediction are illustrated by the fact that just a year after S. oneidensis was first annotated, it was reannotated by Daraselia et al. (2003). Some recent studies have attempted to use LC-MS/MS and LC-MS to further improve the annotations of S. oneidensis (Romine et al. 2004; Elias et al. 2005, 2006). Recently, Kolker et al. (2005) used microarray and MS data to analyze the expression of predicted genes in S. oneidensis. We remark that even if bacterial gene predictions were 100% accurate, they would still provide incorrect protein sequences for ~50% of bacterial genes! It is estimated that a single post-translational modification (N-terminal methionine cleavage, or NME) alters roughly half of proteins in Escherichia coli. NME is the process of cleaving N-terminal methionine residue by methionyl amino peptidase (MAP) or amino peptidase P (AmpP) from a number of cytosolic proteins. NME has important implications for protein half-life (Tobias et al. 1991), and knowledge of NME is crucial for many applications in food safety, infection diagnostics, and counter-terrorism (it improves the quality of MS-based microorganism detection by an order of magnitude; Fenselau and Demirev 2001). The role of NME remains poorly understood, but the process is recognized to be the major source of N-terminal amino acid diversity. The recognition rules for NME remain elusive, resulting in a number of conflicting studies (Link et al. 1997; Wasinger and Humphery-Smith 1998; Fenselau and Demirev 2001; Frottin et al. 2006). Frottin et al. (2006) recently estimated that the existing ambiguities in NME recognition rules make reliable proteome annotation difficult for ~30% of bacterial proteins. This renders the production of recombinant proteins of therapeutic interest risky, given the high anti-genicity of the N terminus if incorrectly processed, the problem originally encountered in the production of human hemoglobin (Olson et al. 1981; Ben-Bassat et al. 1987). We therefore argue that MS studies are necessary to complement the existing gene prediction tools by accurate annotations of NME and many other post-translational modifications. In this work we exploit a data set containing 14.5 million tandem mass spectra for S. oneidensis. These data include samples from 17 cell culture conditions and comprise the largest LC-MS/MS data set ever reported for a bacterium. Using very conservative cutoffs for peptide identifications, we confirm the protein expression of 1992 out of 4928 predicted TIGR genes. We correct or redefine gene boundaries of 38 genes, eight of which were not included in the TIGR predictions, and provide evidence for expression of 13 genes previously annotated as pseudogenes. The peptides identified by MS/MS give important insights into post-translational processing, including proteolytic events. For example, cleavage of initial methionine was observed with specificity similar to that observed in vitro with bacterial enzymes. Our analysis provides significantly refined annotations for secreted proteins by confirming signal peptide processing for 94 proteins while rejecting imprecise or incorrect predictions by existing software for 119 proteins. We further performed blind search for modifications, resulting in the comprehensive map of modifications for a bacterial proteome for the first time. We argue that complementing a sequencing project by a LC-MS/MS projects critically improves both genome and proteome annotations. The data used in this study are available at http://peptide.ucsd.edu/ShewanellaOneidensis/ and http://ober-proteomics.pnl.gov/data, and the software tools, including Inspect, MS-Alignment and other scripts, are available at http://peptide.ucsd.edu/. Results Peptide identification As described in the Methods section, we searched the spectra against the six-frame translation of the Shewanella genome. Some previous proteogenomic studies (Jaffe et al. 2004a, b) did not attempt to measure the rate of false peptide identification, thus raising doubts about the reliability of new gene annotations. To quantify the false discovery rate of peptide identifications, we searched all spectra against a reversed sequence database. We selected a match score cutoff that limits the number of peptides annotated in the reverse database to 5% of the number of peptides identified in the valid database (1417 distinct peptide identifications in the reversed database, compared with 28,377 peptide identifications in the valid database). Peptide matches of length less than eight amino acids were discarded to minimize the number of false positive predictions. In total, 1.4 million spectra were annotated while searching 14.5 million spectra against the forward database. In contrast, only 14,523 spectra had scores above the threshold in the reversed database, giving a spectrum-level false discovery rate below 1% on the forward database search. Peptide coverage Some peptides are encoded at multiple locations within the six-frame translation. Therefore, the identified peptide list was filtered to 27,946 peptides that map to unique locations in the translated genome, removing this possible source of ambiguity. We compare our findings with genes predicted de novo by GeneMark (Besemer and Borodovsky 2005) and with the curated TIGR annotations (Peterson et al. 2001). We note that the fully automated gene predictions from GeneMark include only 4692 genes. Table 1 analyzes the locations of the identified peptides relative to the positions of the TIGR genes. Of the 331 peptides not covered by a TIGR gene, 126 were covered by GeneMark predictions. This demonstrates the utility of LC-MS/MS data in resolving discrepancies between various gene prediction tools. For each TIGR gene, we looked at all identified peptides within the gene and determined the coverage (fraction of protein residues covered by the identified peptides). Figure 1
Figure 2
A distribution of the observed coverage of individual proteins in the expressed proteome by MS-detected peptides could be perceived as somewhat random with most biases coming from technical factors such as efficiency of extraction and existence of poorly detectable peptides (Tang et al. 2006). On the other hand, one may expect this distribution to be substantially affected by the relative protein abundance reflecting cell physiology under given experimental conditions (Lu et al. 2006). The latter hypothesis, if confirmed, would open new opportunities to exploring cellular pathways and networks. To validate this hypothesis, we used several indirect tests, as the proteome-scale data on protein abundances are not readily available. We examined whether there is any correlation between coverage and (1) conservation of orthologs in a range of sequenced microbial genomes, (2) essentiality of such orthologs in a model system of E. coli (Baba et al. 2006), and (3) functional categories inferred by gene annotations and pathway reconstruction in public genomic resources TIGR (http://cmr.tigr.org/) and SEED (http://theseed.uchicago.edu/FIG/index.cgi). Whereas none of these three features should correlate with technological constraints, one may expect them to reflect relative protein abundances. The latter expectation is quite straightforward for functional categories or pathways and may be less obvious for conservation and essentiality. Conservation and essentiality are known to correlate with each other (for discussion, see Gerdes et al. 2003; Koonin 2003), implicating proteins that constitute a Core of Life. It is plausible that a substantial fraction of these proteins would be expressed at appreciable (and even high) level in a variety of growth conditions. Particularly in the context of a global survey of proteome data acquired at different experimental conditions, conserved and essential proteins may prevail due to the universal nature of their expression. For the purpose of this analysis S. oneidensis proteome was split to five arbitrary coverage groups: the first four groups (A–D) of a similar size (605, 612, 632, and 549 proteins, respectively) with a range of protein coverage 50%–100%, 27%–49%, 11%–26%, and 1%–10% (for details, see Supplemental Table S1A). The last group E contained proteins not covered by any peptide and represented more than half of all predicted proteins. The results of this analysis are summarized in Figure 3.
To roughly assess a functional content within the same coverage groups we used two types of functional categories: (1) assigned in a TIGR database (main categories), and (2) deduced from a categorized collection of annotated subsystems (pathways) provided by SEED genomic resource (Overbeek et al. 2005). While differing in details of classification and coverage (TIGR nonhypothetical categories cover a larger fraction of proteome, whereas SEED collection reflects a more detailed reconstruction of the major pathways), both graphs in Figure 3B Some of the individual functional categories (e.g., those related to metabolism of proteins [pale blue], amino acids [green], and nucleotides [brown]) display consistent downward trends when compared between all coverage groups in both graphs in Figure 3B. Overall, this preliminary analysis provides us with substantial evidence confirming the hypothesis that protein coverage by MS-detected peptides may provide a reasonable approximation of relative abundance of proteins at the whole proteome scale. Further studies aimed to validate this hypothesis and to apply this principle to assess differential expression of genes and pathways as a function of growth conditions are currently in progress. One should keep in mind that pathways may be regulated at various levels, and different genes may have different levels of correlation with the pathway expression. It must be noted that while sequence coverage correlates with protein abundance in general, the sequence coverage of any specific protein may also be significantly affected by its physicochemical properties. For example, we looked carefully at various transporters and found that usually the soluble components and outer membrane proteins are observed, while the integral membrane proteins are underrepresented. Developing models to take the effect of these physicochemical properties into consideration is expected to be a part of future studies in this direction. Improving genome annotation Detection and mapping of multiple peptides to a single gene is evidence that it encodes a protein that is expressed. Similarly, multiple matches to a genomic region outside the boundary of genes can be used to detect new genes missed during genome annotation or to suggest that gene boundaries should be expanded. To detect such cases, we examined the identified peptides falling outside the TIGR genes. Such peptides are combined into putative coding segments if they are located within 200 nucleotides from each other, with compatible reading frame. These coding segments point to new genes and extensions of the TIGR genes. By analyzing the location of these segments, we identified eight new genes and extended the 5′ boundaries for 30 genes (see Supplemental Table S2A,B). Of the eight new genes, all but two had been previously suggested in a second version of the annotation of the MR-1 genome (Daraselia et al. 2003). The remaining two, which we have designated SO_4799 and SO_A0180, were similar to proteins found in other bacteria. Furthermore, our data provide the necessary information required to validate that these genes, six of which would otherwise have been annotated as hypothetical, encode proteins. Based on comparative protein sequence alignment, extensions of the 5′ ends of all 30 genes were consistent with predicted N termini of proteins identified in other bacteria, including other species of Shewanella. Figure 4
Figure 5
To our surprise, we also detected peptides that mapped to 13 genes originally annotated as pseudogenes (Supplemental Table S2C). SO_0991 encodes peptide chain release factor 2, PrfB. The orthologous gene from other bacteria has been shown to undergo programmed frameshifting (Baranov et al. 2002). The occurrence of the expected recoding signals in the MR-1 prfB gene, as described in the RECODE database (Baranov et al. 2003), suggests that the seven peptides that we detected after the frameshift position are a consequence of programmed frameshifting of SO_0991. Alignments of the deduced products for two genes (SO_4538 and SO_4809) suggest that their original annotation as pseudogenes was inaccurate, since numerous orthologs of similar size were found. Alignments of proteins deduced from the remaining 10 genes were in agreement with the original suggestion that they were encoded by pseudogenes. However, examination of the trace archives available at NCBI for the MR-1 genome revealed mistakes in base calling that, when repaired, yield genes with an open reading frame of a significant length. Representative examples of the latter two cases are illustrated in Figure 6.
Proteolytic sites Proteolytic cleavage through cellular proteases is extremely important for many biological functions. While such cleavage is often specific and tightly regulated, protease activity in cells is relatively unexplored, primarily due to the lack of effective high-throughput technology to detect proteolytic events. Large-scale efforts are currently underway to address this technological bottleneck (e.g., at the NIH Center for Proteolytic Pathways at Burnham Institute for Medical Research). However, these approaches mainly rely on labeling protocols and thus face a number of chemical and computational challenges. We show that at least some proteolytic events can be reliably identified by label-free analysis, as illustrated below by our analysis of high-throughput MS/MS data. Large MS/MS data sets offer an unprecedented opportunity to study in vivo cleavage specificity by looking at over-represented nontryptic peptides that may be manifestations of proteolytic events. Since all protein samples were digested with trypsin, we expect the majority of the peptide endpoints to correspond to tryptic cleavage sites (after arginine or lysine, but not before proline). Given trypsin’s high specificity (Olsen et al. 2004), it is natural to consider that nontryptic endpoints may reveal proteolytic events (Mann and Pandey 2001). We also consider the natural N and C termini of any protein as fully consistent with conventional tryptic cleavage. A peptide endpoint that is not tryptic according to either of the above definitions is considered “nontryptic.” Nontryptic endpoints suggest the possibility of a proteolytic event, either in vivo or in vitro. Our peptide identifications include 21,297 tryptic peptides (75%), 6670 peptides with one nontryptic endpoint (24%), and 409 peptides with two nontryptic endpoints (1%). However, caution is needed while analyzing peptides with nontryptic endpoints, since they can also reflect post-digestion trimming of tryptic peptides. Figure 7
We note that 96% of nontryptic peptides fall within confirmed TIGR proteins, similar to tryptic peptides (97%). Since confirmed TIGR proteins make up only 7% of the whole genome database size (translated in six frames), incorrect identifications (which are randomly distributed) are unlikely to fall within confirmed proteins. Thus, we argue that our false discovery rate for nontryptic peptides is not significantly larger than that for tryptic peptides. In this section, we consider only those peptides contained within confirmed proteins. Detecting proteolytic events via MS/MS analysis Nontryptic peptides may arise from post-digestion breakup, due to hydrolysis (driven by endogenous or exogenous peptidases or by harsh chemical conditions in course of the sample preparation) or in-source decay (Olsen et al. 2004). Of the 7079 nontryptic peptides, 5474 (77%) are properly contained in a longer observed tryptic peptide, and 1605 (23%) are not. It is likely that the majority of nontryptic peptides contained within other observed peptides result from post-digestion breakup, particularly when the longer peptide is more abundant (as estimated by spectrum count), although some of them may result from the partial proteolytic processing in vivo. To examine this in further detail, we measure the distance from these nontryptic termini to the tryptic endpoint of the containing peptide (Fig. 8
Surprisingly, the distance from a nontryptic C terminus to the containing peptide’s tryptic endpoint is often precisely two amino acids (31% of peptides). Moreover, the plot has peaks at even distances (two, four, and six amino acids), suggesting an increased propensity to deleting two amino acids at a time. This observation may be explained by the action of a peptidyl dipeptidase (such as dcp in E. coli) (Yaron 1976), which cleaves off dipeptides from the C termini of oligopeptides. S. oneidensis indeed has two dcp orthologs: SO_3142, dcp-1 (46% identity with E. coli dcp) and SO_3564, dcp-2 (47% identity with E. coli dcp). In addition to carboxypeptidase (and dipeptidase), the observed trimming patterns point to a likely presence of aminopeptidase activity (Fig. 7 We now focus on the 1605 nontryptic peptides that are not contained within tryptic peptides. We further reduce this set to 1372 peptides that are not contained within any other peptide and that are located within confirmed TIGR genes (such peptides are called “noncovered peptides”). While the 688 proteins containing noncovered peptides may be potential proteolytic targets, one can argue that these peptides may also represent (1) erroneous peptide identifications or (2) instances where the containing tryptic peptide does not generate any observed MS/MS spectra, perhaps due to extreme length (Tang et al. 2006). In the following section, we demonstrate that while it is a valid concern, many noncovered peptides indeed correspond to proteolytic events. To prove that this is the case, we point to the extremely nonuniform distribution of starting positions of these peptides along the protein (Fig. 9
To focus on these two phenomena, N-terminal methionine and signal peptide cleavages, we limit our attention to proteins in which the leftmost identified peptide is a noncovered peptide, i.e., proteins with no peptide coverage upstream of the first noncovered peptide. A total of 366 N-terminal peptide endpoints are obtained accordingly. N-terminal methionine cleavage The N-terminal methionine residue is cleaved by MAP or AmpP from a number of cytosolic proteins. Methionine, which is important during translation, may not be required (or actually be detrimental) for the function of the protein. An earlier study (Hirel et al. 1989) measured the efficiency of cleavage between initial methionine and various second residues in vitro. This study genetically engineered and expressed 20 mutants of a gene in E. coli differing only at the second position. The expressed proteins were purified and subjected to MAP activity. Using Edman degradation to analyze the N-terminal sequences of the resultants, they showed that methionine cleavage is more efficient if the second residue has a smaller side chain. In our analysis of noncovered peptides, we observed many peptides starting at the second residue of a protein. These peptides confirm cleavage of N-terminal methionine in 218 proteins (Supplemental Table S3A contains the list). To check whether the effect of second residue on cleavage (Hirel et al. 1989) is also seen in our data, we computed a cleavage efficiency factor for each of the 20 amino acids. For a given amino acid, we identify all peptides that contain the second residues of proteins having the particular amino acid at that position. If X is the number of such peptides that begin at residue 1 of a protein (indicating no cleavage) and Y is the number of peptides beginning at residue 2 (indicating a cleavage), the cleavage efficiency for that amino acid is defined as Y/(X + Y). The relative levels of these efficiencies of different amino acids are similar to the results observed in vitro for E. coli (Fig. 10
We observed several cases of an apparent cleavage before the second methionine in proteins starting with double methionine. A comparative genomics analysis of other Shewanella strains revealed that a large portion of them (e.g., SO_4343 and SO_2364) have orthologous proteins with a single methionine, rather than a double methionine. Therefore, we speculate that many proteins starting with double methionine in TIGR may represent a misannotation of the translation start site. We note that aside from the noncovered peptides, we observed 32 nontryptic peptides, starting at residue 2 where a peptide containing the N-terminal methionine was also observed. These peptides may be explained by partial processing by MAP activity or, more likely, by the N-terminal trimming in course of sample preparation. Signal peptides A signal peptide is a short N-terminal region of a protein that targets a protein for secretion or for transportation to a desired cellular location. Signal peptides are cleaved and quickly degraded to produce the mature protein sequence. The average length of a signal targeting proteins to the Sec pathway in Gram-negative bacteria is estimated to be 25 amino acids, with most signal peptides in the range from 20–30 amino acids (Nielsen et al. 1997; Paetzel et al. 2002). The distribution of starting positions of noncovered peptides has a pronounced peak at 20 amino acids, with a longer tail on right side (bigger lengths) as shown in Figure 9. While knowledge of signal peptides is important for understanding protein function, they are difficult to confirm experimentally, and computational predictions are used to fill the gap. SignalP (Bendtsen et al. 2004) is popular signal peptide prediction software which uses neural network (NN) and hidden Markov model (HMM) models of known signal peptides. Another such program, PrediSi (Hiller et al. 2004), uses a position weight matrix (PWM) to predict signal peptides. It is important to note that the bulk of protein secretion in Shewanella, as well as in other Gram-negative bacteria, occurs to the periplasmic space; therefore, the corresponding processed proteins can be experimentally observed in the whole-cell extract. There have been some concerns regarding the quality of signal peptide predictions (Antelmann 2001) since these methods consider a generalized signal motif for all proteins and may not identify interesting cases that are limited to a few proteins. Also, these tools make predictions based on a rather small signal peptide database, since experimental data about signal peptides are limited. For example, SignalP v3 made predictions for Shewanella based on a data set of only 334 experimentally confirmed signal peptides in all Gram-negative bacteria. The number of experimentally confirmed signal peptides in Gram-positive bacteria is half as large (Bendtsen et al. 2004). It is clear that LC-MS/MS evidence can greatly increase the number of experimentally confirmed signal peptides and confidence of signal peptide predictions. We analyzed our peptide annotations in order to confirm or refute signal predictions, and possibly to discover new signal cleavage sites. We examined peptides from the confirmed proteins with nontryptic N termini. From this set, we selected peptides with no upstream coverage (see Methods). The list of 117 predictions is provided in Supplemental Table S3B. A clear sequence motif (Crooks et al. 2004) emerges when we examine the sequence immediately upstream of these 117 putative signal peptides predicted by MS/MS analysis (Fig. 11
SignalP and PrediSi predict 370 and 403 proteins with signal peptides. However, there is a substantial discrepancy between these tools—only 211 signals are predicted by both tools. LC-MS/MS evidence provides a possibility to resolve the discrepancies between SignalP and PrediSi as well as to identify signal peptides missed by both tools. Figure 12A
On 119 of the confirmed proteins, the MS/MS results include peptides upstream of the cleavage site predicted by SignalP/PrediSi and thus represent evidence against SignalP/PrediSi predictions (Supplemental Table S3C). We call these the refuted sites. We refute 89 sites predicted by SignalP, and 38 sites predicted by PrediSi (with eight refuted sites predicted by both tools; Fig. 12B If the predicted start site of a gene in the TIGR annotation falls before (toward 5′) the actual start site (i.e., the predicted TIGR protein is longer than the actual protein at the N terminus), a peptide covering the N terminus of this misannotated protein may also appear as a nontryptic peptide with no coverage in the upstream region. Thus it might be falsely predicted as a signal peptide, and it is likely that some of the cases where our predictions do not match SignalP or PrediSi belong to this category. Similarly, cases where the N-terminal most peptide is nontryptic and within 15 residues of the start position (too short for a signal peptide) might represent misannotated translational start sites. To investigate this further, we looked at the codon usage at the site of the first observed residue (position 0) or the one just before it (position −1) to account for N-terminal methionine cleavage. If some of these cases are indeed late translational start sites, we would expect higher frequency of start codons at these positions. We analyzed 36 proteins in all where the N-terminal most peptide was nontryptic mapping to the protein at a distance of two to 30 amino acids from the annotated start position and did not conform to SignalP or PrediSi signal predictions. For example, while ATG is expected to appear 36/61 = 0.59 times at position −1, we observe it five times, an order of magnitude increase in frequency. Further comparative genomics analysis of these five proteins demonstrates that at least four of them are likely to be misannotated. All other codons appear zero, one, or two times in the sample, and conspicuously, the relatively rare start codons GTG, TTG, and ATT are also over-represented (each appears two times in position −1). While the size of the sample is too small to claim that this threefold over-representation reveals the new start codons, the comparative genomics analysis implies that it is likely to be the case. Figure 13
Chemical modifications The current understanding of post-translational chemical modifications (PTMs) in bacteria is very limited even for well-studied organisms like E. coli, let alone for Shewanella. Any information that could be obtained about PTMs from large-scale MS/MS studies will prove to be very important toward gaining an understanding of the molecular biology of bacterial genomes. Note that while proteolytic events are also included in the term post-translational modifications, we will use the acronym PTM to specifically refer to in vivo chemical modifications of specific residues. We analyzed the mass spectra using MS-Alignment (Tsur et al. 2005), which allows for the discovery and statistical validation of unanticipated modifications (without distinguishing between PTMs and in vitro chemical adducts). We considered all spectral annotations with one modification permitted for all possible mass-shifts of up to 250 Da. Since MS/MS annotations of modified peptides typically have high error rates, we applied a careful scoring procedure to highlight the valid modifications (see Methods). A total of 10,758 modification sites were seen with a false discovery rate of 5% (Supplemental Table S4). Some modifications types were observed on many different sites. Table 2 presents 24 common modification types, each observed on five or more distinct sites, with their likely chemical explanations. Since the false positive rate is low, it is extremely unlikely that any of these modification types represent a computational artifact. Moreover, all but two are known modification types, further reinforcing the conclusion that they are not artifact. We remark that the number of such modification types is rather large, significantly larger than the usual limit imposed by the popular restrictive PTM search tools like Mascot, SEQUEST, and X!Tandem. Many of these modifications appear to represent chemical events with low site specificity, which can result from chemical damage in vitro (Hunyadi-Gulyas and Medzihradszky 2004). Wherever possible, we cross-reference these modifications with known modifications from the UNIMOD (Creasy and Cottrell 2004) or RESID (Garavelli 2004) databases.
After filtering out the modifications that can be explained by these common modifications, we retain 4037 modification sites, corresponding to 390 distinct modification mass-shifts in 1673 proteins. While these numbers appear surprisingly large, similar diversity of PTMs has also been found in another recent study (Nielsen et al. 2006). There are no methods presently available to identify in vivo modifications from the list of all modifications, and one has to rely on comparisons with previous literature and databases. Below, we highlight several modifications of particular biological interest. We anticipate that many biologically important PTMs in Shewanella and E. coli will be located on aligned positions in orthologous proteins. We highlight several modifications (Table 3) that are similar to those previously reported on orthologous positions in E. coli. For instance, Kowalak and Walsh (1996) reported the occurrence of β-methylthio-aspartic acid in E. coli, which is a modification of mass 46 at D88 of ribosomal protein S12p (Swiss-Prot ID: RS12_ECOLI). This is a well-conserved protein, and its ortholog, SO_0226, in S. oneidensis has almost identical amino acid sequence. This modification may be important for stabilizing the ribosome structure. We observed the same modification at D89 of the S. oneidensis protein, the homologous position to D88 of the E. coli protein, as shown below (the modified aspartate residues are shown in bold):
Sometimes, the modifications we find in Shewanella differ somewhat from what has been previously reported for E. coli. For example, we observed both methylation and dimethylation at residue K83 of RplL. Single methylation of this protein was previously observed (Chang 1981) and localized to the orthologous residue in E. coli (Arnold and Reilly 1999). We observed an apparent hydroxylation (net mass shift 16 Da) at residue 54 in translation elongation factor Tu (TufB). Methylation of a nearby lysine has been reported for in E. coli (L’Italien and Laursen 1979). It is possible that the hydroxylation plays a similar regulatory or structural role. Interestingly, formylation was observed only on one N-terminal methionine residue, although translation in bacteria begins with a formylmethionine residue. This indicates that peptide deformylases (such as Def-1, Def-2, and Def-3) operate with high efficiency. Some modifications correspond to amino acid substitutions, either due to polymorphisms or due to errors in the genomic sequence. We validate such cases by considering the sequences of other Shewanella strains. For example, a modified peptide K.QQIG+14ENPIIVYMK.G was identified from glutaredoxin domain protein (SO_2880). This 14-Da offset is readily explained by a glycine-to-alanine amino acid substitution of glycine at position 12 of the protein. Indeed, analysis of the raw sequence trace TI|202899805 from S. oneidensis available in the NCBI trace archives revealed a mistake in the genome sequence at the corresponding locus resulting in a GGT (Gly) rather than the correct GCT (Ala). Similarly, we find that Ser177 (AGU) in SO_1637 should be corrected to Gly (GGU), and Cys33 (TGT) in SO_2880 should be changed to Ser (TCT). Labeled images of example spectra for modifications reported in Table 2 and Table 3 are presented in Supplemental Data S5. The names of these images represent the modification; for example, D46-SO_0226.png represents the +46 modification on D in SO_0226. The corresponding spectra are also provided in DTA format (named like D46-SO_0226.dta) and contain the mass and charge of precursor and the list of fragment ions (precursor mass are shown in spreadsheet precursorMass_modification Examples.xls included in Supplemental Data S5.) While we cannot validate all the predicted modifications at this time due to lack of available experimental data about PTMs for Shewanella, future research on this organism may confirm many of these putative modifications and begin to uncover their biological function. Discussion With recent improvements in sequencing technology, the number of sequenced bacterial genomes has been rapidly increasing (Benson et al. 2006). There have been significant improvements in algorithms for analyzing these sequences in the last two decades, especially in the area of gene prediction. However, there is a limit to what we can learn about the biology of an organism just from its DNA sequence. For example, it is very difficult to predict the post-translational modifications from protein sequence. What is needed, besides the primary sequence, are experimental data about the expressed proteins, and MS/MS has emerged as the preferred high-throughput technology in this field. MS has been extensively used for studying individual proteins; however, its application to whole genomes became feasible only recently with the arrival of efficient database search tools. Some groups have been successful in using MS/MS for improving gene predictions in newly sequenced organisms (Jaffe et al. 2004a, b; Kalume et al. 2005; Wang et al. 2005), but there are no previous proteomic studies to obtain information about post-translational processing (proteolysis, chemical modification) at whole proteome scale. Studies like those by Jaffe et al. (2004a, b) that attempted to find PTMs at genome level had little or no success. In this study, for the first time, we have provided a whole-bacterial proteome map of post-translational modifications for S. oneidensis MR-1. With effective control on the false discovery rate, we exploit nontryptic peptides to detect proteolytic events. In particular, we confirm 94 signal peptide predictions from SignalP or PrediSi and provide evidence for refuting 119 of their predictions. We also detect N-terminal methionine cleavage in 218 proteins. Using the recently developed MS-Alignment algorithm (Tsur et al. 2005), we find a large number of PTMs in the MS/MS data set, some of which represent in vivo modifications. We also improved the genome annotation with 30 N-terminal corrections, eight new genes, and validation of 13 expressed genes that were misannotated as pseudogenes. MS takes a snapshot of the expressed proteins in a cell under specific conditions. Samples were collected across multiple experimental conditions in order to sample the complete proteome. We reliably identified >40% of all predicted genes (4928) as expressed proteins. This coverage is lower than 81% and 88% coverage reported in some Mycoplasma strains (Jaffe et al. 2004a, b). One reason for this difference in coverage, besides our more stringent control over peptide and protein selection, is the fact that Mycoplasma is a simpler organism with no transcriptional control (Jaffe et al. 2004a). On the other hand, Shewanella, with a genome seven times larger than Mycoplasma, does have an expression control mechanism, and some proteins may be expressed only under very specific conditions that might not have been captured by our experiments. DNA microarrays are a complementary technology that might be helpful in verifying expressed genes, and may be used in conjunction with MS (Kolker et al. 2005). However, our primary focus in this study is post-translational modifications that cannot be observed at the RNA level. In a large-scale computational study of this kind, it is critical to keep the false positive rates under control when making predictions about gene corrections or post-translational modifications. We have used rigorous statistical measures to quantify and control the error rates below <5% and obtain reliable predictions. At the same time, experimental validation of these results by complementary approaches will be of great importance, and we anticipate that the results presented in this article will encourage such experiments by other research groups in the field. Application of MS for whole genome studies is a relatively unexplored territory, especially in the area of post-translational modifications. A number of interesting challenges remain to be addressed. For example, we have demonstrated an approach for confidently detecting signal peptides and N-terminal methionine cleavages in the proteome with the assumption that the N-terminal fragment is degraded after proteolysis. However, a solution to the general proteolysis detection problem where both fractions of the protein may remain functional is not yet available, and we are only beginning to make attempts toward solving it. Similarly, even though we use the current state-of-the-art methods for detection of chemical modifications, it remains a challenge to distinguish between in vivo and in vitro modifications. We are considering strategies of improving the scoring and post-processing of MS-Alignment results to address this challenge. It has been shown in this and other proteogenomic studies how one can extend N-terminal boundaries of genes by observing upstream peptides. However, there is no good solution yet to make a case for shortening the N terminus of a gene. Modifying the experimental set-up is a possible approach to these and many other unsolved problems in this field, but it is likely that some problems can be solved by developing novel algorithms for interpreting the same data sets. For example, in this study we were able to detect many signal peptides and N-terminal methionine cleavages from MS/MS data without any labeling or other special treatment of the samples. It remains to be seen if we can further extend these computational methods to address other challenges as mentioned above, and apply them successfully to other organisms. Colloquially speaking, we are only beginning to scratch the surface of what can be learnt from whole genome MS/MS data sets. Methods MS/MS data The majority of the 14.5 million spectra in our data set came from ion-trap MS/MS experiments, as described in further details in the Supplemental Data S6. Some (~2 million) come from FT-ICR instruments. In this study, we treat FT data the same as ion-trap, with the exception of PTM analysis. The whole data set involves a large number of experiments under different conditions including aerobic (mid log), aerobic (steady state), suboxic, and anaerobic conditions with different additives. This data set is expected to represent most of the proteins expressed in Shewanella under these different conditions. Whole genome search The S. oneidensis genome consists of a circular chromosome (4,969,803 bp) and a plasmid (161,613 bp). We searched the MS/MS data set against a six-frame translation of the entire genome. The size of the translated genome (10,262,824 amino acids) was almost seven times the total size of the TIGR genes (1,432,446 amino acids). Sequences of common contaminants, including porcine trypsin and human keratins, were also added to the query database. A database of reversed sequences was created and searched as a negative control. Translation was made according to standard codon usage. Two exceptions to this are TGA and TAG stop codons that sometimes code for the rare amino acids Selenocysteine and Pyrrolysine, respectively (Stadtman 1996; Hao et al. 2002). We translated all TGA and TAG codons into these amino acids, while kept the third stop codon TAA as the true stop codon. It is assumed that most TAG and TGA codons are authentic stop codons, and so will not be covered by any identified peptide. The advantage of this approach is that it can allow discovery of loci where read-through of TAG and TGA codons occurs (like in methylamine methyltransferase genes of Methanosarcina barkeri; Hao et al. 2002), without predicting those sites in advance. However, we did not observe such read-through codons in Shewanella. The restrictive database search was done using Inspect (Tanner et al. 2005), with two fixed modifications, M+16 and N-terminal Q-17, that are common chemical artifacts (see below for a description of unrestricted PTM search). For each spectrum, the peptide with the top hit (lowest P-value) was selected. It took 2 d to search all spectra against this database on a 70-node grid. Non-tryptic peptides We considered all non-tryptic peptide annotations from confirmed proteins. Often such a peptide is properly contained in a longer tryptic peptide observed in the sample. These “covered peptides” are assumed to arise primarily from post-digestion break of tryptic peptides and not from cleavage in vivo. We focused our attention on noncovered peptides with no upstream peptide identifications. A total of 457 peptides (corresponding to 366 distinct N-terminal endpoints) satisfy these criteria. These peptides exhibit a strong sequence motif that closely matches the known motif characteristic for signal peptides in Gram-negative bacteria. We note also that a similar filter applied to C-terminal endpoints selects only 97 endpoints. These C-terminal endpoints do not correspond to a strong sequence motif, and have no clear position bias. Signal peptides We ran both SignalP and PrediSi against the TIGR genes to predict signal peptides, retaining all predictions made by these tools with scores ≥0.5. We analyzed all peptides from confirmed proteins that have a non-tryptic N terminus, and are not contained in an observed tryptic peptide. We discarded those that begin at position 2 (these correspond to N-terminal methionine cleavage, which is handled separately). We also restricted our attention to peptides with no upstream coverage at all. This provides a list of 164 predictions (94 of which match predictions made by SignalP and/or PrediSi). However, since no predicted signal peptides were confirmed outside the protein residue range 17–55, we further filtered the list to only the 117 predictions within this residue range. Some nontryptic peptides outside this range may also be generated by proteolytic cleavage. For example, a cut before residue 9 of SO_4743 may be a valid (but unusually short) signal peptide, as its upstream sequence matches the consensus AXA motif. Signal peptides are rapidly degraded after cleavage. Therefore, if a peptide was identified upstream of a predicted signal cleavage site, then we consider the SignalP/PrediSi prediction to be refuted. PTMs Recently, we developed the MS-Alignment unrestrictive database search algorithm for finding unanticipated modifications (Tsur et al. 2005). We applied MS-Alignment to our data set, allowing for an arbitrary single modification with mass shift up to 250 Da per peptide. We constructed a database consisting of all confirmed Shewanella proteins found by the initial search, together with a shuffled sequence corresponding to each protein in this database. In a search of this database with MS-Alignment, spurious modifications are expected to be distributed randomly throughout the database, including the shuffled sequences. After selecting a score cutoff, we obtained an empirical false discovery rate by simply counting the number of peptides from spurious proteins. A score cutoff providing a spectrum-level false discovery rate of 5% was used. Modification sites were selected from these results using the Inspect analysis scripts (Tanner et al. 2006). The search required ~1 mo on a 64-processor cluster. Modifications with small mass shift (e.g., mass shift 1) may result from low parent mass accuracy or the presence of isotopic peaks; therefore, only modifications with mass shift 3 Da or more were considered in subsequent analysis. Although the spectrum-level false discovery rate is 5%, the error rate at the level of modification sites may be higher. This phenomenon is similar to the increase in error rate that occurs as one moves from spectrum-level to peptide-level or protein-level analysis. Therefore, we scored and rank the modification sites, and computed the false discovery rate at the level of modification sites. For each modification site, we generated a consensus spectrum by averaging together the peaks from all spectra carrying the modification. This consensus spectrum eliminates much of the noise from individual spectra, and provides a more accurate measurement of modification masses. We then scored modification sites by computing several features and then computing a weighted sum of these features. Various features were initially considered, and a collection of features was selected by iteratively selecting the feature providing the best marginal improvement in accuracy. The features used in computing the score are as follows:
After sorting modification sites by these scores, we obtained 10,758 modification sites with a false discovery rate of 5%. Supplemental Table 4 summarizes these modifications. For each row of this spreadsheet, a representative spectrum (DTA format) and the corresponding image with b and y ions labeled are available at http://peptide.ucsd.edu/ShewanellaOneidensis/ (the files are named based on their row number in Supplemental Table 4, and precursor masses are in the file precursorMass_allPTMSpectra. xls). These modified annotations include delta-correct annotations (see Tsur et al. 2005) where the modification is localized to an incorrect (typically adjacent) residue or has the wrong mass (due to limited parent mass accuracy). We use an automated analysis procedure to suggest alternatives for each modification. It considers edits, such as shifting the modification to the adjacent residue, which have minor effects on the theoretical fragmentation pattern. If these edits generate a candidate peptide that contains only common modifications (such as oxidized methionine) and whose score comparable to the initial annotation, then we consider the modification to be satisfactorily explained. For example, the putative annotation “S.F+115SVEAPKT.K” from rplD was replaced with “E.S+28FSVEAPKT.K,” since the latter annotation invokes a common chemical modification (formylation of the N terminus) and explains the spectrum just as well. Modification types were added to the collection of “common” modifications after successful manual validation of five or more sites carrying the same modification mass. After these automated procedures were run, modification sites were curated, and a putative annotation assigned to each site. Acknowledgments We thank Mark Borodovsky and Gary Olsen for useful comments and help with gene annotations. S.T. is supported by NSF IGERT training grant DGE0504645. This research was supported in part by the UCSD FWGrid Project, NSF Research Infrastructure grant number EIA-0303622. Part of this investigation was supported using the computing facility made possible by the Research Facilities Improvement Program grant no. C06 RR017588 awarded to the Whitaker Biomedical Engineering Institute, and the Biomedical Technology Resource Centers Program grant no. P41 RR08605 awarded to the National Biomedical Computation Resource, UCSD, from the National Center for Research Resources, National Institutes of Health. This project was supported by U.S. National Institutes of Health grant NIGMS 1-R01-RR16522. Shewanella genome sequences were kindly provided by the Joint Genome Institute. Part of this research at Pacific Northwest National Laboratory was supported by the Genomics:GtL Program, Office of Biological and Environmental Research, U.S. Department of Energy. Pacific Northwest National Laboratory is operated for the DOE by Battelle Memorial Institute under Contract DE-AC06-76RLO 1830. Footnotes [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6427907 References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||
Science. 2004 Apr 2; 304(5667):66-74.
[Science. 2004]Nature. 2003 Mar 13; 422(6928):198-207.
[Nature. 2003]Nat Rev Mol Cell Biol. 2006 Jun; 7(6):391-403.
[Nat Rev Mol Cell Biol. 2006]Anal Chem. 1995 Sep 15; 67(18):3202-10.
[Anal Chem. 1995]Proteomics. 2001 May; 1(5):641-50.
[Proteomics. 2001]Genome Res. 2002 Aug; 12(8):1210-20.
[Genome Res. 2002]Genome Biol. 2006; 7(4):R35.
[Genome Biol. 2006]Genome Res. 2005 Aug; 15(8):1118-26.
[Genome Res. 2005]Anal Chem. 1995 Sep 15; 67(18):3202-10.
[Anal Chem. 1995]Proteomics. 2001 May; 1(5):641-50.
[Proteomics. 2001]Genome Res. 2002 Aug; 12(8):1210-20.
[Genome Res. 2002]Genome Res. 2005 Aug; 15(8):1118-26.
[Genome Res. 2005]Genome Biol. 2006; 7(4):R35.
[Genome Biol. 2006]Antonie Van Leeuwenhoek. 2002 Aug; 81(1-4):215-22.
[Antonie Van Leeuwenhoek. 2002]Nat Biotechnol. 2002 Nov; 20(11):1118-23.
[Nat Biotechnol. 2002]Nucleic Acids Res. 2001 Jan 1; 29(1):123-5.
[Nucleic Acids Res. 2001]OMICS. 2003 Summer; 7(2):171-5.
[OMICS. 2003]OMICS. 2004 Fall; 8(3):239-54.
[OMICS. 2004]Science. 1991 Nov 29; 254(5036):1374-7.
[Science. 1991]Mass Spectrom Rev. 2001 Jul-Aug; 20(4):157-71.
[Mass Spectrom Rev. 2001]Electrophoresis. 1997 Aug; 18(8):1259-313.
[Electrophoresis. 1997]FEMS Microbiol Lett. 1998 Dec 15; 169(2):375-82.
[FEMS Microbiol Lett. 1998]Mol Cell Proteomics. 2006 Dec; 5(12):2336-49.
[Mol Cell Proteomics. 2006]Nucleic Acids Res. 2005; 33(2):451-63.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2001 Jan 1; 29(1):123-5.
[Nucleic Acids Res. 2001]Bioinformatics. 2006 Jul 15; 22(14):e481-8.
[Bioinformatics. 2006]Nat Biotechnol. 2007 Jan; 25(1):117-24.
[Nat Biotechnol. 2007]J Bacteriol. 2003 Oct; 185(19):5673-84.
[J Bacteriol. 2003]Nat Rev Microbiol. 2003 Nov; 1(2):127-36.
[Nat Rev Microbiol. 2003]J Bacteriol. 2006 Jul; 188(13):4601-9.
[J Bacteriol. 2006]Nucleic Acids Res. 2005; 33(17):5691-702.
[Nucleic Acids Res. 2005]Genome Res. 2005 Aug; 15(8):1118-26.
[Genome Res. 2005]J Biol Chem. 2006 Oct 6; 281(40):29872-85.
[J Biol Chem. 2006]OMICS. 2003 Summer; 7(2):171-5.
[OMICS. 2003]Mol Microbiol. 1996 Jul; 21(2):347-60.
[Mol Microbiol. 1996]Proc Natl Acad Sci U S A. 1974 Apr; 71(4):1342-6.
[Proc Natl Acad Sci U S A. 1974]EMBO J. 1982; 1(3):311-5.
[EMBO J. 1982]Gene. 1993 May 15; 127(1):79-85.
[Gene. 1993]FEMS Microbiol Lett. 1993 Sep 1; 112(2):211-6.
[FEMS Microbiol Lett. 1993]Mol Microbiol. 1996 Jul; 21(2):347-60.
[Mol Microbiol. 1996]J Bacteriol. 1998 Jun; 180(11):2936-42.
[J Bacteriol. 1998]Gene. 2002 Mar 20; 286(2):187-201.
[Gene. 2002]Nucleic Acids Res. 2003 Jan 1; 31(1):87-9.
[Nucleic Acids Res. 2003]Mol Cell Proteomics. 2004 Jun; 3(6):608-14.
[Mol Cell Proteomics. 2004]Trends Biochem Sci. 2001 Jan; 26(1):54-61.
[Trends Biochem Sci. 2001]Mol Cell Proteomics. 2004 Jun; 3(6):608-14.
[Mol Cell Proteomics. 2004]Methods Enzymol. 1976; 45():599-610.
[Methods Enzymol. 1976]Bioinformatics. 2006 Jul 15; 22(14):e481-8.
[Bioinformatics. 2006]Chem Rev. 2002 Dec; 102(12):4549-80.
[Chem Rev. 2002]Proc Natl Acad Sci U S A. 1989 Nov; 86(21):8247-51.
[Proc Natl Acad Sci U S A. 1989]Proc Natl Acad Sci U S A. 1989 Nov; 86(21):8247-51.
[Proc Natl Acad Sci U S A. 1989]Proc Natl Acad Sci U S A. 1989 Nov; 86(21):8247-51.
[Proc Natl Acad Sci U S A. 1989]Chem Rev. 2002 Dec; 102(12):4549-80.
[Chem Rev. 2002]J Mol Biol. 2004 Jul 16; 340(4):783-95.
[J Mol Biol. 2004]Genome Res. 2001 Sep; 11(9):1484-502.
[Genome Res. 2001]J Mol Biol. 2004 Jul 16; 340(4):783-95.
[J Mol Biol. 2004]Genome Res. 2004 Jun; 14(6):1188-90.
[Genome Res. 2004]Proc Natl Acad Sci U S A. 1974 Apr; 71(4):1342-6.
[Proc Natl Acad Sci U S A. 1974]Nat Biotechnol. 2005 Dec; 23(12):1562-7.
[Nat Biotechnol. 2005]Proteomics. 2004 Jun; 4(6):1534-6.
[Proteomics. 2004]Proteomics. 2004 Jun; 4(6):1527-33.
[Proteomics. 2004]Mol Cell Proteomics. 2006 Dec; 5(12):2384-91.
[Mol Cell Proteomics. 2006]Protein Sci. 1996 Aug; 5(8):1625-32.
[Protein Sci. 1996]Mol Gen Genet. 1981; 183(3):418-21.
[Mol Gen Genet. 1981]Anal Biochem. 1999 Apr 10; 269(1):105-12.
[Anal Biochem. 1999]FEBS Lett. 1979 Nov 15; 107(2):359-62.
[FEBS Lett. 1979]Genome Res. 2005 Aug; 15(8):1118-26.
[Genome Res. 2005]Nat Biotechnol. 2005 Dec; 23(12):1562-7.
[Nat Biotechnol. 2005]Proc Natl Acad Sci U S A. 2005 Feb 8; 102(6):2099-104.
[Proc Natl Acad Sci U S A. 2005]Annu Rev Biochem. 1996; 65():83-100.
[Annu Rev Biochem. 1996]Science. 2002 May 24; 296(5572):1462-6.
[Science. 2002]Anal Chem. 2005 Jul 15; 77(14):4626-39.
[Anal Chem. 2005]Nat Biotechnol. 2005 Dec; 23(12):1562-7.
[Nat Biotechnol. 2005]Nat Protoc. 2006; 1(1):67-72.
[Nat Protoc. 2006]Nat Biotechnol. 2005 Dec; 23(12):1562-7.
[Nat Biotechnol. 2005]