• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Jan 1, 2006; 34(Database issue): D622–D627.
Published online Dec 28, 2005. doi:  10.1093/nar/gkj083
PMCID: PMC1347446

dbPTM: an information repository of protein post-translational modification

Abstract

dbPTM is a database that compiles information on protein post-translational modifications (PTMs), such as the catalytic sites, solvent accessibility of amino acid residues, protein secondary and tertiary structures, protein domains and protein variations. The database includes all of the experimentally validated PTM sites from Swiss-Prot, PhosphoELM and O-GLYCBASE. Only a small fraction of Swiss-Prot proteins are annotated with experimentally verified PTM. Although the Swiss-Prot provides rich information about the PTM, other structural properties and functional information of proteins are also essential for elucidating protein mechanisms. The dbPTM systematically identifies three major types of protein PTM (phosphorylation, glycosylation and sulfation) sites against Swiss-Prot proteins by refining our previously developed prediction tool, KinasePhos (http://kinasephos.mbc.nctu.edu.tw/). Solvent accessibility and secondary structure of residues are also computationally predicted and are mapped to the PTM sites. The resource is now freely available at http://dbPTM.mbc.nctu.edu.tw/.

INTRODUCTION

Protein post-translational modification (PTM) is an extremely important cellular control mechanism because it may alter proteins' physical and chemical properties, folding, conformation, distribution, stability, activity and consequently, their functions (1). Examples of the biological effects of protein modifications include phosphorylation for signal transduction, attachment of fatty acids for membrane anchoring and association, and glycosylation for changing protein half-life, targeting substrates, and promoting cell-cell and cell-matrix interactions. With the accelerating progress in proteomics, biological knowledgebases containing a wealth of information, in particular protein modifications, are playing crucial roles in cell regulation research (2).

The Swiss-Prot knowledge base (3) includes as much modification information as is available with consistency and structure, allowing easy retrieval by biologists. Phospho.ELM (1), which was developed as part of the ELM (Eukaryotic Linear Motif) resource, is a new resource containing experimentally verified phosphorylation sites that were manually curated from the literature. O-GLYCBASE (4) is a database of glycoproteins, most of which include experimentally verified O-linked glycosylation sites. The RESID protein modification database is a comprehensive collection of annotations and structures for protein modifications and cross-links including pre-, co- and post-translational modifications (5). The RESID database provides modification information, literature citations, Gene Ontology (GO) cross-references, protein sequence database feature table annotations, structure diagrams and molecular models. Each RESID entry presents a protein with a chemically unique modification and indicates how the modification is currently annotated in the Swiss-Prot (6).

In this study, we collect the known PTM information from external biological data sources. Since only a small fraction of Swiss-Prot proteins are annotated with experimentally verified PTMs, we also developed computational tools to comprehensively identify phosphorylation sites, glycosylation sites and sulfation sites against the Swiss-Prot proteins. Protein structural properties and functional information, such as the solvent accessibility of residues, protein variations, non-synonymous single nucleotide polymorphism (SNP), protein tertiary structures and protein functional domains, are provided for researchers who are investigating the protein PTM mechanisms. Web query interface and graphical visualization were designed and implemented to facilitate access to the database content.

DATA GENERATION

The data generation flow of the dbPTM is briefly depicted in Figure 1. The data generation flow comprises the three major components: integration of external known PTM databases, learning and prediction of PTM sites, and structural or functional annotations. The experimentally validated PTM data sources were extracted from Swiss-Prot (3), Phospho.ELM (1) and O-GLYCBASE (4). The experimentally verified PTM sites were used to generate computer models to further identify putative PTM sites against the Swiss-Prot proteins. Additional structural properties and functional information, such as protein tertiary structures, protein secondary structures, solvent accessibility of residues, protein functional domains, protein variations and non-synonymous SNP, are also annotated to the Swiss-Prot proteins. The detailed data generation flow is described below.

Figure 1
The data generation flow of the dbPTM database.

Integration of external known PTM databases

Three external biological databases related to protein PTM information, Swiss-Prot (3), Phospho.ELM (1) and O-GLYCBASE (4), are integrated into the proposed resource. Both the experimentally validated PTM sites and the putative PTM sites, which are annotated as ‘by similarity’, ‘potential’ or ‘probable’ in the ‘MOD_RES’ fields, have been extracted from the Swiss-Prot database (3). As summarized in Table 1, release 46.0 of Swiss-Prot contributes 11 025 experimental validated PTM sites within 4921 proteins, and 72 308 putative PTM sites within 31 026 proteins. The Phospho.ELM entries store information about substrate proteins with the exact positions of residues are known to be phosphorylated by cellular kinases. A total of 1703 experimentally verified phosphorylation sites within 556 proteins were obtained from Phospho.ELM version 2 (1). O-GLYCBASE (4) Version 6.00 provides 242 glycoproteins containing 2765 experimentally verified O-linked, N-linked and C-linked glycosylation sites. Moreover, 185 glycoproteins in O-GLYCBASE are corresponded to Swiss-Prot proteins, which have 2353 experimentally verified glycosylation sites.

Table 1
The list of the integrated external data sources

Learning and prediction of PTM sites

To provide the PTM information of the PTM un-annotated proteins available from Swiss-Prot, we integrated several computational tools for identifying the PTMs of the Swiss-Prot proteins. Our previous work, namely KinasePhos (7), incorporated the profile hidden Markov model (HMM) to identify kinase-specific phosphorylation sites with ~87% prediction accuracy (8), which was compared with several phosphorylation prediction tools, such as NetPhos (9), DISPHOS (10) and rBPNN (11) (see Supplementary Table S1). KinasePhos is integrated and used to fully detect the kinase-specific phosphorylation sites against the Swiss-Prot proteins. To reduce the number of false positive predictions by KinasePhos, we set the predictive parameters as the values when the prediction specificity is 100% (8).

As depicted in Supplementary Figure S1, the KinasePhos-like method, which is similar to KinasePhos for phosphorylation sites, was designed and implemented to learn models for the prediction of the sulfation sites, N-linked glycosylation sites and C-linked glycosylation sites (Table 2). We used the 144 known sulfation sites of tyrosine, 1790 N-linked glycosylation sites of asparagine and 49 C-linked glycosylation sites of tryptophan to evaluate the performance of the KinasePhos-like prediction tools. The result suggests that all three KinasePhos-like tools exhibited high prediction precision, sensitivity and specificity (Supplementary Table S2).

Table 2
The list of the integrated annotated tools

Structural and functional annotations

In order to provide more effective information about protein structural and functional annotations relevant to protein PTM, a variety of biological databases, such as Swiss-Prot (12), Ensembl (13), InterPro (14), PDB (15) and RESID (5), are integrated.

Protein variation is the change of amino acids in polypeptides. As summarized in Table 1, Swiss-Prot contributes 32 101 protein variants corresponding to 6115 proteins, where 47 variant residues are located at the PTM sites and 267 variant residues are located surrounding 236 PTM sites (−4 ~ +4 AA). Furthermore, single amino acid polymorphism (SAP) is the amino acid variation corresponding to the genetic variation as the definition of non-synonymous SNP in genomic sequence. The amino acid variants may have an impact on protein folding, active sites, or the overall solubility and stability of a protein. SAP is the type of variation most frequently related to human diseases (12). Therefore, when the amino acid variations occur in the PTM sites or the surrounding residues, they may affect the recognition of PTM sites by catalytic kinases. A total of 23 378 human non-synonymous SNPs located at 7230 Swiss-Prot human proteins were obtained from the variation part of Ensembl database (13).

InterPro provides 1 113 928 entries corresponding to 161 988 Swiss-Prot proteins. We found that about 65% of Swiss-Prot annotated PTM sites are located at InterPro annotated protein domains. The RESID (5) protein modifications database is integrated into dbPTM to provide PTM related information, such as mass difference, chemical formula, enzymatic activities, literature citations, GO cross-references, structure diagrams and molecular models.

The latest version of PDB contains 31 721 tertiary structures corresponding to 6806 Swiss-Prot protein entries (Table 1). For the proteins with known tertiary structures, the DSSP (16) program was used to extract the true secondary structure and solvent accessibility for those 6808 Swiss-Prot proteins. Solvent accessibility of amino acids residues is important for both the structure and function of proteins, especially the PTMs studied in this investigation. Protein secondary structure is the regular arrangement of amino acid residues in a segment of a polypeptide chain, where each amino acid is assigned a structure state, helix (H), strand (E) or coil (C). There are 1124 experimentally verified PTMs have the true secondary structure and solvent accessibility.

However, only ~4% of Swiss-Prot proteins have the known tertiary structures. For proteins without known tertiary structures, two previously published tools, RVP-net (17) and PSIPRED (18), were applied to predict the solvent accessibility and the secondary structure, respectively (see Table 2). RVP-net (17) presents a feed-forward type neural network which can predict a real value ranging from 0 to 100% of accessible surface areas (ASAs) for amino acid residues, based on their neighborhood information. We applied the RVP-net program (17) to fully predict the real-valued ASA for the amino acid residues of all Swiss-Prot proteins. By selecting a suggested threshold (17) (i.e. 25%), the residues with larger ASA values are viewed as surface residues.

DATA STATISTICS

The statistics of the experimentally verified PTMs and the putative PTMs compiled in the dbPTM resource are shown in the Table 3. For instance, dbPTM contains 14 057 known PTM sites and 772 154 putative PTM sites. The parameters of the predictive tools, KinasePhos, KinasePhos-like Sulfation and KinasePhos-like Glycosylation—for the prediction of phosphorylation sites, sulfation sites and glycosylation sites, respectively—are set as the values when the predictive specificity is set to 100% during the parameter optimization of the trained models (8). The numbers of putative phosphorylation and sulfation sites, where the ASA of the substrates are >25% (defined as the residue locating at the protein surface), are 652 756 and 13 315, respectively. There are a total of 33 887 predicted N-linked glycosylations of asparagine and C-linked glycosylations of tryptophan.

Table 3
The data statistics of the dbPTM database

INTERFACE

To facilitate the use of the dbPTM resource, we developed a website for users to browse and search for content. As depicted in Supplementary Figure S2, the user can select a particular type of PTM for browsing the information. When clicking on a PTM entry, it pops up a window showing the solvent accessibility of the residues, the secondary structures and the flanking sequence of the PTM site.

The search pages allow users to query the database using the Swiss-Prot ID and protein name. The interface also presents structural properties and functional information corresponding to the resulting proteins, such as the solvent accessibility of residues, non-synonymous variations, protein domains and protein secondary structures. Furthermore, the positional relationships among the PTMs, protein structural properties and protein functional information are graphically displayed (Figure 2).

Figure 2
The graphical interface reveals the PTMs, the solvent accessibility of the residues, protein variations, protein secondary structures and protein functional domains.

Generally, a 3D presentation is an effective manner for revealing the PTM information corresponding to the protein tertiary structures. For these purposes, we developed a protein structure viewer for the visualization of protein tertiary structures and especially of the port-translational modification residues. As shown in Supplementary Figure S3, the visualization tool provides a comprehensive view of the whole protein structure and marks residues that are annotated as the PTM sites. This visualization tool is implemented as a client-side tool based on OpenGL's pipeline.

The visualization of the protein structures and the annotated residues are provided by two different ways according to different users' platforms. For users in MS Windows, the users can download the installable package of the Silver. After the Silver is installed, the protein tertiary structures and the PTM sites can be graphically and directly provided, as shown in Supplementary Figure S3. Alternatively, for users in other platforms such as Mac OS X, Linux and Solaris, the user can download the PDB structure and the Rasmol (http://www.umass.edu/microbio/rasmol/) scripts for the labeling of the PTM sites.

CONCLUSIONS

The proposed resource not only integrates the experimentally validated PTM information, but it also computationally annotates the Swiss-Prot proteins for putative phosphorylation, glycosylation and sulfation sites. Furthermore, the PTM related protein structural properties and functional information, such as solvent accessibility of amino acid residues, protein variations, protein secondary structures, protein tertiary structures and protein domains, are provided to facilitate the research of protein PTMs.

One of the prospective goals for dbPTM is to integrate more efficient prediction tools for other types of PTM in addition to phosphorylation, sulfation and N- and C-linked glycosylation. Other protein sequence databases besides the Swiss-Prot protein database can also be considered and annotated for post-translation modifications by the proposed resource.

AVAILABILITY

The dbPTM resource will be regularly maintained and updated. The resource is now freely available at http://dbPTM.mbc.nctu.edu.tw/.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

Supplementary Material

[Supplementary Material]

Acknowledgments

The authors would like to thank Shih-Yee M. Wang (University of Illinois at Chicago) for English editing, the National Science Council of the Republic of China for financially supporting this research under Contract No. NSC 94-2213-E-009-025 (to H.-D.H.), and Chang Gung Memorial Hospital for the Research Grant CTRP1006 (to T.-H.W.). Funding to pay the Open Access publication charges for this article was provided by Chang Gung Memorial Hospital.

Conflict of interest statement. None declared.

REFERENCES

1. Diella F., Cameron S., Gemund C., Linding R., Via A., Kuster B., Sicheritz-Ponten T., Blom N., Gibson T.J. Phospho.ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins. BMC Bioinformatics. 2004;5:79. [PMC free article] [PubMed]
2. Farriol-Mathis N., Garavelli J.S., Boeckmann B., Duvaud S., Gasteiger E., Gateau A., Veuthey A.L., Bairoch A. Annotation of post-translational modifications in the Swiss-Prot knowledge base. Proteomics. 2004;4:1537–1550. [PubMed]
3. Boeckmann B., Bairoch A., Apweiler R., Blatter M.C., Estreicher A., Gasteiger E., Martin M.J., Michoud K., O'Donovan C., Phan I., et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31:365–370. [PMC free article] [PubMed]
4. Gupta R., Birch H., Rapacki K., Brunak S., Hansen J.E. O-GLYCBASE version 4.0: a revised database of O-glycosylated proteins. Nucleic Acids Res. 1999;27:370–372. [PMC free article] [PubMed]
5. Garavelli J.S. The RESID database of protein modifications as a resource and annotation tool. Proteomics. 2004;4:1527–1533. [PubMed]
6. Wu C.H., Yeh L.S., Huang H., Arminski L., Castro-Alvear J., Chen Y., Hu Z., Kourtesis P., Ledley R.S., Suzek B.E., et al. The Protein Information Resource. Nucleic Acids Res. 2003;31:345–347. [PMC free article] [PubMed]
7. Huang H.D., Lee T.Y., Tzeng S.W., Horng J.T. KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites. Nucleic Acids Res. 2005;33:W226–W229. [PMC free article] [PubMed]
8. Huang H.D., Lee T.Y., Tzeng S.W., Wu L.C., Horng J.T., Tsou A.P., Huang K.T. Incorporating hidden Markov models for identifying protein kinase-specific phosphorylation sites. J. Comput. Chem. 2005;26:1032–1041. [PubMed]
9. Blom N., Gammeltoft S., Brunak S. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J. Mol. Biol. 1999;294:1351–1362. [PubMed]
10. Iakoucheva L.M., Radivojac P., Brown C.J., O'Connor T.R., Sikes J.G., Obradovic Z., Dunker A.K. The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res. 2004;32:1037–1049. [PMC free article] [PubMed]
11. Berry E.A., Dalby A.R., Yang Z.R. Reduced bio basis function neural network for identification of protein phosphorylation sites: comparison with pattern recognition algorithms. Comput. Biol. Chem. 2004;28:75–85. [PubMed]
12. Yip Y.L., Scheib H., Diemand A.V., Gattiker A., Famiglietti L.M., Gasteiger E., Bairoch A. The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum. Mutat. 2004;23:464–470. [PubMed]
13. Hubbard T., Andrews D., Caccamo M., Cameron G., Chen Y., Clamp M., Clarke L., Coates G., Cox T., Cunningham F., et al. Ensembl 2005. Nucleic Acids Res. 2005;33:D447–D453. [PMC free article] [PubMed]
14. Mulder N.J., Apweiler R., Attwood T.K., Bairoch A., Bateman A., Binns D., Biswas M., Bradley P., Bork P., Bucher P., et al. InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief Bioinform. 2002;3:225–235. [PubMed]
15. Deshpande N., Addess K.J., Bluhm W.F., Merino-Ott J.C., Townsend-Merino W., Zhang Q., Knezevich C., Xie L., Chen L., Feng Z., et al. The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema. Nucleic Acids Res. 2005;33:D233–D237. [PMC free article] [PubMed]
16. Kabsch W., Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22:2577–2637. [PubMed]
17. Ahmad S., Gromiha M.M., Sarai A. RVP-net: online prediction of real valued accessible surface area of proteins from single sequences. Bioinformatics. 2003;19:1849–1851. [PubMed]
18. McGuffin L.J., Bryson K., Jones D.T. The PSIPRED protein structure prediction server. Bioinformatics. 2000;16:404–405. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...