• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptNIH Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Bioinformatics. Author manuscript; available in PMC Oct 14, 2008.
Published in final edited form as:
PMCID: PMC2567111
NIHMSID: NIHMS60482

EPO-KB: a searchable knowledge base of biomarker to protein links

Summary

The knowledge base EPO-KB (Empirical Proteomic Ontology Knowledge Base) is based on an OWL ontology that represents current knowledge linking mass-to-charge (m/z) ratios to proteins on multiple platforms including Matrix Assisted Laser/Desorption Ionization (MALDI) and Surface Enhanced Laser/Desorption Ionization (SELDI) – Time of Flight (TOF). At present, it contains information on m/z ratio to protein links that were extracted from 120 published research papers. It has a web interface that allows researchers to query and retrieve putative proteins that correspond to a user-specified m/z ratio. EPO-KB also allows automated entry of additional m/z ratio to protein links and is expandable to the addition of gene to protein and protein to disease links.

1 INTRODUCTION

There are an estimated 100 000 or more proteins (Hadas Leonov, 2003) in the human body. Mass spectrometry is a technique that is used extensively by researchers to identify proteins within different human substrates such as plasma, cerebrospinal fluid and tissue lysates (Ranganathan et al., 2005; Soltys et al., 2004; Szalowska et al., 2007). However, currently, there does not exist a mechanism for easy look-up of previously identified biomarker [mass to charge (m/z) ratio] to protein links. The researcher who wants to identify possible proteins corresponding to a m/z ratio has to manually search the literature, use the m/z ratio to search a database of theoretical weights, or perform further analysis to identify the protein.

The current literature on m/z ratios has significant limitations. Authors who focus on proteomic pattern recognition and classification using specific biomarkers (Zhukov et al., 2003) do not link m/z ratios to the identified proteins. Some authors defer protein identification to a future study. Yet other authors, who have invested much time and effort into biomarker and protein discovery studies, sometimes report the m/z ratios as a rounded number, e.g. 3.0 × 104 m/z, or omit the biomarker entirely, reporting only the identified protein.

Some work has been done on compiling proteomics data and making them available to researchers. Though many different proteomes, such as plasma, serum and cerebrospinal fluid (States et al., 2006), have been characterized, the m/z ratio to protein links cannot be identified quickly and easily. UniProt (The UniProt, 2007) as well as KEGG (Kanehisa et al., 2006) allow for molecular weight look-ups, as does Expasy (Gasteiger et al., 2003), but in our experience the molecular weight and the resultant m/z ratios after acquiring the spectra are often shifted due to post-translational modifications, alternative splicing phenomena and the physiochemical characteristics of the protein. Often, these characteristics of the proteins alter the acquired m/z ratio from experiment to experiment even though the protocol is unchanged which means that the proposed tool can be used only as an aid in the identification process.

To address this need, we created a web accessible knowledge base called the Empirical Proteomic Ontology Knowledge Base (EPO-KB) to assist with the look-up and coordination of knowledge about validated biomarkers and their links to proteins. We have included several parameters for searching the knowledge base that assist the researcher in narrowing down the possible candidate proteins during the identification process. We have designed the ontology to allow for easy revision and expansion.

2 METHODS

We developed a basic ontology that includes data elements such as laser power, type of mass spectrometer, ionization method, matrix type and the biological sample type (bio-specimen), as well as the experts’ determination that an observed peak is a singly, doubly or triply charged species of the protein. To populate the knowledge base, we selected papers from the literature that included all the data elements listed above either in the main text or in the Supplementary Material. Locating papers that had complete information proved more difficult than we anticipated. We identified 300 published papers, out of which 120 papers contained information on all the data elements and these were used to populate the knowledge base. Those that did not have all the data elements were set aside for future curation and addition to the knowledge base.

One challenge that we faced was the combining of m/z ratios that were linked to the same protein. We considered pooling m/z ratios if they were linked to the same protein (with the same post-translational modification), obtained using the same acquisition protocol (i.e. same platform, substrate, chip and same bio-specimen), and possessed the same charge (i.e. singly charged or doubly charged). The arithmetic mean of the m/z ratios was then computed to create what we call a pooled biomarker. A new m/z ratio that satisfied the above mentioned criteria was added to an existing pooled biomarker in the database if its value was within 1% of the mean and the mean recalculated after including the new m/z ratio. This heuristic seems to work well, and we have yet to encounter two biomarkers linked to the same protein that have more than a 1% difference between the m/z ratios. As an example, three m/z ratios namely 13 875, 13 887 and 13 923 that were all linked to the protein Transthyretin (with the same post-translational modification and acquired using the same acquisition protocol from the same bio-specimen) were pooled to create a single pooled biomarker with mean of 13 895 m/z ratio in the EPO-KB. Subsequently, two more m/z ratios namely 13 884 and 6880 linked to Transthyretin were found in the literature. The first m/z ratio was pooled with the already existing Transthyretin pooled biomarker and the mean recalculated. The second m/z ratio was not pooled with the existing Transthyretin pooled biomarker since it has a different charge, and a new pooled biomarker was created to represent the doubly charged species of Transthyretin.

The query engine retrieves proteins from the knowledge base that are close (based on a distance score) to the user-specified m/z ratio. For example, if the query is 13 750 m/z with a 0.5% window parameter (which represents the confidence in the biomarker value e.g. the mass accuracy), the closest protein in the knowledge base is Transthyretin (singly charged) with a distance score of 3.5. The smaller the distance score for the candidate protein, the better the match to the m/z ratio in question and the higher its rank in the output.

3 CURRENT STATE AND PLANNED IMPROVEMENTS

Currently, the knowledge base contains over 150 different validated m/z ratios representing over 75 commonly analyzed proteins. The knowledge base can be queried on a user-specified m/z ratio and a window around the m/z ratio to generate an output consisting of protein matches ranked according to the distance score which is a measure of the quality of the match. In addition to the query engine, the website also contains the ontology of the knowledge base. We hope that this resource will influence biomarker identification by improving the speed and ease of peak identification. Instead of isolating the peak, performing a tryptic digestion and doing sequence identification, which is costly and time consuming, the EPO-KB can be used to suggest putative proteins that can then be analyzed by immunoprecipitation studies for definitive identification.

We plan to improve the EPO-KB in several directions including the capability of allowing reverse look-ups (i.e. retrieve the set of m/z ratios that correspond to a specified protein) and incorporation of information from other data-bases and ontologies to link validated biomarkers to diseases. We plan to incorporate more biomarkers from the literature beyond the current corpus of 120 papers used to populate the database. We plan to contact the authors of the papers that we were unable to add to the knowledge base due to the incomplete information, which will provide additional details that will allow their inclusion into our knowledge base.

We also hope that with the release of the EPO-KB, biologists and bioinformaticians will contribute newly identified m/z ratio to protein links through the online submission on the website. Such contributions will be sequestered and reviewed before addition to the knowledge base.

4 CONCLUSION

We have compiled a new web-accessible resource called EPO-KB for biologists and bioinformaticians, who utilize mass spectrometry to analyze biofluids and tissue lysates for putative biomarkers, to assist with biomarker identification. We developed a structured ontology knowledge base and populated it with validated m/z ratio to protein links from 120 published research papers. The EPO-KB has a front end query engine that is publicly accessible on the web for researchers to utilize when performing preliminary analysis of biomarkers (http://www.dbmi.pitt.edu/EPO-KB). We believe that this novel resource will speed up the identification and coordination of biomarker analysis.

ACKNOWLEDGEMENTS

We thank Gary Garvin for his assistance with the programming, the website with the query engine, and Shyam Visweswaran for his help in editing and assistance with formatting. We also thank the Department of Biomedical Informatics at the University of Pittsburgh (http://www.dbmi.pitt.edu/) for providing space for the EPO-KB web-server.

Funding: This research is funded by the NIH, grant # T15 LM007059 NLM.

Footnotes

Conflict of Interest: none declared.

REFERENCES

  • Gasteiger E, et al. ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. 2003;31:3784–3788. [PMC free article] [PubMed]
  • Hadas Leonov JSBMITA. Monte Carlo estimation of the number of possible protein folds: Effects of sampling bias and folds distributions. Prot. Struct. Funct. Genet. 2003;51:352–359. [PubMed]
  • Kanehisa M, et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006;34 [PMC free article] [PubMed]
  • Ranganathan S, et al. Proteomic profiling of cerebrospinal fluid identifies biomarkers for amyotrophic lateral sclerosis. Neurochemsitry. 2005;95:1461. [PMC free article] [PubMed]
  • Soltys SG, et al. The use of plasma surface-enhanced laser desorption/ionization time-of-flight mass spectrometry proteomic patterns for detection of head and neck squamous cell cancers. Clin. Cancer Res. 2004;10:4806–4812. [PubMed]
  • States DJ, et al. Challenges in deriving high-confidence protein identifications from data gathered by a HUPO plasma proteome collaborative study. Nat. Biotechnol. 2006;24:333–338. [PubMed]
  • Szalowska E, et al. Fractional factorial design for optimization of the SELDI protocol for human adipose tissue culture media. Biotechnol. Prog. 2007;23:217–224. [PubMed]
  • The UniProt C. The universal protein resource (UniProt) Nucleic Acids Res. 2007;35:D193–D197. [PMC free article] [PubMed]
  • Zhukov TA, et al. Discovery of distinct protein profiles specific for lung tumors and pre-malignant lung lesions by SELDI mass spectrometry. Lung Cancer. 2003;40:267–279. [PubMed]
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links