The accurate prediction of protein family from amino acid sequence by measuring features of sequence fragments

Huixiao Hong; Qilong Hong; Roger Perkins; Leming Shi; Hong Fang; Zhenqiang Su; Yvonne Dragan; James C Fuscoe; Weida Tong

doi:10.1089/cmb.2008.0115

The accurate prediction of protein family from amino acid sequence by measuring features of sequence fragments

J Comput Biol. 2009 Dec;16(12):1671-88. doi: 10.1089/cmb.2008.0115.

Authors

Huixiao Hong¹, Qilong Hong, Roger Perkins, Leming Shi, Hong Fang, Zhenqiang Su, Yvonne Dragan, James C Fuscoe, Weida Tong

Affiliation

¹ Division of Systems Toxicology, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, Arkansas 72079, USA. Huixiao.Hong@fda.hhs.gov

PMID: 20047490
DOI: 10.1089/cmb.2008.0115

Abstract

The rapid advances in proteomic analyses coupled with the completion of multiple genomes have led to an increased demand for determining protein functions. The first step is classification or prediction into families. A method was developed for the prediction of protein family based only on protein sequence using support vector machine (SVM) models. In these models, the amino acids were classified into three categories (apolar, polar, and charged). Consecutive fragments ranging from one to five were annotated by amino acid type to define the protein features of each protein. SVM models were constructed based on the protein features of a training set of proteins and then examined with an independent set of proteins. The approach was tested for 20 protein families from the iProClass database of Protein Information Resources (PIR). For two-class SVM models, an average prediction accuracy of 0.9985 was achieved, while for multi-class SVM models an accuracy of 0.9941 was achieved. This study demonstrates that SVM based methods can accurately recognize and predict the protein family to which a sequence belongs based solely on its primary amino acid sequence.

MeSH terms

Algorithms
Amino Acid Sequence
Databases, Protein
Models, Theoretical
Multigene Family*
Proteins / chemistry*
Proteins / classification*
Sequence Alignment
Sequence Analysis, Protein / methods*
Sequence Homology, Amino Acid

Substances

Proteins