The accurate prediction of protein family from amino acid sequence by measuring features of sequence fragments

J Comput Biol. 2009 Dec;16(12):1671-88. doi: 10.1089/cmb.2008.0115.

Abstract

The rapid advances in proteomic analyses coupled with the completion of multiple genomes have led to an increased demand for determining protein functions. The first step is classification or prediction into families. A method was developed for the prediction of protein family based only on protein sequence using support vector machine (SVM) models. In these models, the amino acids were classified into three categories (apolar, polar, and charged). Consecutive fragments ranging from one to five were annotated by amino acid type to define the protein features of each protein. SVM models were constructed based on the protein features of a training set of proteins and then examined with an independent set of proteins. The approach was tested for 20 protein families from the iProClass database of Protein Information Resources (PIR). For two-class SVM models, an average prediction accuracy of 0.9985 was achieved, while for multi-class SVM models an accuracy of 0.9941 was achieved. This study demonstrates that SVM based methods can accurately recognize and predict the protein family to which a sequence belongs based solely on its primary amino acid sequence.

MeSH terms

  • Algorithms
  • Amino Acid Sequence
  • Databases, Protein
  • Models, Theoretical
  • Multigene Family*
  • Proteins / chemistry*
  • Proteins / classification*
  • Sequence Alignment
  • Sequence Analysis, Protein / methods*
  • Sequence Homology, Amino Acid

Substances

  • Proteins