Send to

Choose Destination
Pac Symp Biocomput. 2004:399-410.

Clustering protein sequence and structure space with infinite Gaussian mixture models.

Author information

Keck Graduate Institute, 535 Watson Drive, Claremont, CA 91711, USA.


We describe a novel approach to the problem of automatically clustering protein sequences and discovering protein families, subfamilies etc., based on the theory of infinite Gaussian mixtures models. This method allows the data itself to dictate how many mixture components are required to model it, and provides a measure of the probability that two proteins belong to the same cluster. We illustrate our methods with application to three data sets: globin sequences, globin sequences with known three-dimensional structures and G-protein coupled receptor sequences. The consistency of the clusters indicate that our method is producing biologically meaningful results, which provide a very good indication of the underlying families and subfamilies. With the inclusion of secondary structure and residue solvent accessibility information, we obtain a classification of sequences of known structure which both reflects and extends their SCOP classifications. A supplementray web site containing larger versions of the figures is available at wid/PSB04/index.html.

[Indexed for MEDLINE]
Free full text

Supplemental Content

Full text links

Icon for Pacific Sympsium On Biocomputing
Loading ...
Support Center