NCBI seminar Wednesday, 18 February 2009, 11:00 am, B2 library Rob Finn, Pfam Project Leader at The Wellcome Sanger Institute Pfam – A quest for complete and accurate classification of protein space Pfam is a protein family database, with each family represented by sequence alignments and profile hidden Markov models (HMMs). Over the past few years, the development of sequencing techniques and their application to novel fields (such as metagenomics) have resulted in a substantial increase in the number of protein sequences deposited in public databases. The number of sequences these databases now exceeding tens of millions and with the rate of growth showing no signs of abating, I will outline how Pfam has, and is changing in order to cope with the increased volume of data. Over the past few years we have significantly changed many aspects of the database, including the introduction of a hierarchical classification (termed Clans) and the methodology used to create the automatically generated supplement of the database (Pfam-B). In addition to new features, we have continued to build new families, adding over 1000 new entries in 2008. The current release of Pfam, version 23.0, contains 10,340 families that match ~75% of sequences in UniProt. I will also offer some conjecture on how complete (or incomplete) our classification of protein space is and how many families may be required in order to represent all non-singleton sequences. In the final part of my talk I will present some preliminary analysis on our application of the recently released alpha version of the HMMER software package, HMMER3, to Pfam. As HMMER underpins Pfam for building and searching the HMMs, HMMER3’s increased speed and sensitivity presents an exciting prospect for fighting back against the deluge of sequence data. I will describe some of our experiences, how we expect to be early adopters of the new software and offer some of the future prospects that HMMER3 will facilitate.