The Pfam database is one the most important collections of information in the world for classifying proteins. The database categorises 75 per cent of known proteins to form a library of protein families - a 'periodic table' of biology. The open access resource was established at the Wellcome Trust Sanger Institute in 1998. Its vision is to provide a tool which allows experimental, computational and evolutionary biologists to classify protein sequences and answer questions about what they do and how they have evolved. The Pfam project is led by Dr Alex Bateman at the Sanger Institute.

[Genome Research Limited]


Proteins are the fundamental building blocks of all life - understanding and classifying these molecules is one of the crucial steps in extracting the benefits to human health that are encoded in genome information. Each entry in the Pfam database includes a protein sequence alignment as well as an accompanying statistical model, called a hidden Markov model.

Proteins are built from a number of regions, called domains, which in different combinations can determine the protein's function. Pfam allows users to analyse sequence data and search for related proteins in the database. The tool also lets users see the structure and domain architecture of any of the proteins stored, examine what species proteins are found in and look at multiple alignments. In addition, Pfam stores and gives access to information on higher level groupings of related protein families - known as clans - which are related by similarity of sequence, structure or by a statistical analysis of their associated hidden Markov model.

The database comprises two main collections of information. Pfam-A comprises high-quality entries that have been curated manually. To extend the sequence coverage of Pfam, an additional area of the Pfam database - Pfam-B - contains automatically curated entries that are of a lower quality but add valuable coverage for regions not yet curated and stored in Pfam-A.

The latest version of the Pfam database contains approaching 12,000 curated protein families, but the aim of the project is to develop a comprehensive classification of all known protein sequences. On its way to achieving this ambitious goal, the open access resource will speed scientific discovery by continuing to share all new information as it is added to the database.

Selected Publications

  • The Pfam protein families database.

    Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR and Bateman A

    Nucleic acids research 2010;38;Database issue;D211-22

  • Pfam 10 years on: 10,000 families and still growing.

    Sammut SJ, Finn RD and Bateman A

    Briefings in bioinformatics 2008;9;3;210-9

* quick link -