Archive page: Classification of proteins and RNAs

The Classification of proteins and RNAs group moved to EMBL-EBI (European Molecular Biology Institute-European Bioinformatics Institute) in November 2012. The team continues to work under Alex Bateman, who now leads the EBI's Protein Services. We are maintaining this page as a historical record of the group's activities at the Sanger Institute. For latest information about the group's research, please visit EMBL-EBI.

The Bateman group sets out to classify proteins and certain RNAs into functional families with a view to producing a 'periodic table' of these molecules.

These classifications allow researchers to rapidly understand the properties and functions of these molecules and thus better interpret their experimental results. The molecules are grouped based on their sequence, structure and function. This group, under the direction of Alex Bateman, has set up a range of different databases that collect and interpret information from researchers around the world. Sophisticated computer programs are applied to sequence information to assist in the classifications. The Pfam and Rfam databases are the most important collections of information for classifying proteins and RNAs, and the MEROPS database provides the worldwide standard nomenclature for peptidase proteins. Alex Bateman also helped initiate the Wikipedia WikiRNA Project. The information acquired is used with the overall view of contributing to the growing understanding of the functions encoded by proteins and RNAs.

[Genome Research Limited]



Proteins are the workhorse molecules in a cell. They are built from molecular building blocks called amino acids of which there are 20 different types. The structure of a protein molecule depends upon the order in which the amino acids are linked together. The order of the amino acids depends upon the sequenceof the bases in the RNA molecule that codes for it, and this in turn depends upon the sequence of the DNA in a cell.

Proteins normally fold into one or more three-dimensional units each one of which has its own function. These units are called protein domains. The functions of domains are mediated through interaction with other domains or molecules. Different combinations of functional domains create the diverse range of proteins found in nature. The identification of domains in newly discovered proteins can, therefore, provide insights into how that protein is likely to function and hence reveal the function of the whole protein sequence.

The Bateman group bases its classifications mainly on the sequence of amino acids in a given protein since this is what determines a protein's function. There is considerable redundancy in RNA coding sequences and, therefore, RNA sequences from different organisms can vary quite considerably whilst still coding for proteins that have the same or similar function and hence structure. Thus the group focuses on protein sequences rather than on DNA sequences.

Non-coding RNAs

In addition to proteins, the group also has an interest in the different RNA molecules that do not appear to have a role in coding for proteins for example non-coding RNAs. Many of these RNAs are well-structured and some have catalytic activities and play crucial roles in the lifecycle of the cell. An example of a non-coding RNA is found in the ribosome. This molecular complex is comprised of two very long sequences of RNA together with various proteins. The RNA molecules fold in a particular fashion and have a catalytic function, in that they bring together two substrates (separate amino-acids) and facilitate the chemical reaction that joins them together into a chain. Because so many non-coding RNAs have such fundamental functions involved in the control of how proteins are made, it is believed that they may be more ancient in evolutionary terms than proteins.


The overall aim of the Bateman group is to contribute to the understanding of the function and evolution of proteins. Specifically, its aims are to enable its own and other research scientists to:

  • group all known proteins and non-coding RNA molecules into families to help understand their functions;
  • identify new families of proteins and non-coding RNAs that are important in health and disease.

Contributing to the Interactome

Scanning the Pfam and Online Mendelian Inheritance in Man (OMIM) databases for mutations that affect protein interactions has thrown light on the molecular mechanisms underlying a variety of inherited diseases, and has revealed that around 4 per cent of disease-causing mutations disrupt the interaction interface in proteins. Benjamin Schuster-Böckler and Alex Bateman at the Sanger Institute created a computer program (Schuster-Bökler 2009) that combines protein structure and protein interaction information to predict interaction hotspots, and they confirmed their method using all the mutations found in the OMIM database. The team identified 1,428 mutations that were likely to affect the interaction interface in proteins, and went on to examine disease cases reported in the literature in which disruption of protein interactions as a result of mutations were believed to be the cause.

Although it is known that disease-causing mutations do disrupt protein structure, there has been little evidence that these are actually directly involved in the interface that interacts with other proteins. The team's literature survey revealed 119 cases of disruption of protein interaction in 65 different inherited diseases, including well-known cases such as sickle-cell anaemia which can be caused by an aberrant aggregation of haemoglobin proteins, similar to pathological aggregation of proteins in Alzheimers and Creutzfeld-Jacob diseases.

The team has compiled details of the molecular basis behind many inherited diseases. For example, in Griscelli Syndrome, which is a fatal disease that features abnormal skin and hair pigmentation and sometimes immunodeficiency, the team found that a Trp73Gly mutation in the protein Rab-27A affects a residue that is both highly conserved and in the centre of the interaction interface. There is strong evidence that Rab-27A interacts with myophilin and hence the Trp73Gly mutation seems likely to affect vesicle transport by reducing affinity of Rab-27A to myophilin.

The team has made available all the information derived from their study, and this will contribute both to the understanding of the underlying molecular mechanisms behind certain inherited diseases, and to the growing 'interactomic' information in man.

The Pfam database (Finn 2008)

The Pfam database organises proteins into a library of protein families providing a 'periodic table' of biology. The database consists of a large collection - currently amounting to nearly 12,000 families - that match to 75 per cent of known proteins. Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of protein sequence entries that are related by similarity of sequence, structure or by a statistical analysis called profile-HMM.

The MEROPS database (Rawlings 2008)

The MEROPS database focuses on the classification of a subset of proteins called peptidases (also termed proteases, proteinases or proteolytic enzymes) and provides the worldwide standard nomenclature for these proteins. Because MEROPS covers a more specialised set of proteins it can collect data at a greater depth than Pfam, even at the level of individual proteins, family and clan level.

The Rfam database (Griffiths-Jones 2009)

We have created the Rfam database, the first collection of non-coding RNA (ncRNA) families. Rfam is a joint project involving researchers based at the Wellcome Trust Sanger Institute and at Janelia Farm, Ashburn, VA, USA. Rfam makes use of the large amount of available nucleotide sequence data to identify sequence relatives for the many hundreds of known ncRNA families. The database has allowed for the first time, the routine annotation of ncRNAs in genomes. The database is also widely used as a training set for RNA software development.

Wikipedia: WikiRNA Project (Daub 2008)

The online encyclopedia Wikipedia has become one of the most important online references in the world and has a substantial and growing scientific content. We have formed the RNA WikiProject ( as part of the larger Molecular and Cellular Biology WikiProject. We have created over 600 new Wikipedia articles describing families of noncoding RNAs based on the Rfam database, and invite the community to update, edit, and correct these articles. The Rfam database now redistributes this Wikipedia content as the primary textual annotation of its RNA families. Users can, for the first time, directly edit the content of one of the major RNA databases. We believe that this Wikipedia/Rfam link acts as a functioning model for incorporating community annotation into molecular biology databases. This project has received a lot of media attention including an appearance in NatureNews (link to and WikiNews (link to

Discovery of novel protein families

Figure 1: stereoview of the PASTA domain binding to a beta-lactam antibiotic.

Figure 1: stereoview of the PASTA domain binding to a beta-lactam antibiotic.


The classification of novel protein families continues to be a key method for transferring experimental results onto new genomic data. Our team has published on many novel domains such as the G5 domain (5). The discovery of the PAZ domain allowed us to predict that the Dicer protein would be the dsRNA nuclease involved in RNAi some months before this was experimentally demonstrated (6). We also discovered a novel beta-lactam binding module called the PASTA domain, found in bacterial cell surface receptors and penicillin binding proteins (7). Most recently we identified that the enigmatic scramblase proteins are related to Tubby, an important protein involved in regulating weight, suggesting these two have a common role in gene regulation (8).

Research and database maintenance is supported by grants from the Wellcome Trust, the Medical Research Council (MRC) and the Biotechnology and Biological Sciences Research Council (BBSRC).


  • MEROPS - Provides the internationally recognised classification of peptidases and their inhibitors.
  • Pfam - Provides a classification of proteins into families and domains using hidden Markov models.
  • Rfam - Provides a classification of RNA families using covariance models.

Selected Publications

  • Phospholipid scramblases and Tubby-like proteins belong to a new superfamily of membrane tethered transcription factors.

    Bateman A, Finn RD, Sims PJ, Wiedmer T, Biegert A and Söding J

    Bioinformatics (Oxford, England) 2009;25;2;159-62

  • Rfam: updates to the RNA families database.

    Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, Wilkinson AC, Finn RD, Griffiths-Jones S, Eddy SR and Bateman A

    Nucleic acids research 2009;37;Database issue;D136-40

  • The RNA WikiProject: community annotation of RNA families.

    Daub J, Gardner PP, Tate J, Ramsköld D, Manske M, Scott WG, Weinberg Z, Griffiths-Jones S and Bateman A

    RNA (New York, N.Y.) 2008;14;12;2462-4

  • The Pfam protein families database.

    Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL and Bateman A

    Nucleic acids research 2008;36;Database issue;D281-8

  • MEROPS: the peptidase database.

    Rawlings ND, Morton FR, Kok CY, Kong J and Barrett AJ

    Nucleic acids research 2008;36;Database issue;D320-5

  • Protein interactions in human genetic diseases.

    Schuster-Böckler B and Bateman A

    Genome biology 2008;9;1;R9

  • The G5 domain: a potential N-acetylglucosamine recognition domain involved in biofilm formation.

    Bateman A, Holden MT and Yeats C

    Bioinformatics (Oxford, England) 2005;21;8;1301-3

  • The PASTA domain: a beta-lactam-binding domain.

    Yeats C, Finn RD and Bateman A

    Trends in biochemical sciences 2002;27;9;438

  • Domains in gene silencing and cell differentiation proteins: the novel PAZ domain and redefinition of the Piwi domain.

    Cerutti L, Mian N and Bateman A

    Trends in biochemical sciences 2000;25;10;481-2


Team members

Lars Barquist unknown

I completed a bachelor's degree in Biomathematics at Rutgers University in New Brunswick, New Jersey, USA. Following this, I spent two years in Ian Holmes' lab in the Department of Bioengineering at the University of California, Berkeley working on problems in biological sequence analysis. I am currently persuing a PhD in Alex Bateman's group on a four year Wellcome Trust Sanger Institute studentship.


My work focuses on bacterial small non-coding RNAs, a diverse class of regulatory molecules. I am using computational, experimental, and high-throughput techniques to discover and characterize these elements.


  • HandAlign: Bayesian multiple sequence alignment, phylogeny and ancestral reconstruction.

    Westesson O, Barquist L and Holmes I

    Department of Bioengineering, University of California Berkeley, CA 94720, USA.

    Unlabelled: We describe handalign, a software package for Bayesian reconstruction of phylogenetic history. The underlying model of sequence evolution describes indels and substitutions. Alignments, trees and model parameters are all treated as jointly dependent random variables and sampled via Metropolis-Hastings Markov chain Monte Carlo (MCMC), enabling systematic statistical parameter inference and hypothesis testing. handalign implements several different MCMC proposal kernels, allows sampling from arbitrary target distributions via Hastings ratios, and uses standard file formats for trees, alignments and models.

    Installation and usage instructions are at

    Funded by: NIGMS NIH HHS: R01-GM076705

    Bioinformatics (Oxford, England) 2012;28;8;1170-1

  • RNIE: genome-wide prediction of bacterial intrinsic terminators.

    Gardner PP, Barquist L, Bateman A, Nawrocki EP and Weinberg Z

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA0, UK.

    Bacterial Rho-independent terminators (RITs) are important genomic landmarks involved in gene regulation and terminating gene expression. In this investigation we present RNIE, a probabilistic approach for predicting RITs. The method is based upon covariance models which have been known for many years to be the most accurate computational tools for predicting homology in structural non-coding RNAs. We show that RNIE has superior performance in model species from a spectrum of bacterial phyla. Further analysis of species where a low number of RITs were predicted revealed a highly conserved structural sequence motif enriched near the genic termini of the pathogenic Actinobacteria, Mycobacterium tuberculosis. This motif, together with classical RITs, account for up to 90% of all the significantly structured regions from the termini of M. tuberculosis genic elements. The software, predictions and alignments described below are available from

    Funded by: Howard Hughes Medical Institute

    Nucleic acids research 2011;39;14;5845-52

  • Evolutionary modeling and prediction of non-coding RNAs in Drosophila.

    Bradley RK, Uzilov AV, Skinner ME, Bendaña YR, Barquist L and Holmes I

    Biophysics Graduate Group, University of California, Berkeley, CA, USA.

    We performed benchmarks of phylogenetic grammar-based ncRNA gene prediction, experimenting with eight different models of structural evolution and two different programs for genome alignment. We evaluated our models using alignments of twelve Drosophila genomes. We find that ncRNA prediction performance can vary greatly between different gene predictors and subfamilies of ncRNA gene. Our estimates for false positive rates are based on simulations which preserve local islands of conservation; using these simulations, we predict a higher rate of false positives than previous computational ncRNA screens have reported. Using one of the tested prediction grammars, we provide an updated set of ncRNA predictions for D. melanogaster and compare them to previously-published predictions and experimental data. Many of our predictions show correlations with protein-coding genes. We found significant depletion of intergenic predictions near the 3' end of coding regions and furthermore depletion of predictions in the first intron of protein-coding genes. Some of our predictions are colocated with larger putative unannotated genes: for example, 17 of our predictions showing homology to the RFAM family snoR28 appear in a tandem array on the X chromosome; the 4.5 Kbp spanned by the predicted tandem array is contained within a FlyBase-annotated cDNA.

    Funded by: NIGMS NIH HHS: GM076705, R01 GM076705

    PloS one 2009;4;8;e6478

  • xREI: a phylo-grammar visualization webserver.

    Barquist L and Holmes I

    Department of Bioengineering, University of California, Berkeley, USA.

    Phylo-grammars, probabilistic models combining Markov chain substitution models with stochastic grammars, are powerful models for annotating structured features in multiple sequence alignments and analyzing the evolution of those features. In the past, these methods have been cumbersome to implement and modify. xrate provides means for the rapid development of phylo-grammars (using a simple file format) and automated parameterization of those grammars from training data (via the Expectation Maximization algorithm). xREI (pron. 'X-ray') is an intuitive, flexible AJAX (Asynchronous Javascript And XML) web interface to xrate providing grammar visualization tools as well as access to xrate's training and annotation functionality. It is hoped that this application will serve as a valuable tool to those developing phylo-grammars, and as a means for the exploration and dissemination of such models. xREI is available at

    Funded by: NIGMS NIH HHS: 1R01GM076705-01

    Nucleic acids research 2008;36;Web Server issue;W65-9

Penny Coggill Pfam annotator

I began at the Sanger Centre in 1999 in the Sequencing Projects team, when the first ABI 3700 sequencing machines arrived - exciting times. Our task was to optimise the Sanger-sequencing reaction to make maximally efficient use of the machines to churn out the human sequence.

After several years, I moved to Stephan Beck's Immunogenetics team to run the Human MHC Sequencing Consortium project. The aim of this project was to sequence fully the MHC region from 8 individuals. Day-to-day this meant mapping and fingerprinting and gap-closure of BAC clones, and supervising their passage through the Sanger sequencing pipeline.


On the disbandment of Stephan's team, I moved to work with Alex as a Pfam annotator. My job is to build families-models from proteins that are not covered by earlier models and to ascertain their activity if possible. I also manage the help-desk system for user-queries.

Occasionally I can delve more deeply into the nature of a protein or PDB structure, and such efforts may warrant or deserve publication.

I have contributed to the production of two on-line user-manuals for the Pfam database, the first for Current Protocols in Bioinformatics, the second for the NCI-Nature Protocols Databases journal.


  • DUFs: families in search of function.

    Bateman A, Coggill P and Finn RD

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, England.

    Domains of unknown function (DUFs) are a large set of uncharacterized protein families that are found in the Pfam database. Here, the scale and growth of functionally uncharacterized families in biological databases are surveyed and the prospects for discovering their function are examined. In particular, the important role that structural genomics can play in identifying potential function is evaluated.

    Funded by: Wellcome Trust: 087656, WT077044/Z/05/Z

    Acta crystallographica. Section F, Structural biology and crystallization communications 2010;66;Pt 10;1148-52

  • The Pfam protein families database.

    Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR and Bateman A

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

    Pfam is a widely used database of protein families and domains. This article describes a set of major updates that we have implemented in the latest release (version 24.0). The most important change is that we now use HMMER3, the latest version of the popular profile hidden Markov model package. This software is approximately 100 times faster than HMMER2 and is more sensitive due to the routine use of the forward algorithm. The move to HMMER3 has necessitated numerous changes to Pfam that are described in detail. Pfam release 24.0 contains 11,912 families, of which a large number have been significantly updated during the past two years. Pfam is available via servers in the UK (, the USA ( and Sweden (

    Funded by: Howard Hughes Medical Institute; Medical Research Council: MC_U137761446; Wellcome Trust: 087656, WT077044/Z/05/Z

    Nucleic acids research 2010;38;Database issue;D211-22

  • The structure of pyogenecin immunity protein, a novel bacteriocin-like immunity protein from Streptococcus pyogenes.

    Chang C, Coggill P, Bateman A, Finn RD, Cymborowski M, Otwinowski Z, Minor W, Volkart L and Joachimiak A

    Midwest Center for Structural Genomics and Structural Biology Center, Biosciences Division, Argonne National Laboratory, Argonne, Illinois 60439, USA.

    Background: Many Gram-positive lactic acid bacteria (LAB) produce anti-bacterial peptides and small proteins called bacteriocins, which enable them to compete against other bacteria in the environment. These peptides fall structurally into three different classes, I, II, III, with class IIa being pediocin-like single entities and class IIb being two-peptide bacteriocins. Self-protective cognate immunity proteins are usually co-transcribed with these toxins. Several examples of cognates for IIa have already been solved structurally. Streptococcus pyogenes, closely related to LAB, is one of the most common human pathogens, so knowledge of how it competes against other LAB species is likely to prove invaluable.

    Results: We have solved the crystal structure of the gene-product of locus Spy_2152 from S. pyogenes, (PDB:2fu2), and found it to comprise an anti-parallel four-helix bundle that is structurally similar to other bacteriocin immunity proteins. Sequence analyses indicate this protein to be a possible immunity protein protective against class IIa or IIb bacteriocins. However, given that S. pyogenes appears to lack any IIa pediocin-like proteins but does possess class IIb bacteriocins, we suggest this protein confers immunity to IIb-like peptides.

    Conclusions: Combined structural, genomic and proteomic analyses have allowed the identification and in silico characterization of a new putative immunity protein from S. pyogenes, possibly the first structure of an immunity protein protective against potential class IIb two-peptide bacteriocins. We have named the two pairs of putative bacteriocins found in S. pyogenes pyogenecin 1, 2, 3 and 4.

    Funded by: Wellcome Trust: 087656, WT077044/Z/05/Z

    BMC structural biology 2009;9;75

  • Identifying protein domains with the Pfam database.

    Coggill P, Finn RD and Bateman A

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.

    Pfam is a database of protein domain families, with each family represented by multiple sequence alignments and profile hidden Markov models (HMMs). In addition, each family has associated annotation, literature references, and links to other databases. The entries in Pfam are available via the World Wide Web and in flatfile format. This unit contains detailed information on how to access and utilize the information present in the Pfam database, namely the families, multiple alignments, and annotation. Details on running Pfam, both remotely and locally are presented.

    Funded by: Wellcome Trust: 087656

    Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.] 2008;Chapter 2;Unit 2.5

  • Variation analysis and gene annotation of eight MHC haplotypes: the MHC Haplotype Project.

    Horton R, Gibson R, Coggill P, Miretti M, Allcock RJ, Almeida J, Forbes S, Gilbert JG, Halls K, Harrow JL, Hart E, Howe K, Jackson DK, Palmer S, Roberts AN, Sims S, Stewart CA, Traherne JA, Trevanion S, Wilming L, Rogers J, de Jong PJ, Elliott JF, Sawcer S, Todd JA, Trowsdale J and Beck S

    Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.

    The human major histocompatibility complex (MHC) is contained within about 4 Mb on the short arm of chromosome 6 and is recognised as the most variable region in the human genome. The primary aim of the MHC Haplotype Project was to provide a comprehensively annotated reference sequence of a single, human leukocyte antigen-homozygous MHC haplotype and to use it as a basis against which variations could be assessed from seven other similarly homozygous cell lines, representative of the most common MHC haplotypes in the European population. Comparison of the haplotype sequences, including four haplotypes not previously analysed, resulted in the identification of >44,000 variations, both substitutions and indels (insertions and deletions), which have been submitted to the dbSNP database. The gene annotation uncovered haplotype-specific differences and confirmed the presence of more than 300 loci, including over 160 protein-coding genes. Combined analysis of the variation and annotation datasets revealed 122 gene loci with coding substitutions of which 97 were non-synonymous. The haplotype (A3-B7-DR15; PGF cell line) designated as the new MHC reference sequence, has been incorporated into the human genome assembly (NCBI35 and subsequent builds), and constitutes the largest single-haplotype sequence of the human genome to date. The extensive variation and annotation data derived from the analysis of seven further haplotypes have been made publicly available and provide a framework and resource for future association studies of all MHC-associated diseases and transplant medicine.

    Funded by: NHGRI NIH HHS: U54 HG004555-01; Wellcome Trust: 048880, 062023, 077198

    Immunogenetics 2008;60;1;1-18

  • Genetic analysis of completely sequenced disease-associated MHC haplotypes identifies shuffling of segments in recent human history.

    Traherne JA, Horton R, Roberts AN, Miretti MM, Hurles ME, Stewart CA, Ashurst JL, Atrazhev AM, Coggill P, Palmer S, Almeida J, Sims S, Wilming LG, Rogers J, de Jong PJ, Carrington M, Elliott JF, Sawcer S, Todd JA, Trowsdale J and Beck S

    Department of Pathology, Immunology Division, University of Cambridge, Cambridge, United Kingdom.

    The major histocompatibility complex (MHC) is recognised as one of the most important genetic regions in relation to common human disease. Advancement in identification of MHC genes that confer susceptibility to disease requires greater knowledge of sequence variation across the complex. Highly duplicated and polymorphic regions of the human genome such as the MHC are, however, somewhat refractory to some whole-genome analysis methods. To address this issue, we are employing a bacterial artificial chromosome (BAC) cloning strategy to sequence entire MHC haplotypes from consanguineous cell lines as part of the MHC Haplotype Project. Here we present 4.25 Mb of the human haplotype QBL (HLA-A26-B18-Cw5-DR3-DQ2) and compare it with the MHC reference haplotype and with a second haplotype, COX (HLA-A1-B8-Cw7-DR3-DQ2), that shares the same HLA-DRB1, -DQA1, and -DQB1 alleles. We have defined the complete gene, splice variant, and sequence variation contents of all three haplotypes, comprising over 259 annotated loci and over 20,000 single nucleotide polymorphisms (SNPs). Certain coding sequences vary significantly between different haplotypes, making them candidates for functional and disease-association studies. Analysis of the two DR3 haplotypes allowed delineation of the shared sequence between two HLA class II-related haplotypes differing in disease associations and the identification of at least one of the sites that mediated the original recombination event. The levels of variation across the MHC were similar to those seen for other HLA-disparate haplotypes, except for a 158-kb segment that contained the HLA-DRB1, -DQA1, and -DQB1 genes and showed very limited polymorphism compatible with identity-by-descent and relatively recent common ancestry (<3,400 generations). These results indicate that the differential disease associations of these two DR3 haplotypes are due to sequence variation outside this central 158-kb segment, and that shuffling of ancestral blocks via recombination is a potential mechanism whereby certain DR-DQ allelic combinations, which presumably have favoured immunological functions, can spread across haplotypes and populations.

    Funded by: NCI NIH HHS: N01-CO-12400; Wellcome Trust: 048880

    PLoS genetics 2006;2;1;e9

Marco Punta Project Leader (Pfam)

I earned my first (master level) degree in Physics at the University "La Sapienza" of Rome, Italy, in 1998 and my Ph.D. in Biophysics at SISSA in Trieste, Italy, in 2002, working on computational studies of membrane proteins. From 2002 to 2009 I was at Columbia University in the city of New York, NY, USA, and later at the TUM Institute for Advanced Study in Munich, Germany, as part of the group of Burkhard Rost. During this time I developed methods for protein structure and function prediction and worked on target selection and data analysis for structural genomics.


Given my background, my initial objective at Pfam is to make a more systematic use of available high-resolution structural information in protein family building . Also, together with the Pfam team, we are in the process to address and revisit some important aspects of the database, such as family-specific gathering thresholds, clan definition and family functional annotation. More in general, Pfam constitutes a great platform from which to continue my work on the study of protein sequence-structure-function relationships.


  • The Pfam protein families database.

    Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer EL, Eddy SR, Bateman A and Finn RD

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK.

    Pfam is a widely used database of protein families, currently containing more than 13,000 manually curated protein families as of release 26.0. Pfam is available via servers in the UK (, the USA ( and Sweden ( Here, we report on changes that have occurred since our 2010 NAR paper (release 24.0). Over the last 2 years, we have generated 1840 new families and increased coverage of the UniProt Knowledgebase (UniProtKB) to nearly 80%. Notably, we have taken the step of opening up the annotation of our families to the Wikipedia community, by linking Pfam families to relevant Wikipedia pages and encouraging the Pfam and Wikipedia communities to improve and expand those pages. We continue to improve the Pfam website and add new visualizations, such as the 'sunburst' representation of taxonomic distribution of families. In this work we additionally address two topics that will be of particular interest to the Pfam community. First, we explain the definition and use of family-specific, manually curated gathering thresholds. Second, we discuss some of the features of domains of unknown function (also known as DUFs), which constitute a rapidly growing class of families within Pfam.

    Funded by: Biotechnology and Biological Sciences Research Council: BB/F010435/1; Howard Hughes Medical Institute; Wellcome Trust: WT077044/Z/05/Z

    Nucleic acids research 2012;40;Database issue;D290-301

  • Characterization of metalloproteins by high-throughput X-ray absorption spectroscopy.

    Shi W, Punta M, Bohon J, Sauder JM, D'Mello R, Sullivan M, Toomey J, Abel D, Lippi M, Passerini A, Frasconi P, Burley SK, Rost B and Chance MR

    New York SGX Research Center for Structural Genomics (NYSGXRC), Case Western Reserve University, Center for Proteomics and Bioinformatics, Case Center for Synchrotron Biosciences, Upton, New York 11973, USA.

    High-throughput X-ray absorption spectroscopy was used to measure transition metal content based on quantitative detection of X-ray fluorescence signals for 3879 purified proteins from several hundred different protein families generated by the New York SGX Research Center for Structural Genomics. Approximately 9% of the proteins analyzed showed the presence of transition metal atoms (Zn, Cu, Ni, Co, Fe, or Mn) in stoichiometric amounts. The method is highly automated and highly reliable based on comparison of the results to crystal structure data derived from the same protein set. To leverage the experimental metalloprotein annotations, we used a sequence-based de novo prediction method, MetalDetector, to identify Cys and His residues that bind to transition metals for the redundancy reduced subset of 2411 sequences sharing <70% sequence identity and having at least one His or Cys. As the HT-XAS identifies metal type and protein binding, while the bioinformatics analysis identifies metal- binding residues, the results were combined to identify putative metal-binding sites in the proteins and their associated families. We explored the combination of this data with homology models to generate detailed structure models of metal-binding sites for representative proteins. Finally, we used extended X-ray absorption fine structure data from two of the purified Zn metalloproteins to validate predicted metalloprotein binding site structures. This combination of experimental and bioinformatics approaches provides comprehensive active site analysis on the genome scale for metalloproteins as a class, revealing new insights into metalloprotein structure and function.

    Funded by: NIBIB NIH HHS: P30 EB009998, P30-EB-09998; NIGMS NIH HHS: U54GM074945

    Genome research 2011;21;6;898-907

  • Homologue structure of the SLAC1 anion channel for closing stomata in leaves.

    Chen YH, Hu L, Punta M, Bruni R, Hillerich B, Kloss B, Rost B, Love J, Siegelbaum SA and Hendrickson WA

    Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA.

    The plant SLAC1 anion channel controls turgor pressure in the aperture-defining guard cells of plant stomata, thereby regulating the exchange of water vapour and photosynthetic gases in response to environmental signals such as drought or high levels of carbon dioxide. Here we determine the crystal structure of a bacterial homologue (Haemophilus influenzae) of SLAC1 at 1.20 Å resolution, and use structure-inspired mutagenesis to analyse the conductance properties of SLAC1 channels. SLAC1 is a symmetrical trimer composed from quasi-symmetrical subunits, each having ten transmembrane helices arranged from helical hairpin pairs to form a central five-helix transmembrane pore that is gated by an extremely conserved phenylalanine residue. Conformational features indicate a mechanism for control of gating by kinase activation, and electrostatic features of the pore coupled with electrophysiological characteristics indicate that selectivity among different anions is largely a function of the energetic cost of ion dehydration.

    Funded by: NIGMS NIH HHS: R01 GM034102, U54 GM075026, U54 GM095315

    Nature 2010;467;7319;1074-80

  • Structural genomics target selection for the New York consortium on membrane protein structure.

    Punta M, Love J, Handelman S, Hunt JF, Shapiro L, Hendrickson WA and Rost B

    Department of Biochemistry and Molecular Biophysics, Columbia University, 630 West 168th Street, New York, NY, 10032, USA.

    The New York Consortium on Membrane Protein Structure (NYCOMPS), a part of the Protein Structure Initiative (PSI) in the USA, has as its mission to establish a high-throughput pipeline for determination of novel integral membrane protein structures. Here we describe our current target selection protocol, which applies structural genomics approaches informed by the collective experience of our team of investigators. We first extract all annotated proteins from our reagent genomes, i.e. the 96 fully sequenced prokaryotic genomes from which we clone DNA. We filter this initial pool of sequences and obtain a list of valid targets. NYCOMPS defines valid targets as those that, among other features, have at least two predicted transmembrane helices, no predicted long disordered regions and, except for community nominated targets, no significant sequence similarity in the predicted transmembrane region to any known protein structure. Proteins that feed our experimental pipeline are selected by defining a protein seed and searching the set of all valid targets for proteins that are likely to have a transmembrane region structurally similar to that of the seed. We require sequence similarity aligning at least half of the predicted transmembrane region of seed and target. Seeds are selected according to their feasibility and/or biological interest, and they include both centrally selected targets and community nominated targets. As of December 2008, over 6,000 targets have been selected and are currently being processed by the experimental pipeline. We discuss how our target list may impact structural coverage of the membrane protein space.

    Funded by: NIGMS NIH HHS: U54-GM75026-01

    Journal of structural and functional genomics 2009;10;4;255-68

  • Natively unstructured regions in proteins identified from contact predictions.

    Schlessinger A, Punta M and Rost B

    Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY, USA.

    Motivation: Natively unstructured (also dubbed intrinsically disordered) regions in proteins lack a defined 3D structure under physiological conditions and often adopt regular structures under particular conditions. Proteins with such regions are overly abundant in eukaryotes, they may increase functional complexity of organisms and they usually evade structure determination in the unbound form. Low propensity for the formation of internal residue contacts has been previously used to predict natively unstructured regions.

    Results: We combined PROFcon predictions for protein-specific contacts with a generic pairwise potential to predict unstructured regions. This novel method, Ucon, outperformed the best available methods in predicting proteins with long unstructured regions. Furthermore, Ucon correctly identified cases missed by other methods. By computing the difference between predictions based on specific contacts (approach introduced here) and those based on generic potentials (realized in other methods), we might identify unstructured regions that are involved in protein-protein binding. We discussed one example to illustrate this ambitious aim. Overall, Ucon added quality and an orthogonal aspect that may help in the experimental study of unstructured regions in network hubs.


    Supplementary data are available at Bioinformatics online.

    Funded by: NIGMS NIH HHS: U54-GM072980, U54-GM074958; NLM NIH HHS: R01-LM07329

    Bioinformatics (Oxford, England) 2007;23;18;2376-84

  • Membrane protein prediction methods.

    Punta M, Forrest LR, Bigelow H, Kernytsky A, Liu J and Rost B

    Department of Biochemistry and Molecular Biophysics, Columbia University, 1130 St. Nicholas Ave., New York, NY 10032, USA.

    We survey computational approaches that tackle membrane protein structure and function prediction. While describing the main ideas that have led to the development of the most relevant and novel methods, we also discuss pitfalls, provide practical hints and highlight the challenges that remain. The methods covered include: sequence alignment, motif search, functional residue identification, transmembrane segment and protein topology predictions, homology and ab initio modeling. In general, predictions of functional and structural features of membrane proteins are improving, although progress is hampered by the limited amount of high-resolution experimental information available. While predictions of transmembrane segments and protein topology rank among the most accurate methods in computational biology, more attention and effort will be required in the future to ameliorate database search, homology and ab initio modeling.

    Funded by: NIGMS NIH HHS: R01-GM64633-01, U54-GM75026-01; NLM NIH HHS: R01 LM007329-01A1, R01-LM07329-01

    Methods (San Diego, Calif.) 2007;41;4;460-74

  • Identifying cysteines and histidines in transition-metal-binding sites using support vector machines and neural networks.

    Passerini A, Punta M, Ceroni A, Rost B and Frasconi P

    Università degli Studi di Firenze, Dipartimento di Sistemi e Informatica Via di Santa Marta 3, 50139 Firenze, Italy.

    Accurate predictions of metal-binding sites in proteins by using sequence as the only source of information can significantly help in the prediction of protein structure and function, genome annotation, and in the experimental determination of protein structure. Here, we introduce a method for identifying histidines and cysteines that participate in binding of several transition metals and iron complexes. The method predicts histidines as being in either of two states (free or metal bound) and cysteines in either of three states (free, metal bound, or in disulfide bridges). The method uses only sequence information by utilizing position-specific evolutionary profiles as well as more global descriptors such as protein length and amino acid composition. Our solution is based on a two-stage machine-learning approach. The first stage consists of a support vector machine trained to locally classify the binding state of single histidines and cysteines. The second stage consists of a bidirectional recurrent neural network trained to refine local predictions by taking into account dependencies among residues within the same protein. A simple finite state automaton is employed as a postprocessing in the second stage in order to enforce an even number of disulfide-bonded cysteines. We predict histidines and cysteines in transition-metal-binding sites at 73% precision and 61% recall. We observe significant differences in performance depending on the ligand (histidine or cysteine) and on the metal bound. We also predict cysteines participating in disulfide bridges at 86% precision and 87% recall. Results are compared to those that would be obtained by using expert information as represented by PROSITE motifs and, for disulfide bonds, to state-of-the-art methods.

    Funded by: NIGMS NIH HHS: R01-GM64633-01

    Proteins 2006;65;2;305-16

  • PROFcon: novel prediction of long-range contacts.

    Punta M and Rost B

    CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University 650 West 168th Street BB217, New York, NY 10032, USA.

    Motivation: Despite the continuing advance in the experimental determination of protein structures, the gap between the number of known protein sequences and structures continues to increase. Prediction methods can bridge this sequence-structure gap only partially. Better predictions of non-local contacts between residues could improve comparative modeling, fold recognition and could assist in the experimental structure determination.

    Results: Here, we introduced PROFcon, a novel contact prediction method that combines information from alignments, from predictions of secondary structure and solvent accessibility, from the region between two residues and from the average properties of the entire protein. In contrast to some other methods, PROFcon predicted short and long proteins at similar levels of accuracy. As expected, PROFcon was clearly less accurate when tested on sparse evolutionary profiles, that is, on families with few homologs. Prediction accuracy was highest for proteins belonging to the SCOP alpha/beta class. PROFcon compared favorably with state-of-the-art prediction methods at the CASP6 meeting. While the performance may still be perceived as low, our method clearly pushed the mark higher. Furthermore, predictions are already accurate enough to seed predictions of global features of protein structure.

    Funded by: NIGMS NIH HHS: R01-GM64633-01

    Bioinformatics (Oxford, England) 2005;21;13;2960-8

  • Protein folding rates estimated from contact predictions.

    Punta M and Rost B

    CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, 650 West 168th Street BB217, New York, NY 10032, USA.

    Folding rates of small single-domain proteins that fold through simple two-state kinetics can be estimated from details of the three-dimensional protein structure. Previously, predictions of secondary structure had been exploited to predict folding rates from sequence. Here, we estimate two-state folding rates from predictions of internal residue-residue contacts in proteins of unknown structure. Our estimate is based on the correlation between the folding rate and the number of predicted long-range contacts normalized by the square of the protein length. It is well known that long-range order derived from known structures correlates with folding rates. The surprise was that estimates based on very noisy contact predictions were almost as accurate as the estimates based on known contacts. On average, our estimates were similar to those previously published from secondary structure predictions. The combination of these methods that exploit different sources of information improved performance. It appeared that the combined method reliably distinguished fast from slow two-state folders.

    Funded by: NIGMS NIH HHS: R01-GM64633-01; NLM NIH HHS: R01-LM07329-01

    Journal of molecular biology 2005;348;3;507-12

  • CASP6 assessment of contact prediction.

    Graña O, Baker D, MacCallum RM, Meiler J, Punta M, Rost B, Tress ML and Valencia A

    Protein Design Group, Centro Nacional de Biotecnologia (CNB-CSIC), C/Darwin 3, Cantoblanco, Madrid, Spain.

    Here we present the evaluation results of the Critical Assessment of Protein Structure Prediction (CASP6) contact prediction category. Contact prediction was assessed with standard measures well known in the field and the performance of specialist groups was evaluated alongside groups that submitted models with 3D coordinates. The evaluation was mainly focused on long range contact predictions for the set of new fold targets, although we analyzed predictions for all targets. Three groups with similar levels of accuracy and coverage performed a little better than the others. Comparisons of the predictions of the three best methods with those of CASP5/CAFASP3 suggested some improvement, although there were not enough targets in the comparisons to make this statistically significant.

    Funded by: NIGMS NIH HHS: R01-GM64633; NLM NIH HHS: R01-LM07329

    Proteins 2005;61 Suppl 7;214-24

* quick link -