Classification of proteins and RNAs
This page is maintained as a historical record and is no longer being updated.
The Classification of proteins and RNAs group moved to EMBL-EBI (European Molecular Biology Institute-European Bioinformatics Institute) in November 2012. The team continues to work under Alex Bateman, who now leads the EBI’s Protein Services. We are maintaining this page as a historical record of the group’s activities at the Sanger Institute. For latest information about the group’s research, please visit the EMBL-EBI website: http://www.ebi.ac.uk/.
The Bateman group sets out to classify proteins and certain RNAs into functional families with a view to producing a 'periodic table' of these molecules.
These classifications allow researchers to rapidly understand the properties and functions of these molecules and thus better interpret their experimental results. The molecules are grouped based on their sequence, structure and function. This group, under the direction of Alex Bateman, has set up a range of different databases that collect and interpret information from researchers around the world. Sophisticated computer programs are applied to sequence information to assist in the classifications. The Pfam and Rfam databases are the most important collections of information for classifying proteins and RNAs, and the MEROPS database provides the worldwide standard nomenclature for peptidase proteins. Alex Bateman also helped initiate the Wikipedia WikiRNA Project. The information acquired is used with the overall view of contributing to the growing understanding of the functions encoded by proteins and RNAs.
Proteins are the workhorse molecules in a cell. They are built from molecular building blocks called amino acids of which there are 20 different types. The structure of a protein molecule depends upon the order in which the amino acids are linked together. The order of the amino acids depends upon the sequenceof the bases in the RNA molecule that codes for it, and this in turn depends upon the sequence of the DNA in a cell.
Proteins normally fold into one or more three-dimensional units each one of which has its own function. These units are called protein domains. The functions of domains are mediated through interaction with other domains or molecules. Different combinations of functional domains create the diverse range of proteins found in nature. The identification of domains in newly discovered proteins can, therefore, provide insights into how that protein is likely to function and hence reveal the function of the whole protein sequence.
The Bateman group bases its classifications mainly on the sequence of amino acids in a given protein since this is what determines a protein’s function. There is considerable redundancy in RNA coding sequences and, therefore, RNA sequences from different organisms can vary quite considerably whilst still coding for proteins that have the same or similar function and hence structure. Thus the group focuses on protein sequences rather than on DNA sequences.
In addition to proteins, the group also has an interest in the different RNA molecules that do not appear to have a role in coding for proteins for example non-coding RNAs. Many of these RNAs are well-structured and some have catalytic activities and play crucial roles in the lifecycle of the cell. An example of a non-coding RNA is found in the ribosome. This molecular complex is comprised of two very long sequences of RNA together with various proteins. The RNA molecules fold in a particular fashion and have a catalytic function, in that they bring together two substrates (separate amino-acids) and facilitate the chemical reaction that joins them together into a chain. Because so many non-coding RNAs have such fundamental functions involved in the control of how proteins are made, it is believed that they may be more ancient in evolutionary terms than proteins.
The overall aim of the Bateman group is to contribute to the understanding of the function and evolution of proteins. Specifically, its aims are to enable its own and other research scientists to:
- group all known proteins and non-coding RNA molecules into families to help understand their functions;
- identify new families of proteins and non-coding RNAs that are important in health and disease.
Contributing to the Interactome
Scanning the Pfam and Online Mendelian Inheritance in Man (OMIM) databases for mutations that affect protein interactions has thrown light on the molecular mechanisms underlying a variety of inherited diseases, and has revealed that around 4 per cent of disease-causing mutations disrupt the interaction interface in proteins. Benjamin Schuster-Böckler and Alex Bateman at the Sanger Institute created a computer program (Schuster-Bökler 2009) that combines protein structure and protein interaction information to predict interaction hotspots, and they confirmed their method using all the mutations found in the OMIM database. The team identified 1,428 mutations that were likely to affect the interaction interface in proteins, and went on to examine disease cases reported in the literature in which disruption of protein interactions as a result of mutations were believed to be the cause.
Although it is known that disease-causing mutations do disrupt protein structure, there has been little evidence that these are actually directly involved in the interface that interacts with other proteins. The team’s literature survey revealed 119 cases of disruption of protein interaction in 65 different inherited diseases, including well-known cases such as sickle-cell anaemia which can be caused by an aberrant aggregation of haemoglobin proteins, similar to pathological aggregation of proteins in Alzheimers and Creutzfeld-Jacob diseases.
The team has compiled details of the molecular basis behind many inherited diseases. For example, in Griscelli Syndrome, which is a fatal disease that features abnormal skin and hair pigmentation and sometimes immunodeficiency, the team found that a Trp73Gly mutation in the protein Rab-27A affects a residue that is both highly conserved and in the centre of the interaction interface. There is strong evidence that Rab-27A interacts with myophilin and hence the Trp73Gly mutation seems likely to affect vesicle transport by reducing affinity of Rab-27A to myophilin.
The team has made available all the information derived from their study, and this will contribute both to the understanding of the underlying molecular mechanisms behind certain inherited diseases, and to the growing ‘interactomic’ information in man.
The Pfam database (Finn 2008)
The Pfam database organises proteins into a library of protein families providing a ‘periodic table’ of biology. The database consists of a large collection – currently amounting to nearly 12,000 families – that match to 75 per cent of known proteins. Pfam also generates higher-level groupings of related families, known as clans. A clan is a collection of protein sequence entries that are related by similarity of sequence, structure or by a statistical analysis called profile-HMM.
The MEROPS database (Rawlings 2008)
The MEROPS database focuses on the classification of a subset of proteins called peptidases (also termed proteases, proteinases or proteolytic enzymes) and provides the worldwide standard nomenclature for these proteins. Because MEROPS covers a more specialised set of proteins it can collect data at a greater depth than Pfam, even at the level of individual proteins, family and clan level.
The Rfam database (Griffiths-Jones 2009)
We have created the Rfam database, the first collection of non-coding RNA (ncRNA) families. Rfam is a joint project involving researchers based at the Wellcome Trust Sanger Institute and at Janelia Farm, Ashburn, VA, USA. Rfam makes use of the large amount of available nucleotide sequence data to identify sequence relatives for the many hundreds of known ncRNA families. The database has allowed for the first time, the routine annotation of ncRNAs in genomes. The database is also widely used as a training set for RNA software development.
Wikipedia: WikiRNA Project (Daub 2008)
The online encyclopedia Wikipedia has become one of the most important online references in the world and has a substantial and growing scientific content. We have formed the RNA WikiProject (http://en.wikipedia.org/wiki/Wikipedia:WikiProject_RNA) as part of the larger Molecular and Cellular Biology WikiProject. We have created over 600 new Wikipedia articles describing families of noncoding RNAs based on the Rfam database, and invite the community to update, edit, and correct these articles. The Rfam database now redistributes this Wikipedia content as the primary textual annotation of its RNA families. Users can, for the first time, directly edit the content of one of the major RNA databases. We believe that this Wikipedia/Rfam link acts as a functioning model for incorporating community annotation into molecular biology databases. This project has received a lot of media attention including an appearance in NatureNews (link to http://www.nature.com/news/2008/081216/full/news.2008.1312.html) and WikiNews (link to http://en.wikinews.org/wiki/RNA_journal_submits_articles_to_Wikipedia?curid=118352).
Discovery of novel protein families
The classification of novel protein families continues to be a key method for transferring experimental results onto new genomic data. Our team has published on many novel domains such as the G5 domain. The discovery of the PAZ domain allowed us to predict that the Dicer protein would be the dsRNA nuclease involved in RNAi some months before this was experimentally demonstrated. We also discovered a novel beta-lactam binding module called the PASTA domain, found in bacterial cell surface receptors and penicillin binding proteins. Most recently we identified that the enigmatic scramblase proteins are related to Tubby, an important protein involved in regulating weight, suggesting these two have a common role in gene regulation.
Research and database maintenance is supported by grants from the Wellcome Trust, the Medical Research Council (MRC) and the Biotechnology and Biological Sciences Research Council (BBSRC).