Hubbard Research Group
The main focus of Tim Hubbard's research group is the development of algorithms and infrastructure for use in vertebrate genome annotation. The group closely interacts with the Vertebrate Genome Analysis Project of which Tim is PI.

Sequence Motifs from Tiffin Database.
The process of vertebrate gene annotation is dominated by use of experimentally determined transcription evidence, processed by integrated approaches combining both automatic algorithms and manual curation. However annotating gene regulatory regions such as promotors or enhancers, in particularly the short motif sequences that function as transcription factor binding sites (TFBS) for DNA binding proteins, is still a hard problem both computationally and experimentally. In the group a number of machine learning algorithms have been developed both to address specific annotation problems, such as the prediction of transcription start sites (TSS), and more general problems, such as the large scale detection of short DNA motifs sequences. The Eponine system developed by Thomas Down was used to predict exact start points of transcription (Eponine-TSS) from vertebrate genomic sequence alone (Down and Hubbard, 2002). Eponine has been applied to a number of other problems, such as to identify candidate DNA motifs within human protein-coding exons (Down et al., 2006). While Eponine can be used to model the general properties of regulatory features within genome sequence such as TSS, modelling the multiple distinct motif sequences within these features requires a different approach. The NestedMICA algorithm (NMICA) was developed by Thomas Down with large scale genome wide motif detection in mind (Down and Hubbard, 2005). It has been used to simultaneously discover large numbers of candidate TFBS from Drosophila (Down et al., 2007). Assessment of these predicted motifs using independent data suggest that the majority represent functional DNA elements. Matias Piipari has been generating even larger motif sets in a range of organisms such as yeast, with the aim of generating complete motif dictionaries. The motifs and predicted functional annotation for a subset are contained in the Tiffin database. NMICA has been extended for use with protein sequences by Mutlu Dogruel (Dogruel et al., 2008) and used in subcellular localization prediction. The group has a long standing connection with subcellular localization prediction algorithms, dating back to the work of Astrid Reinhardt using neural networks (Reinhardt and Hubbard, 1998).
In developing these ab initio algorithms, an overall philosophy has been to build predictive models of biological processes that do not rely on information inaccessible to the biological machinery, i.e. not to "cheat". For example, many algorithms make predictions by combining ab initio models with external experimental data such as resulting from evolutionary comparison. However arguably RNA polymerase when it transcribes a section of genome sequence knows nothing about the degree of conservation of the sequence in other species, nor its protein coding potential etc. While ignoring such additional information makes prediction harder, it results in algorithms better able to distinguishing between functional and mutated sequences. Analysis of the ever increasing volumes of experimental data frequently overtakes purely computational approaches, as it has done in vertebrate protein coding gene annotation. However, if we are to model the consequence of rare or unique mutations in the genome sequences of individuals, we will also need such pure computational algorithms.
Beyond ab initio motif prediction algorithms there are a number of genome annotation related problems where group members have applied machine learning and other genome analysis approaches. In the field of cancer Thomas Down contributed to the analysis of the Human Cancer Gene Census generated by the institute's Cancer Genome Project (CGP) of Mick Stratton and Andy Futreal (Futreal et al., 2004). In collaboration with the Experimental Cancer Genetics team of David Adams and CGP, Jenny Mattison analysed data from human and mouse cancers to look for evidence of new cancer genes (Uren et al., 2008). In the field of genome annotation, while the transcriptome of organisms is increasingly accessible, particularly though next generation sequencing, it is hard to identifying whether a open reading frame is really transcribed into protein, or is untranscribed, such as due to nonsense mediated decay. Mass spectrometry (MS) is one approach and Markus Brosch has been developed improved data processing algorithms (Brosch et al., 2008) to provide a reliable set of genome wide vertebrate peptide fragments in collaboration with the WTSI MS team of Jyoti Choudhary. In the interpretation of DNA methylation data, Thomas Down developed a Bayesian tool (Batman) (Down et al., 2008) which has been applied to genome wide MeDIP datasets (Rakyan et al., 2008).
Components of widely used open source software (OSS) bioinformatics infrastructure have also been developed and contributed to by members of the group. The popular biojava Java framework for processing biological data was started in the group in 1997 by Matt Pocock and Thomas Down while studying for their PhDs (Pocock et al., 2000, Holland et al., 2008). Much of the machine learning algorithm development work described above uses BioJava and is the main reason that Java is the dominant programming language of the group. The group has also made substantial contributions to the development of the Distributed Annotation System (DAS). Tim has always been a strong advocate of the DAS concept (Dowell et al., 2001), seeing it as a way to create a level playing field between researchers allowing them to share annotation without each having to host a genome browser. In 1999 Thomas Down developed Dazzlethe first Java implementation of a DAS server, just in time for the client support of DAS in Ensembl. DAS incorporates the same concept of separating annotation and data as Tim implemented to allow protein annotation from the SCOP database to be layered onto protein 3D structures (Hubbard et al., 1997). Andreas Prlic in the group provided DAS components for two large consortia to enable integration and sharing of annotation. He extended the DAS specification to support protein structures (Prlic et al., 2007) and developed the SPICE DAS client (Prlic et al., 2005) to support the efamily project to integrate protein family annotation from SCOP, CATH, Pfam, InterPro and MSD. He went on to extend SPICE to view protein structural alignments from the SISYPHUS database (Andreeva et al., 2007) and between predictions from the CASP structure prediction competition (Kryshtafovych et al., 2007), all using DAS. For the biosapiens consortium, which adopted DAS as its main data integration framework, he developed the DAS registry (Prlic et al., 2007). The number of registered DAS sources now exceeds 400 it has been necessary to develop an ontology to allow users and clients organise the many annotations (Reeves et al., 2008). Andreas started a series of annual DAS workshops in 2007. The 3rd DAS workshop is being organised by Jonathan Warren in 2009.
Prior to joining WTSI, Tim's research group worked mainly on proteins. Tim remains connected with SCOP (the Structural Classification of Proteins database) and its use in the calibration of sequence alignment methods (Brenner et al., 1998), and has been involved in development such as SISYPHUS (Andreeva et al., 2007). During his period as organiser of CASP (Critical Assessment of Structure Prediction) (CASP2-CASP7; 1996-2007) he developed new methods of assessment (Hubbard, 1999) which are still in use. Tim's involvement in evaluation exercises continues through the EGASP (ENCODE Genome Annotation Assessment Project) (Guigo et al., 2006).
Tim welcomes enquiries from prospective Postdocs, particularly those with strong coding and mathematical skills and experience in machine learning. Those interested in PhD positions need to apply to the WTSI PhD programme, since the programme is co-ordinated centrally and applications can not generally be made to individual research groups.
Publications
-
The Protein Feature Ontology: a tool for the unification of protein feature annotations.
Bioinformatics (Oxford, England) 2008;24;23;2767-72
PUBMED: 18936051; PMC: 2912506; DOI: 10.1093/bioinformatics/btn528
-
BioJava: an open-source framework for bioinformatics.
Bioinformatics (Oxford, England) 2008;24;18;2096-7
PUBMED: 18689808; PMC: 2530884; DOI: 10.1093/bioinformatics/btn397
-
An integrated resource for genome-wide identification and analysis of human tissue-specific differentially methylated regions (tDMRs).
Genome research 2008;18;9;1518-29
PUBMED: 18577705; PMC: 2527707; DOI: 10.1101/gr.077479.108
-
A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis.
Nature biotechnology 2008;26;7;779-85
PUBMED: 18612301; PMC: 2644410; DOI: 10.1038/nbt1414
-
Large-scale mutagenesis in p19(ARF)- and p53-deficient mice identifies cancer genes and their collaborative networks.
Cell 2008;133;4;727-41
PUBMED: 18485879; PMC: 2405818; DOI: 10.1016/j.cell.2008.03.021
-
Comparison of Mascot and X!Tandem performance for low and high accuracy mass spectrometry and the development of an adjusted Mascot threshold.
Molecular & cellular proteomics : MCP 2008;7;5;962-70
PUBMED: 18216375; PMC: 2656932; DOI: 10.1074/mcp.M700293-MCP200
-
Integrating biological data--the Distributed Annotation System.
BMC bioinformatics 2008;9 Suppl 8;S3
PUBMED: 18673527; PMC: 2500094; DOI: 10.1186/1471-2105-9-S8-S3
-
NestedMICA as an ab initio protein motif discovery tool.
BMC bioinformatics 2008;9;19
PUBMED: 18194537; PMC: 2267705; DOI: 10.1186/1471-2105-9-19
-
Large-scale discovery of promoter motifs in Drosophila melanogaster.
PLoS computational biology 2007;3;1;e7
PUBMED: 17238282; PMC: 1779301; DOI: 10.1371/journal.pcbi.0030007
-
SISYPHUS--structural alignments for proteins with non-trivial relationships.
Nucleic acids research 2007;35;Database issue;D253-9
PUBMED: 17068077; PMC: 1635320; DOI: 10.1093/nar/gkl746
-
Integrating sequence and structural biology with DAS.
BMC bioinformatics 2007;8;333
PUBMED: 17850653; PMC: 2031907; DOI: 10.1186/1471-2105-8-333
-
New tools and expanded data analysis capabilities at the Protein Structure Prediction Center.
Proteins 2007;69 Suppl 8;19-26
PUBMED: 17705273; PMC: 2656758; DOI: 10.1002/prot.21653
-
DNA methylation profiling of human chromosomes 6, 20 and 22.
Nature genetics 2006;38;12;1378-85
PUBMED: 17072317; PMC: 3082778; DOI: 10.1038/ng1909
-
A machine learning strategy to identify candidate binding sites in human protein-coding sequence.
BMC bioinformatics 2006;7;419
PUBMED: 17002805; PMC: 1592515; DOI: 10.1186/1471-2105-7-419
-
NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence.
Nucleic acids research 2005;33;5;1445-53
PUBMED: 15760844; PMC: 1064142; DOI: 10.1093/nar/gki282
-
What can we learn from noncoding regions of similarity between genomes?
BMC bioinformatics 2004;5;131
PUBMED: 15369604; PMC: 523850; DOI: 10.1186/1471-2105-5-131
-
Domain insertions in protein structures.
Journal of molecular biology 2004;338;4;633-41
PUBMED: 15099733; PMC: 2665287; DOI: 10.1016/j.jmb.2004.03.039
-
A census of human cancer genes.
Nature reviews. Cancer 2004;4;3;177-83
PUBMED: 14993899; PMC: 2665285; DOI: 10.1038/nrc1299
-
ddbRNA: detection of conserved secondary structures in multiple alignments.
Bioinformatics (Oxford, England) 2003;19;13;1606-11
PUBMED: 12967955
-
Computational detection and location of transcription start sites in mammalian genomic DNA.
Genome research 2002;12;3;458-61
PUBMED: 11875034; PMC: 155284; DOI: 10.1101/gr.216102
-
MaxBench: evaluation of sequence and structure comparison methods.
Bioinformatics (Oxford, England) 2002;18;3;494-5
PUBMED: 11934754
-
A browser for expression data.
Bioinformatics (Oxford, England) 2000;16;4;402-3
PUBMED: 10869040
-
SCOP: a Structural Classification of Proteins database.
Nucleic acids research 1999;27;1;254-6
-
Analysis and assessment of ab initio three-dimensional prediction, secondary structure, and contacts prediction.
Proteins 1999;Suppl 3;149-70
PUBMED: 10526364
-
Critical assessment of methods of protein structure prediction (CASP): round III.
Proteins 1999;Suppl 3;2-6
PUBMED: 10526346
-
RMS/coverage graphs: a qualitative method for comparing three-dimensional protein structure predictions.
Proteins 1999;Suppl 3;15-21
PUBMED: 10526348
-
Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.
Journal of molecular biology 1998;284;4;1201-10
PUBMED: 9837738; DOI: 10.1006/jmbi.1998.2221
-
SCOP, Structural Classification of Proteins database: applications to evaluation of the effectiveness of sequence alignment methods and statistics of protein structural data.
Acta crystallographica. Section D, Biological crystallography 1998;54;Pt 6 Pt 1;1147-54
PUBMED: 10089491
-
Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships.
Proceedings of the National Academy of Sciences of the United States of America 1998;95;11;6073-8
-
Using neural networks for prediction of the subcellular location of proteins.
Nucleic acids research 1998;26;9;2230-6
-
GLASS: a tool to visualize protein structure prediction data in three dimensions and evaluate their consistency.
Proteins 1998;30;4;339-51
PUBMED: 9533618
-
SPEM: a parser for EMBL style flat file database entries.
Bioinformatics (Oxford, England) 1998;14;9;823-4
PUBMED: 9918956
-
Intermediate sequences increase the detection of homology between sequences.
Journal of molecular biology 1997;273;1;349-54
PUBMED: 9367767; DOI: 10.1006/jmbi.1997.1288
-
SCOP: a structural classification of proteins database.
Nucleic acids research 1997;25;1;236-9
Related publications
-
EGASP: the human ENCODE Genome Annotation Assessment Project.
Genome biology 2006;7 Suppl 1;S2.1-31
PUBMED: 16925836; PMC: 1810551; DOI: 10.1186/gb-2006-7-s1-s2
-
The distributed annotation system.
BMC bioinformatics 2001;2;7



