Archive Page‭ - Hubbard Research Group

While Tim was a member of Faculty at the Sanger Institute from‭ ‬1996‭ ‬to‭ ‬2013,‭ ‬the main focus of‭ ‬his research group was the development of algorithms and infrastructure for use in vertebrate genome annotation.‭ ‬The group also‭ ‬closely interacted with the Vertebrate Genome Analysis Project of which Tim was Principal Investigator.

Sequence Motifs from Tiffin Database.

Sequence Motifs from Tiffin Database.

The process of vertebrate gene annotation is dominated by use of experimentally determined transcription evidence, processed by integrated approaches combining both automatic algorithms and manual curation. However annotating gene regulatory regions such as promotors or enhancers, in particularly the short motif sequences that function as transcription factor binding sites (TFBS) for DNA binding proteins, is still a hard problem both computationally and experimentally. In the group a number of machine learning algorithms have been developed both to address specific annotation problems, such as the prediction of transcription start sites (TSS), and more general problems, such as the large scale detection of short DNA motifs sequences. The Eponine system developed by Thomas Down was used to predict exact start points of transcription (Eponine-TSS) from vertebrate genomic sequence alone (Down and Hubbard, 2002). Eponine has been applied to a number of other problems, such as to identify candidate DNA motifs within human protein-coding exons (Down et al., 2006). While Eponine can be used to model the general properties of regulatory features within genome sequence such as TSS, modelling the multiple distinct motif sequences within these features requires a different approach. The NestedMICA algorithm (NMICA) was developed by Thomas Down with large scale genome wide motif detection in mind (Down and Hubbard, 2005). It has been used to simultaneously discover large numbers of candidate TFBS from Drosophila (Down et al., 2007). Assessment of these predicted motifs using independent data suggest that the majority represent functional DNA elements. Matias Piipari has been generating even larger motif sets in a range of organisms such as yeast, with the aim of generating complete motif dictionaries. The motifs and predicted functional annotation for a subset are contained in the Tiffin database. NMICA has been extended for use with protein sequences by Mutlu Dogruel (Dogruel et al., 2008) and used in subcellular localization prediction. The group has a long standing connection with subcellular localization prediction algorithms, dating back to the work of Astrid Reinhardt using neural networks (Reinhardt and Hubbard, 1998).

SPICE DAS Client.

SPICE DAS Client.

zoom

In developing these ab initio algorithms, an overall philosophy has been to build predictive models of biological processes that do not rely on information inaccessible to the biological machinery, i.e. not to "cheat". For example, many algorithms make predictions by combining ab initio models with external experimental data such as resulting from evolutionary comparison. However arguably RNA polymerase when it transcribes a section of genome sequence knows nothing about the degree of conservation of the sequence in other species, nor its protein coding potential etc. While ignoring such additional information makes prediction harder, it results in algorithms better able to distinguishing between functional and mutated sequences. Analysis of the ever increasing volumes of experimental data frequently overtakes purely computational approaches, as it has done in vertebrate protein coding gene annotation. However, if we are to model the consequence of rare or unique mutations in the genome sequences of individuals, we will also need such pure computational algorithms.

Beyond ab initio motif prediction algorithms there are a number of genome annotation related problems where group members have applied machine learning and other genome analysis approaches. In the field of cancer Thomas Down contributed to the analysis of the Human Cancer Gene Census generated by the institute's Cancer Genome Project (CGP) of Mick Stratton and Andy Futreal (Futreal et al., 2004). In collaboration with the Experimental Cancer Genetics team of David Adams and CGP, Jenny Mattison analysed data from human and mouse cancers to look for evidence of new cancer genes (Uren et al., 2008). In the field of genome annotation, while the transcriptome of organisms is increasingly accessible, particularly though next generation sequencing, it is hard to identifying whether a open reading frame is really transcribed into protein, or is untranscribed, such as due to nonsense mediated decay. Mass spectrometry (MS) is one approach and Markus Brosch has been developed improved data processing algorithms (Brosch et al., 2008) to provide a reliable set of genome wide vertebrate peptide fragments in collaboration with the WTSI MS team of Jyoti Choudhary. In the interpretation of DNA methylation data, Thomas Down developed a Bayesian tool (Batman) (Down et al., 2008) which has been applied to genome wide MeDIP datasets (Rakyan et al., 2008).

Components of widely used open source software (OSS) bioinformatics infrastructure have also been developed and contributed to by members of the group. The popular biojava Java framework for processing biological data was started in the group in 1997 by Matt Pocock and Thomas Down while studying for their PhDs (Pocock et al., 2000, Holland et al., 2008). Much of the machine learning algorithm development work described above uses BioJava and is the main reason that Java is the dominant programming language of the group. The group has also made substantial contributions to the development of the Distributed Annotation System (DAS). Tim has always been a strong advocate of the DAS concept (Dowell et al., 2001), seeing it as a way to create a level playing field between researchers allowing them to share annotation without each having to host a genome browser. In 1999 Thomas Down developed Dazzlethe first Java implementation of a DAS server, just in time for the client support of DAS in Ensembl. DAS incorporates the same concept of separating annotation and data as Tim implemented to allow protein annotation from the SCOP database to be layered onto protein 3D structures (Hubbard et al., 1997). Andreas Prlic in the group provided DAS components for two large consortia to enable integration and sharing of annotation. He extended the DAS specification to support protein structures (Prlic et al., 2007) and developed the SPICE DAS client (Prlic et al., 2005) to support the efamily project to integrate protein family annotation from SCOP, CATH, Pfam, InterPro and MSD. He went on to extend SPICE to view protein structural alignments from the SISYPHUS database (Andreeva et al., 2007) and between predictions from the CASP structure prediction competition (Kryshtafovych et al., 2007), all using DAS. For the biosapiens consortium, which adopted DAS as its main data integration framework, he developed the DAS registry (Prlic et al., 2007). The number of registered DAS sources now exceeds 400 it has been necessary to develop an ontology to allow users and clients organise the many annotations (Reeves et al., 2008). Andreas started a series of annual DAS workshops in 2007. The 3rd DAS workshop is being organised by Jonathan Warren in 2009.

Prior to joining WTSI, Tim's research group worked mainly on proteins. Tim remains connected with SCOP (the Structural Classification of Proteins database) and its use in the calibration of sequence alignment methods (Brenner et al., 1998), and has been involved in development such as SISYPHUS (Andreeva et al., 2007). During his period as organiser of CASP (Critical Assessment of Structure Prediction) (CASP2-CASP7; 1996-2007) he developed new methods of assessment (Hubbard, 1999) which are still in use. Tim's involvement in evaluation exercises continues through the EGASP (ENCODE Genome Annotation Assessment Project) (Guigo et al., 2006).

Publications

  • The Protein Feature Ontology: a tool for the unification of protein feature annotations.

    Reeves GA, Eilbeck K, Magrane M, O'Donovan C, Montecchi-Palazzi L, Harris MA, Orchard S, Jimenez RC, Prlic A, Hubbard TJ, Hermjakob H and Thornton JM

    Bioinformatics (Oxford, England) 2008;24;23;2767-72

  • BioJava: an open-source framework for bioinformatics.

    Holland RC, Down TA, Pocock M, Prlić A, Huen D, James K, Foisy S, Dräger A, Yates A, Heuer M and Schreiber MJ

    Bioinformatics (Oxford, England) 2008;24;18;2096-7

  • An integrated resource for genome-wide identification and analysis of human tissue-specific differentially methylated regions (tDMRs).

    Rakyan VK, Down TA, Thorne NP, Flicek P, Kulesha E, Gräf S, Tomazou EM, Bäckdahl L, Johnson N, Herberth M, Howe KL, Jackson DK, Miretti MM, Fiegler H, Marioni JC, Birney E, Hubbard TJ, Carter NP, Tavaré S and Beck S

    Genome research 2008;18;9;1518-29

  • A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis.

    Down TA, Rakyan VK, Turner DJ, Flicek P, Li H, Kulesha E, Gräf S, Johnson N, Herrero J, Tomazou EM, Thorne NP, Bäckdahl L, Herberth M, Howe KL, Jackson DK, Miretti MM, Marioni JC, Birney E, Hubbard TJ, Durbin R, Tavaré S and Beck S

    Nature biotechnology 2008;26;7;779-85

  • Large-scale mutagenesis in p19(ARF)- and p53-deficient mice identifies cancer genes and their collaborative networks.

    Uren AG, Kool J, Matentzoglu K, de Ridder J, Mattison J, van Uitert M, Lagcher W, Sie D, Tanger E, Cox T, Reinders M, Hubbard TJ, Rogers J, Jonkers J, Wessels L, Adams DJ, van Lohuizen M and Berns A

    Cell 2008;133;4;727-41

  • Comparison of Mascot and X!Tandem performance for low and high accuracy mass spectrometry and the development of an adjusted Mascot threshold.

    Brosch M, Swamy S, Hubbard T and Choudhary J

    Molecular & cellular proteomics : MCP 2008;7;5;962-70

  • Integrating biological data--the Distributed Annotation System.

    Jenkinson AM, Albrecht M, Birney E, Blankenburg H, Down T, Finn RD, Hermjakob H, Hubbard TJ, Jimenez RC, Jones P, Kähäri A, Kulesha E, Macías JR, Reeves GA and Prlić A

    BMC bioinformatics 2008;9 Suppl 8;S3

  • NestedMICA as an ab initio protein motif discovery tool.

    Doğruel M, Down TA and Hubbard TJ

    BMC bioinformatics 2008;9;19

  • Large-scale discovery of promoter motifs in Drosophila melanogaster.

    Down TA, Bergman CM, Su J and Hubbard TJ

    PLoS computational biology 2007;3;1;e7

  • SISYPHUS--structural alignments for proteins with non-trivial relationships.

    Andreeva A, Prlić A, Hubbard TJ and Murzin AG

    Nucleic acids research 2007;35;Database issue;D253-9

  • Integrating sequence and structural biology with DAS.

    Prlić A, Down TA, Kulesha E, Finn RD, Kähäri A and Hubbard TJ

    BMC bioinformatics 2007;8;333

  • New tools and expanded data analysis capabilities at the Protein Structure Prediction Center.

    Kryshtafovych A, Prlic A, Dmytriv Z, Daniluk P, Milostan M, Eyrich V, Hubbard T and Fidelis K

    Proteins 2007;69 Suppl 8;19-26

  • DNA methylation profiling of human chromosomes 6, 20 and 22.

    Eckhardt F, Lewin J, Cortese R, Rakyan VK, Attwood J, Burger M, Burton J, Cox TV, Davies R, Down TA, Haefliger C, Horton R, Howe K, Jackson DK, Kunde J, Koenig C, Liddle J, Niblett D, Otto T, Pettett R, Seemann S, Thompson C, West T, Rogers J, Olek A, Berlin K and Beck S

    Nature genetics 2006;38;12;1378-85

  • A machine learning strategy to identify candidate binding sites in human protein-coding sequence.

    Down T, Leong B and Hubbard TJ

    BMC bioinformatics 2006;7;419

  • NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence.

    Down TA and Hubbard TJ

    Nucleic acids research 2005;33;5;1445-53

  • What can we learn from noncoding regions of similarity between genomes?

    Down TA and Hubbard TJ

    BMC bioinformatics 2004;5;131

  • Domain insertions in protein structures.

    Aroul-Selvam R, Hubbard T and Sasidharan R

    Journal of molecular biology 2004;338;4;633-41

  • A census of human cancer genes.

    Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N and Stratton MR

    Nature reviews. Cancer 2004;4;3;177-83

  • ddbRNA: detection of conserved secondary structures in multiple alignments.

    di Bernardo D, Down T and Hubbard T

    Bioinformatics (Oxford, England) 2003;19;13;1606-11

  • Computational detection and location of transcription start sites in mammalian genomic DNA.

    Down TA and Hubbard TJ

    Genome research 2002;12;3;458-61

  • MaxBench: evaluation of sequence and structure comparison methods.

    Leplae R and Hubbard TJ

    Bioinformatics (Oxford, England) 2002;18;3;494-5

  • A browser for expression data.

    Pocock MR and Hubbard TJ

    Bioinformatics (Oxford, England) 2000;16;4;402-3

  • SCOP: a Structural Classification of Proteins database.

    Hubbard TJ, Ailey B, Brenner SE, Murzin AG and Chothia C

    Nucleic acids research 1999;27;1;254-6

  • Analysis and assessment of ab initio three-dimensional prediction, secondary structure, and contacts prediction.

    Orengo CA, Bray JE, Hubbard T, LoConte L and Sillitoe I

    Proteins 1999;Suppl 3;149-70

  • Critical assessment of methods of protein structure prediction (CASP): round III.

    Moult J, Hubbard T, Fidelis K and Pedersen JT

    Proteins 1999;Suppl 3;2-6

  • RMS/coverage graphs: a qualitative method for comparing three-dimensional protein structure predictions.

    Hubbard TJ

    Proteins 1999;Suppl 3;15-21

  • Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.

    Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T and Chothia C

    Journal of molecular biology 1998;284;4;1201-10

  • SCOP, Structural Classification of Proteins database: applications to evaluation of the effectiveness of sequence alignment methods and statistics of protein structural data.

    Hubbard TJ, Ailey B, Brenner SE, Murzin AG and Chothia C

    Acta crystallographica. Section D, Biological crystallography 1998;54;Pt 6 Pt 1;1147-54

  • Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships.

    Brenner SE, Chothia C and Hubbard TJ

    Proceedings of the National Academy of Sciences of the United States of America 1998;95;11;6073-8

  • Using neural networks for prediction of the subcellular location of proteins.

    Reinhardt A and Hubbard T

    Nucleic acids research 1998;26;9;2230-6

  • GLASS: a tool to visualize protein structure prediction data in three dimensions and evaluate their consistency.

    Leplae R, Hubbard T and Tramontano A

    Proteins 1998;30;4;339-51

  • SPEM: a parser for EMBL style flat file database entries.

    Pocock MR, Hubbard T and Birney E

    Bioinformatics (Oxford, England) 1998;14;9;823-4

  • Intermediate sequences increase the detection of homology between sequences.

    Park J, Teichmann SA, Hubbard T and Chothia C

    Journal of molecular biology 1997;273;1;349-54

  • SCOP: a structural classification of proteins database.

    Hubbard TJ, Murzin AG, Brenner SE and Chothia C

    Nucleic acids research 1997;25;1;236-9

Related publications

  • EGASP: the human ENCODE Genome Annotation Assessment Project.

    Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE and Reese MG

    Genome biology 2006;7 Suppl 1;S2.1-31

  • The distributed annotation system.

    Dowell RD, Jokerst RM, Day A, Eddy SR and Stein L

    BMC bioinformatics 2001;2;7

Related links:

* quick link - http://q.sanger.ac.uk/gywsvcel