Honorary Faculty - Dr Tim Hubbard

Tim is Professor of Bioinformatics,‭ ‬Head of Department of Medical and Molecular Genetics at King's College London and overall Director of Bioinformatics for King's Health Partners/King's College London.‭ ‬Tim also has a part-time appointment as‭ ‬Head of Bioinformatics at Genomics England,‭ ‬the company set up by the Government to help deliver the‭ ‬100k Genome Project.

From‭ ‬1996‭ ‬to‭ ‬2013,‭ ‬Professor Hubbard worked at the Sanger Institute,‭ ‬first on the Human Genome Project and then leading the Vertebrate Genome Analysis Project,‭ ‬which generates and presents core vertebrate genome annotation and maintains the reference genome sequences.

Tim graduated with a BA in Biochemistry from University of Cambridge in‭ ‬1985‭ ‬and a PhD in Protein Design from the Department of Crystallography,‭ ‬Birkbeck College,‭ ‬London in‭ ‬1988.

Following a postdoctoral fellowship at the Protein Engineering Research Institute in Osaka under the EU scientific training program in Japan‭ (‬1989-90‭) ‬he returned to Cambridge becoming a Zeneca Fellow at the Medical Research Council‭ (‬MRC‭)‬,‭ ‬Centre for Protein Engineering.‭ ‬In‭ ‬1997‭ ‬he joined the Wellcome Trust Sanger Institute to become Head of Human Genome Analysis.‭ ‬Tim was Head of Informatics from‭ ‬2007‭ ‬to‭ ‬2013,‭ ‬when he was seconded part time as specialist advisor to NHS England,‭ ‬involved in delivering genomics for health programmes.

Before joining the Sanger Institute Tim worked mainly on protein folding,‭ ‬classification and design.‭ ‬In‭ ‬1994‭ ‬he co-founded SCOP‭ (‬the Structural Classification of Proteins database‭)‬,‭ ‬thought to be the first biological database designed from the start to take advantage of the then new World Wide Web.‭ ‬He developed algorithms to make protein structure predictions and assess their accuracy and to calibrate the reliability of sequence alignment methods.‭ ‬He was one of the most successful participants in the first CASP‭ (‬Critical Assessment of Structure Prediction‭) ‬competition in‭ ‬1994‭ ‬and a co-organiser of subsequent CASP competitions‭ (‬CASP2-CASP7‭) ‬until‭ ‬2007.

At the Sanger Institute Tim was a member of the strategy group that organised the sequencing of the human genome as part of the international public consortium.‭ ‬During this time he built up groups to analyse and annotate the sequences of vertebrate genomes.‭ ‬In‭ ‬1999‭ ‬he developed an automatic annotation system and starting applying it in real time to sequence output of the human genome project.‭ ‬This evolved into the Ensembl project with Michele Clamp and Ewan Birney from the European Bioinformatics Institute‭ (‬EBI‭)‬,‭ ‬which now provides annotation to more than‭ ‬40‭ ‬genomes.‭ ‬In parallel,‭ ‬the HAVANA group have carried out large scale manual annotation of human,‭ ‬mouse and zebrafish genomes.‭ ‬Since‭ ‬2007‭ ‬Tim has been the principal investigator of GENCODE,‭ ‬a scale up programme of the ENCODE project,‭ ‬which brings together HAVANA,‭ ‬Ensembl and seven external groups to generate the reference geneset for the human genome.‭ ‬He‭ ‬was also the Sanger Institute principal investigator of the Genome Reference Consortium,‭ ‬which is responsible for reference genome sequences of human,‭ ‬mouse and zebrafish.

Tim also ran a research group that used machine learning approaches to develop ab initio algorithms for vertebrate genome annotation such as large scale motif discovery.‭ ‬The group also developed components of bioinformatics infrastructure,‭ ‬including the popular biojava open source Java framework for processing biological data was‭ (‬started in the group in‭ ‬1997‭) ‬and components of the Distributed Annotation System‭ (‬DAS‭) ‬including the SPICE DAS client,‭ ‬the Dazzle DAS Server and the DAS registry.

Tim has served on many national and international advisory boards including the advisory council of the RIKEN Genome Science Centre,‭ ‬Japan as chairman‭ (‬2005-2007‭)‬,‭ ‬the advisory board of ukPMC‭ (‬UK PubMedCentral‭) ‬as deputy chair‭ (‬2007-‭)‬,‭ ‬the E-Health Records Research Board of the UK Government Office for Strategic Coordination of Health Research‭ (‬OSCHR‭) (‬2007-2009‭) ‬which was the external reference group for NHS Research Capability Programme ‭(‬RCP‭) ‬and the Bioinformatics and Education,‭ ‬Engagement‭ & ‬Training Working Groups of the UK Department of Health Human Genomics Strategy Group ‭(‬HGSG‭)‬ ‭(‬2010-‭) ‬setup in response to the 2009‭ ‬House of Lords Science and Technology Committee Report on Genomic Medicine.‭ ‬In his private capacity he is also significantly involved in international policy discussions around innovation,‭ ‬intellectual property and public health.

Publications

Selected Publications

  • Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

    ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, Giresi PG, Goldy J, Hawrylycz M, Haydock A, Humbert R, James KD, Johnson BE, Johnson EM, Frum TT, Rosenzweig ER, Karnani N, Lee K, Lefebvre GC, Navas PA, Neri F, Parker SC, Sabo PJ, Sandstrom R, Shafer A, Vetrie D, Weaver M, Wilcox S, Yu M, Collins FS, Dekker J, Lieb JD, Tullius TD, Crawford GE, Sunyaev S, Noble WS, Dunham I, Denoeud F, Reymond A, Kapranov P, Rozowsky J, Zheng D, Castelo R, Frankish A, Harrow J, Ghosh S, Sandelin A, Hofacker IL, Baertsch R, Keefe D, Dike S, Cheng J, Hirsch HA, Sekinger EA, Lagarde J, Abril JF, Shahab A, Flamm C, Fried C, Hackermüller J, Hertel J, Lindemeyer M, Missal K, Tanzer A, Washietl S, Korbel J, Emanuelsson O, Pedersen JS, Holroyd N, Taylor R, Swarbreck D, Matthews N, Dickson MC, Thomas DJ, Weirauch MT, Gilbert J, Drenkow J, Bell I, Zhao X, Srinivasan KG, Sung WK, Ooi HS, Chiu KP, Foissac S, Alioto T, Brent M, Pachter L, Tress ML, Valencia A, Choo SW, Choo CY, Ucla C, Manzano C, Wyss C, Cheung E, Clark TG, Brown JB, Ganesh M, Patel S, Tammana H, Chrast J, Henrichsen CN, Kai C, Kawai J, Nagalakshmi U, Wu J, Lian Z, Lian J, Newburger P, Zhang X, Bickel P, Mattick JS, Carninci P, Hayashizaki Y, Weissman S, Hubbard T, Myers RM, Rogers J, Stadler PF, Lowe TM, Wei CL, Ruan Y, Struhl K, Gerstein M, Antonarakis SE, Fu Y, Green ED, Karaöz U, Siepel A, Taylor J, Liefer LA, Wetterstrand KA, Good PJ, Feingold EA, Guyer MS, Cooper GM, Asimenos G, Dewey CN, Hou M, Nikolaev S, Montoya-Burgos JI, Löytynoja A, Whelan S, Pardi F, Massingham T, Huang H, Zhang NR, Holmes I, Mullikin JC, Ureta-Vidal A, Paten B, Seringhaus M, Church D, Rosenbloom K, Kent WJ, Stone EA, NISC Comparative Sequencing Program, Baylor College of Medicine Human Genome Sequencing Center, Washington University Genome Sequencing Center, Broad Institute, Children's Hospital Oakland Research Institute, Batzoglou S, Goldman N, Hardison RC, Haussler D, Miller W, Sidow A, Trinklein ND, Zhang ZD, Barrera L, Stuart R, King DC, Ameur A, Enroth S, Bieda MC, Kim J, Bhinge AA, Jiang N, Liu J, Yao F, Vega VB, Lee CW, Ng P, Shahab A, Yang A, Moqtaderi Z, Zhu Z, Xu X, Squazzo S, Oberley MJ, Inman D, Singer MA, Richmond TA, Munn KJ, Rada-Iglesias A, Wallerman O, Komorowski J, Fowler JC, Couttet P, Bruce AW, Dovey OM, Ellis PD, Langford CF, Nix DA, Euskirchen G, Hartman S, Urban AE, Kraus P, Van Calcar S, Heintzman N, Kim TH, Wang K, Qu C, Hon G, Luna R, Glass CK, Rosenfeld MG, Aldred SF, Cooper SJ, Halees A, Lin JM, Shulha HP, Zhang X, Xu M, Haidar JN, Yu Y, Ruan Y, Iyer VR, Green RD, Wadelius C, Farnham PJ, Ren B, Harte RA, Hinrichs AS, Trumbower H, Clawson H, Hillman-Jackson J, Zweig AS, Smith K, Thakkapallayil A, Barber G, Kuhn RM, Karolchik D, Armengol L, Bird CP, de Bakker PI, Kern AD, Lopez-Bigas N, Martin JD, Stranger BE, Woodroffe A, Davydov E, Dimas A, Eyras E, Hallgrímsdóttir IB, Huppert J, Zody MC, Abecasis GR, Estivill X, Bouffard GG, Guan X, Hansen NF, Idol JR, Maduro VV, Maskeri B, McDowell JC, Park M, Thomas PJ, Young AC, Blakesley RW, Muzny DM, Sodergren E, Wheeler DA, Worley KC, Jiang H, Weinstock GM, Gibbs RA, Graves T, Fulton R, Mardis ER, Wilson RK, Clamp M, Cuff J, Gnerre S, Jaffe DB, Chang JL, Lindblad-Toh K, Lander ES, Koriabine M, Nefedov M, Osoegawa K, Yoshinaga Y, Zhu B and de Jong PJ

    We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.

    Funded by: NCI NIH HHS: F32 CA108313; NHGRI NIH HHS: K22 HG003169, K22 HG003169-01A1, P41 HG002371, P41 HG002371-03S1, R01 HG002238, R01 HG002238-15, R01 HG003110, R01 HG003110-03, R01 HG003129-03, R01 HG003143, R01 HG003143-04, R01 HG003521, R01 HG003521-01, R01 HG003532, R01 HG003532-01, R01 HG003541, R01 HG003541-03, U01 HG002523, U01 HG002523-01, U01 HG003147, U01 HG003147-02, U01 HG003150-03, U01 HG003151, U01 HG003151-03, U01 HG003156, U01 HG003156-03, U01 HG003157, U01 HG003157-03, U01 HG003161, U01 HG003161-03, U01 HG003162, U01 HG003162-03, U01 HG003168-02, U54 HG003067, U54 HG003067-01, U54 HG003079, U54 HG003079-01, U54 HG003273, U54 HG003273-01; Wellcome Trust: 062023, 077198

    Nature 2007;447;7146;799-816

  • Large-scale discovery of promoter motifs in Drosophila melanogaster.

    Down TA, Bergman CM, Su J and Hubbard TJ

    Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom. td2@sanger.ac.uk

    A key step in understanding gene regulation is to identify the repertoire of transcription factor binding motifs (TFBMs) that form the building blocks of promoters and other regulatory elements. Identifying these experimentally is very laborious, and the number of TFBMs discovered remains relatively small, especially when compared with the hundreds of transcription factor genes predicted in metazoan genomes. We have used a recently developed statistical motif discovery approach, NestedMICA, to detect candidate TFBMs from a large set of Drosophila melanogaster promoter regions. Of the 120 motifs inferred in our initial analysis, 25 were statistically significant matches to previously reported motifs, while 87 appeared to be novel. Analysis of sequence conservation and motif positioning suggested that the great majority of these discovered motifs are predictive of functional elements in the genome. Many motifs showed associations with specific patterns of gene expression in the D. melanogaster embryo, and we were able to obtain confident annotation of expression patterns for 25 of our motifs, including eight of the novel motifs. The motifs are available through Tiffin, a new database of DNA sequence motifs. We have discovered many new motifs that are overrepresented in D. melanogaster promoter regions, and offer several independent lines of evidence that these are novel TFBMs. Our motif dictionary provides a solid foundation for further investigation of regulatory elements in Drosophila, and demonstrates techniques that should be applicable in other species. We suggest that further improvements in computational motif discovery should narrow the gap between the set of known motifs and the total number of transcription factors in metazoan genomes.

    Funded by: Wellcome Trust: 077198

    PLoS computational biology 2007;3;1;e7

  • Integrating sequence and structural biology with DAS.

    Prlić A, Down TA, Kulesha E, Finn RD, Kähäri A and Hubbard TJ

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK. ap3@sanger.ac.uk

    Background: The Distributed Annotation System (DAS) is a network protocol for exchanging biological data. It is frequently used to share annotations of genomes and protein sequence.

    Results: Here we present several extensions to the current DAS 1.5 protocol. These provide new commands to share alignments, three dimensional molecular structure data, add the possibility for registration and discovery of DAS servers, and provide a convention how to provide different types of data plots. We present examples of web sites and applications that use the new extensions. We operate a public registry of DAS sources, which now includes entries for more than 250 distinct sources.

    Conclusion: Our DAS extensions are essential for the management of the growing number of services and exchange of diverse biological data sets. In addition the extensions allow new types of applications to be developed and scientific questions to be addressed. The registry of DAS sources is available at http://www.dasregistry.org.

    Funded by: Wellcome Trust: 062023, 077198

    BMC bioinformatics 2007;8;333

  • GENCODE: producing a reference annotation for ENCODE.

    Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D, Rossier C, Ucla C, Hubbard T, Antonarakis SE and Guigo R

    Wellcome Trust Sanger Institute, Wellcome Trust Campus, Hinxton, Cambridge CB10 1SA, UK. jla1@sanger.ac.uk

    Background: The GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manual annotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on these experimental results.

    Results: The GENCODE gene features are divided into eight different categories of which only the first two (known and novel coding sequence) are confidently predicted to be protein-coding genes. 5' rapid amplification of cDNA ends (RACE) and RT-PCR were used to experimentally verify the initial annotation. Of the 420 coding loci tested, 229 RACE products have been sequenced. They supported 5' extensions of 30 loci and new splice variants in 50 loci. In addition, 46 loci without evidence for a coding sequence were validated, consisting of 31 novel and 15 putative transcripts. We assessed the comprehensiveness of the GENCODE annotation by attempting to validate all the predicted exon boundaries outside the GENCODE annotation. Out of 1,215 tested in a subset of the ENCODE regions, 14 novel exon pairs were validated, only two of them in intergenic regions.

    Conclusion: In total, 487 loci, of which 434 are coding, have been annotated as part of the GENCODE reference set available from the UCSC browser. Comparison of GENCODE annotation with RefSeq and ENSEMBL show only 40% of GENCODE exons are contained within the two sets, which is a reflection of the high number of alternative splice forms with unique exons annotated. Over 50% of coding loci have been experimentally verified by 5' RACE for EGASP and the GENCODE collaboration is continuing to refine its annotation of 1% human genome with the aid of experimental validation.

    Genome biology 2006;7 Suppl 1;S4.1-9

  • Initial sequencing and comparative analysis of the mouse genome.

    Mouse Genome Sequencing Consortium, Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S, Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR, Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Davies J, Delehaunty KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P, Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gibbs RA, Glusman G, Gnerre S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigó R, Guyer M, Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Johnson LS, Jones M, Jones TA, Joy A, Kamal M, Karlsson EK, Karolchik D, Kasprzyk A, Kawai J, Keibler E, Kells C, Kent WJ, Kirby A, Kolbe DL, Korf I, Kucherlapati RS, Kulbokas EJ, Kulp D, Landers T, Leger JP, Leonard S, Letunic I, Levine R, Li J, Li M, Lloyd C, Lucas S, Ma B, Maglott DR, Mardis ER, Matthews L, Mauceli E, Mayer JH, McCarthy M, McCombie WR, McLaren S, McLay K, McPherson JD, Meldrim J, Meredith B, Mesirov JP, Miller W, Miner TL, Mongin E, Montgomery KT, Morgan M, Mott R, Mullikin JC, Muzny DM, Nash WE, Nelson JO, Nhan MN, Nicol R, Ning Z, Nusbaum C, O'Connor MJ, Okazaki Y, Oliver K, Overton-Larty E, Pachter L, Parra G, Pepin KH, Peterson J, Pevzner P, Plumb R, Pohl CS, Poliakov A, Ponce TC, Ponting CP, Potter S, Quail M, Reymond A, Roe BA, Roskin KM, Rubin EM, Rust AG, Santos R, Sapojnikov V, Schultz B, Schultz J, Schwartz MS, Schwartz S, Scott C, Seaman S, Searle S, Sharpe T, Sheridan A, Shownkeen R, Sims S, Singer JB, Slater G, Smit A, Smith DR, Spencer B, Stabenau A, Stange-Thomann N, Sugnet C, Suyama M, Tesler G, Thompson J, Torrents D, Trevaskis E, Tromp J, Ucla C, Ureta-Vidal A, Vinson JP, Von Niederhausern AC, Wade CM, Wall M, Weber RJ, Weiss RB, Wendl MC, West AP, Wetterstrand K, Wheeler R, Whelan S, Wierzbowski J, Willey D, Williams S, Wilson RK, Winter E, Worley KC, Wyman D, Yang S, Yang SP, Zdobnov EM, Zody MC and Lander ES

    Genome Sequencing Center, Washington University School of Medicine, Campus Box 8501, 4444 Forest Park Avenue, St Louis, Missouri 63108, USA. waterston@gs.washington.edu

    The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.

    Nature 2002;420;6915;520-62

  • The Ensembl genome database project.

    Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Huminiecki L, Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C, Mongin E, Pettett R, Pocock M, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I and Clamp M

    The Wellcome Trust Sanger Institute and European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

    The Ensembl (http://www.ensembl.org/) database project provides a bioinformatics framework to organise biology around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of the human genome sequence, with confirmed gene predictions that have been integrated with external data sources, and is available as either an interactive web site or as flat files. It is also an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements from sequence analysis to data storage and visualisation. The Ensembl site is one of the leading sources of human genome sequence annotation and provided much of the analysis for publication by the international human genome project of the draft genome. The Ensembl system is being installed around the world in both companies and academic sites on machines ranging from supercomputers to laptops.

    Nucleic acids research 2002;30;1;38-41

  • Initial sequencing and analysis of the human genome.

    Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blöcker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ, Szustakowki J and International Human Genome Sequencing Consortium

    Whitehead Institute for Biomedical Research, Center for Genome Research, Cambridge, Massachusetts 02142, USA. lander@genome.wi.mit.edu

    The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

    Nature 2001;409;6822;860-921

  • RMS/coverage graphs: a qualitative method for comparing three-dimensional protein structure predictions.

    Hubbard TJ

    Sanger Centre, Hinxton, Cambridgeshire, United Kingdom. th@sanger.ac.uk

    Evaluating a set of protein structure predictions is difficult as each prediction may omit different residues and different parts of the structure may have different accuracies. A method is described that captures the best results from a large number of alternative sequence-dependent structural superpositions between a prediction and the experimental structure and represents them as a single line on a graph. Applied to CASP2 and CASP3 data the best predictions stand out visually in most cases, as judged by manual inspection. The results from this method applied to CASP data are available from the URLs http:/(/)PredictionCenter. llnl.gov/casp3/results/th/ and http:/(/)www.sanger.ac.uk/ approximately th/casp/.

    Funded by: Wellcome Trust

    Proteins 1999;Suppl 3;15-21

  • Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships.

    Brenner SE, Chothia C and Hubbard TJ

    MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, United Kingdom. brenner@hyper.stanford.edu

    Pairwise sequence comparison methods have been assessed using proteins whose relationships are known reliably from their structures and functions, as described in the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia C. (1995) J. Mol. Biol. 247, 536-540]. The evaluation tested the programs BLAST [Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) Methods Enzymol. 266, 460-480], FASTA [Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444-2448], and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195-197] and their scoring schemes. The error rate of all algorithms is greatly reduced by using statistical scores to evaluate matches rather than percentage identity or raw scores. The E-value statistical scores of SSEARCH and FASTA are reliable: the number of false positives found in our tests agrees well with the scores reported. However, the P-values reported by BLAST and WU-BLAST2 exaggerate significance by orders of magnitude. SSEARCH, FASTA ktup = 1, and WU-BLAST2 perform best, and they are capable of detecting almost all relationships between proteins whose sequence identities are >30%. For more distantly related proteins, they do much less well; only one-half of the relationships between proteins with 20-30% identity are found. Because many homologs have low sequence similarity, most distant relationships cannot be detected by any pairwise comparison method; however, those which are identified may be used with confidence.

    Funded by: Wellcome Trust

    Proceedings of the National Academy of Sciences of the United States of America 1998;95;11;6073-8

  • SCOP: a structural classification of proteins database for the investigation of sequences and structures.

    Murzin AG, Brenner SE, Hubbard T and Chothia C

    MRC Laboratory of Molecular Biology and Centre for Protein Engineering, Cambridge, England.

    To facilitate understanding of, and access to, the information available for protein structures, we have constructed the Structural Classification of Proteins (scop) database. This database provides a detailed and comprehensive description of the structural and evolutionary relationships of the proteins of known structure. It also provides for each entry links to co-ordinates, images of the structure, interactive viewers, sequence data and literature references. Two search facilities are available. The homology search permits users to enter a sequence and obtain a list of any structures to which it has significant levels of sequence similarity. The key word search finds, for a word entered by the user, matches from both the text of the scop database and the headers of Brookhaven Protein Databank structure files. The database is freely accessible on World Wide Web (WWW) with an entry point to URL http: parallel scop.mrc-lmb.cam.ac.uk magnitude of scop.

    Journal of molecular biology 1995;247;4;536-40

2011 Publications

  • The origins, evolution, and functional potential of alternative splicing in vertebrates.

    Mudge JM, Frankish A, Fernandez-Banet J, Alioto T, Derrien T, Howald C, Reymond A, Guigó R, Hubbard T and Harrow J

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. jm12@sanger.ac.uk

    Alternative splicing (AS) has the potential to greatly expand the functional repertoire of mammalian transcriptomes. However, few variant transcripts have been characterized functionally, making it difficult to assess the contribution of AS to the generation of phenotypic complexity and to study the evolution of splicing patterns. We have compared the AS of 309 protein-coding genes in the human ENCODE pilot regions against their mouse orthologs in unprecedented detail, utilizing traditional transcriptomic and RNAseq data. The conservation status of every transcript has been investigated, and each functionally categorized as coding (separated into coding sequence [CDS] or nonsense-mediated decay [NMD] linked) or noncoding. In total, 36.7% of human and 19.3% of mouse coding transcripts are species specific, and we observe a 3.6 times excess of human NMD transcripts compared with mouse; in contrast to previous studies, the majority of species-specific AS is unlinked to transposable elements. We observe one conserved CDS variant and one conserved NMD variant per 2.3 and 11.4 genes, respectively. Subsequently, we identify and characterize equivalent AS patterns for 22.9% of these CDS or NMD-linked events in nonmammalian vertebrate genomes, and our data indicate that functional NMD-linked AS is more widespread and ancient than previously thought. Furthermore, although we observe an association between conserved AS and elevated sequence conservation, as previously reported, we emphasize that 30% of conserved AS exons display sequence conservation below the average score for constitutive exons. In conclusion, we demonstrate the value of detailed comparative annotation in generating a comprehensive set of AS transcripts, increasing our understanding of AS evolution in vertebrates. Our data supports a model whereby the acquisition of functional AS has occurred throughout vertebrate evolution and is considered alongside amino acid change as a key mechanism in gene evolution.

    Funded by: NHGRI NIH HHS: 5U54HG004555; Wellcome Trust: 077198, WT077198/Z/05/Z

    Molecular biology and evolution 2011;28;10;2949-59

  • The GENCODE exome: sequencing the complete human exome.

    Coffey AJ, Kokocinski F, Calafato MS, Scott CE, Palta P, Drury E, Joyce CJ, Leproust EM, Harrow J, Hunt S, Lehesjoki AE, Turner DJ, Hubbard TJ and Palotie A

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, UK.

    Sequencing the coding regions, the exome, of the human genome is one of the major current strategies to identify low frequency and rare variants associated with human disease traits. So far, the most widely used commercial exome capture reagents have mainly targeted the consensus coding sequence (CCDS) database. We report the design of an extended set of targets for capturing the complete human exome, based on annotation from the GENCODE consortium. The extended set covers an additional 5594 genes and 10.3 Mb compared with the current CCDS-based sets. The additional regions include potential disease genes previously inaccessible to exome resequencing studies, such as 43 genes linked to ion channel activity and 70 genes linked to protein kinase activity. In total, the new GENCODE exome set developed here covers 47.9 Mb and performed well in sequence capture experiments. In the sample set used in this study, we identified over 5000 SNP variants more in the GENCODE exome target (24%) than in the CCDS-based exome sequencing.

    Funded by: NHGRI NIH HHS: 5U54HG004555; Wellcome Trust: 077198, WT062023, WT077198, WT089062

    European journal of human genetics : EJHG 2011;19;7;827-31

  • Modernizing reference genome assemblies.

    Church DM, Schneider VA, Graves T, Auger K, Cunningham F, Bouk N, Chen HC, Agarwala R, McLaren WM, Ritchie GR, Albracht D, Kremitzki M, Rock S, Kotkiewicz H, Kremitzki C, Wollam A, Trani L, Fulton L, Fulton R, Matthews L, Whitehead S, Chow W, Torrance J, Dunn M, Harden G, Threadgold G, Wood J, Collins J, Heath P, Griffiths G, Pelan S, Grafham D, Eichler EE, Weinstock G, Mardis ER, Wilson RK, Howe K, Flicek P and Hubbard T

    National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America. church@ncbi.nlm.nih.gov

    Funded by: Wellcome Trust: 077198, 095908

    PLoS biology 2011;9;7;e1001091

  • Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome.

    Brosch M, Saunders GI, Frankish A, Collins MO, Yu L, Wright J, Verstraten R, Adams DJ, Harrow J, Choudhary JS and Hubbard T

    The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.

    Recent advances in proteomic mass spectrometry (MS) offer the chance to marry high-throughput peptide sequencing to transcript models, allowing the validation, refinement, and identification of new protein-coding loci. We present a novel pipeline that integrates highly sensitive and statistically robust peptide spectrum matching with genome-wide protein-coding predictions to perform large-scale gene validation and discovery in the mouse genome for the first time. In searching an excess of 10 million spectra, we have been able to validate 32%, 17%, and 7% of all protein-coding genes, exons, and splice boundaries, respectively. Moreover, we present strong evidence for the identification of multiple alternatively spliced translations from 53 genes and have uncovered 10 entirely novel protein-coding genes, which are not covered in any mouse annotation data sources. One such novel protein-coding gene is a fusion protein that spans the Ins2 and Igf2 loci to produce a transcript encoding the insulin II and the insulin-like growth factor 2-derived peptides. We also report nine processed pseudogenes that have unique peptide hits, demonstrating, for the first time, that they are not just transcribed but are translated and are therefore resurrected into new coding loci. This work not only highlights an important utility for MS data in genome annotation but also provides unique insights into the gene structure and propagation in the mouse genome. All these data have been subsequently used to improve the publicly available mouse annotation available in both the Vega and Ensembl genome browsers (http://vega.sanger.ac.uk).

    Funded by: Cancer Research UK; Wellcome Trust: 077198

    Genome research 2011;21;5;756-67

  • A user's guide to the encyclopedia of DNA elements (ENCODE).

    ENCODE Project Consortium

    HudsonAlpha Institute for Biotechnology, Huntsville, Alabama, United States of America. rmyers@hudsonalpha.org

    The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns. In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome.

    Funded by: NHGRI NIH HHS: R01 HG003143, R01 HG004037, RC2 HG005573; NIDDK NIH HHS: R01 DK054369, R01 DK065806; Wellcome Trust: 095908

    PLoS biology 2011;9;4;e1001046

  • Dalliance: interactive genome viewing on the web.

    Down TA, Piipari M and Hubbard TJ

    Wellcome Trust/CRUK Gurdon Institute, Cambridge CB2 1QN, UK. thomas@biodalliance.org

    Summary: Dalliance is a new genome viewer which offers a high level of interactivity while running within a web browser. All data is fetched using the established distributed annotation system (DAS) protocol, making it easy to customize the browser and add extra data.

    Dalliance runs entirely within your web browser, and relies on existing DAS server infrastructure. Browsers for several mammalian genomes are available at http://www.biodalliance.org/, and the use of DAS means you can add your own data to these browsers. In addition, the source code (Javascript) is available under the BSD license, and is straightforward to install on your own web server and embed within other documents.

    Funded by: Wellcome Trust: 077198, 083563

    Bioinformatics (Oxford, England) 2011;27;6;889-90

  • Ensembl 2011.

    Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S, Gordon L, Hendrix M, Hourlier T, Johnson N, Kähäri A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Larsson P, Longden I, McLaren W, Overduin B, Pritchard B, Riat HS, Rios D, Ritchie GR, Ruffier M, Schuster M, Sobral D, Spudich G, Tang YA, Trevanion S, Vandrovcova J, Vilella AJ, White S, Wilder SP, Zadissa A, Zamora J, Aken BL, Birney E, Cunningham F, Dunham I, Durbin R, Fernández-Suarez XM, Herrero J, Hubbard TJ, Parker A, Proctor G, Vogel J and Searle SM

    European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. flicek@ebi.ac.uk

    The Ensembl project (http://www.ensembl.org) seeks to enable genomic science by providing high quality, integrated annotation on chordate and selected eukaryotic genomes within a consistent and accessible infrastructure. All supported species include comprehensive, evidence-based gene annotations and a selected set of genomes includes additional data focused on variation, comparative, evolutionary, functional and regulatory annotation. The most advanced resources are provided for key species including human, mouse, rat and zebrafish reflecting the popularity and importance of these species in biomedical research. As of Ensembl release 59 (August 2010), 56 species are supported of which 5 have been added in the past year. Since our previous report, we have substantially improved the presentation and integration of both data of disease relevance and the regulatory state of different cell types.

    Funded by: Biotechnology and Biological Sciences Research Council; Wellcome Trust: 062023, 077198

    Nucleic acids research 2011;39;Database issue;D800-6

  • Developing and implementing an institute-wide data sharing policy.

    Dyke SO and Hubbard TJ

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. th@sanger.ac.uk.

    The Wellcome Trust Sanger Institute has a strong reputation for prepublication data sharing as a result of its policy of rapid release of genome sequence data and particularly through its contribution to the Human Genome Project. The practicalities of broad data sharing remain largely uncharted, especially to cover the wide range of data types currently produced by genomic studies and to adequately address ethical issues. This paper describes the processes and challenges involved in implementing a data sharing policy on an institute-wide scale. This includes questions of governance, practical aspects of applying principles to diverse experimental contexts, building enabling systems and infrastructure, incentives and collaborative issues.

    Genome medicine 2011;3;9;60

2010 Publications

  • International network of cancer genome projects.

    International Cancer Genome Consortium, Hudson TJ, Anderson W, Artez A, Barker AD, Bell C, Bernabé RR, Bhan MK, Calvo F, Eerola I, Gerhard DS, Guttmacher A, Guyer M, Hemsley FM, Jennings JL, Kerr D, Klatt P, Kolar P, Kusada J, Lane DP, Laplace F, Youyong L, Nettekoven G, Ozenberger B, Peterson J, Rao TS, Remacle J, Schafer AJ, Shibata T, Stratton MR, Vockley JG, Watanabe K, Yang H, Yuen MM, Knoppers BM, Bobrow M, Cambon-Thomsen A, Dressler LG, Dyke SO, Joly Y, Kato K, Kennedy KL, Nicolás P, Parker MJ, Rial-Sebbag E, Romeo-Casabona CM, Shaw KM, Wallace S, Wiesner GL, Zeps N, Lichter P, Biankin AV, Chabannon C, Chin L, Clément B, de Alava E, Degos F, Ferguson ML, Geary P, Hayes DN, Hudson TJ, Johns AL, Kasprzyk A, Nakagawa H, Penny R, Piris MA, Sarin R, Scarpa A, Shibata T, van de Vijver M, Futreal PA, Aburatani H, Bayés M, Botwell DD, Campbell PJ, Estivill X, Gerhard DS, Grimmond SM, Gut I, Hirst M, López-Otín C, Majumder P, Marra M, McPherson JD, Nakagawa H, Ning Z, Puente XS, Ruan Y, Shibata T, Stratton MR, Stunnenberg HG, Swerdlow H, Velculescu VE, Wilson RK, Xue HH, Yang L, Spellman PT, Bader GD, Boutros PC, Campbell PJ, Flicek P, Getz G, Guigó R, Guo G, Haussler D, Heath S, Hubbard TJ, Jiang T, Jones SM, Li Q, López-Bigas N, Luo R, Muthuswamy L, Ouellette BF, Pearson JV, Puente XS, Quesada V, Raphael BJ, Sander C, Shibata T, Speed TP, Stein LD, Stuart JM, Teague JW, Totoki Y, Tsunoda T, Valencia A, Wheeler DA, Wu H, Zhao S, Zhou G, Stein LD, Guigó R, Hubbard TJ, Joly Y, Jones SM, Kasprzyk A, Lathrop M, López-Bigas N, Ouellette BF, Spellman PT, Teague JW, Thomas G, Valencia A, Yoshida T, Kennedy KL, Axton M, Dyke SO, Futreal PA, Gerhard DS, Gunter C, Guyer M, Hudson TJ, McPherson JD, Miller LJ, Ozenberger B, Shaw KM, Kasprzyk A, Stein LD, Zhang J, Haider SA, Wang J, Yung CK, Cros A, Cross A, Liang Y, Gnaneshan S, Guberman J, Hsu J, Bobrow M, Chalmers DR, Hasel KW, Joly Y, Kaan TS, Kennedy KL, Knoppers BM, Lowrance WW, Masui T, Nicolás P, Rial-Sebbag E, Rodriguez LL, Vergely C, Yoshida T, Grimmond SM, Biankin AV, Bowtell DD, Cloonan N, deFazio A, Eshleman JR, Etemadmoghadam D, Gardiner BB, Gardiner BA, Kench JG, Scarpa A, Sutherland RL, Tempero MA, Waddell NJ, Wilson PJ, McPherson JD, Gallinger S, Tsao MS, Shaw PA, Petersen GM, Mukhopadhyay D, Chin L, DePinho RA, Thayer S, Muthuswamy L, Shazand K, Beck T, Sam M, Timms L, Ballin V, Lu Y, Ji J, Zhang X, Chen F, Hu X, Zhou G, Yang Q, Tian G, Zhang L, Xing X, Li X, Zhu Z, Yu Y, Yu J, Yang H, Lathrop M, Tost J, Brennan P, Holcatova I, Zaridze D, Brazma A, Egevard L, Prokhortchouk E, Banks RE, Uhlén M, Cambon-Thomsen A, Viksna J, Ponten F, Skryabin K, Stratton MR, Futreal PA, Birney E, Borg A, Børresen-Dale AL, Caldas C, Foekens JA, Martin S, Reis-Filho JS, Richardson AL, Sotiriou C, Stunnenberg HG, Thoms G, van de Vijver M, van't Veer L, Calvo F, Birnbaum D, Blanche H, Boucher P, Boyault S, Chabannon C, Gut I, Masson-Jacquemier JD, Lathrop M, Pauporté I, Pivot X, Vincent-Salomon A, Tabone E, Theillet C, Thomas G, Tost J, Treilleux I, Calvo F, Bioulac-Sage P, Clément B, Decaens T, Degos F, Franco D, Gut I, Gut M, Heath S, Lathrop M, Samuel D, Thomas G, Zucman-Rossi J, Lichter P, Eils R, Brors B, Korbel JO, Korshunov A, Landgraf P, Lehrach H, Pfister S, Radlwimmer B, Reifenberger G, Taylor MD, von Kalle C, Majumder PP, Sarin R, Rao TS, Bhan MK, Scarpa A, Pederzoli P, Lawlor RA, Delledonne M, Bardelli A, Biankin AV, Grimmond SM, Gress T, Klimstra D, Zamboni G, Shibata T, Nakamura Y, Nakagawa H, Kusada J, Tsunoda T, Miyano S, Aburatani H, Kato K, Fujimoto A, Yoshida T, Campo E, López-Otín C, Estivill X, Guigó R, de Sanjosé S, Piris MA, Montserrat E, González-Díaz M, Puente XS, Jares P, Valencia A, Himmelbauer H, Himmelbaue H, Quesada V, Bea S, Stratton MR, Futreal PA, Campbell PJ, Vincent-Salomon A, Richardson AL, Reis-Filho JS, van de Vijver M, Thomas G, Masson-Jacquemier JD, Aparicio S, Borg A, Børresen-Dale AL, Caldas C, Foekens JA, Stunnenberg HG, van't Veer L, Easton DF, Spellman PT, Martin S, Barker AD, Chin L, Collins FS, Compton CC, Ferguson ML, Gerhard DS, Getz G, Gunter C, Guttmacher A, Guyer M, Hayes DN, Lander ES, Ozenberger B, Penny R, Peterson J, Sander C, Shaw KM, Speed TP, Spellman PT, Vockley JG, Wheeler DA, Wilson RK, Hudson TJ, Chin L, Knoppers BM, Lander ES, Lichter P, Stein LD, Stratton MR, Anderson W, Barker AD, Bell C, Bobrow M, Burke W, Collins FS, Compton CC, DePinho RA, Easton DF, Futreal PA, Gerhard DS, Green AR, Guyer M, Hamilton SR, Hubbard TJ, Kallioniemi OP, Kennedy KL, Ley TJ, Liu ET, Lu Y, Majumder P, Marra M, Ozenberger B, Peterson J, Schafer AJ, Spellman PT, Stunnenberg HG, Wainwright BJ, Wilson RK and Yang H

    The International Cancer Genome Consortium (ICGC) was launched to coordinate large-scale cancer genome studies in tumours from 50 different cancer types and/or subtypes that are of clinical and societal importance across the globe. Systematic studies of more than 25,000 cancer genomes at the genomic, epigenomic and transcriptomic levels will reveal the repertoire of oncogenic mutations, uncover traces of the mutagenic influences, define clinically relevant subtypes for prognosis and therapeutic management, and enable the development of new cancer therapies.

    Funded by: Cancer Research UK: 6613; NCI NIH HHS: P01 CA117969, P01 CA117969-04S1, P01 CA117969-05, P50 CA102701, P50 CA102701-08, P50 CA127003, P50 CA127003-04, P50 CA127003-05; NHGRI NIH HHS: R01 HG001806-02; NIDDK NIH HHS: K08 DK071329, K08 DK071329-04, K08 DK071329-05; Wellcome Trust: 077198, 088340, 093867

    Nature 2010;464;7291;993-8

  • iMotifs: an integrated sequence motif visualization and analysis environment.

    Piipari M, Down TA, Saini H, Enright A and Hubbard TJ

    Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK. matias.piipari@gmail.com

    Motivation: Short sequence motifs are an important class of models in molecular biology, used most commonly for describing transcription factor binding site specificity patterns. High-throughput methods have been recently developed for detecting regulatory factor binding sites in vivo and in vitro and consequently high-quality binding site motif data are becoming available for increasing number of organisms and regulatory factors. Development of intuitive tools for the study of sequence motifs is therefore important. iMotifs is a graphical motif analysis environment that allows visualization of annotated sequence motifs and scored motif hits in sequences. It also offers motif inference with the sensitive NestedMICA algorithm, as well as overrepresentation and pairwise motif matching capabilities. All of the analysis functionality is provided without the need to convert between file formats or learn different command line interfaces. The application includes a bundled and graphically integrated version of the NestedMICA motif inference suite that has no outside dependencies. Problems associated with local deployment of software are therefore avoided.

    Availability: iMotifs is licensed with the GNU Lesser General Public License v2.0 (LGPL 2.0). The software and its source is available at http://wiki.github.com/mz2/imotifs and can be run on Mac OS X Leopard (Intel/PowerPC). We also provide a cross-platform (Linux, OS X, Windows) LGPL 2.0 licensed library libxms for the Perl, Ruby, R and Objective-C programming languages for input and output of XMS formatted annotated sequence motif set files.

    Contact: matias.piipari@gmail.com; imotifs@googlegroups.com.

    Funded by: Wellcome Trust: 077198, 077198/Z/05/Z

    Bioinformatics (Oxford, England) 2010;26;6;843-4

  • Novel candidate cancer genes identified by a large-scale cross-species comparative oncogenomics approach.

    Mattison J, Kool J, Uren AG, de Ridder J, Wessels L, Jonkers J, Bignell GR, Butler A, Rust AG, Brosch M, Wilson CH, van der Weyden L, Largaespada DA, Stratton MR, Futreal PA, van Lohuizen M, Berns A, Collier LS, Hubbard T and Adams DJ

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom.

    Comparative genomic hybridization (CGH) can reveal important disease genes but the large regions identified could sometimes contain hundreds of genes. Here we combine high-resolution CGH analysis of 598 human cancer cell lines with insertion sites isolated from 1,005 mouse tumors induced with the murine leukemia virus (MuLV). This cross-species oncogenomic analysis revealed candidate tumor suppressor genes and oncogenes mutated in both human and mouse tumors, making them strong candidates for novel cancer genes. A significant number of these genes contained binding sites for the stem cell transcription factors Oct4 and Nanog. Notably, mice carrying tumors with insertions in or near stem cell module genes, which are thought to participate in cell self-renewal, died significantly faster than mice without these insertions. A comparison of the profile we identified to that induced with the Sleeping Beauty (SB) transposon system revealed significant differences in the profile of recurrently mutated genes. Collectively, this work provides a rich catalogue of new candidate cancer genes for functional analysis.

    Funded by: Cancer Research UK: A6997, A8784; NCI NIH HHS: K01 CA122183, K01CA122183, R01 CA113636, R01 CA134759; Wellcome Trust: 077198, 082356

    Cancer research 2010;70;3;883-95

  • Ensembl's 10th year.

    Flicek P, Aken BL, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S, Fernandez-Banet J, Gordon L, Gräf S, Haider S, Hammond M, Howe K, Jenkinson A, Johnson N, Kähäri A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Koscielny G, Kulesha E, Lawson D, Longden I, Massingham T, McLaren W, Megy K, Overduin B, Pritchard B, Rios D, Ruffier M, Schuster M, Slater G, Smedley D, Spudich G, Tang YA, Trevanion S, Vilella A, Vogel J, White S, Wilder SP, Zadissa A, Birney E, Cunningham F, Dunham I, Durbin R, Fernández-Suarez XM, Herrero J, Hubbard TJ, Parker A, Proctor G, Smith J and Searle SM

    European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. flicek@ebi.ac.uk

    Ensembl (http://www.ensembl.org) integrates genomic information for a comprehensive set of chordate genomes with a particular focus on resources for human, mouse, rat, zebrafish and other high-value sequenced genomes. We provide complete gene annotations for all supported species in addition to specific resources that target genome variation, function and evolution. Ensembl data is accessible in a variety of formats including via our genome browser, API and BioMart. This year marks the tenth anniversary of Ensembl and in that time the project has grown with advances in genome technology. As of release 56 (September 2009), Ensembl supports 51 species including marmoset, pig, zebra finch, lizard, gorilla and wallaby, which were added in the past year. Major additions and improvements to Ensembl since our previous report include the incorporation of the human GRCh37 assembly, enhanced visualisation and data-mining options for the Ensembl regulatory features and continued development of our software infrastructure.

    Funded by: Biotechnology and Biological Sciences Research Council: BBE0116401; Wellcome Trust: 062023, 077198

    Nucleic acids research 2010;38;Database issue;D557-62

  • AnnoTrack--a tracking system for genome annotation.

    Kokocinski F, Harrow J and Hubbard T

    Vertebrate Genome Analysis, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB101HH, UK. fsk@sanger.ac.uk

    Background: As genome sequences are determined for increasing numbers of model organisms, demand has grown for better tools to facilitate unified genome annotation efforts by communities of biologists. Typically this process involves numerous experts from the field and the use of data from dispersed sources as evidence. This kind of collaborative annotation project requires specialized software solutions for efficient data tracking and processing.

    Results: As part of the scale-up phase of the ENCODE project (Encyclopedia of DNA Elements), the aim of the GENCODE project is to produce a highly accurate evidence-based reference gene annotation for the human genome. The AnnoTrack software system was developed to aid this effort. It integrates data from multiple distributed sources, highlights conflicts and facilitates the quick identification, prioritisation and resolution of problems during the process of genome annotation.

    Conclusions: AnnoTrack has been in use for the last year and has proven a very valuable tool for large-scale genome annotation. Designed to interface with standard bioinformatics components, such as DAS servers and Ensembl databases, it is easy to setup and configure for different genome projects. The source code is available at http://annotrack.sanger.ac.uk.

    Funded by: NHGRI NIH HHS: 5U54HG004555; Wellcome Trust: 077198, WT077198/Z/05/Z

    BMC genomics 2010;11;538

  • Metamotifs--a generative model for building families of nucleotide position weight matrices.

    Piipari M, Down TA and Hubbard TJ

    Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, UK. matias.piipari@gmail.com

    Background: Development of high-throughput methods for measuring DNA interactions of transcription factors together with computational advances in short motif inference algorithms is expanding our understanding of transcription factor binding site motifs. The consequential growth of sequence motif data sets makes it important to systematically group and categorise regulatory motifs. It has been shown that there are familial tendencies in DNA sequence motifs that are predictive of the family of factors that binds them. Further development of methods that detect and describe familial motif trends has the potential to help in measuring the similarity of novel computational motif predictions to previously known data and sensitively detecting regulatory motifs similar to previously known ones from novel sequence.

    Results: We propose a probabilistic model for position weight matrix (PWM) sequence motif families. The model, which we call the 'metamotif' describes recurring familial patterns in a set of motifs. The metamotif framework models variation within a family of sequence motifs. It allows for simultaneous estimation of a series of independent metamotifs from input position weight matrix (PWM) motif data and does not assume that all input motif columns contribute to a familial pattern. We describe an algorithm for inferring metamotifs from weight matrix data. We then demonstrate the use of the model in two practical tasks: in the Bayesian NestedMICA model inference algorithm as a PWM prior to enhance motif inference sensitivity, and in a motif classification task where motifs are labelled according to their interacting DNA binding domain.

    Conclusions: We show that metamotifs can be used as PWM priors in the NestedMICA motif inference algorithm to dramatically increase the sensitivity to infer motifs. Metamotifs were also successfully applied to a motif classification problem where sequence motif features were used to predict the family of protein DNA binding domains that would interact with it. The metamotif based classifier is shown to compare favourably to previous related methods. The metamotif has great potential for further use in machine learning tasks related to especially de novo computational sequence motif inference. The metamotif methods presented have been incorporated into the NestedMICA suite.

    Funded by: Wellcome Trust: 077198, 077198/Z/05/Z

    BMC bioinformatics 2010;11;348

2009 Publications

  • Genome-wide end-sequenced BAC resources for the NOD/MrkTac() and NOD/ShiLtJ() mouse genomes.

    Steward CA, Humphray S, Plumb B, Jones MC, Quail MA, Rice S, Cox T, Davies R, Bonfield J, Keane TM, Nefedov M, de Jong PJ, Lyons P, Wicker L, Todd J, Hayashizaki Y, Gulban O, Danska J, Harrow J, Hubbard T, Rogers J and Adams DJ

    The Wellcome Trust Sanger Institute, Hinxton, UK. cas@sanger.ac.uk

    Non-obese diabetic (NOD) mice spontaneously develop type 1 diabetes (T1D) due to the progressive loss of insulin-secreting beta-cells by an autoimmune driven process. NOD mice represent a valuable tool for studying the genetics of T1D and for evaluating therapeutic interventions. Here we describe the development and characterization by end-sequencing of bacterial artificial chromosome (BAC) libraries derived from NOD/MrkTac (DIL NOD) and NOD/ShiLtJ (CHORI-29), two commonly used NOD substrains. The DIL NOD library is composed of 196,032 BACs and the CHORI-29 library is composed of 110,976 BACs. The average depth of genome coverage of the DIL NOD library, estimated from mapping the BAC end-sequences to the reference mouse genome sequence, was 7.1-fold across the autosomes and 6.6-fold across the X chromosome. Clones from this library have an average insert size of 150 kb and map to over 95.6% of the reference mouse genome assembly (NCBIm37), covering 98.8% of Ensembl mouse genes. By the same metric, the CHORI-29 library has an average depth over the autosomes of 5.0-fold and 2.8-fold coverage of the X chromosome, the reduced X chromosome coverage being due to the use of a male donor for this library. Clones from this library have an average insert size of 205 kb and map to 93.9% of the reference mouse genome assembly, covering 95.7% of Ensembl genes. We have identified and validated 191,841 single nucleotide polymorphisms (SNPs) for DIL NOD and 114,380 SNPs for CHORI-29. In total we generated 229,736,133 bp of sequence for the DIL NOD and 121,963,211 bp for the CHORI-29. These BAC libraries represent a powerful resource for functional studies, such as gene targeting in NOD embryonic stem (ES) cell lines, and for sequencing and mapping experiments.

    Funded by: Cancer Research UK; Medical Research Council: G0800024; Wellcome Trust: 062023, 077198

    Genomics 2010;95;2;105-10

  • Cancer gene discovery in mouse and man.

    Mattison J, van der Weyden L, Hubbard T and Adams DJ

    Experimental Cancer Genetics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1HH, UK.

    The elucidation of the human and mouse genome sequence and developments in high-throughput genome analysis, and in computational tools, have made it possible to profile entire cancer genomes. In parallel with these advances mouse models of cancer have evolved into a powerful tool for cancer gene discovery. Here we discuss the approaches that may be used for cancer gene identification in both human and mouse and discuss how a cross-species 'oncogenomics' approach to cancer gene discovery represents a powerful strategy for finding genes that drive tumourigenesis.

    Funded by: Cancer Research UK; Wellcome Trust: 077198

    Biochimica et biophysica acta 2009;1796;2;140-61

  • Discovery of candidate disease genes in ENU-induced mouse mutants by large-scale sequencing, including a splice-site mutation in nucleoredoxin.

    Boles MK, Wilkinson BM, Wilming LG, Liu B, Probst FJ, Harrow J, Grafham D, Hentges KE, Woodward LP, Maxwell A, Mitchell K, Risley MD, Johnson R, Hirschi K, Lupski JR, Funato Y, Miki H, Marin-Garcia P, Matthews L, Coffey AJ, Parker A, Hubbard TJ, Rogers J, Bradley A, Adams DJ and Justice MJ

    Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas, USA.

    An accurate and precisely annotated genome assembly is a fundamental requirement for functional genomic analysis. Here, the complete DNA sequence and gene annotation of mouse Chromosome 11 was used to test the efficacy of large-scale sequencing for mutation identification. We re-sequenced the 14,000 annotated exons and boundaries from over 900 genes in 41 recessive mutant mouse lines that were isolated in an N-ethyl-N-nitrosourea (ENU) mutation screen targeted to mouse Chromosome 11. Fifty-nine sequence variants were identified in 55 genes from 31 mutant lines. 39% of the lesions lie in coding sequences and create primarily missense mutations. The other 61% lie in noncoding regions, many of them in highly conserved sequences. A lesion in the perinatal lethal line l11Jus13 alters a consensus splice site of nucleoredoxin (Nxn), inserting 10 amino acids into the resulting protein. We conclude that point mutations can be accurately and sensitively recovered by large-scale sequencing, and that conserved noncoding regions should be included for disease mutation identification. Only seven of the candidate genes we report have been previously targeted by mutation in mice or rats, showing that despite ongoing efforts to functionally annotate genes in the mammalian genome, an enormous gap remains between phenotype and function. Our data show that the classical positional mapping approach of disease mutation identification can be extended to large target regions using high-throughput sequencing.

    Funded by: Cancer Research UK; NCI NIH HHS: R01 CA115503; NHGRI NIH HHS: U54 HG004555; Wellcome Trust: 062023, 077198, 76943

    PLoS genetics 2009;5;12;e1000759

  • Post-publication sharing of data and tools.

    Schofield PN, Bubela T, Weaver T, Portilla L, Brown SD, Hancock JM, Einhorn D, Tocchini-Valentini G, Hrabe de Angelis M, Rosenthal N and CASIMIR Rome Meeting participants

    Department of Physiology, Development and Neuroscience, University of Cambridge, Cambridge, CB2 3EG, UK. PS@mole.bio.cam.ac.uk

    Despite existing guidelines on access to data and bioresources, good practice is not widespread. A meeting of mouse researchers in Rome proposes ways to promote a culture of sharing.

    Funded by: Medical Research Council: MC_U142684171; Wellcome Trust: 062023, 077198

    Nature 2009;461;7261;171-3

  • Prepublication data sharing.

    Toronto International Data Release Workshop Authors, Birney E, Hudson TJ, Green ED, Gunter C, Eddy S, Rogers J, Harris JR, Ehrlich SD, Apweiler R, Austin CP, Berglund L, Bobrow M, Bountra C, Brookes AJ, Cambon-Thomsen A, Carter NP, Chisholm RL, Contreras JL, Cooke RM, Crosby WL, Dewar K, Durbin R, Dyke SO, Ecker JR, El Emam K, Feuk L, Gabriel SB, Gallacher J, Gelbart WM, Granell A, Guarner F, Hubbard T, Jackson SA, Jennings JL, Joly Y, Jones SM, Kaye J, Kennedy KL, Knoppers BM, Kyrpides NC, Lowrance WW, Luo J, MacKay JJ, Martín-Rivera L, McCombie WR, McPherson JD, Miller L, Miller W, Moerman D, Mooser V, Morton CC, Ostell JM, Ouellette BF, Parkhill J, Raina PS, Rawlings C, Scherer SE, Scherer SW, Schofield PN, Sensen CW, Stodden VC, Sussman MR, Tanaka T, Thornton J, Tsunoda T, Valle D, Vuorio EI, Walker NM, Wallace S, Weinstock G, Whitman WB, Worley KC, Wu C, Wu J and Yu J

    Rapid release of prepublication data has served the field of genomics well. Attendees at a workshop in Toronto recommend extending the practice to other biological data sets.

    Funded by: Biotechnology and Biological Sciences Research Council; NHGRI NIH HHS: U54 HG003273, U54 HG003273-04, U54 HG003273-04S1, U54 HG003273-05, U54 HG003273-05S1, U54 HG003273-05S2, U54 HG003273-06, U54 HG003273-06S1, U54 HG003273-06S2, U54 HG003273-07, U54 HG003273-08; Wellcome Trust: 062023, 077198

    Nature 2009;461;7261;168-70

  • The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.

    Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, Hart E, Suner MM, Landrum MJ, Aken B, Ayling S, Baertsch R, Fernandez-Banet J, Cherry JL, Curwen V, Dicuccio M, Kellis M, Lee J, Lin MF, Schuster M, Shkeda A, Amid C, Brown G, Dukhanina O, Frankish A, Hart J, Maidak BL, Mudge J, Murphy MR, Murphy T, Rajan J, Rajput B, Riddick LD, Snow C, Steward C, Webb D, Weber JA, Wilming L, Wu W, Birney E, Haussler D, Hubbard T, Ostell J, Durbin R and Lipman D

    National Center for Biotechnology Information, National Library of Medicine, Bethesda, Maryland 20894, USA. Pruitt@ncbi.nlm.nih.gov

    Effective use of the human and mouse genomes requires reliable identification of genes and their products. Although multiple public resources provide annotation, different methods are used that can result in similar but not identical representation of genes, transcripts, and proteins. The collaborative consensus coding sequence (CCDS) project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented on the NCBI, Ensembl, and UCSC Genome Browsers. Importantly, the project coordinates on manually reviewing inconsistent protein annotations between sites, as well as annotations for which new evidence suggests a revision is needed, to progressively converge on a complete protein-coding set for the human and mouse reference genomes, while maintaining a high standard of reliability and biological accuracy. To date, the project has identified 20,159 human and 17,707 mouse consensus coding regions from 17,052 human and 16,893 mouse genes. Three evaluation methods indicate that the entries in the CCDS set are highly likely to represent real proteins, more so than annotations from contributing groups not included in CCDS. The CCDS database thus centralizes the function of identifying well-supported, identically-annotated, protein-coding regions.

    Funded by: NHGRI NIH HHS: 1U54HG004555-01, U54 HG004555; Wellcome Trust: 062023, 077198, WT062023, WT077198

    Genome research 2009;19;7;1316-23

  • Accurate and sensitive peptide identification with Mascot Percolator.

    Brosch M, Yu L, Hubbard T and Choudhary J

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom.

    Sound scoring methods for sequence database search algorithms such as Mascot and Sequest are essential for sensitive and accurate peptide and protein identifications from proteomic tandem mass spectrometry data. In this paper, we present a software package that interfaces Mascot with Percolator, a well performing machine learning method for rescoring database search results, and demonstrate it to be amenable for both low and high accuracy mass spectrometry data, outperforming all available Mascot scoring schemes as well as providing reliable significance measures. Mascot Percolator can be readily used as a stand alone tool or integrated into existing data analysis pipelines.

    Funded by: Wellcome Trust: 077198

    Journal of proteome research 2009;8;6;3176-81

  • Ensembl 2009.

    Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, Coates G, Fairley S, Fitzgerald S, Fernandez-Banet J, Gordon L, Graf S, Haider S, Hammond M, Holland R, Howe K, Jenkinson A, Johnson N, Kahari A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Lawson D, Longden I, Megy K, Meidl P, Overduin B, Parker A, Pritchard B, Rios D, Schuster M, Slater G, Smedley D, Spooner W, Spudich G, Trevanion S, Vilella A, Vogel J, White S, Wilder S, Zadissa A, Birney E, Cunningham F, Curwen V, Durbin R, Fernandez-Suarez XM, Herrero J, Kasprzyk A, Proctor G, Smith J, Searle S and Flicek P

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. th@sanger.ac.uk

    The Ensembl project (http://www.ensembl.org) is a comprehensive genome information system featuring an integrated set of genome annotation, databases, and other information for chordate, selected model organism and disease vector genomes. As of release 51 (November 2008), Ensembl fully supports 45 species, and three additional species have preliminary support. New species in the past year include orangutan and six additional low coverage mammalian genomes. Major additions and improvements to Ensembl since our previous report include a major redesign of our website; generation of multiple genome alignments and ancestral sequences using the new Enredo-Pecan-Ortheus pipeline and development of our software infrastructure, particularly to support the Ensembl Genomes project (http://www.ensemblgenomes.org/).

    Funded by: Biotechnology and Biological Sciences Research Council: BBE0116401; Medical Research Council; Wellcome Trust: 062023, 077198, WT062023

    Nucleic acids research 2009;37;Database issue;D690-7

  • Petabyte-scale innovations at the European Nucleotide Archive.

    Cochrane G, Akhtar R, Bonfield J, Bower L, Demiralp F, Faruque N, Gibson R, Hoad G, Hubbard T, Hunter C, Jang M, Juhos S, Leinonen R, Leonard S, Lin Q, Lopez R, Lorenc D, McWilliam H, Mukherjee G, Plaister S, Radhakrishnan R, Robinson S, Sobhany S, Hoopen PT, Vaughan R, Zalunin V and Birney E

    EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. cochrane@ebi.ac.uk

    Dramatic increases in the throughput of nucleotide sequencing machines, and the promise of ever greater performance, have thrust bioinformatics into the era of petabyte-scale data sets. Sequence repositories, which provide the feed for these data sets into the worldwide computational infrastructure, are challenged by the impact of these data volumes. The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/embl), comprising the EMBL Nucleotide Sequence Database and the Ensembl Trace Archive, has identified challenges in the storage, movement, analysis, interpretation and visualization of petabyte-scale data sets. We present here our new repository for next generation sequence data, a brief summary of contents of the ENA and provide details of major developments to submission pipelines, high-throughput rule-based validation infrastructure and data integration approaches.

    Funded by: Wellcome Trust: 077198, 085532

    Nucleic acids research 2009;37;Database issue;D19-25

  • Annotations for all by all - the BioSapiens network.

    Thornton J and BioSapiens Network

    The BioSapiens network has developed a distributed infrastructure for genome and proteome annotation by laboratories anywhere in the world.

    Funded by: Wellcome Trust: 077198

    Genome biology 2009;10;2;401

2008 Publications

  • The Protein Feature Ontology: a tool for the unification of protein feature annotations.

    Reeves GA, Eilbeck K, Magrane M, O'Donovan C, Montecchi-Palazzi L, Harris MA, Orchard S, Jimenez RC, Prlic A, Hubbard TJ, Hermjakob H and Thornton JM

    EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. gabby@ebi.ac.uk

    Motivation: The advent of sequencing and structural genomics projects has provided a dramatic boost in the number of uncharacterized protein structures and sequences. Consequently, many computational tools have been developed to help elucidate protein function. However, such services are spread throughout the world, often with standalone web pages. Integration of these methods is needed and so far this has not been possible as there was no common vocabulary available that could be used as a standard language.

    Results: The Protein Feature Ontology has been developed to provide a structured controlled vocabulary for features on a protein sequence or structure and comprises approximately 100 positional terms, now integrated into the Sequence Ontology (SO) and 40 non-positional terms which describe features relating to the whole-protein sequence. In addition, post-translational modifications are described by using a pre-existing ontology, the Protein Modification Ontology (MOD). This ontology is being used to integrate over 150 distinct annotations provided by the BioSapiens Network of Excellence, a consortium comprising 19 partner sites in Europe.

    Availability: The Protein Feature Ontology can be browsed by accessing the ontology lookup service at the European Bioinformatics Institute (http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=BS).

    Funded by: Wellcome Trust: 062023, 077198

    Bioinformatics (Oxford, England) 2008;24;23;2767-72

  • An integrated resource for genome-wide identification and analysis of human tissue-specific differentially methylated regions (tDMRs).

    Rakyan VK, Down TA, Thorne NP, Flicek P, Kulesha E, Gräf S, Tomazou EM, Bäckdahl L, Johnson N, Herberth M, Howe KL, Jackson DK, Miretti MM, Fiegler H, Marioni JC, Birney E, Hubbard TJ, Carter NP, Tavaré S and Beck S

    Institute of Cell and Molecular Science, Barts and the London, London E1 2AT, United Kingdom. v.rakyan@qmul.ac.uk

    We report a novel resource (methylation profiles of DNA, or mPod) for human genome-wide tissue-specific DNA methylation profiles. mPod consists of three fully integrated parts, genome-wide DNA methylation reference profiles of 13 normal somatic tissues, placenta, sperm, and an immortalized cell line, a visualization tool that has been integrated with the Ensembl genome browser and a new algorithm for the analysis of immunoprecipitation-based DNA methylation profiles. We demonstrate the utility of our resource by identifying the first comprehensive genome-wide set of tissue-specific differentially methylated regions (tDMRs) that may play a role in cellular identity and the regulation of tissue-specific genome function. We also discuss the implications of our findings with respect to the regulatory potential of regions with varied CpG density, gene expression, transcription factor motifs, gene ontology, and correlation with other epigenetic marks such as histone modifications.

    Funded by: Cancer Research UK: C14303/A4646; Wellcome Trust: 077198

    Genome research 2008;18;9;1518-29

  • A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis.

    Down TA, Rakyan VK, Turner DJ, Flicek P, Li H, Kulesha E, Gräf S, Johnson N, Herrero J, Tomazou EM, Thorne NP, Bäckdahl L, Herberth M, Howe KL, Jackson DK, Miretti MM, Marioni JC, Birney E, Hubbard TJ, Durbin R, Tavaré S and Beck S

    Wellcome Trust Cancer Research UK Gurdon Institute, and Department of Genetics, University of Cambridge, Tennis Court Road, Cambridge CB2 1QR, UK. thomas.down@gurdon.cam.ac.uk

    DNA methylation is an indispensible epigenetic modification required for regulating the expression of mammalian genomes. Immunoprecipitation-based methods for DNA methylome analysis are rapidly shifting the bottleneck in this field from data generation to data analysis, necessitating the development of better analytical tools. In particular, an inability to estimate absolute methylation levels remains a major analytical difficulty associated with immunoprecipitation-based DNA methylation profiling. To address this issue, we developed a cross-platform algorithm-Bayesian tool for methylation analysis (Batman)-for analyzing methylated DNA immunoprecipitation (MeDIP) profiles generated using oligonucleotide arrays (MeDIP-chip) or next-generation sequencing (MeDIP-seq). We developed the latter approach to provide a high-resolution whole-genome DNA methylation profile (DNA methylome) of a mammalian genome. Strong correlation of our data, obtained using mature human spermatozoa, with those obtained using bisulfite sequencing suggest that combining MeDIP-seq or MeDIP-chip with Batman provides a robust, quantitative and cost-effective functional genomic strategy for elucidating the function of DNA methylation.

    Funded by: Cancer Research UK: C14303/A8646; Wellcome Trust: 077198, 083563, 084071

    Nature biotechnology 2008;26;7;779-85

  • Large-scale mutagenesis in p19(ARF)- and p53-deficient mice identifies cancer genes and their collaborative networks.

    Uren AG, Kool J, Matentzoglu K, de Ridder J, Mattison J, van Uitert M, Lagcher W, Sie D, Tanger E, Cox T, Reinders M, Hubbard TJ, Rogers J, Jonkers J, Wessels L, Adams DJ, van Lohuizen M and Berns A

    Division of Molecular Genetics and Cancer Genomics Centre, Netherlands Cancer Institute, Plesmanlaan 121, 1066CX, Amsterdam, The Netherlands.

    p53 and p19(ARF) are tumor suppressors frequently mutated in human tumors. In a high-throughput screen in mice for mutations collaborating with either p53 or p19(ARF) deficiency, we identified 10,806 retroviral insertion sites, implicating over 300 loci in tumorigenesis. This dataset reveals 20 genes that are specifically mutated in either p19(ARF)-deficient, p53-deficient or wild-type mice (including Flt3, mmu-mir-106a-363, Smg6, and Ccnd3), as well as networks of significant collaborative and mutually exclusive interactions between cancer genes. Furthermore, we found candidate tumor suppressor genes, as well as distinct clusters of insertions within genes like Flt3 and Notch1 that induce mutants with different spectra of genetic interactions. Cross species comparative analysis with aCGH data of human cancer cell lines revealed known and candidate oncogenes (Mmp13, Slamf6, and Rreb1) and tumor suppressors (Wwox and Arfrp2). This dataset should prove to be a rich resource for the study of genetic interactions that underlie tumorigenesis.

    Funded by: Cancer Research UK; Wellcome Trust: 077198, 76943

    Cell 2008;133;4;727-41

  • Comparison of Mascot and X!Tandem performance for low and high accuracy mass spectrometry and the development of an adjusted Mascot threshold.

    Brosch M, Swamy S, Hubbard T and Choudhary J

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.

    It is a major challenge to develop effective sequence database search algorithms to translate molecular weight and fragment mass information obtained from tandem mass spectrometry into high quality peptide and protein assignments. We investigated the peptide identification performance of Mascot and X!Tandem for mass tolerance settings common for low and high accuracy mass spectrometry. We demonstrated that sensitivity and specificity of peptide identification can vary substantially for different mass tolerance settings, but this effect was more significant for Mascot. We present an adjusted Mascot threshold, which allows the user to freely select the best trade-off between sensitivity and specificity. The adjusted Mascot threshold was compared with the default Mascot and X!Tandem scoring thresholds and shown to be more sensitive at the same false discovery rates for both low and high accuracy mass spectrometry data.

    Funded by: Wellcome Trust: 077198

    Molecular & cellular proteomics : MCP 2008;7;5;962-70

  • The vertebrate genome annotation (Vega) database.

    Wilming LG, Gilbert JG, Howe K, Trevanion S, Hubbard T and Harrow JL

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK. lw2@sanger.ac.uk

    The Vertebrate Genome Annotation (Vega) database (http://vega.sanger.ac.uk) was first made public in 2004 and has been designed to view manual annotation of human, mouse and zebrafish genomic sequences produced at the Wellcome Trust Sanger Institute. Since its initial release, the number of human annotated loci has more than doubled to close to 33 000 and now contains comprehensive annotation on 20 of the 24 human chromosomes, four whole mouse chromosomes and around 40% of the zebrafish Danio rerio genome. In addition, we offer manual annotation of a number of haplotype regions in mouse and human and regions of comparative interest in pig and dog that are unique to Vega.

    Funded by: NHGRI NIH HHS: U54 HG004555; Wellcome Trust: 077198

    Nucleic acids research 2008;36;Database issue;D753-60

  • Data growth and its impact on the SCOP database: new developments.

    Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C and Murzin AG

    MRC Centre for Protein Engineering, Hills Road, Cambridge CB2 0QH, UK.

    The Structural Classification of Proteins (SCOP) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. The SCOP hierarchy comprises the following levels: Species, Protein, Family, Superfamily, Fold and Class. While keeping the original classification scheme intact, we have changed the production of SCOP in order to cope with a rapid growth of new structural data and to facilitate the discovery of new protein relationships. We describe ongoing developments and new features implemented in SCOP. A new update protocol supports batch classification of new protein structures by their detected relationships at Family and Superfamily levels in contrast to our previous sequential handling of new structural data by release date. We introduce pre-SCOP, a preview of the SCOP developmental version that enables earlier access to the information on new relationships. We also discuss the impact of worldwide Structural Genomics initiatives, which are producing new protein structures at an increasing rate, on the rates of discovery and growth of protein families and superfamilies. SCOP can be accessed at http://scop.mrc-lmb.cam.ac.uk/scop.

    Funded by: Medical Research Council: MC_U105192716; NIGMS NIH HHS: R01-GM073109; Wellcome Trust: 077198

    Nucleic acids research 2008;36;Database issue;D419-25

  • Ensembl 2008.

    Flicek P, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Eyre T, Fitzgerald S, Fernandez-Banet J, Gräf S, Haider S, Hammond M, Holland R, Howe KL, Howe K, Johnson N, Jenkinson A, Kähäri A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Megy K, Meidl P, Overduin B, Parker A, Pritchard B, Prlic A, Rice S, Rios D, Schuster M, Sealy I, Slater G, Smedley D, Spudich G, Trevanion S, Vilella AJ, Vogel J, White S, Wood M, Birney E, Cox T, Curwen V, Durbin R, Fernandez-Suarez XM, Herrero J, Hubbard TJ, Kasprzyk A, Proctor G, Smith J, Ureta-Vidal A and Searle S

    European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. flicek@ebi.ac.uk

    The Ensembl project (http://www.ensembl.org) is a comprehensive genome information system featuring an integrated set of genome annotation, databases and other information for chordate and selected model organism and disease vector genomes. As of release 47 (October 2007), Ensembl fully supports 35 species, with preliminary support for six additional species. New species in the past year include platypus and horse. Major additions and improvements to Ensembl since our previous report include extensive support for functional genomics data in the form of a specialized functional genomics database, genome-wide maps of protein-DNA interactions and the Ensembl regulatory build; support for customization of the Ensembl web interface through the addition of user accounts and user groups; and increased support for genome resequencing. We have also introduced new comparative genomics-based data mining options and report on the continued development of our software infrastructure.

    Funded by: Biotechnology and Biological Sciences Research Council: BBE0116401, BBS/B/13446, BBS/B/13470; Wellcome Trust: 062023, 077198

    Nucleic acids research 2008;36;Database issue;D707-14

  • Priorities for nucleotide trace, sequence and annotation data capture at the Ensembl Trace Archive and the EMBL Nucleotide Sequence Database.

    Cochrane G, Akhtar R, Aldebert P, Althorpe N, Baldwin A, Bates K, Bhattacharyya S, Bonfield J, Bower L, Browne P, Castro M, Cox T, Demiralp F, Eberhardt R, Faruque N, Hoad G, Jang M, Kulikova T, Labarga A, Leinonen R, Leonard S, Lin Q, Lopez R, Lorenc D, McWilliam H, Mukherjee G, Nardone F, Plaister S, Robinson S, Sobhany S, Vaughan R, Wu D, Zhu W, Apweiler R, Hubbard T and Birney E

    EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. cochrane@ebi.ac.uk

    The Ensembl Trace Archive (http://trace.ensembl.org/) and the EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/), known together as the European Nucleotide Archive, continue to see growth in data volume and diversity. Selected major developments of 2007 are presented briefly, along with data submission and retrieval information. In the face of increasing requirements for nucleotide trace, sequence and annotation data archiving, data capture priority decisions have been taken at the European Nucleotide Archive. Priorities are discussed in terms of how reliably information can be captured, the long-term benefits of its capture and the ease with which it can be captured.

    Funded by: Wellcome Trust: 062023, 077198, 085532

    Nucleic acids research 2008;36;Database issue;D5-12

  • Integrating biological data--the Distributed Annotation System.

    Jenkinson AM, Albrecht M, Birney E, Blankenburg H, Down T, Finn RD, Hermjakob H, Hubbard TJ, Jimenez RC, Jones P, Kähäri A, Kulesha E, Macías JR, Reeves GA and Prlić A

    European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK. andy.jenkinson@ebi.ac.uk

    Background: The Distributed Annotation System (DAS) is a widely adopted protocol for dynamically integrating a wide range of biological data from geographically diverse sources. DAS continues to expand its applicability and evolve in response to new challenges facing integrative bioinformatics.

    Results: Here we describe the various infrastructure components of DAS and present a new extended version of the DAS specification. Version 1.53E incorporates several recent developments, including its extension to serve new data types and an ontology for protein features.

    Conclusion: Our extensions to the DAS protocol have facilitated the integration of new data types, and our improvements to the existing DAS infrastructure have addressed recent challenges. The steadily increasing numbers of available data sources demonstrates further adoption of the DAS protocol.

    Funded by: Wellcome Trust: 077198

    BMC bioinformatics 2008;9 Suppl 8;S3

  • NestedMICA as an ab initio protein motif discovery tool.

    Doğruel M, Down TA and Hubbard TJ

    Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1HH, UK. md5@sanger.ac.uk

    Background: Discovering overrepresented patterns in amino acid sequences is an important step in protein functional element identification. We adapted and extended NestedMICA, an ab initio motif finder originally developed for finding transcription binding site motifs, to find short protein signals, and compared its performance with another popular protein motif finder, MEME. NestedMICA, an open source protein motif discovery tool written in Java, is driven by a Monte Carlo technique called Nested Sampling. It uses multi-class sequence background models to represent different "uninteresting" parts of sequences that do not contain motifs of interest. In order to assess NestedMICA as a protein motif finder, we have tested it on synthetic datasets produced by spiking instances of known motifs into a randomly selected set of protein sequences. NestedMICA was also tested using a biologically-authentic test set, where we evaluated its performance with respect to varying sequence length.

    Results: Generally NestedMICA recovered most of the short (3-9 amino acid long) test protein motifs spiked into a test set of sequences at different frequencies. We showed that it can be used to find multiple motifs at the same time, too. In all the assessment experiments we carried out, its overall motif discovery performance was better than that of MEME.

    Conclusion: NestedMICA proved itself to be a robust and sensitive ab initio protein motif finder, even for relatively short motifs that exist in only a small fraction of sequences.

    Availability: NestedMICA is available under the Lesser GPL open-source license from: http://www.sanger.ac.uk/Software/analysis/nmica/

    Funded by: Wellcome Trust: 077198

    BMC bioinformatics 2008;9;19

2007 Publications

  • Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

    ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, Giresi PG, Goldy J, Hawrylycz M, Haydock A, Humbert R, James KD, Johnson BE, Johnson EM, Frum TT, Rosenzweig ER, Karnani N, Lee K, Lefebvre GC, Navas PA, Neri F, Parker SC, Sabo PJ, Sandstrom R, Shafer A, Vetrie D, Weaver M, Wilcox S, Yu M, Collins FS, Dekker J, Lieb JD, Tullius TD, Crawford GE, Sunyaev S, Noble WS, Dunham I, Denoeud F, Reymond A, Kapranov P, Rozowsky J, Zheng D, Castelo R, Frankish A, Harrow J, Ghosh S, Sandelin A, Hofacker IL, Baertsch R, Keefe D, Dike S, Cheng J, Hirsch HA, Sekinger EA, Lagarde J, Abril JF, Shahab A, Flamm C, Fried C, Hackermüller J, Hertel J, Lindemeyer M, Missal K, Tanzer A, Washietl S, Korbel J, Emanuelsson O, Pedersen JS, Holroyd N, Taylor R, Swarbreck D, Matthews N, Dickson MC, Thomas DJ, Weirauch MT, Gilbert J, Drenkow J, Bell I, Zhao X, Srinivasan KG, Sung WK, Ooi HS, Chiu KP, Foissac S, Alioto T, Brent M, Pachter L, Tress ML, Valencia A, Choo SW, Choo CY, Ucla C, Manzano C, Wyss C, Cheung E, Clark TG, Brown JB, Ganesh M, Patel S, Tammana H, Chrast J, Henrichsen CN, Kai C, Kawai J, Nagalakshmi U, Wu J, Lian Z, Lian J, Newburger P, Zhang X, Bickel P, Mattick JS, Carninci P, Hayashizaki Y, Weissman S, Hubbard T, Myers RM, Rogers J, Stadler PF, Lowe TM, Wei CL, Ruan Y, Struhl K, Gerstein M, Antonarakis SE, Fu Y, Green ED, Karaöz U, Siepel A, Taylor J, Liefer LA, Wetterstrand KA, Good PJ, Feingold EA, Guyer MS, Cooper GM, Asimenos G, Dewey CN, Hou M, Nikolaev S, Montoya-Burgos JI, Löytynoja A, Whelan S, Pardi F, Massingham T, Huang H, Zhang NR, Holmes I, Mullikin JC, Ureta-Vidal A, Paten B, Seringhaus M, Church D, Rosenbloom K, Kent WJ, Stone EA, NISC Comparative Sequencing Program, Baylor College of Medicine Human Genome Sequencing Center, Washington University Genome Sequencing Center, Broad Institute, Children's Hospital Oakland Research Institute, Batzoglou S, Goldman N, Hardison RC, Haussler D, Miller W, Sidow A, Trinklein ND, Zhang ZD, Barrera L, Stuart R, King DC, Ameur A, Enroth S, Bieda MC, Kim J, Bhinge AA, Jiang N, Liu J, Yao F, Vega VB, Lee CW, Ng P, Shahab A, Yang A, Moqtaderi Z, Zhu Z, Xu X, Squazzo S, Oberley MJ, Inman D, Singer MA, Richmond TA, Munn KJ, Rada-Iglesias A, Wallerman O, Komorowski J, Fowler JC, Couttet P, Bruce AW, Dovey OM, Ellis PD, Langford CF, Nix DA, Euskirchen G, Hartman S, Urban AE, Kraus P, Van Calcar S, Heintzman N, Kim TH, Wang K, Qu C, Hon G, Luna R, Glass CK, Rosenfeld MG, Aldred SF, Cooper SJ, Halees A, Lin JM, Shulha HP, Zhang X, Xu M, Haidar JN, Yu Y, Ruan Y, Iyer VR, Green RD, Wadelius C, Farnham PJ, Ren B, Harte RA, Hinrichs AS, Trumbower H, Clawson H, Hillman-Jackson J, Zweig AS, Smith K, Thakkapallayil A, Barber G, Kuhn RM, Karolchik D, Armengol L, Bird CP, de Bakker PI, Kern AD, Lopez-Bigas N, Martin JD, Stranger BE, Woodroffe A, Davydov E, Dimas A, Eyras E, Hallgrímsdóttir IB, Huppert J, Zody MC, Abecasis GR, Estivill X, Bouffard GG, Guan X, Hansen NF, Idol JR, Maduro VV, Maskeri B, McDowell JC, Park M, Thomas PJ, Young AC, Blakesley RW, Muzny DM, Sodergren E, Wheeler DA, Worley KC, Jiang H, Weinstock GM, Gibbs RA, Graves T, Fulton R, Mardis ER, Wilson RK, Clamp M, Cuff J, Gnerre S, Jaffe DB, Chang JL, Lindblad-Toh K, Lander ES, Koriabine M, Nefedov M, Osoegawa K, Yoshinaga Y, Zhu B and de Jong PJ

    We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.

    Funded by: NCI NIH HHS: F32 CA108313; NHGRI NIH HHS: K22 HG003169, K22 HG003169-01A1, P41 HG002371, P41 HG002371-03S1, R01 HG002238, R01 HG002238-15, R01 HG003110, R01 HG003110-03, R01 HG003129-03, R01 HG003143, R01 HG003143-04, R01 HG003521, R01 HG003521-01, R01 HG003532, R01 HG003532-01, R01 HG003541, R01 HG003541-03, U01 HG002523, U01 HG002523-01, U01 HG003147, U01 HG003147-02, U01 HG003150-03, U01 HG003151, U01 HG003151-03, U01 HG003156, U01 HG003156-03, U01 HG003157, U01 HG003157-03, U01 HG003161, U01 HG003161-03, U01 HG003162, U01 HG003162-03, U01 HG003168-02, U54 HG003067, U54 HG003067-01, U54 HG003079, U54 HG003079-01, U54 HG003273, U54 HG003273-01; Wellcome Trust: 062023, 077198

    Nature 2007;447;7146;799-816

  • Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions.

    Denoeud F, Kapranov P, Ucla C, Frankish A, Castelo R, Drenkow J, Lagarde J, Alioto T, Manzano C, Chrast J, Dike S, Wyss C, Henrichsen CN, Holroyd N, Dickson MC, Taylor R, Hance Z, Foissac S, Myers RM, Rogers J, Hubbard T, Harrow J, Guigó R, Gingeras TR, Antonarakis SE and Reymond A

    Grup de Recerca en Informática Biomèdica, Institut Municipal d'Investigació Mèdica/Universitat Pompeu Fabra, 08003 Barcelona, Catalonia, Spain.

    This report presents systematic empirical annotation of transcript products from 399 annotated protein-coding loci across the 1% of the human genome targeted by the Encyclopedia of DNA elements (ENCODE) pilot project using a combination of 5' rapid amplification of cDNA ends (RACE) and high-density resolution tiling arrays. We identified previously unannotated and often tissue- or cell-line-specific transcribed fragments (RACEfrags), both 5' distal to the annotated 5' terminus and internal to the annotated gene bounds for the vast majority (81.5%) of the tested genes. Half of the distal RACEfrags span large segments of genomic sequences away from the main portion of the coding transcript and often overlap with the upstream-annotated gene(s). Notably, at least 20% of the resultant novel transcripts have changes in their open reading frames (ORFs), most of them fusing ORFs of adjacent transcripts. A significant fraction of distal RACEfrags show expression levels comparable to those of known exons of the same locus, suggesting that they are not part of very minority splice forms. These results have significant implications concerning (1) our current understanding of the architecture of protein-coding genes; (2) our views on locations of regulatory regions in the genome; and (3) the interpretation of sequence polymorphisms mapping to regions hitherto considered to be "noncoding," ultimately relating to the identification of disease-related sequence alterations.

    Funded by: NHGRI NIH HHS: U01HG03147, U01HG03150; PHS HHS: N01C012400; Wellcome Trust: 077198

    Genome research 2007;17;6;746-59

  • Large-scale discovery of promoter motifs in Drosophila melanogaster.

    Down TA, Bergman CM, Su J and Hubbard TJ

    Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom. td2@sanger.ac.uk

    A key step in understanding gene regulation is to identify the repertoire of transcription factor binding motifs (TFBMs) that form the building blocks of promoters and other regulatory elements. Identifying these experimentally is very laborious, and the number of TFBMs discovered remains relatively small, especially when compared with the hundreds of transcription factor genes predicted in metazoan genomes. We have used a recently developed statistical motif discovery approach, NestedMICA, to detect candidate TFBMs from a large set of Drosophila melanogaster promoter regions. Of the 120 motifs inferred in our initial analysis, 25 were statistically significant matches to previously reported motifs, while 87 appeared to be novel. Analysis of sequence conservation and motif positioning suggested that the great majority of these discovered motifs are predictive of functional elements in the genome. Many motifs showed associations with specific patterns of gene expression in the D. melanogaster embryo, and we were able to obtain confident annotation of expression patterns for 25 of our motifs, including eight of the novel motifs. The motifs are available through Tiffin, a new database of DNA sequence motifs. We have discovered many new motifs that are overrepresented in D. melanogaster promoter regions, and offer several independent lines of evidence that these are novel TFBMs. Our motif dictionary provides a solid foundation for further investigation of regulatory elements in Drosophila, and demonstrates techniques that should be applicable in other species. We suggest that further improvements in computational motif discovery should narrow the gap between the set of known motifs and the total number of transcription factors in metazoan genomes.

    Funded by: Wellcome Trust: 077198

    PLoS computational biology 2007;3;1;e7

  • Ensembl 2007.

    Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Fitzgerald S, Fernandez-Banet J, Graf S, Haider S, Hammond M, Herrero J, Holland R, Howe K, Howe K, Johnson N, Kahari A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Melsopp C, Megy K, Meidl P, Ouverdin B, Parker A, Prlic A, Rice S, Rios D, Schuster M, Sealy I, Severin J, Slater G, Smedley D, Spudich G, Trevanion S, Vilella A, Vogel J, White S, Wood M, Cox T, Curwen V, Durbin R, Fernandez-Suarez XM, Flicek P, Kasprzyk A, Proctor G, Searle S, Smith J, Ureta-Vidal A and Birney E

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK. th@sanger.ac.uk

    The Ensembl (http://www.ensembl.org/) project provides a comprehensive and integrated source of annotation of chordate genome sequences. Over the past year the number of genomes available from Ensembl has increased from 15 to 33, with the addition of sites for the mammalian genomes of elephant, rabbit, armadillo, tenrec, platypus, pig, cat, bush baby, common shrew, microbat and european hedgehog; the fish genomes of stickleback and medaka and the second example of the genomes of the sea squirt (Ciona savignyi) and the mosquito (Aedes aegypti). Some of the major features added during the year include the first complete gene sets for genomes with low-sequence coverage, the introduction of new strain variation data and the introduction of new orthology/paralog annotations based on gene trees.

    Funded by: Biotechnology and Biological Sciences Research Council: BBS/B/13446, BBS/B/13470; Wellcome Trust: 062023

    Nucleic acids research 2007;35;Database issue;D610-7

  • SISYPHUS--structural alignments for proteins with non-trivial relationships.

    Andreeva A, Prlić A, Hubbard TJ and Murzin AG

    MRC Centre for Protein Engineering, Hills Road, Cambridge CB2 2QH, UK. tony@mrc-lmb.cam.ac.uk

    With the increasing amount of structural data, the number of homologous protein structures bearing topological irregularities is steadily growing. These include proteins with circular permutations, segment-swapping, context-dependent folding or chameleon sequences that can adopt alternative secondary structures. Their non-trivial structural relationships are readily identified during expert analysis but their automatic identification using the existing computational tools still remains difficult or impossible. Such non-trivial cases of protein relationships are known to pose a problem to multiple alignment algorithms and to impede comparative modeling studies. They support a new emerging concept of evolutionary changeable protein fold, which creates practical difficulties for the hierarchical classifications of protein structures.To facilitate the understanding of, and to provide a comprehensive annotation of proteins with such non-trivial structural relationships we have created SISYPHUS ([Sigmaomeganuphiomicronzeta]--in Greek crafty), a compendium to the SCOP database. The SISYPHUS database contains a collection of manually curated structural alignments and their inter-relationships. The multiple alignments are constructed for protein structural regions that range from oligomeric biological units, or individual domains to fragments of different size. The SISYPHUS multiple alignments are displayed with SPICE, a browser that provides an integrated view of protein sequences, structures and their annotations. The database is available from http://sisyphus.mrc-cpe.cam.ac.uk.

    Funded by: Medical Research Council: MC_U105192716; Wellcome Trust: 077198

    Nucleic acids research 2007;35;Database issue;D253-9

  • Critical assessment of methods of protein structure prediction-Round VII.

    Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T and Tramontano A

    Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, Rockville, Maryland 20850, USA. moult@umbi.umd.edu

    This paper is an introduction to the supplemental issue of the journal PROTEINS, dedicated to the seventh CASP experiment to assess the state of the art in protein structure prediction. The paper describes the conduct of the experiment, the categories of prediction included, and outlines the evaluation and assessment procedures. Highlights are improvements in model accuracy relative to that obtainable from knowledge of a single best template structure; convergence of the accuracy of models produced by automatic servers toward that produced by human modeling teams; the emergence of methods for predicting the quality of models; and rapidly increasing practical applications of the methods.

    Funded by: NIGMS NIH HHS: GM072354; NLM NIH HHS: LM07085; Wellcome Trust: 077198

    Proteins 2007;69 Suppl 8;3-9

  • Integrating sequence and structural biology with DAS.

    Prlić A, Down TA, Kulesha E, Finn RD, Kähäri A and Hubbard TJ

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK. ap3@sanger.ac.uk

    Background: The Distributed Annotation System (DAS) is a network protocol for exchanging biological data. It is frequently used to share annotations of genomes and protein sequence.

    Results: Here we present several extensions to the current DAS 1.5 protocol. These provide new commands to share alignments, three dimensional molecular structure data, add the possibility for registration and discovery of DAS servers, and provide a convention how to provide different types of data plots. We present examples of web sites and applications that use the new extensions. We operate a public registry of DAS sources, which now includes entries for more than 250 distinct sources.

    Conclusion: Our DAS extensions are essential for the management of the growing number of services and exchange of diverse biological data sets. In addition the extensions allow new types of applications to be developed and scientific questions to be addressed. The registry of DAS sources is available at http://www.dasregistry.org.

    Funded by: Wellcome Trust: 062023, 077198

    BMC bioinformatics 2007;8;333

  • Lessons learned from the initial sequencing of the pig genome: comparative analysis of an 8 Mb region of pig chromosome 17.

    Hart EA, Caccamo M, Harrow JL, Humphray SJ, Gilbert JG, Trevanion S, Hubbard T, Rogers J and Rothschild MF

    Wellcome Trust Sanger Institute, Wellcome Tust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. eah@sanger.ac.uk

    Background: We describe here the sequencing, annotation and comparative analysis of an 8 Mb region of pig chromosome 17, which provides a useful test region to assess coverage and quality for the pig genome sequencing project. We report our findings comparing the annotation of draft sequence assembled at different depths of coverage.

    Results: Within this region we annotated 71 loci, of which 53 are orthologous to human known coding genes. When compared to the syntenic regions in human (20q13.13-q13.33) and mouse (chromosome 2, 167.5 Mb-178.3 Mb), this region was found to be highly conserved with respect to gene order. The most notable difference between the three species is the presence of a large expansion of zinc finger coding genes and pseudogenes on mouse chromosome 2 between Edn3 and Phactr3 that is absent from pig and human. All of our annotation has been made publicly available in the Vertebrate Genome Annotation browser, VEGA. We assessed the impact of coverage on sequence assembly across this region and found, as expected, that increased sequence depth resulted in fewer, longer contigs. One-third of our annotated loci could not be fully re-aligned back to the low coverage version of the sequence, principally because the transcripts are fragmented over several contigs.

    Conclusion: We have demonstrated the considerable advantages of sequencing at increased read depths and discuss the implications that lower coverage sequence may have on subsequent comparative and functional studies, particularly those involving complex loci such as GNAS.

    Funded by: Biotechnology and Biological Sciences Research Council: BBE0116401; Wellcome Trust: 077198

    Genome biology 2007;8;8;R168

  • New tools and expanded data analysis capabilities at the Protein Structure Prediction Center.

    Kryshtafovych A, Prlic A, Dmytriv Z, Daniluk P, Milostan M, Eyrich V, Hubbard T and Fidelis K

    Genome Center, University of California, Davis, California 95616, USA.

    We outline the main tasks performed by the Protein Structure Prediction Center in support of the CASP7 experiment and provide a brief review of the major measures used in the automatic evaluation of predictions. We describe in more detail the software developed to facilitate analysis of modeling success over and beyond the available templates and the adopted Java-based tool enabling visualization of multiple structural superpositions between target and several models/templates. We also give an overview of the CASP infrastructure provided by the Center and discuss the organization of the results web pages available through http://predictioncenter.org.

    Funded by: NLM NIH HHS: LM07085-01; Wellcome Trust: 077198

    Proteins 2007;69 Suppl 8;19-26

2006 Publications

  • The DNA sequence and biological annotation of human chromosome 1.

    Gregory SG, Barlow KF, McLay KE, Kaul R, Swarbreck D, Dunham A, Scott CE, Howe KL, Woodfine K, Spencer CC, Jones MC, Gillson C, Searle S, Zhou Y, Kokocinski F, McDonald L, Evans R, Phillips K, Atkinson A, Cooper R, Jones C, Hall RE, Andrews TD, Lloyd C, Ainscough R, Almeida JP, Ambrose KD, Anderson F, Andrew RW, Ashwell RI, Aubin K, Babbage AK, Bagguley CL, Bailey J, Beasley H, Bethel G, Bird CP, Bray-Allen S, Brown JY, Brown AJ, Buckley D, Burton J, Bye J, Carder C, Chapman JC, Clark SY, Clarke G, Clee C, Cobley V, Collier RE, Corby N, Coville GJ, Davies J, Deadman R, Dunn M, Earthrowl M, Ellington AG, Errington H, Frankish A, Frankland J, French L, Garner P, Garnett J, Gay L, Ghori MR, Gibson R, Gilby LM, Gillett W, Glithero RJ, Grafham DV, Griffiths C, Griffiths-Jones S, Grocock R, Hammond S, Harrison ES, Hart E, Haugen E, Heath PD, Holmes S, Holt K, Howden PJ, Hunt AR, Hunt SE, Hunter G, Isherwood J, James R, Johnson C, Johnson D, Joy A, Kay M, Kershaw JK, Kibukawa M, Kimberley AM, King A, Knights AJ, Lad H, Laird G, Lawlor S, Leongamornlert DA, Lloyd DM, Loveland J, Lovell J, Lush MJ, Lyne R, Martin S, Mashreghi-Mohammadi M, Matthews L, Matthews NS, McLaren S, Milne S, Mistry S, Moore MJ, Nickerson T, O'Dell CN, Oliver K, Palmeiri A, Palmer SA, Parker A, Patel D, Pearce AV, Peck AI, Pelan S, Phelps K, Phillimore BJ, Plumb R, Rajan J, Raymond C, Rouse G, Saenphimmachak C, Sehra HK, Sheridan E, Shownkeen R, Sims S, Skuce CD, Smith M, Steward C, Subramanian S, Sycamore N, Tracey A, Tromans A, Van Helmond Z, Wall M, Wallis JM, White S, Whitehead SL, Wilkinson JE, Willey DL, Williams H, Wilming L, Wray PW, Wu Z, Coulson A, Vaudin M, Sulston JE, Durbin R, Hubbard T, Wooster R, Dunham I, Carter NP, McVean G, Ross MT, Harrow J, Olson MV, Beck S, Rogers J, Bentley DR, Banerjee R, Bryant SP, Burford DC, Burrill WD, Clegg SM, Dhami P, Dovey O, Faulkner LM, Gribble SM, Langford CF, Pandian RD, Porter KM and Prigmore E

    The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK. sgregory@chg.duhs.duke.edu

    The reference sequence for each human chromosome provides the framework for understanding genome function, variation and evolution. Here we report the finished sequence and biological annotation of human chromosome 1. Chromosome 1 is gene-dense, with 3,141 genes and 991 pseudogenes, and many coding sequences overlap. Rearrangements and mutations of chromosome 1 are prevalent in cancer and many other diseases. Patterns of sequence variation reveal signals of recent selection in specific genes that may contribute to human fitness, and also in regions where no function is evident. Fine-scale recombination occurs in hotspots of varying intensity along the sequence, and is enriched near genes. These and other studies of human biology and disease encoded within chromosome 1 are made possible with the highly accurate annotated sequence, as part of the completed set of chromosome sequences that comprise the reference human genome.

    Funded by: Medical Research Council: G0000107; Wellcome Trust

    Nature 2006;441;7091;315-21

  • DNA sequence of human chromosome 17 and analysis of rearrangement in the human lineage.

    Zody MC, Garber M, Adams DJ, Sharpe T, Harrow J, Lupski JR, Nicholson C, Searle SM, Wilming L, Young SK, Abouelleil A, Allen NR, Bi W, Bloom T, Borowsky ML, Bugalter BE, Butler J, Chang JL, Chen CK, Cook A, Corum B, Cuomo CA, de Jong PJ, DeCaprio D, Dewar K, FitzGerald M, Gilbert J, Gibson R, Gnerre S, Goldstein S, Grafham DV, Grocock R, Hafez N, Hagopian DS, Hart E, Norman CH, Humphray S, Jaffe DB, Jones M, Kamal M, Khodiyar VK, LaButti K, Laird G, Lehoczky J, Liu X, Lokyitsang T, Loveland J, Lui A, Macdonald P, Major JE, Matthews L, Mauceli E, McCarroll SA, Mihalev AH, Mudge J, Nguyen C, Nicol R, O'Leary SB, Osoegawa K, Schwartz DC, Shaw-Smith C, Stankiewicz P, Steward C, Swarbreck D, Venkataraman V, Whittaker CA, Yang X, Zimmer AR, Bradley A, Hubbard T, Birren BW, Rogers J, Lander ES and Nusbaum C

    Broad Institute of MIT and Harvard, 7 Cambridge Center, Massachusetts 02142, USA.

    Chromosome 17 is unusual among the human chromosomes in many respects. It is the largest human autosome with orthology to only a single mouse chromosome, mapping entirely to the distal half of mouse chromosome 11. Chromosome 17 is rich in protein-coding genes, having the second highest gene density in the genome. It is also enriched in segmental duplications, ranking third in density among the autosomes. Here we report a finished sequence for human chromosome 17, as well as a structural comparison with the finished sequence for mouse chromosome 11, the first finished mouse chromosome. Comparison of the orthologous regions reveals striking differences. In contrast to the typical pattern seen in mammalian evolution, the human sequence has undergone extensive intrachromosomal rearrangement, whereas the mouse sequence has been remarkably stable. Moreover, although the human sequence has a high density of segmental duplication, the mouse sequence has a very low density. Notably, these segmental duplications correspond closely to the sites of structural rearrangement, demonstrating a link between duplication and rearrangement. Examination of the main classes of duplicated segments provides insight into the dynamics underlying expansion of chromosome-specific, low-copy repeats in the human genome.

    Funded by: Medical Research Council: G0000107; Wellcome Trust: 077187

    Nature 2006;440;7087;1045-9

  • Genomic anatomy of the Tyrp1 (brown) deletion complex.

    Smyth IM, Wilming L, Lee AW, Taylor MS, Gautier P, Barlow K, Wallis J, Martin S, Glithero R, Phillimore B, Pelan S, Andrew R, Holt K, Taylor R, McLaren S, Burton J, Bailey J, Sims S, Squares J, Plumb B, Joy A, Gibson R, Gilbert J, Hart E, Laird G, Loveland J, Mudge J, Steward C, Swarbreck D, Harrow J, North P, Leaves N, Greystrong J, Coppola M, Manjunath S, Campbell M, Smith M, Strachan G, Tofts C, Boal E, Cobley V, Hunter G, Kimberley C, Thomas D, Cave-Berry L, Weston P, Botcherby MR, White S, Edgar R, Cross SH, Irvani M, Hummerich H, Simpson EH, Johnson D, Hunsicker PR, Little PF, Hubbard T, Campbell RD, Rogers J and Jackson IJ

    Medical Research Council Human Genetics Unit, Edinburgh EH4 2XU, United Kingdom.

    Chromosome deletions in the mouse have proven invaluable in the dissection of gene function. The brown deletion complex comprises >28 independent genome rearrangements, which have been used to identify several functional loci on chromosome 4 required for normal embryonic and postnatal development. We have constructed a 172-bacterial artificial chromosome contig that spans this 22-megabase (Mb) interval and have produced a contiguous, finished, and manually annotated sequence from these clones. The deletion complex is strikingly gene-poor, containing only 52 protein-coding genes (of which only 39 are supported by human homologues) and has several further notable genomic features, including several segments of >1 Mb, apparently devoid of a coding sequence. We have used sequence polymorphisms to finely map the deletion breakpoints and identify strong candidate genes for the known phenotypes that map to this region, including three lethal loci (l4Rn1, l4Rn2, and l4Rn3) and the fitness mutant brown-associated fitness (baf). We have also characterized misexpression of the basonuclin homologue, Bnc2, associated with the inversion-mediated coat color mutant white-based brown (B(w)). This study provides a molecular insight into the basis of several characterized mouse mutants, which will allow further dissection of this region by targeted or chemical mutagenesis.

    Funded by: Medical Research Council: MC_U123160651, MC_U127561112; Wellcome Trust

    Proceedings of the National Academy of Sciences of the United States of America 2006;103;10;3704-9

  • Ensembl 2006.

    Birney E, Andrews D, Caccamo M, Chen Y, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, Down T, Durbin R, Fernandez-Suarez XM, Flicek P, Gräf S, Hammond M, Herrero J, Howe K, Iyer V, Jekosch K, Kähäri A, Kasprzyk A, Keefe D, Kokocinski F, Kulesha E, London D, Longden I, Melsopp C, Meidl P, Overduin B, Parker A, Proctor G, Prlic A, Rae M, Rios D, Redmond S, Schuster M, Sealy I, Searle S, Severin J, Slater G, Smedley D, Smith J, Stabenau A, Stalker J, Trevanion S, Ureta-Vidal A, Vogel J, White S, Woodwark C and Hubbard TJ

    European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK. birney@ebi.ac.uk

    The Ensembl (http://www.ensembl.org/) project provides a comprehensive and integrated source of annotation of large genome sequences. Over the last year the number of genomes available from the Ensembl site has increased from 4 to 19, with the addition of the mammalian genomes of Rhesus macaque and Opossum, the chordate genome of Ciona intestinalis and the import and integration of the yeast genome. The year has also seen extensive improvements to both data analysis and presentation, with the introduction of a redesigned website, the addition of RNA gene and regulatory annotation and substantial improvements to the integration of human genome variation data.

    Funded by: Biotechnology and Biological Sciences Research Council: BBS/B/13446, BBS/B/13470; Wellcome Trust

    Nucleic acids research 2006;34;Database issue;D556-61

  • A machine learning strategy to identify candidate binding sites in human protein-coding sequence.

    Down T, Leong B and Hubbard TJ

    Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK. td2@sanger.ac.uk

    Background: The splicing of RNA transcripts is thought to be partly promoted and regulated by sequences embedded within exons. Known sequences include binding sites for SR proteins, which are thought to mediate interactions between splicing factors bound to the 5' and 3' splice sites. It would be useful to identify further candidate sequences, however identifying them computationally is hard since exon sequences are also constrained by their functional role in coding for proteins.

    Results: This strategy identified a collection of motifs including several previously reported splice enhancer elements. Although only trained on coding exons, the model discriminates both coding and non-coding exons from intragenic sequence.

    Conclusion: We have trained a computational model able to detect signals in coding exons which seem to be orthogonal to the sequences' primary function of coding for proteins. We believe that many of the motifs detected here represent binding sites for both previously unrecognized proteins which influence RNA splicing as well as other regulatory elements.

    Funded by: Wellcome Trust: 077198

    BMC bioinformatics 2006;7;419

  • EGASP: the human ENCODE Genome Annotation Assessment Project.

    Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE and Reese MG

    Centre de Regulació Genòmica, Institut Municipal d'Investigació Mèdica-Universitat Pompeu Fabra, E08003 Barcelona, Catalonia, Spain. rguigo@imim.es

    Background: We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment.

    Results: The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified.

    Conclusion: This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence.

    Funded by: Medical Research Council: G8225539

    Genome biology 2006;7 Suppl 1;S2.1-31

  • GENCODE: producing a reference annotation for ENCODE.

    Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D, Rossier C, Ucla C, Hubbard T, Antonarakis SE and Guigo R

    Wellcome Trust Sanger Institute, Wellcome Trust Campus, Hinxton, Cambridge CB10 1SA, UK. jla1@sanger.ac.uk

    Background: The GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manual annotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on these experimental results.

    Results: The GENCODE gene features are divided into eight different categories of which only the first two (known and novel coding sequence) are confidently predicted to be protein-coding genes. 5' rapid amplification of cDNA ends (RACE) and RT-PCR were used to experimentally verify the initial annotation. Of the 420 coding loci tested, 229 RACE products have been sequenced. They supported 5' extensions of 30 loci and new splice variants in 50 loci. In addition, 46 loci without evidence for a coding sequence were validated, consisting of 31 novel and 15 putative transcripts. We assessed the comprehensiveness of the GENCODE annotation by attempting to validate all the predicted exon boundaries outside the GENCODE annotation. Out of 1,215 tested in a subset of the ENCODE regions, 14 novel exon pairs were validated, only two of them in intergenic regions.

    Conclusion: In total, 487 loci, of which 434 are coding, have been annotated as part of the GENCODE reference set available from the UCSC browser. Comparison of GENCODE annotation with RefSeq and ENSEMBL show only 40% of GENCODE exons are contained within the two sets, which is a reflection of the high number of alternative splice forms with unique exons annotated. Over 50% of coding loci have been experimentally verified by 5' RACE for EGASP and the GENCODE collaboration is continuing to refine its annotation of 1% human genome with the aid of experimental validation.

    Genome biology 2006;7 Suppl 1;S4.1-9

2005 Publications

  • Adding some SPICE to DAS.

    Prlić A, Down TA and Hubbard TJ

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus Hinxton, Cambridge, UK. ap3@sanger.ac.uk

    Unlabelled: The distributed annotation system (DAS) defines a communication protocol used to exchange biological annotations. It is motivated by the idea that annotations should not be provided by single centralized databases but instead be spread over multiple sites. Data distribution, performed by DAS servers, is separated from visualization, which is carried out by DAS clients. The original DAS protocol was designed to serve annotation of genomic sequences. We have extended the protocol to be applicable to macromolecular structures. Here we present SPICE, a new DAS client that can be used to visualize protein sequence and structure annotations.

    Availability: http://www.efamily.org.uk/software/dasclients/spice/

    Bioinformatics (Oxford, England) 2005;21 Suppl 2;ii40-1

  • The DNA sequence of the human X chromosome.

    Ross MT, Grafham DV, Coffey AJ, Scherer S, McLay K, Muzny D, Platzer M, Howell GR, Burrows C, Bird CP, Frankish A, Lovell FL, Howe KL, Ashurst JL, Fulton RS, Sudbrak R, Wen G, Jones MC, Hurles ME, Andrews TD, Scott CE, Searle S, Ramser J, Whittaker A, Deadman R, Carter NP, Hunt SE, Chen R, Cree A, Gunaratne P, Havlak P, Hodgson A, Metzker ML, Richards S, Scott G, Steffen D, Sodergren E, Wheeler DA, Worley KC, Ainscough R, Ambrose KD, Ansari-Lari MA, Aradhya S, Ashwell RI, Babbage AK, Bagguley CL, Ballabio A, Banerjee R, Barker GE, Barlow KF, Barrett IP, Bates KN, Beare DM, Beasley H, Beasley O, Beck A, Bethel G, Blechschmidt K, Brady N, Bray-Allen S, Bridgeman AM, Brown AJ, Brown MJ, Bonnin D, Bruford EA, Buhay C, Burch P, Burford D, Burgess J, Burrill W, Burton J, Bye JM, Carder C, Carrel L, Chako J, Chapman JC, Chavez D, Chen E, Chen G, Chen Y, Chen Z, Chinault C, Ciccodicola A, Clark SY, Clarke G, Clee CM, Clegg S, Clerc-Blankenburg K, Clifford K, Cobley V, Cole CG, Conquer JS, Corby N, Connor RE, David R, Davies J, Davis C, Davis J, Delgado O, Deshazo D, Dhami P, Ding Y, Dinh H, Dodsworth S, Draper H, Dugan-Rocha S, Dunham A, Dunn M, Durbin KJ, Dutta I, Eades T, Ellwood M, Emery-Cohen A, Errington H, Evans KL, Faulkner L, Francis F, Frankland J, Fraser AE, Galgoczy P, Gilbert J, Gill R, Glöckner G, Gregory SG, Gribble S, Griffiths C, Grocock R, Gu Y, Gwilliam R, Hamilton C, Hart EA, Hawes A, Heath PD, Heitmann K, Hennig S, Hernandez J, Hinzmann B, Ho S, Hoffs M, Howden PJ, Huckle EJ, Hume J, Hunt PJ, Hunt AR, Isherwood J, Jacob L, Johnson D, Jones S, de Jong PJ, Joseph SS, Keenan S, Kelly S, Kershaw JK, Khan Z, Kioschis P, Klages S, Knights AJ, Kosiura A, Kovar-Smith C, Laird GK, Langford C, Lawlor S, Leversha M, Lewis L, Liu W, Lloyd C, Lloyd DM, Loulseged H, Loveland JE, Lovell JD, Lozado R, Lu J, Lyne R, Ma J, Maheshwari M, Matthews LH, McDowall J, McLaren S, McMurray A, Meidl P, Meitinger T, Milne S, Miner G, Mistry SL, Morgan M, Morris S, Müller I, Mullikin JC, Nguyen N, Nordsiek G, Nyakatura G, O'Dell CN, Okwuonu G, Palmer S, Pandian R, Parker D, Parrish J, Pasternak S, Patel D, Pearce AV, Pearson DM, Pelan SE, Perez L, Porter KM, Ramsey Y, Reichwald K, Rhodes S, Ridler KA, Schlessinger D, Schueler MG, Sehra HK, Shaw-Smith C, Shen H, Sheridan EM, Shownkeen R, Skuce CD, Smith ML, Sotheran EC, Steingruber HE, Steward CA, Storey R, Swann RM, Swarbreck D, Tabor PE, Taudien S, Taylor T, Teague B, Thomas K, Thorpe A, Timms K, Tracey A, Trevanion S, Tromans AC, d'Urso M, Verduzco D, Villasana D, Waldron L, Wall M, Wang Q, Warren J, Warry GL, Wei X, West A, Whitehead SL, Whiteley MN, Wilkinson JE, Willey DL, Williams G, Williams L, Williamson A, Williamson H, Wilming L, Woodmansey RL, Wray PW, Yen J, Zhang J, Zhou J, Zoghbi H, Zorilla S, Buck D, Reinhardt R, Poustka A, Rosenthal A, Lehrach H, Meindl A, Minx PJ, Hillier LW, Willard HF, Wilson RK, Waterston RH, Rice CM, Vaudin M, Coulson A, Nelson DL, Weinstock G, Sulston JE, Durbin R, Hubbard T, Gibbs RA, Beck S, Rogers J and Bentley DR

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. mtr@sanger.ac.uk

    The human X chromosome has a unique biology that was shaped by its evolution as the sex chromosome shared by males and females. We have determined 99.3% of the euchromatic sequence of the X chromosome. Our analysis illustrates the autosomal origin of the mammalian sex chromosomes, the stepwise process that led to the progressive loss of recombination between X and Y, and the extent of subsequent degradation of the Y chromosome. LINE1 repeat elements cover one-third of the X chromosome, with a distribution that is consistent with their proposed role as way stations in the process of X-chromosome inactivation. We found 1,098 genes in the sequence, of which 99 encode proteins expressed in testis and in various tumour types. A disproportionately high number of mendelian diseases are documented for the X chromosome. Of this number, 168 have been explained by mutations in 113 X-linked genes, which in many cases were characterized with the aid of the DNA sequence.

    Nature 2005;434;7031;325-37

  • The Vertebrate Genome Annotation (Vega) database.

    Ashurst JL, Chen CK, Gilbert JG, Jekosch K, Keenan S, Meidl P, Searle SM, Stalker J, Storey R, Trevanion S, Wilming L and Hubbard T

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK. jla1@sanger.ac.uk

    The Vertebrate Genome Annotation (Vega) database (http://vega.sanger.ac.uk) has been designed to be a community resource for browsing manual annotation of finished sequences from a variety of vertebrate genomes. Its core database is based on an Ensembl-style schema, extended to incorporate curation-specific metadata. In collaboration with the genome sequencing centres, Vega attempts to present consistent high-quality annotation of the published human chromosome sequences. In addition, it is also possible to view various finished regions from other vertebrates, including mouse and zebrafish. Vega displays only manually annotated gene structures built using transcriptional evidence, which can be examined in the browser. Attempts have been made to standardize the annotation procedure across each vertebrate genome, which should aid comparative analysis of orthologues across the different finished regions.

    Nucleic acids research 2005;33;Database issue;D459-65

  • Ensembl 2005.

    Hubbard T, Andrews D, Caccamo M, Cameron G, Chen Y, Clamp M, Clarke L, Coates G, Cox T, Cunningham F, Curwen V, Cutts T, Down T, Durbin R, Fernandez-Suarez XM, Gilbert J, Hammond M, Herrero J, Hotz H, Howe K, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Keenan S, Kokocinsci F, London D, Longden I, McVicker G, Melsopp C, Meidl P, Potter S, Proctor G, Rae M, Rios D, Schuster M, Searle S, Severin J, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Trevanion S, Ureta-Vidal A, Vogel J, White S, Woodwark C and Birney E

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

    The Ensembl (http://www.ensembl.org/) project provides a comprehensive and integrated source of annotation of large genome sequences. Over the last year the number of genomes available from the Ensembl site has increased by 7 to 16, with the addition of the six vertebrate genomes of chimpanzee, dog, cow, chicken, tetraodon and frog and the insect genome of honeybee. The majority have been annotated automatically using the Ensembl gene build system, showing its flexibility to reliably annotate a wide variety of genomes. With the increased number of vertebrate genomes, the comparative analysis provided to users has been greatly improved, with new website interfaces allowing annotation of different genomes to be directly compared. The Ensembl software system is being increasingly widely reused in different projects showing the benefits of a completely open approach to software development and distribution.

    Nucleic acids research 2005;33;Database issue;D447-53

  • Critical assessment of methods of protein structure prediction (CASP)--round 6.

    Moult J, Fidelis K, Rost B, Hubbard T and Tramontano A

    Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, Rockville, Maryland 20850, USA. moult@umbi.umd.edu

    This article is an introduction to the special issue of the journal Proteins, dedicated to the sixth CASP experiment to assess the state of the art in protein structure prediction. The article describes the conduct of the experiment and the categories of prediction included, and outlines the evaluation and assessment procedures. A brief summary of progress over the decade of CASP experiments is also provided.

    Funded by: NIGMS NIH HHS: GM072354; NLM NIH HHS: LM07085

    Proteins 2005;61 Suppl 7;3-7

  • NestedMICA: sensitive inference of over-represented motifs in nucleic acid sequence.

    Down TA and Hubbard TJ

    Wellcome Trust Sanger Institute, Hinxton Cambridge, CB10 1SA, UK. td2@sanger.ac.uk

    NestedMICA is a new, scalable, pattern-discovery system for finding transcription factor binding sites and similar motifs in biological sequences. Like several previous methods, NestedMICA tackles this problem by optimizing a probabilistic mixture model to fit a set of sequences. However, the use of a newly developed inference strategy called Nested Sampling means NestedMICA is able to find optimal solutions without the need for a problematic initialization or seeding step. We investigate the performance of NestedMICA in a range scenario, on synthetic data and a well-characterized set of muscle regulatory regions, and compare it with the popular MEME program. We show that the new method is significantly more sensitive than MEME: in one case, it successfully extracted a target motif from background sequence four times longer than could be handled by the existing program. It also performs robustly on synthetic sequences containing multiple significant motifs. When tested on a real set of regulatory sequences, NestedMICA produced motifs which were good predictors for all five abundant classes of annotated binding sites.

    Nucleic acids research 2005;33;5;1445-53

2004 Publications

  • The ENCODE (ENCyclopedia Of DNA Elements) Project.

    ENCODE Project Consortium

    The ENCyclopedia Of DNA Elements (ENCODE) Project aims to identify all functional elements in the human genome sequence. The pilot phase of the Project is focused on a specified 30 megabases (approximately 1%) of the human genome sequence and is organized as an international consortium of computational and laboratory-based scientists working to develop and apply high-throughput approaches for detecting all sequence elements that confer biological function. The results of this pilot phase will guide future efforts to analyze the entire human genome.

    Funded by: NHGRI NIH HHS: R01 HG003143

    Science (New York, N.Y.) 2004;306;5696;636-40

  • Finishing the euchromatic sequence of the human genome.

    International Human Genome Sequencing Consortium

    The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers approximately 99% of the euchromatic genome and is accurate to an error rate of approximately 1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human genome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead.

    Nature 2004;431;7011;931-45

  • What can we learn from noncoding regions of similarity between genomes?

    Down TA and Hubbard TJ

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. td2@sanger.ac.uk

    Background: In addition to known protein-coding genes, large amounts of apparently non-coding sequence are conserved between the human and mouse genomes. It seems reasonable to assume that these conserved regions are more likely to contain functional elements than less-conserved portions of the genome.

    Methods: Here we used a motif-oriented machine learning method based on the Relevance Vector Machine algorithm to extract the strongest signal from a set of non-coding conserved sequences.

    Results: We successfully fitted models to reflect the non-coding sequences, and showed that the results were quite consistent for repeated training runs. Using the learned models to scan genomic sequence, we found that they often made predictions close to the start of annotated genes. We compared this method with other published promoter-prediction systems, and showed that the set of promoters which are detected by this method is substantially similar to that detected by existing methods.

    Conclusions: The results presented here indicate that the promoter signal is the strongest single motif-based signal in the non-coding functional fraction of the genome. They also lend support to the belief that there exists a substantial subset of promoter regions which share several common features including, but not restricted to, a relative abundance of CpG dinucleotides. This subset is detectable by a variety of distinct computational methods.

    BMC bioinformatics 2004;5;131

  • The DNA sequence and comparative analysis of human chromosome 10.

    Deloukas P, Earthrowl ME, Grafham DV, Rubenfield M, French L, Steward CA, Sims SK, Jones MC, Searle S, Scott C, Howe K, Hunt SE, Andrews TD, Gilbert JG, Swarbreck D, Ashurst JL, Taylor A, Battles J, Bird CP, Ainscough R, Almeida JP, Ashwell RI, Ambrose KD, Babbage AK, Bagguley CL, Bailey J, Banerjee R, Bates K, Beasley H, Bray-Allen S, Brown AJ, Brown JY, Burford DC, Burrill W, Burton J, Cahill P, Camire D, Carter NP, Chapman JC, Clark SY, Clarke G, Clee CM, Clegg S, Corby N, Coulson A, Dhami P, Dutta I, Dunn M, Faulkner L, Frankish A, Frankland JA, Garner P, Garnett J, Gribble S, Griffiths C, Grocock R, Gustafson E, Hammond S, Harley JL, Hart E, Heath PD, Ho TP, Hopkins B, Horne J, Howden PJ, Huckle E, Hynds C, Johnson C, Johnson D, Kana A, Kay M, Kimberley AM, Kershaw JK, Kokkinaki M, Laird GK, Lawlor S, Lee HM, Leongamornlert DA, Laird G, Lloyd C, Lloyd DM, Loveland J, Lovell J, McLaren S, McLay KE, McMurray A, Mashreghi-Mohammadi M, Matthews L, Milne S, Nickerson T, Nguyen M, Overton-Larty E, Palmer SA, Pearce AV, Peck AI, Pelan S, Phillimore B, Porter K, Rice CM, Rogosin A, Ross MT, Sarafidou T, Sehra HK, Shownkeen R, Skuce CD, Smith M, Standring L, Sycamore N, Tester J, Thorpe A, Torcasso W, Tracey A, Tromans A, Tsolas J, Wall M, Walsh J, Wang H, Weinstock K, West AP, Willey DL, Whitehead SL, Wilming L, Wray PW, Young L, Chen Y, Lovering RC, Moschonas NK, Siebert R, Fechtel K, Bentley D, Durbin R, Hubbard T, Doucette-Stamm L, Beck S, Smith DR and Rogers J

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK. panos@sanger.ac.uk

    The finished sequence of human chromosome 10 comprises a total of 131,666,441 base pairs. It represents 99.4% of the euchromatic DNA and includes one megabase of heterochromatic sequence within the pericentromeric region of the short and long arm of the chromosome. Sequence annotation revealed 1,357 genes, of which 816 are protein coding, and 430 are pseudogenes. We observed widespread occurrence of overlapping coding genes (either strand) and identified 67 antisense transcripts. Our analysis suggests that both inter- and intrachromosomal segmental duplications have impacted on the gene count on chromosome 10. Multispecies comparative analysis indicated that we can readily annotate the protein-coding genes with current resources. We estimate that over 95% of all coding exons were identified in this study. Assessment of single base changes between the human chromosome 10 and chimpanzee sequence revealed nonsense mutations in only 21 coding genes with respect to the human sequence.

    Nature 2004;429;6990;375-81

  • DNA sequence and analysis of human chromosome 9.

    Humphray SJ, Oliver K, Hunt AR, Plumb RW, Loveland JE, Howe KL, Andrews TD, Searle S, Hunt SE, Scott CE, Jones MC, Ainscough R, Almeida JP, Ambrose KD, Ashwell RI, Babbage AK, Babbage S, Bagguley CL, Bailey J, Banerjee R, Barker DJ, Barlow KF, Bates K, Beasley H, Beasley O, Bird CP, Bray-Allen S, Brown AJ, Brown JY, Burford D, Burrill W, Burton J, Carder C, Carter NP, Chapman JC, Chen Y, Clarke G, Clark SY, Clee CM, Clegg S, Collier RE, Corby N, Crosier M, Cummings AT, Davies J, Dhami P, Dunn M, Dutta I, Dyer LW, Earthrowl ME, Faulkner L, Fleming CJ, Frankish A, Frankland JA, French L, Fricker DG, Garner P, Garnett J, Ghori J, Gilbert JG, Glison C, Grafham DV, Gribble S, Griffiths C, Griffiths-Jones S, Grocock R, Guy J, Hall RE, Hammond S, Harley JL, Harrison ES, Hart EA, Heath PD, Henderson CD, Hopkins BL, Howard PJ, Howden PJ, Huckle E, Johnson C, Johnson D, Joy AA, Kay M, Keenan S, Kershaw JK, Kimberley AM, King A, Knights A, Laird GK, Langford C, Lawlor S, Leongamornlert DA, Leversha M, Lloyd C, Lloyd DM, Lovell J, Martin S, Mashreghi-Mohammadi M, Matthews L, McLaren S, McLay KE, McMurray A, Milne S, Nickerson T, Nisbett J, Nordsiek G, Pearce AV, Peck AI, Porter KM, Pandian R, Pelan S, Phillimore B, Povey S, Ramsey Y, Rand V, Scharfe M, Sehra HK, Shownkeen R, Sims SK, Skuce CD, Smith M, Steward CA, Swarbreck D, Sycamore N, Tester J, Thorpe A, Tracey A, Tromans A, Thomas DW, Wall M, Wallis JM, West AP, Whitehead SL, Willey DL, Williams SA, Wilming L, Wray PW, Young L, Ashurst JL, Coulson A, Blöcker H, Durbin R, Sulston JE, Hubbard T, Jackson MJ, Bentley DR, Beck S, Rogers J and Dunham I

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. sjh@sanger.ac.uk

    Chromosome 9 is highly structurally polymorphic. It contains the largest autosomal block of heterochromatin, which is heteromorphic in 6-8% of humans, whereas pericentric inversions occur in more than 1% of the population. The finished euchromatic sequence of chromosome 9 comprises 109,044,351 base pairs and represents >99.6% of the region. Analysis of the sequence reveals many intra- and interchromosomal duplications, including segmental duplications adjacent to both the centromere and the large heterochromatic block. We have annotated 1,149 genes, including genes implicated in male-to-female sex reversal, cancer and neurodegenerative disease, and 426 pseudogenes. The chromosome contains the largest interferon gene cluster in the human genome. There is also a region of exceptionally high gene and G + C content including genes paralogous to those in the major histocompatibility complex. We have also detected recently duplicated genes that exhibit different rates of sequence divergence, presumably reflecting natural selection.

    Nature 2004;429;6990;369-74

  • Domain insertions in protein structures.

    Aroul-Selvam R, Hubbard T and Sasidharan R

    The Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    Domains are the structural, functional or evolutionary units of proteins. Proteins can comprise a single domain or a combination of domains. In multi-domain proteins, the domains almost always occur end-to-end, i.e., one domain follows the C-terminal end of another domain. However, there are exceptions to this common pattern, where multi-domain proteins are formed by insertion of one domain (insert) into another domain (parent). Here, we provide a quantitative description of known insertions in the Protein Data Bank (PDB). We found that 9% of domain combinations observed in non-redundant PDB are insertions. Although 90% of all insertions involve only one insert, proteins can clearly have multiple (nested, two-domain and three-domain) inserts. We also observed correlations between the structure and function of a domain and its tendency to be found as a parent or an insert. There is a bias in insert position towards the C terminus of parents. We observed that the atomic distance between the N and C terminus of an insert is significantly smaller when compared to the N-to-C distance in a parent context or a single domain context. Insertions are found always to occur in loop regions of parent domains. Our observations regarding the relationship between domain insertions and the structure, function and evolution of proteins have implications for protein engineering.

    Journal of molecular biology 2004;338;4;633-41

  • An overview of Ensembl.

    Birney E, Andrews TD, Bevan P, Caccamo M, Chen Y, Clarke L, Coates G, Cuff J, Curwen V, Cutts T, Down T, Eyras E, Fernandez-Suarez XM, Gane P, Gibbins B, Gilbert J, Hammond M, Hotz HR, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Keenan S, Lehvaslaiho H, McVicker G, Melsopp C, Meidl P, Mongin E, Pettett R, Potter S, Proctor G, Rae M, Searle S, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Ureta-Vidal A, Woodwark KC, Cameron G, Durbin R, Cox A, Hubbard T and Clamp M

    EMBL European Bioinformatics Institute, The Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. birney@ebi.ac.uk

    Ensembl (http://www.ensembl.org/) is a bioinformatics project to organize biological information around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of individual genomes, and of the synteny and orthology relationships between them. It is also a framework for integration of any biological data that can be mapped onto features derived from the genomic sequence. Ensembl is available as an interactive Web site, a set of flat files, and as a complete, portable open source software system for handling genomes. All data are provided without restriction, and code is freely available. Ensembl's aims are to continue to "widen" this biological integration to include other model organisms relevant to understanding human biology as they become available; to "deepen" this integration to provide an ever more seamless linkage between equivalent components in different species; and to provide further classification of functional elements in the genome that have been previously elusive.

    Funded by: Wellcome Trust: 062023

    Genome research 2004;14;5;925-8

  • The DNA sequence and analysis of human chromosome 13.

    Dunham A, Matthews LH, Burton J, Ashurst JL, Howe KL, Ashcroft KJ, Beare DM, Burford DC, Hunt SE, Griffiths-Jones S, Jones MC, Keenan SJ, Oliver K, Scott CE, Ainscough R, Almeida JP, Ambrose KD, Andrews DT, Ashwell RI, Babbage AK, Bagguley CL, Bailey J, Bannerjee R, Barlow KF, Bates K, Beasley H, Bird CP, Bray-Allen S, Brown AJ, Brown JY, Burrill W, Carder C, Carter NP, Chapman JC, Clamp ME, Clark SY, Clarke G, Clee CM, Clegg SC, Cobley V, Collins JE, Corby N, Coville GJ, Deloukas P, Dhami P, Dunham I, Dunn M, Earthrowl ME, Ellington AG, Faulkner L, Frankish AG, Frankland J, French L, Garner P, Garnett J, Gilbert JG, Gilson CJ, Ghori J, Grafham DV, Gribble SM, Griffiths C, Hall RE, Hammond S, Harley JL, Hart EA, Heath PD, Howden PJ, Huckle EJ, Hunt PJ, Hunt AR, Johnson C, Johnson D, Kay M, Kimberley AM, King A, Laird GK, Langford CJ, Lawlor S, Leongamornlert DA, Lloyd DM, Lloyd C, Loveland JE, Lovell J, Martin S, Mashreghi-Mohammadi M, McLaren SJ, McMurray A, Milne S, Moore MJ, Nickerson T, Palmer SA, Pearce AV, Peck AI, Pelan S, Phillimore B, Porter KM, Rice CM, Searle S, Sehra HK, Shownkeen R, Skuce CD, Smith M, Steward CA, Sycamore N, Tester J, Thomas DW, Tracey A, Tromans A, Tubby B, Wall M, Wallis JM, West AP, Whitehead SL, Willey DL, Wilming L, Wray PW, Wright MW, Young L, Coulson A, Durbin R, Hubbard T, Sulston JE, Beck S, Bentley DR, Rogers J and Ross MT

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK. ad1@sanger.ac.uk

    Chromosome 13 is the largest acrocentric human chromosome. It carries genes involved in cancer including the breast cancer type 2 (BRCA2) and retinoblastoma (RB1) genes, is frequently rearranged in B-cell chronic lymphocytic leukaemia, and contains the DAOA locus associated with bipolar disorder and schizophrenia. We describe completion and analysis of 95.5 megabases (Mb) of sequence from chromosome 13, which contains 633 genes and 296 pseudogenes. We estimate that more than 95.4% of the protein-coding genes of this chromosome have been identified, on the basis of comparison with other vertebrate genome sequences. Additionally, 105 putative non-coding RNA genes were found. Chromosome 13 has one of the lowest gene densities (6.5 genes per Mb) among human chromosomes, and contains a central region of 38 Mb where the gene density drops to only 3.1 genes per Mb.

    Nature 2004;428;6982;522-8

  • A census of human cancer genes.

    Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, Rahman N and Stratton MR

    Cancer Genome Project, Human Genome Analysis Group and Pfam Group, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambs, CB10 1SA, UK.

    Nature reviews. Cancer 2004;4;3;177-83

  • A new trade framework for global healthcare R&D.

    Hubbard T and Love J

    Wellcome Trust Sanger Institute in Hinxton, United Kingdom.

    PLoS biology 2004;2;2;E52

  • Ensembl 2004.

    Birney E, Andrews D, Bevan P, Caccamo M, Cameron G, Chen Y, Clarke L, Coates G, Cox T, Cuff J, Curwen V, Cutts T, Down T, Durbin R, Eyras E, Fernandez-Suarez XM, Gane P, Gibbins B, Gilbert J, Hammond M, Hotz H, Iyer V, Kahari A, Jekosch K, Kasprzyk A, Keefe D, Keenan S, Lehvaslaiho H, McVicker G, Melsopp C, Meidl P, Mongin E, Pettett R, Potter S, Proctor G, Rae M, Searle S, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Ureta-Vidal A, Woodwark C, Clamp M and Hubbard T

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    The Ensembl (http://www.ensembl.org/) database project provides a bioinformatics framework to organize biology around the sequences of large genomes. It is a comprehensive and integrated source of annotation of large genome sequences, available via interactive website, web services or flat files. As well as being one of the leading sources of genome annotation, Ensembl is an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements. The facilities of the system range from sequence analysis to data storage and visualization and installations exist around the world both in companies and at academic sites. With a total of nine genome sequences available from Ensembl and more genomes to follow, recent developments have focused mainly on closer integration between genomes and external data.

    Nucleic acids research 2004;32;Database issue;D468-70

  • SCOP database in 2004: refinements integrate structure and sequence family data.

    Andreeva A, Howorth D, Brenner SE, Hubbard TJ, Chothia C and Murzin AG

    MRC Centre for Protein Engineering, Hills Road, Cambridge CB2 2QH, UK.

    The Structural Classification of Proteins (SCOP) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. Protein domains in SCOP are hierarchically classified into families, superfamilies, folds and classes. The continual accumulation of sequence and structural data allows more rigorous analysis and provides important information for understanding the protein world and its evolutionary repertoire. SCOP participates in a project that aims to rationalize and integrate the data on proteins held in several sequence and structure databases. As part of this project, starting with release 1.63, we have initiated a refinement of the SCOP classification, which introduces a number of changes mostly at the levels below superfamily. The pending SCOP reclassification will be carried out gradually through a number of future releases. In addition to the expanded set of static links to external resources, available at the level of domain entries, we have started modernization of the interface capabilities of SCOP allowing more dynamic links with other databases. SCOP can be accessed at http://scop.mrc-lmb.cam.ac.uk/scop.

    Nucleic acids research 2004;32;Database issue;D226-9

2003 Publications

  • The DNA sequence and analysis of human chromosome 6.

    Mungall AJ, Palmer SA, Sims SK, Edwards CA, Ashurst JL, Wilming L, Jones MC, Horton R, Hunt SE, Scott CE, Gilbert JG, Clamp ME, Bethel G, Milne S, Ainscough R, Almeida JP, Ambrose KD, Andrews TD, Ashwell RI, Babbage AK, Bagguley CL, Bailey J, Banerjee R, Barker DJ, Barlow KF, Bates K, Beare DM, Beasley H, Beasley O, Bird CP, Blakey S, Bray-Allen S, Brook J, Brown AJ, Brown JY, Burford DC, Burrill W, Burton J, Carder C, Carter NP, Chapman JC, Clark SY, Clark G, Clee CM, Clegg S, Cobley V, Collier RE, Collins JE, Colman LK, Corby NR, Coville GJ, Culley KM, Dhami P, Davies J, Dunn M, Earthrowl ME, Ellington AE, Evans KA, Faulkner L, Francis MD, Frankish A, Frankland J, French L, Garner P, Garnett J, Ghori MJ, Gilby LM, Gillson CJ, Glithero RJ, Grafham DV, Grant M, Gribble S, Griffiths C, Griffiths M, Hall R, Halls KS, Hammond S, Harley JL, Hart EA, Heath PD, Heathcott R, Holmes SJ, Howden PJ, Howe KL, Howell GR, Huckle E, Humphray SJ, Humphries MD, Hunt AR, Johnson CM, Joy AA, Kay M, Keenan SJ, Kimberley AM, King A, Laird GK, Langford C, Lawlor S, Leongamornlert DA, Leversha M, Lloyd CR, Lloyd DM, Loveland JE, Lovell J, Martin S, Mashreghi-Mohammadi M, Maslen GL, Matthews L, McCann OT, McLaren SJ, McLay K, McMurray A, Moore MJ, Mullikin JC, Niblett D, Nickerson T, Novik KL, Oliver K, Overton-Larty EK, Parker A, Patel R, Pearce AV, Peck AI, Phillimore B, Phillips S, Plumb RW, Porter KM, Ramsey Y, Ranby SA, Rice CM, Ross MT, Searle SM, Sehra HK, Sheridan E, Skuce CD, Smith S, Smith M, Spraggon L, Squares SL, Steward CA, Sycamore N, Tamlyn-Hall G, Tester J, Theaker AJ, Thomas DW, Thorpe A, Tracey A, Tromans A, Tubby B, Wall M, Wallis JM, West AP, White SS, Whitehead SL, Whittaker H, Wild A, Willey DJ, Wilmer TE, Wood JM, Wray PW, Wyatt JC, Young L, Younger RM, Bentley DR, Coulson A, Durbin R, Hubbard T, Sulston JE, Dunham I, Rogers J and Beck S

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. ajm@sanger.ac.uk

    Chromosome 6 is a metacentric chromosome that constitutes about 6% of the human genome. The finished sequence comprises 166,880,988 base pairs, representing the largest chromosome sequenced so far. The entire sequence has been subjected to high-quality manual annotation, resulting in the evidence-supported identification of 1,557 genes and 633 pseudogenes. Here we report that at least 96% of the protein-coding genes have been identified, as assessed by multi-species comparative sequence analysis, and provide evidence for the presence of further, otherwise unsupported exons/genes. Among these are genes directly implicated in cancer, schizophrenia, autoimmunity and many other diseases. Chromosome 6 harbours the largest transfer RNA gene cluster in the genome; we show that this cluster co-localizes with a region of high transcriptional activity. Within the essential immune loci of the major histocompatibility complex, we find HLA-B to be the most polymorphic gene on chromosome 6 and in the human genome.

    Nature 2003;425;6960;805-11

  • ddbRNA: detection of conserved secondary structures in multiple alignments.

    di Bernardo D, Down T and Hubbard T

    Telethon Institute of Genetics and Medicine, Via P Castellino 111, 80133 Naples, Italy. dibernardo@tigem.it

    Motivation: Structured non-coding RNAs (ncRNAs) have a very important functional role in the cell. No distinctive general features common to all ncRNA have yet been discovered. This makes it difficult to design computational tools able to detect novel ncRNAs in the genomic sequence.

    Results: We devised an algorithm able to detect conserved secondary structures in both pairwise and multiple DNA sequence alignments with computational time proportional to the square of the sequence length. We implemented the algorithm for the case of pairwise and three-way alignments and tested it on ncRNAs obtained from public databases. On the test sets, the pairwise algorithm has a specificity greater than 97% with a sensitivity varying from 22.26% for Blast alignments to 56.35% for structural alignments. The three-way algorithm behaves similarly. Our algorithm is able to efficiently detect a conserved secondary structure in multiple alignments.

    Funded by: Telethon: TGM03P17, TGM06S01

    Bioinformatics (Oxford, England) 2003;19;13;1606-11

  • Ensembl 2002: accommodating comparative genomics.

    Clamp M, Andrews D, Barker D, Bevan P, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Hubbard T, Kasprzyk A, Keefe D, Lehvaslaiho H, Iyer V, Melsopp C, Mongin E, Pettett R, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I and Birney E

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

    The Ensembl (http://www.ensembl.org/) database project provides a bioinformatics framework to organise biology around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of human, mouse and other genome sequences, available as either an interactive web site or as flat files. Ensembl also integrates manually annotated gene structures from external sources where available. As well as being one of the leading sources of genome annotation, Ensembl is an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements. These range from sequence analysis to data storage and visualisation and installations exist around the world in both companies and at academic sites. With both human and mouse genome sequences available and more vertebrate sequences to follow, many of the recent developments in Ensembl have focusing on developing automatic comparative genome analysis and visualisation.

    Nucleic acids research 2003;31;1;38-42

  • CASP5 target classification.

    Kinch LN, Qi Y, Hubbard TJ and Grishin NV

    Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas 75390-9050, USA. lkinch@chop.swmed.edu

    This report summarizes the Critical Assessment of Protein Structure Prediction (CASP5) target proteins, which included 67 experimental models submitted from various structural genomics efforts and independent research groups. Throughout this special issue, CASP5 targets are referred to with the identification numbers T0129-T0195. Several of these targets were excluded from the assessment for various reasons: T0164 and T0166 were cancelled by the organizers; T0131, T0144, T0158, T0163, T0171, T0175, and T0180 were not available in time; T0145 was "natively unfolded"; the T0139 structure became available before the target expired; and T0194 was solved for a different sequence than the one submitted. Table I outlines the sequence and structural information available for CASP5 proteins in the context of existing folds and evolutionary relationships. This information provided the basis for a domain-based classification of the target structures into three assessment categories: comparative modeling (CM), fold recognition (FR), and new fold (NF). The FR category was further subdivided into homologues [FR(H)] and analogs [FR(A)] based on evolutionary considerations, and the overlap between assessment categories was classified as CM/FR(H) and FR(A)/NF. CASP5 domains are illustrated in Figure 1. Examples of nontrivial links between CASP5 target domains and existing structures that support our classifications are provided.

    Proteins 2003;53 Suppl 6;340-51

  • Critical assessment of methods of protein structure prediction (CASP)-round V.

    Moult J, Fidelis K, Zemla A and Hubbard T

    Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, Rockville, Maryland 20850, USA. moult@umbi.umd.edu

    This article provides an introduction to the special issue of the journal Proteins dedicated to the fifth CASP experiment to assess the state of the art in protein structure prediction. The article describes the conduct, the categories of prediction, and the evaluation and assessment procedures of the experiment. A brief summary of progress over the five CASP experiments is provided. Related developments in the field are also described.

    Funded by: NIGMS NIH HHS: GM/DK61967; NLM NIH HHS: LM07085

    Proteins 2003;53 Suppl 6;334-9

2002 Publications

  • Initial sequencing and comparative analysis of the mouse genome.

    Mouse Genome Sequencing Consortium, Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S, Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR, Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Davies J, Delehaunty KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P, Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gibbs RA, Glusman G, Gnerre S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigó R, Guyer M, Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Johnson LS, Jones M, Jones TA, Joy A, Kamal M, Karlsson EK, Karolchik D, Kasprzyk A, Kawai J, Keibler E, Kells C, Kent WJ, Kirby A, Kolbe DL, Korf I, Kucherlapati RS, Kulbokas EJ, Kulp D, Landers T, Leger JP, Leonard S, Letunic I, Levine R, Li J, Li M, Lloyd C, Lucas S, Ma B, Maglott DR, Mardis ER, Matthews L, Mauceli E, Mayer JH, McCarthy M, McCombie WR, McLaren S, McLay K, McPherson JD, Meldrim J, Meredith B, Mesirov JP, Miller W, Miner TL, Mongin E, Montgomery KT, Morgan M, Mott R, Mullikin JC, Muzny DM, Nash WE, Nelson JO, Nhan MN, Nicol R, Ning Z, Nusbaum C, O'Connor MJ, Okazaki Y, Oliver K, Overton-Larty E, Pachter L, Parra G, Pepin KH, Peterson J, Pevzner P, Plumb R, Pohl CS, Poliakov A, Ponce TC, Ponting CP, Potter S, Quail M, Reymond A, Roe BA, Roskin KM, Rubin EM, Rust AG, Santos R, Sapojnikov V, Schultz B, Schultz J, Schwartz MS, Schwartz S, Scott C, Seaman S, Searle S, Sharpe T, Sheridan A, Shownkeen R, Sims S, Singer JB, Slater G, Smit A, Smith DR, Spencer B, Stabenau A, Stange-Thomann N, Sugnet C, Suyama M, Tesler G, Thompson J, Torrents D, Trevaskis E, Tromp J, Ucla C, Ureta-Vidal A, Vinson JP, Von Niederhausern AC, Wade CM, Wall M, Weber RJ, Weiss RB, Wendl MC, West AP, Wetterstrand K, Wheeler R, Whelan S, Wierzbowski J, Willey D, Williams S, Wilson RK, Winter E, Worley KC, Wyman D, Yang S, Yang SP, Zdobnov EM, Zody MC and Lander ES

    Genome Sequencing Center, Washington University School of Medicine, Campus Box 8501, 4444 Forest Park Avenue, St Louis, Missouri 63108, USA. waterston@gs.washington.edu

    The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.

    Nature 2002;420;6915;520-62

  • A physical map of the mouse genome.

    Gregory SG, Sekhon M, Schein J, Zhao S, Osoegawa K, Scott CE, Evans RS, Burridge PW, Cox TV, Fox CA, Hutton RD, Mullenger IR, Phillips KJ, Smith J, Stalker J, Threadgold GJ, Birney E, Wylie K, Chinwalla A, Wallis J, Hillier L, Carter J, Gaige T, Jaeger S, Kremitzki C, Layman D, Maas J, McGrane R, Mead K, Walker R, Jones S, Smith M, Asano J, Bosdet I, Chan S, Chittaranjan S, Chiu R, Fjell C, Fuhrmann D, Girn N, Gray C, Guin R, Hsiao L, Krzywinski M, Kutsche R, Lee SS, Mathewson C, McLeavy C, Messervier S, Ness S, Pandoh P, Prabhu AL, Saeedi P, Smailus D, Spence L, Stott J, Taylor S, Terpstra W, Tsai M, Vardy J, Wye N, Yang G, Shatsman S, Ayodeji B, Geer K, Tsegaye G, Shvartsbeyn A, Gebregeorgis E, Krol M, Russell D, Overton L, Malek JA, Holmes M, Heaney M, Shetty J, Feldblyum T, Nierman WC, Catanese JJ, Hubbard T, Waterston RH, Rogers J, de Jong PJ, Fraser CM, Marra M, McPherson JD and Bentley DR

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.

    A physical map of a genome is an essential guide for navigation, allowing the location of any gene or other landmark in the chromosomal DNA. We have constructed a physical map of the mouse genome that contains 296 contigs of overlapping bacterial clones and 16,992 unique markers. The mouse contigs were aligned to the human genome sequence on the basis of 51,486 homology matches, thus enabling use of the conserved synteny (correspondence between chromosome blocks) of the two genomes to accelerate construction of the mouse map. The map provides a framework for assembly of whole-genome shotgun sequence data, and a tile path of clones for generation of the reference sequence. Definition of the human-mouse alignment at this level of resolution enables identification of a mouse clone that corresponds to almost any position in the human genome. The human sequence may be used to facilitate construction of other mammalian genome maps using the same strategy.

    Funded by: NHGRI NIH HHS: U01 HG002137-03

    Nature 2002;418;6899;743-50

  • The significance of performance ranking in CASP--response to Marti-Renom et al.

    Moult J, Fidelis K, Zemla A, Hubbard T and Tramontano A

    Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, Rockville 20850, USA. jmoult@tunc.org

    Structure (London, England : 1993) 2002;10;3;291-2; discussion 292-3

  • Computational detection and location of transcription start sites in mammalian genomic DNA.

    Down TA and Hubbard TJ

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, United Kingdom. td2@sanger.ac.uk

    Transcription, the process whereby RNA copies are made from sections of the DNA genome, is directed by promoter regions. These define the transcription start site, and also the set of cellular conditions under which the promoter is active. At least in more complex species, it appears to be common for genes to have several different transcription start sites, which may be active under different conditions. Eukaryotic promoters are complex and fairly diffuse structures, which have proven hard to detect in silico. We show that a novel hybrid machine-learning method is able to build useful models of promoters for >50% of human transcription start sites. We estimate specificity to be >70%, and demonstrate good positional accuracy. Based on the structure of our learned models, we conclude that a signal resembling the well known TATA box, together with flanking regions of C-G enrichment, are the most important sequence-based signals marking sites of transcriptional initiation at a large class of typical promoters.

    Genome research 2002;12;3;458-61

  • MaxBench: evaluation of sequence and structure comparison methods.

    Leplae R and Hubbard TJ

    Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK. lp1@sanger.ac.uk

    Summary: MaxBench is a web-based system available for evaluating the results of sequence and structure comparison methods, based on the SCOP protein domain classification. The system makes it easy for developers to both compare the overall performance of their methods to standard algorithms and investigate the results of individual comparisons.

    Availability: http://www.sanger.ac.uk/Users/lp1/MaxBench/

    Bioinformatics (Oxford, England) 2002;18;3;494-5

  • The Ensembl genome database project.

    Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Huminiecki L, Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C, Mongin E, Pettett R, Pocock M, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I and Clamp M

    The Wellcome Trust Sanger Institute and European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

    The Ensembl (http://www.ensembl.org/) database project provides a bioinformatics framework to organise biology around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of the human genome sequence, with confirmed gene predictions that have been integrated with external data sources, and is available as either an interactive web site or as flat files. It is also an open source software engineering project to develop a portable system able to handle very large genomes and associated requirements from sequence analysis to data storage and visualisation. The Ensembl site is one of the leading sources of human genome sequence annotation and provided much of the analysis for publication by the international human genome project of the draft genome. The Ensembl system is being installed around the world in both companies and academic sites on machines ranging from supercomputers to laptops.

    Nucleic acids research 2002;30;1;38-41

  • SCOP database in 2002: refinements accommodate structural genomics.

    Lo Conte L, Brenner SE, Hubbard TJ, Chothia C and Murzin AG

    MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK. loredana@mrc-lmb.cam.ac.uk

    The SCOP (Structural Classification of Proteins) database is a comprehensive ordering of all proteins of known structure, according to their evolutionary and structural relationships. Protein domains in SCOP are grouped into species and hierarchically classified into families, superfamilies, folds and classes. Recently, we introduced a new set of features with the aim of standardizing access to the database, and providing a solid basis to manage the increasing number of experimental structures expected from structural genomics projects. These features include: a new set of identifiers, which uniquely identify each entry in the hierarchy; a compact representation of protein domain classification; a new set of parseable files, which fully describe all domains in SCOP and the hierarchy itself. These new features are reflected in the ASTRAL compendium. The SCOP search engine has also been updated, and a set of links to external resources added at the level of domain entries. SCOP can be accessed at http://scop.mrc-lmb.cam.ac.uk/scop.

    Nucleic acids research 2002;30;1;264-7

  • Databases and tools for browsing genomes.

    Birney E, Clamp M and Hubbard T

    European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom. birney@ebi.ac.uk

    To maximize the value of genome sequences they need to be integrated with other types of biological data and with each other. The entire collection of data then needs to be made available in a way that is easy to view and mine for complex relationships. The recently determined vertebrate genome sequences of human and mouse are so large that building the infrastructure to manage these datasets is a major challenge. This article reviews the database systems and tools for analysis that have so far been developed to address this.

    Annual review of genomics and human genetics 2002;3;293-310

2001 Publications

  • The DNA sequence and comparative analysis of human chromosome 20.

    Deloukas P, Matthews LH, Ashurst J, Burton J, Gilbert JG, Jones M, Stavrides G, Almeida JP, Babbage AK, Bagguley CL, Bailey J, Barlow KF, Bates KN, Beard LM, Beare DM, Beasley OP, Bird CP, Blakey SE, Bridgeman AM, Brown AJ, Buck D, Burrill W, Butler AP, Carder C, Carter NP, Chapman JC, Clamp M, Clark G, Clark LN, Clark SY, Clee CM, Clegg S, Cobley VE, Collier RE, Connor R, Corby NR, Coulson A, Coville GJ, Deadman R, Dhami P, Dunn M, Ellington AG, Frankland JA, Fraser A, French L, Garner P, Grafham DV, Griffiths C, Griffiths MN, Gwilliam R, Hall RE, Hammond S, Harley JL, Heath PD, Ho S, Holden JL, Howden PJ, Huckle E, Hunt AR, Hunt SE, Jekosch K, Johnson CM, Johnson D, Kay MP, Kimberley AM, King A, Knights A, Laird GK, Lawlor S, Lehvaslaiho MH, Leversha M, Lloyd C, Lloyd DM, Lovell JD, Marsh VL, Martin SL, McConnachie LJ, McLay K, McMurray AA, Milne S, Mistry D, Moore MJ, Mullikin JC, Nickerson T, Oliver K, Parker A, Patel R, Pearce TA, Peck AI, Phillimore BJ, Prathalingam SR, Plumb RW, Ramsay H, Rice CM, Ross MT, Scott CE, Sehra HK, Shownkeen R, Sims S, Skuce CD, Smith ML, Soderlund C, Steward CA, Sulston JE, Swann M, Sycamore N, Taylor R, Tee L, Thomas DW, Thorpe A, Tracey A, Tromans AC, Vaudin M, Wall M, Wallis JM, Whitehead SL, Whittaker P, Willey DL, Williams L, Williams SA, Wilming L, Wray PW, Hubbard T, Durbin RM, Bentley DR, Beck S and Rogers J

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK. panos@sanger.ac.uk

    The finished sequence of human chromosome 20 comprises 59,187,298 base pairs (bp) and represents 99.4% of the euchromatic DNA. A single contig of 26 megabases (Mb) spans the entire short arm, and five contigs separated by gaps totalling 320 kb span the long arm of this metacentric chromosome. An additional 234,339 bp of sequence has been determined within the pericentromeric region of the long arm. We annotated 727 genes and 168 pseudogenes in the sequence. About 64% of these genes have a 5' and a 3' untranslated region and a complete open reading frame. Comparative analysis of the sequence of chromosome 20 to whole-genome shotgun-sequence data of two other vertebrates, the mouse Mus musculus and the puffer fish Tetraodon nigroviridis, provides an independent measure of the efficiency of gene annotation, and indicates that this analysis may account for more than 95% of all coding exons and almost all genes.

    Nature 2001;414;6866;865-71

  • Initial sequencing and analysis of the human genome.

    Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blöcker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ, Szustakowki J and International Human Genome Sequencing Consortium

    Whitehead Institute for Biomedical Research, Center for Genome Research, Cambridge, Massachusetts 02142, USA. lander@genome.wi.mit.edu

    The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

    Nature 2001;409;6822;860-921

  • Mining the draft human genome.

    Birney E, Bateman A, Clamp ME and Hubbard TJ

    The European Bioinformatics Institute, Hinxton, Cambridge, UK. birney@ebi.ac.uk

    Now that the draft human genome sequence is available, everyone wants to be able to use it. However, we have perhaps become complacent about our ability to turn new genomes into lists of genes. The higher volume of data associated with a larger genome is accompanied by a much greater increase in complexity. We need to appreciate both the scale of the challenge of vertebrate genome analysis and the limitations of current gene prediction methods and understanding.

    Nature 2001;409;6822;827-8

  • Assessment of novel fold targets in CASP4: predictions of three-dimensional structures, secondary structures, and interresidue contacts.

    Lesk AM, Lo Conte L and Hubbard TJ

    Department of Haematology, University of Cambridge Clinical School, Cambridge Institute for Medical Research, Cambridge, United Kingdom. aml2@mrc-lmb.cam.ac.uk

    In the Novel Fold category, three types of predictions were assessed: three-dimensional structures, secondary structures, and residue-residue contacts. For predictions of three-dimensional models, CASP4 targets included 5 domains or structures with novel folds, and 13 on the borderline between Novel Fold and Fold Recognition categories. These elicited 1863 predictions of these and other targets by methods more general than comparative modeling or fold recognition techniques. The group of Bonneau, Tsai, Ruczinski, and Baker stood out as performing well with the greatest consistency. In many cases, several groups were able to predict fragments of the target correctly-often at a level somewhat larger than standard supersecondary structures-but were not able to assemble fragments into a correct global topology. The methods of Bonneau, Tsai, Ruczinski, and Baker have been successful in addressing the fragment assembly problem for many but not all the target structures.

    Proteins 2001;Suppl 5;98-118

  • Critical assessment of methods of protein structure prediction (CASP): round IV.

    Moult J, Fidelis K, Zemla A and Hubbard T

    Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, Rockville, Maryland 20850, USA. jmoult@tunc.org

    Funded by: NIGMS NIH HHS: GM/DK61967; NLM NIH HHS: LM07085

    Proteins 2001;Suppl 5;2-7

  • Prediction targets of CASP4.

    Murzin A and Hubbard TJ

    Centre for Protein Engineering, MRC Centre, Cambridge, UK.

    Proteins 2001;Suppl 5;8-12

2000 Publications

  • A browser for expression data.

    Pocock MR and Hubbard TJ

    The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. mrp@sanger.ac.uk

    Summary: We have written a fully extensible Java application for visually browsing expression data, and clusters of genes or experimental conditions calculated from that data. The application requires a run-time environment for Java2.

    Availability: http://www. sanger.ac.uk/Users/mrp/java/ExpressionBrowser

    Bioinformatics (Oxford, England) 2000;16;4;402-3

  • Open annotation offers a democratic solution to genome sequencing.

    Hubbard T and Birney E

    Nature 2000;403;6772;825

  • SCOP: a structural classification of proteins database.

    Lo Conte L, Ailey B, Hubbard TJ, Brenner SE, Murzin AG and Chothia C

    MRC Laboratory of Molecular Biology, Centre for Protein Engineering, Hills Road, Cambridge CB2 2QH, UK. loredana@mrc-lmb.cam.ac.uk

    The Structural Classification of Proteins (SCOP) database provides a detailed and comprehensive description of the relationships of known protein structures. The classification is on hierarchical levels: the first two levels, family and superfamily, describe near and distant evolutionary relationships; the third, fold, describes geometrical relationships. The distinction between evolutionary relationships and those that arise from the physics and chemistry of proteins is a feature that is unique to this database so far. The sequences of proteins in SCOP provide the basis of the ASTRAL sequence libraries that can be used as a source of data to calibrate sequence search algorithms and for the generation of statistics on, or selections of, protein structures. Links can be made from SCOP to PDB-ISL: a library containing sequences homologous to proteins of known structure. Sequences of proteins of unknown structure can be matched to distantly related proteins of known structure by using pairwise sequence comparison methods to find homologues in PDB-ISL. The database and its associated files are freely accessible from a number of WWW sites mirrored from URL http://scop.mrc-lmb.cam.ac.uk/scop/

    Nucleic acids research 2000;28;1;257-9

1999 Publications

  • The DNA sequence of human chromosome 22.

    Dunham I, Shimizu N, Roe BA, Chissoe S, Hunt AR, Collins JE, Bruskiewich R, Beare DM, Clamp M, Smink LJ, Ainscough R, Almeida JP, Babbage A, Bagguley C, Bailey J, Barlow K, Bates KN, Beasley O, Bird CP, Blakey S, Bridgeman AM, Buck D, Burgess J, Burrill WD, O'Brien KP et al.

    Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. id1@sanger.ac.uk

    Knowledge of the complete genomic DNA sequence of an organism allows a systematic approach to defining its genetic components. The genomic sequence provides access to the complete structures of all genes, including those without known function, their control elements, and, by inference, the proteins they encode, as well as all other biologically important sequences. Furthermore, the sequence is a rich and permanent source of information for the design of further biological studies of the organism and for the study of evolution through cross-species sequence comparison. The power of this approach has been amply demonstrated by the determination of the sequences of a number of microbial and model organisms. The next step is to obtain the complete sequence of the entire human genome. Here we report the sequence of the euchromatic part of human chromosome 22. The sequence obtained consists of 12 contiguous segments spanning 33.4 megabases, contains at least 545 genes and 134 pseudogenes, and provides the first view of the complex chromosomal landscapes that will be found in the rest of the genome.

    Nature 1999;402;6761;489-95

  • SCOP: a Structural Classification of Proteins database.

    Hubbard TJ, Ailey B, Brenner SE, Murzin AG and Chothia C

    Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

    The Structural Classification of Proteins (SCOP) database provides a detailed and comprehensive description of the relationships of all known proteins structures. The classification is on hierarchical levels: the first two levels, family and superfamily, describe near and far evolutionary relationships; the third, fold, describes geometrical relationships. The distinction between evolutionary relationships and those that arise from the physics and chemistry of proteins is a feature that is unique to this database, so far. The database can be used as a source of data to calibrate sequence search algorithms and for the generation of population statistics on protein structures. The database and its associated files are freely accessible from a number of WWW sites mirrored from URL http://scop. mrc-lmb.cam.ac.uk/scop/

    Funded by: Wellcome Trust

    Nucleic acids research 1999;27;1;254-6

  • Analysis and assessment of ab initio three-dimensional prediction, secondary structure, and contacts prediction.

    Orengo CA, Bray JE, Hubbard T, LoConte L and Sillitoe I

    Department of Biochemistry and Molecular Biology, University College, London, United Kingdom. orengo@bsm.bioc.ucl.ac.uk

    CASP3 saw a substantial increase in the volume of ab initio 3D prediction data, with 507 datasets for fifteen selected targets and sixty-one groups participating. As with CASP2, methods ranged from computationally intensive strategies that attempt to recreate the physical and chemical forces involved in protein folding to the more recent knowledge-based approaches. These exploit information from the structure databases, extracting potentially similar fragments and/or distance constraints derived from multiple sequence alignments. The knowledge-based approaches generally gave more consistently successful predictions across the range of targets, particularly that of the Baker group (Bystroff and Baker, J Mol Biol 1998;281:565-577; Simons et al. Proteins Suppl 1999;3:171-176), which used a fragment library. In the secondary structure prediction category, the most successful approaches built on the concepts used in PHD (Rost et al. Comput Appl Biosci 1994;10:53-60), an accepted standard in this field. Like PHD, they exploit neural networks but have different strategies for incorporating multiple sequence data or position-dependent weight matrices for training the networks. Analysis of the contact data, for which only six groups participated, suggested that as yet this data provides a rather weak signal. However, in combination with other types of prediction data it can sometimes be a useful constraint for identifying the correct structure.

    Funded by: Wellcome Trust

    Proteins 1999;Suppl 3;149-70

  • Critical assessment of methods of protein structure prediction (CASP): round III.

    Moult J, Hubbard T, Fidelis K and Pedersen JT

    Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, Rockville 20850, USA. jmoult@carb.nist.gov

    Proteins 1999;Suppl 3;2-6

  • RMS/coverage graphs: a qualitative method for comparing three-dimensional protein structure predictions.

    Hubbard TJ

    Sanger Centre, Hinxton, Cambridgeshire, United Kingdom. th@sanger.ac.uk

    Evaluating a set of protein structure predictions is difficult as each prediction may omit different residues and different parts of the structure may have different accuracies. A method is described that captures the best results from a large number of alternative sequence-dependent structural superpositions between a prediction and the experimental structure and represents them as a single line on a graph. Applied to CASP2 and CASP3 data the best predictions stand out visually in most cases, as judged by manual inspection. The results from this method applied to CASP data are available from the URLs http:/(/)PredictionCenter. llnl.gov/casp3/results/th/ and http:/(/)www.sanger.ac.uk/ approximately th/casp/.

    Funded by: Wellcome Trust

    Proteins 1999;Suppl 3;15-21

1998 Publications

  • Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods.

    Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T and Chothia C

    MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, UK.

    The sequences of related proteins can diverge beyond the point where their relationship can be recognised by pairwise sequence comparisons. In attempts to overcome this limitation, methods have been developed that use as a query, not a single sequence, but sets of related sequences or a representation of the characteristics shared by related sequences. Here we describe an assessment of three of these methods: the SAM-T98 implementation of a hidden Markov model procedure; PSI-BLAST; and the intermediate sequence search (ISS) procedure. We determined the extent to which these procedures can detect evolutionary relationships between the members of the sequence database PDBD40-J. This database, derived from the structural classification of proteins (SCOP), contains the sequences of proteins of known structure whose sequence identities with each other are 40% or less. The evolutionary relationships that exist between those that have low sequence identities were found by the examination of their structural details and, in many cases, their functional features. For nine false positive predictions out of a possible 432,680, i.e. at a false positive rate of about 1/50,000, SAM-T98 found 35% of the true homologous relationships in PDBD40-J, whilst PSI-BLAST found 30% and ISS found 25%. Overall, this is about twice the number of PDBD40-J relations that can be detected by the pairwise comparison procedures FASTA (17%) and GAP-BLAST (15%). For distantly related sequences in PDBD40-J, those pairs whose sequence identity is less than 30%, SAM-T98 and PSI-BLAST detect three times the number of relationships found by the pairwise methods.

    Journal of molecular biology 1998;284;4;1201-10

  • SCOP, Structural Classification of Proteins database: applications to evaluation of the effectiveness of sequence alignment methods and statistics of protein structural data.

    Hubbard TJ, Ailey B, Brenner SE, Murzin AG and Chothia C

    Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, England. th@sanger.ac.uk

    The Structural Classification of Proteins (SCOP) database provides a detailed and comprehensive description of the relationships of all known protein structures. The classification is on hierarchical levels: the first two levels, family and superfamily, describe near and far evolutionary relationships; the third, fold, describes geometrical relationships. The distinction between evolutionary relationships and those that arise from the physics and chemistry of proteins is a feature that is unique to this database, so far. The database can be used as a source of data to calibrate sequence search algorithms and for the generation of population statistics on protein structures. The database and its associated files are freely accessible from a number of WWW sites mirrored from URL http://scop. mrc-lmb.cam.ac.uk/scop/.

    Acta crystallographica. Section D, Biological crystallography 1998;54;Pt 6 Pt 1;1147-54

  • Toward a complete human genome sequence.

    No authors listed

    We have begun a joint program as part of a coordinated international effort to determine a complete human genome sequence. Our strategy is to map large-insert bacterial clones and to sequence each clone by a random shotgun approach followed by directed finishing. As of September 1998, we have identified the map positions of bacterial clones covering approximately 860 Mb for sequencing and completed >98 Mb ( approximately 3.3%) of the human genome sequence. Our progress and sequencing data can be accessed via the World Wide Web (http://webace.sanger.ac.uk/HGP/ or http://genome.wustl.edu/gsc/).

    Genome research 1998;8;11;1097-108

  • Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships.

    Brenner SE, Chothia C and Hubbard TJ

    MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, United Kingdom. brenner@hyper.stanford.edu

    Pairwise sequence comparison methods have been assessed using proteins whose relationships are known reliably from their structures and functions, as described in the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia C. (1995) J. Mol. Biol. 247, 536-540]. The evaluation tested the programs BLAST [Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403-410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) Methods Enzymol. 266, 460-480], FASTA [Pearson, W. R. & Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444-2448], and SSEARCH [Smith, T. F. & Waterman, M. S. (1981) J. Mol. Biol. 147, 195-197] and their scoring schemes. The error rate of all algorithms is greatly reduced by using statistical scores to evaluate matches rather than percentage identity or raw scores. The E-value statistical scores of SSEARCH and FASTA are reliable: the number of false positives found in our tests agrees well with the scores reported. However, the P-values reported by BLAST and WU-BLAST2 exaggerate significance by orders of magnitude. SSEARCH, FASTA ktup = 1, and WU-BLAST2 perform best, and they are capable of detecting almost all relationships between proteins whose sequence identities are >30%. For more distantly related proteins, they do much less well; only one-half of the relationships between proteins with 20-30% identity are found. Because many homologs have low sequence similarity, most distant relationships cannot be detected by any pairwise comparison method; however, those which are identified may be used with confidence.

    Funded by: Wellcome Trust

    Proceedings of the National Academy of Sciences of the United States of America 1998;95;11;6073-8

  • Using neural networks for prediction of the subcellular location of proteins.

    Reinhardt A and Hubbard T

    The Sanger Centre, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK. areinha@sanger.ac.uk

    Neural networks have been trained to predict the subcellular location of proteins in prokaryotic or eukaryotic cells from their amino acid composition. For three possible subcellular locations in prokaryotic organisms a prediction accuracy of 81% can be achieved. Assigning a reliability index, 33% of the predictions can be made with an accuracy of 91%. For eukaryotic proteins (excluding plant sequences) an overall prediction accuracy of 66% for four locations was achieved, with 33% of the sequences being predicted with an accuracy of 82% or better. With the subcellular location restricting a protein's possible function, this method should be a useful tool for the systematic analysis of genome data and is available via a server on the world wide web.

    Funded by: Wellcome Trust

    Nucleic acids research 1998;26;9;2230-6

  • GLASS: a tool to visualize protein structure prediction data in three dimensions and evaluate their consistency.

    Leplae R, Hubbard T and Tramontano A

    Istituto di Ricerche di Biologia Molecolare, P. Angeletti, Pomezia, Italy.

    When a protein sequence does not share any significant sequence similarity with a protein of known structure, homology modeling cannot be applied. However, many novel and interesting methods, such as secondary structure prediction, fold recognition, and prediction of long-range interactions, are being developed and have been shown to be reasonably successful in predicting protein structures from sequence data and evolutionary information. The a priori evaluation of the correctness of a prediction obtained by one of these methods is however often problematic. Consequently, it is important to use all available information provided by as many different methods as possible and all the available experimental data about the protein of interest, since the consistency of the results is indicative of the reliability of the prediction. Hence the need has arisen for suitable tools able to compare results provided by different methods and evaluate their consistency. We have therefore constructed GLASS, a general platform to read, visualize, compare, and evaluate prediction results from many different sources and to project these prediction results into three dimensions. In addition, GLASS allows the comparison of selected parameters calculated for a model with the distribution observed in real protein structures, thus providing an easy way to test new methods for evaluating the likelihood of different structural models. GLASS can be considered as a "workbench" for structural predictions useful to both experimentalists and theoreticians.

    Funded by: Wellcome Trust

    Proteins 1998;30;4;339-51

  • SPEM: a parser for EMBL style flat file database entries.

    Pocock MR, Hubbard T and Birney E

    Informatics, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. mrp@sanger.ac.uk

    Summary: We present a set of Perl modules for the flexible and robust parsing and editing of EMBL/SWISS-PROT databases.

    Availability: The Web page at http://www.sanger.ac. uk/Software/PerlModule/ provides information about downloading the SPEM and PrEMBL modules, and provides links to documentation and example code.

    Bioinformatics (Oxford, England) 1998;14;9;823-4

1997 Publications

  • Intermediate sequences increase the detection of homology between sequences.

    Park J, Teichmann SA, Hubbard T and Chothia C

    Cambridge Centre for Protein Engineering, UK.

    Two homologous sequences, which have diverged beyond the point where their homology can be recognised by a simple direct comparison, can be related through a third sequence that is suitably intermediate between the two. High scores, for a sequence match between the first and third sequences and between the second and the third sequences, imply that the first and second sequences are related even though their own match score is low. We have tested the usefulness of this idea using a database that contains the sequences of 971 protein domains whose structures are known and whose residue identities with each other are some 40% or less (PDB40D). On the basis of sequence and structural information, 2143 pairs of these sequences are known to have an evolutionary relationship. FASTA, in an all-against-all comparison of the sequences in the database, detected 320 (15%) of these relationships as well as three false positive (i.e. 1% error rate). Using intermediate sequences found by FASTA matches of PDB40D sequences to those in the large non-redundant OWL database we could detect 550 evolutionary relationships with an error rate of 1%. This means the intermediate sequence procedure increases the ability to recognise the evolutionary relationships amongst the PDB40D sequences by 70%.

    Funded by: Wellcome Trust

    Journal of molecular biology 1997;273;1;349-54

  • Population statistics of protein structures: lessons from structural classifications.

    Brenner SE, Chothia C and Hubbard TJ

    Structural Biology Centre, National Institute for Bioscience and Human-Technology, Ibaraki, Japan. brenner@hyper.stanford.edu

    Structural classifications aid the interpretation of proteins by describing degrees of structural and evolutionary relatedness. They have also recently revealed strikingly skewed distributions at all levels; for example, a small number of folds are far more common than others, and just a few superfamilies are known to have diverged widely. The classifications also provide an indication of the total number of superfamilies in nature.

    Funded by: Wellcome Trust

    Current opinion in structural biology 1997;7;3;369-76

  • New horizons in sequence analysis.

    Hubbard TJ

    Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK. th@sanger.ac.uk

    An ever increasing number of protein sequences are being compared, partly because of the availability of full sets of protein sequences from several completed genome-sequencing projects. The resulting problem of scale has shifted the emphasis of sequence analysis method development from sensitivity and flexibility, which relies on manual intervention and interpretation, to the automatic generation of results of known reliability.

    Funded by: Wellcome Trust

    Current opinion in structural biology 1997;7;2;190-3

  • The solution structure of the S1 RNA binding domain: a member of an ancient nucleic acid-binding fold.

    Bycroft M, Hubbard TJ, Proctor M, Freund SM and Murzin AG

    Department of Chemistry, University of Cambridge, United Kingdom.

    The S1 domain, originally identified in ribosomal protein S1, is found in a large number of RNA-associated proteins. The structure of the S1 RNA-binding domain from the E. coli polynucleotide phosphorylase has been determined using NMR methods and consists of a five-stranded antiparallel beta barrel. Conserved residues on one face of the barrel and adjacent loops form the putative RNA-binding site. The structure of the S1 domain is very similar to that of cold shock protein, suggesting that they are both derived from an ancient nucleic acid-binding protein. Enhanced sequence searches reveal hitherto unidentified S1 domains in RNase E, RNase II, NusA, EMB-5, and other proteins.

    Cell 1997;88;2;235-42

  • SCOP: a structural classification of proteins database.

    Hubbard TJ, Murzin AG, Brenner SE and Chothia C

    MRC Laboratory of Molecular Biology, Cambridge CB2 2QH, UK.

    The Structural Classification of Proteins (SCOP) database provides a detailed and comprehensive description of the relationships of all known proteins structures. The classification is on hierarchical levels: the first two levels, family and superfamily, describe near and far evolutionary relationships; the third, fold, describes geometrical relationships. The distinction between evolutionary relationships and those that arise from the physics and chemistry of proteins is a feature that is unique to this database, so far. SCOP also provides for each structure links to atomic co-ordinates, images of the structures, interactive viewers, sequence data, data on any conformational changes related to function and literature references. The database is freely accessible on the World Wide Web (WWW) with an entry point at URL http://scop.mrc-lmb.cam.ac.uk/scop/

    Nucleic acids research 1997;25;1;236-9

  • Critical assessment of methods of protein structure prediction (CASP): round II.

    Moult J, Hubbard T, Bryant SH, Fidelis K and Pedersen JT

    Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, Rockville 20850, USA. jmoult@carb.nist.gov

    Proteins 1997;Suppl 1;2-6

  • Numerical criteria for the evaluation of ab initio predictions of protein structure.

    Zemla A, Venclovas C, Reinhardt A, Fidelis K and Hubbard TJ

    Biology and Biotechnology Research Program, Lawrence Livermore National Laboratory, Livermore, CA 94551, USA.

    As part of the CASP2 protein structure prediction experiment, a set of numerical criteria were defined for the evaluation of "ab initio" predictions. The evaluation package comprises a series of electronic submission formats, a submission validator, evaluation software, and a series of scripts to summarize the results for the CASP2 meeting and for presentation via the World Wide Web (WWW). The evaluation package is accessible for use on new predictions via WWW so that results can be compared to those submitted to CASP2. With further input from the community, the evaluation criteria are expected to evolve into a comprehensive set of measures capturing the overall quality of a prediction as well as critical detail essential for further development of prediction methods. We discuss present measures, limitations of the current criteria, and possible improvements.

    Funded by: Wellcome Trust

    Proteins 1997;Suppl 1;140-50

  • Protein folds in the all-beta and all-alpha classes.

    Chothia C, Hubbard T, Brenner S, Barns H and Murzin A

    MRC Laboratory of Molecular Biology, Cambridge, United Kingdom.

    Analysis of the structures in the Protein Databank, released in June 1996, shows that the number of different protein folds, i.e. the number of different arrangements of major secondary structures and/or chain topologies, is 327. Of these folds, approximately 25% belong to the all-alpha class, 20% belong to the all-beta class, 30% belong to the alpha/beta class, and 25% belong to the alpha + beta class. We describe the types of folds now known for the all-beta and all-alpha classes, emphasizing those that have been discovered recently. Detailed theories for the physical determinants of the structures of most of these folds now exist, and these are reviewed.

    Annual review of biophysics and biomolecular structure 1997;26;597-627

1996 Publications

  • Protein structure prediction: playing the fold.

    Hubbard T, Park J, Lahm A, Leplae R and Tramontano A

    Centre for Protein Engineering, MRC, Cambridge, UK.

    Trends in biochemical sciences 1996;21;8;279-81

  • Gathering them in to the fold.

    Hubbard TJ, Lesk AM and Tramontano A

    Nature structural biology 1996;3;4;313

  • Understanding protein structure: using scop for fold interpretation.

    Brenner SE, Chothia C, Hubbard TJ and Murzin AG

    Medical Research Council Centre Laboratories of Molecular Biology, Cambridge United Kingdom.

    Methods in enzymology 1996;266;635-43

  • Update on protein structure prediction: results of the 1995 IRBM workshop.

    Hubbard T and Tramontano A

    Instituto di Ricerche di Biologia Molecolare (IRBM), Pomezia, Roma, Italy. th@mrc.cpe.cam.ac.uk

    Computational tools for protein structure prediction are of great interest to molecular, structural and theoretical biologists due to a rapidly increasing number of protein sequences with no known structure. In October 1995, a workshop was held at IRBM to predict as much as possible about a number of proteins of biological interest using ab initio prediction of fold recognition methods. 112 protein sequences were collected via an open invitation for target submissions. 17 were selected for prediction during the workshop and for 11 of these a prediction of some reliability could be made. We believe that this was a worthwhile experiment showing that the use of a range of independent prediction methods and thorough use of existing databases can lead to credible and useful ab initio structure predictions.

    Folding & design 1996;1;3;R55-63

1995 Publications

  • Gene duplications in H. influenzae.

    Brenner SE, Hubbard T, Murzin A and Chothia C

    Nature 1995;378;6553;140

  • Fold recognition and ab initio structure predictions using hidden Markov models and beta-strand pair potentials.

    Hubbard TJ and Park J

    Centre for Protein Engineering (CPE), MRC Centre, Cambridge, UK.

    Protein structure predictions were submitted for 9 of the target sequences in the competition that ran during 1994. Targets sequences were selected that had no known homology with any sequence of known structure and were members of a reasonably sized family of related but divergent sequences. The objective was either to recognize a compatible fold for the target sequence in the database of known structures or to predict ab initio its rough 3D topology. The main tools used were Hidden Markov models (HMM) for fold recognition, a beta-strand pair potential to predict beta-sheet topology, and the PHD server for secondary structure prediction. Compatible folds were correctly identified in a number of cases and the beta-strand pair potential was shown to be a useful tool for ab initio topology prediction.

    Proteins 1995;23;3;398-402

  • Prediction of the structure of GroES and its interaction with GroEL.

    Valencia A, Hubbard TJ, Muga A, Bañuelos S, Llorca O, Carrascosa JL and Valpuesta JM

    Centro Nacional de Biotecnología, C.S.I.C. Universidad Autónoma de Madrid, Spain.

    The three-dimensional structure of the GroES monomer and its interaction with GroEL has been predicted using a combination of prediction tools and experimental data obtained by biophysical [electron microscope (EM), Fourier transform infrared (FTIR), and nuclear magnetic resonance (NMR)] and biochemical techniques. The GroES monomer, according to the prediction, is composed of eight beta-strands forming a beta-barrel with loose ends. In the model, beta-strands 5-8 run along the outer surface of GroES, forming an antiparallel beta-sheet with beta 4 loosely bound to one of the edges. beta-strands 1-3 would then be parallel and placed in the interior of the molecule. Loops 1-3 would face the internal cavity of the GroEL-GroES complex, and together with conserved residues in loops 5 and 7, would form the active surface interacting with GroEL.

    Proteins 1995;22;3;199-209

  • SCOP: a structural classification of proteins database for the investigation of sequences and structures.

    Murzin AG, Brenner SE, Hubbard T and Chothia C

    MRC Laboratory of Molecular Biology and Centre for Protein Engineering, Cambridge, England.

    To facilitate understanding of, and access to, the information available for protein structures, we have constructed the Structural Classification of Proteins (scop) database. This database provides a detailed and comprehensive description of the structural and evolutionary relationships of the proteins of known structure. It also provides for each entry links to co-ordinates, images of the structure, interactive viewers, sequence data and literature references. Two search facilities are available. The homology search permits users to enter a sequence and obtain a list of any structures to which it has significant levels of sequence similarity. The key word search finds, for a word entered by the user, matches from both the text of the scop database and the headers of Brookhaven Protein Databank structure files. The database is freely accessible on World Wide Web (WWW) with an entry point to URL http: parallel scop.mrc-lmb.cam.ac.uk magnitude of scop.

    Journal of molecular biology 1995;247;4;536-40

  • A specification for defining and annotating regions of macromolecular structures.

    Brenner SE and Hubbard TJ

    MRC Laboratory of Molecular Biology, Cambridge, England.

    We present a program- and machine-independent standard for annotating macromolecular structures. Data encoded by this specification may be used for communicating information about structures and for exchanging it between different computer systems. The format consists of a set of ASN.1 objects which are mechanically straightforward to parse, but are also easy for humans to create and understand. It differs from all other related standards in that it specifies how a molecule should be displayed without requiring a custom format for the coordinate data.

    Proceedings / ... International Conference on Intelligent Systems for Molecular Biology ; ISMB. International Conference on Intelligent Systems for Molecular Biology 1995;3;66-74

Publications before 1995

  • First protein engineering centres meeting at CAPE/GBF on May 22-24, 1991. Foundation of INPEC (International Network of Protein Engineering Centres).

    Recktenwald A, Schomburg D, Schmid RD, Ballard P, Carey PR, Clark BF, Fromageot P, Hubbard T, Ikehara M, Iokibe R et al.

    Gesellschaft für Biotechnologische Forschung (GBF), Braunschweig, Germany.

    Journal of biotechnology 1993;28;1;137-63

  • Protein design on computers. Five new proteins: Shpilka, Grendel, Fingerclasp, Leather, and Aida.

    Sander C, Vriend G, Bazan F, Horovitz A, Nakamura H, Ribas L, Finkelstein AV, Lockhart A, Merkl R, Perry LJ et al.

    European Molecular Biology Laboratory, Heidelberg, Federal Republic of Germany.

    What is the current state of the art in protein design? This question was approached in a recent two-week protein design workshop sponsored by EMBO and held at the EMBL in Heidelberg. The goals were to test available design tools and to explore new design strategies. Five novel proteins were designed: Shpilka, a sandwich of two four-stranded beta-sheets, a scaffold on which to explore variations in loop topology; Grendel, a four-helical membrane anchor, ready for fusion to water-soluble functional domains; Finger-clasp, a dimer of interdigitating beta-beta-alpha units, the simplest variant of the "handshake" structural class; Aida, an antibody binding surface intended to be specific for flavodoxin; Leather--a minimal NAD binding domain, extracted from a larger protein. Each design is available as a set of three-dimensional coordinates, the corresponding amino acid sequence and a set of analytical results. The designs are placed in the public domain for scrutiny, improvement, and possible experimental verification.

    Proteins 1992;12;2;105-10

  • The role of heat-shock and chaperone proteins in protein folding: possible molecular mechanisms.

    Hubbard TJ and Sander C

    Protein Engineering Research Institute, Osaka, Japan.

    Recently some heat-shock proteins have been linked to functions of 'chaperoning' protein folding in vivo. Here current experimental evidence is reviewed and possible requirements for such an activity are discussed. It is proposed that one mode of chaperone action is to actively unfold misfolded or badly aggregated proteins to a conformation from which they could refold spontaneously; that improperly folded proteins are recognized by excessive stretches of solvent-exposed backbone, rather than by exposed hydrophobic patches; and that the molecular mechanism for unfolding is either repeated binding and dissociation ('plucking') or translocation of the protein backbone through a binding cleft ('threading'), allowing the threaded chain to refold spontaneously. The observed hydrolysis of ATP would provide the energy for active unfolding. These hypotheses can be applied to both monomeric folding and oligomeric assembly and are sufficiently detailed to be open to directed experimental verification.

    Protein engineering 1991;4;7;711-7

  • Protein engineering and design.

    Blundell TL, Elliott G, Gardner SP, Hubbard T, Islam S, Johnson M, Mantafounis D, Murray-Rust P, Overington J, Pitts JE et al.

    Department of Crystallography, Birkbeck College, London, U.K.

    Rapid advances in site-directed mutagenesis and total gene synthesis combined with new expression systems in prokaryotic and eukaryotic cells have provided the molecular biologist with tools for modification of existing proteins to improve catalytic activity, stability and selectivity, for construction of chimeric molecules and for synthesis of completely novel molecules that may be endowed with some useful activity. Such protein engineering can be seen as a cycle in which the structures of engineered molecules are studied by X-ray analysis and two-dimensional nuclear magnetic resonance. The results are used in the improvement of the design by using knowledge-based procedures that exploit facts, rules and observations about proteins of known three-dimensional structure.

    Philosophical transactions of the Royal Society of London. Series B, Biological sciences 1989;324;1224;447-60

  • 18th Sir Hans Krebs lecture. Knowledge-based protein modelling and design.

    Blundell T, Carney D, Gardner S, Hayes F, Howlin B, Hubbard T, Overington J, Singh DA, Sibanda BL and Sutcliffe M

    Department of Crystallography, Birkbeck College, University of London.

    A systematic technique for protein modelling that is applicable to the design of drugs, peptide vaccines and novel proteins is described. Our approach is knowledge-based, depending on the structures of homologous or analogous proteins and more generally on a relational data base of protein three-dimensional structures. The procedure simultaneously aligns the known tertiary structures, selects fragments from the structurally conserved regions on the basis of sequence homology, aligns these with the 'average structure' or 'framework', builds on the loops selected from homologous proteins or a wider database, substitutes sidechains and energy minimises the resultant model. Applications to modelling an homologous structure, tissue plasminogen activator on the basis of another serine proteinase, and to modelling an analogous protein, HIV viral proteinase on the basis of aspartic proteinases, are described. The converse problem of ab initio design is also addressed: this involves the selection of an amino acid sequence to give a particular tertiary structure, in this case a symmetrical domain of two Greek-key motifs.

    European journal of biochemistry / FEBS 1988;172;3;513-20

  • Comparison of solvent-inaccessible cores of homologous proteins: definitions useful for protein modelling.

    Hubbard TJ and Blundell TL

    Department of Crystallography, Birkbeck College, University of London, UK.

    The three-dimensional structures of 41 homologous proteins (belonging to eight families) were compared by pairwise superposition. A subset of 'core' residues was defined as those whose side chains have less than 7% of their surface exposed to solvent. This subset has significantly higher sequence identity and lower root mean square (RMS) alpha carbon separation than for all topologically equivalent residues in the structure, when members of a protein family are superposed. For such superpositions the relationship between RMS distance and percentage sequence identity of this subset of residues is similar to that for all equivalent residues, although some variation is observed between families of proteins which are predominantly beta sheet and those which are mainly alpha helix. The definition of a structurally more conserved core may be useful in model building proteins from an homologous family. The RMS differences of coordinates of structures of proteins with identical sequences are found to be related to the resolutions of the structures.

    Protein engineering 1987;1;3;159-71

  • Heat-shock proteins during growth and sporulation of Bacillus subtilis.

    Todd JA, Hubbard TJ, Travers AA and Ellar DJ

    Four major heat-shock proteins (hsps) with apparent molecular masses of 84, 69, 32 and 22 kDa were detected in exponentially growing stationary phase and sporulating cells of Bacillus subtilis heat-shocked from 30 to 43 degrees C. The most abundant, hsp69, is probably analogous to the E. coli groEL protein. These proteins were transiently inducible by heat-shock. Partial purification of RNA polymerase revealed several other minor hsps. One of these, a 48 kDa polypeptide probably corresponds to sigma 43. The synthesis of this polypeptide and at least two other proteins appeared to be under sporulation and heat-shock regulation and was affected by the SpoOA mutation.

    FEBS letters 1985;188;2;209-14

* quick link - http://q.sanger.ac.uk/th