Vertebrate genome analysis

Genome sequences provide a natural index for organising and understanding biological data.

Following the sequencing of the human and other vertebrate genomes, vertebrate genome browsers such as Ensembl have become critical resources, providing biologists with integrated access to the sequence and its associated annotation. The activities of the Vertebrate genome analysis team revolve around generating and presenting core vertebrate genome annotation, particularly in the form of reference genesets, and in maintaining the reference genome sequences of human, mouse and zebrafish. As well as contributing to resources used globally, the team is involved in a wide variety of collaborations related to genome annotation and the development of improved methods for analysis and annotation resulting in many publications. Tim Hubbard was the principal investigator of the team‭ ‬until he left the Sanger Institute in‭ ‬2013‭ ‬to become‭ ‬Professor of Bioinformatics,‭ ‬Head of Department of Medical and Molecular Genetics at King's College London and overall Director of Bioinformatics for King's Health Partners/King's College London.

[Genome Research Limited]


Part of a 'view' from the Ensembl Genome Browser.

Part of a 'view' from the Ensembl Genome Browser.


The team includes the Wellcome Trust Sanger Institute part of the Ensembl project (led by Steve Searle) and the Havana annotation group (lead by Jen Harrow). Ensembl is a joint project with the European Bioinformatics Institute (EBI). Steve Searle's EBI counterpart is Paul Flicek who heads the EBI Vertebrate Genomics Team. Sanger Institute Ensembl consists of the genebuild group (led by Steve Searle) that generates genesets using an automatic pipeline and the web team (led by Anne Parker) that develops and maintains the Ensembl website.


The Otterlace Annotation Tool.

The Otterlace Annotation Tool.


A major combined activity of Havana and the Ensembl genebuild group is to generate complete, high-accuracy genesets for the high-quality reference genomes of human and mouse. Ensembl generates complete genesets using its automatic pipeline for most of the 40+ genomes that it contains. Human and mouse are exceptions where the genesets are referred to as 'Ensembl-Havana' since they combine curated gene structures from Havana with annotation from the Ensembl automatic pipeline. So far only about 50 per cent of human and 30 per cent of mouse genome have been manually curated. Ultimately the whole of these genesets will be curated and for human this is the objective of the GENCODE project, which is a scale up programme of the NHGRI funded ENCODE project, which brings together HAVANA, Ensembl and seven external groups to generate the reference geneset for the human genome. The Havana-Ensembl geneset incorporates the subset of human and mouse CDS (protein coding) regions that have been curated and agreed by the CCDS consortium, which includes curators at Havana and NCBI (Refseq) with computational annotation and assessment from the Ensembl genebuild group and UCSC.

The gene curation carried out by Havana is supported by specialist analysis pipelines and annotation tools provided by the Anacode group (led by James Gilbert). Anacode also develops and maintains many of the software systems that support curation of reference genome sequences and WTSI sequence submission to the EMBL sequence database (the EBI partner of the INSDC database consortium). A key component of the otterlace curation interface, which can be used by annotators anywhere in the world, is the ZMAP genome display engine developed by the Acedb group (led by Ed Griffiths). The group continues to support the Acedb database package, used by the model organism databases wormbase. The Havana group is involved in the annotation genes as candidates for knockout in mouse for the Embryonic Stem (ES) Cell Mutagenesis team of Bill Skarnes as part of the EUCOMM and KOMP projects. Otterlace is also used remotely by KOMP annotators at the Genome Center at Washington University.

The genome of the Zebrafish (a key model organism) is being sequenced to reference quality by WTSI. The team includes the Zebrafish analysis group (led by Kerstin Howe) which is responsible for preparing genome assemblies and integrating functional data such as from the EU ZF-models project and the Sanger Institute zebrafish mutagenesis project. Kerstin also leads the informatics group of the Sanger Institute's component of the Genome reference consortium (GRC) which is responsible for maintaining the reference genome sequences of human and mouse.

Selected Publications

  • Ensembl 2009.

    Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, Coates G, Fairley S, Fitzgerald S, Fernandez-Banet J, Gordon L, Graf S, Haider S, Hammond M, Holland R, Howe K, Jenkinson A, Johnson N, Kahari A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Lawson D, Longden I, Megy K, Meidl P, Overduin B, Parker A, Pritchard B, Rios D, Schuster M, Slater G, Smedley D, Spooner W, Spudich G, Trevanion S, Vilella A, Vogel J, White S, Wilder S, Zadissa A, Birney E, Cunningham F, Curwen V, Durbin R, Fernandez-Suarez XM, Herrero J, Kasprzyk A, Proctor G, Smith J, Searle S and Flicek P

    Nucleic acids research 2009;37;Database issue;D690-7

  • Petabyte-scale innovations at the European Nucleotide Archive.

    Cochrane G, Akhtar R, Bonfield J, Bower L, Demiralp F, Faruque N, Gibson R, Hoad G, Hubbard T, Hunter C, Jang M, Juhos S, Leinonen R, Leonard S, Lin Q, Lopez R, Lorenc D, McWilliam H, Mukherjee G, Plaister S, Radhakrishnan R, Robinson S, Sobhany S, Hoopen PT, Vaughan R, Zalunin V and Birney E

    Nucleic acids research 2009;37;Database issue;D19-25

  • The Protein Feature Ontology: a tool for the unification of protein feature annotations.

    Reeves GA, Eilbeck K, Magrane M, O'Donovan C, Montecchi-Palazzi L, Harris MA, Orchard S, Jimenez RC, Prlic A, Hubbard TJ, Hermjakob H and Thornton JM

    Bioinformatics (Oxford, England) 2008;24;23;2767-72

  • BioJava: an open-source framework for bioinformatics.

    Holland RC, Down TA, Pocock M, Prlić A, Huen D, James K, Foisy S, Dräger A, Yates A, Heuer M and Schreiber MJ

    Bioinformatics (Oxford, England) 2008;24;18;2096-7

  • Integrating biological data--the Distributed Annotation System.

    Jenkinson AM, Albrecht M, Birney E, Blankenburg H, Down T, Finn RD, Hermjakob H, Hubbard TJ, Jimenez RC, Jones P, Kähäri A, Kulesha E, Macías JR, Reeves GA and Prlić A

    BMC bioinformatics 2008;9 Suppl 8;S3


Team members

Hashem Koohy
Postdoctoral Fellow

Hashem Koohy

- Postdoctoral Fellow

I was born in Shiraz where I completed my primary and secondary education. I then moved to Kermanshah to do my undergraduate studies in pure mathematics. This was followed by a two years MSc in group theory at Tehran and a four years PhD in ring theory at Ahvaz (2004).

Shortly after that I moved to the UK and worked as a part time mathematics instructor at the University of Warwick where I became interested in applications of mathematics in biological sciences. I therefore completed my second MSc and PhD respectively in 2007 and 2010 at Warwick System Biology Centre.


I currently work in Dr Tim Hubbard's research group and my research interests mainly centre on transcriptional regulation. An objective of my project is to provide a comprehensive description of regulatory regions by developing mathematical and computational models. Therefore, trying to predic, detect and annotate TFBSs and regulatory regions such as promoters and enhancers all are parts of my everyday life activity. Currently, I work on a model called "composure" which is an evolutionary approach to detection of regulatory sequences.

I would also like to compare consistency of computationally detected motifs (ab initio) with biologically reported ones.


  • An alignment-free model for comparison of regulatory sequences.

    Koohy H, Dyer NP, Reid JE, Koentges G and Ott S

    MOAC Doctoral Training Centre, Coventry House, University of Warwick, Coventry, CV4 7AL, UK.

    Motivation: Some recent comparative studies have revealed that regulatory regions can retain function over large evolutionary distances, even though the DNA sequences are divergent and difficult to align. It is also known that such enhancers can drive very similar expression patterns. This poses a challenge for the in silico detection of biologically related sequences, as they can only be discovered using alignment-free methods.

    Results: Here, we present a new computational framework called Regulatory Region Scoring (RRS) model for the detection of functional conservation of regulatory sequences using predicted occupancy levels of transcription factors of interest. We demonstrate that our model can detect the functional and/or evolutionary links between some non-alignable enhancers with a strong statistical significance. We also identify groups of enhancers that are likely to be similarly regulated. Our model is motivated by previous work on prediction of expression patterns and it can capture similarity by strong binding sites, weak binding sites and even the statistically significant absence of sites. Our results support the hypothesis that weak binding sites contribute to the functional similarity of sequences. Our model fills a gap between two families of models: detailed, data-intensive models for the prediction of precise spatio-temporal expression patterns on the one side, and crude, generally applicable models on the other side. Our model borrows some of the strengths of each group and addresses their drawbacks.

    Availability: The RRS source code is freely available upon publication of this manuscript:

    Funded by: Medical Research Council: MC_U105260799

    Bioinformatics (Oxford, England) 2010;26;19;2391-7

  • Is DNA a worm-like chain in Couette flow? In search of persistence length, a critical review.

    Rittman M, Gilroy E, Koohya H, Rodger A and Richards A

    Molecular Organisation and Assembly in Cells Doctoral Training Centre, University of Warwick, Coventry CV4 7AL, UK.

    Persistence length is the foremost measure of DNA flexibility. Its origins lie in polymer theory which was adapted for DNA following the determination of BDNA structure in 1953. There is no single definition of persistence length used, and the links between published definitions are based on assumptions which may, or may not be, clearly stated. DNA flexibility is affected by local ionic strength, solvent environment, bound ligands and intrinsic sequence-dependent flexibility. This article is a review of persistence length providing a mathematical treatment of the relationships between four definitions of persistence length, including: correlation, Kuhn length, bending, and curvature. Persistence length has been measured using various microscopy, force extension and solution methods such as linear dichroism and transient electric birefringence. For each experimental method a model of DNA is required to interpret the data. The importance of understanding the underlying models, along with the assumptions required by each definition to determine a value of persistence length, is highlighted for linear dichroism data, where it transpires that no model is currently available for long DNA or medium to high shear rate experiments.

    Science progress 2009;92;Pt 2;163-204

* quick link -