Vertebrate genome analysis

Genome sequences provide a natural index for organising and understanding biological data.

Following the sequencing of the human and other vertebrate genomes, vertebrate genome browsers such as Ensembl have become critical resources, providing biologists with integrated access to the sequence and its associated annotation. The activities of the Vertebrate genome analysis team revolve around generating and presenting core vertebrate genome annotation, particularly in the form of reference genesets, and in maintaining the reference genome sequences of human, mouse and zebrafish. As well as contributing to resources used globally, the team is involved in a wide variety of collaborations related to genome annotation and the development of improved methods for analysis and annotation resulting in many publications. The principal investigator of the team is Tim Hubbard who also has a small research group.

[The Wellcome Trust Sanger Institute]

Background

Part of a 'view' from the Ensembl Genome Browser.

Part of a 'view' from the Ensembl Genome Browser.
Enlarge this image (635 x 514)

The team includes the Wellcome Trust Sanger Institute part of the Ensembl project (led by Steve Searle) and the Havana annotation group (lead by Jen Harrow). Ensembl is a joint project with the European Bioinformatics Institute (EBI). Steve Searle's EBI counterpart is Paul Flicek who heads the EBI Vertebrate Genomics Team. Sanger Institute Ensembl consists of the genebuild group (led by Steve Searle) that generates genesets using an automatic pipeline and the web team (led by James Smith) that develops and maintains the Ensembl website.

Research

The Otterlace Annotation Tool.

The Otterlace Annotation Tool.
Enlarge this image (647 x 425)

A major combined activity of Havana and the Ensembl genebuild group is to generate complete, high-accuracy genesets for the high-quality reference genomes of human and mouse. Ensembl generates complete genesets using its automatic pipeline for most of the 40+ genomes that it contains. Human and mouse are exceptions where the genesets are referred to as 'Ensembl-Havana' since they combine curated gene structures from Havana with annotation from the Ensembl automatic pipeline. So far only about 50 per cent of human and 30 per cent of mouse genome have been manually curated. Ultimately the whole of these genesets will be curated and for human this is the objective of the GENCODE project, which is a scale up programme of the NHGRI funded ENCODE project, which brings together HAVANA, Ensembl and seven external groups to generate the reference geneset for the human genome. The Havana-Ensembl geneset incorporates the subset of human and mouse CDS (protein coding) regions that have been curated and agreed by the CCDS consortium, which includes curators at Havana and NCBI (Refseq) with computational annotation and assessment from the Ensembl genebuild group and UCSC.

The gene curation carried out by Havana is supported by specialist analysis pipelines and annotation tools provided by the Anacode group (led by James Gilbert). Anacode also develops and maintains many of the software systems that support curation of reference genome sequences and WTSI sequence submission to the EMBL sequence database (the EBI partner of the INSDC database consortium). A key component of the otterlace curation interface, which can be used by annotators anywhere in the world, is the ZMAP genome display engine developed by the Acedb group (led by Ed Griffiths). The group continues to support the Acedb database package, used by the model organism databases wormbase. The Havana group is involved in the annotation genes as candidates for knockout in mouse for the Embryonic Stem (ES) Cell Mutagenesis team of Bill Skarnes as part of the EUCOMM and KOMP projects. Otterlace is also used remotely by KOMP annotators at the Genome Center at Washington University.

The genome of the Zebrafish (a key model organism) is being sequenced to reference quality by WTSI. The team includes the Zebrafish analysis group (led by Kerstin Howe) which is responsible for preparing genome assemblies and integrating functional data such as from the EU ZF-models project and the Sanger Institute zebrafish mutagenesis project. Kerstin also leads the informatics group of the Sanger Institute's component of the Genome reference consortium (GRC) which is responsible for maintaining the reference genome sequences of human and mouse.

Selected Publications

  • Ensembl 2009.

    Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, Coates G, Fairley S, Fitzgerald S, Fernandez-Banet J, Gordon L, Graf S, Haider S, Hammond M, Holland R, Howe K, Jenkinson A, Johnson N, Kahari A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Lawson D, Longden I, Megy K, Meidl P, Overduin B, Parker A, Pritchard B, Rios D, Schuster M, Slater G, Smedley D, Spooner W, Spudich G, Trevanion S, Vilella A, Vogel J, White S, Wilder S, Zadissa A, Birney E, Cunningham F, Curwen V, Durbin R, Fernandez-Suarez XM, Herrero J, Kasprzyk A, Proctor G, Smith J, Searle S and Flicek P

    Nucleic acids research 2009;37;Database issue;D690-7

  • Petabyte-scale innovations at the European Nucleotide Archive.

    Cochrane G, Akhtar R, Bonfield J, Bower L, Demiralp F, Faruque N, Gibson R, Hoad G, Hubbard T, Hunter C, Jang M, Juhos S, Leinonen R, Leonard S, Lin Q, Lopez R, Lorenc D, McWilliam H, Mukherjee G, Plaister S, Radhakrishnan R, Robinson S, Sobhany S, Hoopen PT, Vaughan R, Zalunin V and Birney E

    Nucleic acids research 2009;37;Database issue;D19-25

  • The Protein Feature Ontology: a tool for the unification of protein feature annotations.

    Reeves GA, Eilbeck K, Magrane M, O'Donovan C, Montecchi-Palazzi L, Harris MA, Orchard S, Jimenez RC, Prlic A, Hubbard TJ, Hermjakob H and Thornton JM

    Bioinformatics (Oxford, England) 2008;24;23;2767-72

  • BioJava: an open-source framework for bioinformatics.

    Holland RC, Down TA, Pocock M, Prlić A, Huen D, James K, Foisy S, Dräger A, Yates A, Heuer M and Schreiber MJ

    Bioinformatics (Oxford, England) 2008;24;18;2096-7

  • Integrating biological data - the Distributed Annotation System.

    Jenkinson AM, Albrecht M, Birney E, Blankenburg H, Down T, Finn RD, Hermjakob H, Hubbard TJ, Jimenez RC, Jones P, Kähäri A, Kulesha E, Macías JR, Reeves GA and Prlic A

    BMC bioinformatics 2008;9 Suppl 8;S3