Archived

Vertebrate Annotation

Human Genetics

Archive Page

This page is maintained as a historical record and is no longer being updated.

The HAVANA team relocated to EMBL-EBI in 2017 and continues to create reference gene annotation as part of Ensembl, within Paul Flicek’s team. http://www.ebi.ac.uk/research/flicek

The value of a genome is only as good as its annotation. To create a gold standard reference annotation the Human and Vertebrate Analysis and Annotation (HAVANA) team uses tools developed in-house to manually annotate human, mouse, zebrafish and other vertebrate genomes. This annotation appears in the the Vega browser.

The Sanger Institute has made large contributions to a large number of vertebrate genome sequences, including all or part of human chromosomes 1, 6, 9, 10, 13, 20, 22 and X and mouse chromosomes 2, 4, 11 and X, and the full Danio rerio (zebrafish) genome sequence. The Institute has also sequenced or continues to sequence selected parts of other vertebrate genomes, including candidate diabetes gene regions (in reference and non-obese diabetic (NOD) mouse strains) and MHC regions (in wallaby, Tasmanian devil, gorilla, dog, pig, human haplotypes and mouse strains). The HAVANA team provides the manual annotation for these and other genome sequences.

The HAVANA group puts special emphasis on splice variants and pseudogenes, two areas still underdeveloped in automated annotation systems, as well as poly-adenylation features. Also, where other systems concentrate on, or are limited to, protein-coding genes, many HAVANA transcripts are annotated without a protein-coding region. These transcripts may function as non-coding RNAs or they may be incomplete gene fragments for which the coding sequence cannot yet be determined.

The HAVANA group requires that all annotated gene structures (transcripts) are supported by transcriptional evidence, either from cDNA, EST or protein sequences. As such not all annotated transcripts are necessarily complete. Support does not need to come from locus-specific evidence, but can also be homologous, paralogous or orthologous.

While the transcript and protein sequences are the most important pieces of information, HAVANA annotation takes into account and uses other data, such as CpG islands, gene predictions, repeats and genome signatures. Because the annotation software used is DAS (Distributed Annotation System) aware, the HAVANA team can link to external data sources. Ensembl gene models and data from GENCODE collaborators are some of the DAS sources the HAVANA group uses. HAVANA sources are under constant review and subject change. For example, the group recently started to use data from new technologies such as RNAseq and protein mass spectrometry in its annotation efforts.

The team aims to develop accurate and comprehensive annotation representing the full complexity of gene loci and their features. Manual annotation is especially important in areas that are not well catered for by automated annotation systems, such as splice variation, pseudogenes, conserved gene families, duplications and non-coding genes. The HAVANA team constantly updates its methods by incorporating new data sources that are created as new technologies are developed. HAVANA annotation is freely available through genome browsers, including VEGA, Ensembl and UCSC.

If you have any queries regarding our annotation, please contact us at havana-help@sanger.ac.uk.

Our people

Previous group lead

Dr Adam Frankish

Principal Computer Biologist

As a team leader in the HAVANA group my primary responsibility is managing the production of reference gene annotation for human and mouse within the GENCODE project. My focus is driving improvement in gene annotation to support more accurate interpretation of variation in both the research and clinical environments.

Core team

Mr James Gilbert

Senior Software Developer

Dr Mark Thomas

Principal Bioinformatician

Previous core team members

Ruth Bennett

Computer Biologist

Dr Gloria Despacio-Reyes

Senior Computer Biologist

Sarah Donaldson

Senior Computer Biologist

Dr Jose M Gonzalez

Former Senior Bioinformatician at the Sanger Institute

Dr Toby Hunt

Former Senior Computer Biologist at the Sanger Institute

Mike Kay

Senior Computer Biologist

Dr Jane Loveland

Principal Computer Biologist

Deepa Manthravadi

Computer Biologist

Dr Jonathan M. Mudge

Former Senior Computer Biologist at the Sanger Institute

Dr Gaurab Mukherjee

Senior Computer Biologist

Marie-Marthe Suner

Senior Computer Biologist

Associated research

Collaborations

Collaboration

Manual Annotation

The HAVANA team provides the manual annotation of human, mouse, zebrafish and other vertebrate genomes.

Tools & software

Tool

Ensembl Genome Browser

The Ensembl project creates evidence-based annotation of genome sequences and integrates these data with other biological information. All of Ensembl' ...

Tool

GENCODE

The aim of GENCODE is to annotate all evidence-based gene features in the human and mouse genomes at high accuracy. ...

Tool

VEGA Genome Browser

VEGA displays all of the manual annotation from the HAVANA team.

Related groups

Science group

Genome Reference Informatics Team

Tree of Life Programme

Collaborate on the accuracy of gene loci.

Science group

Proteomic Mass Spectrometry

Scientific Resources

We worked with Jen Harrow on the project "Gencode: Comprehensive gene annotation for human and mouse", funded by NIH/NHGRI

Science group

Sequence Variation Infrastructure

Human Genetics

Collaborate on the mouse gene set maintainence of the mouse reference genomes and other laboratory mouse strains.

Wellcome Sanger Institute

Programmes and Facilities

Programme

Human Genetics

The Human Genetics Programme is driving a step-change in our understanding of genetic causes and biological mechanisms of disease susceptibility and ...

Partners

We work with the following groups

External

CCDS

CCDS is a collaboration between the Wellcome Trust Sanger Institute (HAVANA), EBI (Ensembl), NCBI (RefSeq), UCSC (Genome Bioinformatics Group), HUGO Genome Nomenclature Committee (HGNC) and Mouse Genome Informatics (MGI). CCDS strives to provide a comprehensive database of high-quality coding regions from the human and mouse genomes agreed by all collaborators. Annotation from Sanger Institute and RefSeq, which is created using different techniques, is compared and a CCDS entry is created when the two agree on the coding sequence structure for a given transcript or locus. Conflicts are discussed between all three parties and, where a consensus can be reached, a CCDS entry is created.

External

IMPC

IMPC is a collaboration between the three main mouse knockout projects: EUCOMM (European Conditional Mouse Mutagenesis), KOMP (Knockout Mouse Project) and NorCOMM (North American Conditional Mouse Mutagenesis). Manual annotation by the HAVANA group and collaborators at Washington University, St Louis, and University of Manitoba, Winnipeg, serves as the foundation for constructing knockout mouse cell lines for every coding gene.

External

ENCODE

The ENCODE (Encyclopedia of DNA Elements) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.

Publications

Loading publications...

Careers and Study

Policies

Archive

Leadership

Faculty

Vertebrate Annotation

Archive Page

Our people

Previous group lead

Dr Adam Frankish

Core team

Mr James Gilbert

Dr Mark Thomas

Previous core team members

Ruth Bennett

Dr Gloria Despacio-Reyes

Sarah Donaldson

Dr Jose M Gonzalez

Dr Toby Hunt

Mike Kay

Dr Jane Loveland

Deepa Manthravadi

Dr Jonathan M. Mudge

Dr Gaurab Mukherjee

Marie-Marthe Suner

Associated research

Manual Annotation

Ensembl Genome Browser

GENCODE

VEGA Genome Browser

Related groups

Genome Reference Informatics Team

Proteomic Mass Spectrometry

Sequence Variation Infrastructure

Programmes and Facilities

Human Genetics

Partners

Publications