Vertebrate Annotation

Human Genetics

Archive Page

This page is maintained as a historical record and is no longer being updated.

The HAVANA team relocated to EMBL-EBI in 2017 and continues to create reference gene annotation as part of Ensembl, within Paul Flicek’s team.

The value of a genome is only as good as its annotation. To create a gold standard reference annotation the Human and Vertebrate Analysis and Annotation (HAVANA) team uses tools developed in-house to manually annotate human, mouse, zebrafish and other vertebrate genomes. This annotation appears in the the Vega browser.

The Sanger Institute has made large contributions to a large number of vertebrate genome sequences, including all or part of human chromosomes 1, 6, 9, 10, 13, 20, 22 and X and mouse chromosomes 2, 4, 11 and X, and the full Danio rerio (zebrafish) genome sequence. The Institute has also sequenced or continues to sequence selected parts of other vertebrate genomes, including candidate diabetes gene regions (in reference and non-obese diabetic (NOD) mouse strains) and MHC regions (in wallaby, Tasmanian devil, gorilla, dog, pig, human haplotypes and mouse strains). The HAVANA team provides the manual annotation for these and other genome sequences.

The HAVANA group puts special emphasis on splice variants and pseudogenes, two areas still underdeveloped in automated annotation systems, as well as poly-adenylation features. Also, where other systems concentrate on, or are limited to, protein-coding genes, many HAVANA transcripts are annotated without a protein-coding region. These transcripts may function as non-coding RNAs or they may be incomplete gene fragments for which the coding sequence cannot yet be determined.

The HAVANA group requires that all annotated gene structures (transcripts) are supported by transcriptional evidence, either from cDNA, EST or protein sequences. As such not all annotated transcripts are necessarily complete. Support does not need to come from locus-specific evidence, but can also be homologous, paralogous or orthologous.

While the transcript and protein sequences are the most important pieces of information, HAVANA annotation takes into account and uses other data, such as CpG islands, gene predictions, repeats and genome signatures. Because the annotation software used is DAS (Distributed Annotation System) aware, the HAVANA team can link to external data sources. Ensembl gene models and data from GENCODE collaborators are some of the DAS sources the HAVANA group uses. HAVANA sources are under constant review and subject change. For example, the group recently started to use data from new technologies such as RNAseq and protein mass spectrometry in its annotation efforts.

The team aims to develop accurate and comprehensive annotation representing the full complexity of gene loci and their features. Manual annotation is especially important in areas that are not well catered for by automated annotation systems, such as splice variation, pseudogenes, conserved gene families, duplications and non-coding genes. The HAVANA team constantly updates its methods by incorporating new data sources that are created as new technologies are developed. HAVANA annotation is freely available through genome browsers, including VEGA, Ensembl and UCSC.

If you have any queries regarding our annotation, please contact us at

Core team

Photo of Mr James Gilbert

Mr James Gilbert

Senior Software Developer

Photo of Dr Mark Thomas

Dr Mark Thomas

Principal Bioinformatician

Previous team members

Photo of Ruth Bennett

Ruth Bennett

Computer Biologist

Photo of Dr Gloria Despacio-Reyes

Dr Gloria Despacio-Reyes

Senior Computer Biologist

Photo of Sarah Donaldson

Sarah Donaldson

Senior Computer Biologist

Photo of Dr Jose M Gonzalez

Dr Jose M Gonzalez

Former Senior Bioinformatician at the Sanger Institute

Photo of Dr Toby Hunt

Dr Toby Hunt

Former Senior Computer Biologist at the Sanger Institute

Photo of Mike Kay

Mike Kay

Senior Computer Biologist

Photo of Dr Jane Loveland

Dr Jane Loveland

Principal Computer Biologist

Photo of Deepa Manthravadi

Deepa Manthravadi

Computer Biologist

Photo of Dr Jonathan M. Mudge

Dr Jonathan M. Mudge

Former Senior Computer Biologist at the Sanger Institute

Photo of Dr Gaurab Mukherjee

Dr Gaurab Mukherjee

Senior Computer Biologist

Photo of Marie-Marthe Suner

Marie-Marthe Suner

Senior Computer Biologist


We work with the following groups



CCDS is a collaboration between the Wellcome Trust Sanger Institute (HAVANA), EBI (Ensembl), NCBI (RefSeq), UCSC (Genome Bioinformatics Group), HUGO Genome Nomenclature Committee (HGNC) and Mouse Genome Informatics (MGI). CCDS strives to provide a comprehensive database of high-quality coding regions from the human and mouse genomes agreed by all collaborators. Annotation from Sanger Institute and RefSeq, which is created using different techniques, is compared and a CCDS entry is created when the two agree on the coding sequence structure for a given transcript or locus. Conflicts are discussed between all three parties and, where a consensus can be reached, a CCDS entry is created.



IMPC is a collaboration between the three main mouse knockout projects: EUCOMM (European Conditional Mouse Mutagenesis), KOMP (Knockout Mouse Project) and NorCOMM (North American Conditional Mouse Mutagenesis). Manual annotation by the HAVANA group and collaborators at Washington University, St Louis, and University of Manitoba, Winnipeg, serves as the foundation for constructing knockout mouse cell lines for every coding gene.



The ENCODE (Encyclopedia of DNA Elements) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.



Loading publications...