Archive Page: Vertebrate Annotation | Human Genetics

Archive Page: Vertebrate Annotation | Human Genetics

Vertebrate Annotation

The HAVANA team relocated to EMBL-EBI in 2017 and continues to create reference gene annotation as part of Ensembl, within Paul Flicek's team.
havana-5.jpgSanger Institute, Genome Research Limited

Our Research and Approach


The value of a genome is only as good as its annotation. To create a gold standard reference annotation the Human and Vertebrate Analysis and Annotation (HAVANA) team uses tools developed in-house to manually annotate human, mouse, zebrafish and other vertebrate genomes. This annotation appears in the the Vega browser.

The Sanger Institute has made large contributions to a large number of vertebrate genome sequences, including all or part of human chromosomes 1, 6, 9, 10, 13, 20, 22 and X and mouse chromosomes 2, 4, 11 and X, and the full Danio rerio (zebrafish) genome sequence. The Institute has also sequenced or continues to sequence selected parts of other vertebrate genomes, including candidate diabetes gene regions (in reference and non-obese diabetic (NOD) mouse strains) and MHC regions (in wallaby, Tasmanian devil, gorilla, dog, pig, human haplotypes and mouse strains). The HAVANA team provides the manual annotation for these and other genome sequences.

The HAVANA group puts special emphasis on splice variants and pseudogenes, two areas still underdeveloped in automated annotation systems, as well as poly-adenylation features. Also, where other systems concentrate on, or are limited to, protein-coding genes, many HAVANA transcripts are annotated without a protein-coding region. These transcripts may function as non-coding RNAs or they may be incomplete gene fragments for which the coding sequence cannot yet be determined.

The HAVANA group requires that all annotated gene structures (transcripts) are supported by transcriptional evidence, either from cDNA, EST or protein sequences. As such not all annotated transcripts are necessarily complete. Support does not need to come from locus-specific evidence, but can also be homologous, paralogous or orthologous.

While the transcript and protein sequences are the most important pieces of information, HAVANA annotation takes into account and uses other data, such as CpG islands, gene predictions, repeats and genome signatures. Because the annotation software used is DAS (Distributed Annotation System) aware, the HAVANA team can link to external data sources. Ensembl gene models and data from GENCODE collaborators are some of the DAS sources the HAVANA group uses. HAVANA sources are under constant review and subject change. For example, the group recently started to use data from new technologies such as RNAseq and protein mass spectrometry in its annotation efforts.

Read More


Dr Adam Frankish
Group Leader

As a team leader in the HAVANA group my primary responsibility is managing the production of reference gene annotation for human and mouse within the GENCODE project. My focus is driving improvement in gene annotation to support more accurate interpretation of variation in both the research and clinical environments.

Show Alumni


Gray, Michael

Michael Gray
Former Senior Software Developer in the Annosoft Team

Guest, Gemma

Gemma Guest
Former Senior Software Developer in the Annosoft Team

Hunt, Toby

Hunt, Toby
Dr Toby Hunt
Former Senior Computer Biologist at the Sanger Institute

Kay, Mike

Kay, Mike
Mike Kay
Senior Computer Biologist

Key Projects, Collaborations, Tools & Data

The HAVANA group collaborates with others in both small and large projects. The largest projects are designed to annotate the entire human, mouse and zebrafish genomes. The following are the main HAVANA collaborations relating to these projects:

Programmes, Associate Research Programmes and Facilities

Partners and Funders

Internal Partners
External Partners and Funders


  • Extension of human lncRNA transcripts by RACE coupled with long-read high-throughput sequencing (RACE-Seq).

    Lagarde J, Uszczynska-Ratajczak B, Santoyo-Lopez J, Gonzalez JM, Tapanari E et al.

    Nature communications 2016;7;12339

  • Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow.

    Wright JC, Mudge J, Weisser H, Barzine MP, Gonzalez JM et al.

    Nature communications 2016;7;11778

  • Ensembl 2016.

    Yates A, Akanni W, Amode MR, Barrell D, Billis K et al.

    Nucleic acids research 2016;44;D1;D710-6

  • The pig X and Y Chromosomes: structure, sequence, and evolution.

    Skinner BM, Sargent CA, Churcher C, Hunt T, Herrero J et al.

    Genome research 2016;26;1;130-9

  • Devising a Consensus Framework for Validation of Novel Human Coding Loci.

    Bruford EA, Lane L and Harrow J

    Journal of proteome research 2015;14;12;4945-8

  • Creating reference gene annotation for the mouse C57BL6/J genome assembly.

    Mudge JM and Harrow J

    Mammalian genome : official journal of the International Mammalian Genome Society 2015;26;9-10;366-78

  • A 2.5-kilobase deletion containing a cluster of nine microRNAs in the latency-associated-transcript locus of the pseudorabies virus affects the host response of porcine trigeminal ganglia during established latency.

    Mahjoub N, Dhorne-Pollet S, Fuchs W, Endale Ahanda ML, Lange E et al.

    Journal of virology 2015;89;1;428-42

  • Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction.

    Frankish A, Uszczynska B, Ritchie GR, Gonzalez JM, Pervouchine D et al.

    BMC genomics 2015;16 Suppl 8;S2

  • Comprehensive comparative homeobox gene annotation in human and mouse.

    Wilming LG, Boychenko V and Harrow JL

    Database : the journal of biological databases and curation 2015;2015

  • Ensembl 2015.

    Cunningham F, Amode MR, Barrell D, Beal K, Billis K et al.

    Nucleic acids research 2015;43;Database issue;D662-9

  • RNAcentral: an international database of ncRNA sequences.

    RNAcentral Consortium, Petrov AI, Kay SJE, Gibson R, Kulesha E et al.

    Nucleic acids research 2015;43;Database issue;D123-9

  • Multiple evidence strands suggest that there may be as few as 19,000 human protein-coding genes.

    Ezkurdia I, Juan D, Rodriguez JM, Frankish A, Diekhans M et al.

    Human molecular genetics 2014;23;22;5866-78

  • Comparative analysis of pseudogenes across three phyla.

    Sisu C, Pei B, Leng J, Frankish A, Zhang Y et al.

    Proceedings of the National Academy of Sciences of the United States of America 2014;111;37;13361-6

  • Comparative analysis of the transcriptome across distant species.

    Gerstein MB, Rozowsky J, Yan KK, Wang D, Cheng C et al.

    Nature 2014;512;7515;445-8

  • Genome-wide association meta-analysis of human longevity identifies a novel locus conferring survival beyond 90 years of age.

    Deelen J, Beekman M, Uh HW, Broer L, Ayers KL et al.

    Human molecular genetics 2014;23;16;4420-32

  • Human genomic regions with exceptionally high levels of population differentiation identified from 911 whole-genome sequences.

    Colonna V, Ayub Q, Chen Y, Pagani L, Luisi P et al.

    Genome biology 2014;15;6;R88

  • The Vertebrate Genome Annotation browser 10 years on.

    Harrow JL, Steward CA, Frankish A, Gilbert JG, Gonzalez JM et al.

    Nucleic acids research 2014;42;Database issue;D771-9

  • Current status and new features of the Consensus Coding Sequence database.

    Farrell CM, O'Leary NA, Harte RA, Loveland JE, Wilming LG et al.

    Nucleic acids research 2014;42;Database issue;D865-72

  • Ensembl 2014.

    Flicek P, Amode MR, Barrell D, Beal K, Billis K et al.

    Nucleic acids research 2014;42;Database issue;D749-55

  • GENCODE pseudogenes.

    Frankish A and Harrow J

    Methods in molecular biology (Clifton, N.J.) 2014;1167;129-55

  • Assessment of transcript reconstruction methods for RNA-seq.

    Steijger T, Abril JF, Engström PG, Kokocinski F, RGASP Consortium et al.

    Nature methods 2013;10;12;1177-84

  • Functional transcriptomics in the post-ENCODE era.

    Mudge JM, Frankish A and Harrow J

    Genome research 2013;23;12;1961-73

  • Systematic evaluation of spliced alignment programs for RNA-seq data.

    Engström PG, Steijger T, Sipos B, Grant GR, Kahles A et al.

    Nature methods 2013;10;12;1185-91

  • Integrative annotation of variants from 1092 humans: application to cancer genomics.

    Khurana E, Fu Y, Colonna V, Mu XJ, Kang HM et al.

    Science (New York, N.Y.) 2013;342;6154;1235587

  • Best practices in bioinformatics training for life scientists.

    Via A, Blicher T, Bongcam-Rudloff E, Brazas MD, Brooksbank C et al.

    Briefings in bioinformatics 2013;14;5;528-37

  • iAnn: an event sharing platform for the life sciences.

    Jimenez RC, Albar JP, Bhak J, Blatter MC, Blicher T et al.

    Bioinformatics (Oxford, England) 2013;29;15;1919-21

  • Transcriptome analysis of human tissues and cell lines reveals one dominant transcript per gene.

    Gonzàlez-Porta M, Frankish A, Rung J, Harrow J and Brazma A

    Genome biology 2013;14;7;R70

  • Structural and functional annotation of the porcine immunome.

    Dawson HD, Loveland JE, Pascal G, Gilbert JG, Uenishi H et al.

    BMC genomics 2013;14;332

  • The zebrafish reference genome sequence and its relationship to the human genome.

    Howe K, Clark MD, Torroja CF, Torrance J, Berthelot C et al.

    Nature 2013;496;7446;498-503

  • The non-obese diabetic mouse sequence, annotation and variation resource: an aid for investigating type 1 diabetes.

    Steward CA, Gonzalez JM, Trevanion S, Sheppard D, Kerry G et al.

    Database : the journal of biological databases and curation 2013;2013;bat032

  • Ensembl 2013.

    Flicek P, Ahmed I, Amode MR, Barrell D, Beal K et al.

    Nucleic acids research 2013;41;Database issue;D48-55

  • Sequencing and comparative analysis of the gorilla MHC genomic sequence.

    Wilming LG, Hart EA, Coggill PC, Horton R, Gilbert JG et al.

    Database : the journal of biological databases and curation 2013;2013;bat011

  • Analyses of pig genomes provide insight into porcine demography and evolution.

    Groenen MA, Archibald AL, Uenishi H, Tuggle CK, Takeuchi Y et al.

    Nature 2012;491;7424;393-8

  • An integrated map of genetic variation from 1,092 human genomes.

    1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA et al.

    Nature 2012;491;7422;56-65

  • The GENCODE pseudogene resource.

    Pei B, Sisu C, Frankish A, Howald C, Habegger L et al.

    Genome biology 2012;13;9;R51

  • An integrated encyclopedia of DNA elements in the human genome.

    ENCODE Project Consortium

    Nature 2012;489;7414;57-74

  • Landscape of transcription in human cells.

    Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T et al.

    Nature 2012;489;7414;101-8

  • The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression.

    Derrien T, Johnson R, Bussotti G, Tanzer A, Djebali S et al.

    Genome research 2012;22;9;1775-89

  • Combining RT-PCR-seq and RNA-seq to catalog all genic elements encoded in the human genome.

    Howald C, Tanzer A, Chrast J, Kokocinski F, Derrien T et al.

    Genome research 2012;22;9;1698-710

  • Comparative proteomics reveals a significant bias toward alternative protein isoforms with conserved structure and function.

    Ezkurdia I, del Pozo A, Frankish A, Rodriguez JM, Harrow J et al.

    Molecular biology and evolution 2012;29;9;2265-83

  • GENCODE: the reference human genome annotation for The ENCODE Project.

    Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M et al.

    Genome research 2012;22;9;1760-74

  • Bioinformatics Training Network (BTN): a community resource for bioinformatics trainers.

    Schneider MV, Walter P, Blatter MC, Watson J, Brazas MD et al.

    Briefings in bioinformatics 2012;13;3;383-9

  • 2-Carboxy-D-arabinitol 1-phosphate (CA1P) phosphatase: evidence for a wider role in plant Rubisco regulation.

    Andralojc PJ, Madgwick PJ, Tao Y, Keys A, Ward JL et al.

    The Biochemical journal 2012;442;3;733-42

  • A systematic survey of loss-of-function variants in human protein-coding genes.

    MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J et al.

    Science (New York, N.Y.) 2012;335;6070;823-8

  • The importance of identifying alternative splicing in vertebrate genome annotation.

    Frankish A, Mudge JM, Thomas M and Harrow J

    Database : the journal of biological databases and curation 2012;2012;bas014

  • Community gene annotation in practice.

    Loveland JE, Gilbert JG, Griffiths E and Harrow JL

    Database : the journal of biological databases and curation 2012;2012;bas009

  • Ensembl 2012.

    Flicek P, Amode MR, Barrell D, Beal K, Brent S et al.

    Nucleic acids research 2012;40;Database issue;D84-90

  • Evidence for transcript networks composed of chimeric RNAs in human cells.

    Djebali S, Lagarde J, Kapranov P, Lacroix V, Borel C et al.

    PloS one 2012;7;1;e28213

  • Tracking and coordinating an international curation effort for the CCDS Project.

    Harte RA, Farrell CM, Loveland JE, Suner MM, Wilming L et al.

    Database : the journal of biological databases and curation 2012;2012;bas008

  • The origins, evolution, and functional potential of alternative splicing in vertebrates.

    Mudge JM, Frankish A, Fernandez-Banet J, Alioto T, Derrien T et al.

    Molecular biology and evolution 2011;28;10;2949-59

  • The tammar wallaby major histocompatibility complex shows evidence of past genomic instability.

    Siddle HV, Deakin JE, Coggill P, Whilming LG, Harrow J et al.

    BMC genomics 2011;12;421

  • A conditional knockout resource for the genome-wide study of mouse gene function.

    Skarnes WC, Rosen B, West AP, Koutsourakis M, Bushell W et al.

    Nature 2011;474;7351;337-42

  • Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome.

    Brosch M, Saunders GI, Frankish A, Collins MO, Yu L et al.

    Genome research 2011;21;5;756-67

  • A user's guide to the encyclopedia of DNA elements (ENCODE).

    ENCODE Project Consortium

    PLoS biology 2011;9;4;e1001046

  • Gene inactivation and its implications for annotation in the era of personal genomics.

    Balasubramanian S, Habegger L, Frankish A, MacArthur DG, Harte R et al.

    Genes & development 2011;25;1;1-10

  • AnnoTrack--a tracking system for genome annotation.

    Kokocinski F, Harrow J and Hubbard T

    BMC genomics 2010;11;538

  • Identification and analysis of unitary pseudogenes: historic and contemporary gene losses in humans and other primates.

    Zhang ZD, Frankish A, Hunt T, Harrow J and Gerstein M

    Genome biology 2010;11;3;R26

  • Meeting report: a workshop on Best Practices in Genome Annotation.

    Madupu R, Brinkac LM, Harrow J, Wilming LG, Böhme U et al.

    Database : the journal of biological databases and curation 2010;2010;baq001

  • Quantifying the mechanisms of domain gain in animal proteins.

    Buljan M, Frankish A and Bateman A

    Genome biology 2010;11;7;R74

  • Manual annotation and analysis of the defensin gene cluster in the C57BL/6J mouse reference genome.

    Amid C, Rehaume LM, Brown KL, Gilbert JG, Dougan G et al.

    BMC genomics 2009;10;606

  • Discovery of candidate disease genes in ENU-induced mouse mutants by large-scale sequencing, including a splice-site mutation in nucleoredoxin.

    Boles MK, Wilkinson BM, Wilming LG, Liu B, Probst FJ et al.

    PLoS genetics 2009;5;12;e1000759

  • MHC-linked and un-linked class I genes in the wallaby.

    Siddle HV, Deakin JE, Coggill P, Hart E, Cheng Y et al.

    BMC genomics 2009;10;310

  • The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.

    Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M et al.

    Genome research 2009;19;7;1316-23

  • The genome sequence of taurine cattle: a window to ruminant biology and evolution.

    Bovine Genome Sequencing and Analysis Consortium, Elsik CG, Tellam RL, Worley KC, Gibbs RA et al.

    Science (New York, N.Y.) 2009;324;5926;522-8

  • Comparative analysis of processed ribosomal protein pseudogenes in four mammalian genomes.

    Balasubramanian S, Zheng D, Liu YJ, Fang G, Frankish A et al.

    Genome biology 2009;10;1;R2

  • Identifying protein-coding genes in genomic sequences.

    Harrow J, Nagy A, Reymond A, Alioto T, Patthy L et al.

    Genome biology 2009;10;1;201

  • Efficient targeted transcript discovery via array-based normalization of RACE libraries.

    Djebali S, Kapranov P, Foissac S, Lagarde J, Reymond A et al.

    Nature methods 2008;5;7;629-35

  • Determination and validation of principal gene products.

    Tress ML, Wesselink JJ, Frankish A, López G, Goldman N et al.

    Bioinformatics (Oxford, England) 2008;24;1;11-7

  • Dynamic instability of the major urinary protein gene family revealed by genomic and phenotypic comparisons between C57 and 129 strain mice.

    Mudge JM, Armstrong SD, McLaren K, Beynon RJ, Hurst JL et al.

    Genome biology 2008;9;5;R91

  • Variation analysis and gene annotation of eight MHC haplotypes: the MHC Haplotype Project.

    Horton R, Gibson R, Coggill P, Miretti M, Allcock RJ et al.

    Immunogenetics 2008;60;1;1-18

  • Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

    ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R et al.

    Nature 2007;447;7146;799-816

  • Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions.

    Denoeud F, Kapranov P, Ucla C, Frankish A, Castelo R et al.

    Genome research 2007;17;6;746-59

  • Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution.

    Zheng D, Frankish A, Baertsch R, Kapranov P, Reymond A et al.

    Genome research 2007;17;6;839-51

  • The implications of alternative splicing in the ENCODE protein complement.

    Tress ML, Martelli PL, Frankish A, Reeves GA, Wesselink JJ et al.

    Proceedings of the National Academy of Sciences of the United States of America 2007;104;13;5495-500

  • Lessons learned from the initial sequencing of the pig genome: comparative analysis of an 8 Mb region of pig chromosome 17.

    Hart EA, Caccamo M, Harrow JL, Humphray SJ, Gilbert JG et al.

    Genome biology 2007;8;8;R168

  • The DNA sequence and biological annotation of human chromosome 1.

    Gregory SG, Barlow KF, McLay KE, Kaul R, Swarbreck D et al.

    Nature 2006;441;7091;315-21

  • DNA sequence of human chromosome 17 and analysis of rearrangement in the human lineage.

    Zody MC, Garber M, Adams DJ, Sharpe T, Harrow J et al.

    Nature 2006;440;7087;1045-9

  • Genomic anatomy of the Tyrp1 (brown) deletion complex.

    Smyth IM, Wilming L, Lee AW, Taylor MS, Gautier P et al.

    Proceedings of the National Academy of Sciences of the United States of America 2006;103;10;3704-9

  • EGASP: the human ENCODE Genome Annotation Assessment Project.

    Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J et al.

    Genome biology 2006;7 Suppl 1;S2.1-31

  • GENCODE: producing a reference annotation for ENCODE.

    Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK et al.

    Genome biology 2006;7 Suppl 1;S4.1-9

  • Genetic analysis of completely sequenced disease-associated MHC haplotypes identifies shuffling of segments in recent human history.

    Traherne JA, Horton R, Roberts AN, Miretti MM, Hurles ME et al.

    PLoS genetics 2006;2;1;e9

  • Performance assessment of promoter predictions on ENCODE regions in the EGASP experiment.

    Bajic VB, Brent MR, Brown RH, Frankish A, Harrow J et al.

    Genome biology 2006;7 Suppl 1;S3.1-13

  • Validation of mRNA/EST-based gene predictions in human Xp11.4 revealed differences to the organization of the orthologous mouse locus.

    Wen G, Ramser J, Taudien S, Gausmann U, Blechschmidt K et al.

    Mammalian genome : official journal of the International Mammalian Genome Society 2005;16;12;934-41

  • Evidence for widespread reticulate evolution within human duplicons.

    Jackson MS, Oliver K, Loveland J, Humphray S, Dunham I et al.

    American journal of human genetics 2005;77;5;824-40

  • VEGA, the genome browser with a difference.

    Loveland J

    Briefings in bioinformatics 2005;6;2;189-93

  • The DNA sequence of the human X chromosome.

    Ross MT, Grafham DV, Coffey AJ, Scherer S, McLay K et al.

    Nature 2005;434;7031;325-37

  • Evolutionary implications of pericentromeric gene expression in humans.

    Mudge JM and Jackson MS

    Cytogenetic and genome research 2005;108;1-3;47-57

  • Genomic sequence of the class II region of the canine MHC: comparison with the MHC of other mammalian species.

    Debenham SL, Hart EA, Ashurst JL, Howe KL, Quail MA et al.

    Genomics 2005;85;1;48-59

  • Polymorphic segmental duplications at 8p23.1 challenge the determination of individual defensin gene repertoires and the assembly of a contiguous human reference sequence.

    Taudien S, Galgoczy P, Huse K, Reichwald K, Schilhabel M et al.

    BMC genomics 2004;5;1;92

  • Identification of mammalian microRNA host genes and transcription units.

    Rodriguez A, Griffiths-Jones S, Ashurst JL and Bradley A

    Genome research 2004;14;10A;1902-10

  • Organization and evolution of a gene-rich region of the mouse genome: a 12.7-Mb region deleted in the Del(13)Svea36H mouse.

    Mallon AM, Wilming L, Weekes J, Gilbert JG, Ashurst J et al.

    Genome research 2004;14;10A;1888-901

  • Complete MHC haplotype sequencing for common disease gene mapping.

    Stewart CA, Horton R, Allcock RJ, Ashurst JL, Atrazhev AM et al.

    Genome research 2004;14;6;1176-87

  • Integrative annotation of 21,037 human genes validated by full-length cDNA clones.

    Imanishi T, Itoh T, Suzuki Y, O'Donovan C, Fukuchi S et al.

    PLoS biology 2004;2;6;e162

  • The DNA sequence and comparative analysis of human chromosome 10.

    Deloukas P, Earthrowl ME, Grafham DV, Rubenfield M, French L et al.

    Nature 2004;429;6990;375-81

  • DNA sequence and analysis of human chromosome 9.

    Humphray SJ, Oliver K, Hunt AR, Plumb RW, Loveland JE et al.

    Nature 2004;429;6990;369-74

  • The DNA sequence and analysis of human chromosome 13.

    Dunham A, Matthews LH, Burton J, Ashurst JL, Howe KL et al.

    Nature 2004;428;6982;522-8

  • The DNA sequence and analysis of human chromosome 6.

    Mungall AJ, Palmer SA, Sims SK, Edwards CA, Ashurst JL et al.

    Nature 2003;425;6960;805-11

  • Neocentromeres in 15q24-26 map to duplicons which flanked an ancestral centromere in 15q25.

    Ventura M, Mudge JM, Palumbo V, Burn S, Blennow E et al.

    Genome research 2003;13;9;2059-68

  • Gene annotation: prediction and testing.

    Ashurst JL and Collins JE

    Annual review of genomics and human genetics 2003;4;69-88

  • Meeting highlights: genome informatics.

    Wixon J and Ashurst J

    Comparative and functional genomics 2003;4;5;509-14