Archive Page: Vertebrate Annotation | Human Genetics
Archive Page: Vertebrate Annotation
The HAVANA team relocated to EMBL-EBI in 2017 and continues to create reference gene annotation as part of Ensembl, within Paul Flicek's team. http://www.ebi.ac.uk/research/flicek
Sanger Institute, Genome Research Limited
Our Research and Approach
The value of a genome is only as good as its annotation. To create a gold standard reference annotation the Human and Vertebrate Analysis and Annotation (HAVANA) team uses tools developed in-house to manually annotate human, mouse, zebrafish and other vertebrate genomes. This annotation appears in the the Vega browser.
The Sanger Institute has made large contributions to a large number of vertebrate genome sequences, including all or part of human chromosomes 1, 6, 9, 10, 13, 20, 22 and X and mouse chromosomes 2, 4, 11 and X, and the full Danio rerio (zebrafish) genome sequence. The Institute has also sequenced or continues to sequence selected parts of other vertebrate genomes, including candidate diabetes gene regions (in reference and non-obese diabetic (NOD) mouse strains) and MHC regions (in wallaby, Tasmanian devil, gorilla, dog, pig, human haplotypes and mouse strains). The HAVANA team provides the manual annotation for these and other genome sequences.
The HAVANA group puts special emphasis on splice variants and pseudogenes, two areas still underdeveloped in automated annotation systems, as well as poly-adenylation features. Also, where other systems concentrate on, or are limited to, protein-coding genes, many HAVANA transcripts are annotated without a protein-coding region. These transcripts may function as non-coding RNAs or they may be incomplete gene fragments for which the coding sequence cannot yet be determined.
The HAVANA group requires that all annotated gene structures (transcripts) are supported by transcriptional evidence, either from cDNA, EST or protein sequences. As such not all annotated transcripts are necessarily complete. Support does not need to come from locus-specific evidence, but can also be homologous, paralogous or orthologous.
While the transcript and protein sequences are the most important pieces of information, HAVANA annotation takes into account and uses other data, such as CpG islands, gene predictions, repeats and genome signatures. Because the annotation software used is DAS (Distributed Annotation System) aware, the HAVANA team can link to external data sources. Ensembl gene models and data from GENCODE collaborators are some of the DAS sources the HAVANA group uses. HAVANA sources are under constant review and subject change. For example, the group recently started to use data from new technologies such as RNAseq and protein mass spectrometry in its annotation efforts.
The team aims to develop accurate and comprehensive annotation representing the full complexity of gene loci and their features. Manual annotation is especially important in areas that are not well catered for by automated annotation systems, such as splice variation, pseudogenes, conserved gene families, duplications and non-coding genes. The HAVANA team constantly updates its methods by incorporating new data sources that are created as new technologies are developed. HAVANA annotation is freely available through genome browsers, including VEGA, Ensembl and UCSC.
If you have any queries regarding our annotation, please contact us at email@example.com.
As a team leader in the HAVANA group my primary responsibility is managing the production of reference gene annotation for human and mouse within the GENCODE project. My focus is driving improvement in gene annotation to support more accurate interpretation of variation in both the research and clinical environments.
The HAVANA group collaborates with others in both small and large projects. The largest projects are designed to annotate the entire human, mouse and zebrafish genomes. The following are the main HAVANA collaborations relating to these projects:
The Ensembl project creates evidence-based annotation of genome sequences and integrates these data with other biological information. All of Ensembl's results are freely available to geneticists, molecular biologists, bioinformaticians and the wider research community. Ensembl is a joint project between EMBL-EBI and the Wellcome Trust Sanger Institute.
The Human Genetics Programme seeks to bring genomics to population-scale studies (in the UK, and in diverse populations); progress beyond locus discovery and mapping, to causal variant and pathway identification; provide mechanistic insights into how individual variants impact health and disease; and gain knowledge of variable phenotypic expressivity, and assess reversibility of developmental phenotypes, which may yield important therapeutic insights.
CCDS is a collaboration between the Wellcome Trust Sanger Institute (HAVANA), EBI (Ensembl), NCBI (RefSeq), UCSC (Genome Bioinformatics Group), HUGO Genome Nomenclature Committee (HGNC) and Mouse Genome Informatics (MGI). CCDS strives to provide a comprehensive database of high-quality coding regions from the human and mouse genomes agreed by all collaborators. Annotation from Sanger Institute and RefSeq, which is created using different techniques, is compared and a CCDS entry is created when the two agree on the coding sequence structure for a given transcript or locus. Conflicts are discussed between all three parties and, where a consensus can be reached, a CCDS entry is created.
IMPC is a collaboration between the three main mouse knockout projects: EUCOMM (European Conditional Mouse Mutagenesis), KOMP (Knockout Mouse Project) and NorCOMM (North American Conditional Mouse Mutagenesis). Manual annotation by the HAVANA group and collaborators at Washington University, St Louis, and University of Manitoba, Winnipeg, serves as the foundation for constructing knockout mouse cell lines for every coding gene.
The ENCODE (Encyclopedia of DNA Elements) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.
A 2.5-kilobase deletion containing a cluster of nine microRNAs in the latency-associated-transcript locus of the pseudorabies virus affects the host response of porcine trigeminal ganglia during established latency.
Mahjoub N, Dhorne-Pollet S, Fuchs W, Endale Ahanda ML, Lange Eet al.