Archive Page: Sequence Variation Infrastructure | Human Genetics

Archive Page: Sequence Variation Infrastructure | Human Genetics

Sequence Variation Infrastructure

This is an archive page of the work that the Sequence Variation Infrastructure team undertook at the Sanger Institute and is being retained as a historical record. When its leader, Dr Thomas Keane, moved to EMBL-EBI to head up the European Nucleotide Archive (ENA) and European Genome-phenome Archive (EGA) in November 2016, the group members moved to other research teams to continue their work.
pmp5066.jpgSanger Institute, Genome Research Limited

Our Research and Approach

We were an interdisciplinary team consisting of Senior Bioinformaticians, Senior Software Developers, and Postdoctoral Fellows. One key goal of the group is to develop the computational infrastructure to enable global data sharing in genomics. At the level of the raw sequencing data, we were involved in the Data Working Group of the Global Alliance for Genomics and Health (GA4GH), an international effort to standardise genomics file formats from the point of the data coming off the sequencing machines. We developed and maintain Samtools; a set of tools for high-throughput data processing of standardised next-generation sequencing data formats (SAM/BAM/CRAM). Since its launch in 2009, Samtools has been downloaded hundreds of thousands of times and has become a core piece of software for processing genomics data worldwide.

Over the past century, the mouse has become one of the premier model organisms for genetic research with mouse models available for many diseases on different genetic backgrounds. In 2011, we led the effort to completely sequence the genomes of 17 inbred laboratory mouse strains and identified approximately 56M unique SNPs, 8.8M indels, and 0.28M structural variants. To fully understand the functional consequences of these genetic differences, the MRC and BBSRC funded us to create assembled chromosome sequences and strain-specific gene annotation for 16 strains. The results of this work enable scientists using non-C57BL/6J mouse strains in medical research to design experiments based on the genome sequence closest to the animals genetic background.

There is a pressing need to investigate new algorithms or data structures for storing raw sequencing data and the corresponding base call qualities. Data volumes have escalated in recent years, in tandem with the rapid decline in sequencing costs, posing storage issues for research organisations worldwide. In recent years, the Burrows-Wheeler transform (BWT) and FM-index have been widely employed as a highly compressed, but searchable, data structure used by read aligners, and for de novo assembly. The team explored how we could use the BWT structure to store and compress the sequencing reads from the full set of 2,500 samples in the 1000 Genomes project to carry out tasks such as rapid genotyping of polymorphisms.


Dr Thomas Keane
Group Leader

Thomas Keane led the Sequence Variation Infrastructure group. His interests were in using genomic technologies to learn about biological processes with a particular focus on mouse and human disease. He is now based at EMBL-EBI.

Show Alumni


Key Projects, Collaborations, Tools & Data

Here are some of the outputs of the team.

Programmes, Associate Research Programmes and Facilities

Partners and Funders

Internal Partners
External Partners and Funders


  • Mouse genomic variation and its effect on phenotypes and gene regulation.

    Keane TM, Goodstadt L, Danecek P, White MA, Wong K et al.

    Nature 2011;477;7364;289-94

  • Sequence-based characterization of structural variation in the mouse genome.

    Yalcin B, Wong K, Agam A, Goodson M, Keane TM et al.

    Nature 2011;477;7364;326-9

  • The genomic landscape shaped by selection on transposable elements across 18 mouse strains.

    Nellåker C, Keane TM, Yalcin B, Wong K, Agam A et al.

    Genome biology 2012;13;6;R45

  • Sequencing and characterization of the FVB/NJ mouse genome.

    Wong K, Bumpstead S, Van Der Weyden L, Reinholdt LG, Wilming LG et al.

    Genome biology 2012;13;8;R72

  • The Mouse Genomes Project: a repository of inbred laboratory mouse strain genomes.

    Adams DJ, Doran AG, Lilue J and Keane TM

    Mammalian genome : official journal of the International Mammalian Genome Society 2015;26;9-10;403-12