Sequence Variation Infrastructure

Human Genetics

Archive Page

This page is maintained as a historical record and is no longer being updated.

This is an archive page of the work that the Sequence Variation Infrastructure team undertook at the Sanger Institute and is being retained as a historical record. When its leader, Dr Thomas Keane, moved to EMBL-EBI to head up the European Nucleotide Archive (ENA) and European Genome-phenome Archive (EGA) in November 2016, the group members moved to other research teams to continue their work.

We were an interdisciplinary team consisting of Senior Bioinformaticians, Senior Software Developers, and Postdoctoral Fellows. One key goal of the group is to develop the computational infrastructure to enable global data sharing in genomics. At the level of the raw sequencing data, we were involved in the Data Working Group of the Global Alliance for Genomics and Health (GA4GH), an international effort to standardise genomics file formats from the point of the data coming off the sequencing machines. We developed and maintain Samtools; a set of tools for high-throughput data processing of standardised next-generation sequencing data formats (SAM/BAM/CRAM). Since its launch in 2009, Samtools has been downloaded hundreds of thousands of times and has become a core piece of software for processing genomics data worldwide.

Over the past century, the mouse has become one of the premier model organisms for genetic research with mouse models available for many diseases on different genetic backgrounds. In 2011, we led the effort to completely sequence the genomes of 17 inbred laboratory mouse strains and identified approximately 56M unique SNPs, 8.8M indels, and 0.28M structural variants. To fully understand the functional consequences of these genetic differences, the MRC and BBSRC funded us to create assembled chromosome sequences and strain-specific gene annotation for 16 strains. The results of this work enable scientists using non-C57BL/6J mouse strains in medical research to design experiments based on the genome sequence closest to the animals genetic background.

There is a pressing need to investigate new algorithms or data structures for storing raw sequencing data and the corresponding base call qualities. Data volumes have escalated in recent years, in tandem with the rapid decline in sequencing costs, posing storage issues for research organisations worldwide. In recent years, the Burrows-Wheeler transform (BWT) and FM-index have been widely employed as a highly compressed, but searchable, data structure used by read aligners, and for de novo assembly. The team explored how we could use the BWT structure to store and compress the sequencing reads from the full set of 2,500 samples in the 1000 Genomes project to carry out tasks such as rapid genotyping of polymorphisms.

Our people

Previous team members

Photo of Zhicheng Liu

Zhicheng Liu

Senior Software Developer


We work with the following groups


Medical Research Council

We are funded by the MRC to produce new mouse reference genomes.



We are funded by the BBSRC to produce new mouse reference genomes.



Loading publications...