Archived

Sequence Variation Infrastructure

Human Genetics

Archive Page

This page is maintained as a historical record and is no longer being updated.

This is an archive page of the work that the Sequence Variation Infrastructure team undertook at the Sanger Institute and is being retained as a historical record. When its leader, Dr Thomas Keane, moved to EMBL-EBI to head up the European Nucleotide Archive (ENA) and European Genome-phenome Archive (EGA) in November 2016, the group members moved to other research teams to continue their work.

We were an interdisciplinary team consisting of Senior Bioinformaticians, Senior Software Developers, and Postdoctoral Fellows. One key goal of the group is to develop the computational infrastructure to enable global data sharing in genomics. At the level of the raw sequencing data, we were involved in the Data Working Group of the Global Alliance for Genomics and Health (GA4GH), an international effort to standardise genomics file formats from the point of the data coming off the sequencing machines. We developed and maintain Samtools; a set of tools for high-throughput data processing of standardised next-generation sequencing data formats (SAM/BAM/CRAM). Since its launch in 2009, Samtools has been downloaded hundreds of thousands of times and has become a core piece of software for processing genomics data worldwide.

Over the past century, the mouse has become one of the premier model organisms for genetic research with mouse models available for many diseases on different genetic backgrounds. In 2011, we led the effort to completely sequence the genomes of 17 inbred laboratory mouse strains and identified approximately 56M unique SNPs, 8.8M indels, and 0.28M structural variants. To fully understand the functional consequences of these genetic differences, the MRC and BBSRC funded us to create assembled chromosome sequences and strain-specific gene annotation for 16 strains. The results of this work enable scientists using non-C57BL/6J mouse strains in medical research to design experiments based on the genome sequence closest to the animals genetic background.

There is a pressing need to investigate new algorithms or data structures for storing raw sequencing data and the corresponding base call qualities. Data volumes have escalated in recent years, in tandem with the rapid decline in sequencing costs, posing storage issues for research organisations worldwide. In recent years, the Burrows-Wheeler transform (BWT) and FM-index have been widely employed as a highly compressed, but searchable, data structure used by read aligners, and for de novo assembly. The team explored how we could use the BWT structure to store and compress the sequencing reads from the full set of 2,500 samples in the 1000 Genomes project to carry out tasks such as rapid genotyping of polymorphisms.

Our people

Previous core team member

Zhicheng Liu

Senior Software Developer

Associated research

Tools & software

Tool

SAMtools / BCFtools / HTSlib

Co-ordinate development of these key pieces of genomics software.

Data

Data set

Mouse Genomes Project

Co-ordinate and lead the Mouse Genomes Project.

Related groups

Science group

Adams Group

Somatic Functional Genomics and Cancer

We share interests in understanding the underlying genetics of laboratory mouse strains.

Science group

DNA Pipelines Research and Development

DNA Pipelines Development

We collaborate on various adhoc projects to test and develop new sequencing library construction techniques.

Science group

Durbin Group

Computational Genomics

We co-develop novel algorithms and software for discovering sequence variation.

Science group

Genome Reference Informatics Team

Tree of Life Programme

Collaborate on the reference genome construction and maintainance of the laboratory mouse strains.

Science group

Sequence Analysis and Management (SAM)

Science Support - Informatics and Digital Solutions

Collaborate on development of Samtools and maintenance of the sequencing data format specifications (SAM/BAM/CRAM/BCF/VCF).

Science group

Miska Group

Non-coding RNA and epigenetics

We are interested in all aspects of gene regulation by non-coding RNA. Current research themes include: miRNA biology and pathology, miRNA ...

Science group

Vertebrate Annotation

Human Genetics

Collaborate on the mouse gene set maintainence of the mouse reference genomes and other laboratory mouse strains.

Wellcome Sanger Institute

Programmes and Facilities

Programme

Human Genetics

The Human Genetics Programme is driving a step-change in our understanding of genetic causes and biological mechanisms of disease susceptibility and ...

Partners

We work with the following groups

External

Medical Research Council

We are funded by the MRC to produce new mouse reference genomes.

External

BBSRC

We are funded by the BBSRC to produce new mouse reference genomes.

Publications

Loading publications...

Careers and Study

Policies

Archive

Leadership

Faculty

Sequence Variation Infrastructure

Archive Page

Our people

Previous core team member

Zhicheng Liu

Associated research

SAMtools / BCFtools / HTSlib

Mouse Genomes Project

Related groups

Adams Group

DNA Pipelines Research and Development

Durbin Group

Genome Reference Informatics Team

Sequence Analysis and Management (SAM)

Miska Group

Vertebrate Annotation

Programmes and Facilities

Human Genetics

Partners

Publications