Computational Genomics

Computational Genomics

ARCHIVE PAGE: The Computational Genomics programme ran until October 2016. The computational genomics faculty, teams and research projects have been transferred into the Cellular Genetics and Human Genetics programmes.​ This page is being retained as a historical record and is not being updated.​


In the Computational Genomics programme, novel computational methods were developed, both for managing and analyzing large datasets. We were interested in population genetics approaches for characterizing the variations in human genomes as well as computational methods for understanding the functional consequences of this variation.



Computational methods and resources for studying genetic variation:

Since its inception, the Sanger Institute has been a leader in the development of software, methods and resources for the analysis of large-scale DNA sequence data. Many of the techniques that we developed in this area underpin research in other programmes in the Institute as well as elsewhere in the world. Research within Computational Genomics developed and drove forward established programmes for algorithms, software and data resources for using DNA sequence data to study genetic variation, in conjunction with the Global Alliance for Genomics and Health (GA4GH); for the development of reference genome sequences for humans and mouse as part of the Genome Reference Consortium; and for the development of the DECIPHER platform for exchange of clinical rare variant data. Alongside these, we conducted research activity in the development of novel population genetic analysis methods based on whole genome sequences, and their application to large genomic data sets.

Computational analysis of genome regulation:

The central goals in genomics are to understand how genome functions are affected by genetic variation. To achieve this goal, the Sanger Institute strives to develop novel computational and statistical approaches, focusing in particular on non-coding and regulatory sequence. We developed new methods and tools for genomic data analysis for providing new knowledge about genome function: the identification of sequence and chromatin features involved in enhancer activity, the identification of variants and cell types involved in complex traits, and improved understanding of biological variation and the transcriptional response in single cells.

The Sanger Institute is a global leader in the technology of collecting and processing this data, and the science of understanding and using it. A core requirement to achieve this is computational, to identify the significant information in each data set, finding the genetic variation present in a sample or quantifying measurements, and to relate that to existing knowledge. The primary tools for analysing sequence data are algorithmic methods for sequence alignment based on string matching, and data representation including compression to manage previous data and knowledge. The underlying disciplines are computer science, statistics and genetics. This is very much the domain of Big Data, and it was no surprise that companies such as Google, Amazon and Microsoft are participating alongside science institutions such as the Sanger Institute, the Broad Institute, EBI, NCBI and UC Santa Cruz in the new Global Alliance for Genomics and Health (GA4GH) which supports genomic data exchange to further health and research.


Below are some of the research projects that the Computational Genomics programme delivered or supported:

DECIPHER - Mapping the Clinical Genome

DECIPHER is an interactive web-based database which incorporates a suite of tools designed to aid the interpretation of genomic variants. DECIPHER enhances clinical diagnosis by retrieving information from a variety of bioinformatics resources relevant to the variant found in the patient. The patient’s variant is displayed in the context of both normal variation and pathogenic variation reported at that locus thereby facilitating interpretation.

Genome Reference Consortium

The GRC aims to ensure that the human, mouse and zebrafish reference assemblies are biologically relevant by closing gaps, fixing errors and representing complex variation.

Genome Reference Informatics

As the impact of the human reference genome assembly on biomedical research has shown, the availability of a high quality reference genome assembly is essential for the understanding of a species' biology. Our team is responsible for further improving and extending the human, mouse and zebrafish reference assemblies, as well as generating and improving individual strain assemblies.


Hundreds of induced pluripotent stem cell lines for cellular genetic analysis

Mouse Genomes Project

The Mouse Genomes Project is an ongoing effort to catalog all forms of genetic variation between the common laboratory mouse strains and to construct and annotate reference genomes for the key strains.

Single-cell Consensus Clustering (SC3)

SC3 is a method for unsupervised clustering of single-cell RNA-seq data. In addition to a graphical user-interface, SC3 provides additional information about potential outliers and marker genes for each cluster.

Zebrafish Genome Project

The zebrafish genome project lead to the generation of the zebrafish reference assembly based on the Tuebingen strain that is now being updated and maintained by the Sanger Institute division for the genome Reference Consortium. Further strain assemblies will be generated.


Below are research teams who were part of the Computational Genomics programme:

Archive Page: Bateman Group

The Bateman group set out to classify proteins and certain RNAs into functional families with a view to producing a 'periodic table' of these molecules.

Birney Group | Using outbred genetic variation to understand basic biology
Durbin Group | Computational Genomics

Population and evolutionary genomics, novel computational genomics methods, and related mathematical and statistical models.

Gaffney Group | Genomics of gene regulation

Gene expression involves the transformation of genetic information encoded in DNA sequence into a gene product, such as a protein. Regulation of gene expression is a fundamentally important process in biology because controlling the timing, location and level of gene expression is critical for the gene product to function correctly. The majority of mutations that alter disease risk for most common diseases are thought affect gene regulation, although how these mutations actually function is not well understood in most cases. Our group uses a combination of statistical and experimental approaches to map mutations that affect gene regulation in humans.

Genome Reference Informatics Team | Computational Genomics

The Genome Reference Informatics Team analyses genome assemblies to reveal and correct quality issues and to identify and add variation. It forms the Sanger division of the Genome Reference Consortium.

Hemberg Group | Quantitative models of gene expression

The Hemberg group is interested in developing quantitative models of gene expression. Our approach is theoretical and we strive to develop novel mathematical models as well as computational tools that can be used by other researchers.

Archive Page: Hubbard Group

The activities of the Vertebrate genome analysis team revolved around generating and presenting core vertebrate genome annotation, particularly in the form of reference genesets, and maintaining the reference genome sequences of human, mouse and zebrafish.

Miska Group | Non-coding RNA and epigenetics

We are interested in all aspects of gene regulation by non-coding RNA. Current research themes include: miRNA biology and pathology, miRNA mechanism, piRNA biology and the germline, endo-siRNAs in epigenetic inheritance and evironmental conditioning, small RNA evolution and the role of RNAi in host pathogen interaction.

Archive Page: Mustonen Group | Population genomics of adaptation

High-throughput sequencing has opened up a new chapter in the study of molecular evolution and genetics, allowing us to study in detail how genetic composition of populations change as they respond to external pressures such as drug therapies. Our group contributes to this effort by developing scalable methods for biomedical applications of data. We further use these data to address basic biological research questions such as how drug resistance arises.

Parts Group | Genetic screens of cellular traits

We measure, model, and modulate cell state. We use genome engineering and synthetic biology to create cell lines that can be employed for CRISPR/Cas9-based genetic screening and high throughput cell biology assays. We develop probabilistic models as well as software tools to accurately analyse the readouts.

Archive Page: Sequence Variation Infrastructure | Computational Genomics

We are part of the Computational Genomics programme at the institute.

Trynka Group | Immune Genomics

The Trynka group combines experimental and computational approaches to study how genetics control the immune system and predispose individuals to autoimmune diseases.

Vertebrate Annotation | Computational Genomics

This group consists of manual annotators and software developers. The HAVANA team provides the manual annotation of human, mouse, zebrafish and other vertebrate genomes that appear in the Vega browser. Our software is written and developed by the Annosoft team.

Web, Core Bioinformatics and Software Action Team | Information Communications Technology

Core Software Services encompasses: the Core Sanger Web Team; Core Bioinformatics (CoreBio); SoftWare Action Team (SWAT) and the Decipher Web Team at the Sanger Institute.