We were an interdisciplinary team consisting of Senior Bioinformaticians, Senior Software Developers, and Postdoctoral Fellows. One key goal of the group is to develop the computational infrastructure to enable global data sharing in genomics. At the level of the raw sequencing data, we were involved in the Data Working Group of the Global Alliance for Genomics and Health (GA4GH), an international effort to standardise genomics file formats from the point of the data coming off the sequencing machines. We developed and maintain Samtools; a set of tools for high-throughput data processing of standardised next-generation sequencing data formats (SAM/BAM/CRAM). Since its launch in 2009, Samtools has been downloaded hundreds of thousands of times and has become a core piece of software for processing genomics data worldwide.
Over the past century, the mouse has become one of the premier model organisms for genetic research with mouse models available for many diseases on different genetic backgrounds. In 2011, we led the effort to completely sequence the genomes of 17 inbred laboratory mouse strains and identified approximately 56M unique SNPs, 8.8M indels, and 0.28M structural variants. To fully understand the functional consequences of these genetic differences, the MRC and BBSRC funded us to create assembled chromosome sequences and strain-specific gene annotation for 16 strains. The results of this work enable scientists using non-C57BL/6J mouse strains in medical research to design experiments based on the genome sequence closest to the animals genetic background.
There is a pressing need to investigate new algorithms or data structures for storing raw sequencing data and the corresponding base call qualities. Data volumes have escalated in recent years, in tandem with the rapid decline in sequencing costs, posing storage issues for research organisations worldwide. In recent years, the Burrows-Wheeler transform (BWT) and FM-index have been widely employed as a highly compressed, but searchable, data structure used by read aligners, and for de novo assembly. The team explored how we could use the BWT structure to store and compress the sequencing reads from the full set of 2,500 samples in the 1000 Genomes project to carry out tasks such as rapid genotyping of polymorphisms.