IT infrastructure and data management

The IT infrastructure at the Sanger Institute is one of the most extensive in the life sciences. Every day we serve data to researchers across the globe; every week our web pages provide 80,000 page views.

At the turn of the century the Sanger Institute had just finished a big push to produce its share of the Human Genome Project, generating DNA sequence for public release. It was a major scientific endeavour and throughout the project provided significant challenges for the IT infrastructure.

Now, with tremendous sequencing capacity of emerging next-generation technologies, the IT infrastructure continues to grow dramatically and adapt to the Institute's scientific needs.

The data centre

In the wake of the Human Genome Project, as sequence data continued to emerge from the Sanger Institute and other centres worldwide, we decided to design and develop a purpose-built data centre on the Wellcome Genome Campus.

The data centre - comprising 1,000 square metres of floor space split equally into four rooms - was completed in April 2005 as part of a Campus extension. One room is occupied by the EBI and Wellcome Trust, two rooms by the Sanger Institute, and the fourth is awaiting development.

Designed to be as future proof as possible, each of the current three rooms of the data centre was installed with systems that provide up to 2 kilowatts per square metre of cooling capacity to accommodate high-performance blade computing. A traditional hot aisle/cold aisle layout was used and the air conditioning uses a fan-coil system for overhead cooling and heat extraction.

The data centre's designers estimated that it would be capable of supporting up to 50,000 processors and up to 4 petabytes (4000 terabytes, or TB) of disk storage by 2012.

The design has proved highly successful and today far exceeds its original design capacity. We now have 17,000 cores of compute predominantly in blade format, and approximately 40 petabytes of raw storage capacity (25 PB usable); four times the density of equipment for which it was designed.

Expanding storage

The scale of the Institute's IT operation continues to grow and solutions to the challenges we now face will involve dramatically reshaping our IT infrastructure.

In one year, between 2008 and 2009, installed disk capacity doubled to 3 petabytes. In 2014-15, we added more than 10PB in a single year. This rate of growth is set to continue; we expect to reach at least 80 PB of usable storage and 55,000 processors by 2021.

Virtualisation

We have worked on an aggressive virtualisation project to reduce the number of physical racked servers. We now run more than 1,000 virtual machines, providing our web site and numerous other services.

The future of our IT infrastructure

Discussions with all other large-scale genome sequencing centres are integral to maintaining and improving our IT infrastructure. We must address the particular challenges posed by the explosion of genetic sequence data are working with other centres to investigate international models for future data sharing, such as the Global Alliance for Genomics and Health.

To that end we are tenants in the Jisc Shared Data Centre, a collaborative effort between the Institute and Jisc, University College London, Kings College London, Queen Mary University London, the Francis Crick Institute and others. We keep a second copy of all of our sequencing data in this facility.

The same facility also hosts eMedLab, a collaborative project for scientific computing in a cloud services environment based on OpenStack. eMedLab is a collaboration between UCL, the Francis Crick Institute, the Sanger Institute, EBI, the London School of Hygiene and Tropical Medicine, QMUL and others, with operational responsibility shared between UCL, Crick and Sanger.

The shape of our IT infrastructure will change dramatically in the future. Large scale collaborative science, and an extremely diverse software landscape, are driving us towards a more cloud-services oriented approach over the next few years, allowing scientists from other organisations to run their own bespoke analyses against our data, and vice versa.

Likewise, the advent of genomics within the clinical space increases our requirements for security, validation and resilience. Meeting these needs while not sacrificing the flexibility required by the Institute's cutting-edge research science is our key challenge over the next few years.

* quick link - http://q.sanger.ac.uk/2ri1nvzk