IT infrastructure and data management

The IT infrastructure at the Sanger Institute is one of the most extensive in the life sciences. Every day we serve data to researchers across the globe; every week our web pages provide 7 to 8 million page impressions.

At the turn of the century the Sanger Institute had just finished a big push to produce its share of the Human Genome Project, generating DNA sequence for public release. It was a major scientific endeavour and throughout the project provided significant challenges for the IT infrastructure.

Now, with tremendous sequencing capacity of emerging next-generation technologies, the IT infrastructure continues to grow dramatically and adapt to the Institute's scientific needs.

The data centre

In the wake of the Human Genome Project, as sequence data continued to emerge from the Sanger Institute and other centres worldwide, we decided to design and develop a purpose-built data centre on the Genome Campus.

The data centre - comprising 1,000 square metres of floor space split equally into four rooms - was completed in April 2005 as part of a Campus extension.

Designed to be as future proof as possible, each of the four rooms of the data centre was installed with systems that provide up to 2 kilowatts per square metre of cooling capacity to accommodate high-performance blade computing. A traditional hot aisle/cold aisle layout was used and the air conditioning uses a fan-coil system for overhead cooling and heat extraction.

The data centre's designers estimated that it would be capable of supporting up to 50,000 processors and up to 4 petabytes (4000 terabytes, or TB) of disk storage by 2012.

The design has proved highly successful and today the data centre accommodates all the compute and servers for both the Sanger Institute and the neighbouring European Bioinformatics Institute (EBI). We now have close to 10,000 cores of compute predominantly in blade format, split equally with EBI, and approximately 10 petabytes of raw storage capacity, again split equally with EBI.

Expanding storage

The scale of the Institute's IT operation continues to grow and solutions to the challenges we now face will involve dramatically reshaping our IT infrastructure.

In one year, between 2008 and 2009, installed disk capacity doubled to 3 petabytes and we plan to add more than a 1,000 TB in the coming year. We also added 2,000 cores of blade computers, bringing the total number installed to more than 5,000. These numbers are set only to rise.

We have worked on an aggressive virtualisation project to reduce the number of physical racked servers. To date we have over 180 virtual servers installed on just six blade servers. Our aim is to virtualise most of our 300 racked servers into fewer than 20 blade computers, reducing energy consumption and saving space. To support continued expansion of the IT operation, we are investigating ways to bring additional electrical power onto the Genome Campus to allow us to populate the currently vacant fourth room of the data centre.

The future of our IT infrastructure

Discussions with all other large-scale genome sequencing centres are integral to maintaining and improving our IT infrastructure. We must address the particular challenges posed by the explosion of genetic sequence data are working with other centres to investigate international models for future data sharing.

The shape of our IT infrastructure will change dramatically in the future. As our operation continues to grow, we will switch to a combination of onsite and offsite data facilities. This development is necessary not only to enhance our ability to recover from a major incident but also to establish resilient mirrored data operations (replication) as an alternative to large tape backups, which will not scale to multipetabyte levels. We are also exploring emerging technologies such as 'cloud computing'.

* quick link - http://q.sanger.ac.uk/1pwqefut