Tony Cox leads the DNA Pipelines Informatics operation.
The Sequencing Informatics group ensures that the harvesting, storage and analysis of DNA geynotype and sequence information at the Sanger Institute is as swift and efficient as possible. To do this the team develops software to support the high-throughput data production activities (sequencing, genotyping and ancillary services) of the DNA Pipelines Operations teams.
tracking software to manage samples entering the Institute
management of research data projects
Laboratory Information Management Systems (LIMS)
primary data analysis
data quality control
The growth of the Institute’s sequencing instrument fleet, improvements in individual instrument output and the increasingly sophisticated demands of scientific end-users generate ever-increasing demands for high-quality data in ever-decreasing time frames. To meet these needs, we work to improve the capacity of our software systems to process efficiently next-generation sequence data. The rate of data production is currently running at approximately 25 Terabases (TB) per month.
Production Software Development
The Production Software Development team is primarily responsible for developing and maintaining Laboratory Information Management Systems (LIMS) for the Institute’s high-throughput pipelines. The team have adapted and augmented software systems used in the laboratories to cope with a significant increase in the number of samples being processed due to new high-throughput sequencing technology (Illumina v4, Illumina X Ten) and new DNA sequencing methods that have moved from experimental protocols to full production processes.
New Pipelines Group
The New Pipelines Group develops and supports the core data analysis systems for our high-throughput data production processes. These systems are responsible for: monitoring data production instruments (eg. Illumina sequencers), collecting data from them, transferring it to a temporary storage area, and conducting primary analysis. This is usually followed by an automated and/or manual quality control process before automatically transferring the data to our central data archive.
This “big data” system currently manages 33 Illumina sequencers, 2 Petabytes of “staging” disk storage and approximately 1,000 CPUs to process the data. As new sequencing technologies arrive, the group refines and develops the system to improve efficiency allowing it to cope with additional data volumes and decrease processing times and costs by using disk space as efficiently as possible.
We centralise and organise our pipeline output in the iRODS (Integrated Rule-Orientated Data System) data storage system. This single, Institute-wide archive is accessible to all and has proven to be a very effective tool in managing and distributing the vast amounts of data we produce. To control our use of disk space we devote a considerable amount of effort to developing more efficient data-storage formats and implementing these new data structures throughout our pipelines.