World’s largest genetic project opens the door to new era for treatments and cures: UK Biobank’s major milestone
The Wellcome Sanger Institute sequenced its share of 500,000 whole human genomes, contributing to the world’s largest single set of sequencing data
In a momentous landmark for medical research, UK Biobank has today (Thursday 30 November) unveiled incredible new data from whole genome sequencing1 of its half a million2 participants. This is set to drive the discovery of new diagnostics, treatments and cures and, uniquely, is available to approved researchers worldwide, via a protected database containing only de-identified data (e.g. name, address, date of birth, name of GP and more stripped out). This abundance of genomic data is unparalleled, but what cements it as a defining moment for the future of healthcare is its use in combination with the existing wealth of data UK Biobank has collected over the past 15 years on lifestyle, whole body imaging scans, health information, and proteins found in the blood.
After five years, more than 350,000 hours of genome sequencing, and over £200 million of investment3, UK Biobank is releasing the world’s largest-by-far single set of sequencing data, completing the most ambitious project of its kind ever undertaken.
“This is a veritable treasure trove for approved scientists undertaking health research, and I expect it to have transformative results for diagnoses, treatments and cures around the globe.”
Professor Sir Rory Collins FRS FMedSci, Principal Investigator at UK Biobank
Set up 20 years ago, the charity UK Biobank recruited half a million altruistic volunteers to create the world’s most comprehensive source of health data. It is used by researchers across the world, from academic, commercial, government and charitable settings, for scientific discoveries that improve human health.
UK Biobank now provides the most detailed picture of human health that exists, equipping researchers with the ultimate toolbox to make previously out-of-reach links and discoveries about disease development possible.
“The sheer amount of genetic data is exceptional – it is twice as much as anywhere else – but UK Biobank’s data is so illuminating because we’ve been able to follow the health of our brilliant volunteers for around 15 years.”
Professor Sir Rory Collins FRS FMedSci, Principal Investigator at UK Biobank
Game-changing data for health research
Today’s addition of sequencing data comes after a series of great leaps made using the vast UK Biobank biomedical database. These leaps include:
- finding genes associated with protection against obesity and type 2 diabetes, which has the potential to lead to the development of new drugs
- identifying individuals at very high genetic risk for diseases such as heart disease, breast cancer and prostate cancer, which may help with screening
- a link between activity and Parkinson’s that can predict the disease up to seven years before diagnosis from smartwatch data, potentially leading to early intervention. The new sequencing data will dramatically enhance the existing data’s potential.
Whole genome sequencing data on this scale, combined with UK Biobank’s existing data and biological samples, will result in extraordinary biomedical innovations, including:
– More targeted drug discovery and development
Experimental therapeutics that are developed based on evidence from human genetics are twice as likely to be approved for clinical use.
“This landmark dataset will enable us to leverage the power of artificial intelligence and machine learning for rapidly identifying novel disease targets and helping researchers predict how a candidate medicine might impact certain subpopulations of patients, based on their genetics. This could pave the way for more efficient clinical development and drive progress toward precision medicine.”
John Reed, M.D., Ph.D., Executive Vice President for Innovative Medicine R&D at Johnson & Johnson
– Discovering thousands of disease-causing non-coding genetic variants
Little is known about 98 per cent of the human genome, once erroneously called ‘junk DNA’. This is the portion of DNA that doesn’t code for proteins and already, using earlier sequencing data, a study has found examples from this region where rare variants are associated with specific genetically-determined characteristics.
– Accelerating precision medicine
With a sample size of half a million people, and data collected on over 10,000 variables (such as blood pressure, cognitive function, diet and bone density), researchers using UK Biobank are driving tailored healthcare, such as investigating why people with the same genetic predisposition for a disease have different outcomes, reactions and side-effects to the same treatment.
– Understanding the biological underpinnings of disease
For many illnesses, such as Parkinson’s, Alzheimer’s and autoimmune diseases, the underlying origins are poorly understood
“This ground-breaking dataset allows scientists to explore how genetics affect levels of proteins, metabolites and other physiological factors, more closely than ever before, promising to accelerate our understanding of the genetic underpinnings of disease.”
David Reese, Executive Vice President R&D, Amgen
“It is an honour to represent UKRI during this landmark event for science, following our support of UK Biobank since its conception. Researchers can now apply to access de-identified full genome data from half a million participants, alongside a rich combination of medical, biochemical, lifestyle and environmental data from volunteers involved.
“Today marks an important milestone in UKRI’s commitment to realise the potential of genetics for biomedical research, innovation and translation to the clinic.”
Professor Dame Ottoline Leyser DBE FRS, Chief Executive of UK Research and Innovation (UKRI)
To date, over 30,000 researchers from more than 90 countries have registered to use UK Biobank, with over 9,000 peer-reviewed papers published as a result. Researchers are given the tools and computing power to analyse the de-identified data via UK Biobank’s secure, cloud-based Research Analysis Platform4.
“From the sequencing of the genomes themselves through to innovative and secure data storage, the release of this rich dataset marks a significant and impressive moment in scientific research. It’s truly field-opening for understanding the interactions between our genetics, environment and health.
“Wellcome’s funding has supported a new, bespoke data platform that will provide approved researchers with the tools they need to analyse the wealth of data. Crucially, this opens up exciting opportunities for early-career researchers and those in low-and-middle-income countries, in turn offering huge potential to unlock new discoveries and enhance our understanding of health to improve lives around the world.”
Cheryl Moore, Chief Research Programmes Officer, Wellcome Trust
The consortium behind this joint venture
This project was funded by Wellcome, UKRI and four biopharmaceutical companies; Amgen, AstraZeneca, GSK and Johnson & Johnson5.
“This world-leading project has only been possible due to the collaboration between industry, charity and Government, who have worked together to enable the sequencing of 500,000 genomes. It showcases the importance of partnership and working together to push boundaries and enhance our scientific knowledge to support the development of future medicines for patients around the world.”
Sharon Barr, Executive Vice President BioPharmaceuticals R&D from AstraZeneca
“UK Life Sciences are going from strength to strength, and UK Biobank is leading the way by combining world-leading data, fantastic infrastructure, brilliant minds and cross-sector collaboration.”
Professor Sir John Bell CH GBE FRS FMedSci
In return for significant investment, UK Biobank gives nine months’ exclusive data access to industry members of the consortium. In this way, commercial companies invest heavily to enhance a ground-breaking health dataset that is then available to approved research across the world.
“Bringing together science and technology to deepen our understanding of patients, human biology, and disease mechanisms is a key part of the discovery and development of new medicines, and the work of UK Biobank has been central to our approach. There is no other resource like it that combines genetic, biological and clinical data and then makes those data available to researchers across the industry with the goal of improving health. The partnership across the UK life sciences ecosystem has been critical to make this all possible.”
Robert Scott, Vice President, Human Genetics, from GSK
The DNA sequencing was completed by Amgen’s subsidiary, deCODE Genetics, and the Wellcome Sanger Institute, using Illumina NovaSeq technology, and with deCODE providing additional informatics processing support.
“I am extremely proud of the teams here at the Sanger Institute who dedicated their expertise and agility to deliver in partnership this momentous milestone of 500,000 whole human genomes. These data will enable new discoveries into the onset and progression of diseases, and accelerate drug discovery. We are privileged to have played a part in making this world’s largest single set of sequencing data available to the research community.”
Dr Cordelia Langford, Director of Scientific Operations at the Wellcome Sanger Institute
This data – and the rest of UK Biobank’s de-identified data – is now globally accessible for approved researchers on the UK Biobank Research Analysis Platform which is hosted on Amazon Web Services (AWS) 6 in the London region and enabled by DNAnexus. This is the first time a globally accessible resource, the computing power, and necessary storage required to analyse this size and sort of data, has been made available to researchers.
Following completion of the sequencing, the industry consortium led efforts to process and joint call7 the genomes using the DRAGEN pipeline on AWS infrastructure, enabling this vast volume of data to be transformed into a single combined genetic dataset by Illumina. These outputs further enrich the scientific importance of the data, enhancing the potential to identify less frequent genetic variants and making it more cross-comparable with other large scale population health studies.
The four pharmaceutical companies plan to publicly share their summary statistical analyses arising from the consortium collaboration, including genome-wide association results, providing the research community with highly valuable insights without the costly and time-consuming burden of analysing raw data.
UK Biobank is a large-scale biomedical database and research resource containing de-identified genetic, lifestyle and health information and biological samples from half a million UK participants. It is the most comprehensive and widely-used dataset of its kind, and is globally accessible to approved researchers who are undertaking health-related research that is in the public interest, whether they are from academic, commercial, government or charitable settings. UK Biobank is helping to advance modern medicine and enable better understanding of the prevention, diagnosis, and treatment of a wide range of serious and life-threatening illnesses – including cancer, heart disease and stroke. Over 30,000 researchers from more than 90 countries are registered to use UK Biobank and more than 9,000 peer-reviewed papers have been published as a result. UK Biobank is supported by Wellcome and the Medical Research Council, as well as the British Heart Foundation, Cancer Research UK, the UK Government’s National Institute for Health and Care Research and Department of Science, Innovation and Technology, Griffin Catalyst and Schmidt Futures.
- Whole Genome Sequencing analyses the entire human genome, a unique genetic code of three billion building blocks that contain the ~20,000 genes and other non-coding regions inside a human cell and which control the biochemical processes that underpin life.
- The genomes of 491,554 UK Biobank participants were sequenced.
- In March 2018, UK Biobank received £30m funding from the UK Government as part of the Industrial Strategy Challenge Fund to fund a Vanguard Phase of whole genome sequencing for an initial 50,000 participant samples. Wellcome Sanger Institute were selected as the sequencing provider, with the first whole genome sequenced in August 2018. One year later, a consortium of industry funders (Amgen, AstraZeneca, GSK and Johnson & Johnson) came together with additional funding from Government and charity to fund a Main Phase programme to sequence the remaining 450,000 whole genomes. Each industry party committed £25m, with matched funding of £50m from Government and Wellcome to fund this £200m programme. Sequencing during the Main Phase was undertaken by deCODE Genetics and the Wellcome Sanger Institute and commenced in September 2019. At its peak, over 20,000 whole genomes were being sequenced each month, with generated data securely flowing to the European Bioinformatics Institute before uploading to UK Biobank’s Research Analysis Platform, with the final genome sequenced in early 2022. Following a period of data review and quality control, the industry parties began their research using these data, linked to all other UK Biobank data, in February 2023 and, following an agreed nine month period of exclusive access, these data are now being made available to approved researchers worldwide.
- UK Biobank’s Research Analysis Platform (UKB-RAP) is a cloud-based data analysis platform provided by DNAnexus and delivered upon compute and storage infrastructure provided by Amazon Web Services in the London region.
- On behalf of Johnson & Johnson, the Whole Genome Sequencing contract was entered into by Janssen Biotech, Inc., one of the Janssen Pharmaceutical Companies of Johnson & Johnson, and the collaboration was facilitated by the Johnson & Johnson Innovation Centre in London, UK.
- Amazon Web Services (AWS) provides cloud-based compute infrastructure and data storage both to support the Illumina DRAGEN processing and the UK Biobank’s Research Analysis Platform. AWS also provides $500,000 of credits available each year to low- & middle-income countries and early career researchers.
- Joint calling is a process that aggregates the variant information of all 500,000 genomes to determine the frequency of all the variants in the population. Joint calling at this scale has rarely been undertaken and represents the outputs of substantial computational engineering efforts.
Related blog posts
26 Sep 2022
Our UK Biobank Journey: 3 years and over 240,000 human genomes
In 2019, the Sanger Institute started on the most ambitious human genome sequencing project in the world. Three years later, the ...
26 Sep 2022
How do you sequence over 240,000 whole human genomes?
The world’s largest human genome sequencing project has been for UK Biobank – a large-scale biomedical database. Sanger staff have sequenced ...
21 Feb 2024
Butterfly and moth genomes mostly unchanged despite 250 million years of evolution
Comparison of over 200 high-quality butterfly and moth genomes reveals key insights into their biology, evolution and diversification over the last ...
14 Feb 2024
Key genes linked to DNA damage and human disease uncovered
Scientists unveil 145 genes vital for genome health, and possible strategies to curb progression of human genomic disorders