5 February 2014

Finding what's important in our genome

New integrated tool to predict the function of non-coding variants

Plots showing the performance of the classifier on three different sets of genetic variants used in the study.

Plots showing the performance of the classifier on three different sets of genetic variants used in the study. [doi:10.1038/nmeth.2832]


Researchers at the Wellcome Trust Sanger Institute and the EMBL-European Bioinformatics Institute have developed software that predicts the likelihood of variants in non-coding regions - relatively unknown regions of DNA that make up 98 per cent of genome - having a functional role.

The software, called GWAVA, integrates an enormous amount of information about the way genes are regulated, and prioritises non-coding variants in the human genome. This helps researchers focus their research on the most promising candidates, potentially saving considerable time and resources.

In recent years scientists have found a lot of links between our genes and susceptibility to disease - but there is still a long way to go before we fully understand how DNA variation underlies disease. While much is known about the way protein-coding genes work, our three-billion-base-pair-long genome is bursting with other types of information. One of the big challenges in genomics is figuring out how non-coding regions of the human genome are involved in disease.

"The information provided by the ENCODE consortium, the 1000 Genomes Project and the NIH's Roadmap Epigenomics project are extremely useful resources for understanding non-coding variants," said Paul Flicek, co-lead author from EMBL-EBI. "But ranking that information is no small task. There is a lot of benign variation in our genome, so we needed a way to narrow down which regions play a role in disease."

" Most disease-associated variants discovered to date fall outside genes. This tool can help us start to understand how they work. "

Professor Ele Zeggini

The team investigated if a combination of information related to genes, genetic regions associated with regulation and genome-wide properties can be used to identify the most likely variants that contribute to disease in the non-coding part of the genome.

"GWAVA uses a classifier to discriminate apparently harmless non-coding variants from those that are likely to be involved in disease," said Graham Ritchie, first author from EMBL-EBI and the Sanger Institute. "We tested it out using several scenarios and found that it consistently prioritises the regions known to be associated with disease. This could be really useful for people who need to decide which mutations to look at as cancer drivers, for example."

The authors hope that using GWAVA predictions for non-coding variants in disease association studies will substantially improve the chances of finding genetic variants that are involved in human disease.

"We've combined freely available data to predict the impact of these variants in the non-coding region of the genome," says Professor Eleftheria Zeggini, co-lead author from the Sanger Institute. "Most disease-associated variants discovered to date fall outside genes. This tool can help us start to understand how they work."

Notes to Editors

Publication details

  • Functional annotation of noncoding sequence variants.

    Ritchie GR, Dunham I, Zeggini E and Flicek P

    Nature methods 2014;11;3;294-6


This work was funded by the Wellcome Trust and by the European Molecular Biology Laboratory.

Participating Centres

  • Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK.
  • European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, UK.

EMBL-European Bioinformatics Institute

The EMBL-European Bioinformatics Institute (EBI) is part of the European Molecular Biology Laboratory (EMBL) and is located on the Wellcome Trust Genome Campus in Hinxton near Cambridge, UK. The EBI grew out of EMBL's pioneering work in providing public biological databases to the research community. It hosts some of the world's most important collections of biological data, including DNA sequences (ENA), protein sequences (UniProt), the genomes of animals and plants, three-dimensional molecular structures, data from gene expression experiments, protein-protein interactions and reactions and pathways. EMBl-EBI's many research groups are continually developing new tools to support the biocomputing community. EMBL-EBI provides essential compute infrastructure for the ENCODE project and coordinates ELIXIR, the emerging research infrastructure for life science data in Europe.


The Wellcome Trust Sanger Institute

The Wellcome Trust Sanger Institute is one of the world's leading genome centres. Through its ability to conduct research at scale, it is able to engage in bold and long-term exploratory projects that are designed to influence and empower medical science globally. Institute research findings, generated through its own research programmes and through its leading role in international consortia, are being used to develop new diagnostics and treatments for human disease.


The Wellcome Trust

The Wellcome Trust is a global charitable foundation dedicated to achieving extraordinary improvements in human and animal health. We support the brightest minds in biomedical research and the medical humanities. Our breadth of support includes public engagement, education and the application of research to improve health. We are independent of both political and commercial interests.


Contact the Press Office

Mark Thomson Senior Media and Public Relations Officer
Wellcome Trust Sanger Institute, Hinxton, Cambs, CB10 1SA, UK

Tel +44 (0)1223 492 384
Mobile +44 (0)7753 775 397
Fax +44 (0)1223 494 919
Email press.office@sanger.ac.uk

* quick link - http://q.sanger.ac.uk/09rbfoqh