21 April 2011

A User's Guide to the Encyclopaedia of DNA Elements

International effort to decode human genes releases massive dataset and instructions for use

Selected ENCODE and other gene and transcript annotations in the region of the human TP53 gene on chromosome 17 GENCODE annotated isoforms of TP53 RNAs are shown in red at the top of the figure, along with annotation of the neighbouring WRAP53 gene...

Selected ENCODE and other gene and transcript annotations in the region of the human TP53 gene on chromosome 17 GENCODE annotated isoforms of TP53 RNAs are shown in red at the top of the figure, along with annotation of the neighbouring WRAP53 gene...


The international team of the ENCODE, or Encyclopedia Of DNA Elements, Project, have created an overview of their ongoing large-scale efforts to interpret the human genome sequence.

The April 19 publication of "A User's Guide to the Encyclopedia of DNA Elements (ENCODE)" in the journal PLoS Biology provides a guide for using the vast amounts of high-quality data and resources produced so far by the project. All of the data, tools to study them, and the paper itself are freely available through multiple websites.

The Human Genome Project and subsequent large-scale genomic efforts were based on the belief that data from large genome projects should be made freely available, to further scientific discovery. ENCODE has accomplished this goal through their database at genome.ucsc.edu, and by creating and posting tools to facilitate data use at encodeproject.org.

"This project requires collaboration from multiple people all over the world at the cutting edge of their fields, working in a coordinated manner to figure out the function of our human genome," said Dr. Richard Myers, president and director of the HudsonAlpha Institute for Biotechnology and one of the 25 principal investigators of the project. "The importance extends beyond basic knowledge of who and what we are as humans and into understanding of human health and disease."

In their publication, the team demonstrates how the data can be immediately useful in interpreting associations between single nucleotides and disease. DNA variants upstream of the c-Myc proto-oncogene are associated with multiple cancers, but until recently the mechanism behind the association was unknown. ENCODE data confirm recent observations by other groups that the variants can change binding of transcription factor proteins to an enhancer region, leading to changes in expression of the c-Myc gene and therefore to oncogenesis. Similar studies are now possible for the thousands of variants identified in genome-wide association studies, addressing mechanistic questions of susceptibility to disease.

As part of the international effort, the Wellcome Trust Sanger Institute and its collaborators aim to develop accurate, evidence-based descriptions (annotations) for all gene features in the entire human genome through the GENCODE sub-project. Their aim is to deliver annotations for all protein-coding loci, non-protein-coding loci where there is evidence of gene activity, and pseudogenes. The team curates the annotations manually as well as using computation analysis and work in biology laboratories, incorporating the work of the Sanger HAVANA and Ensembl groups into the GENCODE gene set.

Dr Tim Hubbard, who leads GENCODE at the Sanger Institute, said "Our commitments from the Human Genome Project included freely available data, high-quality genome tools and the most complete and accurate annotation that expert analysis could provide. GENCODE delivers on these commitments, providing the most accurate and complete gene annotation to ensure that the research of biomedical scientists around the world can proceed as rapidly and confidently as possible. The entire ENCODE Project is delivering data sets informing how our genome functions that builds on the foundations laid a decade ago."

" GENCODE delivers the most accurate and complete gene annotation to ensure that the research of biomedical scientists around the world can proceed as rapidly and confidently as possible. "

Dr Tim Hubbard

The Gencode gene sets are used by the entire ENCODE consortium and by many other projects (such as the 1000 Genomes Project and the International Cancer Genome Consortium) as reference gene sets.

Dr. Ewan Birney, senior team leader at the EMBL-European Bioinformatics Institute in the UK and another principal investigator, commented "We knew four years ago, from our publication of ENCODE techniques on 1% of the genome, that we had an unprecedented view of how biology works on those regions. By extending our work to the entire genome, we see the immediate impact on the interpretation of noncoding variants identified in genome-wide association studies. These studies are disease-driven but have not always yielded clear next steps, and ENCODE data provide those scientists with some new paths to follow."

The current paper not only tells how to find the data, but also explains how to apply the data to interpret the human genome. One can think of determining the human DNA sequence alone as finding a new language, but without a key to interpret the letters within. The ENCODE project adds data such as where RNA is produced from our DNA, where proteins bind to DNA, and where parts of our DNA are augmented by additional chemical markers. These proteins and chemical additions are keys to understanding how different cells within our bodies are interpreting the language of DNA.

"ENCODE resources are already being used by scientists for discovery," said Dr. Ross Hardison, professor at Penn State University and principal investigator, "but they are also used in teaching classes now, to train students in all areas of biology. Our classes are using real data on genomic variation and function in their problem sets, shortly after our labs have generated them."

Scientists with the ENCODE Project are applying up to 20 different tests in 108 commonly used cell lines to compile these important data. "Assays that are now fundamental to biology, such as chromatin immunoprecipitation and sequencing or ChIP-seq, were produced by the ENCODE Project," commented Dr. John Stamatoyannopoulos, assistant professor at the University of Washington and principal investigator. "Widely used computational tools for processing and interpretation of large-scale functional genomic data have also been developed by the project. The depth, quality, and diversity of the ENCODE data are unprecedented."

Dr. Michael Snyder, professor and chair at Stanford University and ENCODE principal investigator, summed up the project: "It is now becoming commonplace to sequence genomes. The biggest challenge ahead is to understand the information that is generated. The ENCODE project is shedding light on the vast sea of information that is present in a human genome sequence including where the genes are and the sequences that control them."

Notes to Editors

Publication details


The ENCODE Project is funded by the National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.

Selected websites

Participating Centres

A list of participating centres can be found at http://www.genome.gov/26525220

The Wellcome Trust Sanger Institute

The Wellcome Trust Sanger Institute, which receives the majority of its funding from the Wellcome Trust, was founded in 1992. The Institute is responsible for the completion of the sequence of approximately one-third of the human genome as well as genomes of model organisms and more than 90 pathogen genomes. In October 2006, new funding was awarded by the Wellcome Trust to exploit the wealth of genome data now available to answer important questions about health and disease.


The Wellcome Trust

The Wellcome Trust is a global charitable foundation dedicated to achieving extraordinary improvements in human and animal health. We support the brightest minds in biomedical research and the medical humanities. Our breadth of support includes public engagement, education and the application of research to improve health. We are independent of both political and commercial interests.


Contact the Press Office

Mark Thomson Senior Media and Public Relations Officer
Wellcome Trust Sanger Institute, Hinxton, Cambs, CB10 1SA, UK

Tel +44 (0)1223 492 384
Mobile +44 (0)7753 775 397
Fax +44 (0)1223 494 919
Email press.office@sanger.ac.uk

HudsonAlpha Institute for Biotechnology

Dr Chris Gunter Director of Research Affairs
HudsonAlpha Institute for Biotechnology
Email cgunter@hudsonalpha.org
Tel +1 256-327-0934

* quick link - http://q.sanger.ac.uk/ssxc9j1l