Archive Page: ENCODE pilot phase (2003-2007)

This page contains archive content about the pilot phase of the ENCODE project. It is no longer maintained and may contain out of date information.

Following the completion of the Human genome sequence, the next major task is to understand the information contained therein.

Although considerable progress has been made in identifying the genes that code for functional proteins (For instance see Collins et al 2003, 2004), identifying elements in DNA sequence which control gene expression and DNA replication at a genome wide level is far from trivial. Therefore NHGRI has established a pilot project (ENCODE) to explore computational and experimental methods to develop an encyclopedia of DNA elements in the human genome. Initially the pilot project has funded a collection of different groups who will target 1% of the genome chosen according to the criteria outlined in the ENCODE RFA.

The Sanger Institute has two groups involved in the ENCODE project: one looking at detecting human functional sequences with microarrays, and the other looking at identification of functionally variable regulatory regions in the human genome.

[Genome Research Limited]

Functional sequences

Detecting human functional sequences with microarrays

We were inspired by recent work in our laboratories using microarrays to study DNA copy number (Fiegler et al 2003), replication timing and chromatin modifications in a variety of genomic situations from 400bp resolution in a ~200 kb pilot region, through ~75 kb resolution across the q arm of chromosome 22, to 1Mb resolution across the human genome. We aim to contribute microarray-based approaches to the ENCODE consortium to provide experimental evidence of DNA elements involved in gene regulation and replication, as well as the status of chromatin, across the pilot 1% of the genome. Specifically we are:

  1. Developing two sets of genomic microarrays covering the 1 % of the genome targeted in the ENCODE project. The first is a low resolution genomic clone (predominantly BACs, but also PACs, cosmids, fosmids) based microarray using the clones from the genomic sequence tile path. The second is an array of 22 000 1.25kb PCR fragments designed from the DNA sequence covering ~85% of the targeted regions - viewable here.
  2. Using these microarrays to assay DNA samples enriched for sequences involved in specific biological processes and functions by methods including flow-sorting, pulse-labeling and chromatin immunoprecipitation (ChIP) so as to develop high resolution maps of the following at genomic clone and 1.25kb resolution of:
    • Replication timing
    • Replication origins
    • DNA methylation
    • Modified histones/active and inactive chromatin
    • Transcription factor binding sites

We will correlate these maps with genomic DNA features including C+G content, genes/exons, repeat elements, and SNP density. In addition we will correlate the elements we map with regions of conserved DNA sequence identified by comparative sequencing across multiple species being undertaken in the laboratory of Eric Green and maps of transcriptional activity as part of the consortium.

Team

References

  • A genome annotation-driven approach to cloning the human ORFeome.

    Collins JE, Wright CL, Edwards CA, Davis MP, Grinham JA, Cole CG, Goward ME, Aguado B, Mallya M, Mokrab Y, Huckle EJ, Beare DM and Dunham I

    Genome biology 2004;5;10;R84

  • DNA microarrays for comparative genomic hybridization based on DOP-PCR amplification of BAC and PAC clones.

    Fiegler H, Carr P, Douglas EJ, Burford DC, Hunt S, Scott CE, Smith J, Vetrie D, Gorman P, Tomlinson IP and Carter NP

    Genes, chromosomes & cancer 2003;36;4;361-74

  • Reevaluating human gene annotation: a second-generation analysis of chromosome 22.

    Collins JE, Goward ME, Cole CG, Smink LJ, Huckle EJ, Knowles S, Bye JM, Beare DM and Dunham I

    Genome research 2003;13;1;27-36

Genetic variation

Identification of functionally variable regulatory regions in the human genome

One of the main reasons to annotate the human genome is to interpret the phenotypic consequences of genetic variation within functional genomic regions. We are using a novel approach for the selective identification of functionally variable regulatory sequences of the human genome. We are detecting correlations between variation in gene expression and nucleotide polymorphisms near those genes to identify regulatory regions and their variants that contribute to gene expression variation. This approach uses naturally occurring genomic variation (nucleotide polymorphism) and phenotypic variation (transcript levels) to detect significant associations (Figure 1). Polymorphisms associated with phenotypic variation will likely be in linkage disequilibrium with functional regulatory polymorphisms nearby, thereby identifying segments of the genome containing sequences that regulate gene expression.

Our experimental design is to use the illumina technology to screen for gene expression variation as well as to genotype relevant SNPs for the association analysis. We have designed an illumina bead array that contains approximately 350 genes from the ENCODE regions, all the human chromosome 21 genes and 100 genes from a 10 Mb genomic region of human chromosome 20. An example of a hybridized array is shown in Figure 2. The technology is highly sensitive and accurate. In Figure 3a we show the regression of two replicates from the same RNA pool and in Figure 3b the regression of two different individuals. Note the wider spread of Figure 3b as a result of difference in transcript levels between the two individuals.

We view this project as readily scalable to a whole human genome screen for gene expression variation and association with nucleotide polymorphism.

It will provide 3 different types of information:

  1. Genomic regions that contain variable regulatory polymorphisms
  2. Structure of regulatory variation in the human genome and determination of how it is associated with disease susceptibility
  3. Large dataset of genes that exhibit variation of expression within populations, in a manner similar to the way the HapMap project will provide the haplotype structure of the human genome

Team

  • Manolis Dermitzakis PI
  • Panos Deloukas Co-PI
  • Stylianos E. Antonarakis, University of Geneva Co-PI
  • Andrew G. Clark, Cornell University Co-PI

Data access

ENCODE - Data Access (pilot phase)

Experiment ChIP/chip Parameters ChIP/chip Data
Array ID Cell Type/Line Antibody Parameters UCSC Track Name ArrayExpress ID Data Files
H3K4me3_GM06990_1 ENCODE2.1.1 GM06990 ab8580 5 : 1 : 10 Sanger ChIP E-MEXP-269 ftp
H3K4me3_GM06990_2 ENCODE3.1.1 GM06990 ab8580 10 : 1 : 10 Sanger ChIP E-MEXP-269 ftp
H3K4me1_GM06990_1 ENCODE3.1.1 GM06990 ab8895 10 : 1 : 10 Sanger ChIP E-MEXP-269 ftp
H3K4me2_GM06990_1 ENCODE3.1.1 GM06990 ab7766 10 : 1 : 10 Sanger ChIP E-MEXP-269 ftp
H4ac_GM06990_1 ENCODE3.1.1 GM06990 06-866 10 : 1 : 10 Sanger ChIP E-MEXP-269 ftp
H3ac_GM06990_1 ENCODE3.1.1 GM06990 06-599 10 : 1 : 10 Sanger ChIP E-MEXP-269 ftp
H3K27me3_GM06990_1 ENCODE3.1.1
ENCODE5.1.1
GM06990 05-851 10 : 1 : 10 Sanger ChIP ftp
H3K36me3_GM06990_1 ENCODE3.1.1
ENCODE5.1.1
GM06990 ab9050 10 : 1 : 10 Sanger ChIP ftp
H3K79me3_GM06990_1 ENCODE3.1.1
ENCODE5.1.1
GM06990 ab2621 10 : 1 : 10 Sanger ChIP ftp
H3K9me3_GM06990_1 ENCODE5.1.1 GM06990 07-523 10 : 1 : 10 Sanger ChIP ftp
CTCF_GM06990_1 ENCODE3.1.1
ENCODE5.1.1
GM06990 15914 10 : 1 : 10 Sanger ChIP ftp
H3K4me2_K562_1 ENCODE3.1.1 K562 ab7766 2.5 : 0.37 : 10 Sanger ChIP ftp
H4ac_K562_1 ENCODE2.1.1
ENCODE3.1.1
K562 06-866 10 : 0.37 : 10 Sanger ChIP ftp
H3ac_K562_1 ENCODE2.1.1
ENCODE3.1.1
K562 06-599 10 : 0.37 : 10 Sanger ChIP ftp
H3K4me3_K562_1 ENCODE2.1.1
ENCODE3.1.1
K562 ab8580 10 : 0.37 : 10 Sanger ChIP ftp
H3K4me3_HeLa-S3_1 ENCODE3.1.1 HeLa-S3 ab8580 10 : 1 : 10 Sanger ChIP ftp
H3K4me1_HeLa-S3_1 ENCODE3.1.1 HeLa-S3 ab8895 10 : 1 : 10 Sanger ChIP ftp
H3K4me2_HeLa-S3_1 ENCODE3.1.1 HeLa-S3 ab7766 10 : 1 : 10 Sanger ChIP ftp
H4ac_HeLa-S3_1 ENCODE3.1.1 HeLa-S3 06-866 10 : 1 : 10 Sanger ChIP ftp
H3ac_HeLa-S3_1 ENCODE3.1.1 HeLa-S3 06-599 10 : 1 : 10 Sanger ChIP ftp
H3K4me3_HFL-1_1 ENCODE3.1.1 HFL-1 ab8580 10 : 1 : 10 Sanger ChIP ftp
H3K4me1_HFL-1_1 ENCODE3.1.1 HFL-1 ab8895 10 : 1 : 10 Sanger ChIP ftp
H3K4me2_HFL-1_1 ENCODE3.1.1 HFL-1 ab7766 10 : 1 : 10 Sanger ChIP ftp
H4ac_HFL-1_1 ENCODE3.1.1 HFL-1 06-866 10 : 1 : 10 Sanger ChIP ftp
H3ac_HFL-1_1 ENCODE3.1.1 HFL-1 06-599 10 : 1 : 10 Sanger ChIP ftp
H3K4me3_MOLT4_1 ENCODE3.1.1 MOLT4 ab8580 10 : 1 : 10 Sanger ChIP ftp
H3K4me1_MOLT4_1 ENCODE3.1.1 MOLT4 ab8895 10 : 1 : 10 Sanger ChIP ftp
H3K4me2_MOLT4_1 ENCODE3.1.1 MOLT4 ab7766 10 : 1 : 10 Sanger ChIP ftp
H4ac_MOLT4_1 ENCODE3.1.1 MOLT4 06-866 10 : 1 : 10 Sanger ChIP ftp
H3ac_MOLT4_1 ENCODE3.1.1 MOLT4 06-599 10 : 1 : 10 Sanger ChIP ftp
H3K4me1_PTR8_1 ENCODE3.1.1 PTR8 ab8580 10 : 1 : 10 Sanger ChIP ftp
H3K4me2_PTR8_1 ENCODE3.1.1 PTR8 ab8895 10 : 1 : 10 Sanger ChIP ftp
H3K4me3_PTR8_1 ENCODE3.1.1 PTR8 ab7766 10 : 1 : 10 Sanger ChIP ftp

Parameter key

Amount of antibody in assay (µg) : Formaldehyde concentration (%) : Cross-linking time (minutes) 5 : 1 : 10

Contact

  • Detecting human functional sequences with microarrays: Nigel Carter
  • Identification of functionally variable regulatory regions in the human genome: Panos Deloukas

ENCODE pages

* quick link - http://q.sanger.ac.uk/vny901n6