Genomics of gene regulation

The genomics of gene regulation team seek to understand the role of gene regulation in human disease and evolution. Headed by Daniel Gaffney, the group combines computational and statistical methods with high-throughput experimental techniques to understand the role played by changes in gene regulation in disease susceptibility and human evolution. We are currently involved in both the data generation and analysis of molecular phenotypes in human induced pluripotent stem cells as part of the Human Induced Pluripotent Stem Cells Initiative

All genes in the genome are regulated to control how their genetic information is turned into gene products, a process known as gene expression. Understanding this process is important because the majority of mutations that are associated with human disease and evolution are thought to affect gene expression.

The team aims to understand how genetic changes affect the level, location and timing of gene expression using a combination of experimental and computational methods. We are particularly focussed on understanding this variation in pluripotent stem cells as a model of human disease and development.

We welcome applications from prospective postdocs and PhD students. Projects are available in the areas of genomics of gene regulation, molecular evolution and on population genomics of gene regulation. All our work involves data analysis, but there is also scope for projects with a component of laboratory work. Interested applicants should send a CV to Dan (see profile page for contact details), with information on your research, publications and contact details for three references.

[Wikimedia Commons]


Since the publication of the human genome sequence the pace of discovery in human genetics has accelerated dramatically. We have begun to identify which changes in the genome are important for a variety of human diseases and which have occurred during recent human evolution. However, biological interpretation of these results is complicated because most of these changes do not occur inside known genes. In fact, many important genetic changes occur in the non-coding fraction of the genome, and are believed to affect the regulation of gene expression.

Understanding how changes in gene regulation alter observable phenotypes is important for:

  • understanding the functional basis of genetic disease
  • development of more accurate, powerful and specific diagnostics
  • interpreting the biological changes that have occurred since we diverged from our common ancestors.

Recent technological developments mean that we can now assay key molecular phenotypes, including protein-coding and noncoding RNA transcription, transcription factor binding and chromatin accessibility, genome-wide and with high accuracy.

Our group studies epigenetic and gene expression variation in human populations. Recently, we have started work in human induced pluripotent stem cells as a model system for disease and development.


Gene expression and regulatory variation in human populations

Part of our group's research focuses on using naturally occurring variation as a model system that we can use to test hypotheses about gene regulation. We look for genetic variants that correlate with differences in gene expression between individuals. The genetic and epigenetic context of these changes can inform about the biology of gene regulation, and can help pinpoint likely causal disease mutations.

Annotating active regulatory elements using next-generation sequencing

Our group uses experimental methods such as DNaseI digestion, chromatin-immunoprecipitation and formaldehyde-assisted recovery of regulatory elements (FAIRE) to identify active regulatory regions, and develops computational and statistical methods for interpreting these data.


We collaborate closely with a number of groups both at the Sanger Institute and elsewhere. We are currently working with Ludovic Vallier's lab in Cambridge on annotating regulatory elements in a variety of cell types. We also work with Duncan Odom's groups at the Sanger Institute and Cancer Research UK: Cambridge Research Institute to develop high-throughput methods for regulatory element annotation. We have close links with Ville Mustonen and Carl Anderson's groups at the Sanger Institute.

  • Carl Anderson - Statistical genetics, The Wellome Trust Sanger Institute, Hinxton
  • Duncan Odom - Regulatory evolution in mammalian tissues, The Wellome Trust Sanger Institute, Hinxton
  • Ludovic Vallier - Gene expression variation in induced pluripotent stem cells, The Wellcome Trust Centre for Stem Cell Research, Cambridge

Selected Publications

  • Genetic background drives transcriptional variation in human induced pluripotent stem cells.

    Rouhani F, Kumasaka N, de Brito MC, Bradley A, Vallier L and Gaffney D

    PLoS genetics 2014;10;6;e1004432

  • AHT-ChIP-seq: a completely automated robotic protocol for high-throughput chromatin immunoprecipitation.

    Aldridge S, Watt S, Quail MA, Rayner T, Lukk M, Bimson MF, Gaffney D and Odom DT

    Genome biology 2013;14;11;R124

  • Global properties and functional complexity of human gene regulatory variation.

    Gaffney DJ

    PLoS genetics 2013;9;5;e1003501

  • Dense fine-mapping study identifies new susceptibility loci for primary biliary cirrhosis.

    Liu JZ, Almarri MA, Gaffney DJ, Mells GF, Jostins L, Cordell HJ, Ducker SJ, Day DB, Heneghan MA, Neuberger JM, Donaldson PT, Bathgate AJ, Burroughs A, Davies MH, Jones DE, Alexander GJ, Barrett JC, Sandford RN, Anderson CA, UK Primary Biliary Cirrhosis (PBC) Consortium and Wellcome Trust Case Control Consortium 3

    Nature genetics 2012;44;10;1137-41

  • DNA sequence-dependent compartmentalization and silencing of chromatin at the nuclear lamina.

    Zullo JM, Demarco IA, Piqué-Regi R, Gaffney DJ, Epstein CB, Spooner CJ, Luperchio TR, Bernstein BE, Pritchard JK, Reddy KL and Singh H

    Cell 2012;149;7;1474-87

  • DNase I sensitivity QTLs are a major determinant of human expression variation.

    Degner JF, Pai AA, Pique-Regi R, Veyrieras JB, Gaffney DJ, Pickrell JK, De Leon S, Michelini K, Lewellen N, Crawford GE, Stephens M, Gilad Y and Pritchard JK

    Nature 2012;482;7385;390-4

  • The contribution of RNA decay quantitative trait loci to inter-individual variation in steady-state gene expression levels.

    Pai AA, Cain CE, Mizrahi-Man O, De Leon S, Lewellen N, Veyrieras JB, Degner JF, Gaffney DJ, Pickrell JK, Stephens M, Pritchard JK and Gilad Y

    PLoS genetics 2012;8;10;e1003000

  • Controls of nucleosome positioning in the human genome.

    Gaffney DJ, McVicker G, Pai AA, Fondufe-Mittendorf YN, Lewellen N, Michelini K, Widom J, Gilad Y and Pritchard JK

    PLoS genetics 2012;8;11;e1003036

  • Dissecting the regulatory architecture of gene expression QTLs.

    Gaffney DJ, Veyrieras JB, Degner JF, Pique-Regi R, Pai AA, Crawford GE, Stephens M, Gilad Y and Pritchard JK

    Genome biology 2012;13;1;R7

  • Exon-specific QTLs skew the inferred distribution of expression QTLs detected using gene expression array data.

    Veyrieras JB, Gaffney DJ, Pickrell JK, Gilad Y, Stephens M and Pritchard JK

    PloS one 2012;7;2;e30629

  • False positive peaks in ChIP-seq and other sequencing-based functional assays caused by unannotated high copy number regions.

    Pickrell JK, Gaffney DJ, Gilad Y and Pritchard JK

    Bioinformatics (Oxford, England) 2011;27;15;2144-6

  • Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data.

    Pique-Regi R, Degner JF, Pai AA, Gaffney DJ, Gilad Y and Pritchard JK

    Genome research 2011;21;3;447-55

  • DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines.

    Bell JT, Pai AA, Pickrell JK, Gaffney DJ, Pique-Regi R, Degner JF, Gilad Y and Pritchard JK

    Genome biology 2011;12;1;R10

  • Alternative splicing is frequent during early embryonic development in mouse.

    Revil T, Gaffney D, Dias C, Majewski J and Jerome-Majewska LA

    BMC genomics 2010;11;399

  • Effect of the assignment of ancestral CpG state on the estimation of nucleotide substitution rates in mammals.

    Gaffney DJ and Keightley PD

    BMC evolutionary biology 2008;8;265

  • Selective constraints in experimentally defined primate regulatory regions.

    Gaffney DJ, Blekhman R and Majewski J

    PLoS genetics 2008;4;8;e1000157

  • Genomic selective constraints in murid noncoding DNA.

    Gaffney DJ and Keightley PD

    PLoS genetics 2006;2;11;e204

  • The scale of mutational variation in the murid genome.

    Gaffney DJ and Keightley PD

    Genome research 2005;15;8;1086-94

  • Functional constraints and frequency of deleterious mutations in noncoding DNA of rodents.

    Keightley PD and Gaffney DJ

    Proceedings of the National Academy of Sciences of the United States of America 2003;100;23;13402-6


Team members

Daniel Gaffney
CDF Group Leader
Angela Goncalves
Postdoctoral Fellow
Andrew Knights
Senior Research Assistant
Natsuhiko Kumasaka
Postdoctoral Fellow

Daniel Gaffney

- CDF Group Leader

I earned my PhD in evolutionary genetics from Edinburgh University in 2006 under the supervision of Dr Peter Keightley. My graduate research used computational methods to study variation in the mutation rate and natural selection in noncoding DNA. From 2006 to 2008 I pursued a postdoc with Dr Jacek Majewski in McGill University and Genome Quebec Genome Centre, where I worked on the evolution of transcriptional regulation in primates, and the role of alternative splicing in embryonic development. From 2008 until 2011 I worked on population genetic variation in gene expression with Dr Jonathan Pritchard at the University of Chicago.


Our current research is focused on understanding the impact of human genetic variation of molecular phenotypes such as gene transcription, and other important processes.

Angela Goncalves

- Postdoctoral Fellow

I was an undergraduate at the University of Coimbra, where I studied Computer Science. Subsequently, I spent a year at the Centre for Earth Observation (ESRIN) of the European Space Agency (ESA), working on a toolbox for analysing remote sensing satellite data. Following my time at ESA, I trained with Alvis Brazma for a PhD in Molecular Biology at the European Bioinformatics Institute (EBI-EMBL) and the University of Cambridge. There I worked on computational methods for the analysis of RNA sequencing data and applied them to study the divergence of gene expression and isoform usage in mammals.


In my current research, I am interested in uncovering genetic bases for the molecular and functional variability observed in human induced pluripotent stem cells.


  • Extensive compensatory cis-trans regulation in the evolution of mouse gene expression.

    Goncalves A, Leigh-Brown S, Thybert D, Stefflova K, Turro E, Flicek P, Brazma A, Odom DT and Marioni JC

    European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.

    Gene expression levels are thought to diverge primarily via regulatory mutations in trans within species, and in cis between species. To test this hypothesis in mammals we used RNA-sequencing to measure gene expression divergence between C57BL/6J and CAST/EiJ mouse strains and allele-specific expression in their F1 progeny. We identified 535 genes with parent-of-origin specific expression patterns, although few of these showed full allelic silencing. This suggests that the number of imprinted genes in a typical mouse somatic tissue is relatively small. In the set of nonimprinted genes, 32% showed evidence of divergent expression between the two strains. Of these, 2% could be attributed purely to variants acting in trans, while 43% were attributable only to variants acting in cis. The genes with expression divergence driven by changes in trans showed significantly higher sequence constraint than genes where the divergence was explained by variants acting in cis. The remaining genes with divergent patterns of expression (55%) were regulated by a combination of variants acting in cis and variants acting in trans. Intriguingly, the changes in expression induced by the cis and trans variants were in opposite directions more frequently than expected by chance, implying that compensatory regulation to stabilize gene expression levels is widespread. We propose that expression levels of genes regulated by this mechanism are fine-tuned by cis variants that arise following regulatory changes in trans, suggesting that many cis variants are not the primary targets of natural selection.

    Funded by: Cancer Research UK: A15603; European Research Council: 202218; Wellcome Trust

    Genome research 2012;22;12;2376-84

  • Pol III binding in six mammals shows conservation among amino acid isotypes despite divergence among tRNA genes.

    Kutter C, Brown GD, Gonçalves A, Wilson MD, Watt S, Brazma A, White RJ and Odom DT

    Cancer Research UK, Cambridge Research Institute, Li Ka Shing Centre, Cambridge, UK.

    RNA polymerase III (Pol III) transcription of tRNA genes is essential for generating the tRNA adaptor molecules that link genetic sequence and protein translation. By mapping Pol III occupancy genome-wide in mouse, rat, human, macaque, dog and opossum livers, we found that Pol III binding to individual tRNA genes varies substantially in strength and location. However, when we took into account tRNA redundancies by grouping Pol III occupancy into 46 anticodon isoacceptor families or 21 amino acid-based isotype classes, we discovered strong conservation. Similarly, Pol III occupancy of amino acid isotypes is almost invariant among transcriptionally and evolutionarily diverse tissues in mouse. Thus, synthesis of functional tRNA isotypes has been highly constrained, although the usage of individual tRNA genes has evolved rapidly.

    Funded by: Cancer Research UK: A10185, A15603; European Research Council: 202218

    Nature genetics 2011;43;10;948-55

  • A pipeline for RNA-seq data processing and quality assessment.

    Goncalves A, Tikhonov A, Brazma A and Kapushesky M

    EMBL Outstation-Hinxton, European Bioinformatics Institute, Cambridge, UK.

    Summary: We present an R based pipeline, ArrayExpressHTS, for pre-processing, expression estimation and data quality assessment of high-throughput sequencing transcriptional profiling (RNA-seq) datasets. The pipeline starts from raw sequence files and produces standard Bioconductor R objects containing gene or transcript measurements for downstream analysis along with web reports for data quality assessment. It may be run locally on a user's own computer or remotely on a distributed R-cloud farm at the European Bioinformatics Institute. It can be used to analyse user's own datasets or public RNA-seq datasets from the ArrayExpress Archive.

    Availability: The R package is available at with online documentation at, also available as supplementary material.

    Bioinformatics (Oxford, England) 2011;27;6;867-9

  • Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads.

    Turro E, Su SY, Gonçalves Â, Coin LJ, Richardson S and Lewin A

    Department of Epidemiology and Biostatistics, Imperial College London, Norfolk Place, London, W2 1PG, UK.

    We present a novel pipeline and methodology for simultaneously estimating isoform expression and allelic imbalance in diploid organisms using RNA-seq data. We achieve this by modeling the expression of haplotype-specific isoforms. If unknown, the two parental isoform sequences can be individually reconstructed. A new statistical method, MMSEQ, deconvolves the mapping of reads to multiple transcripts (isoforms or haplotype-specific isoforms). Our software can take into account non-uniform read generation and works with paired-end reads.

    Funded by: Biotechnology and Biological Sciences Research Council: BBG0003521; Medical Research Council: G0600609

    Genome biology 2011;12;2;R13

Andrew Knights

- Senior Research Assistant

I graduated with a BSc (Hons) in Biochemistry and Microbiology from the University of Sheffield in 1998. I then joined the Sanger Institute Library Construction Group, working on the Human Genome Project. In 2004, I left the Sanger Institute for the Babraham Institute, Cambridge, to carry out a PhD investigating vertebrate and invertebrate G protein-coupled receptors (GPCRs). Following a short post-doctoral appointment within the GPCR field, I returned to the Library Construction Group at the Sanger Institute, early 2010 as a Staff Scientist, with core duties focusing on the generation and optimisation of various transcriptome libraries for the Illumina platform.


In late 2011, I joined Daniel Gaffney’s group. Using my molecular biology background, my objective is to set up the wet laboratory aspect of the project, introducing and optimising assays such as FAIRE-seq, ChIP-seq and DNAseI-seq. In combination with the computational side of the group, these assays are being used to study gene regulation in human populations, currently focusing on variation in iPS cells obtained from separate individuals, as well as different tissues from within individuals.

Natsuhiko Kumasaka

- Postdoctoral Fellow

I received my doctoral degree from the Graduate School of Science and Technology at Keio University, where my research focused on combining fields such as statistics, data visualization, computer science and graphic design, as a means for understanding phenomena hidden behind the data. I developed a new data visualization technique called Textile Plot with Professor Ritei Shibata. After completing my thesis, I spent time developing tools for calling copy number polymorphisms and predicting population structure analysis of SNP genotype data as a postdoc under Dr Naoyuki Kamatani at CGM, RIKEN. I was also involved several gnome-wide association studies at RIKEN.


I'm currently a Postdoctoral fellow and involved in a project on investigating transcriptional and epigenetic variation in human induced pluripotent stem cells (hiPSCs). My role as a statistician is to develop a novel statistical model based on a negative-binomial regression to detect differentially expressed genes among hiPSCs derived from different tissue types while correcting known biological and technical biases in the RNA-seq data. I'm now extending the model in the generalised linear mixed model framework to take account of complex sample correlation structures.


Here you can find data supporting our publications.

Genetic background drives transcriptional variation in human induced pluripotent stem cells.Rouhani F*, Kumasaka N*, de Brito MC, Bradley A, Vallier L and Gaffney D. PLoS genetics 2014 10 6;e1004432

  • Raw read count tableThis is a table of fragment counts in each gene across all 47 samples in our data set.
  • Log2 FPKMThis is a table of log2 FPKMs (fragments per kilobase per million reads sequenced).
  • Gene annotationThis is the gene annotation we used (basically a slightly modified version of Ensembl release 69).
  • Differential expressionThe results of our differential expression analysis.
  • Differential expression READMEREADME with details of the columns in the differential expression analysis table.

Group leader

Dr Daniel Gaffney Dr Daniel Gaffney
Daniel's profile

* quick link -