The Genome informatics group, under Dr Richard Durbin, works on various types of sequence and variation informatics, mostly in one way or another involving evolutionary analysis.
[Kate Whitley, Wellcome Images]
Apart from human genome resequencing, projects that Richard is connected to include:
- the SGRP yeast sequence variation and population genomics project;
- the TreeFam database of animal gene families;
- the Ensembl resource for vertebrate genome annotation;
- the WormBase model organism database for C. elegans;
- the MitoCheck study of mitosis regulation in human cells;
- the Pfam database of protein domain families; and
- the ACEDB genome database.
- 1000 Genomes Project, a deep catalogue of human genetic variation.
- SGRP, Saccharomyces Genome Resequencing Project.
- WormBase is the repository of mapping, sequencing and phenotypic information for C. elegans and several related nematodes. It also contains large amounts of data from manually curated papers and genome wide studies.
- TreeFam, tree families database.
- Margarita, inferring genealogies from population genotype data and using these to map disease loci.
- MAQ, software for mapping short sequencing reads
The Sequence Alignment/Map format and SAMtools.
Bioinformatics (Oxford, England) 2009;25;16;2078-9
The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.
Genome research 2009;19;7;1316-23
Population genomics of domestic and wild yeasts.
Inferring selection on amino acid preference in protein domains.
Molecular biology and evolution 2009;26;3;527-36
Accurate whole human genome sequencing using reversible terminator chemistry.
Mapping short DNA sequencing reads and calling variants using mapping quality scores.
Genome research 2008;18;11;1851-8
Mapping trait loci by use of inferred ancestral recombination graphs.
American journal of human genetics 2006;79;5;910-22
TreeFam: a curated database of phylogenetic trees of animal gene families.
Nucleic acids research 2006;34;Database issue;D572-80
I am currently a PhD student on quantitative genetics. Previously I worked in Cancer Research UK Cambridge Research Institute for two years as a bioinformatician on breast cancer projects. I graduated from Wuhan University with a BSc in Biology, followed by a MSc in Bioinformatics from The University of Edinburgh.
I am interested in understanding how genetic variants drive observed cellular phenotypes - such as gene expression and transcription factor binding. My work focuses on developing computational methods to extract signals from large and complex data sets.
The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups.
Department of Oncology, University of Cambridge, Hills Road, Cambridge CB2 2XZ, UK.
The elucidation of breast cancer subgroups and their molecular drivers requires integrated views of the genome and transcriptome from representative numbers of patients. We present an integrated analysis of copy number and gene expression in a discovery and validation set of 997 and 995 primary breast tumours, respectively, with long-term clinical follow-up. Inherited variants (copy number variants and single nucleotide polymorphisms) and acquired somatic copy number aberrations (CNAs) were associated with expression in ~40% of genes, with the landscape dominated by cis- and trans-acting CNAs. By delineating expression outlier genes driven in cis by CNAs, we identified putative cancer genes, including deletions in PPP2R2A, MTAP and MAP2K4. Unsupervised analysis of paired DNA–RNA profiles revealed novel subgroups with distinct clinical outcomes, which reproduced in the validation cohort. These include a high-risk, oestrogen-receptor-positive 11q13/14 cis-acting subgroup and a favourable prognosis subgroup devoid of CNAs. Trans-acting aberration hotspots were found to modulate subgroup-specific gene networks, including a TCR deletion-mediated adaptive immune response in the ‘CNA-devoid’ subgroup and a basal-specific chromosome 5 deletion-associated mitotic network. Our results provide a novel molecular stratification of the breast cancer population, derived from the impact of somatic CNAs on the transcriptome.
Funded by: Cancer Research UK: A7199; NHGRI NIH HHS: P50 HG002790, P50HG02790
Genome sequencing and analysis of the Tasmanian devil and its transmissible cancer.
Wellcome Trust Sanger Institute, Hinxton, CB10 1SA, UK. email@example.com
The Tasmanian devil (Sarcophilus harrisii), the largest marsupial carnivore, is endangered due to a transmissible facial cancer spread by direct transfer of living cancer cells through biting. Here we describe the sequencing, assembly, and annotation of the Tasmanian devil genome and whole-genome sequences for two geographically distant subclones of the cancer. Genomic analysis suggests that the cancer first arose from a female Tasmanian devil and that the clone has subsequently genetically diverged during its spread across Tasmania. The devil cancer genome contains more than 17,000 somatic base substitution mutations and bears the imprint of a distinct mutational process. Genotyping of somatic mutations in 104 geographically and temporally distributed Tasmanian devil tumors reveals the pattern of evolution and spread of this parasitic clonal lineage, with evidence of a selective sweep in one geographical area and persistence of parallel lineages in other populations.
Funded by: Wellcome Trust: 077012/Z/05/Z, 088340, 095908
Studied Computer Science in University of Helsinki, specializing on Computational Biology, specifically on eukaryotic and mammalian gene transcription regulation.
Obtained PhD in Computer Science at University of Helsinki 2007 under supervision by Prof. Esko Ukkonen.
Involved with WTCCC+ resequencing project at the Sanger Institute 2008-09
Started in RD research group 2009
I am currently studying the genetics of isolated human populations. Specifically I'm involved in low coverage whole genome sequencing of population samples from Orkney Islands, UK, and from Kuusamo, Finland. Our aim is to characterize essentially all genetic variation in these populations by sequencing a large enough sample which would share a recent common ancestor with every individual in the isolate. To support this goal, I have developed methods and algorithms for analyzing genome wide genotype data from isolated populations.
Identity-by-descent-based phasing and imputation in founder populations using graphical models.
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.
Accurate knowledge of haplotypes, the combination of alleles co-residing on a single copy of a chromosome, enables powerful gene mapping and sequence imputation methods. Since humans are diploid, haplotypes must be derived from genotypes by a phasing process. In this study, we present a new computational model for haplotype phasing based on pairwise sharing of haplotypes inferred to be Identical-By-Descent (IBD). We apply the Bayesian network based model in a new phasing algorithm, called systematic long-range phasing (SLRP), that can capitalize on the close genetic relationships in isolated founder populations, and show with simulated and real genome-wide genotype data that SLRP substantially reduces the rate of phasing errors compared to previous phasing algorithms. Furthermore, the method accurately identifies regions of IBD, enabling linkage-like studies without pedigrees, and can be used to impute most genotypes with very low error rate.
Funded by: Chief Scientist Office: CZB/4/710; Medical Research Council: MC_U127561128; Wellcome Trust: 076113, 077192, 085475
Genetic epidemiology 2011;35;8;853-60
Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls.
Copy number variants (CNVs) account for a major proportion of human genetic polymorphism and have been predicted to have an important role in genetic susceptibility to common disease. To address this we undertook a large, direct genome-wide study of association between CNVs and eight common human diseases. Using a purpose-designed array we typed approximately 19,000 individuals into distinct copy-number classes at 3,432 polymorphic CNVs, including an estimated approximately 50% of all common CNVs larger than 500 base pairs. We identified several biological artefacts that lead to false-positive associations, including systematic CNV differences between DNAs derived from blood and cell lines. Association testing and follow-up replication analyses confirmed three loci where CNVs were associated with disease-IRGM for Crohn's disease, HLA for Crohn's disease, rheumatoid arthritis and type 1 diabetes, and TSPAN8 for type 2 diabetes-although in each case the locus had previously been identified in single nucleotide polymorphism (SNP)-based studies, reflecting our observation that most common CNVs that are well-typed on our array are well tagged by SNPs and so have been indirectly explored through SNP studies. We conclude that common CNVs that can be typed on existing platforms are unlikely to contribute greatly to the genetic basis of common human diseases.
Funded by: Arthritis Research UK: 17552, 18475; British Heart Foundation: RG/08/014/24067, RG/09/012/28096; Chief Scientist Office: CZB/4/540; Medical Research Council: G0000934, G0400874, G0500115, G0501942, G0600329, G0600705, G0700491, G0701003, G0701420, G0701810, G0701810(85517), G0800383, G0800509, G0800675, G0800759, G19/9, G90/106, G9521010, MC_UP_A390_1107; Wellcome Trust: 061858, 083948, 089989, 090532
The common colorectal cancer predisposition SNP rs6983267 at chromosome 8q24 confers potential to enhanced Wnt signaling.
Department of Medical Genetics, Genome-Scale Biology Research Program, Biomedicum Helsinki, University of Helsinki, Helsinki, Finland.
Homozygosity for the G allele of rs6983267 at 8q24 increases colorectal cancer (CRC) risk approximately 1.5 fold. We report here that the risk allele G shows copy number increase during CRC development. Our computer algorithm, Enhancer Element Locator (EEL), identified an enhancer element that contains rs6983267. The element drove expression of a reporter gene in a pattern that is consistent with regulation by the key CRC pathway Wnt. rs6983267 affects a binding site for the Wnt-regulated transcription factor TCF4, with the risk allele G showing stronger binding in vitro and in vivo. Genome-wide ChIP assay revealed the element as the strongest TCF4 binding site within 1 Mb of MYC. An unambiguous correlation between rs6983267 genotype and MYC expression was not detected, and additional work is required to scrutinize all possible targets of the enhancer. Our work provides evidence that the common CRC predisposition associated with 8q24 arises from enhanced responsiveness to Wnt signaling.
Nature genetics 2009;41;8;885-90
Genome-wide prediction of mammalian enhancers based on analysis of transcription-factor binding affinity.
Molecular and Cancer Biology Program, Biomedicum Helsinki, University of Helsinki, Finland.
Understanding the regulation of human gene expression requires knowledge of the "second genetic code," which consists of the binding specificities of transcription factors (TFs) and the combinatorial code by which TF binding sites are assembled to form tissue-specific enhancer elements. Using a novel high-throughput method, we determined the DNA binding specificities of GLIs 1-3, Tcf4, and c-Ets1, which mediate transcriptional responses to the Hedgehog (Hh), Wnt, and Ras/MAPK signaling pathways. To identify mammalian enhancer elements regulated by these pathways on a genomic scale, we developed a computational tool, enhancer element locator (EEL). We show that EEL can be used to identify Hh and Wnt target genes and to predict activated TFs based on changes in gene expression. Predictions validated in transgenic mouse embryos revealed the presence of multiple tissue-specific enhancers in mouse c-Myc and N-Myc genes, which has implications for organ-specific growth control and tumor-type specificity of oncogenes.
Locating potential enhancer elements by comparative genomics using the EEL software.
Department of Computer Science, P.O. Box 68 (Gustaf Hällströmin katu 2b) FIN-00014, University of Helsinki, Finland. Kimmo.Palin@helsinki.fi
This protocol describes the use of Enhancer Element Locator (EEL), a computer program that was designed to locate distal enhancer elements in long mammalian sequences. EEL will predict the location and structure of conserved enhancers after being provided with two orthologous DNA sequences and binding specificity matrices for the transcription factors (TFs) that are expected to contribute to the function of the enhancers to be identified. The freely available EEL software can analyze two 1-Mb sequences with 100 TF motifs in about 15 min on a modern Windows, Linux or Mac computer. The output provides several hypotheses about enhancer location and structure for further evaluation by an expert on enhancer function.
Nature protocols 2006;1;1;368-74
From gene networks to gene function.
European Bioinformatics Institute, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. firstname.lastname@example.org
We propose a novel method to identify functionally related genes based on comparisons of neighborhoods in gene networks. This method does not rely on gene sequence or protein structure homologies, and it can be applied to any organism and a wide variety of experimental data sets. The character of the predicted gene relationships depends on the underlying networks;they concern biological processes rather than the molecular function. We used the method to analyze gene networks derived from genome-wide chromatin immunoprecipitation experiments, a large-scale gene deletion study, and from the genomic positions of consensus binding sites for transcription factors of the yeast Saccharomyces cerevisiae. We identified 816 functional relationships between 159 genes and show that these relationships correspond to protein-protein interactions, co-occurrence in the same protein complexes, and/or co-occurrence in abstracts of scientific articles. Our results suggest functions for seven previously uncharacterized yeast genes: KIN3 and YMR269W may be involved in biological processes related to cell growth and/or maintenance, whereas IES6, YEL008W, YEL033W, YHL029C, YMR010W, and YMR031W-A are likely to have metabolic functions.
Genome research 2003;13;12;2568-76
Correlating gene promoters and expression in gene disruption experiments.
Department of Computer Science, University of Helsinki, Finland. email@example.com
Motivation: Finding putative transcription factor binding sites in the upstream sequences of similarly expressed genes has recently become a subject of intensive studies. In this paper we investigate how much gene expression regulation can be attributed to the presence of various binding sites in the gene promoters by correlating the binding sites and the changes in gene expression resulting from gene disruptions (e.g. knockouts).
Results: We have developed a data analysis method for comparing mRNA measurements of gene disruption experiments with information about gene promoters. The method was applied to a well-known dataset to uncover correlations between known transcription factor binding site motifs in the upstream regions of all S. cerevisiae genes and the gene expression changes in various gene disruption experiments. The possible explanations of the correlations were categorized and analyzed using e.g. expression cascades. Several correlations turned out to be consistent with existing biological knowledge while some new ones suggest themselves for further study.
Availability: The resulting tables are available at http://www.cs.helsinki.fi/u/kpalin/CorrDisrupt/.
Bioinformatics (Oxford, England) 2002;18 Suppl 2;S172-80
I am a researcher in computational genomics and population genetics, with particular focus on human and primate evolution. Prior to working in this field my training was in theoretical physics at Trinity College, Dublin, followed by a Ph.D. in astrophysics at the University of Cambridge. I have been at the Sanger Institute since 2007.
My research at the Sanger Institute has primarily been devoted to the Gorilla Genome Project, an international collaboration to assemble and analyse a whole genome sequence for gorilla. As part of this and other projects, I work on various aspects of high-throughput sequencing informatics including assembly, alignment and the detection and analysis of genomic variation.
Insights into hominid evolution from the gorilla genome sequence.
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK.
Gorillas are humans' closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago. In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.
Funded by: Biotechnology and Biological Sciences Research Council; Cancer Research UK: 15603, A15603; European Research Council: 202218; Howard Hughes Medical Institute; Intramural NIH HHS; Medical Research Council: G0501331, G0701805; NHGRI NIH HHS: HG002385, R01 HG002385, U54 HG003079; Wellcome Trust: 062023, 075491/Z/04, 077009, 077192, 077198, 089066, 090532, 095908
Mapping copy number variation by population-scale genome sequencing.
Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA.
Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
Funded by: CCR NIH HHS: RC2 HG005552-01; Medical Research Council: G0701805, G1000758; NHGRI NIH HHS: P41 HG004221, P41 HG004221-01, P41 HG004221-02, P41 HG004221-03, P41 HG004221-03S1, P41 HG004221-03S2, P41 HG004221-03S3, R01 HG004719, R01 HG004719-01, R01 HG004719-02, R01 HG004719-02S1, R01 HG004719-03, R01 HG004719-04, RC2 HG005552, RC2 HG005552-02, U01 HG005209, U01 HG005209-01, U01 HG005209-02, U54 HG003067, U54 HG003273; NIAAA NIH HHS: R21 AA022707; NIGMS NIH HHS: R01 GM059290, R01 GM081533, R01 GM081533-01A1, R01 GM081533-02, R01 GM081533-03, R01 GM081533-04; NIMH NIH HHS: R01 MH091350; Wellcome Trust: 062023, 077009, 077014, 077192, 085532
A large genome center's improvements to the Illumina sequencing system.
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK.
The Wellcome Trust Sanger Institute is one of the world's largest genome centers, and a substantial amount of our sequencing is performed with 'next-generation' massively parallel sequencing technologies: in June 2008 the quantity of purity-filtered sequence data generated by our Genome Analyzer (Illumina) platforms reached 1 terabase, and our average weekly Illumina production output is currently 64 gigabases. Here we describe a set of improvements we have made to the standard Illumina protocols to make the library preparation more reliable in a high-throughput environment, to reduce bias, tighten insert size distribution and reliably obtain high yields of data.
Funded by: Medical Research Council: G0701805; Wellcome Trust: 079643
Nature methods 2008;5;12;1005-10
Accurate whole human genome sequencing using reversible terminator chemistry.
Illumina Cambridge Ltd. (Formerly Solexa Ltd), Chesterford Research Park, Little Chesterford, Nr Saffron Walden, Essex CB10 1XL, UK. firstname.lastname@example.org
DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400-800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.
Funded by: Biotechnology and Biological Sciences Research Council: B05823, MOL04534; Intramural NIH HHS: Z01 HG200330-03; Medical Research Council: G0701805; Wellcome Trust
email@example.com Postdoctoral Fellow
I studied Physics at the University of Cologne in Germany, and finished my PhD in December 2011. During my PhD I mainly worked in the field of population genetics, especially on problems related to genetic linkage in asexual populations. I also worked on population genomic models for adaptation in fruit-flies.
Here at Sanger I developed a method to analyze human population histories from genomic data. I used whole-genome data from 9 geographically diverse populations to infer past population sizes and separation times, for example to better understand the spread of human agriculture or the relationship between modern humans and Neanderthals.
I am also carrying out a study of ancient DNA, sampled from Iron Age and Anglo-Saxon skeletons found on the grounds of our genome campus here in Hinxton.
Inferring human population size and separation history from multiple genome sequences.
Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK.
The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model ancestral relationships under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20,000-30,000 years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The multiple sequentially Markovian coalescent (MSMC) analyzes the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago and give information about human population history as recent as 2,000 years ago, including the bottleneck in the peopling of the Americas and separations within Africa, East Asia and Europe.
Funded by: Wellcome Trust: 098051
Nature genetics 2014;46;8;919-25
Quantifying selection acting on a complex trait using allele frequency time series data.
Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom.
When selection is acting on a large genetically diverse population, beneficial alleles increase in frequency. This fact can be used to map quantitative trait loci by sequencing the pooled DNA from the population at consecutive time points and observing allele frequency changes. Here, we present a population genetic method to analyze time series data of allele frequencies from such an experiment. Beginning with a range of proposed evolutionary scenarios, the method measures the consistency of each with the observed frequency changes. Evolutionary theory is utilized to formulate equations of motion for the allele frequencies, following which likelihoods for having observed the sequencing data under each scenario are derived. Comparison of these likelihoods gives an insight into the prevailing dynamics of the system under study. We illustrate the method by quantifying selective effects from an experiment, in which two phenotypically different yeast strains were first crossed and then propagated under heat stress (Parts L, Cubillos FA, Warringer J, et al. [14 co-authors]. 2011. Revealing the genetic structure of a trait by sequencing a population under selection. Genome Res). From these data, we discover that about 6% of polymorphic sites evolve nonneutrally under heat stress conditions, either because of their linkage to beneficial (driver) alleles or because they are drivers themselves. We further identify 44 genomic regions containing one or more candidate driver alleles, quantify their apparent selective advantage, obtain estimates of recombination rates within the regions, and show that the dynamics of the drivers display a strong signature of selection going beyond additive models. Our approach is applicable to study adaptation in a range of systems under different evolutionary pressures.
Funded by: Wellcome Trust: 098051, WT077192/Z/05/Z
Molecular biology and evolution 2012;29;4;1187-97
Emergent neutrality in adaptive asexual evolution.
Institut für Theoretische Physik, Universität zu Köln, 50937 Köln, Germany.
In nonrecombining genomes, genetic linkage can be an important evolutionary force. Linkage generates interference interactions, by which simultaneously occurring mutations affect each other's chance of fixation. Here, we develop a comprehensive model of adaptive evolution in linked genomes, which integrates interference interactions between multiple beneficial and deleterious mutations into a unified framework. By an approximate analytical solution, we predict the fixation rates of these mutations, as well as the probabilities of beneficial and deleterious alleles at fixed genomic sites. We find that interference interactions generate a regime of emergent neutrality: all genomic sites with selection coefficients smaller in magnitude than a characteristic threshold have nearly random fixed alleles, and both beneficial and deleterious mutations at these sites have nearly neutral fixation rates. We show that this dynamic limits not only the speed of adaptation, but also a population's degree of adaptation in its current environment. We apply the model to different scenarios: stationary adaptation in a time-dependent environment and approach to equilibrium in a fixed environment. In both cases, the analytical predictions are in good agreement with numerical simulations. Our results suggest that interference can severely compromise biological functions in an adapting population, which sets viability limits on adaptive evolution under linkage.
Funded by: Wellcome Trust: 091747