Genome informatics

The Genome informatics group, under Dr Richard Durbin, works on various types of sequence and variation informatics, mostly in one way or another involving evolutionary analysis.

[Kate Whitley, Wellcome Images]

Apart from human genome resequencing, projects that Richard is connected to include:

  • the SGRP yeast sequence variation and population genomics project;
  • the TreeFam database of animal gene families;
  • the Ensembl resource for vertebrate genome annotation;
  • the WormBase model organism database for C. elegans;
  • the MitoCheck study of mitosis regulation in human cells;
  • the Pfam database of protein domain families; and
  • the ACEDB genome database.
  • 1000 Genomes Project, a deep catalogue of human genetic variation.
  • SGRP, Saccharomyces Genome Resequencing Project.
  • WormBase is the repository of mapping, sequencing and phenotypic information for C. elegans and several related nematodes. It also contains large amounts of data from manually curated papers and genome wide studies.
  • TreeFam, tree families database.
  • Margarita, inferring genealogies from population genotype data and using these to map disease loci.
  • MAQ, software for mapping short sequencing reads
  • The Sequence Alignment/Map format and SAMtools.

    Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R and 1000 Genome Project Data Processing Subgroup

    Bioinformatics (Oxford, England) 2009;25;16;2078-9

  • The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.

    Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, Hart E, Suner MM, Landrum MJ, Aken B, Ayling S, Baertsch R, Fernandez-Banet J, Cherry JL, Curwen V, Dicuccio M, Kellis M, Lee J, Lin MF, Schuster M, Shkeda A, Amid C, Brown G, Dukhanina O, Frankish A, Hart J, Maidak BL, Mudge J, Murphy MR, Murphy T, Rajan J, Rajput B, Riddick LD, Snow C, Steward C, Webb D, Weber JA, Wilming L, Wu W, Birney E, Haussler D, Hubbard T, Ostell J, Durbin R and Lipman D

    Genome research 2009;19;7;1316-23

  • Population genomics of domestic and wild yeasts.

    Liti G, Carter DM, Moses AM, Warringer J, Parts L, James SA, Davey RP, Roberts IN, Burt A, Koufopanou V, Tsai IJ, Bergman CM, Bensasson D, O'Kelly MJ, van Oudenaarden A, Barton DB, Bailes E, Nguyen AN, Jones M, Quail MA, Goodhead I, Sims S, Smith F, Blomberg A, Durbin R and Louis EJ

    Nature 2009;458;7236;337-41

  • Inferring selection on amino acid preference in protein domains.

    Moses AM and Durbin R

    Molecular biology and evolution 2009;26;3;527-36

  • Accurate whole human genome sequencing using reversible terminator chemistry.

    Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, Rasolonjatovo IM, Reed MT, Rigatti R, Rodighiero C, Ross MT, Sabot A, Sankar SV, Scally A, Schroth GP, Smith ME, Smith VP, Spiridou A, Torrance PE, Tzonev SS, Vermaas EH, Walter K, Wu X, Zhang L, Alam MD, Anastasi C, Aniebo IC, Bailey DM, Bancarz IR, Banerjee S, Barbour SG, Baybayan PA, Benoit VA, Benson KF, Bevis C, Black PJ, Boodhun A, Brennan JS, Bridgham JA, Brown RC, Brown AA, Buermann DH, Bundu AA, Burrows JC, Carter NP, Castillo N, Chiara E Catenazzi M, Chang S, Neil Cooley R, Crake NR, Dada OO, Diakoumakos KD, Dominguez-Fernandez B, Earnshaw DJ, Egbujor UC, Elmore DW, Etchin SS, Ewan MR, Fedurco M, Fraser LJ, Fuentes Fajardo KV, Scott Furey W, George D, Gietzen KJ, Goddard CP, Golda GS, Granieri PA, Green DE, Gustafson DL, Hansen NF, Harnish K, Haudenschild CD, Heyer NI, Hims MM, Ho JT, Horgan AM, Hoschler K, Hurwitz S, Ivanov DV, Johnson MQ, James T, Huw Jones TA, Kang GD, Kerelska TH, Kersey AD, Khrebtukova I, Kindwall AP, Kingsbury Z, Kokko-Gonzales PI, Kumar A, Laurent MA, Lawley CT, Lee SE, Lee X, Liao AK, Loch JA, Lok M, Luo S, Mammen RM, Martin JW, McCauley PG, McNitt P, Mehta P, Moon KW, Mullens JW, Newington T, Ning Z, Ling Ng B, Novo SM, O'Neill MJ, Osborne MA, Osnowski A, Ostadan O, Paraschos LL, Pickering L, Pike AC, Pike AC, Chris Pinkard D, Pliskin DP, Podhasky J, Quijano VJ, Raczy C, Rae VH, Rawlings SR, Chiva Rodriguez A, Roe PM, Rogers J, Rogert Bacigalupo MC, Romanov N, Romieu A, Roth RK, Rourke NJ, Ruediger ST, Rusman E, Sanches-Kuiper RM, Schenker MR, Seoane JM, Shaw RJ, Shiver MK, Short SW, Sizto NL, Sluis JP, Smith MA, Ernest Sohna Sohna J, Spence EJ, Stevens K, Sutton N, Szajkowski L, Tregidgo CL, Turcatti G, Vandevondele S, Verhovsky Y, Virk SM, Wakelin S, Walcott GC, Wang J, Worsley GJ, Yan J, Yau L, Zuerlein M, Rogers J, Mullikin JC, Hurles ME, McCooke NJ, West JS, Oaks FL, Lundberg PL, Klenerman D, Durbin R and Smith AJ

    Nature 2008;456;7218;53-9

  • Mapping short DNA sequencing reads and calling variants using mapping quality scores.

    Li H, Ruan J and Durbin R

    Genome research 2008;18;11;1851-8

  • Mapping trait loci by use of inferred ancestral recombination graphs.

    Minichiello MJ and Durbin R

    American journal of human genetics 2006;79;5;910-22

  • TreeFam: a curated database of phylogenetic trees of animal gene families.

    Li H, Coghlan A, Ruan J, Coin LJ, Hériché JK, Osmotherly L, Li R, Liu T, Zhang Z, Bolund L, Wong GK, Zheng W, Dehal P, Wang J and Durbin R

    Nucleic acids research 2006;34;Database issue;D572-80

Team

Team members

Zhihao Ding

- PhD Student

I am currently a PhD student on quantitative genetics. Previously I worked in Cancer Research UK Cambridge Research Institute for two years as a bioinformatician on breast cancer projects. I graduated from Wuhan University with a BSc in Biology, followed by a MSc in Bioinformatics from The University of Edinburgh.

Research

I am interested in understanding how genetic variants drive observed cellular phenotypes - such as gene expression and transcription factor binding. My work focuses on developing computational methods to extract signals from large and complex data sets.

References

  • The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups.

    Curtis C, Shah SP, Chin SF, Turashvili G, Rueda OM, Dunning MJ, Speed D, Lynch AG, Samarajiwa S, Yuan Y, Gräf S, Ha G, Haffari G, Bashashati A, Russell R, McKinney S, METABRIC Group, Langerød A, Green A, Provenzano E, Wishart G, Pinder S, Watson P, Markowetz F, Murphy L, Ellis I, Purushotham A, Børresen-Dale AL, Brenton JD, Tavaré S, Caldas C and Aparicio S

    Department of Oncology, University of Cambridge, Hills Road, Cambridge CB2 2XZ, UK.

    The elucidation of breast cancer subgroups and their molecular drivers requires integrated views of the genome and transcriptome from representative numbers of patients. We present an integrated analysis of copy number and gene expression in a discovery and validation set of 997 and 995 primary breast tumours, respectively, with long-term clinical follow-up. Inherited variants (copy number variants and single nucleotide polymorphisms) and acquired somatic copy number aberrations (CNAs) were associated with expression in ~40% of genes, with the landscape dominated by cis- and trans-acting CNAs. By delineating expression outlier genes driven in cis by CNAs, we identified putative cancer genes, including deletions in PPP2R2A, MTAP and MAP2K4. Unsupervised analysis of paired DNA–RNA profiles revealed novel subgroups with distinct clinical outcomes, which reproduced in the validation cohort. These include a high-risk, oestrogen-receptor-positive 11q13/14 cis-acting subgroup and a favourable prognosis subgroup devoid of CNAs. Trans-acting aberration hotspots were found to modulate subgroup-specific gene networks, including a TCR deletion-mediated adaptive immune response in the ‘CNA-devoid’ subgroup and a basal-specific chromosome 5 deletion-associated mitotic network. Our results provide a novel molecular stratification of the breast cancer population, derived from the impact of somatic CNAs on the transcriptome.

    Funded by: Cancer Research UK: A7199; NHGRI NIH HHS: P50HG02790

    Nature 2012;486;7403;346-52

  • Genome sequencing and analysis of the Tasmanian devil and its transmissible cancer.

    Murchison EP, Schulz-Trieglaff OB, Ning Z, Alexandrov LB, Bauer MJ, Fu B, Hims M, Ding Z, Ivakhno S, Stewart C, Ng BL, Wong W, Aken B, White S, Alsop A, Becq J, Bignell GR, Cheetham RK, Cheng W, Connor TR, Cox AJ, Feng ZP, Gu Y, Grocock RJ, Harris SR, Khrebtukova I, Kingsbury Z, Kowarsky M, Kreiss A, Luo S, Marshall J, McBride DJ, Murray L, Pearse AM, Raine K, Rasolonjatovo I, Shaw R, Tedder P, Tregidgo C, Vilella AJ, Wedge DC, Woods GM, Gormley N, Humphray S, Schroth G, Smith G, Hall K, Searle SM, Carter NP, Papenfuss AT, Futreal PA, Campbell PJ, Yang F, Bentley DR, Evers DJ and Stratton MR

    Wellcome Trust Sanger Institute, Hinxton, CB10 1SA, UK. elizabeth.murchison@sanger.ac.uk

    The Tasmanian devil (Sarcophilus harrisii), the largest marsupial carnivore, is endangered due to a transmissible facial cancer spread by direct transfer of living cancer cells through biting. Here we describe the sequencing, assembly, and annotation of the Tasmanian devil genome and whole-genome sequences for two geographically distant subclones of the cancer. Genomic analysis suggests that the cancer first arose from a female Tasmanian devil and that the clone has subsequently genetically diverged during its spread across Tasmania. The devil cancer genome contains more than 17,000 somatic base substitution mutations and bears the imprint of a distinct mutational process. Genotyping of somatic mutations in 104 geographically and temporally distributed Tasmanian devil tumors reveals the pattern of evolution and spread of this parasitic clonal lineage, with evidence of a selective sweep in one geographical area and persistence of parallel lineages in other populations.

    Funded by: Wellcome Trust: 077012/Z/05/Z, 088340, 095908

    Cell 2012;148;4;780-91

Kimmo Palin

- unknown

Studied Computer Science in University of Helsinki, specializing on Computational Biology, specifically on eukaryotic and mammalian gene transcription regulation.

Obtained PhD in Computer Science at University of Helsinki 2007 under supervision by Prof. Esko Ukkonen.

Involved with WTCCC+ resequencing project at the Sanger Institute 2008-09

Started in RD research group 2009

Publications: http://scholar.google.com/citations?user=NxaW34kAAAAJ

Research

I am currently studying the genetics of isolated human populations. Specifically I'm involved in low coverage whole genome sequencing of population samples from Orkney Islands, UK, and from Kuusamo, Finland. Our aim is to characterize essentially all genetic variation in these populations by sequencing a large enough sample which would share a recent common ancestor with every individual in the isolate. To support this goal, I have developed methods and algorithms for analyzing genome wide genotype data from isolated populations.

References

  • Identity-by-descent-based phasing and imputation in founder populations using graphical models.

    Palin K, Campbell H, Wright AF, Wilson JF and Durbin R

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom.

    Accurate knowledge of haplotypes, the combination of alleles co-residing on a single copy of a chromosome, enables powerful gene mapping and sequence imputation methods. Since humans are diploid, haplotypes must be derived from genotypes by a phasing process. In this study, we present a new computational model for haplotype phasing based on pairwise sharing of haplotypes inferred to be Identical-By-Descent (IBD). We apply the Bayesian network based model in a new phasing algorithm, called systematic long-range phasing (SLRP), that can capitalize on the close genetic relationships in isolated founder populations, and show with simulated and real genome-wide genotype data that SLRP substantially reduces the rate of phasing errors compared to previous phasing algorithms. Furthermore, the method accurately identifies regions of IBD, enabling linkage-like studies without pedigrees, and can be used to impute most genotypes with very low error rate.

    Funded by: Chief Scientist Office: CZB/4/710; Medical Research Council: MC_U127561128; Wellcome Trust: 076113, 077192, 085475, WT077192

    Genetic epidemiology 2011;35;8;853-60

  • Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls.

    Wellcome Trust Case Control Consortium, Craddock N, Hurles ME, Cardin N, Pearson RD, Plagnol V, Robson S, Vukcevic D, Barnes C, Conrad DF, Giannoulatou E, Holmes C, Marchini JL, Stirrups K, Tobin MD, Wain LV, Yau C, Aerts J, Ahmad T, Andrews TD, Arbury H, Attwood A, Auton A, Ball SG, Balmforth AJ, Barrett JC, Barroso I, Barton A, Bennett AJ, Bhaskar S, Blaszczyk K, Bowes J, Brand OJ, Braund PS, Bredin F, Breen G, Brown MJ, Bruce IN, Bull J, Burren OS, Burton J, Byrnes J, Caesar S, Clee CM, Coffey AJ, Connell JM, Cooper JD, Dominiczak AF, Downes K, Drummond HE, Dudakia D, Dunham A, Ebbs B, Eccles D, Edkins S, Edwards C, Elliot A, Emery P, Evans DM, Evans G, Eyre S, Farmer A, Ferrier IN, Feuk L, Fitzgerald T, Flynn E, Forbes A, Forty L, Franklyn JA, Freathy RM, Gibbs P, Gilbert P, Gokumen O, Gordon-Smith K, Gray E, Green E, Groves CJ, Grozeva D, Gwilliam R, Hall A, Hammond N, Hardy M, Harrison P, Hassanali N, Hebaishi H, Hines S, Hinks A, Hitman GA, Hocking L, Howard E, Howard P, Howson JM, Hughes D, Hunt S, Isaacs JD, Jain M, Jewell DP, Johnson T, Jolley JD, Jones IR, Jones LA, Kirov G, Langford CF, Lango-Allen H, Lathrop GM, Lee J, Lee KL, Lees C, Lewis K, Lindgren CM, Maisuria-Armer M, Maller J, Mansfield J, Martin P, Massey DC, McArdle WL, McGuffin P, McLay KE, Mentzer A, Mimmack ML, Morgan AE, Morris AP, Mowat C, Myers S, Newman W, Nimmo ER, O'Donovan MC, Onipinla A, Onyiah I, Ovington NR, Owen MJ, Palin K, Parnell K, Pernet D, Perry JR, Phillips A, Pinto D, Prescott NJ, Prokopenko I, Quail MA, Rafelt S, Rayner NW, Redon R, Reid DM, Renwick, Ring SM, Robertson N, Russell E, St Clair D, Sambrook JG, Sanderson JD, Schuilenburg H, Scott CE, Scott R, Seal S, Shaw-Hawkins S, Shields BM, Simmonds MJ, Smyth DJ, Somaskantharajah E, Spanova K, Steer S, Stephens J, Stevens HE, Stone MA, Su Z, Symmons DP, Thompson JR, Thomson W, Travers ME, Turnbull C, Valsesia A, Walker M, Walker NM, Wallace C, Warren-Perry M, Watkins NA, Webster J, Weedon MN, Wilson AG, Woodburn M, Wordsworth BP, Young AH, Zeggini E, Carter NP, Frayling TM, Lee C, McVean G, Munroe PB, Palotie A, Sawcer SJ, Scherer SW, Strachan DP, Tyler-Smith C, Brown MA, Burton PR, Caulfield MJ, Compston A, Farrall M, Gough SC, Hall AS, Hattersley AT, Hill AV, Mathew CG, Pembrey M, Satsangi J, Stratton MR, Worthington J, Deloukas P, Duncanson A, Kwiatkowski DP, McCarthy MI, Ouwehand W, Parkes M, Rahman N, Todd JA, Samani NJ and Donnelly P

    Copy number variants (CNVs) account for a major proportion of human genetic polymorphism and have been predicted to have an important role in genetic susceptibility to common disease. To address this we undertook a large, direct genome-wide study of association between CNVs and eight common human diseases. Using a purpose-designed array we typed approximately 19,000 individuals into distinct copy-number classes at 3,432 polymorphic CNVs, including an estimated approximately 50% of all common CNVs larger than 500 base pairs. We identified several biological artefacts that lead to false-positive associations, including systematic CNV differences between DNAs derived from blood and cell lines. Association testing and follow-up replication analyses confirmed three loci where CNVs were associated with disease-IRGM for Crohn's disease, HLA for Crohn's disease, rheumatoid arthritis and type 1 diabetes, and TSPAN8 for type 2 diabetes-although in each case the locus had previously been identified in single nucleotide polymorphism (SNP)-based studies, reflecting our observation that most common CNVs that are well-typed on our array are well tagged by SNPs and so have been indirectly explored through SNP studies. We conclude that common CNVs that can be typed on existing platforms are unlikely to contribute greatly to the genetic basis of common human diseases.

    Funded by: Arthritis Research UK: 17552; Chief Scientist Office: CZB/4/540, ETM/137, ETM/75; Medical Research Council: G0000934, G0400874, G0500115, G0501942, G0600329, G0600705, G0700491, G0701003, G0701420, G0701810, G0701810(85517), G0800383, G0800759, G19/9, G90/106, G9521010, MC_UP_A390_1107; Wellcome Trust: 061858, 083948, 089989

    Nature 2010;464;7289;713-20

  • The common colorectal cancer predisposition SNP rs6983267 at chromosome 8q24 confers potential to enhanced Wnt signaling.

    Tuupanen S, Turunen M, Lehtonen R, Hallikas O, Vanharanta S, Kivioja T, Björklund M, Wei G, Yan J, Niittymäki I, Mecklin JP, Järvinen H, Ristimäki A, Di-Bernardo M, East P, Carvajal-Carmona L, Houlston RS, Tomlinson I, Palin K, Ukkonen E, Karhu A, Taipale J and Aaltonen LA

    Department of Medical Genetics, Genome-Scale Biology Research Program, Biomedicum Helsinki, University of Helsinki, Helsinki, Finland.

    Homozygosity for the G allele of rs6983267 at 8q24 increases colorectal cancer (CRC) risk approximately 1.5 fold. We report here that the risk allele G shows copy number increase during CRC development. Our computer algorithm, Enhancer Element Locator (EEL), identified an enhancer element that contains rs6983267. The element drove expression of a reporter gene in a pattern that is consistent with regulation by the key CRC pathway Wnt. rs6983267 affects a binding site for the Wnt-regulated transcription factor TCF4, with the risk allele G showing stronger binding in vitro and in vivo. Genome-wide ChIP assay revealed the element as the strongest TCF4 binding site within 1 Mb of MYC. An unambiguous correlation between rs6983267 genotype and MYC expression was not detected, and additional work is required to scrutinize all possible targets of the enhancer. Our work provides evidence that the common CRC predisposition associated with 8q24 arises from enhanced responsiveness to Wnt signaling.

    Nature genetics 2009;41;8;885-90

  • Genome-wide prediction of mammalian enhancers based on analysis of transcription-factor binding affinity.

    Hallikas O, Palin K, Sinjushina N, Rautiainen R, Partanen J, Ukkonen E and Taipale J

    Molecular and Cancer Biology Program, Biomedicum Helsinki, University of Helsinki, Finland.

    Understanding the regulation of human gene expression requires knowledge of the "second genetic code," which consists of the binding specificities of transcription factors (TFs) and the combinatorial code by which TF binding sites are assembled to form tissue-specific enhancer elements. Using a novel high-throughput method, we determined the DNA binding specificities of GLIs 1-3, Tcf4, and c-Ets1, which mediate transcriptional responses to the Hedgehog (Hh), Wnt, and Ras/MAPK signaling pathways. To identify mammalian enhancer elements regulated by these pathways on a genomic scale, we developed a computational tool, enhancer element locator (EEL). We show that EEL can be used to identify Hh and Wnt target genes and to predict activated TFs based on changes in gene expression. Predictions validated in transgenic mouse embryos revealed the presence of multiple tissue-specific enhancers in mouse c-Myc and N-Myc genes, which has implications for organ-specific growth control and tumor-type specificity of oncogenes.

    Cell 2006;124;1;47-59

  • Locating potential enhancer elements by comparative genomics using the EEL software.

    Palin K, Taipale J and Ukkonen E

    Department of Computer Science, P.O. Box 68 (Gustaf Hällströmin katu 2b) FIN-00014, University of Helsinki, Finland. Kimmo.Palin@helsinki.fi

    This protocol describes the use of Enhancer Element Locator (EEL), a computer program that was designed to locate distal enhancer elements in long mammalian sequences. EEL will predict the location and structure of conserved enhancers after being provided with two orthologous DNA sequences and binding specificity matrices for the transcription factors (TFs) that are expected to contribute to the function of the enhancers to be identified. The freely available EEL software can analyze two 1-Mb sequences with 100 TF motifs in about 15 min on a modern Windows, Linux or Mac computer. The output provides several hypotheses about enhancer location and structure for further evaluation by an expert on enhancer function.

    Nature protocols 2006;1;1;368-74

  • From gene networks to gene function.

    Schlitt T, Palin K, Rung J, Dietmann S, Lappe M, Ukkonen E and Brazma A

    European Bioinformatics Institute, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. schlitt@ebi.ac.uk

    We propose a novel method to identify functionally related genes based on comparisons of neighborhoods in gene networks. This method does not rely on gene sequence or protein structure homologies, and it can be applied to any organism and a wide variety of experimental data sets. The character of the predicted gene relationships depends on the underlying networks;they concern biological processes rather than the molecular function. We used the method to analyze gene networks derived from genome-wide chromatin immunoprecipitation experiments, a large-scale gene deletion study, and from the genomic positions of consensus binding sites for transcription factors of the yeast Saccharomyces cerevisiae. We identified 816 functional relationships between 159 genes and show that these relationships correspond to protein-protein interactions, co-occurrence in the same protein complexes, and/or co-occurrence in abstracts of scientific articles. Our results suggest functions for seven previously uncharacterized yeast genes: KIN3 and YMR269W may be involved in biological processes related to cell growth and/or maintenance, whereas IES6, YEL008W, YEL033W, YHL029C, YMR010W, and YMR031W-A are likely to have metabolic functions.

    Genome research 2003;13;12;2568-76

  • Correlating gene promoters and expression in gene disruption experiments.

    Palin K, Ukkonen E, Brazma A and Vilo J

    Department of Computer Science, University of Helsinki, Finland. kimmo.palin@cs.helsinki.fi

    Motivation: Finding putative transcription factor binding sites in the upstream sequences of similarly expressed genes has recently become a subject of intensive studies. In this paper we investigate how much gene expression regulation can be attributed to the presence of various binding sites in the gene promoters by correlating the binding sites and the changes in gene expression resulting from gene disruptions (e.g. knockouts).

    Results: We have developed a data analysis method for comparing mRNA measurements of gene disruption experiments with information about gene promoters. The method was applied to a well-known dataset to uncover correlations between known transcription factor binding site motifs in the upstream regions of all S. cerevisiae genes and the gene expression changes in various gene disruption experiments. The possible explanations of the correlations were categorized and analyzed using e.g. expression cascades. Several correlations turned out to be consistent with existing biological knowledge while some new ones suggest themselves for further study.

    Availability: The resulting tables are available at http://www.cs.helsinki.fi/u/kpalin/CorrDisrupt/.

    Bioinformatics (Oxford, England) 2002;18 Suppl 2;S172-80

Aylwyn Scally

as6@sanger.ac.uk unknown

I am a researcher in computational genomics and population genetics, with particular focus on human and primate evolution. Prior to working in this field my training was in theoretical physics at Trinity College, Dublin, followed by a Ph.D. in astrophysics at the University of Cambridge. I have been at the Sanger Institute since 2007.

Research

My research at the Sanger Institute has primarily been devoted to the Gorilla Genome Project, an international collaboration to assemble and analyse a whole genome sequence for gorilla. As part of this and other projects, I work on various aspects of high-throughput sequencing informatics including assembly, alignment and the detection and analysis of genomic variation.

References

  • Insights into hominid evolution from the gorilla genome sequence.

    Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, Herrero J, Hobolth A, Lappalainen T, Mailund T, Marques-Bonet T, McCarthy S, Montgomery SH, Schwalie PC, Tang YA, Ward MC, Xue Y, Yngvadottir B, Alkan C, Andersen LN, Ayub Q, Ball EV, Beal K, Bradley BJ, Chen Y, Clee CM, Fitzgerald S, Graves TA, Gu Y, Heath P, Heger A, Karakoc E, Kolb-Kokocinski A, Laird GK, Lunter G, Meader S, Mort M, Mullikin JC, Munch K, O'Connor TD, Phillips AD, Prado-Martinez J, Rogers AS, Sajjadian S, Schmidt D, Shaw K, Simpson JT, Stenson PD, Turner DJ, Vigilant L, Vilella AJ, Whitener W, Zhu B, Cooper DN, de Jong P, Dermitzakis ET, Eichler EE, Flicek P, Goldman N, Mundy NI, Ning Z, Odom DT, Ponting CP, Quail MA, Ryder OA, Searle SM, Warren WC, Wilson RK, Schierup MH, Rogers J, Tyler-Smith C and Durbin R

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK.

    Gorillas are humans' closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago. In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.

    Funded by: Biotechnology and Biological Sciences Research Council; Cancer Research UK: A15603; Howard Hughes Medical Institute; Medical Research Council: G0501331, G0701805; NHGRI NIH HHS: HG002385, U54 HG003079; Wellcome Trust: 062023, 075491/Z/04, 077009, 077192, 077198, 089066, 090532, 095908, WT062023, WT077009, WT077192, WT077198, WT089066

    Nature 2012;483;7388;169-75

  • Mapping copy number variation by population-scale genome sequencing.

    Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, Cheetham RK, Chinwalla A, Conrad DF, Fu Y, Grubert F, Hajirasouliha I, Hormozdiari F, Iakoucheva LM, Iqbal Z, Kang S, Kidd JM, Konkel MK, Korn J, Khurana E, Kural D, Lam HY, Leng J, Li R, Li Y, Lin CY, Luo R, Mu XJ, Nemesh J, Peckham HE, Rausch T, Scally A, Shi X, Stromberg MP, Stütz AM, Urban AE, Walker JA, Wu J, Zhang Y, Zhang ZD, Batzer MA, Ding L, Marth GT, McVean G, Sebat J, Snyder M, Wang J, Ye K, Eichler EE, Gerstein MB, Hurles ME, Lee C, McCarroll SA, Korbel JO and 1000 Genomes Project

    Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA.

    Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.

    Funded by: Howard Hughes Medical Institute; Medical Research Council: G0701805; NHGRI NIH HHS: P01 HG004120, P41 HG004221, P41 HG004221-01, P41 HG004221-02, P41 HG004221-03, P41 HG004221-03S1, P41 HG004221-03S2, P41 HG004221-03S3, R01 HG004719, R01 HG004719-01, R01 HG004719-02, R01 HG004719-02S1, R01 HG004719-03, R01 HG004719-04, RC2 HG005552, RC2 HG005552-01, RC2 HG005552-02, U01 HG005209, U01 HG005209-01, U01 HG005209-02, U54 HG003273; NIGMS NIH HHS: R01 GM059290-10, R01 GM081533, R01 GM081533-01A1, R01 GM081533-02, R01 GM081533-03, R01 GM081533-04, R01 GM59290; NIMH NIH HHS: R01 MH091350-03; Wellcome Trust: 062023, 077009, 077014, 077192, 085532

    Nature 2011;470;7332;59-65

  • A large genome center's improvements to the Illumina sequencing system.

    Quail MA, Kozarewa I, Smith F, Scally A, Stephens PJ, Durbin R, Swerdlow H and Turner DJ

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK.

    The Wellcome Trust Sanger Institute is one of the world's largest genome centers, and a substantial amount of our sequencing is performed with 'next-generation' massively parallel sequencing technologies: in June 2008 the quantity of purity-filtered sequence data generated by our Genome Analyzer (Illumina) platforms reached 1 terabase, and our average weekly Illumina production output is currently 64 gigabases. Here we describe a set of improvements we have made to the standard Illumina protocols to make the library preparation more reliable in a high-throughput environment, to reduce bias, tighten insert size distribution and reliably obtain high yields of data.

    Funded by: Medical Research Council: G0701805; Wellcome Trust: 079643

    Nature methods 2008;5;12;1005-10

  • Accurate whole human genome sequencing using reversible terminator chemistry.

    Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, Rasolonjatovo IM, Reed MT, Rigatti R, Rodighiero C, Ross MT, Sabot A, Sankar SV, Scally A, Schroth GP, Smith ME, Smith VP, Spiridou A, Torrance PE, Tzonev SS, Vermaas EH, Walter K, Wu X, Zhang L, Alam MD, Anastasi C, Aniebo IC, Bailey DM, Bancarz IR, Banerjee S, Barbour SG, Baybayan PA, Benoit VA, Benson KF, Bevis C, Black PJ, Boodhun A, Brennan JS, Bridgham JA, Brown RC, Brown AA, Buermann DH, Bundu AA, Burrows JC, Carter NP, Castillo N, Chiara E Catenazzi M, Chang S, Neil Cooley R, Crake NR, Dada OO, Diakoumakos KD, Dominguez-Fernandez B, Earnshaw DJ, Egbujor UC, Elmore DW, Etchin SS, Ewan MR, Fedurco M, Fraser LJ, Fuentes Fajardo KV, Scott Furey W, George D, Gietzen KJ, Goddard CP, Golda GS, Granieri PA, Green DE, Gustafson DL, Hansen NF, Harnish K, Haudenschild CD, Heyer NI, Hims MM, Ho JT, Horgan AM, Hoschler K, Hurwitz S, Ivanov DV, Johnson MQ, James T, Huw Jones TA, Kang GD, Kerelska TH, Kersey AD, Khrebtukova I, Kindwall AP, Kingsbury Z, Kokko-Gonzales PI, Kumar A, Laurent MA, Lawley CT, Lee SE, Lee X, Liao AK, Loch JA, Lok M, Luo S, Mammen RM, Martin JW, McCauley PG, McNitt P, Mehta P, Moon KW, Mullens JW, Newington T, Ning Z, Ling Ng B, Novo SM, O'Neill MJ, Osborne MA, Osnowski A, Ostadan O, Paraschos LL, Pickering L, Pike AC, Pike AC, Chris Pinkard D, Pliskin DP, Podhasky J, Quijano VJ, Raczy C, Rae VH, Rawlings SR, Chiva Rodriguez A, Roe PM, Rogers J, Rogert Bacigalupo MC, Romanov N, Romieu A, Roth RK, Rourke NJ, Ruediger ST, Rusman E, Sanches-Kuiper RM, Schenker MR, Seoane JM, Shaw RJ, Shiver MK, Short SW, Sizto NL, Sluis JP, Smith MA, Ernest Sohna Sohna J, Spence EJ, Stevens K, Sutton N, Szajkowski L, Tregidgo CL, Turcatti G, Vandevondele S, Verhovsky Y, Virk SM, Wakelin S, Walcott GC, Wang J, Worsley GJ, Yan J, Yau L, Zuerlein M, Rogers J, Mullikin JC, Hurles ME, McCooke NJ, West JS, Oaks FL, Lundberg PL, Klenerman D, Durbin R and Smith AJ

    Illumina Cambridge Ltd. (Formerly Solexa Ltd), Chesterford Research Park, Little Chesterford, Nr Saffron Walden, Essex CB10 1XL, UK. dbentley@illumina.com

    DNA sequence information underpins genetic research, enabling discoveries of important biological or medical benefit. Sequencing projects have traditionally used long (400-800 base pair) reads, but the existence of reference sequences for the human and many other genomes makes it possible to develop new, fast approaches to re-sequencing, whereby shorter reads are compared to a reference to identify intraspecies genetic variation. Here we report an approach that generates several billion bases of accurate nucleotide sequence per experiment at low cost. Single molecules of DNA are attached to a flat surface, amplified in situ and used as templates for synthetic sequencing with fluorescent reversible terminator deoxyribonucleotides. Images of the surface are analysed to generate high-quality sequence. We demonstrate application of this approach to human genome sequencing on flow-sorted X chromosomes and then scale the approach to determine the genome sequence of a male Yoruba from Ibadan, Nigeria. We build an accurate consensus sequence from >30x average depth of paired 35-base reads. We characterize four million single-nucleotide polymorphisms and four hundred thousand structural variants, many of which were previously unknown. Our approach is effective for accurate, rapid and economical whole-genome re-sequencing and many other biomedical applications.

    Funded by: Biotechnology and Biological Sciences Research Council: B05823, MOL04534; Medical Research Council: G0701805; NHGRI NIH HHS: Z01 HG200330-03; Wellcome Trust

    Nature 2008;456;7218;53-9

Stephan Schiffels

ss27@sanger.ac.uk Postdoctoral Fellow

I studied Physics at the University of Cologne in Germany, and finished my PhD in December 2011. During my PhD I mainly worked in the field of population genetics, especially on problems related to genetic linkage in asexual populations. I also worked on population genomic models for adaptation in fruit-flies.

Research

Here at Sanger I develop a method to analyze human population histories from genomic data. For this I use whole-genome data from the 1000 Genomes Project to infer past population sizes, for example to better understand the spread of human agriculture or the relationship between modern humans and Neanderthals.

References

  • Quantifying selection acting on a complex trait using allele frequency time series data.

    Illingworth CJ, Parts L, Schiffels S, Liti G and Mustonen V

    Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom.

    When selection is acting on a large genetically diverse population, beneficial alleles increase in frequency. This fact can be used to map quantitative trait loci by sequencing the pooled DNA from the population at consecutive time points and observing allele frequency changes. Here, we present a population genetic method to analyze time series data of allele frequencies from such an experiment. Beginning with a range of proposed evolutionary scenarios, the method measures the consistency of each with the observed frequency changes. Evolutionary theory is utilized to formulate equations of motion for the allele frequencies, following which likelihoods for having observed the sequencing data under each scenario are derived. Comparison of these likelihoods gives an insight into the prevailing dynamics of the system under study. We illustrate the method by quantifying selective effects from an experiment, in which two phenotypically different yeast strains were first crossed and then propagated under heat stress (Parts L, Cubillos FA, Warringer J, et al. [14 co-authors]. 2011. Revealing the genetic structure of a trait by sequencing a population under selection. Genome Res). From these data, we discover that about 6% of polymorphic sites evolve nonneutrally under heat stress conditions, either because of their linkage to beneficial (driver) alleles or because they are drivers themselves. We further identify 44 genomic regions containing one or more candidate driver alleles, quantify their apparent selective advantage, obtain estimates of recombination rates within the regions, and show that the dynamics of the drivers display a strong signature of selection going beyond additive models. Our approach is applicable to study adaptation in a range of systems under different evolutionary pressures.

    Funded by: Wellcome Trust: 098051, WT077192/Z/05/Z

    Molecular biology and evolution 2012;29;4;1187-97

  • Emergent neutrality in adaptive asexual evolution.

    Schiffels S, Szöllosi GJ, Mustonen V and Lässig M

    Institut für Theoretische Physik, Universität zu Köln, 50937 Köln, Germany.

    In nonrecombining genomes, genetic linkage can be an important evolutionary force. Linkage generates interference interactions, by which simultaneously occurring mutations affect each other's chance of fixation. Here, we develop a comprehensive model of adaptive evolution in linked genomes, which integrates interference interactions between multiple beneficial and deleterious mutations into a unified framework. By an approximate analytical solution, we predict the fixation rates of these mutations, as well as the probabilities of beneficial and deleterious alleles at fixed genomic sites. We find that interference interactions generate a regime of emergent neutrality: all genomic sites with selection coefficients smaller in magnitude than a characteristic threshold have nearly random fixed alleles, and both beneficial and deleterious mutations at these sites have nearly neutral fixation rates. We show that this dynamic limits not only the speed of adaptation, but also a population's degree of adaptation in its current environment. We apply the model to different scenarios: stationary adaptation in a time-dependent environment and approach to equilibrium in a fixed environment. In both cases, the analytical predictions are in good agreement with numerical simulations. Our results suggest that interference can severely compromise biological functions in an adapting population, which sets viability limits on adaptive evolution under linkage.

    Funded by: Wellcome Trust: 091747

    Genetics 2011;189;4;1361-75

Component Qr failed to execute