Human evolution

The Human evolution team uses information on genetic variation in modern humans and apes to answer questions about our species' past. This allows us to understand more about the genetic influences on our current health and disease.

We study human genetic variation, including both single nucleotide polymorphisms (SNPs) and structural variants, in diverse human populations, and also variation in closely-related species. With this information, we investigate human origins, expansions and migrations, and how natural selection has shaped our species.

[Genome Research Limited]

Background

We are one of the great apes, but differ from orangutans, gorillas, chimpanzees and bonobos in our enormous numbers, distribution all over the world, yet surprisingly low genetic diversity and even distribution of this diversity among populations. All of these human-specific characteristics are explained in a simple way: recent expansion of modern humans from a small population in Africa within the last 100,000 years. All human populations therefore share most of their genetic variants and susceptibilities because these were present in the ancestral population. But populations differ slightly because of a combination of random genetic drift and natural selection affecting them differently during the expansions into new environments over the last 50,000 years.

One view of the expansion of anatomically and behaviourally modern humans out of Africa around 50 thousand years ago (KYA). Times and routes are very uncertain.

One view of the expansion of anatomically and behaviourally modern humans out of Africa around 50 thousand years ago (KYA). Times and routes are very uncertain. [Genome Research Limited]

zoom

With the availability of genomic sequences from humans and apes and accumulation of extensive information about the variation within humans, we can now begin to reconstruct these expansions and search directly for the functional genetic variants that have contributed to the characteristics of modern humans. Most DNA variants are evolutionarily neutral (they have no effect on fitness) but provide information on past population sizes and migrations, and we continue to investigate these, particularly using the Y chromosome and mitochondrial DNA. A few variants increase fitness and are of particular interest. We can recognise these from the patterns of variation in the surrounding DNA, or by carrying out functional studies. We would like to catalogue the positively selected regions in the human genome and understand the basis for their selection.

Disease-associated alleles are generally expected to decrease fitness, so why are they present at all and not eliminated by negative selection? New disease variants arise continually by mutation, and while some are eliminated rapidly, those that confer only a small decrease in fitness may persist in the population for many generations. Indeed, if the disease develops only after an individual has reproduced, the causal variant may be, in evolutionary terms, neutral. Occasionally, a disease-associated allele may actually confer a fitness advantage in certain circumstances and be positively selected, as the sickle allele has been in malaria-endemic regions. An evolutionary perspective can thus help us to understand our disease susceptibilities more fully.

By exploring the genetic signals left in our gene pool in these ways we can reconstruct human evolutionary history and advance our understanding of what makes us human, what makes populations differ from one another, and why we suffer from some diseases.

Selected publications

Research

Current projects

Previous projects

  • Gene number variation and human evolution
  • Population differentiation and human evolution
  • Y-chromosomal variation and human evolution

Publications

Team publications 2014

  • Revisiting the thrifty gene hypothesis via 65 loci associated with susceptibility to type 2 diabetes.

    Ayub Q, Moutsianas L, Chen Y, Panoutsopoulou K, Colonna V, Pagani L, Prokopenko I, Ritchie GR, Tyler-Smith C, McCarthy MI, Zeggini E and Xue Y

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1HH, UK.

    We have investigated the evidence for positive selection in samples of African, European, and East Asian ancestry at 65 loci associated with susceptibility to type 2 diabetes (T2D) previously identified through genome-wide association studies. Selection early in human evolutionary history is predicted to lead to ancestral risk alleles shared between populations, whereas late selection would result in population-specific signals at derived risk alleles. By using a wide variety of tests based on the site frequency spectrum, haplotype structure, and population differentiation, we found no global signal of enrichment for positive selection when we considered all T2D risk loci collectively. However, in a locus-by-locus analysis, we found nominal evidence for positive selection at 14 of the loci. Selection favored the protective and risk alleles in similar proportions, rather than the risk alleles specifically as predicted by the thrifty gene hypothesis, and may not be related to influence on diabetes. Overall, we conclude that past positive selection has not been a powerful influence driving the prevalence of T2D risk alleles.

    Funded by: Wellcome Trust: 098051, 098381, WT090367MA

    American journal of human genetics 2014;94;2;176-85

  • Gene Conversion Violates the Stepwise Mutation Model for Microsatellites in Y-chromosomal Palindromic Repeats.

    Balaresque P, King TE, Parkin EJ, Heyer E, Carvalho-Silva D, Kraaijenbrink T, de Knijff P, Tyler-Smith C and Jobling MA

    UMR5288 CNRS/UPS - AMIS - Université Paul Sabatier, Allées Jules Guesde, Toulouse, France; Department of Genetics, University of Leicester, University Road, Leicester, UK.

    The male-specific region of the human Y chromosome (MSY) contains eight large inverted repeats (palindromes) in which high sequence similarity between repeat arms is maintained by gene conversion. These palindromes also harbor microsatellites, considered to evolve via a stepwise mutation model (SMM). Here we ask whether gene conversion between palindrome microsatellites contributes to their mutational dynamics. First, we study the duplicated tetranucleotide microsatellite DYS385a,b lying in palindrome P4. We show, by comparing observed data with simulated data under a SMM within haplogroups, that observed heteroallelic combinations in which the modal repeat number difference between copies was large, can give rise to homoallelic combinations with zero repeats difference, equivalent to many single-step mutations. These are unlikely to be generated under a strict SMM, suggesting the action of gene conversion. Second, we show that the inter-copy repeat-number difference for a large set of duplicated microsatellites in all palindromes in the MSY reference sequence is significantly reduced compared to that for non-palindrome duplicated microsatellites, suggesting that the former are characterized by unusual evolutionary dynamics. These observations indicate that gene conversion violates the SMM for microsatellites in palindromes, homogenizing copies within individual Y chromosomes, but increasing overall haplotype diversity among chromosomes within related groups. This article is protected by copyright. All rights reserved.

    Human mutation 2014

  • The andean adaptive toolkit to counteract high altitude maladaptation: genome-wide and phenotypic analysis of the collas.

    Eichstaedt CA, Antão T, Pagani L, Cardona A, Kivisild T and Mormina M

    Division of Biological Anthropology, University of Cambridge, Cambridge, Cambridgeshire, United Kingdom.

    During their migrations out of Africa, humans successfully colonised and adapted to a wide range of habitats, including extreme high altitude environments, where reduced atmospheric oxygen (hypoxia) imposes a number of physiological challenges. This study evaluates genetic and phenotypic variation in the Colla population living in the Argentinean Andes above 3500 m and compares it to the nearby lowland Wichí group in an attempt to pinpoint evolutionary mechanisms underlying adaptation to high altitude hypoxia. We genotyped 730,525 SNPs in 25 individuals from each population. In genome-wide scans of extended haplotype homozygosity Collas showed the strongest signal around VEGFB, which plays an essential role in the ischemic heart, and ELTD1, another gene crucial for heart development and prevention of cardiac hypertrophy. Moreover, pathway enrichment analysis showed an overrepresentation of pathways associated with cardiac morphology. Taken together, these findings suggest that Colla highlanders may have evolved a toolkit of adaptative mechanisms resulting in cardiac reinforcement, most likely to counteract the adverse effects of the permanently increased haematocrit and associated shear forces that characterise the Andean response to hypoxia. Regulation of cerebral vascular flow also appears to be part of the adaptive response in Collas. These findings are not only relevant to understand the evolution of hypoxia protection in high altitude populations but may also suggest new avenues for medical research into conditions where hypoxia constitutes a detrimental factor.

    PloS one 2014;9;3;e93314

  • Using ancestry-informative markers to identify fine structure across 15 populations of European origin.

    GCAN

    The Wellcome Trust Case Control Consortium 3 anorexia nervosa genome-wide association scan includes 2907 cases from 15 different populations of European origin genotyped on the Illumina 670K chip. We compared methods for identifying population stratification, and suggest list of markers that may help to counter this problem. It is usual to identify population structure in such studies using only common variants with minor allele frequency (MAF) >5%; we find that this may result in highly informative SNPs being discarded, and suggest that instead all SNPs with MAF >1% may be used. We established informative axes of variation identified via principal component analysis and highlight important features of the genetic structure of diverse European-descent populations, some studied for the first time at this scale. Finally, we investigated the substructure within each of these 15 populations and identified SNPs that help capture hidden stratification. This work can provide information regarding the designing and interpretation of association results in the International Consortia.European Journal of Human Genetics advance online publication, 19 February 2014; doi:10.1038/ejhg.2014.1.

    European journal of human genetics : EJHG 2014

  • A Linguistically Informed Autosomal STR Survey of Human Populations Residing in the Greater Himalayan Region.

    Kraaijenbrink T, van der Gaag KJ, Zuniga SB, Xue Y, Carvalho-Silva DR, Tyler-Smith C, Jobling MA, Parkin EJ, Su B, Shi H, Xiao CJ, Tang WR, Kashyap VK, Trivedi R, Sitalaximi T, Banerjee J, Gaselô KT, Tuladhar NM, Opgenort JR, van Driem GL, Barbujani G and de Knijff P

    MGC Department of Human and Clinical Genetics, Leiden University Medical Centre, Leiden, the Netherlands.

    The greater Himalayan region demarcates two of the most prominent linguistic phyla in Asia: Tibeto-Burman and Indo-European. Previous genetic surveys, mainly using Y-chromosome polymorphisms and/or mitochondrial DNA polymorphisms suggested a substantially reduced geneflow between populations belonging to these two phyla. These studies, however, have mainly focussed on populations residing far to the north and/or south of this mountain range, and have not been able to study geneflow patterns within the greater Himalayan region itself. We now report a detailed, linguistically informed, genetic survey of Tibeto-Burman and Indo-European speakers from the Himalayan countries Nepal and Bhutan based on autosomal microsatellite markers and compare these populations with surrounding regions. The genetic differentiation between populations within the Himalayas seems to be much higher than between populations in the neighbouring countries. We also observe a remarkable genetic differentiation between the Tibeto-Burman speaking populations on the one hand and Indo-European speaking populations on the other, suggesting that language and geography have played an equally large role in defining the genetic composition of present-day populations within the Himalayas.

    PloS one 2014;9;3;e91534

  • Association of a germline copy number polymorphism of APOBEC3A and APOBEC3B with burden of putative APOBEC-dependent mutations in breast cancer.

    Nik-Zainal S, Wedge DC, Alexandrov LB, Petljak M, Butler AP, Bolli N, Davies HR, Knappskog S, Martin S, Papaemmanuil E, Ramakrishna M, Shlien A, Simonic I, Xue Y, Tyler-Smith C, Campbell PJ and Stratton MR

    1] Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK. [2] Department of Medical Genetics, Addenbrooke's Hospital National Health Service (NHS) Trust, Cambridge, UK.

    The somatic mutations in a cancer genome are the aggregate outcome of one or more mutational processes operative through the lifetime of the individual with cancer. Each mutational process leaves a characteristic mutational signature determined by the mechanisms of DNA damage and repair that constitute it. A role was recently proposed for the APOBEC family of cytidine deaminases in generating particular genome-wide mutational signatures and a signature of localized hypermutation called kataegis. A germline copy number polymorphism involving APOBEC3A and APOBEC3B, which effectively deletes APOBEC3B, has been associated with modestly increased risk of breast cancer. Here we show that breast cancers in carriers of the deletion show more mutations of the putative APOBEC-dependent genome-wide signatures than cancers in non-carriers. The results suggest that the APOBEC3A-APOBEC3B germline deletion allele confers cancer susceptibility through increased activity of APOBEC-dependent mutational processes, although the mechanism by which this increase in activity occurs remains unknown.

    Nature genetics 2014

Team publications 2013

  • Genomic triumph meets clinical reality.

    Ayub Q, Xue Y and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. cts@sanger.ac.uk.

    A report on the 'Genomic Disorders 2013: from 60 years of DNA to human genomes in the clinic' meeting, held at Homerton College, Cambridge, UK, April 10-12, 2013.

    Genome biology 2013;14;5;307

  • FOXP2 targets show evidence of positive selection in European populations.

    Ayub Q, Yngvadottir B, Chen Y, Xue Y, Hu M, Vernes SC, Fisher SE and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. qa1@sanger.ac.uk

    Forkhead box P2 (FOXP2) is a highly conserved transcription factor that has been implicated in human speech and language disorders and plays important roles in the plasticity of the developing brain. The pattern of nucleotide polymorphisms in FOXP2 in modern populations suggests that it has been the target of positive (Darwinian) selection during recent human evolution. In our study, we searched for evidence of selection that might have followed FOXP2 adaptations in modern humans. We examined whether or not putative FOXP2 targets identified by chromatin-immunoprecipitation genomic screening show evidence of positive selection. We developed an algorithm that, for any given gene list, systematically generates matched lists of control genes from the Ensembl database, collates summary statistics for three frequency-spectrum-based neutrality tests from the low-coverage resequencing data of the 1000 Genomes Project, and determines whether these statistics are significantly different between the given gene targets and the set of controls. Overall, there was strong evidence of selection of FOXP2 targets in Europeans, but not in the Han Chinese, Japanese, or Yoruba populations. Significant outliers included several genes linked to cellular movement, reproduction, development, and immune cell trafficking, and 13 of these constituted a significant network associated with cardiac arteriopathy. Strong signals of selection were observed for CNTNAP2 and RBFOX1, key neurally expressed genes that have been consistently identified as direct FOXP2 targets in multiple studies and that have themselves been associated with neurodevelopmental disorders involving language dysfunction.

    Funded by: Wellcome Trust: 098051

    American journal of human genetics 2013;92;5;696-706

  • Y-chromosome and mtDNA genetics reveal significant contrasts in affinities of modern Middle Eastern populations with European and African populations.

    Badro DA, Douaihy B, Haber M, Youhanna SC, Salloum A, Ghassibe-Sabbagh M, Johnsrud B, Khazen G, Matisoo-Smith E, Soria-Hernanz DF, Wells RS, Tyler-Smith C, Platt DE, Zalloua PA and Genographic Consortium

    The Lebanese American University, Chouran, Beirut, Lebanon.

    The Middle East was a funnel of human expansion out of Africa, a staging area for the Neolithic Agricultural Revolution, and the home to some of the earliest world empires. Post LGM expansions into the region and subsequent population movements created a striking genetic mosaic with distinct sex-based genetic differentiation. While prior studies have examined the mtDNA and Y-chromosome contrast in focal populations in the Middle East, none have undertaken a broad-spectrum survey including North and sub-Saharan Africa, Europe, and Middle Eastern populations. In this study 5,174 mtDNA and 4,658 Y-chromosome samples were investigated using PCA, MDS, mean-linkage clustering, AMOVA, and Fisher exact tests of F(ST)'s, R(ST)'s, and haplogroup frequencies. Geographic differentiation in affinities of Middle Eastern populations with Africa and Europe showed distinct contrasts between mtDNA and Y-chromosome data. Specifically, Lebanon's mtDNA shows a very strong association to Europe, while Yemen shows very strong affinity with Egypt and North and East Africa. Previous Y-chromosome results showed a Levantine coastal-inland contrast marked by J1 and J2, and a very strong North African component was evident throughout the Middle East. Neither of these patterns were observed in the mtDNA. While J2 has penetrated into Europe, the pattern of Y-chromosome diversity in Lebanon does not show the widespread affinities with Europe indicated by the mtDNA data. Lastly, while each population shows evidence of connections with expansions that now define the Middle East, Africa, and Europe, many of the populations in the Middle East show distinctive mtDNA and Y-haplogroup characteristics that indicate long standing settlement with relatively little impact from and movement into other populations.

    PloS one 2013;8;1;e54616

  • Where genotype is not predictive of phenotype: towards an understanding of the molecular basis of reduced penetrance in human inherited disease.

    Cooper DN, Krawczak M, Polychronakos C, Tyler-Smith C and Kehrer-Sawatzki H

    Institute of Medical Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff, CF14 4XN, UK, cooperDN@cardiff.ac.uk.

    Some individuals with a particular disease-causing mutation or genotype fail to express most if not all features of the disease in question, a phenomenon that is known as 'reduced (or incomplete) penetrance'. Reduced penetrance is not uncommon; indeed, there are many known examples of 'disease-causing mutations' that fail to cause disease in at least a proportion of the individuals who carry them. Reduced penetrance may therefore explain not only why genetic diseases are occasionally transmitted through unaffected parents, but also why healthy individuals can harbour quite large numbers of potentially disadvantageous variants in their genomes without suffering any obvious ill effects. Reduced penetrance can be a function of the specific mutation(s) involved or of allele dosage. It may also result from differential allelic expression, copy number variation or the modulating influence of additional genetic variants in cis or in trans. The penetrance of some pathogenic genotypes is known to be age- and/or sex-dependent. Variable penetrance may also reflect the action of unlinked modifier genes, epigenetic changes or environmental factors. At least in some cases, complete penetrance appears to require the presence of one or more genetic variants at other loci. In this review, we summarize the evidence for reduced penetrance being a widespread phenomenon in human genetics and explore some of the molecular mechanisms that may help to explain this enigmatic characteristic of human inherited disease.

    Funded by: Wellcome Trust: 098051

    Human genetics 2013;132;10;1077-130

  • The GenoChip: a new tool for genetic anthropology.

    Elhaik E, Greenspan E, Staats S, Krahn T, Tyler-Smith C, Xue Y, Tofanelli S, Francalacci P, Cucca F, Pagani L, Jin L, Li H, Schurr TG, Greenspan B, Spencer Wells R and Genographic Consortium

    Department of Mental Health, Johns Hopkins University Bloomberg School of Public Health, USA.

    The Genographic Project is an international effort aimed at charting human migratory history. The project is nonprofit and nonmedical, and, through its Legacy Fund, supports locally led efforts to preserve indigenous and traditional cultures. Although the first phase of the project was focused on uniparentally inherited markers on the Y-chromosome and mitochondrial DNA (mtDNA), the current phase focuses on markers from across the entire genome to obtain a more complete understanding of human genetic variation. Although many commercial arrays exist for genome-wide single-nucleotide polymorphism (SNP) genotyping, they were designed for medical genetic studies and contain medically related markers that are inappropriate for global population genetic studies. GenoChip, the Genographic Project's new genotyping array, was designed to resolve these issues and enable higher resolution research into outstanding questions in genetic anthropology. The GenoChip includes ancestry informative markers obtained for over 450 human populations, an ancient human (Saqqaq), and two archaic hominins (Neanderthal and Denisovan) and was designed to identify all known Y-chromosome and mtDNA haplogroups. The chip was carefully vetted to avoid inclusion of medically relevant markers. To demonstrate its capabilities, we compared the FST distributions of GenoChip SNPs to those of two commercial arrays. Although all arrays yielded similarly shaped (inverse J) FST distributions, the GenoChip autosomal and X-chromosomal distributions had the highest mean FST, attesting to its ability to discern subpopulations. The chip performances are illustrated in a principal component analysis for 14 worldwide populations. In summary, the GenoChip is a dedicated genotyping platform for genetic anthropology. With an unprecedented number of approximately 12,000 Y-chromosomal and approximately 3,300 mtDNA SNPs and over 130,000 autosomal and X-chromosomal SNPs without any known health, medical, or phenotypic relevance, the GenoChip is a useful tool for genetic anthropology and population genetics.

    Funded by: NIMH NIH HHS: T32 MH014592; Wellcome Trust: 098051

    Genome biology and evolution 2013;5;5;1021-31

  • Genome-wide diversity in the levant reveals recent structuring by culture.

    Haber M, Gauguier D, Youhanna S, Patterson N, Moorjani P, Botigué LR, Platt DE, Matisoo-Smith E, Soria-Hernanz DF, Wells RS, Bertranpetit J, Tyler-Smith C, Comas D and Zalloua PA

    Institut de Biologia Evolutiva (CSIC-UPF), Departament de Ciències de la Salut i de la Vida, Universitat Pompeu Fabra, Barcelona, Spain.

    The Levant is a region in the Near East with an impressive record of continuous human existence and major cultural developments since the Paleolithic period. Genetic and archeological studies present solid evidence placing the Middle East and the Arabian Peninsula as the first stepping-stone outside Africa. There is, however, little understanding of demographic changes in the Middle East, particularly the Levant, after the first Out-of-Africa expansion and how the Levantine peoples relate genetically to each other and to their neighbors. In this study we analyze more than 500,000 genome-wide SNPs in 1,341 new samples from the Levant and compare them to samples from 48 populations worldwide. Our results show recent genetic stratifications in the Levant are driven by the religious affiliations of the populations within the region. Cultural changes within the last two millennia appear to have facilitated/maintained admixture between culturally similar populations from the Levant, Arabian Peninsula, and Africa. The same cultural changes seem to have resulted in genetic isolation of other groups by limiting admixture with culturally different neighboring populations. Consequently, Levant populations today fall into two main groups: one sharing more genetic characteristics with modern-day Europeans and Central Asians, and the other with closer genetic affinities to other Middle Easterners and Africans. Finally, we identify a putative Levantine ancestral component that diverged from other Middle Easterners ∼23,700-15,500 years ago during the last glacial period, and diverged from Europeans ∼15,900-9,100 years ago between the last glacial warming and the start of the Neolithic.

    Funded by: PEPFAR: 098051; Wellcome Trust

    PLoS genetics 2013;9;2;e1003316

  • Genetic signatures reveal high-altitude adaptation in a set of ethiopian populations.

    Huerta-Sánchez E, Degiorgio M, Pagani L, Tarekegn A, Ekong R, Antao T, Cardona A, Montgomery HE, Cavalleri GL, Robbins PA, Weale ME, Bradman N, Bekele E, Kivisild T, Tyler-Smith C and Nielsen R

    Department of Integrative Biology, University of California, Berkeley, CA, USA. emiliahsc@berkeley.edu

    The Tibetan and Andean Plateaus and Ethiopian highlands are the largest regions to have long-term high-altitude residents. Such populations are exposed to lower barometric pressures and hence atmospheric partial pressures of oxygen. Such "hypobaric hypoxia" may limit physical functional capacity, reproductive health, and even survival. As such, selection of genetic variants advantageous to hypoxic adaptation is likely to have occurred. Identifying signatures of such selection is likely to help understanding of hypoxic adaptive processes. Here, we seek evidence of such positive selection using five Ethiopian populations, three of which are from high-altitude areas in Ethiopia. As these populations may have been recipients of Eurasian gene flow, we correct for this admixture. Using single-nucleotide polymorphism genotype data from multiple populations, we find the strongest signal of selection in BHLHE41 (also known as DEC2 or SHARP1). Remarkably, a major role of this gene is regulation of the same hypoxia response pathway on which selection has most strikingly been observed in both Tibetan and Andean populations. Because it is also an important player in the circadian rhythm pathway, BHLHE41 might also provide insights into the mechanisms underlying the recognized impacts of hypoxia on the circadian clock. These results support the view that Ethiopian, Andean, and Tibetan populations living at high altitude have adapted to hypoxia differently, with convergent evolution affecting different genes from the same pathway.

    Funded by: NHGRI NIH HHS: R01HG003229, R01HG003229-08S2

    Molecular biology and evolution 2013;30;8;1877-88

  • Integrative annotation of variants from 1092 humans: application to cancer genomics.

    Khurana E, Fu Y, Colonna V, Mu XJ, Kang HM, Lappalainen T, Sboner A, Lochovsky L, Chen J, Harmanci A, Das J, Abyzov A, Balasubramanian S, Beal K, Chakravarty D, Challis D, Chen Y, Clarke D, Clarke L, Cunningham F, Evani US, Flicek P, Fragoza R, Garrison E, Gibbs R, Gümüs ZH, Herrero J, Kitabayashi N, Kong Y, Lage K, Liluashvili V, Lipkin SM, MacArthur DG, Marth G, Muzny D, Pers TH, Ritchie GR, Rosenfeld JA, Sisu C, Wei X, Wilson M, Xue Y, Yu F, 1000 Genomes Project Consortium, Dermitzakis ET, Yu H, Rubin MA, Tyler-Smith C and Gerstein M

    Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA.

    Interpreting variants, especially noncoding ones, in the increasing number of personal genomes is challenging. We used patterns of polymorphisms in functionally annotated regions in 1092 humans to identify deleterious variants; then we experimentally validated candidates. We analyzed both coding and noncoding regions, with the former corroborating the latter. We found regions particularly sensitive to mutations ("ultrasensitive") and variants that are disruptive because of mechanistic effects on transcription-factor binding (that is, "motif-breakers"). We also found variants in regions with higher network centrality tend to be deleterious. Insertions and deletions followed a similar pattern to single-nucleotide variants, with some notable exceptions (e.g., certain deletions and enhancers). On the basis of these patterns, we developed a computational tool (FunSeq), whose application to ~90 cancer genomes reveals nearly a hundred candidate noncoding drivers.

    Funded by: NCATS NIH HHS: UL1 TR000457; NCI NIH HHS: CA167824, R01 CA166661, R01CA152057, U01 CA111275; NHGRI NIH HHS: HG005718, HG007000, R01 HG002898, R01HG4719, U01 HG005718, U01HG6513, U41 HG007000; NIGMS NIH HHS: GM104424; Wellcome Trust: 085532, 095908, 098051, WT085532, WT095908

    Science (New York, N.Y.) 2013;342;6154;1235587

  • A genome-wide survey of genetic variation in gorillas using reduced representation sequencing.

    Scally A, Yngvadottir B, Xue Y, Ayub Q, Durbin R and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, United Kingdom.

    All non-human great apes are endangered in the wild, and it is therefore important to gain an understanding of their demography and genetic diversity. Whole genome assembly projects have provided an invaluable foundation for understanding genetics in all four genera, but to date genetic studies of multiple individuals within great ape species have largely been confined to mitochondrial DNA and a small number of other loci. Here, we present a genome-wide survey of genetic variation in gorillas using a reduced representation sequencing approach, focusing on the two lowland subspecies. We identify 3,006,670 polymorphic sites in 14 individuals: 12 western lowland gorillas (Gorilla gorilla gorilla) and 2 eastern lowland gorillas (Gorilla beringei graueri). We find that the two species are genetically distinct, based on levels of heterozygosity and patterns of allele sharing. Focusing on the western lowland population, we observe evidence for population substructure, and a deficit of rare genetic variants suggesting a recent episode of population contraction. In western lowland gorillas, there is an elevation of variation towards telomeres and centromeres on the chromosomal scale. On a finer scale, we find substantial variation in genetic diversity, including a marked reduction close to the major histocompatibility locus, perhaps indicative of recent strong selection there. These findings suggest that despite their maintaining an overall level of genetic diversity equal to or greater than that of humans, population decline, perhaps associated with disease, has been a significant factor in recent and long-term pressures on wild gorilla populations.

    Funded by: Wellcome Trust: 098051

    PloS one 2013;8;6;e65066

  • Modeling the contrasting Neolithic male lineage expansions in Europe and Africa.

    Sikora MJ, Colonna V, Xue Y and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK. cts@sanger.ac.uk.

    Background: Patterns of genetic variation in a population carry information about the prehistory of the population, and for the human Y chromosome an especially informative phylogenetic tree has previously been constructed from fully-sequenced chromosomes. This revealed contrasting bifurcating and starlike phylogenies for the major lineages associated with the Neolithic expansions in sub-Saharan Africa and Western Europe, respectively.

    Results: We used coalescent simulations to investigate the range of demographic models most likely to produce the phylogenetic structures observed in Africa and Europe, assessing the starting and ending genetic effective population sizes, duration of the expansion, and time when expansion ended. The best-fitting models in Africa and Europe are very different. In Africa, the expansion took about 12 thousand years, ending very recently; it started from approximately 40 men and numbers expanded approximately 50-fold. In Europe, the expansion was much more rapid, taking only a few generations and occurring as soon as the major R1b lineage entered Europe; it started from just one to three men, whose numbers expanded more than a thousandfold.

    Conclusions: Although highly simplified, the demographic model we have used captures key elements of the differences between the male Neolithic expansions in Africa and Europe, and is consistent with archaeological findings.

    Investigative genetics 2013;4;1;25

  • A rare functional cardioprotective APOC3 variant has risen in frequency in distinct population isolates.

    Tachmazidou I, Dedoussis G, Southam L, Farmaki AE, Ritchie GR, Xifara DK, Matchan A, Hatzikotoulas K, Rayner NW, Chen Y, Pollin TI, O'Connell JR, Yerges-Armstrong LM, Kiagiadaki C, Panoutsopoulou K, Schwartzentruber J, Moutsianas L, UK10K consortium, Tsafantakis E, Tyler-Smith C, McVean G, Xue Y and Zeggini E

    Wellcome Trust Sanger Institute, Hinxton CB10 1SA, UK.

    Isolated populations can empower the identification of rare variation associated with complex traits through next generation association studies, but the generalizability of such findings remains unknown. Here we genotype 1,267 individuals from a Greek population isolate on the Illumina HumanExome Beadchip, in search of functional coding variants associated with lipids traits. We find genome-wide significant evidence for association between R19X, a functional variant in APOC3, with increased high-density lipoprotein and decreased triglycerides levels. Approximately 3.8% of individuals are heterozygous for this cardioprotective variant, which was previously thought to be private to the Amish founder population. R19X is rare (<0.05% frequency) in outbred European populations. The increased frequency of R19X enables discovery of this lipid traits signal at genome-wide significance in a small sample size. This work exemplifies the value of isolated populations in successfully detecting transferable rare variant associations of high medical relevance.

    Funded by: NHLBI NIH HHS: K01 HL116770, R01 HL104193, U01 HL072515, U01 HL105198; NIDDK NIH HHS: P30 DK072488; Wellcome Trust: 098051, WT091310

    Nature communications 2013;4;2872

  • Genetic basis of Y-linked hearing impairment.

    Wang Q, Xue Y, Zhang Y, Long Q, Asan, Yang F, Turner DJ, Fitzgerald T, Ng BL, Zhao Y, Chen Y, Liu Q, Yang W, Han D, Quail MA, Swerdlow H, Burton J, Fahey C, Ning Z, Hurles ME, Carter NP, Yang H and Tyler-Smith C

    Department of Otolaryngology, Head and Neck Surgery, Chinese PLA Institute of Otolaryngology, Chinese PLA General Hospital, Beijing, China.

    A single Mendelian trait has been mapped to the human Y chromosome: Y-linked hearing impairment. The molecular basis of this disorder is unknown. Here, we report the detailed characterization of the DFNY1 Y chromosome and its comparison with a closely related Y chromosome from an unaffected branch of the family. The DFNY1 chromosome carries a complex rearrangement, including duplication of several noncontiguous segments of the Y chromosome and insertion of ∼160 kb of DNA from chromosome 1, in the pericentric region of Yp. This segment of chromosome 1 is derived entirely from within a known hearing impairment locus, DFNA49. We suggest that a third copy of one or more genes from the shared segment of chromosome 1 might be responsible for the hearing-loss phenotype.

    Funded by: Wellcome Trust: 098051

    American journal of human genetics 2013;92;2;301-6

  • A calibrated human Y-chromosomal phylogeny based on resequencing.

    Wei W, Ayub Q, Chen Y, McCarthy S, Hou Y, Carbone I, Xue Y and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom.

    We have identified variants present in high-coverage complete sequences of 36 diverse human Y chromosomes from Africa, Europe, South Asia, East Asia, and the Americas, representing eight major haplogroups. After restricting our analysis to 8.97 Mb of the unique male-specific Y sequence, we identified 6662 high-confidence variants, including single-nucleotide polymorphisms (SNPs), multi-nucleotide polymorphisms (MNPs), and indels. We constructed phylogenetic trees using these variants, or subsets of them, and recapitulated the known structure of the tree. Assuming a male mutation rate of 1 × 10(-9) per base pair per year, the time depth of the tree (haplogroups A3-R) was ~101,000-115,000 yr, and the lineages found outside Africa dated to 57,000-74,000 yr, both as expected. In addition, we dated a striking Paleolithic male lineage expansion to 41,000-52,000 yr ago and the node representing the major European Y lineage, R1b, to 4000-13,000 yr ago, supporting a Neolithic origin for these modern European Y chromosomes. In all, we provide a nearly 10-fold increase in the number of Y markers with phylogenetic information, and novel historical insights derived from placing them on a calibrated phylogenetic tree.

    Funded by: Wellcome Trust: 098051

    Genome research 2013;23;2;388-95

  • A comparison of Y-chromosomal lineage dating using either resequencing or Y-SNP plus Y-STR genotyping.

    Wei W, Ayub Q, Xue Y and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, UK.

    We have compared phylogenies and time estimates for Y-chromosomal lineages based on resequencing ∼9 Mb of DNA and applying the program GENETREE to similar analyses based on the more standard approach of genotyping 26 Y-SNPs plus 21 Y-STRs and applying the programs NETWORK and BATWING. We find that deep phylogenetic structure is not adequately reconstructed after Y-SNP plus Y-STR genotyping, and that times estimated using observed Y-STR mutation rates are several-fold too recent. In contrast, an evolutionary mutation rate gives times that are more similar to the resequencing data. In principle, systematic comparisons of this kind can in future studies be used to identify the combinations of Y-SNP and Y-STR markers, and time estimation methodologies, that correspond best to resequencing data.

    Funded by: Wellcome Trust

    Forensic science international. Genetics 2013;7;6;568-72

Team publications 2012

  • An integrated map of genetic variation from 1,092 human genomes.

    1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT and McVean GA

    By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

    Funded by: Biotechnology and Biological Sciences Research Council: BB/I021213/1; British Heart Foundation: RG/09/012/28096, RG/09/12/28096; Howard Hughes Medical Institute; Medical Research Council: G0701805, G0801823, G0900747, G0900747(91070); NCI NIH HHS: R01 CA166661, R01CA166661; NCRR NIH HHS: UL1RR024131; NHGRI NIH HHS: P01HG4120, P41HG2371, P41HG4221, R01 HG002898, R01 HG004960, R01 HG007022, R01HG2898, R01HG3698, R01HG4719, R01HG4960, R01HG5701, RC2HG5552, RC2HG5581, U01 HG005728, U01 HG006513, U01 HG006569, U01HG5208, U01HG5209, U01HG5211, U01HG5214, U01HG5715, U01HG5725, U01HG5728, U01HG6513, U01HG6569, U41HG4568, U54 HG003079, U54 HG003273, U54HG3067, U54HG3079, U54HG3273; NHLBI NIH HHS: HL078885, R01HL95045, RC2HL102925, T32HL94284; NIAID NIH HHS: AI077439, AI2009061; NIEHS NIH HHS: ES015794; NIGMS NIH HHS: R01GM59290, T32GM7748, T32GM8283; NIH HHS: DP2OD6514; NIMH NIH HHS: F30 MH098571, R01MH84698; NLM NIH HHS: T15LM7033; PHS HHS: HHSN268201100040C; Wellcome Trust: 085532, 086084, 090532, 095908, WT085475/Z/08/Z, WT085532AIA, WT086084/Z/08/Z, WT089250/Z/09/Z, WT090532/Z/09/Z, WT095552/Z/11/Z, WT098051

    Nature 2012;491;7422;56-65

  • Population differentiation of southern Indian male lineages correlates with agricultural expansions predating the caste system.

    Arunkumar G, Soria-Hernanz DF, Kavitha VJ, Arun VS, Syama A, Ashokan KS, Gandhirajan KT, Vijayakumar K, Narayanan M, Jayalakshmi M, Ziegle JS, Royyuru AK, Parida L, Wells RS, Renfrew C, Schurr TG, Smith CT, Platt DE, Pitchappan R and Genographic Consortium

    The Genographic Laboratory, School of Biological Sciences, Madurai Kamaraj University, Madurai, Tamil Nadu, India.

    Previous studies that pooled Indian populations from a wide variety of geographical locations, have obtained contradictory conclusions about the processes of the establishment of the Varna caste system and its genetic impact on the origins and demographic histories of Indian populations. To further investigate these questions we took advantage that both Y chromosome and caste designation are paternally inherited, and genotyped 1,680 Y chromosomes representing 12 tribal and 19 non-tribal (caste) endogamous populations from the predominantly Dravidian-speaking Tamil Nadu state in the southernmost part of India. Tribes and castes were both characterized by an overwhelming proportion of putatively Indian autochthonous Y-chromosomal haplogroups (H-M69, F-M89, R1a1-M17, L1-M27, R2-M124, and C5-M356; 81% combined) with a shared genetic heritage dating back to the late Pleistocene (10-30 Kya), suggesting that more recent Holocene migrations from western Eurasia contributed <20% of the male lineages. We found strong evidence for genetic structure, associated primarily with the current mode of subsistence. Coalescence analysis suggested that the social stratification was established 4-6 Kya and there was little admixture during the last 3 Kya, implying a minimal genetic impact of the Varna (caste) system from the historically-documented Brahmin migrations into the area. In contrast, the overall Y-chromosomal patterns, the time depth of population diversifications and the period of differentiation were best explained by the emergence of agricultural technology in South Asia. These results highlight the utility of detailed local genetic studies within India, without prior assumptions about the importance of Varna rank status for population grouping, to obtain new insights into the relative influences of past demographic events for the population structure of the whole of modern India.

    Funded by: Wellcome Trust: 098051

    PloS one 2012;7;11;e50269

  • Genome-wide meta-analysis of common variant differences between men and women.

    Boraska V, Jerončić A, Colonna V, Southam L, Nyholt DR, Rayner NW, Perry JR, Toniolo D, Albrecht E, Ang W, Bandinelli S, Barbalic M, Barroso I, Beckmann JS, Biffar R, Boomsma D, Campbell H, Corre T, Erdmann J, Esko T, Fischer K, Franceschini N, Frayling TM, Girotto G, Gonzalez JR, Harris TB, Heath AC, Heid IM, Hoffmann W, Hofman A, Horikoshi M, Zhao JH, Jackson AU, Hottenga JJ, Jula A, Kähönen M, Khaw KT, Kiemeney LA, Klopp N, Kutalik Z, Lagou V, Launer LJ, Lehtimäki T, Lemire M, Lokki ML, Loley C, Luan J, Mangino M, Mateo Leach I, Medland SE, Mihailov E, Montgomery GW, Navis G, Newnham J, Nieminen MS, Palotie A, Panoutsopoulou K, Peters A, Pirastu N, Polasek O, Rehnström K, Ripatti S, Ritchie GR, Rivadeneira F, Robino A, Samani NJ, Shin SY, Sinisalo J, Smit JH, Soranzo N, Stolk L, Swinkels DW, Tanaka T, Teumer A, Tönjes A, Traglia M, Tuomilehto J, Valsesia A, van Gilst WH, van Meurs JB, Smith AV, Viikari J, Vink JM, Waeber G, Warrington NM, Widen E, Willemsen G, Wright AF, Zanke BW, Zgaga L, Wellcome Trust Case Control Consortium, Boehnke M, d'Adamo AP, de Geus E, Demerath EW, den Heijer M, Eriksson JG, Ferrucci L, Gieger C, Gudnason V, Hayward C, Hengstenberg C, Hudson TJ, Järvelin MR, Kogevinas M, Loos RJ, Martin NG, Metspalu A, Pennell CE, Penninx BW, Perola M, Raitakari O, Salomaa V, Schreiber S, Schunkert H, Spector TD, Stumvoll M, Uitterlinden AG, Ulivi S, van der Harst P, Vollenweider P, Völzke H, Wareham NJ, Wichmann HE, Wilson JF, Rudan I, Xue Y and Zeggini E

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. vboraska@mefst.hr

    The male-to-female sex ratio at birth is constant across world populations with an average of 1.06 (106 male to 100 female live births) for populations of European descent. The sex ratio is considered to be affected by numerous biological and environmental factors and to have a heritable component. The aim of this study was to investigate the presence of common allele modest effects at autosomal and chromosome X variants that could explain the observed sex ratio at birth. We conducted a large-scale genome-wide association scan (GWAS) meta-analysis across 51 studies, comprising overall 114 863 individuals (61 094 women and 53 769 men) of European ancestry and 2 623 828 common (minor allele frequency >0.05) single-nucleotide polymorphisms (SNPs). Allele frequencies were compared between men and women for directly-typed and imputed variants within each study. Forward-time simulations for unlinked, neutral, autosomal, common loci were performed under the demographic model for European populations with a fixed sex ratio and a random mating scheme to assess the probability of detecting significant allele frequency differences. We do not detect any genome-wide significant (P < 5 × 10(-8)) common SNP differences between men and women in this well-powered meta-analysis. The simulated data provided results entirely consistent with these findings. This large-scale investigation across ~115 000 individuals shows no detectable contribution from common genetic variants to the observed skew in the sex ratio. The absence of sex-specific differences is useful in guiding genetic association study design, for example when using mixed controls for sex-biased traits.

    Funded by: Canadian Institutes of Health Research: MOP-82893; Cancer Research UK; Chief Scientist Office: CZB/4/710; Medical Research Council: G0401527, G1000143, G1001799, MC_PC_U127561128, MC_U106179471, MC_U127561128; NCRR NIH HHS: RR018787, UL1RR025005; NHGRI NIH HHS: U01HG004402; NHLBI NIH HHS: HL65234, HL67466, R01HL086694, R01HL087641, R01HL59367; NIA NIH HHS: N.1-AG-1-1, N.1-AG-1-2111, N01-AG-1-2100, N01-AG-5-0002; NIAAA NIH HHS: AA07535, AA10248, AA13320, AA13321, AA13326, AA14041; NIDDK NIH HHS: DK062370; NIMH NIH HHS: MH081802, MH66206, R01 MH059160, U24 MH068457-06; NLM NIH HHS: LM010098; PHS HHS: HHSN268200625226C, HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN268201100010C, HHSN268201100011C, HHSN268201100012C; Wellcome Trust: 076113, 089062/Z/09/Z, 092447/Z/10/Z, 095831, 098051, 89061/Z/09/Z

    Human molecular genetics 2012;21;21;4805-15

  • 'Sifting the significance from the data' - the impact of high-throughput genomic technologies on human genetics and health care.

    Clarke AJ, Cooper DN, Krawczak M, Tyler-Smith C, Wallace HM, Wilkie AO, Raymond FL, Chadwick R, Craddock N, John R, Gallacher J and Chiano M

    Institute of Medical Genetics, School of Medicine, Cardiff University, Cardiff, Wales CF14 4XN, UK. clarkeaj@cardiff.ac.uk

    This report is of a round-table discussion held in Cardiff in September 2009 for Cesagen, a research centre within the Genomics Network of the UK's Economic and Social Research Council. The meeting was arranged to explore ideas as to the likely future course of human genomics. The achievements of genomics research were reviewed, and the likely constraints on the pace of future progress were explored. New knowledge is transforming biology and our understanding of evolution and human disease. The difficulties we face now concern the interpretation rather than the generation of new sequence data. Our understanding of gene-environment interaction is held back by our current primitive tools for measuring environmental factors, and in addition, there may be fundamental constraints on what can be known about these complex interactions.

    Funded by: Wellcome Trust

    Human genomics 2012;6;11

  • IFITM3 restricts the morbidity and mortality associated with influenza.

    Everitt AR, Clare S, Pertel T, John SP, Wash RS, Smith SE, Chin CR, Feeley EM, Sims JS, Adams DJ, Wise HM, Kane L, Goulding D, Digard P, Anttila V, Baillie JK, Walsh TS, Hume DA, Palotie A, Xue Y, Colonna V, Tyler-Smith C, Dunning J, Gordon SB, GenISIS Investigators, MOSAIC Investigators, Smyth RL, Openshaw PJ, Dougan G, Brass AL and Kellam P

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK.

    The 2009 H1N1 influenza pandemic showed the speed with which a novel respiratory virus can spread and the ability of a generally mild infection to induce severe morbidity and mortality in a subset of the population. Recent in vitro studies show that the interferon-inducible transmembrane (IFITM) protein family members potently restrict the replication of multiple pathogenic viruses. Both the magnitude and breadth of the IFITM proteins' in vitro effects suggest that they are critical for intrinsic resistance to such viruses, including influenza viruses. Using a knockout mouse model, we now test this hypothesis directly and find that IFITM3 is essential for defending the host against influenza A virus in vivo. Mice lacking Ifitm3 display fulminant viral pneumonia when challenged with a normally low-pathogenicity influenza virus, mirroring the destruction inflicted by the highly pathogenic 1918 'Spanish' influenza. Similar increased viral replication is seen in vitro, with protection rescued by the re-introduction of Ifitm3. To test the role of IFITM3 in human influenza virus infection, we assessed the IFITM3 alleles of individuals hospitalized with seasonal or pandemic influenza H1N1/09 viruses. We find that a statistically significant number of hospitalized subjects show enrichment for a minor IFITM3 allele (SNP rs12252-C) that alters a splice acceptor site, and functional assays show the minor CC genotype IFITM3 has reduced influenza virus restriction in vitro. Together these data reveal that the action of a single intrinsic immune effector, IFITM3, profoundly alters the course of influenza virus infection in mouse and humans.

    Funded by: Chief Scientist Office; Medical Research Council: G0600511, G0800767, G0800777, G0802752, G0901697, MC_G1001212, MC_U122785833; NIAID NIH HHS: R01 AI091786, R01AI091786; Wellcome Trust: 090382, 090382/Z/09/Z, 090385/Z/09/Z, 098051

    Nature 2012;484;7395;519-23

  • Afghanistan's ethnic groups share a Y-chromosomal heritage structured by historical events.

    Haber M, Platt DE, Ashrafian Bonab M, Youhanna SC, Soria-Hernanz DF, Martínez-Cruz B, Douaihy B, Ghassibe-Sabbagh M, Rafatpanah H, Ghanbari M, Whale J, Balanovsky O, Wells RS, Comas D, Tyler-Smith C, Zalloua PA and Genographic Consortium

    The Lebanese American University, Chouran, Beirut, Lebanon.

    Afghanistan has held a strategic position throughout history. It has been inhabited since the Paleolithic and later became a crossroad for expanding civilizations and empires. Afghanistan's location, history, and diverse ethnic groups present a unique opportunity to explore how nations and ethnic groups emerged, and how major cultural evolutions and technological developments in human history have influenced modern population structures. In this study we have analyzed, for the first time, the four major ethnic groups in present-day Afghanistan: Hazara, Pashtun, Tajik, and Uzbek, using 52 binary markers and 19 short tandem repeats on the non-recombinant segment of the Y-chromosome. A total of 204 Afghan samples were investigated along with more than 8,500 samples from surrounding populations important to Afghanistan's history through migrations and conquests, including Iranians, Greeks, Indians, Middle Easterners, East Europeans, and East Asians. Our results suggest that all current Afghans largely share a heritage derived from a common unstructured ancestral population that could have emerged during the Neolithic revolution and the formation of the first farming communities. Our results also indicate that inter-Afghan differentiation started during the Bronze Age, probably driven by the formation of the first civilizations in the region. Later migrations and invasions into the region have been assimilated differentially among the ethnic groups, increasing inter-population genetic differences, and giving the Afghans a unique genetic diversity in Central Asia.

    Funded by: Wellcome Trust

    PloS one 2012;7;3;e34288

  • Exploration of signals of positive selection derived from genotype-based human genome scans using re-sequencing data.

    Hu M, Ayub Q, Guerra-Assunção JA, Long Q, Ning Z, Huang N, Romero IG, Mamanova L, Akan P, Liu X, Coffey AJ, Turner DJ, Swerdlow H, Burton J, Quail MA, Conrad DF, Enright AJ, Tyler-Smith C and Xue Y

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SA, UK.

    We have investigated whether regions of the genome showing signs of positive selection in scans based on haplotype structure also show evidence of positive selection when sequence-based tests are applied, whether the target of selection can be localized more precisely, and whether such extra evidence can lead to increased biological insights. We used two tools: simulations under neutrality or selection, and experimental investigation of two regions identified by the HapMap2 project as putatively selected in human populations. Simulations suggested that neutral and selected regions should be readily distinguished and that it should be possible to localize the selected variant to within 40 kb at least half of the time. Re-sequencing of two ~300 kb regions (chr4:158Mb and chr10:22Mb) lacking known targets of selection in HapMap CHB individuals provided strong evidence for positive selection within each and suggested the micro-RNA gene hsa-miR-548c as the best candidate target in one region, and changes in regulation of the sperm protein gene SPAG6 in the other.

    Funded by: Wellcome Trust: 077009

    Human genetics 2012;131;5;665-74

  • A systematic survey of loss-of-function variants in human protein-coding genes.

    MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB, Albers CA, Zhang ZD, Conrad DF, Lunter G, Zheng H, Ayub Q, DePristo MA, Banks E, Hu M, Handsaker RE, Rosenfeld JA, Fromer M, Jin M, Mu XJ, Khurana E, Ye K, Kay M, Saunders GI, Suner MM, Hunt T, Barnes IH, Amid C, Carvalho-Silva DR, Bignell AH, Snow C, Yngvadottir B, Bumpstead S, Cooper DN, Xue Y, Romero IG, 1000 Genomes Project Consortium, Wang J, Li Y, Gibbs RA, McCarroll SA, Dermitzakis ET, Pritchard JK, Barrett JC, Harrow J, Hurles ME, Gerstein MB and Tyler-Smith C

    Wellcome Trust Sanger Institute, Hinxton, UK. macarthur@atgu.mgh.harvard.edu

    Genome-sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in nonessential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.

    Funded by: British Heart Foundation: RG/09/012/28096; NHGRI NIH HHS: U54 HG003273; Wellcome Trust: 085532, 090532, 090532/Z/09/Z, 098051

    Science (New York, N.Y.) 2012;335;6070;823-8

  • High altitude adaptation in Daghestani populations from the Caucasus.

    Pagani L, Ayub Q, MacArthur DG, Xue Y, Baillie JK, Chen Y, Kozarewa I, Turner DJ, Tofanelli S, Bulayeva K, Kidd K, Paoli G and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Hinxton, UK. lp8@sanger.ac.uk

    We have surveyed 15 high-altitude adaptation candidate genes for signals of positive selection in North Caucasian highlanders using targeted re-sequencing. A total of 49 unrelated Daghestani from three ethnic groups (Avars, Kubachians, and Laks) living in ancient villages located at around 2,000 m above sea level were chosen as the study population. Caucasian (Adygei living at sea level, N = 20) and CEU (CEPH Utah residents with ancestry from northern and western Europe; N = 20) were used as controls. Candidate genes were compared with 20 putatively neutral control regions resequenced in the same individuals. The regions of interest were amplified by long-PCR, pooled according to individual, indexed by adding an eight-nucleotide tag, and sequenced using the Illumina GAII platform. 1,066 SNPs were called using false discovery and false negative thresholds of ~6%. The neutral regions provided an empirical null distribution to compare with the candidate genes for signals of selection. Two genes stood out. In Laks, a non-synonymous variant within HIF1A already known to be associated with improvement in oxygen metabolism was rediscovered, and in Kubachians a cluster of 13 SNPs located in a conserved intronic region within EGLN1 showing high population differentiation was found. These variants illustrate both the common pathways of adaptation to high altitude in different populations and features specific to the Daghestani populations, showing how even a mildly hypoxic environment can lead to genetic adaptation.

    Funded by: Wellcome Trust

    Human genetics 2012;131;3;423-33

  • Ethiopian genetic diversity reveals linguistic stratification and complex influences on the Ethiopian gene pool.

    Pagani L, Kivisild T, Tarekegn A, Ekong R, Plaster C, Gallego Romero I, Ayub Q, Mehdi SQ, Thomas MG, Luiselli D, Bekele E, Bradman N, Balding DJ and Tyler-Smith C

    Division of Biological Anthropology, University of Cambridge, UK. lp8@sanger.ac.uk

    Humans and their ancestors have traversed the Ethiopian landscape for millions of years, and present-day Ethiopians show great cultural, linguistic, and historical diversity, which makes them essential for understanding African variability and human origins. We genotyped 235 individuals from ten Ethiopian and two neighboring (South Sudanese and Somali) populations on an Illumina Omni 1M chip. Genotypes were compared with published data from several African and non-African populations. Principal-component and STRUCTURE-like analyses confirmed substantial genetic diversity both within and between populations, and revealed a match between genetic data and linguistic affiliation. Using comparisons with African and non-African reference samples in 40-SNP genomic windows, we identified "African" and "non-African" haplotypic components for each Ethiopian individual. The non-African component, which includes the SLC24A5 allele associated with light skin pigmentation in Europeans, may represent gene flow into Africa, which we estimate to have occurred ~3 thousand years ago (kya). The non-African component was found to be more similar to populations inhabiting the Levant rather than the Arabian Peninsula, but the principal route for the expansion out of Africa ~60 kya remains unresolved. Linkage-disequilibrium decay with genomic distance was less rapid in both the whole genome and the African component than in southern African samples, suggesting a less ancient history for Ethiopian populations.

    Funded by: Wellcome Trust: 098051

    American journal of human genetics 2012;91;1;83-96

  • Impact of restricted marital practices on genetic variation in an endogamous Gujarati group.

    Pemberton TJ, Li FY, Hanson EK, Mehta NU, Choi S, Ballantyne J, Belmont JW, Rosenberg NA, Tyler-Smith C and Patel PI

    Institute for Genetic Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA. trevorp@stanford.edu

    Recent studies have examined the influence on patterns of human genetic variation of a variety of cultural practices. In India, centuries-old marriage customs have introduced extensive social structuring into the contemporary population, potentially with significant consequences for genetic variation. Social stratification in India is evident as social classes that are defined by endogamous groups known as castes. Within a caste, there exist endogamous groups known as gols (marriage circles), each of which comprises a small number of exogamous gotra (lineages). Thus, while consanguinity is strictly avoided and some randomness in mate selection occurs within the gol, gene flow is limited with groups outside the gol. Gujarati Patels practice this form of "exogamic endogamy." We have analyzed genetic variation in one such group of Gujarati Patels, the Chha Gaam Patels (CGP), who comprise individuals from six villages. Population structure analysis of 1,200 autosomal loci offers support for the existence of distinctive multilocus genotypes in the CGP with respect to both non-Gujaratis and other Gujaratis, and indicates that CGP individuals are genetically very similar. Analysis of Y-chromosomal and mitochondrial haplotypes provides support for both patrilocal and patrilineal practices within the gol, and a low-level of female gene flow into the gol. Our study illustrates how the practice of gol endogamy has introduced fine-scale genetic structure into the population of India, and contributes more generally to an understanding of the way in which marriage practices affect patterns of genetic variation.

    Funded by: NCI NIH HHS: CA62528-01; NCRR NIH HHS: RR10600-01, RR14514-01; NICHD NIH HHS: P30 HD024064; NIGMS NIH HHS: GM081441, R01 GM081441; Wellcome Trust

    American journal of physical anthropology 2012;149;1;92-103

  • Evolutionary genetics of the human Rh blood group system.

    Perry GH, Xue Y, Smith RS, Meyer WK, Calışkan M, Yanez-Cuna O, Lee AS, Gutiérrez-Arcelus M, Ober C, Hollox EJ, Tyler-Smith C and Lee C

    Department of Anthropology, Pennsylvania State University, University Park, PA 16801, USA.

    The evolutionary history of variation in the human Rh blood group system, determined by variants in the RHD and RHCE genes, has long been an unresolved puzzle in human genetics. Prior to medical treatments and interventions developed in the last century, the D-positive (RhD positive) children of D-negative (RhD negative) women were at risk for hemolytic disease of the newborn, if the mother produced anti-D antibodies following sensitization to the blood of a previous D-positive child. Given the deleterious fitness consequences of this disease, the appreciable frequencies in European populations of the responsible RHD gene deletion variant (for example, 0.43 in our study) seem surprising. In this study, we used new molecular and genomic data generated from four HapMap population samples to test the idea that positive selection for an as-of-yet unknown fitness benefit of the RHD deletion may have offset the otherwise negative fitness effects of hemolytic disease of the newborn. We found no evidence that positive natural selection affected the frequency of the RHD deletion. Thus, the initial rise to intermediate frequency of the RHD deletion in European populations may simply be explained by genetic drift/founder effect, or by an older or more complex sweep that we are insufficiently powered to detect. However, our simulations recapitulate previous findings that selection on the RHD deletion is frequency dependent and weak or absent near 0.5. Therefore, once such a frequency was achieved, it could have been maintained by a relatively small amount of genetic drift. We unexpectedly observed evidence for positive selection on the C allele of RHCE in non-African populations (on chromosomes with intact copies of the RHD gene) in the form of an unusually high F( ST ) value and the high frequency of a single haplotype carrying the C allele. RhCE function is not well understood, but the C/c antigenic variant is clinically relevant and can result in hemolytic disease of the newborn, albeit much less commonly and severely than that related to the D-negative blood type. Therefore, the potential fitness benefits of the RHCE C allele are currently unknown but merit further exploration.

    Funded by: Medical Research Council: G0801123, GO801123; NHGRI NIH HHS: P41-HG004221; NICHD NIH HHS: R01-HD21244; Wellcome Trust: 098051, WT098051

    Human genetics 2012;131;7;1205-16

  • Insights into hominid evolution from the gorilla genome sequence.

    Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, Herrero J, Hobolth A, Lappalainen T, Mailund T, Marques-Bonet T, McCarthy S, Montgomery SH, Schwalie PC, Tang YA, Ward MC, Xue Y, Yngvadottir B, Alkan C, Andersen LN, Ayub Q, Ball EV, Beal K, Bradley BJ, Chen Y, Clee CM, Fitzgerald S, Graves TA, Gu Y, Heath P, Heger A, Karakoc E, Kolb-Kokocinski A, Laird GK, Lunter G, Meader S, Mort M, Mullikin JC, Munch K, O'Connor TD, Phillips AD, Prado-Martinez J, Rogers AS, Sajjadian S, Schmidt D, Shaw K, Simpson JT, Stenson PD, Turner DJ, Vigilant L, Vilella AJ, Whitener W, Zhu B, Cooper DN, de Jong P, Dermitzakis ET, Eichler EE, Flicek P, Goldman N, Mundy NI, Ning Z, Odom DT, Ponting CP, Quail MA, Ryder OA, Searle SM, Warren WC, Wilson RK, Schierup MH, Rogers J, Tyler-Smith C and Durbin R

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK.

    Gorillas are humans' closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago. In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.

    Funded by: Biotechnology and Biological Sciences Research Council; Cancer Research UK: A15603; Howard Hughes Medical Institute; Medical Research Council: G0501331, G0701805; NHGRI NIH HHS: HG002385, U54 HG003079; Wellcome Trust: 062023, 075491/Z/04, 077009, 077192, 077198, 089066, 090532, 095908, WT062023, WT077009, WT077192, WT077198, WT089066

    Nature 2012;483;7388;169-75

  • A British approach to sampling.

    Tyler-Smith C and Xue Y

    Funded by: Wellcome Trust: 077009

    European journal of human genetics : EJHG 2012;20;2;129-30

  • Sibling rivalry among paralogs promotes evolution of the human brain.

    Tyler-Smith C and Xue Y

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. cts@sanger.ac.uk

    Geneticists have long sought to identify the genetic changes that made us human, but pinpointing the functionally relevant changes has been challenging. Two papers in this issue suggest that partial duplication of SRGAP2, producing an incomplete protein that antagonizes the original, contributed to human brain evolution.

    Funded by: Wellcome Trust: 098051

    Cell 2012;149;4;737-9

  • Deleterious- and disease-allele prevalence in healthy individuals: insights from current predictions, mutation databases, and population-scale resequencing.

    Xue Y, Chen Y, Ayub Q, Huang N, Ball EV, Mort M, Phillips AD, Shaw K, Stenson PD, Cooper DN, Tyler-Smith C and 1000 Genomes Project Consortium

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.

    We have assessed the numbers of potentially deleterious variants in the genomes of apparently healthy humans by using (1) low-coverage whole-genome sequence data from 179 individuals in the 1000 Genomes Pilot Project and (2) current predictions and databases of deleterious variants. Each individual carried 281-515 missense substitutions, 40-85 of which were homozygous, predicted to be highly damaging. They also carried 40-110 variants classified by the Human Gene Mutation Database (HGMD) as disease-causing mutations (DMs), 3-24 variants in the homozygous state, and many polymorphisms putatively associated with disease. Whereas many of these DMs are likely to represent disease-allele-annotation errors, between 0 and 8 DMs (0-1 homozygous) per individual are predicted to be highly damaging, and some of them provide information of medical relevance. These analyses emphasize the need for improved annotation of disease alleles both in mutation databases and in the primary literature; some HGMD mutation data have been recategorized on the basis of the present findings, an iterative process that is both necessary and ongoing. Our estimates of deleterious-allele numbers are likely to be subject to both overcounting and undercounting. However, our current best mean estimates of ~400 damaging variants and ~2 bona fide disease mutations per individual are likely to increase rather than decrease as sequencing studies ascertain rare variants more effectively and as additional disease alleles are discovered.

    Funded by: Wellcome Trust: 085532, WT098051

    American journal of human genetics 2012;91;6;1022-32

Team publications 2011

  • Dindel: accurate indel calls from short-read data.

    Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH and Durbin R

    Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, United Kingdom. caa@sanger.ac.uk

    Small insertions and deletions (indels) are a common and functionally important type of sequence polymorphism. Most of the focus of studies of sequence variation is on single nucleotide variants (SNVs) and large structural variants. In principle, high-throughput sequencing studies should allow identification of indels just as SNVs. However, inference of indels from next-generation sequence data is challenging, and so far methods for identifying indels lag behind methods for calling SNVs in terms of sensitivity and specificity. We propose a Bayesian method to call indels from short-read sequence data in individuals and populations by realigning reads to candidate haplotypes that represent alternative sequence to the reference. The candidate haplotypes are formed by combining candidate indels and SNVs identified by the read mapper, while allowing for known sequence variants or candidates from other methods to be included. In our probabilistic realignment model we account for base-calling errors, mapping errors, and also, importantly, for increased sequencing error indel rates in long homopolymer runs. We show that our method is sensitive and achieves low false discovery rates on simulated and real data sets, although challenges remain. The algorithm is implemented in the program Dindel, which has been used in the 1000 Genomes Project call sets.

    Funded by: British Heart Foundation: RG/09/012/28096; Wellcome Trust: 086084, 090532, WT089088/Z/09/Z

    Genome research 2011;21;6;961-73

  • Comprehensive comparison of three commercial human whole-exome capture platforms.

    Asan, Xu Y, Jiang H, Tyler-Smith C, Xue Y, Jiang T, Wang J, Wu M, Liu X, Tian G, Wang J, Wang J, Yang H and Zhang X

    Beijing Institute of Genomics, Chinese Academy of Sciences, No.7 Beitucheng West Road, Chaoyang District, Beijing 100029, China. asan@genomics.org.cn

    Background: Exome sequencing, which allows the global analysis of protein coding sequences in the human genome, has become an effective and affordable approach to detecting causative genetic mutations in diseases. Currently, there are several commercial human exome capture platforms; however, the relative performances of these have not been characterized sufficiently to know which is best for a particular study.

    Results: We comprehensively compared three platforms: NimbleGen's Sequence Capture Array and SeqCap EZ, and Agilent's SureSelect. We assessed their performance in a variety of ways, including number of genes covered and capture efficacy. Differences that may impact on the choice of platform were that Agilent SureSelect covered approximately 1,100 more genes, while NimbleGen provided better flanking sequence capture. Although all three platforms achieved similar capture specificity of targeted regions, the NimbleGen platforms showed better uniformity of coverage and greater genotype sensitivity at 30- to 100-fold sequencing depth. All three platforms showed similar power in exome SNP calling, including medically relevant SNPs. Compared with genotyping and whole-genome sequencing data, the three platforms achieved a similar accuracy of genotype assignment and SNP detection. Importantly, all three platforms showed similar levels of reproducibility, GC bias and reference allele bias.

    Conclusions: We demonstrate key differences between the three platforms, particularly advantages of solutions over array capture and the importance of a large gene target set.

    Funded by: Wellcome Trust

    Genome biology 2011;12;9;R95

  • Male lineages in the Himalayan foothills: a commentary on Y-chromosome haplogroup diversity in the sub-Himalayan Terai and Duars populations of East India.

    Ayub Q

    Journal of human genetics 2011;56;12;813-4

  • Parallel evolution of genes and languages in the Caucasus region.

    Balanovsky O, Dibirova K, Dybo A, Mudrak O, Frolova S, Pocheshkhova E, Haber M, Platt D, Schurr T, Haak W, Kuznetsova M, Radzhabov M, Balaganskaya O, Romanov A, Zakharova T, Soria Hernanz DF, Zalloua P, Koshel S, Ruhlen M, Renfrew C, Wells RS, Tyler-Smith C, Balanovska E and Genographic Consortium

    Research Centre for Medical Genetics, Russian Academy of Medical Sciences, Moscow, Russia. balanovsky@inbox.ru

    We analyzed 40 single nucleotide polymorphism and 19 short tandem repeat Y-chromosomal markers in a large sample of 1,525 indigenous individuals from 14 populations in the Caucasus and 254 additional individuals representing potential source populations. We also employed a lexicostatistical approach to reconstruct the history of the languages of the North Caucasian family spoken by the Caucasus populations. We found a different major haplogroup to be prevalent in each of four sets of populations that occupy distinct geographic regions and belong to different linguistic branches. The haplogroup frequencies correlated with geography and, even more strongly, with language. Within haplogroups, a number of haplotype clusters were shown to be specific to individual populations and languages. The data suggested a direct origin of Caucasus male lineages from the Near East, followed by high levels of isolation, differentiation, and genetic drift in situ. Comparison of genetic and linguistic reconstructions covering the last few millennia showed striking correspondences between the topology and dates of the respective gene and language trees and with documented historical events. Overall, in the Caucasus region, unmatched levels of gene-language coevolution occurred within geographically isolated populations, probably due to its mountainous terrain.

    Funded by: Wellcome Trust: 077009

    Molecular biology and evolution 2011;28;10;2905-20

  • Gene inactivation and its implications for annotation in the era of personal genomics.

    Balasubramanian S, Habegger L, Frankish A, MacArthur DG, Harte R, Tyler-Smith C, Harrow J and Gerstein M

    Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA.

    The first wave of personal genomes documents how no single individual genome contains the full complement of functional genes. Here, we describe the extent of variation in gene and pseudogene numbers between individuals arising from inactivation events such as premature termination or aberrant splicing due to single-nucleotide polymorphisms. This highlights the inadequacy of the current reference sequence and gene set. We present a proposal to define a reference gene set that will remain stable as more individuals are sequenced. In particular, we recommend that the ancestral allele be used to define the reference sequence from which a core human reference gene annotation set can be derived. In addition, we call for the development of an expanded gene set to include human-specific genes that have arisen recently and are absent from the ancestral set.

    Funded by: Wellcome Trust

    Genes & development 2011;25;1;1-10

  • Population genetic structure in Indian Austroasiatic speakers: the role of landscape barriers and sex-specific admixture.

    Chaubey G, Metspalu M, Choi Y, Mägi R, Romero IG, Soares P, van Oven M, Behar DM, Rootsi S, Hudjashov G, Mallick CB, Karmin M, Nelis M, Parik J, Reddy AG, Metspalu E, van Driem G, Xue Y, Tyler-Smith C, Thangaraj K, Singh L, Remm M, Richards MB, Lahr MM, Kayser M, Villems R and Kivisild T

    Department of Evolutionary Biology, Institute of Molecular and Cell Biology, University of Tartu and Estonian Biocentre, Tartu, Estonia.

    The geographic origin and time of dispersal of Austroasiatic (AA) speakers, presently settled in south and southeast Asia, remains disputed. Two rival hypotheses, both assuming a demic component to the language dispersal, have been proposed. The first of these places the origin of Austroasiatic speakers in southeast Asia with a later dispersal to south Asia during the Neolithic, whereas the second hypothesis advocates pre-Neolithic origins and dispersal of this language family from south Asia. To test the two alternative models, this study combines the analysis of uniparentally inherited markers with 610,000 common single nucleotide polymorphism loci from the nuclear genome. Indian AA speakers have high frequencies of Y chromosome haplogroup O2a; our results show that this haplogroup has significantly higher diversity and coalescent time (17-28 thousand years ago) in southeast Asia, strongly supporting the first of the two hypotheses. Nevertheless, the results of principal component and "structure-like" analyses on autosomal loci also show that the population history of AA speakers in India is more complex, being characterized by two ancestral components-one represented in the pattern of Y chromosomal and EDAR results and the other by mitochondrial DNA diversity and genomic structure. We propose that AA speakers in India today are derived from dispersal from southeast Asia, followed by extensive sex-specific admixture with local Indian populations.

    Funded by: Wellcome Trust: 077009

    Molecular biology and evolution 2011;28;2;1013-24

  • A world in a grain of sand: human history from genetic data.

    Colonna V, Pagani L, Xue Y and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK.

    Genome-wide genotypes and sequences are enriching our understanding of the past 50,000 years of human history and providing insights into earlier periods largely inaccessible to mitochondrial DNA and Y-chromosomal studies.To see a world in a grain of sand ...William Blake, Auguries of Innocence.

    Funded by: Wellcome Trust

    Genome biology 2011;12;11;234

  • Contrasting signals of positive selection in genes involved in human skin-color variation from tests based on SNP scans and resequencing.

    de Gruijter JM, Lao O, Vermeulen M, Xue Y, Woodwark C, Gillson CJ, Coffey AJ, Ayub Q, Mehdi SQ, Kayser M and Tyler-Smith C

    Department of Forensic Molecular Biology, Erasmus MC University Medical Center, PO Box 2040, Rotterdam, 3000 CA, The Netherlands. o.laogrueso@erasmusmc.nl.

    Background: Numerous genome-wide scans conducted by genotyping previously ascertained single-nucleotide polymorphisms (SNPs) have provided candidate signatures for positive selection in various regions of the human genome, including in genes involved in pigmentation traits. However, it is unclear how well the signatures discovered by such haplotype-based test statistics can be reproduced in tests based on full resequencing data. Four genes (oculocutaneous albinism II (OCA2), tyrosinase-related protein 1 (TYRP1), dopachrome tautomerase (DCT), and KIT ligand (KITLG)) implicated in human skin-color variation, have shown evidence for positive selection in Europeans and East Asians in previous SNP-scan data. In the current study, we resequenced 4.7 to 6.7 kb of DNA from each of these genes in Africans, Europeans, East Asians, and South Asians.

    Results: Applying all commonly used neutrality-test statistics for allele frequency distribution to the newly generated sequence data provided conflicting results regarding evidence for positive selection. Previous haplotype-based findings could not be clearly confirmed. Although some tests were marginally significant for some populations and genes, none of them were significant after multiple-testing correction. Combined P values for each gene-population pair did not improve these results. Application of Approximate Bayesian Computation Markov chain Monte Carlo based to these sequence data using a simple forward simulator revealed broad posterior distributions of the selective parameters for all four genes, providing no support for positive selection. However, when we applied this approach to published sequence data on SLC45A2, another human pigmentation candidate gene, we could readily confirm evidence for positive selection, as previously detected with sequence-based and some haplotype-based tests.

    Conclusions: Overall, our data indicate that even genes that are strong biological candidates for positive selection and show reproducible signatures of positive selection in SNP scans do not always show the same replicability of selection signals in other tests, which should be considered in future studies on detecting positive selection in genetic data.

    Investigative genetics 2011;2;1;24

  • Influences of history, geography, and religion on genetic structure: the Maronites in Lebanon.

    Haber M, Platt DE, Badro DA, Xue Y, El-Sibai M, Bonab MA, Youhanna SC, Saade S, Soria-Hernanz DF, Royyuru A, Wells RS, Tyler-Smith C, Zalloua PA and Genographic Consortium

    The Lebanese American University, Chouran, Beirut, Lebanon.

    Cultural expansions, including of religions, frequently leave genetic traces of differentiation and in-migration. These expansions may be driven by complex doctrinal differentiation, together with major population migrations and gene flow. The aim of this study was to explore the genetic signature of the establishment of religious communities in a region where some of the most influential religions originated, using the Y chromosome as an informative male-lineage marker. A total of 3139 samples were analyzed, including 647 Lebanese and Iranian samples newly genotyped for 28 binary markers and 19 short tandem repeats on the non-recombinant segment of the Y chromosome. Genetic organization was identified by geography and religion across Lebanon in the context of surrounding populations important in the expansions of the major sects of Lebanon, including Italy, Turkey, the Balkans, Syria, and Iran by employing principal component analysis, multidimensional scaling, and AMOVA. Timing of population differentiations was estimated using BATWING, in comparison with dates of historical religious events to determine if these differentiations could be caused by religious conversion, or rather, whether religious conversion was facilitated within already differentiated populations. Our analysis shows that the great religions in Lebanon were adopted within already distinguishable communities. Once religious affiliations were established, subsequent genetic signatures of the older differentiations were reinforced. Post-establishment differentiations are most plausibly explained by migrations of peoples seeking refuge to avoid the turmoil of major historical events.

    Funded by: Wellcome Trust

    European journal of human genetics : EJHG 2011;19;3;334-40

  • Y-chromosome R-M343 African lineages and sickle cell disease reveal structured assimilation in Lebanon.

    Haber M, Platt DE, Khoury S, Badro DA, Abboud M, Tyler-Smith C and Zalloua PA

    Medical School, The Lebanese American University, Beirut, Lebanon.

    We have sought to identify signals of assimilation of African male lines in Lebanon by exploring the association of sickle cell disease (SCD) in Lebanon with Y-chromosome haplogroups that are informative of the disease origin and its exclusivity to the Muslim community. A total of 732 samples were analyzed, including 33 SCD patients from Lebanon genotyped for 28 binary markers and 19 short tandem repeats on the non-recombinant segment of the Y chromosome. Genetic organization was identified using populations known to have influenced the genetic structure of the Lebanese population, in addition to African populations with high incidence of SCD. Y-chromosome haplogroup R-M343 sub-lineages distinguish between sub-Saharan African and Lebanese Y chromosomes. We detected a limited penetration of SCD into Lebanese R-M343 carriers, restricted to Lebanese Muslims. We suggest that this penetration brought the sickle cell gene along with the African R-M343, probably with the Saharan caravan slave trade.

    Funded by: Wellcome Trust: 077009

    Journal of human genetics 2011;56;1;29-33

  • A worldwide analysis of beta-defensin copy number variation suggests recent selection of a high-expressing DEFB103 gene copy in East Asia.

    Hardwick RJ, Machado LR, Zuccherato LW, Antolinos S, Xue Y, Shawa N, Gilman RH, Cabrera L, Berg DE, Tyler-Smith C, Kelly P, Tarazona-Santos E and Hollox EJ

    Department of Genetics, University of Leicester, University Road, Leicester, United Kingdom.

    Beta-defensins are a family of multifunctional genes with roles in defense against pathogens, reproduction, and pigmentation. In humans, six beta-defensin genes are clustered in a repeated region which is copy-number variable (CNV) as a block, with a diploid copy number between 1 and 12. The role in host defense makes the evolutionary history of this CNV particularly interesting, because morbidity due to infectious disease is likely to have been an important selective force in human evolution, and to have varied between geographical locations. Here, we show CNV of the beta-defensin region in chimpanzees, and identify a beta-defensin block in the human lineage that contains rapidly evolving noncoding regulatory sequences. We also show that variation at one of these rapidly evolving sequences affects expression levels and cytokine responsiveness of DEFB103, a key inhibitor of influenza virus fusion at the cell surface. A worldwide analysis of beta-defensin CNV in 67 populations shows an unusually high frequency of high-DEFB103-expressing copies in East Asia, the geographical origin of historical and modern influenza epidemics, possibly as a result of selection for increased resistance to influenza in this region.

    Funded by: Medical Research Council: G0801123, GO801123; Wellcome Trust: 067948, 077009, 087663

    Human mutation 2011;32;7;743-50

  • PoolHap: inferring haplotype frequencies from pooled samples by next generation sequencing.

    Long Q, Jeffares DC, Zhang Q, Ye K, Nizhynska V, Ning Z, Tyler-Smith C and Nordborg M

    Gregor Mendel Institute, Vienna, Austria. quan.long@gmi.oeaw.ac.at

    With the advance of next-generation sequencing (NGS) technologies, increasingly ambitious applications are becoming feasible. A particularly powerful one is the sequencing of polymorphic, pooled samples. The pool can be naturally occurring, as in the case of multiple pathogen strains in a blood sample, multiple types of cells in a cancerous tissue sample, or multiple isoforms of mRNA in a cell. In these cases, it's difficult or impossible to partition the subtypes experimentally before sequencing, and those subtype frequencies must hence be inferred. In addition, investigators may occasionally want to artificially pool the sample of a large number of individuals for reasons of cost-efficiency, e.g., when carrying out genetic mapping using bulked segregant analysis. Here we describe PoolHap, a computational tool for inferring haplotype frequencies from pooled samples when haplotypes are known. The key insight into why PoolHap works is that the large number of SNPs that come with genome-wide coverage can compensate for the uneven coverage across the genome. The performance of PoolHap is illustrated and discussed using simulated and real data. We show that PoolHap is able to accurately estimate the proportions of haplotypes with less than 2% error for 34-strain mixtures with 2X total coverage Arabidopsis thaliana whole genome polymorphism data. This method should facilitate greater biological insight into heterogeneous samples that are difficult or impossible to isolate experimentally. Software and users manual are freely available at http://arabidopsis.gmi.oeaw.ac.at/quan/poolhap/.

    Funded by: Wellcome Trust: 085775/Z/08/Z

    PloS one 2011;6;1;e15292

  • The functional spectrum of low-frequency coding variation.

    Marth GT, Yu F, Indap AR, Garimella K, Gravel S, Leong WF, Tyler-Smith C, Bainbridge M, Blackwell T, Zheng-Bradley X, Chen Y, Challis D, Clarke L, Ball EV, Cibulskis K, Cooper DN, Fulton B, Hartl C, Koboldt D, Muzny D, Smith R, Sougnez C, Stewart C, Ward A, Yu J, Xue Y, Altshuler D, Bustamante CD, Clark AG, Daly M, DePristo M, Flicek P, Gabriel S, Mardis E, Palotie A, Gibbs R and 1000 Genomes Project

    Department of Biology, Boston College, 140 Commonwealth Avenue, Chestnut Hill, MA 02467, USA. gabor.marth@bc.edu

    Background: Rare coding variants constitute an important class of human genetic variation, but are underrepresented in current databases that are based on small population samples. Recent studies show that variants altering amino acid sequence and protein function are enriched at low variant allele frequency, 2 to 5%, but because of insufficient sample size it is not clear if the same trend holds for rare variants below 1% allele frequency.

    Results: The 1000 Genomes Exon Pilot Project has collected deep-coverage exon-capture data in roughly 1,000 human genes, for nearly 700 samples. Although medical whole-exome projects are currently afoot, this is still the deepest reported sampling of a large number of human genes with next-generation technologies. According to the goals of the 1000 Genomes Project, we created effective informatics pipelines to process and analyze the data, and discovered 12,758 exonic SNPs, 70% of them novel, and 74% below 1% allele frequency in the seven population samples we examined. Our analysis confirms that coding variants below 1% allele frequency show increased population-specificity and are enriched for functional variants.

    Conclusions: This study represents a large step toward detecting and interpreting low frequency coding variation, clearly lays out technical steps for effective analysis of DNA capture data, and articulates functional and population properties of this important class of genetic variation.

    Funded by: NHGRI NIH HHS: 1U01HG005211-0109, 5U54HG003273, R01 HG003229, R01 HG004719, R01 HG004960, RC2 HG005552, U54 HG003273; Wellcome Trust: 085532, WT 077009

    Genome biology 2011;12;9;R84

  • Indian Siddis: African descendants with Indian admixture.

    Shah AM, Tamang R, Moorjani P, Rani DS, Govindaraj P, Kulkarni G, Bhattacharya T, Mustak MS, Bhaskar LV, Reddy AG, Gadhvi D, Gai PB, Chaubey G, Patterson N, Reich D, Tyler-Smith C, Singh L and Thangaraj K

    Centre for Cellular and Molecular Biology, Council of Scientific and Industrial Research, Hyderabad, India.

    The Siddis (Afro-Indians) are a tribal population whose members live in coastal Karnataka, Gujarat, and in some parts of Andhra Pradesh. Historical records indicate that the Portuguese brought the Siddis to India from Africa about 300-500 years ago; however, there is little information about their more precise ancestral origins. Here, we perform a genome-wide survey to understand the population history of the Siddis. Using hundreds of thousands of autosomal markers, we show that they have inherited ancestry from Africans, Indians, and possibly Europeans (Portuguese). Additionally, analyses of the uniparental (Y-chromosomal and mitochondrial DNA) markers indicate that the Siddis trace their ancestry to Bantu speakers from sub-Saharan Africa. We estimate that the admixture between the African ancestors of the Siddis and neighboring South Asian groups probably occurred in the past eight generations (∼200 years ago), consistent with historical records.

    American journal of human genetics 2011;89;1;154-61

  • An Exceptional Gene: Evolution of the TSPY Gene Family in Humans and Other Great Apes.

    Xue Y and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambs. CB10 1SA, UK. ylx@sanger.ac.uk.

    The TSPY gene stands out from all other human protein-coding genes because of its high copy number and tandemly-repeated organization. Here, we review its evolutionary history in great apes in order to assess whether these unusual properties are more likely to result from a relaxation of constraint or an unusual functional role. Detailed comparisons with chimpanzee are possible because a finished sequence of the chimpanzee Y chromosome is available, together with more limited data from other apes. These comparisons suggest that the human-chimpanzee ancestral Y chromosome carried a tandem array of TSPY genes which expanded on the human lineage while undergoing multiple duplication events followed by pseudogene formation on the chimpanzee lineage. The protein coding region is the most highly conserved of the multi-copy Y genes in human-chimpanzee comparisons, and the analysis of the dN/dS ratio indicates that TSPY is evolutionarily highly constrained, but may have experienced positive selection after the human-chimpanzee split. We therefore conclude that the exceptionally high copy number in humans is most likely due to a human-specific but unknown functional role, possibly involving rapid production of a large amount of TSPY protein at some stage during spermatogenesis.

    Genes 2011;2;1;36-47

  • Response to the comment on "The hare and the tortoise: One small step for four SNPs, one giant leap for SNP-kind".

    Xue Y and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambs. CB10 1SA, UK.

    The possibility of introducing new sequencing technologies into forensic genetics raises questions that go beyond the choice between SNPs and STRs as the preferred genetic markers. We suggest that many of the novel methodological and technical issues could be incorporated into the likelihood ratio frameworks currently used by forensic scientists. However, changes to ethical and legal structures may be needed before the new information could be used.

    Forensic science international. Genetics 2011;5;4;361-2

  • α-Actinin-3 deficiency is associated with reduced bone mass in human and mouse.

    Yang N, Schindeler A, McDonald MM, Seto JT, Houweling PJ, Lek M, Hogarth M, Morse AR, Raftery JM, Balasuriya D, MacArthur DG, Berman Y, Quinlan KG, Eisman JA, Nguyen TV, Center JR, Prince RL, Wilson SG, Zhu K, Little DG and North KN

    Institute for Neuroscience and Muscle Research, The Children's Hospital at Westmead, Sydney 2145, NSW, Australia. nan.yang@persongen.com

    Bone mineral density (BMD) is a complex trait that is the single best predictor of the risk of osteoporotic fractures. Candidate gene and genome-wide association studies have identified genetic variations in approximately 30 genetic loci associated with BMD variation in humans. α-Actinin-3 (ACTN3) is highly expressed in fast skeletal muscle fibres. There is a common null-polymorphism R577X in human ACTN3 that results in complete deficiency of the α-actinin-3 protein in approximately 20% of Eurasians. Absence of α-actinin-3 does not cause any disease phenotypes in muscle because of compensation by α-actinin-2. However, α-actinin-3 deficiency has been shown to be detrimental to athletic sprint/power performance. In this report we reveal additional functions for α-actinin-3 in bone. α-Actinin-3 but not α-actinin-2 is expressed in osteoblasts. The Actn3(-/-) mouse displays significantly reduced bone mass, with reduced cortical bone volume (-14%) and trabecular number (-61%) seen by microCT. Dynamic histomorphometry indicated this was due to a reduction in bone formation. In a cohort of postmenopausal Australian women, ACTN3 577XX genotype was associated with lower BMD in an additive genetic model, with the R577X genotype contributing 1.1% of the variance in BMD. Microarray analysis of cultured osteoprogenitors from Actn3(-/-) mice showed alterations in expression of several genes regulating bone mass and osteoblast/osteoclast activity, including Enpp1, Opg and Wnt7b. Our studies suggest that ACTN3 likely contributes to the regulation of bone mass through alterations in bone turnover. Given the high frequency of R577X in the general population, the potential role of ACTN3 R577X as a factor influencing variations in BMD in elderly humans warrants further study.

    Bone 2011;49;4;790-8

  • Replication of the association of a MET variant with autism in a Chinese Han population.

    Zhou X, Xu Y, Wang J, Zhou H, Liu X, Ayub Q, Wang X, Tyler-Smith C, Wu L and Xue Y

    Department of Children's and Adolescent Health, Public Health College of Harbin Medical University, Harbin, Heilongjiang, People's Republic of China.

    Background: Autism is a common, severe and highly heritable neurodevelopmental disorder in children, affecting up to 100 children per 10,000. The MET gene has been regarded as a promising candidate gene for this disorder because it is located within a replicated linkage interval, is involved in pathways affecting the development of the cerebral cortex and cerebellum in ways relevant to autism patients, and has shown significant association signals in previous studies.

    Here, we present new ASD patient and control samples from Heilongjiang, China and use them in a case-control and family-based replication study of two MET variants. One SNP, rs38845, was successfully replicated in a case-control association study, but failed to replicate in a family-based study, possibly due to small sample size. The other SNP, rs1858830, failed to replicate in both case-control and family-based studies.

    Conclusions: This is the first attempt to replicate associations in Chinese autism samples, and our result provides evidence that MET variants may be relevant to autism susceptibility in the Chinese Han population.

    Funded by: Wellcome Trust

    PloS one 2011;6;11;e27428

Team publications 2010

  • A map of human genome variation from population-scale sequencing.

    1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME and McVean GA

    The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

    Funded by: British Heart Foundation: RG/09/012/28096; Howard Hughes Medical Institute; Medical Research Council: G0801823, G0801823(89305); NCRR NIH HHS: S10RR025056; NHGRI NIH HHS: 01HG3229, N01HG62088, P01HG4120, P41HG2371, P41HG4221, P41HG4222, P50HG2357, R01 HG003229, R01 HG003229-05, R01 HG004719-01, R01 HG004719-02, R01 HG004719-02S1, R01 HG004719-03, R01 HG004719-04, R01HG2651, R01HG3698, R01HG4333, R01HG4719, R01HG4960, RC2 HG005552-01, RC2 HG005552-02, RC2HG5552, U01HG5208, U01HG5209, U01HG5210, U01HG5211, U01HG5214, U41HG4568, U54 HG003273, U54HG2750, U54HG2757, U54HG3067, U54HG3079, U54HG3273; NIGMS NIH HHS: R01GM59290, R01GM72861, T32 GM007753; NIMH NIH HHS: 01MH84698; Wellcome Trust: 075491, 077009, 077014, 077192, 081407, 085532, 086084, 089061, 089062, 089088, WT075491/Z/04, WT077009, WT081407/Z/06/Z, WT085532AIA, WT086084/Z/08/Z, WT089088/Z/09/Z

    Nature 2010;467;7319;1061-73

  • A predominantly neolithic origin for European paternal lineages.

    Balaresque P, Bowden GR, Adams SM, Leung HY, King TE, Rosser ZH, Goodwin J, Moisan JP, Richard C, Millward A, Demaine AG, Barbujani G, Previderè C, Wilson IJ, Tyler-Smith C and Jobling MA

    Department of Genetics, University of Leicester, Leicester, United Kingdom.

    The relative contributions to modern European populations of Paleolithic hunter-gatherers and Neolithic farmers from the Near East have been intensely debated. Haplogroup R1b1b2 (R-M269) is the commonest European Y-chromosomal lineage, increasing in frequency from east to west, and carried by 110 million European men. Previous studies suggested a Paleolithic origin, but here we show that the geographical distribution of its microsatellite diversity is best explained by spread from a single source in the Near East via Anatolia during the Neolithic. Taken with evidence on the origins of other haplogroups, this indicates that most European Y chromosomes originate in the Neolithic expansion. This reinterpretation makes Europe a prime example of how technological and cultural change is linked with the expansion of a Y-chromosomal lineage, and the contrast of this pattern with that shown by maternally inherited mitochondrial DNA suggests a unique role for males in the transition.

    Funded by: Wellcome Trust: 057559, 065569, 084060, 087576

    PLoS biology 2010;8;1;e1000285

  • Origins and functional impact of copy number variation in the human genome.

    Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P, Fitzgerald T, Hu M, Ihm CH, Kristiansson K, Macarthur DG, Macdonald JR, Onyiah I, Pang AW, Robson S, Stirrups K, Valsesia A, Walter K, Wei J, Wellcome Trust Case Control Consortium, Tyler-Smith C, Carter NP, Lee C, Scherer SW and Hurles ME

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA UK.

    Structural variations of DNA greater than 1 kilobase in size account for most bases that vary among human genomes, but are still relatively under-ascertained. Here we use tiling oligonucleotide microarrays, comprising 42 million probes, to generate a comprehensive map of 11,700 copy number variations (CNVs) greater than 443 base pairs, of which most (8,599) have been validated independently. For 4,978 of these CNVs, we generated reference genotypes from 450 individuals of European, African or East Asian ancestry. The predominant mutational mechanisms differ among CNV size classes. Retrotransposition has duplicated and inserted some coding and non-coding DNA segments randomly around the genome. Furthermore, by correlation with known trait-associated single nucleotide polymorphisms (SNPs), we identified 30 loci with CNVs that are candidates for influencing disease susceptibility. Despite this, having assessed the completeness of our map and the patterns of linkage disequilibrium between CNVs and SNPs, we conclude that, for complex traits, the heritability void left by genome-wide association studies will not be accounted for by common CNVs.

    Funded by: Canadian Institutes of Health Research; NHGRI NIH HHS: HG004221; NIGMS NIH HHS: GM081533; Wellcome Trust: 077006/Z/05/Z, 077008, 077009, 077014

    Nature 2010;464;7289;704-12

  • Traces of sub-Saharan and Middle Eastern lineages in Indian Muslim populations.

    Eaaswarkhanth M, Haque I, Ravesh Z, Romero IG, Meganathan PR, Dubey B, Khan FA, Chaubey G, Kivisild T, Tyler-Smith C, Singh L and Thangaraj K

    National DNA Analysis Centre, Central Forensic Science Laboratory, Kolkata, India.

    Islam is the second most practiced religion in India, next to Hinduism. It is still unclear whether the spread of Islam in India has been only a cultural transformation or is associated with detectable levels of gene flow. To estimate the contribution of West Asian and Arabian admixture to Indian Muslims, we assessed genetic variation in mtDNA, Y-chromosomal and LCT/MCM6 markers in 472, 431 and 476 samples, respectively, representing six Muslim communities from different geographical regions of India. We found that most of the Indian Muslim populations received their major genetic input from geographically close non-Muslim populations. However, low levels of likely sub-Saharan African, Arabian and West Asian admixture were also observed among Indian Muslims in the form of L0a2a2 mtDNA and E1b1b1a and J(*)(xJ2) Y-chromosomal lineages. The distinction between Iranian and Arabian sources was difficult to make with mtDNA and the Y chromosome, as the estimates were highly correlated because of similar gene pool compositions in the sources. In contrast, the LCT/MCM6 locus, which shows a clear distinction between the two sources, enabled us to rule out significant gene flow from Arabia. Overall, our results support a model according to which the spread of Islam in India was predominantly cultural conversion associated with minor but still detectable levels of gene flow from outside, primarily from Iran and Central Asia, rather than directly from the Arabian Peninsula.

    Funded by: Wellcome Trust: 077009

    European journal of human genetics : EJHG 2010;18;3;354-63

  • Loss-of-function variants in the genomes of healthy humans.

    MacArthur DG and Tyler-Smith C

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK. dm8@sanger.ac.uk

    Genetic variants predicted to seriously disrupt the function of human protein-coding genes-so-called loss-of-function (LOF) variants-have traditionally been viewed in the context of severe Mendelian disease. However, recent large-scale sequencing and genotyping projects have revealed a surprisingly large number of these variants in the genomes of apparently healthy individuals--at least 100 per genome, including more than 30 in a homozygous state--suggesting a previously unappreciated level of variation in functional gene content between humans. These variants are mostly found at low frequency, suggesting that they are enriched for mildly deleterious polymorphisms suppressed by negative natural selection, and thus represent an attractive set of candidate variants for complex disease susceptibility. However, they are also enriched for sequencing and annotation artefacts, so overall present serious challenges for clinical sequencing projects seeking to identify severe disease genes amidst the 'noise' of technical error and benign genetic polymorphism. Systematic, high-quality catalogues of LOF variants present in the genomes of healthy individuals, built from the output of large-scale sequencing studies such as the 1000 Genomes Project, will help to distinguish between benign and disease-causing LOF variants, and will provide valuable resources for clinical genomics.

    Funded by: Wellcome Trust

    Human molecular genetics 2010;19;R2;R125-30

  • Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing.

    Park H, Kim JI, Ju YS, Gokcumen O, Mills RE, Kim S, Lee S, Suh D, Hong D, Kang HP, Yoo YJ, Shin JY, Kim HJ, Yavartanoo M, Chang YW, Ha JS, Chong W, Hwang GR, Darvishi K, Kim H, Yang SJ, Yang KS, Kim H, Hurles ME, Scherer SW, Carter NP, Tyler-Smith C, Lee C and Seo JS

    Genomic Medicine Institute, Medical Research Center, Seoul National University, Seoul, Korea.

    Copy number variants (CNVs) account for the majority of human genomic diversity in terms of base coverage. Here, we have developed and applied a new method to combine high-resolution array comparative genomic hybridization (CGH) data with whole-genome DNA sequencing data to obtain a comprehensive catalog of common CNVs in Asian individuals. The genomes of 30 individuals from three Asian populations (Korean, Chinese and Japanese) were interrogated with an ultra-high-resolution array CGH platform containing 24 million probes. Whole-genome sequencing data from a reference genome (NA10851, with 28.3x coverage) and two Asian genomes (AK1, with 27.8x coverage and AK2, with 32.0x coverage) were used to transform the relative copy number information obtained from array CGH experiments into absolute copy number values. We discovered 5,177 CNVs, of which 3,547 were putative Asian-specific CNVs. These common CNVs in Asian populations will be a useful resource for subsequent genetic studies in these populations, and the new method of calling absolute CNVs will be essential for applying CNV data to personalized medicine.

    Funded by: NHGRI NIH HHS: HG004221; Wellcome Trust: 077008, 077009, 077014

    Nature genetics 2010;42;5;400-5

  • A worldwide survey of human male demographic history based on Y-SNP and Y-STR data from the HGDP-CEPH populations.

    Shi W, Ayub Q, Vermeulen M, Shao RG, Zuniga S, van der Gaag K, de Knijff P, Kayser M, Xue Y and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Hinxton, Cambs., United Kingdom.

    We have investigated human male demographic history using 590 males from 51 populations in the Human Genome Diversity Project - Centre d'Etude du Polymorphisme Humain worldwide panel, typed with 37 Y-chromosomal Single Nucleotide Polymorphisms and 65 Y-chromosomal Short Tandem Repeats and analyzed with the program Bayesian Analysis of Trees With Internal Node Generation. The general patterns we observe show a gradient from the oldest population time to the most recent common ancestors (TMRCAs) and expansion times together with the largest effective population sizes in Africa, to the youngest times and smallest effective population sizes in the Americas. These parameters are significantly negatively correlated with distance from East Africa, and the patterns are consistent with most other studies of human variation and history. In contrast, growth rate showed a weaker correlation in the opposite direction. Y-lineage diversity and TMRCA also decrease with distance from East Africa, supporting a model of expansion with serial founder events starting from this source. A number of individual populations diverge from these general patterns, including previously documented examples such as recent expansions of the Yoruba in Africa, Basques in Europe, and Yakut in Northern Asia. However, some unexpected demographic histories were also found, including low growth rates in the Hazara and Kalash from Pakistan and recent expansion of the Mozabites in North Africa.

    Molecular biology and evolution 2010;27;2;385-93

  • Separating the post-Glacial coancestry of European and Asian Y chromosomes within haplogroup R1a.

    Underhill PA, Myres NM, Rootsi S, Metspalu M, Zhivotovsky LA, King RJ, Lin AA, Chow CE, Semino O, Battaglia V, Kutuev I, Järve M, Chaubey G, Ayub Q, Mohyuddin A, Mehdi SQ, Sengupta S, Rogaev EI, Khusnutdinova EK, Pshenichnov A, Balanovsky O, Balanovska E, Jeran N, Augustin DH, Baldovic M, Herrera RJ, Thangaraj K, Singh V, Singh L, Majumder P, Rudan P, Primorac D, Villems R and Kivisild T

    Division of Child and Adolescent Psychiatry and Child Development, Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, 1201 Welch Road, Stanford, CA 94304-5485, USA. under@stanford.edu

    Human Y-chromosome haplogroup structure is largely circumscribed by continental boundaries. One notable exception to this general pattern is the young haplogroup R1a that exhibits post-Glacial coalescent times and relates the paternal ancestry of more than 10% of men in a wide geographic area extending from South Asia to Central East Europe and South Siberia. Its origin and dispersal patterns are poorly understood as no marker has yet been described that would distinguish European R1a chromosomes from Asian. Here we present frequency and haplotype diversity estimates for more than 2000 R1a chromosomes assessed for several newly discovered SNP markers that introduce the onset of informative R1a subdivisions by geography. Marker M434 has a low frequency and a late origin in West Asia bearing witness to recent gene flow over the Arabian Sea. Conversely, marker M458 has a significant frequency in Europe, exceeding 30% in its core area in Eastern Europe and comprising up to 70% of all M17 chromosomes present there. The diversity and frequency profiles of M458 suggest its origin during the early Holocene and a subsequent expansion likely related to a number of prehistoric cultural developments in the region. Its primary frequency and diversity distribution correlates well with some of the major Central and East European river basins where settled farming was established before its spread further eastward. Importantly, the virtual absence of M458 chromosomes outside Europe speaks against substantial patrilineal gene flow from East Europe to Asia, including to India, at least since the mid-Holocene.

    European journal of human genetics : EJHG 2010;18;4;479-84

  • Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls.

    Wellcome Trust Case Control Consortium, Craddock N, Hurles ME, Cardin N, Pearson RD, Plagnol V, Robson S, Vukcevic D, Barnes C, Conrad DF, Giannoulatou E, Holmes C, Marchini JL, Stirrups K, Tobin MD, Wain LV, Yau C, Aerts J, Ahmad T, Andrews TD, Arbury H, Attwood A, Auton A, Ball SG, Balmforth AJ, Barrett JC, Barroso I, Barton A, Bennett AJ, Bhaskar S, Blaszczyk K, Bowes J, Brand OJ, Braund PS, Bredin F, Breen G, Brown MJ, Bruce IN, Bull J, Burren OS, Burton J, Byrnes J, Caesar S, Clee CM, Coffey AJ, Connell JM, Cooper JD, Dominiczak AF, Downes K, Drummond HE, Dudakia D, Dunham A, Ebbs B, Eccles D, Edkins S, Edwards C, Elliot A, Emery P, Evans DM, Evans G, Eyre S, Farmer A, Ferrier IN, Feuk L, Fitzgerald T, Flynn E, Forbes A, Forty L, Franklyn JA, Freathy RM, Gibbs P, Gilbert P, Gokumen O, Gordon-Smith K, Gray E, Green E, Groves CJ, Grozeva D, Gwilliam R, Hall A, Hammond N, Hardy M, Harrison P, Hassanali N, Hebaishi H, Hines S, Hinks A, Hitman GA, Hocking L, Howard E, Howard P, Howson JM, Hughes D, Hunt S, Isaacs JD, Jain M, Jewell DP, Johnson T, Jolley JD, Jones IR, Jones LA, Kirov G, Langford CF, Lango-Allen H, Lathrop GM, Lee J, Lee KL, Lees C, Lewis K, Lindgren CM, Maisuria-Armer M, Maller J, Mansfield J, Martin P, Massey DC, McArdle WL, McGuffin P, McLay KE, Mentzer A, Mimmack ML, Morgan AE, Morris AP, Mowat C, Myers S, Newman W, Nimmo ER, O'Donovan MC, Onipinla A, Onyiah I, Ovington NR, Owen MJ, Palin K, Parnell K, Pernet D, Perry JR, Phillips A, Pinto D, Prescott NJ, Prokopenko I, Quail MA, Rafelt S, Rayner NW, Redon R, Reid DM, Renwick, Ring SM, Robertson N, Russell E, St Clair D, Sambrook JG, Sanderson JD, Schuilenburg H, Scott CE, Scott R, Seal S, Shaw-Hawkins S, Shields BM, Simmonds MJ, Smyth DJ, Somaskantharajah E, Spanova K, Steer S, Stephens J, Stevens HE, Stone MA, Su Z, Symmons DP, Thompson JR, Thomson W, Travers ME, Turnbull C, Valsesia A, Walker M, Walker NM, Wallace C, Warren-Perry M, Watkins NA, Webster J, Weedon MN, Wilson AG, Woodburn M, Wordsworth BP, Young AH, Zeggini E, Carter NP, Frayling TM, Lee C, McVean G, Munroe PB, Palotie A, Sawcer SJ, Scherer SW, Strachan DP, Tyler-Smith C, Brown MA, Burton PR, Caulfield MJ, Compston A, Farrall M, Gough SC, Hall AS, Hattersley AT, Hill AV, Mathew CG, Pembrey M, Satsangi J, Stratton MR, Worthington J, Deloukas P, Duncanson A, Kwiatkowski DP, McCarthy MI, Ouwehand W, Parkes M, Rahman N, Todd JA, Samani NJ and Donnelly P

    Copy number variants (CNVs) account for a major proportion of human genetic polymorphism and have been predicted to have an important role in genetic susceptibility to common disease. To address this we undertook a large, direct genome-wide study of association between CNVs and eight common human diseases. Using a purpose-designed array we typed approximately 19,000 individuals into distinct copy-number classes at 3,432 polymorphic CNVs, including an estimated approximately 50% of all common CNVs larger than 500 base pairs. We identified several biological artefacts that lead to false-positive associations, including systematic CNV differences between DNAs derived from blood and cell lines. Association testing and follow-up replication analyses confirmed three loci where CNVs were associated with disease-IRGM for Crohn's disease, HLA for Crohn's disease, rheumatoid arthritis and type 1 diabetes, and TSPAN8 for type 2 diabetes-although in each case the locus had previously been identified in single nucleotide polymorphism (SNP)-based studies, reflecting our observation that most common CNVs that are well-typed on our array are well tagged by SNPs and so have been indirectly explored through SNP studies. We conclude that common CNVs that can be typed on existing platforms are unlikely to contribute greatly to the genetic basis of common human diseases.

    Funded by: Arthritis Research UK: 17552; Chief Scientist Office: CZB/4/540, ETM/137, ETM/75; Medical Research Council: G0000934, G0400874, G0500115, G0501942, G0600329, G0600705, G0700491, G0701003, G0701420, G0701810, G0701810(85517), G0800383, G0800759, G19/9, G90/106, G9521010, MC_UP_A390_1107; Wellcome Trust: 061858, 083948, 089989

    Nature 2010;464;7289;713-20

  • Distinct variants at LIN28B influence growth in height from birth to adulthood.

    Widén E, Ripatti S, Cousminer DL, Surakka I, Lappalainen T, Järvelin MR, Eriksson JG, Raitakari O, Salomaa V, Sovio U, Hartikainen AL, Pouta A, McCarthy MI, Osmond C, Kajantie E, Lehtimäki T, Viikari J, Kähönen M, Tyler-Smith C, Freimer N, Hirschhorn JN, Peltonen L and Palotie A

    Institute for Molecular Medicine Finland, University of Helsinki, Helsinki, Finland. elisabeth.widen@helsinki.fi

    We have studied the largely unknown genetic underpinnings of height growth by using a unique resource of longitudinal childhood height data available in Finnish population cohorts. After applying GWAS mapping of potential genes influencing pubertal height growth followed by further characterization of the genetic effects on complete postnatal growth trajectories, we have identified strong association between variants near LIN28B and pubertal growth (rs7759938; female p = 4.0 x 10(-9), male p = 1.5 x 10(-4), combined p = 5.0 x 10(-11), n = 5038). Analysis of growth during early puberty confirmed an effect on the timing of the growth spurt. Correlated SNPs have previously been implicated as influencing both adult stature and age at menarche, the same alleles associating with taller height and later age of menarche in other studies as with later pubertal growth here. Additionally, a partially correlated LIN28B SNP, rs314277, has been associated previously with final height. Testing both rs7759938 and rs314277 (pairwise r(2) = 0.29) for independent effects on postnatal growth in 8903 subjects indicated that the pubertal timing-associated marker rs7759938 affects prepubertal growth in females (p = 7 x 10(-5)) and final height in males (p = 5 x 10(-4)), whereas rs314277 has sex-specific effects on growth (p for interaction = 0.005) that were distinct from those observed at rs7759938. In conclusion, partially correlated variants at LIN28B tag distinctive, complex, and sex-specific height-growth-regulating effects, influencing the entire period of postnatal growth. These findings imply a critical role for LIN28B in the regulation of human growth.

    Funded by: Medical Research Council: G0500539; Wellcome Trust: 89061/Z/09/Z, WT089062

    American journal of human genetics 2010;86;5;773-82

  • The hare and the tortoise: one small step for four SNPs, one giant leap for SNP-kind.

    Xue Y and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA, UK.

    A recently published study has used next-gen sequencing technology to resequence two Y chromosomes separated by 13 generations and discovered four single-base differences in approximately 10Mb DNA, suggesting that the Y chromosome euchromatin accumulates around one mutation per generation. Y-SNPs therefore now offer the best resolution of Y haplotypes and promise to distinguish almost every Y chromosome. This work illustrates the promise of current sequencing technology for forensically relevant applications.

    Funded by: Wellcome Trust: 077009

    Forensic science international. Genetics 2010;4;2;59-61

Team publications 2009

  • Genetic variation in South Asia: assessing the influences of geography, language and ethnicity for understanding history and disease risk.

    Ayub Q and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK. qa1@sanger.ac.uk

    South Asia is home to more than 1.5 billion humans representing many diverse ethnicities, linguistic and religious groups and representing almost one-quarter of humanity. Modern humans arrived here soon after their departure from Africa approximately 50,000-70,000 years before present (YBP) and several subsequent human migrations and invasions, as well as the unique social structure of the region, have helped shape the pattern of genetic diversity currently observed in these populations. Over the last few decades population geneticists and molecular anthropologists have analyzed DNA variation in indigenous populations from this region in order to catalog their genetic relationships and histories. The emphasis is gradually shifting from the study of population origins to high resolution surveys of DNA variation to address issues of population stratification and genetic susceptibility or resistance to diseases in genome-wide association surveys. We present a historical overview of the genetic studies carried out on populations from this region in order to understand the influence of geographic, linguistic and religious factors on population diversity in this region, and discuss future prospects in light of developments in high throughput genotyping and next generation sequencing technologies.

    Funded by: Wellcome Trust

    Briefings in functional genomics & proteomics 2009;8;5;395-404

  • Genomic complexity of the Y-STR DYS19: inversions, deletions and founder lineages carrying duplications.

    Balaresque P, Parkin EJ, Roewer L, Carvalho-Silva DR, Mitchell RJ, van Oorschot RA, Henke J, Stoneking M, Nasidze I, Wetton J, de Knijff P, Tyler-Smith C and Jobling MA

    Department of Genetics, University of Leicester, University Road, Leicester, LE1 7RH, UK.

    The Y-STR DYS19 is firmly established in the repertoire of Y-chromosomal markers used in forensic analysis yet is poorly understood at the molecular level, lying in a complex genomic environment and exhibiting null alleles, as well as duplications and occasional triplications in population samples. Here, we analyse three null alleles and 51 duplications and show that DYS19 can also be involved in inversion events, so that even its location within the short arm of the Y chromosome is uncertain. Deletion mapping in the three chromosomes carrying null alleles shows that their deletions are less than approximately 300 kb in size. Haplotypic analysis with binary markers shows that they belong to three different haplogroups and so represent independent events. In contrast, a collection of 51 DYS19 duplication chromosomes belong to only four haplogroups: two are singletons and may represent somatic mutation in lymphoblastoid cell lines, but two, in haplogroups G and C3c, represent founder lineages that have spread widely in Central Europe/West Asia and East Asia, respectively. Consideration of candidate mechanisms underlying both deletions and duplications provides no evidence for the involvement of non-allelic homologous recombination, and they are likely to represent sporadic events with low mutation rates. Understanding the basis and population distribution of these DYS19 alleles will aid in the utilisation and interpretation of profiles that contain them.

    Funded by: Wellcome Trust: 057559, 077009

    International journal of legal medicine 2009;123;1;15-23

  • A common MYBPC3 (cardiac myosin binding protein C) variant associated with cardiomyopathies in South Asia.

    Dhandapany PS, Sadayappan S, Xue Y, Powell GT, Rani DS, Nallari P, Rai TS, Khullar M, Soares P, Bahl A, Tharkan JM, Vaideeswar P, Rathinavel A, Narasimhan C, Ayapati DR, Ayub Q, Mehdi SQ, Oppenheimer S, Richards MB, Price AL, Patterson N, Reich D, Singh L, Tyler-Smith C and Thangaraj K

    Department of Biochemistry, Madurai Kamaraj University, Madurai 625 021, India.

    Heart failure is a leading cause of mortality in South Asians. However, its genetic etiology remains largely unknown. Cardiomyopathies due to sarcomeric mutations are a major monogenic cause for heart failure (MIM600958). Here, we describe a deletion of 25 bp in the gene encoding cardiac myosin binding protein C (MYBPC3) that is associated with heritable cardiomyopathies and an increased risk of heart failure in Indian populations (initial study OR = 5.3 (95% CI = 2.3-13), P = 2 x 10(-6); replication study OR = 8.59 (3.19-25.05), P = 3 x 10(-8); combined OR = 6.99 (3.68-13.57), P = 4 x 10(-11)) and that disrupts cardiomyocyte structure in vitro. Its prevalence was found to be high (approximately 4%) in populations of Indian subcontinental ancestry. The finding of a common risk factor implicated in South Asian subjects with cardiomyopathy will help in identifying and counseling individuals predisposed to cardiac diseases in this region.

    Funded by: NHGRI NIH HHS: R01 HG006399-02; Wellcome Trust: 077009

    Nature genetics 2009;41;2;187-91

  • Geographical structure of the Y-chromosomal genetic landscape of the Levant: a coastal-inland contrast.

    El-Sibai M, Platt DE, Haber M, Xue Y, Youhanna SC, Wells RS, Izaabel H, Sanyoura MF, Harmanani H, Bonab MA, Behbehani J, Hashwa F, Tyler-Smith C, Zalloua PA and Genographic Consortium

    The Lebanese American University, Chouran, Beirut 1102 2801, Lebanon.

    We have examined the male-specific phylogeography of the Levant and its surroundings by analyzing Y-chromosomal haplogroup distributions using 5874 samples (885 new) from 23 countries. The diversity within some of these haplogroups was also examined. The Levantine populations showed clustering in SNP and STR analyses when considered against a broad Middle-East and North African background. However, we also found a coastal-inland, east-west pattern of diversity and frequency distribution in several haplogroups within the small region of the Levant. Since estimates of effective population size are similar in the two regions, this strong pattern is likely to have arisen mainly from differential migrations, with different lineages introduced from the east and west.

    Funded by: Wellcome Trust: 077009

    Annals of human genetics 2009;73;Pt 6;568-81

  • TSPY1 copy number variation influences spermatogenesis and shows differences among Y lineages.

    Giachini C, Nuti F, Turner DJ, Laface I, Xue Y, Daguin F, Forti G, Tyler-Smith C and Krausz C

    Andrology Unit, Department of Clinical Physiopathology, University of Florence, Florence 50139, Italy.

    Context: TSPY1 is a tandemly-repeated gene on the human Y chromosome forming an array of approximately 21-35 copies. The testicular expression pattern and the inferred function of the TSPY1 protein suggest possible involvement in spermatogenesis. However, data are scarce on TSPY1 copy number variation in different Y lineages and its role in spermatogenesis.

    Objectives: We sought to define: 1) the extent of TSPY1 copy number variation within and among Y chromosome haplogroups; and 2) the role of TSPY1 dosage in spermatogenic efficiency.

    A total of 154 idiopathic infertile men and 130 normozoospermic controls from Central Italy were analyzed. We used a quantitative PCR assay to measure TSPY1 copy number and also defined Y haplogroups in all subjects.

    Results: We provide evidence that TSPY1 copy number shows substantial variation among Y haplogroups and thus that population stratification does represent a potential bias in case-control association studies. We also found: 1) a significant positive correlation between TSPY1 copy number and sperm count (P < 0.001); 2) a significant difference in mean TSPY1 copy number between patients and controls (28.4 +/- 8.3 vs. 33.9 +/- 10.7; P < 0.001); and 3) a 1.5-fold increased risk of abnormal sperm parameters in men with less than 33 copies (P < 0.001).

    Conclusions: TSPY copy number variation significantly influences spermatogenic efficiency. Low TSPY1 copy number is a new risk factor for male infertility with potential clinical consequences.

    Funded by: Telethon: GGP08204; Wellcome Trust: 077009

    The Journal of clinical endocrinology and metabolism 2009;94;10;4016-22

  • Geographical affinities of the HapMap samples.

    He M, Gitschier J, Zerjal T, de Knijff P, Tyler-Smith C and Xue Y

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, United Kingdom.

    Background: The HapMap samples were collected for medical-genetic studies, but are also widely used in population-genetic and evolutionary investigations. Yet the ascertainment of the samples differs from most population-genetic studies which collect individuals who live in the same local region as their ancestors. What effects could this non-standard ascertainment have on the interpretation of HapMap results?

    We compared the HapMap samples with more conventionally-ascertained samples used in population- and forensic-genetic studies, including the HGDP-CEPH panel, making use of published genome-wide autosomal SNP data and Y-STR haplotypes, as well as producing new Y-STR data. We found that the HapMap samples were representative of their broad geographical regions of ancestry according to all tests applied. The YRI and JPT were indistinguishable from independent samples of Yoruba and Japanese in all ways investigated. However, both the CHB and the CEU were distinguishable from all other HGDP-CEPH populations with autosomal markers, and both showed Y-STR similarities to unusually large numbers of populations, perhaps reflecting their admixed origins.

    The CHB and JPT are readily distinguished from one another with both autosomal and Y-chromosomal markers, and results obtained after combining them into a single sample should be interpreted with caution. The CEU are better described as being of Western European ancestry than of Northern European ancestry as often reported. Both the CHB and CEU show subtle but detectable signs of admixture. Thus the YRI and JPT samples are well-suited to standard population-genetic studies, but the CHB and CEU less so.

    Funded by: Howard Hughes Medical Institute; Wellcome Trust

    PloS one 2009;4;3;e4684

  • The peopling of Korea revealed by analyses of mitochondrial DNA and Y-chromosomal markers.

    Jin HJ, Tyler-Smith C and Kim W

    Department of Biological Sciences, Dankook University, Cheonan, Korea.

    Background: The Koreans are generally considered a northeast Asian group because of their geographical location. However, recent findings from Y chromosome studies showed that the Korean population contains lineages from both southern and northern parts of East Asia. To understand the genetic history and relationships of Korea more fully, additional data and analyses are necessary.

    We analyzed mitochondrial DNA (mtDNA) sequence variation in the hypervariable segments I and II (HVS-I and HVS-II) and haplogroup-specific mutations in coding regions in 445 individuals from seven east Asian populations (Korean, Korean-Chinese, Mongolian, Manchurian, Han (Beijing), Vietnamese and Thais). In addition, published mtDNA haplogroup data (N = 3307), mtDNA HVS-I sequences (N = 2313), Y chromosome haplogroup data (N = 1697) and Y chromosome STR data (N = 2713) were analyzed to elucidate the genetic structure of East Asian populations. All the mtDNA profiles studied here were classified into subsets of haplogroups common in East Asia, with just two exceptions. In general, the Korean mtDNA profiles revealed similarities to other northeastern Asian populations through analysis of individual haplogroup distributions, genetic distances between populations or an analysis of molecular variance, although a minor southern contribution was also suggested. Reanalysis of Y-chromosomal data confirmed both the overall similarity to other northeastern populations, and also a larger paternal contribution from southeastern populations.

    Conclusion: The present work provides evidence that peopling of Korea can be seen as a complex process, interpreted as an early northern Asian settlement with at least one subsequent male-biased southern-to-northern migration, possibly associated with the spread of rice agriculture.

    PloS one 2009;4;1;e4210

  • Phenotypic variation within European carriers of the Y-chromosomal gr/gr deletion is independent of Y-chromosomal background.

    Krausz C, Giachini C, Xue Y, O'Bryan MK, Gromoll J, Rajpert-de Meyts E, Oliva R, Aknin-Seifer I, Erdei E, Jorgensen N, Simoni M, Ballescà JL, Levy R, Balercia G, Piomboni P, Nieschlag E, Forti G, McLachlan R and Tyler-Smith C

    Andrology Unit, Department of Clinical Physiopathology, University of Florence, Viale Pieraccini, 6 Florence 50139, Italy. c.krausz@dfc.unifi.it

    Background: Previous studies have compared sperm phenotypes between men with partial deletions within the AZFc region of the Y chromosome and non-carriers, with variable results. In this study, a separate question was investigated, the basis of the variation in sperm phenotype within gr/gr deletion carriers, which ranges from normozoospermia to azoospermia. Differences in the genes removed by independent gr/gr deletions, the occurrence of subsequent duplications or the presence of linked modifying variants elsewhere on the chromosome have been suggested as possible causal factors. This study set out to test these possibilities in a large sample of gr/gr deletion carriers with known phenotypes spanning the complete range.

    Results: In total, 169 men diagnosed with gr/gr deletions from six centres in Europe and one in Australia were studied. The DAZ and CDY1 copies retained, the presence or absence of duplications and the Y-chromosomal haplogroup were characterised. Although the study had good power to detect factors that accounted for >or=5.5% of the variation in sperm concentration, no such factor was found. A negative effect of gr/gr deletions followed by b2/b4 duplication was found within the normospermic group, which remains to be further explored in a larger study population. Finally, significant geographical differences in the frequency of different subtypes of gr/gr deletions were found, which may have relevance for the interpretation of case control studies dealing with admixed populations.

    Conclusions: The phenotypic variation of gr/gr carriers in men of European origin is largely independent of the Y-chromosomal background.

    Funded by: Wellcome Trust: 077009

    Journal of medical genetics 2009;46;1;21-31

  • HI: haplotype improver using paired-end short reads.

    Long Q, MacArthur D, Ning Z and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Hinxton, Cambs, UK. ql2@sanger.ac.uk

    Summary: We present a program to improve haplotype reconstruction by incorporating information from paired-end reads, and demonstrate its utility on simulated data. We find that given a fixed coverage, longer reads (implying fewer of them) are preferable.

    Availability: The executable and user manual can be freely downloaded from ftp://ftp.sanger.ac.uk/pub/zn1/HI.

    Funded by: Wellcome Trust

    Bioinformatics (Oxford, England) 2009;25;18;2436-7

  • Biology of Genomes: making sense of sequence.

    Macarthur DG

    Human Evolution, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. dm8@sanger.ac.uk.

    A report on the Biology of Genomes meeting held at Cold Spring Harbor Laboratory, NY, USA, 5-9 May 2009.

    Genome medicine 2009;1;6;61

  • Genetic structure of nomadic Bedouin from Kuwait.

    Mohammad T, Xue Y, Evison M and Tyler-Smith C

    Division of Genomic Medicine, University of Sheffield, Sheffield, UK.

    Bedouin are traditionally nomadic inhabitants of the Persian Gulf who claim descent from two male lineages: Adnani and Qahtani. We have investigated whether or not this tradition is reflected in the current genetic structure of a sample of 153 Bedouin males from six Kuwaiti tribes, including three tribes from each traditional lineage. Volunteers were genotyped using a panel of autosomal and Y-STRs, and Y-SNPs. The samples clustered with their geographical neighbours in both the autosomal and Y-chromosomal analyses, and showed strong evidence of genetic isolation and drift. Although there was no evidence of segregation into the two male lineages, other aspects of genetic structure were in accord with tradition.

    Funded by: Wellcome Trust: 077009

    Heredity 2009;103;5;425-33

  • A genome-wide meta-analysis identifies 22 loci associated with eight hematological parameters in the HaemGen consortium.

    Soranzo N, Spector TD, Mangino M, Kühnel B, Rendon A, Teumer A, Willenborg C, Wright B, Chen L, Li M, Salo P, Voight BF, Burns P, Laskowski RA, Xue Y, Menzel S, Altshuler D, Bradley JR, Bumpstead S, Burnett MS, Devaney J, Döring A, Elosua R, Epstein SE, Erber W, Falchi M, Garner SF, Ghori MJ, Goodall AH, Gwilliam R, Hakonarson HH, Hall AS, Hammond N, Hengstenberg C, Illig T, König IR, Knouff CW, McPherson R, Melander O, Mooser V, Nauck M, Nieminen MS, O'Donnell CJ, Peltonen L, Potter SC, Prokisch H, Rader DJ, Rice CM, Roberts R, Salomaa V, Sambrook J, Schreiber S, Schunkert H, Schwartz SM, Serbanovic-Canic J, Sinisalo J, Siscovick DS, Stark K, Surakka I, Stephens J, Thompson JR, Völker U, Völzke H, Watkins NA, Wells GA, Wichmann HE, Van Heel DA, Tyler-Smith C, Thein SL, Kathiresan S, Perola M, Reilly MP, Stewart AF, Erdmann J, Samani NJ, Meisinger C, Greinacher A, Deloukas P, Ouwehand WH and Gieger C

    Human Genetics, Wellcome Trust Sanger Institute, Hinxton, UK. ns6@sanger.ac.uk

    The number and volume of cells in the blood affect a wide range of disorders including cancer and cardiovascular, metabolic, infectious and immune conditions. We consider here the genetic variation in eight clinically relevant hematological parameters, including hemoglobin levels, red and white blood cell counts and platelet counts and volume. We describe common variants within 22 genetic loci reproducibly associated with these hematological parameters in 13,943 samples from six European population-based studies, including 6 associated with red blood cell parameters, 15 associated with platelet parameters and 1 associated with total white blood cell count. We further identified a long-range haplotype at 12q24 associated with coronary artery disease and myocardial infarction in 9,479 cases and 10,527 controls. We show that this haplotype demonstrates extensive disease pleiotropy, as it contains known risk loci for type 1 diabetes, hypertension and celiac disease and has been spread by a selective sweep specific to European and geographically nearby populations.

    Funded by: Canadian Institutes of Health Research: MOP77682, MOP82810, NA6650; Medical Research Council: G0000111; NCRR NIH HHS: U54 RR020278, U54 RR020278-01; NHLBI NIH HHS: R01 HL056931, R01 HL056931-02, R01 HL056931-03, R01 HL056931-04; Wellcome Trust

    Nature genetics 2009;41;11;1182-90

  • A systematic, large-scale resequencing screen of X-chromosome coding exons in mental retardation.

    Tarpey PS, Smith R, Pleasance E, Whibley A, Edkins S, Hardy C, O'Meara S, Latimer C, Dicks E, Menzies A, Stephens P, Blow M, Greenman C, Xue Y, Tyler-Smith C, Thompson D, Gray K, Andrews J, Barthorpe S, Buck G, Cole J, Dunmore R, Jones D, Maddison M, Mironenko T, Turner R, Turrell K, Varian J, West S, Widaa S, Wray P, Teague J, Butler A, Jenkinson A, Jia M, Richardson D, Shepherd R, Wooster R, Tejada MI, Martinez F, Carvill G, Goliath R, de Brouwer AP, van Bokhoven H, Van Esch H, Chelly J, Raynaud M, Ropers HH, Abidi FE, Srivastava AK, Cox J, Luo Y, Mallya U, Moon J, Parnau J, Mohammed S, Tolmie JL, Shoubridge C, Corbett M, Gardner A, Haan E, Rujirabanjerd S, Shaw M, Vandeleur L, Fullston T, Easton DF, Boyle J, Partington M, Hackett A, Field M, Skinner C, Stevenson RE, Bobrow M, Turner G, Schwartz CE, Gecz J, Raymond FL, Futreal PA and Stratton MR

    Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK.

    Large-scale systematic resequencing has been proposed as the key future strategy for the discovery of rare, disease-causing sequence variants across the spectrum of human complex disease. We have sequenced the coding exons of the X chromosome in 208 families with X-linked mental retardation (XLMR), the largest direct screen for constitutional disease-causing mutations thus far reported. The screen has discovered nine genes implicated in XLMR, including SYP, ZNF711 and CASK reported here, confirming the power of this strategy. The study has, however, also highlighted issues confronting whole-genome sequencing screens, including the observation that loss of function of 1% or more of X-chromosome genes is compatible with apparently normal existence.

    Funded by: Cancer Research UK: 10118; NICHD NIH HHS: HD26202; Wellcome Trust: 077012

    Nature genetics 2009;41;5;535-43

  • The will-o'-the-wisp of genetics--hunting for the azoospermia factor gene.

    Tyler-Smith C and Krausz C

    Funded by: Wellcome Trust: 077009

    The New England journal of medicine 2009;360;9;925-7

  • Improving global and regional resolution of male lineage differentiation by simple single-copy Y-chromosomal short tandem repeat polymorphisms.

    Vermeulen M, Wollstein A, van der Gaag K, Lao O, Xue Y, Wang Q, Roewer L, Knoblauch H, Tyler-Smith C, de Knijff P and Kayser M

    Department of Forensic Molecular Biology, Erasmus University Medical Center Rotterdam, 3000 CA Rotterdam, The Netherlands.

    We analyzed 67 short tandem repeat polymorphisms from the non-recombining part of the Y-chromosome (Y-STRs), including 49 rarely studied simple single-copy (ss)Y-STRs and 18 widely used Y-STRs, in 590 males from 51 populations belonging to 8 worldwide regions (HGDP-CEPH panel). Although autosomal DNA profiling provided no evidence for close relationship, we found 18 Y-STR haplotypes (defined by 67 Y-STRs) that were shared by two to five men in 13 worldwide populations, revealing high and widespread levels of cryptic male relatedness. Maximal (95.9%) haplotype resolution was achieved with the best 25 out of 67 Y-STRs in the global dataset, and with the best 3-16 markers in regional datasets (89.6-100% resolution). From the 49 rarely studied ssY-STRs, the 25 most informative markers were sufficient to reach the highest possible male lineage differentiation in the global (92.2% resolution), and 3-15 markers in the regional datasets (85.4-100%). Considerably lower haplotype resolutions were obtained with the three commonly used Y-STR sets (Minimal Haplotype, PowerPlex Y, and AmpFlSTR Yfiler. Six ssY-STRs (DYS481, DYS533, DYS549, DYS570, DYS576 and DYS643) were most informative to supplement the existing Y-STR kits for increasing haplotype resolution, or - together with additional ssY-STRs - as a new set for maximizing male lineage differentiation. Mutation rates of the 49 ssY-STRs were estimated from 403 meiotic transfers in deep-rooted pedigrees, and ranged from approximately 4.8 x 10(-4) for 31 ssY-STRs with no mutations observed to 1.3 x 10(-2) and 1.5 x 10(-2) for DYS570 and DYS576, respectively, the latter representing the highest mutation rates reported for human Y-STRs so far. Our findings thus demonstrate that ssY-STRs are useful for maximizing global and regional resolution of male lineages, either as a new set, or when added to commonly used Y-STR sets, and support their application to forensic, genealogical and anthropological studies.

    Funded by: Wellcome Trust: 077009

    Forensic science international. Genetics 2009;3;4;205-13

  • Human Y chromosome base-substitution mutation rate measured by direct sequencing in a deep-rooting pedigree.

    Xue Y, Wang Q, Long Q, Ng BL, Swerdlow H, Burton J, Skuce C, Taylor R, Abdellah Z, Zhao Y, Asan, MacArthur DG, Quail MA, Carter NP, Yang H and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Hinxton, Cambs CB10 1SA, UK. ylx@sanger.ac.uk

    Understanding the key process of human mutation is important for many aspects of medical genetics and human evolution. In the past, estimates of mutation rates have generally been inferred from phenotypic observations or comparisons of homologous sequences among closely related species. Here, we apply new sequencing technology to measure directly one mutation rate, that of base substitutions on the human Y chromosome. The Y chromosomes of two individuals separated by 13 generations were flow sorted and sequenced by Illumina (Solexa) paired-end sequencing to an average depth of 11x or 20x, respectively. Candidate mutations were further examined by capillary sequencing in cell-line and blood DNA from the donors and additional family members. Twelve mutations were confirmed in approximately 10.15 Mb; eight of these had occurred in vitro and four in vivo. The latter could be placed in different positions on the pedigree and led to a mutation-rate measurement of 3.0 x 10(-8) mutations/nucleotide/generation (95% CI: 8.9 x 10(-9)-7.0 x 10(-8)), consistent with estimates of 2.3 x 10(-8)-6.3 x 10(-8) mutations/nucleotide/generation for the same Y-chromosomal region from published human-chimpanzee comparisons depending on the generation and split times assumed.

    Funded by: Wellcome Trust

    Current biology : CB 2009;19;17;1453-7

  • Population differentiation as an indicator of recent positive selection in humans: an empirical evaluation.

    Xue Y, Zhang X, Huang N, Daly A, Gillson CJ, Macarthur DG, Yngvadottir B, Nica AC, Woodwark C, Chen Y, Conrad DF, Ayub Q, Mehdi SQ, Li P and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, United Kingdom.

    We have evaluated the extent to which SNPs identified by genomewide surveys as showing unusually high levels of population differentiation in humans have experienced recent positive selection, starting from a set of 32 nonsynonymous SNPs in 27 genes highlighted by the HapMap1 project. These SNPs were genotyped again in the HapMap samples and in the Human Genome Diversity Project-Centre d'Etude du Polymorphisme Humain (HGDP-CEPH) panel of 52 populations representing worldwide diversity; extended haplotype homozygosity was investigated around all of them, and full resequence data were examined for 9 genes (5 from public sources and 4 from new data sets). For 7 of the genes, genotyping errors were responsible for an artifactual signal of high population differentiation and for 2, the population differentiation did not exceed our significance threshold. For the 18 genes with confirmed high population differentiation, 3 showed evidence of positive selection as measured by unusually extended haplotypes within a population, and 7 more did in between-population analyses. The 9 genes with resequence data included 7 with high population differentiation, and 5 showed evidence of positive selection on the haplotype carrying the nonsynonymous SNP from skewed allele frequency spectra; in addition, 2 showed evidence of positive selection on unrelated haplotypes. Thus, in humans, high population differentiation is (apart from technical artifacts) an effective way of enriching for recently selected genes, but is not an infallible pointer to recent positive selection supported by other lines of evidence.

    Funded by: Wellcome Trust

    Genetics 2009;183;3;1065-77

  • The promise and reality of personal genomics.

    Yngvadottir B, Macarthur DG, Jin H and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK.

    The publication of the highest-quality and best-annotated personal genome yet tells us much about sequencing technology, something about genetic ancestry, but still little of medical relevance.

    Funded by: Wellcome Trust

    Genome biology 2009;10;9;237

  • A genome-wide survey of the prevalence and evolutionary forces acting on human nonsense SNPs.

    Yngvadottir B, Xue Y, Searle S, Hunt S, Delgado M, Morrison J, Whittaker P, Deloukas P and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA, UK.

    Nonsense SNPs introduce premature termination codons into genes and can result in the absence of a gene product or in a truncated and potentially harmful protein, so they are often considered disadvantageous and are associated with disease susceptibility. As such, we might expect the disrupted allele to be rare and, in healthy people, observed only in a heterozygous state. However, some, like those in the CASP12 and ACTN3 genes, are known to be present at high frequencies and to occur often in a homozygous state and seem to have been advantageous in recent human evolution. To evaluate the selective forces acting on nonsense SNPs as a class, we have carried out a large-scale experimental survey of nonsense SNPs in the human genome by genotyping 805 of them (plus control synonymous SNPs) in 1,151 individuals from 56 worldwide populations. We identified 169 genes containing nonsense SNPs that were variable in our samples, of which 99 were found with both copies inactivated in at least one individual. We found that the sampled humans differ on average by 24 genes (out of about 20,000) because of these nonsense SNPs alone. As might be expected, nonsense SNPs as a class were found to be slightly disadvantageous over evolutionary timescales, but a few nevertheless showed signs of being possibly advantageous, as indicated by unusually high levels of population differentiation, long haplotypes, and/or high frequencies of derived alleles. This study underlines the extent of variation in gene content within humans and emphasizes the importance of understanding this type of variation.

    Funded by: Wellcome Trust: 062023

    American journal of human genetics 2009;84;2;224-34

Team publications 2008

  • Dynamic nature of the proximal AZFc region of the human Y chromosome: multiple independent deletion and duplication events revealed by microsatellite analysis.

    Balaresque P, Bowden GR, Parkin EJ, Omran GA, Heyer E, Quintana-Murci L, Roewer L, Stoneking M, Nasidze I, Carvalho-Silva DR, Tyler-Smith C, de Knijff P and Jobling MA

    Department of Genetics, University of Leicester, Leicester, United Kingdom.

    The human Y chromosome shows frequent structural variants, some of which are selectively neutral, while others cause impaired fertility due to the loss of spermatogenic genes. The large-scale use of multiple Y-chromosomal microsatellites in forensic and population genetic studies can reveal such variants, through the absence or duplication of specific markers in haplotypes. We describe Y chromosomes in apparently normal males carrying null and duplicated alleles at the microsatellite DYS448, which lies in the proximal part of the azoospermia factor c (AZFc) region, important in spermatogenesis, and made up of "ampliconic" repeats that act as substrates for nonallelic homologous recombination (NAHR). Physical mapping in 26 DYS448 deletion chromosomes reveals that only three cases belong to a previously described class, representing independent occurrences of an approximately 1.5-Mb deletion mediated by recombination between the b1 and b3 repeat units. The remainder belong to five novel classes; none appears to be mediated through homologous recombination, and all remove some genes, but are likely to be compatible with normal fertility. A combination of deletion analysis with binary-marker and microsatellite haplotyping shows that the 26 deletions represent nine independent events. Nine DYS448 duplication chromosomes can be explained by four independent events. Some lineages have risen to high frequency in particular populations, in particular a deletion within haplogroup (hg) C(*)(xC3a,C3c) found in 18 Asian males. The nonrandom phylogenetic distribution of duplication and deletion events suggests possible structural predisposition to such mutations in hgs C and G.

    Funded by: Wellcome Trust: 057559, 077009

    Human mutation 2008;29;10;1171-80

  • A novel 154-bp deletion in the human mitochondrial DNA control region in healthy individuals.

    Behar DM, Blue-Smith J, Soria-Hernanz DF, Tzur S, Hadid Y, Bormans C, Moen A, Tyler-Smith C, Quintana-Murci L, Wells RS and Genographic Consortium

    Molecular Medicine Laboratory, Rambam Health Care Campus, Haifa, Israel. GenoPubs@ngs.org

    The biological role of the mitochondrial DNA (mtDNA) control region in mtDNA replication remains unclear. In a worldwide survey of mtDNA variation in the general population, we have identified a novel large control region deletion spanning positions 16154 to 16307 (m.16154_16307del154). The population prevalence of this deletion is low, since it was only observed in 1 out of over 120,000 mtDNA genomes studied. The deletion is present in a nonheteroplasmic state, and was transmitted by a mother to her two sons with no apparent past or present disease conditions. The identification of this large deletion in healthy individuals challenges the current view of the control region as playing a crucial role in the regulation of mtDNA replication, and supports the existence of a more complex system of multiple or epigenetically-determined replication origins.

    Funded by: Wellcome Trust: 077009

    Human mutation 2008;29;12;1387-91

  • The dawn of human matrilineal diversity.

    Behar DM, Villems R, Soodyall H, Blue-Smith J, Pereira L, Metspalu E, Scozzari R, Makkan H, Tzur S, Comas D, Bertranpetit J, Quintana-Murci L, Tyler-Smith C, Wells RS, Rosset S and Genographic Consortium

    Molecular Medicine Laboratory, Rambam Health Care Campus, Haifa 31096, Israel. behardm@usernet.com

    The quest to explain demographic history during the early part of human evolution has been limited because of the scarce paleoanthropological record from the Middle Stone Age. To shed light on the structure of the mitochondrial DNA (mtDNA) phylogeny at the dawn of Homo sapiens, we constructed a matrilineal tree composed of 624 complete mtDNA genomes from sub-Saharan Hg L lineages. We paid particular attention to the Khoi and San (Khoisan) people of South Africa because they are considered to be a unique relic of hunter-gatherer lifestyle and to carry paternal and maternal lineages belonging to the deepest clades known among modern humans. Both the tree phylogeny and coalescence calculations suggest that Khoisan matrilineal ancestry diverged from the rest of the human mtDNA pool 90,000-150,000 years before present (ybp) and that at least five additional, currently extant maternal lineages existed during this period in parallel. Furthermore, we estimate that a minimum of 40 other evolutionarily successful lineages flourished in sub-Saharan Africa during the period of modern human dispersal out of Africa approximately 60,000-70,000 ybp. Only much later, at the beginning of the Late Stone Age, about 40,000 ybp, did introgression of additional lineages occur into the Khoisan mtDNA pool. This process was further accelerated during the recent Bantu expansions. Our results suggest that the early settlement of humans in Africa was already matrilineally structured and involved small, separately evolving isolated populations.

    Funded by: Wellcome Trust

    American journal of human genetics 2008;82;5;1130-40

  • The functional impact of structural variation in humans.

    Hurles ME, Dermitzakis ET and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. meh@sanger.ac.uk

    Structural variation includes many different types of chromosomal rearrangement and encompasses millions of bases in every human genome. Over the past 3 years, the extent and complexity of structural variation has become better appreciated. Diverse approaches have been adopted to explore the functional impact of this class of variation. As disparate indications of the important biological consequences of genome dynamism are accumulating rapidly, we review the evidence that structural variation has an appreciable impact on cellular phenotypes, disease and human evolution.

    Funded by: Wellcome Trust: 077009, 077014, 077046

    Trends in genetics : TIG 2008;24;5;238-45

  • Copy number variation and evolution in humans and chimpanzees.

    Perry GH, Yang F, Marques-Bonet T, Murphy C, Fitzgerald T, Lee AS, Hyland C, Stone AC, Hurles ME, Tyler-Smith C, Eichler EE, Carter NP, Lee C and Redon R

    School of Human Evolution & Social Change, Arizona State University, Tempe, Arizona 85287, USA.

    Copy number variants (CNVs) underlie many aspects of human phenotypic diversity and provide the raw material for gene duplication and gene family expansion. However, our understanding of their evolutionary significance remains limited. We performed comparative genomic hybridization on a single human microarray platform to identify CNVs among the genomes of 30 humans and 30 chimpanzees as well as fixed copy number differences between species. We found that human and chimpanzee CNVs occur in orthologous genomic regions far more often than expected by chance and are strongly associated with the presence of highly homologous intrachromosomal segmental duplications. By adapting population genetic analyses for use with copy number data, we identified functional categories of genes that have likely evolved under purifying or positive selection for copy number changes. In particular, duplications and deletions of genes with inflammatory response and cell proliferation functions may have been fixed by positive selection and involved in the adaptive phenotypic differentiation of humans and chimpanzees.

    Funded by: Howard Hughes Medical Institute; NCRR NIH HHS: RR014491, RR015087, RR016483; NHGRI NIH HHS: HG004221; Wellcome Trust

    Genome research 2008;18;11;1698-710

  • Maximum-likelihood estimation of site-specific mutation rates in human mitochondrial DNA from partial phylogenetic classification.

    Rosset S, Wells RS, Soria-Hernanz DF, Tyler-Smith C, Royyuru AK, Behar DM and Genographic Consortium

    Department of Statistics and Operations Research, Tel Aviv University, Tel Aviv, Israel. saharon@post.tau.ac.il

    The mitochondrial DNA hypervariable segment I (HVS-I) is widely used in studies of human evolutionary genetics, and therefore accurate estimates of mutation rates among nucleotide sites in this region are essential. We have developed a novel maximum-likelihood methodology for estimating site-specific mutation rates from partial phylogenetic information, such as haplogroup association. The resulting estimation problem is a generalized linear model, with a nonstandard link function. We develop inference and bias correction tools for our estimates and a hypothesis-testing approach for site independence. We demonstrate our methodology using 16,609 HVS-I samples from the Genographic Project. Our results suggest that mutation rates among nucleotide sites in HVS-I are highly variable. The 16,400-16,500 region exhibits significantly lower rates compared to other regions, suggesting potential functional constraints. Several loci identified in the literature as possible termination-associated sequences (TAS) do not yield statistically slower rates than the rest of HVS-I, casting doubt on their functional importance. Our tests do not reject the null hypothesis of independent mutation rates among nucleotide sites, supporting the use of site-independence assumption for analyzing HVS-I. Potential extensions of our methodology include its application to estimation of mutation rates in other genetic regions, like Y chromosome short tandem repeats.

    Funded by: Wellcome Trust

    Genetics 2008;180;3;1511-24

  • Maternal footprints of Southeast Asians in North India.

    Thangaraj K, Chaubey G, Kivisild T, Selvi Rani D, Singh VK, Ismail T, Carvalho-Silva D, Metspalu M, Bhaskar LV, Reddy AG, Chandra S, Pande V, Prathap Naidu B, Adarsh N, Verma A, Jyothi IA, Mallick CB, Shrivastava N, Devasena R, Kumari B, Singh AK, Dwivedi SK, Singh S, Rao G, Gupta P, Sonvane V, Kumari K, Basha A, Bhargavi KR, Lalremruata A, Gupta AK, Kaur G, Reddy KK, Rao AP, Villems R, Tyler-Smith C and Singh L

    Centre for Cellular and Molecular Biology, Hyderabad, India.

    We have analyzed 7,137 samples from 125 different caste, tribal and religious groups of India and 99 samples from three populations of Nepal for the length variation in the COII/tRNA(Lys) region of mtDNA. Samples showing length variation were subjected to detailed phylogenetic analysis based on HVS-I and informative coding region sequence variation. The overall frequencies of the 9-bp deletion and insertion variants in South Asia were 1.9 and 0.6%, respectively. We have also defined a novel deep-rooting haplogroup M43 and identified the rare haplogroup H14 in Indian populations carrying the 9-bp deletion by complete mtDNA sequencing. Moreover, we redefined haplogroup M6 and dissected it into two well-defined subclades. The presence of haplogroups F1 and B5a in Uttar Pradesh suggests minor maternal contribution from Southeast Asia to Northern India. The occurrence of haplogroup F1 in the Nepalese sample implies that Nepal might have served as a bridge for the flow of eastern lineages to India. The presence of R6 in the Nepalese, on the other hand, suggests that the gene flow between India and Nepal has been reciprocal.

    Funded by: Wellcome Trust: 077009

    Human heredity 2008;66;1;1-9

  • Long-range, high-throughput haplotype determination via haplotype-fusion PCR and ligation haplotyping.

    Turner DJ, Tyler-Smith C and Hurles ME

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, UK. djt@sanger.ac.uk

    Ligation Haplotyping is a robust, novel method for experimental determination of haplotypes over long distances, which can be applied to assaying both sequence and structural variation. The simplicity and efficacy of the method for genotyping large chromosomal rearrangements and haplotyping SNPs over long distances make it a valuable and powerful addition to the methodological repertoire, which will be beneficial to studies of population genetics and evolution, disease association and inheritance, and genomic variation. We illustrate the versatility of the method both by genotyping a Yp paracentric inversion, found in approximately 60% of Northwest European males, that strongly influences the germline rate of infertility-causing XY translocations and by haplotyping two autosomal SNPs that lie 16.4 kb apart on chromosome 7, and which influence an individual's susceptibility to systemic lupus erythematosus.

    Funded by: Wellcome Trust

    Nucleic acids research 2008;36;13;e82

  • An evolutionary perspective on Y-chromosomal variation and male infertility.

    Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridgeshire, UK. cts@sanger.ac.uk

    Genetic variation on the Y chromosome is one of the best-documented causes of male infertility, but the genes responsible have still not been identified. This review discusses how an evolutionary perspective may help with interpretation of the data available and suggest novel approaches to identify key genes. Comparison with the chimpanzee Y chromosome indicates that USP9Y is dispensable in apes, but that multiple copies of TSPY1 may have an important role. Comparisons between infertile and control groups in search of genetic susceptibility factors are more complex for the Y chromosome than for the rest of the genome because of population stratification and require unusual levels of confirmation. But the extreme population stratification exhibited by the Y also allows populations particularly suitable for some studies to be identified, such as the partial AZFc deletions common in Northern European populations where further dissection of this complex structural region would be facilitated.

    Funded by: Wellcome Trust

    International journal of andrology 2008;31;4;376-82

  • Variation of the oxytocin/neurophysin I (OXT) gene in four human populations.

    Xu Y, Xue Y, Asan, Daly A, Wu L and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambs., CB10 1SA, UK.

    Oxytocin is a short peptide with multiple functions in human biology and has been implicated in autism. We aimed to determine the normal pattern of variation around the oxytocin gene and resequenced it and its flanking regions in 91 individuals from four HapMap populations and one chimpanzee. We identified 14 single nucleotide polymorphisms (SNPs), all noncoding, including eight that were novel. Population genetic analyses were largely consistent with a neutral evolutionary history, but an Hudson-Kreitman-Aguadé (HKA) test revealed more variation within the human population than expected from the level of chimpanzee-human divergence.

    Funded by: Wellcome Trust: 077009

    Journal of human genetics 2008;53;7;637-43

  • Adaptive evolution of UGT2B17 copy-number variation.

    Xue Y, Sun D, Daly A, Yang F, Zhou X, Zhao M, Huang N, Zerjal T, Lee C, Carter NP, Hurles ME and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK.

    The human UGT2B17 gene varies in copy number from zero to two per individual and also differs in mean number between populations from Africa, Europe, and East Asia. We show that such a high degree of geographical variation is unusual and investigate its evolutionary history. This required first reinterpreting the reference sequence in this region of the genome, which is misassembled from the two different alleles separated by an artifactual gap. A corrected assembly identifies the polymorphism as a 117 kb deletion arising by nonallelic homologous recombination between approximately 4.9 kb segmental duplications and allows the deletion breakpoint to be identified. We resequenced approximately 12 kb of DNA spanning the breakpoint in 91 humans from three HapMap and one extended HapMap populations and one chimpanzee. Diversity was unusually high and the time to the most recent common ancestor was estimated at approximately 2.4 or approximately 3.0 million years by two different methods, with evidence of balancing selection in Europe. In contrast, diversity was low in East Asia where a single haplotype predominated, suggesting positive selection for the deletion in this part of the world.

    Funded by: Wellcome Trust

    American journal of human genetics 2008;83;3;337-46

  • Identifying genetic traces of historical expansions: Phoenician footprints in the Mediterranean.

    Zalloua PA, Platt DE, El Sibai M, Khalife J, Makhoul N, Haber M, Xue Y, Izaabel H, Bosch E, Adams SM, Arroyo E, López-Parra AM, Aler M, Picornell A, Ramon M, Jobling MA, Comas D, Bertranpetit J, Wells RS, Tyler-Smith C and Genographic Consortium

    Lebanese American University, Chouran, Beirut 1102 2801, Lebanon.

    The Phoenicians were the dominant traders in the Mediterranean Sea two thousand to three thousand years ago and expanded from their homeland in the Levant to establish colonies and trading posts throughout the Mediterranean, but then they disappeared from history. We wished to identify their male genetic traces in modern populations. Therefore, we chose Phoenician-influenced sites on the basis of well-documented historical records and collected new Y-chromosomal data from 1330 men from six such sites, as well as comparative data from the literature. We then developed an analytical strategy to distinguish between lineages specifically associated with the Phoenicians and those spread by geographically similar but historically distinct events, such as the Neolithic, Greek, and Jewish expansions. This involved comparing historically documented Phoenician sites with neighboring non-Phoenician sites for the identification of weak but systematic signatures shared by the Phoenician sites that could not readily be explained by chance or by other expansions. From these comparisons, we found that haplogroup J2, in general, and six Y-STR haplotypes, in particular, exhibited a Phoenician signature that contributed > 6% to the modern Phoenician-influenced populations examined. Our methodology can be applied to any historically documented expansion in which contact and noncontact sites can be identified.

    Funded by: Wellcome Trust: 057559

    American journal of human genetics 2008;83;5;633-42

  • Y-chromosomal diversity in Lebanon is structured by recent historical events.

    Zalloua PA, Xue Y, Khalife J, Makhoul N, Debiane L, Platt DE, Royyuru AK, Herrera RJ, Hernanz DF, Blue-Smith J, Wells RS, Comas D, Bertranpetit J, Tyler-Smith C and Genographic Consortium

    The Lebanese American University, Chouran, Beirut 1102 2801, Lebanon.

    Lebanon is an eastern Mediterranean country inhabited by approximately four million people with a wide variety of ethnicities and religions, including Muslim, Christian, and Druze. In the present study, 926 Lebanese men were typed with Y-chromosomal SNP and STR markers, and unusually, male genetic variation within Lebanon was found to be more strongly structured by religious affiliation than by geography. We therefore tested the hypothesis that migrations within historical times could have contributed to this situation. Y-haplogroup J*(xJ2) was more frequent in the putative Muslim source region (the Arabian Peninsula) than in Lebanon, and it was also more frequent in Lebanese Muslims than in Lebanese non-Muslims. Conversely, haplogroup R1b was more frequent in the putative Christian source region (western Europe) than in Lebanon and was also more frequent in Lebanese Christians than in Lebanese non-Christians. The most common R1b STR-haplotype in Lebanese Christians was otherwise highly specific for western Europe and was unlikely to have reached its current frequency in Lebanese Christians without admixture. We therefore suggest that the Islamic expansion from the Arabian Peninsula beginning in the seventh century CE introduced lineages typical of this area into those who subsequently became Lebanese Muslims, whereas the Crusader activity in the 11(th)-13(th) centuries CE introduced western European lineages into Lebanese Christians.

    Funded by: Wellcome Trust

    American journal of human genetics 2008;82;4;873-82

Team

Team members

Qasim Ayub
qa1@sanger.ac.ukStaff Scientist
Maria Cerezo-Fernandez
Visiting postdoctoral scientist
Yuan Chen
Senior Computer Biologist
Vincenza Colonna
Visiting Scientist
Jose Espinosa
Visiting Undergraduate Student
Min Hu
PhD student
Daniel MacArthur
Visiting Scientist
Luca Pagani
Visiting Scientist
Michal Szpak
PhD Student
Wei Wei
Visiting PhD student
Yali Xue
Staff Scientist
Bryndis Yngvadottir
by1@sanger.ac.ukunknown

Qasim Ayub

qa1@sanger.ac.uk Staff Scientist

I graduated from the Khyber Medical College, Peshawar, Pakistan and obtained my Ph.D. from the University of North Texas, USA in 1992 on a Thomas Jefferson Fellowship. Back in Pakistan I joined the Biomedical and Genetic Engineering Laboratories that became the focal point for the Human Genome Diversity Project's South Asian sample collection. Over the last decade I have analyzed DNA variation in ethnic and linguistic groups from Pakistan, in order to understand their genetic origins and relatedness with world populations. In 2006 I was awarded the President of Pakistan's Order of Imtiaz for contributions to science.

Research

I joined the Human Evolution Team in 2008 I am responsible for the team's wet lab and am part of Analysis Group of The 1000 Genomes Project. My research focuses on the analyses of DNA variation in humans and primates in order to understand how humans adapted to local environments as they established themselves in different parts of the world and have developed a method that helps identify gene sets that show evidence for Darwinian (positive) selection in comparison with matched controls. I continue to maintain my interest in refining human Y chromosomal phylogeny and South Asian population genetics.

References

  • FOXP2 targets show evidence of positive selection in European populations.

    Ayub Q, Yngvadottir B, Chen Y, Xue Y, Hu M, Vernes SC, Fisher SE and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. qa1@sanger.ac.uk

    Forkhead box P2 (FOXP2) is a highly conserved transcription factor that has been implicated in human speech and language disorders and plays important roles in the plasticity of the developing brain. The pattern of nucleotide polymorphisms in FOXP2 in modern populations suggests that it has been the target of positive (Darwinian) selection during recent human evolution. In our study, we searched for evidence of selection that might have followed FOXP2 adaptations in modern humans. We examined whether or not putative FOXP2 targets identified by chromatin-immunoprecipitation genomic screening show evidence of positive selection. We developed an algorithm that, for any given gene list, systematically generates matched lists of control genes from the Ensembl database, collates summary statistics for three frequency-spectrum-based neutrality tests from the low-coverage resequencing data of the 1000 Genomes Project, and determines whether these statistics are significantly different between the given gene targets and the set of controls. Overall, there was strong evidence of selection of FOXP2 targets in Europeans, but not in the Han Chinese, Japanese, or Yoruba populations. Significant outliers included several genes linked to cellular movement, reproduction, development, and immune cell trafficking, and 13 of these constituted a significant network associated with cardiac arteriopathy. Strong signals of selection were observed for CNTNAP2 and RBFOX1, key neurally expressed genes that have been consistently identified as direct FOXP2 targets in multiple studies and that have themselves been associated with neurodevelopmental disorders involving language dysfunction.

    Funded by: Wellcome Trust: 098051

    American journal of human genetics 2013;92;5;696-706

  • A calibrated human Y-chromosomal phylogeny based on resequencing.

    Wei W, Ayub Q, Chen Y, McCarthy S, Hou Y, Carbone I, Xue Y and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom.

    We have identified variants present in high-coverage complete sequences of 36 diverse human Y chromosomes from Africa, Europe, South Asia, East Asia, and the Americas, representing eight major haplogroups. After restricting our analysis to 8.97 Mb of the unique male-specific Y sequence, we identified 6662 high-confidence variants, including single-nucleotide polymorphisms (SNPs), multi-nucleotide polymorphisms (MNPs), and indels. We constructed phylogenetic trees using these variants, or subsets of them, and recapitulated the known structure of the tree. Assuming a male mutation rate of 1 × 10(-9) per base pair per year, the time depth of the tree (haplogroups A3-R) was ~101,000-115,000 yr, and the lineages found outside Africa dated to 57,000-74,000 yr, both as expected. In addition, we dated a striking Paleolithic male lineage expansion to 41,000-52,000 yr ago and the node representing the major European Y lineage, R1b, to 4000-13,000 yr ago, supporting a Neolithic origin for these modern European Y chromosomes. In all, we provide a nearly 10-fold increase in the number of Y markers with phylogenetic information, and novel historical insights derived from placing them on a calibrated phylogenetic tree.

    Funded by: Wellcome Trust: 098051

    Genome research 2013;23;2;388-95

  • An integrated map of genetic variation from 1,092 human genomes.

    1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT and McVean GA

    By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

    Funded by: Biotechnology and Biological Sciences Research Council: BB/I021213/1; British Heart Foundation: RG/09/012/28096, RG/09/12/28096; Howard Hughes Medical Institute; Medical Research Council: G0701805, G0801823, G0900747, G0900747(91070); NCI NIH HHS: R01 CA166661, R01CA166661; NCRR NIH HHS: UL1RR024131; NHGRI NIH HHS: P01HG4120, P41HG2371, P41HG4221, R01 HG002898, R01 HG004960, R01 HG007022, R01HG2898, R01HG3698, R01HG4719, R01HG4960, R01HG5701, RC2HG5552, RC2HG5581, U01 HG005728, U01 HG006513, U01 HG006569, U01HG5208, U01HG5209, U01HG5211, U01HG5214, U01HG5715, U01HG5725, U01HG5728, U01HG6513, U01HG6569, U41HG4568, U54 HG003079, U54 HG003273, U54HG3067, U54HG3079, U54HG3273; NHLBI NIH HHS: HL078885, R01HL95045, RC2HL102925, T32HL94284; NIAID NIH HHS: AI077439, AI2009061; NIEHS NIH HHS: ES015794; NIGMS NIH HHS: R01GM59290, T32GM7748, T32GM8283; NIH HHS: DP2OD6514; NIMH NIH HHS: F30 MH098571, R01MH84698; NLM NIH HHS: T15LM7033; PHS HHS: HHSN268201100040C; Wellcome Trust: 085532, 086084, 090532, 095908, WT085475/Z/08/Z, WT085532AIA, WT086084/Z/08/Z, WT089250/Z/09/Z, WT090532/Z/09/Z, WT095552/Z/11/Z, WT098051

    Nature 2012;491;7422;56-65

  • Insights into hominid evolution from the gorilla genome sequence.

    Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, Herrero J, Hobolth A, Lappalainen T, Mailund T, Marques-Bonet T, McCarthy S, Montgomery SH, Schwalie PC, Tang YA, Ward MC, Xue Y, Yngvadottir B, Alkan C, Andersen LN, Ayub Q, Ball EV, Beal K, Bradley BJ, Chen Y, Clee CM, Fitzgerald S, Graves TA, Gu Y, Heath P, Heger A, Karakoc E, Kolb-Kokocinski A, Laird GK, Lunter G, Meader S, Mort M, Mullikin JC, Munch K, O'Connor TD, Phillips AD, Prado-Martinez J, Rogers AS, Sajjadian S, Schmidt D, Shaw K, Simpson JT, Stenson PD, Turner DJ, Vigilant L, Vilella AJ, Whitener W, Zhu B, Cooper DN, de Jong P, Dermitzakis ET, Eichler EE, Flicek P, Goldman N, Mundy NI, Ning Z, Odom DT, Ponting CP, Quail MA, Ryder OA, Searle SM, Warren WC, Wilson RK, Schierup MH, Rogers J, Tyler-Smith C and Durbin R

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK.

    Gorillas are humans' closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago. In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.

    Funded by: Biotechnology and Biological Sciences Research Council; Cancer Research UK: A15603; Howard Hughes Medical Institute; Medical Research Council: G0501331, G0701805; NHGRI NIH HHS: HG002385, U54 HG003079; Wellcome Trust: 062023, 075491/Z/04, 077009, 077192, 077198, 089066, 090532, 095908, WT062023, WT077009, WT077192, WT077198, WT089066

    Nature 2012;483;7388;169-75

  • A systematic survey of loss-of-function variants in human protein-coding genes.

    MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB, Albers CA, Zhang ZD, Conrad DF, Lunter G, Zheng H, Ayub Q, DePristo MA, Banks E, Hu M, Handsaker RE, Rosenfeld JA, Fromer M, Jin M, Mu XJ, Khurana E, Ye K, Kay M, Saunders GI, Suner MM, Hunt T, Barnes IH, Amid C, Carvalho-Silva DR, Bignell AH, Snow C, Yngvadottir B, Bumpstead S, Cooper DN, Xue Y, Romero IG, 1000 Genomes Project Consortium, Wang J, Li Y, Gibbs RA, McCarroll SA, Dermitzakis ET, Pritchard JK, Barrett JC, Harrow J, Hurles ME, Gerstein MB and Tyler-Smith C

    Wellcome Trust Sanger Institute, Hinxton, UK. macarthur@atgu.mgh.harvard.edu

    Genome-sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in nonessential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.

    Funded by: British Heart Foundation: RG/09/012/28096; NHGRI NIH HHS: U54 HG003273; Wellcome Trust: 085532, 090532, 090532/Z/09/Z, 098051

    Science (New York, N.Y.) 2012;335;6070;823-8

  • Genetic variation in South Asia: assessing the influences of geography, language and ethnicity for understanding history and disease risk.

    Ayub Q and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK. qa1@sanger.ac.uk

    South Asia is home to more than 1.5 billion humans representing many diverse ethnicities, linguistic and religious groups and representing almost one-quarter of humanity. Modern humans arrived here soon after their departure from Africa approximately 50,000-70,000 years before present (YBP) and several subsequent human migrations and invasions, as well as the unique social structure of the region, have helped shape the pattern of genetic diversity currently observed in these populations. Over the last few decades population geneticists and molecular anthropologists have analyzed DNA variation in indigenous populations from this region in order to catalog their genetic relationships and histories. The emphasis is gradually shifting from the study of population origins to high resolution surveys of DNA variation to address issues of population stratification and genetic susceptibility or resistance to diseases in genome-wide association surveys. We present a historical overview of the genetic studies carried out on populations from this region in order to understand the influence of geographic, linguistic and religious factors on population diversity in this region, and discuss future prospects in light of developments in high throughput genotyping and next generation sequencing technologies.

    Funded by: Wellcome Trust

    Briefings in functional genomics & proteomics 2009;8;5;395-404

  • A common MYBPC3 (cardiac myosin binding protein C) variant associated with cardiomyopathies in South Asia.

    Dhandapany PS, Sadayappan S, Xue Y, Powell GT, Rani DS, Nallari P, Rai TS, Khullar M, Soares P, Bahl A, Tharkan JM, Vaideeswar P, Rathinavel A, Narasimhan C, Ayapati DR, Ayub Q, Mehdi SQ, Oppenheimer S, Richards MB, Price AL, Patterson N, Reich D, Singh L, Tyler-Smith C and Thangaraj K

    Department of Biochemistry, Madurai Kamaraj University, Madurai 625 021, India.

    Heart failure is a leading cause of mortality in South Asians. However, its genetic etiology remains largely unknown. Cardiomyopathies due to sarcomeric mutations are a major monogenic cause for heart failure (MIM600958). Here, we describe a deletion of 25 bp in the gene encoding cardiac myosin binding protein C (MYBPC3) that is associated with heritable cardiomyopathies and an increased risk of heart failure in Indian populations (initial study OR = 5.3 (95% CI = 2.3-13), P = 2 x 10(-6); replication study OR = 8.59 (3.19-25.05), P = 3 x 10(-8); combined OR = 6.99 (3.68-13.57), P = 4 x 10(-11)) and that disrupts cardiomyocyte structure in vitro. Its prevalence was found to be high (approximately 4%) in populations of Indian subcontinental ancestry. The finding of a common risk factor implicated in South Asian subjects with cardiomyopathy will help in identifying and counseling individuals predisposed to cardiac diseases in this region.

    Funded by: NHGRI NIH HHS: R01 HG006399-02; Wellcome Trust: 077009

    Nature genetics 2009;41;2;187-91

  • Y-chromosomal evidence for a limited Greek contribution to the Pathan population of Pakistan.

    Firasat S, Khaliq S, Mohyuddin A, Papaioannou M, Tyler-Smith C, Underhill PA and Ayub Q

    Biomedical and Genetic Engineering Division, Dr. AQ Khan Research Laboratories, Islamabad, Pakistan.

    Three Pakistani populations residing in northern Pakistan, the Burusho, Kalash and Pathan claim descent from Greek soldiers associated with Alexander's invasion of southwest Asia. Earlier studies have excluded a substantial Greek genetic input into these populations, but left open the question of a smaller contribution. We have now typed 90 binary polymorphisms and 16 multiallelic, short-tandem-repeat (STR) loci mapping to the male-specific portion of the human Y chromosome in 952 males, including 77 Greeks in order to re-investigate this question. In pairwise comparisons between the Greeks and the three Pakistani populations using genetic distance measures sensitive to recent events, the lowest distances were observed between the Greeks and the Pathans. Clade E3b1 lineages, which were frequent in the Greeks but not in Pakistan, were nevertheless observed in two Pathan individuals, one of whom shared a 16 Y-STR haplotype with the Greeks. The worldwide distribution of a shortened (9 Y-STR) version of this haplotype, determined from database information, was concentrated in Macedonia and Greece, suggesting an origin there. Although based on only a few unrelated descendants, this provides strong evidence for a European origin for a small proportion of the Pathan Y chromosomes.

    Funded by: Wellcome Trust: 077009

    European journal of human genetics : EJHG 2007;15;1;121-6

  • Reconstruction of human evolutionary tree using polymorphic autosomal microsatellites.

    Ayub Q, Mansoor A, Ismail M, Khaliq S, Mohyuddin A, Hameed A, Mazhar K, Rehman S, Siddiqi S, Papaioannou M, Piazza A, Cavalli-Sforza LL and Mehdi SQ

    Biomedical and Genetic Engineering Division, Dr. A.Q. Khan Research Laboratories, Islamabad 44000, Pakistan.

    Allelic frequencies of 182 tri- and tetra-autosomal microsatellites were used to examine phylogenetic relationships among 19 extant human populations. In particular, because the languages of the Basques and Hunza Burusho have been suggested to have an ancient relationship, this study sought to explore the genetic relationship between these two major language isolate populations and to compare them with other human populations. The work presented here shows that the microsatellite allelic diversity and the number of unique alleles were highest in sub-Saharan Africans. Neighbor-joining trees based on genetic distances and principal component analyses separated populations from different continents, and are consistent with an African origin for modern humans. For the first time, with biparentally transmitted markers, the microsatellite tree also shows that the San are the first branch of the human tree before the branch leading to all other Africans. In contrast to an earlier study, these results provided no evidence of a genetic relationship among the two language isolate groups. Genetic relationships, as ascertained by these microsatellites, are dictated primarily by geographic proximity rather than by remote linguistic origin, Mantel test, R(0) = 0.484, g = 3.802 (critical g value = 1.645; P = 0.05).

    American journal of physical anthropology 2003;122;3;259-68

  • The genetic legacy of the Mongols.

    Zerjal T, Xue Y, Bertorelle G, Wells RS, Bao W, Zhu S, Qamar R, Ayub Q, Mohyuddin A, Fu S, Li P, Yuldasheva N, Ruzibakiev R, Xu J, Shu Q, Du R, Yang H, Hurles ME, Robinson E, Gerelsaikhan T, Dashnyam B, Mehdi SQ and Tyler-Smith C

    Department of Biochemistry, University of Oxford, Oxford, United Kingdom.

    We have identified a Y-chromosomal lineage with several unusual features. It was found in 16 populations throughout a large region of Asia, stretching from the Pacific to the Caspian Sea, and was present at high frequency: approximately 8% of the men in this region carry it, and it thus makes up approximately 0.5% of the world total. The pattern of variation within the lineage suggested that it originated in Mongolia approximately 1,000 years ago. Such a rapid spread cannot have occurred by chance; it must have been a result of selection. The lineage is carried by likely male-line descendants of Genghis Khan, and we therefore propose that it has spread by a novel form of social selection resulting from their behavior.

    American journal of human genetics 2003;72;3;717-21

Maria Cerezo-Fernandez

- Visiting postdoctoral scientist

BSc in Molecular Biology, MSc in Molecular Medicine. In 2011 I completed my PhD under the supervision of Antonio Salas and Angel Carracedo in Santiago de Compostela, Spain. I was also under supervision of Prof. Cristian Capelli, in Oxford, as part of a predoctoral period. My PhD is was focus on the analysis of human mitochondrial DNA variability and its application to population genetics, forensic and clinical studies. Most of my work has been focused in population genetic studies with African samples but I’ve also worked with European and American samples and for forensic and clinical studies.

Research

In Sanger, I am studying the genetic variability of human populations based on the results of 1000 Genomes Project. In this case not only focus in an uniparental marker but in whole genome data. I am interested in our evolution as specie under the genetic perspective. Try to infer from modern human populations different ancient events and also try to understand the timing of these events. The knowledge of our variation as specie can be used also in medical and forensic studies

References

  • Reconstructing ancient mitochondrial DNA links between Africa and Europe.

    Cerezo M, Achilli A, Olivieri A, Perego UA, Gómez-Carballa A, Brisighelli F, Lancioni H, Woodward SR, López-Soto M, Carracedo A, Capelli C, Torroni A and Salas A

    Unidade de Xenética, Departamento de Anatomía Patolóxica e Ciencias Forenses, and Instituto de Ciencias Forenses, Facultade de Medicina, Universidad de Santiago de Compostela, Santiago de Compostela, Galicia, Spain.

    Mitochondrial DNA (mtDNA) lineages of macro-haplogroup L (excluding the derived L3 branches M and N) represent the majority of the typical sub-Saharan mtDNA variability. In Europe, these mtDNAs account for <1% of the total but, when analyzed at the level of control region, they show no signals of having evolved within the European continent, an observation that is compatible with a recent arrival from the African continent. To further evaluate this issue, we analyzed 69 mitochondrial genomes belonging to various L sublineages from a wide range of European populations. Phylogeographic analyses showed that ~65% of the European L lineages most likely arrived in rather recent historical times, including the Romanization period, the Arab conquest of the Iberian Peninsula and Sicily, and during the period of the Atlantic slave trade. However, the remaining 35% of L mtDNAs form European-specific subclades, revealing that there was gene flow from sub-Saharan Africa toward Europe as early as 11,000 yr ago.

    Genome research 2012;22;5;821-6

  • New insights into the Lake Chad Basin population structure revealed by high-throughput genotyping of mitochondrial DNA coding SNPs.

    Cerezo M, Černý V, Carracedo Á and Salas A

    Unidade de Xenética, Departamento de Anatomía Patolóxica e Ciencias Forenses, Instituto de Medicina Legal, Facultade de Medicina, Universidade de Santiago de Compostela, CIBERER, Galicia, Spain.

    Background: Located in the Sudan belt, the Chad Basin forms a remarkable ecosystem, where several unique agricultural and pastoral techniques have been developed. Both from an archaeological and a genetic point of view, this region has been interpreted to be the center of a bidirectional corridor connecting West and East Africa, as well as a meeting point for populations coming from North Africa through the Saharan desert.

    Samples from twelve ethnic groups from the Chad Basin (n = 542) have been high-throughput genotyped for 230 coding region mitochondrial DNA (mtDNA) Single Nucleotide Polymorphisms (mtSNPs) using Matrix-Assisted Laser Desorption/Ionization Time-Of-Flight (MALDI-TOF) mass spectrometry. This set of mtSNPs allowed for much better phylogenetic resolution than previous studies of this geographic region, enabling new insights into its population history. Notable haplogroup (hg) heterogeneity has been observed in the Chad Basin mirroring the different demographic histories of these ethnic groups. As estimated using a Bayesian framework, nomadic populations showed negative growth which was not always correlated to their estimated effective population sizes. Nomads also showed lower diversity values than sedentary groups.

    Compared to sedentary population, nomads showed signals of stronger genetic drift occurring in their ancestral populations. These populations, however, retained more haplotype diversity in their hypervariable segments I (HVS-I), but not their mtSNPs, suggesting a more ancestral ethnogenesis. Whereas the nomadic population showed a higher Mediterranean influence signaled mainly by sub-lineages of M1, R0, U6, and U5, the other populations showed a more consistent sub-Saharan pattern. Although lifestyle may have an influence on diversity patterns and hg composition, analysis of molecular variance has not identified these differences. The present study indicates that analysis of mtSNPs at high resolution could be a fast and extensive approach for screening variation in population studies where labor-intensive techniques such as entire genome sequencing remain unfeasible.

    PloS one 2011;6;4;e18682

  • Linking the sub-Saharan and West Eurasian gene pools: maternal and paternal heritage of the Tuareg nomads from the African Sahel.

    Pereira L, Cerný V, Cerezo M, Silva NM, Hájek M, Vasíková A, Kujanová M, Brdicka R and Salas A

    Instituto de Patologia e Imunologia Molecular da Universidade do Porto (IPATIMUP), Porto, Portugal.

    The Tuareg presently live in the Sahara and the Sahel. Their ancestors are commonly believed to be the Garamantes of the Libyan Fezzan, ever since it was suggested by authors of antiquity. Biological evidence, based on classical genetic markers, however, indicates kinship with the Beja of Eastern Sudan. Our study of mitochondrial DNA (mtDNA) sequences and Y chromosome SNPs of three different southern Tuareg groups from Mali, Burkina Faso and the Republic of Niger reveals a West Eurasian-North African composition of their gene pool. The data show that certain genetic lineages could not have been introduced into this population earlier than approximately 9000 years ago whereas local expansions establish a minimal date at around 3000 years ago. Some of the mtDNA haplogroups observed in the Tuareg population were involved in the post-Last Glacial Maximum human expansion from Iberian refugia towards both Europe and North Africa. Interestingly, no Near Eastern mtDNA lineages connected with the Neolithic expansion have been observed in our population sample. On the other hand, the Y chromosome SNPs data show that the paternal lineages can very probably be traced to the Near Eastern Neolithic demic expansion towards North Africa, a period that is otherwise concordant with the above-mentioned mtDNA expansion. The time frame for the migration of the Tuareg towards the African Sahel belt overlaps that of early Holocene climatic changes across the Sahara (from the optimal greening approximately 10 000 YBP to the extant aridity beginning at approximately 6000 YBP) and the migrations of other African nomadic peoples in the area.

    European journal of human genetics : EJHG 2010;18;8;915-23

  • Applications of MALDI-TOF MS to large-scale human mtDNA population-based studies.

    Cerezo M, Cerný V, Carracedo A and Salas A

    Unidade de Xenética, Departamento de Anatomía Patolóxica e Ciencias Forenses, Instituto de Medicina Legal, Facultade de Medicina, Universidade de Santiago de Compostela, Galicia, Spain.

    Analysis of the mitochondrial DNA variation in populations is commonly carried out in many fields of biomedical research. We propose the analysis of mitochondrial DNA coding region SNP (mtSNP) variation to a high level of phylogenetic resolution based on MALDI-TOF MS. The African phylogeny has been chosen to test the applicability of the technique but any other part of the worldwide phylogeny (or any other mtSNP panel) could be equally suitable for MALDI-TOF MS genotyping. SNP selection thus aimed to fully cover all the mtSNPs defining major and minor branches of the known African tree, including, macro-haplogroup L, and haplogroups M1, and U6. A total of 230 mtSNPs were finally selected. We used tests samples collected from two different African locations, namely, Mozambique and Chad Basin. Different internal genotyping controls and other indirect approaches (e.g. phylogenetic checking coupled with automatic sequencing) were used in order to evaluate the reproducibility of the technique, which resulted to be 100% using samples previously subjected to whole genome amplification. The advantages of the MALDI-TOF MS are also discussed in comparison with other popular methods such as minisequencing, highlighting its high-throughput nature, which is particularly suitable for case-control medical studies, forensic databasing or population and anthropological studies.

    Electrophoresis 2009;30;21;3665-73

Yuan Chen

- Senior Computer Biologist

From 1980 to 1985, I studied for Bachelor of Medicine (equivalent to M.B., Ch.B) at Dept. of Medicine, Tong-Ji Medical University, Wuhan, P.R. China. Then worked one year at a hospital in Wuhan as a Physician. Gained MSc degree in Bio-computing / Bioinformatics at the University of Manchester in 1989; worked as research assistant on molecular modelling projects there. Joined Sanger Center in 1998; investigated SNP detection using overlap clones from chromosome 22. About three years later, moved to European Bioinformatics Institute (EBI), worked on Variation database in Ensembl Project for about 9 years.

Research

From May 2010, I re-joined Sanger Institute Human Evolution Group, developing pipelines using PERL and MYSQL database to provide data for analysis for various projects, such as 1000 Genomes, Gorilla, 500 Exomes, the Gene Selection Detection Projects.

References

  • A calibrated human Y-chromosomal phylogeny based on resequencing.

    Wei W, Ayub Q, Chen Y, McCarthy S, Hou Y, Carbone I, Xue Y and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom.

    We have identified variants present in high-coverage complete sequences of 36 diverse human Y chromosomes from Africa, Europe, South Asia, East Asia, and the Americas, representing eight major haplogroups. After restricting our analysis to 8.97 Mb of the unique male-specific Y sequence, we identified 6662 high-confidence variants, including single-nucleotide polymorphisms (SNPs), multi-nucleotide polymorphisms (MNPs), and indels. We constructed phylogenetic trees using these variants, or subsets of them, and recapitulated the known structure of the tree. Assuming a male mutation rate of 1 × 10(-9) per base pair per year, the time depth of the tree (haplogroups A3-R) was ~101,000-115,000 yr, and the lineages found outside Africa dated to 57,000-74,000 yr, both as expected. In addition, we dated a striking Paleolithic male lineage expansion to 41,000-52,000 yr ago and the node representing the major European Y lineage, R1b, to 4000-13,000 yr ago, supporting a Neolithic origin for these modern European Y chromosomes. In all, we provide a nearly 10-fold increase in the number of Y markers with phylogenetic information, and novel historical insights derived from placing them on a calibrated phylogenetic tree.

    Funded by: Wellcome Trust: 098051

    Genome research 2013;23;2;388-95

  • Deleterious- and disease-allele prevalence in healthy individuals: insights from current predictions, mutation databases, and population-scale resequencing.

    Xue Y, Chen Y, Ayub Q, Huang N, Ball EV, Mort M, Phillips AD, Shaw K, Stenson PD, Cooper DN, Tyler-Smith C and 1000 Genomes Project Consortium

    The Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.

    We have assessed the numbers of potentially deleterious variants in the genomes of apparently healthy humans by using (1) low-coverage whole-genome sequence data from 179 individuals in the 1000 Genomes Pilot Project and (2) current predictions and databases of deleterious variants. Each individual carried 281-515 missense substitutions, 40-85 of which were homozygous, predicted to be highly damaging. They also carried 40-110 variants classified by the Human Gene Mutation Database (HGMD) as disease-causing mutations (DMs), 3-24 variants in the homozygous state, and many polymorphisms putatively associated with disease. Whereas many of these DMs are likely to represent disease-allele-annotation errors, between 0 and 8 DMs (0-1 homozygous) per individual are predicted to be highly damaging, and some of them provide information of medical relevance. These analyses emphasize the need for improved annotation of disease alleles both in mutation databases and in the primary literature; some HGMD mutation data have been recategorized on the basis of the present findings, an iterative process that is both necessary and ongoing. Our estimates of deleterious-allele numbers are likely to be subject to both overcounting and undercounting. However, our current best mean estimates of ~400 damaging variants and ~2 bona fide disease mutations per individual are likely to increase rather than decrease as sequencing studies ascertain rare variants more effectively and as additional disease alleles are discovered.

    Funded by: Wellcome Trust: 085532, WT098051

    American journal of human genetics 2012;91;6;1022-32

  • An integrated map of genetic variation from 1,092 human genomes.

    1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT and McVean GA

    By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

    Funded by: Biotechnology and Biological Sciences Research Council: BB/I021213/1; British Heart Foundation: RG/09/012/28096, RG/09/12/28096; Howard Hughes Medical Institute; Medical Research Council: G0701805, G0801823, G0900747, G0900747(91070); NCI NIH HHS: R01 CA166661, R01CA166661; NCRR NIH HHS: UL1RR024131; NHGRI NIH HHS: P01HG4120, P41HG2371, P41HG4221, R01 HG002898, R01 HG004960, R01 HG007022, R01HG2898, R01HG3698, R01HG4719, R01HG4960, R01HG5701, RC2HG5552, RC2HG5581, U01 HG005728, U01 HG006513, U01 HG006569, U01HG5208, U01HG5209, U01HG5211, U01HG5214, U01HG5715, U01HG5725, U01HG5728, U01HG6513, U01HG6569, U41HG4568, U54 HG003079, U54 HG003273, U54HG3067, U54HG3079, U54HG3273; NHLBI NIH HHS: HL078885, R01HL95045, RC2HL102925, T32HL94284; NIAID NIH HHS: AI077439, AI2009061; NIEHS NIH HHS: ES015794; NIGMS NIH HHS: R01GM59290, T32GM7748, T32GM8283; NIH HHS: DP2OD6514; NIMH NIH HHS: F30 MH098571, R01MH84698; NLM NIH HHS: T15LM7033; PHS HHS: HHSN268201100040C; Wellcome Trust: 085532, 086084, 090532, 095908, WT085475/Z/08/Z, WT085532AIA, WT086084/Z/08/Z, WT089250/Z/09/Z, WT090532/Z/09/Z, WT095552/Z/11/Z, WT098051

    Nature 2012;491;7422;56-65

  • High altitude adaptation in Daghestani populations from the Caucasus.

    Pagani L, Ayub Q, MacArthur DG, Xue Y, Baillie JK, Chen Y, Kozarewa I, Turner DJ, Tofanelli S, Bulayeva K, Kidd K, Paoli G and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Hinxton, UK. lp8@sanger.ac.uk

    We have surveyed 15 high-altitude adaptation candidate genes for signals of positive selection in North Caucasian highlanders using targeted re-sequencing. A total of 49 unrelated Daghestani from three ethnic groups (Avars, Kubachians, and Laks) living in ancient villages located at around 2,000 m above sea level were chosen as the study population. Caucasian (Adygei living at sea level, N = 20) and CEU (CEPH Utah residents with ancestry from northern and western Europe; N = 20) were used as controls. Candidate genes were compared with 20 putatively neutral control regions resequenced in the same individuals. The regions of interest were amplified by long-PCR, pooled according to individual, indexed by adding an eight-nucleotide tag, and sequenced using the Illumina GAII platform. 1,066 SNPs were called using false discovery and false negative thresholds of ~6%. The neutral regions provided an empirical null distribution to compare with the candidate genes for signals of selection. Two genes stood out. In Laks, a non-synonymous variant within HIF1A already known to be associated with improvement in oxygen metabolism was rediscovered, and in Kubachians a cluster of 13 SNPs located in a conserved intronic region within EGLN1 showing high population differentiation was found. These variants illustrate both the common pathways of adaptation to high altitude in different populations and features specific to the Daghestani populations, showing how even a mildly hypoxic environment can lead to genetic adaptation.

    Funded by: Wellcome Trust

    Human genetics 2012;131;3;423-33

  • A map of human genome variation from population-scale sequencing.

    1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME and McVean GA

    The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

    Funded by: British Heart Foundation: RG/09/012/28096; Howard Hughes Medical Institute; Medical Research Council: G0801823, G0801823(89305); NCRR NIH HHS: S10RR025056; NHGRI NIH HHS: 01HG3229, N01HG62088, P01HG4120, P41HG2371, P41HG4221, P41HG4222, P50HG2357, R01 HG003229, R01 HG003229-05, R01 HG004719-01, R01 HG004719-02, R01 HG004719-02S1, R01 HG004719-03, R01 HG004719-04, R01HG2651, R01HG3698, R01HG4333, R01HG4719, R01HG4960, RC2 HG005552-01, RC2 HG005552-02, RC2HG5552, U01HG5208, U01HG5209, U01HG5210, U01HG5211, U01HG5214, U41HG4568, U54 HG003273, U54HG2750, U54HG2757, U54HG3067, U54HG3079, U54HG3273; NIGMS NIH HHS: R01GM59290, R01GM72861, T32 GM007753; NIMH NIH HHS: 01MH84698; Wellcome Trust: 075491, 077009, 077014, 077192, 081407, 085532, 086084, 089061, 089062, 089088, WT075491/Z/04, WT077009, WT081407/Z/06/Z, WT085532AIA, WT086084/Z/08/Z, WT089088/Z/09/Z

    Nature 2010;467;7319;1061-73

  • A database and API for variation, dense genotyping and resequencing data.

    Rios D, McLaren WM, Chen Y, Birney E, Stabenau A, Flicek P and Cunningham F

    European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

    Background: Advances in sequencing and genotyping technologies are leading to the widespread availability of multi-species variation data, dense genotype data and large-scale resequencing projects. The 1000 Genomes Project and similar efforts in other species are challenging the methods previously used for storage and manipulation of such data necessitating the redesign of existing genome-wide bioinformatics resources.

    Results: Ensembl has created a database and software library to support data storage, analysis and access to the existing and emerging variation data from large mammalian and vertebrate genomes. These tools scale to thousands of individual genome sequences and are integrated into the Ensembl infrastructure for genome annotation and visualisation. The database and software system is easily expanded to integrate both public and non-public data sources in the context of an Ensembl software installation and is already being used outside of the Ensembl project in a number of database and application environments.

    Conclusions: Ensembl's powerful, flexible and open source infrastructure for the management of variation, genotyping and resequencing data is freely available at http://www.ensembl.org.

    Funded by: Medical Research Council; Wellcome Trust

    BMC bioinformatics 2010;11;238

  • Ensembl variation resources.

    Chen Y, Cunningham F, Rios D, McLaren WM, Smith J, Pritchard B, Spudich GM, Brent S, Kulesha E, Marin-Garcia P, Smedley D, Birney E and Flicek P

    European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

    Background: The maturing field of genomics is rapidly increasing the number of sequenced genomes and producing more information from those previously sequenced. Much of this additional information is variation data derived from sampling multiple individuals of a given species with the goal of discovering new variants and characterising the population frequencies of the variants that are already known. These data have immense value for many studies, including those designed to understand evolution and connect genotype to phenotype. Maximising the utility of the data requires that it be stored in an accessible manner that facilitates the integration of variation data with other genome resources such as gene annotation and comparative genomics.

    Description: The Ensembl project provides comprehensive and integrated variation resources for a wide variety of chordate genomes. This paper provides a detailed description of the sources of data and the methods for creating the Ensembl variation databases. It also explores the utility of the information by explaining the range of query options available, from using interactive web displays, to online data mining tools and connecting directly to the data servers programmatically. It gives a good overview of the variation resources and future plans for expanding the variation data within Ensembl.

    Conclusions: Variation data is an important key to understanding the functional and phenotypic differences between individuals. The development of new sequencing and genotyping technologies is greatly increasing the amount of variation data known for almost all genomes. The Ensembl variation resources are integrated into the Ensembl genome browser and provide a comprehensive way to access this data in the context of a widely used genome bioinformatics system. All Ensembl data is freely available at http://www.ensembl.org and from the public MySQL database server at ensembldb.ensembl.org.

    Funded by: Medical Research Council; Wellcome Trust

    BMC genomics 2010;11;293

  • Locus Reference Genomic sequences: an improved basis for describing human DNA variants.

    Dalgleish R, Flicek P, Cunningham F, Astashyn A, Tully RE, Proctor G, Chen Y, McLaren WM, Larsson P, Vaughan BW, Béroud C, Dobson G, Lehväslaiho H, Taschner PE, den Dunnen JT, Devereau A, Birney E, Brookes AJ and Maglott DR

    Department of Genetics, University of Leicester, University Road, Leicester LE1 7RH, UK. raymond.dalgleish@le.ac.uk.

    As our knowledge of the complexity of gene architecture grows, and we increase our understanding of the subtleties of gene expression, the process of accurately describing disease-causing gene variants has become increasingly problematic. In part, this is due to current reference DNA sequence formats that do not fully meet present needs. Here we present the Locus Reference Genomic (LRG) sequence format, which has been designed for the specific purpose of gene variant reporting. The format builds on the successful National Center for Biotechnology Information (NCBI) RefSeqGene project and provides a single-file record containing a uniquely stable reference DNA sequence along with all relevant transcript and protein sequences essential to the description of gene variants. In principle, LRGs can be created for any organism, not just human. In addition, we recognize the need to respect legacy numbering systems for exons and amino acids and the LRG format takes account of these. We hope that widespread adoption of LRGs - which will be created and maintained by the NCBI and the European Bioinformatics Institute (EBI) - along with consistent use of the Human Genome Variation Society (HGVS)-approved variant nomenclature will reduce errors in the reporting of variants in the literature and improve communication about variants affecting human health. Further information can be found on the LRG web site: http://www.lrg-sequence.org.

    Genome medicine 2010;2;4;24

  • A first-generation linkage disequilibrium map of human chromosome 22.

    Dawson E, Abecasis GR, Bumpstead S, Chen Y, Hunt S, Beare DM, Pabial J, Dibling T, Tinsley E, Kirby S, Carter D, Papaspyridonos M, Livingstone S, Ganske R, Lõhmussaar E, Zernant J, Tõnisson N, Remm M, Mägi R, Puurand T, Vilo J, Kurg A, Rice K, Deloukas P, Mott R, Metspalu A, Bentley DR, Cardon LR and Dunham I

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    DNA sequence variants in specific genes or regions of the human genome are responsible for a variety of phenotypes such as disease risk or variable drug response. These variants can be investigated directly, or through their non-random associations with neighbouring markers (called linkage disequilibrium (LD)). Here we report measurement of LD along the complete sequence of human chromosome 22. Duplicate genotyping and analysis of 1,504 markers in Centre d'Etude du Polymorphisme Humain (CEPH) reference families at a median spacing of 15 kilobases (kb) reveals a highly variable pattern of LD along the chromosome, in which extensive regions of nearly complete LD up to 804 kb in length are interspersed with regions of little or no detectable LD. The LD patterns are replicated in a panel of unrelated UK Caucasians. There is a strong correlation between high LD and low recombination frequency in the extant genetic map, suggesting that historical and contemporary recombination rates are similar. This study demonstrates the feasibility of developing genome-wide maps of LD.

    Nature 2002;418;6897;544-8

  • A SNP resource for human chromosome 22: extracting dense clusters of SNPs from the genomic sequence.

    Dawson E, Chen Y, Hunt S, Smink LJ, Hunt A, Rice K, Livingston S, Bumpstead S, Bruskiewich R, Sham P, Ganske R, Adams M, Kawasaki K, Shimizu N, Minoshima S, Roe B, Bentley D and Dunham I

    The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK.

    The recent publication of the complete sequence of human chromosome 22 provides a platform from which to investigate genomic sequence variation. We report the identification and characterization of 12,267 potential variants (SNPs and other small insertions/deletions) of human chromosome 22, discovered in the overlaps of 460 clones used for the chromosome sequencing. We found, on average, 1 potential variant every 1.07 kb and approximately 18% of the potential variants involve insertions/deletions. The SNPs have been positioned both relative to each other, and to genes, predicted genes, repeat sequences, other genetic markers, and the 2730 SNPs previously identified on the chromosome. A subset of the SNPs were verified experimentally using either PCR-RFLP or genomic Invader assays. These experiments confirmed 92% of the potential variants in a panel of 92 individuals. [Details of the SNPs and RFLP assays can be found at http://www.sanger.ac.uk and in dbSNP.]

    Genome research 2001;11;1;170-8

Vincenza Colonna

- Visiting Scientist

I am research scientist at National Research Council (Institute of Genetics and Biophysics) in Napoli, Italy.

I graduated in Napoli and worked as PostDoc at University of Ferrara (Italy) and at Sanger. I have been lecturer in Genetics and Biological Databases at University of Ferrara.

Research

I am interested in understanding the processes leading to the current levels and distribution of genomic variation in humans. My current work is mainly focused on population genetic analyses of the “1000 Genomes Project” data. In addition to this I continue to work on population isolates.

References

  • Small effective population size and genetic homogeneity in the Val Borbera isolate.

    Colonna V, Pistis G, Bomba L, Mona S, Matullo G, Boano R, Sala C, Viganò F, Torroni A, Achilli A, Hooshiar Kashani B, Malerba G, Gambaro G, Soranzo N and Toniolo D

    Institute of Genetics and Biophysics 'A. Buzzati-Traverso', National Research Council (CNR), Naples, Italy. vincenza.colonna@igb.cnr.it

    Population isolates are a valuable resource for medical genetics because of their reduced genetic, phenotypic and environmental heterogeneity. Further, extended linkage disequilibrium (LD) allows accurate haplotyping and imputation. In this study, we use nuclear and mitochondrial DNA data to determine to what extent the geographically isolated population of the Val Borbera valley also presents features of genetic isolation. We performed a comparative analysis of population structure and estimated effective population size exploiting LD data. We also evaluated haplotype sharing through the analysis of segments of autozygosity. Our findings reveal that the valley has features characteristic of a genetic isolate, including reduced genetic heterogeneity and reduced effective population size. We show that this population has been subject to prolonged genetic drift and thus we expect many variants that are rare in the general population to reach significant frequency values in the valley, making this population suitable for the identification of rare variants underlying complex traits.

    European journal of human genetics : EJHG 2013;21;1;89-94

  • An integrated map of genetic variation from 1,092 human genomes.

    1000 Genomes Project Consortium, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT and McVean GA

    By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

    Funded by: Biotechnology and Biological Sciences Research Council: BB/I021213/1; British Heart Foundation: RG/09/012/28096, RG/09/12/28096; Howard Hughes Medical Institute; Medical Research Council: G0701805, G0801823, G0900747, G0900747(91070); NCI NIH HHS: R01 CA166661, R01CA166661; NCRR NIH HHS: UL1RR024131; NHGRI NIH HHS: P01HG4120, P41HG2371, P41HG4221, R01 HG002898, R01 HG004960, R01 HG007022, R01HG2898, R01HG3698, R01HG4719, R01HG4960, R01HG5701, RC2HG5552, RC2HG5581, U01 HG005728, U01 HG006513, U01 HG006569, U01HG5208, U01HG5209, U01HG5211, U01HG5214, U01HG5715, U01HG5725, U01HG5728, U01HG6513, U01HG6569, U41HG4568, U54 HG003079, U54 HG003273, U54HG3067, U54HG3079, U54HG3273; NHLBI NIH HHS: HL078885, R01HL95045, RC2HL102925, T32HL94284; NIAID NIH HHS: AI077439, AI2009061; NIEHS NIH HHS: ES015794; NIGMS NIH HHS: R01GM59290, T32GM7748, T32GM8283; NIH HHS: DP2OD6514; NIMH NIH HHS: F30 MH098571, R01MH84698; NLM NIH HHS: T15LM7033; PHS HHS: HHSN268201100040C; Wellcome Trust: 085532, 086084, 090532, 095908, WT085475/Z/08/Z, WT085532AIA, WT086084/Z/08/Z, WT089250/Z/09/Z, WT090532/Z/09/Z, WT095552/Z/11/Z, WT098051

    Nature 2012;491;7422;56-65

  • IFITM3 restricts the morbidity and mortality associated with influenza.

    Everitt AR, Clare S, Pertel T, John SP, Wash RS, Smith SE, Chin CR, Feeley EM, Sims JS, Adams DJ, Wise HM, Kane L, Goulding D, Digard P, Anttila V, Baillie JK, Walsh TS, Hume DA, Palotie A, Xue Y, Colonna V, Tyler-Smith C, Dunning J, Gordon SB, GenISIS Investigators, MOSAIC Investigators, Smyth RL, Openshaw PJ, Dougan G, Brass AL and Kellam P

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK.

    The 2009 H1N1 influenza pandemic showed the speed with which a novel respiratory virus can spread and the ability of a generally mild infection to induce severe morbidity and mortality in a subset of the population. Recent in vitro studies show that the interferon-inducible transmembrane (IFITM) protein family members potently restrict the replication of multiple pathogenic viruses. Both the magnitude and breadth of the IFITM proteins' in vitro effects suggest that they are critical for intrinsic resistance to such viruses, including influenza viruses. Using a knockout mouse model, we now test this hypothesis directly and find that IFITM3 is essential for defending the host against influenza A virus in vivo. Mice lacking Ifitm3 display fulminant viral pneumonia when challenged with a normally low-pathogenicity influenza virus, mirroring the destruction inflicted by the highly pathogenic 1918 'Spanish' influenza. Similar increased viral replication is seen in vitro, with protection rescued by the re-introduction of Ifitm3. To test the role of IFITM3 in human influenza virus infection, we assessed the IFITM3 alleles of individuals hospitalized with seasonal or pandemic influenza H1N1/09 viruses. We find that a statistically significant number of hospitalized subjects show enrichment for a minor IFITM3 allele (SNP rs12252-C) that alters a splice acceptor site, and functional assays show the minor CC genotype IFITM3 has reduced influenza virus restriction in vitro. Together these data reveal that the action of a single intrinsic immune effector, IFITM3, profoundly alters the course of influenza virus infection in mouse and humans.

    Funded by: Chief Scientist Office; Medical Research Council: G0600511, G0800767, G0800777, G0802752, G0901697, MC_G1001212, MC_U122785833; NIAID NIH HHS: R01 AI091786, R01AI091786; Wellcome Trust: 090382, 090382/Z/09/Z, 090385/Z/09/Z, 098051

    Nature 2012;484;7395;519-23

  • A world in a grain of sand: human history from genetic data.

    Colonna V, Pagani L, Xue Y and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK.

    Genome-wide genotypes and sequences are enriching our understanding of the past 50,000 years of human history and providing insights into earlier periods largely inaccessible to mitochondrial DNA and Y-chromosomal studies.To see a world in a grain of sand ...William Blake, Auguries of Innocence.

    Funded by: Wellcome Trust

    Genome biology 2011;12;11;234

  • Human genome diversity: frequently asked questions.

    Barbujani G and Colonna V

    Department of Biology and Evolution, University of Ferrara, 44121 Ferrara, Italy. g.barbujani@unife.it

    Despite our relatively large population size, humans are genetically less variable than other primates. Many allele frequencies and statistical descriptors of genome diversity form broad gradients, tracing the main expansion from Africa, local migrations, and sometimes adaptation. However, this continuous variation is discordant across loci, and principally seems to reflect different blends of common and often cosmopolitan alleles rather than the presence of distinct gene pools in different regions of the world. The elusive structure of human populations could lead to spurious associations if the effects of shared ancestry are not properly dealt with; indeed, this is among the causes (although not the only one) of the difficulties encountered in discovering the loci responsible for quantitative traits and complex diseases. However, the rapidly growing body of data on our genomic diversity has already cast new light on human population history and is now revealing intricate biological relationships among individuals and populations of our species.

    Trends in genetics : TIG 2010;26;7;285-95

  • Long-range comparison between genes and languages based on syntactic distances.

    Colonna V, Boattini A, Guardiano C, Dall'ara I, Pettener D, Longobardi G and Barbujani G

    Dipartimento di Biologia ed Evoluzione, Università di Ferrara, Ferrara, Italy.

    Objective: To propose a new approach for comparing genetic and linguistic diversity in populations belonging to distantly related groups.

    Background: Comparisons of linguistic and genetic differences have proved powerful tools to reconstruct human demographic history. Current models assume on both sides that similarities reflect either descent from common ancestry or the balance between isolation and contact. Most linguistic phylogenies are ultimately based on lexical evidence (roughly, words and morphemes with their sounds and meanings). However, measures of lexical divergence are reliable only for closely related languages, thus large-scale comparisons of genetic and linguistic diversity have appeared problematic so far. Methods: Syntax (abstract rules to combine words into sentences) appears more measurable, universally comparable, and stable than the lexicon, and hence certain syntactic similarities might reflect deeper linguistic relationships, such as those between distant language families. In this study, we for the first time compared genetic data to a matrix of syntactic differences among selected populations of three continents.

    Results: Comparing two databases of microsatellite (Short Tandem Repeat) markers and Single Nucleotides Polymorphisms (SNPs), with a linguistic matrix based on the values of 62 grammatical parameters, we show that there is indeed a correlation of syntactic and genetic distances. We also identified a few outliers and suggest a possible interpretation of the overall pattern.

    Conclusions: These results strongly support the possibility of better investigating population history by combining genetic data with linguistic information of a new type, provided by a theoretically more sophisticated method to assess the relationships between distantly related languages and language families.

    Human heredity 2010;70;4;245-54

  • Comparing population structure as inferred from genealogical versus genetic information.

    Colonna V, Nutile T, Ferrucci RR, Fardella G, Aversano M, Barbujani G and Ciullo M

    Dipartimento di Biologia ed Evoluzione, Università di Ferrara, Ferrara, Italy.

    Algorithms for inferring population structure from genetic data (ie, population assignment methods) have shown to effectively recognize genetic clusters in human populations. However, their performance in identifying groups of genealogically related individuals, especially in scanty-differentiated populations, has not been tested empirically thus far. For this study, we had access to both genealogical and genetic data from two closely related, isolated villages in southern Italy. We found that nearly all living individuals were included in a single pedigree, with multiple inbreeding loops. Despite F(st) between villages being a low 0.008, genetic clustering analysis identified two clusters roughly corresponding to the two villages. Average kinship between individuals (estimated from genealogies) increased at increasing values of group membership (estimated from the genetic data), showing that the observed genetic clusters represent individuals who are more closely related to each other than to random members of the population. Further, average kinship within clusters and F(st) between clusters increases with increasingly stringent membership threshold requirements. We conclude that a limited number of genetic markers is sufficient to detect structuring, and that the results of genetic analyses faithfully mirror the structuring inferred from detailed analyses of population genealogies, even when F(st) values are low, as in the case of the two villages. We then estimate the impact of observed levels of population structure on association studies using simulated data.

    European journal of human genetics : EJHG 2009;17;12;1635-41

  • Identification and replication of a novel obesity locus on chromosome 1q24 in isolated populations of Cilento.

    Ciullo M, Nutile T, Dalmasso C, Sorice R, Bellenguez C, Colonna V, Persico MG and Bourgain C

    Institute of Genetics and Biophysics A. Buzzati-Traverso, CNR, Via Pietro Castellino, 111, 80131 Naples, Italy. ciullo@igb.cnr.it

    Objective: Obesity is a complex trait with a variety of genetic susceptibility variants. Several loci linked to obesity and/or obesity-related traits have been identified, and relatively few regions have been replicated. Studying isolated populations can be a useful approach to identify rare variants that will not be detected with whole-genome association studies in large populations.

    Random individuals were sampled from Campora, an isolated village of the Cilento area in South Italy, phenotyped for BMI, and genotyped using a dense microsatellite marker map. An efficient pedigree-breaking strategy was applied to perform genome-wide linkage analyses of both BMI and obesity. Significance was assessed with ad hoc simulations for the two traits and with an original local false discovery rate approach to quantitative trait linkage analysis for BMI. A genealogy-corrected association test was performed for a single nucleotide polymorphism located in one of the linkage regions. A replication study was conducted in the neighboring village of Gioi.

    Results: A new locus on chr1q24 significantly linked to BMI was identified in Campora. Linkage at the same locus is suggested with obesity. Three additional loci linked to BMI were also detected, including the locus including the INSIG2 gene region. No evidence of association between the rs7566605 variant and BMI or obesity was found. In Gioi, the linkage on chr1q24 was replicated with both BMI and obesity.

    Conclusions: Overall, our results confirm that successful linkage studies can be accomplished in these populations both to replicate known linkages and to identify novel quantitative trait linkages.

    Diabetes 2008;57;3;783-90

  • Campora: a young genetic isolate in South Italy.

    Colonna V, Nutile T, Astore M, Guardiola O, Antoniol G, Ciullo M and Persico MG

    Institute of Genetics and Biophysics A. Buzzati-Traverso, CNR Naples, Naples, Italy. colonna@igb.cnr.it

    Genetic isolates have been successfully used in the study of complex traits, mainly because due to their features, they allow a reduction in the complexity of the genetic models underlying the trait. The aim of the present study is to describe the population of Campora, a village in the South of Italy, highlighting its properties of a genetic isolate. Both historical evidence and multi-locus genetic data (genomic and mitochondrial DNA polymorphisms) have been taken into account in the analyses. The extension of linkage disequilibrium (LD) regions has been evaluated on autosomes and on a region of the X chromosome. We defined a study sample population on the basis of the genealogy and exogamy data. We found in this population a few different mitochondrial and Y chromosome haplotypes and we ascertained that, similarly to other isolated populations, in Campora LD extends over wider region compared to large and genetically heterogeneous populations. These findings indicate a conspicuous genetic homogeneity in the genome. Finally, we found evidence for a recent population bottleneck that we propose to interpret as a demographic crisis determined by the plague of the 17th century. Overall our findings demonstrate that Campora displays the genetic characteristics of a young isolate.

    Human heredity 2007;64;2;123-35

  • New susceptibility locus for hypertension on chromosome 8q by efficient pedigree-breaking in an Italian isolate.

    Ciullo M, Bellenguez C, Colonna V, Nutile T, Calabria A, Pacente R, Iovino G, Trimarco B, Bourgain C and Persico MG

    Institute of Genetics and Biophysics, A. Buzzati-Traverso, CNR Naples, Italy. ciullo@igb.cnr.it

    Essential hypertension (EH) affects a large proportion of the adult population in Western countries and is a major risk factor for cardiovascular diseases. EH is a multifactorial disease with a complex genetic component. To tackle the complexity of this genetic component, we have initiated a study of Campora, an isolated village in South Italy. A random sample of 389 adults was genotyped for a very dense microsatellite genome scan and phenotyped for EH. Of this sample, 173 affected individuals were all related through a 2,180-member pedigree and could be integrated within a linkage analysis. The complexity of the pedigree prevented its direct use for a non-parametric linkage (NPL) analysis. Therefore, the method proposed by Falchi et al. [2004, Am. J. Hum. Genet., 75, 1015-1031] was used for automatic pedigree-breaking. We identified a new locus for EH on chromosome 8q22-23 and detected linkage with two known loci for EH: 1q42-43 and 4p16. Simulations showed that the linkage with 8q22-23 is highly genome-wide significant, even when accounting for the breaking of the pedigree. An extension to qualitative traits of another pedigree-breaking approach [Pankratz et al., 2001, Genet. Epidemiol., 21 (Suppl. 1), S258-S263] also detected a significant linkage on 8q22-23 using a remarkably different set of sub-pedigrees and helped to refine the location of the linkage signal. This work both identifies a new locus strongly linked to hypertension and shows that the power of linkage analysis can be improved by the appropriate use of efficient pedigree-breaking strategies.

    Human molecular genetics 2006;15;10;1735-43

Jose Espinosa

- Visiting Undergraduate Student

I did my undergraduate in genomic sciences at the National Autonomous University of Mexico (UNAM).

My first field of work took place at the UNAM Biotechnology Institute and it involved analysing MicroRNA expression levels in bean based on NGS RNA libraries. First in 2010 and later on in early 2012 I joined the Sanger Institute and conducted a project related to structural variation in Y chromosomes from different populations. During this period I also collaborated at the National Institute of Genomic Medicine in Mexico analysing copy number variation in glioblastoma cancer samples.

Research

Certainly the deep understanding of the relationship genotype-phenotype is one of the major scientific challenges of the years to come. If we ever come to comprehend this complex interplay in fine detail not only the medical and biological implications will be huge, we might also be able to tell at that same level of detail what it's in our genome that has contributed to make us human beings from a genomic point of view. With no doubt structural variation and its study plays an important role in that.

I'm studying structural variation in Y chromosomes fron the 1000 Genomes Project.

Min Hu

- PhD student

Before I came to the UK in 2008, I obtained my undergraduate degree in life sciences at Peking University in China.

Research

My research focuses on looking for regions in the human genome that have been positively selected during modern human evolution. I am using statistical approaches on sequencing data from multiple populations, aiming to understand: 1) what types of selective sweeps can we detect using current models and statistical tests; 2) which genes and other functional elements in the human genome have been favored by positive natural selection after modern human emerged about 200,000 years ago.

References

  • Exploration of signals of positive selection derived from genotype-based human genome scans using re-sequencing data.

    Hu M, Ayub Q, Guerra-Assunção JA, Long Q, Ning Z, Huang N, Romero IG, Mamanova L, Akan P, Liu X, Coffey AJ, Turner DJ, Swerdlow H, Burton J, Quail MA, Conrad DF, Enright AJ, Tyler-Smith C and Xue Y

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1SA, UK.

    We have investigated whether regions of the genome showing signs of positive selection in scans based on haplotype structure also show evidence of positive selection when sequence-based tests are applied, whether the target of selection can be localized more precisely, and whether such extra evidence can lead to increased biological insights. We used two tools: simulations under neutrality or selection, and experimental investigation of two regions identified by the HapMap2 project as putatively selected in human populations. Simulations suggested that neutral and selected regions should be readily distinguished and that it should be possible to localize the selected variant to within 40 kb at least half of the time. Re-sequencing of two ~300 kb regions (chr4:158Mb and chr10:22Mb) lacking known targets of selection in HapMap CHB individuals provided strong evidence for positive selection within each and suggested the micro-RNA gene hsa-miR-548c as the best candidate target in one region, and changes in regulation of the sperm protein gene SPAG6 in the other.

    Funded by: Wellcome Trust: 077009

    Human genetics 2012;131;5;665-74

  • A systematic survey of loss-of-function variants in human protein-coding genes.

    MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB, Albers CA, Zhang ZD, Conrad DF, Lunter G, Zheng H, Ayub Q, DePristo MA, Banks E, Hu M, Handsaker RE, Rosenfeld JA, Fromer M, Jin M, Mu XJ, Khurana E, Ye K, Kay M, Saunders GI, Suner MM, Hunt T, Barnes IH, Amid C, Carvalho-Silva DR, Bignell AH, Snow C, Yngvadottir B, Bumpstead S, Cooper DN, Xue Y, Romero IG, 1000 Genomes Project Consortium, Wang J, Li Y, Gibbs RA, McCarroll SA, Dermitzakis ET, Pritchard JK, Barrett JC, Harrow J, Hurles ME, Gerstein MB and Tyler-Smith C

    Wellcome Trust Sanger Institute, Hinxton, UK. macarthur@atgu.mgh.harvard.edu

    Genome-sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in nonessential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.

    Funded by: British Heart Foundation: RG/09/012/28096; NHGRI NIH HHS: U54 HG003273; Wellcome Trust: 085532, 090532, 090532/Z/09/Z, 098051

    Science (New York, N.Y.) 2012;335;6070;823-8

  • A map of human genome variation from population-scale sequencing.

    1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME and McVean GA

    The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

    Funded by: British Heart Foundation: RG/09/012/28096; Howard Hughes Medical Institute; Medical Research Council: G0801823, G0801823(89305); NCRR NIH HHS: S10RR025056; NHGRI NIH HHS: 01HG3229, N01HG62088, P01HG4120, P41HG2371, P41HG4221, P41HG4222, P50HG2357, R01 HG003229, R01 HG003229-05, R01 HG004719-01, R01 HG004719-02, R01 HG004719-02S1, R01 HG004719-03, R01 HG004719-04, R01HG2651, R01HG3698, R01HG4333, R01HG4719, R01HG4960, RC2 HG005552-01, RC2 HG005552-02, RC2HG5552, U01HG5208, U01HG5209, U01HG5210, U01HG5211, U01HG5214, U41HG4568, U54 HG003273, U54HG2750, U54HG2757, U54HG3067, U54HG3079, U54HG3273; NIGMS NIH HHS: R01GM59290, R01GM72861, T32 GM007753; NIMH NIH HHS: 01MH84698; Wellcome Trust: 075491, 077009, 077014, 077192, 081407, 085532, 086084, 089061, 089062, 089088, WT075491/Z/04, WT077009, WT081407/Z/06/Z, WT085532AIA, WT086084/Z/08/Z, WT089088/Z/09/Z

    Nature 2010;467;7319;1061-73

  • Origins and functional impact of copy number variation in the human genome.

    Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P, Fitzgerald T, Hu M, Ihm CH, Kristiansson K, Macarthur DG, Macdonald JR, Onyiah I, Pang AW, Robson S, Stirrups K, Valsesia A, Walter K, Wei J, Wellcome Trust Case Control Consortium, Tyler-Smith C, Carter NP, Lee C, Scherer SW and Hurles ME

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA UK.

    Structural variations of DNA greater than 1 kilobase in size account for most bases that vary among human genomes, but are still relatively under-ascertained. Here we use tiling oligonucleotide microarrays, comprising 42 million probes, to generate a comprehensive map of 11,700 copy number variations (CNVs) greater than 443 base pairs, of which most (8,599) have been validated independently. For 4,978 of these CNVs, we generated reference genotypes from 450 individuals of European, African or East Asian ancestry. The predominant mutational mechanisms differ among CNV size classes. Retrotransposition has duplicated and inserted some coding and non-coding DNA segments randomly around the genome. Furthermore, by correlation with known trait-associated single nucleotide polymorphisms (SNPs), we identified 30 loci with CNVs that are candidates for influencing disease susceptibility. Despite this, having assessed the completeness of our map and the patterns of linkage disequilibrium between CNVs and SNPs, we conclude that, for complex traits, the heritability void left by genome-wide association studies will not be accounted for by common CNVs.

    Funded by: Canadian Institutes of Health Research; NHGRI NIH HHS: HG004221; NIGMS NIH HHS: GM081533; Wellcome Trust: 077006/Z/05/Z, 077008, 077009, 077014

    Nature 2010;464;7289;704-12

Daniel MacArthur

- Visiting Scientist

I completed my PhD at the Institute for Neuromuscular Research in Sydney, Australia. My PhD focused on the genetics of human athletic performance, and specifically on the effect of variation in the ACTN3 gene on muscle function. During my PhD I generated and analysed a knockout mouse model of ACTN3 and analysed its recent evolutionary history in humans.

I moved to the Sanger Institute in September 2008. For the last two years I have been funded by an Australian National Health and Medical Research Council Overseas Biomedical Fellowship.

Research

My current research is focused on predicting the functional effects of genetic variants. I coordinated the functional annotation of genetic variants in the 1000 Genomes pilot projects, and have also led an international collaboration investigating the impact of "loss-of-function" variants - sequence changes that are predicted to severely damage the function of protein-coding genes. We have identified over 800 completely novel variants of this kind as part of the 1000 Genomes Project, and explored their effects on gene expression, complex disease risk, and recent human evolution.

References

  • α-Actinin-3 deficiency is associated with reduced bone mass in human and mouse.

    Yang N, Schindeler A, McDonald MM, Seto JT, Houweling PJ, Lek M, Hogarth M, Morse AR, Raftery JM, Balasuriya D, MacArthur DG, Berman Y, Quinlan KG, Eisman JA, Nguyen TV, Center JR, Prince RL, Wilson SG, Zhu K, Little DG and North KN

    Institute for Neuroscience and Muscle Research, The Children's Hospital at Westmead, Sydney 2145, NSW, Australia. nan.yang@persongen.com

    Bone mineral density (BMD) is a complex trait that is the single best predictor of the risk of osteoporotic fractures. Candidate gene and genome-wide association studies have identified genetic variations in approximately 30 genetic loci associated with BMD variation in humans. α-Actinin-3 (ACTN3) is highly expressed in fast skeletal muscle fibres. There is a common null-polymorphism R577X in human ACTN3 that results in complete deficiency of the α-actinin-3 protein in approximately 20% of Eurasians. Absence of α-actinin-3 does not cause any disease phenotypes in muscle because of compensation by α-actinin-2. However, α-actinin-3 deficiency has been shown to be detrimental to athletic sprint/power performance. In this report we reveal additional functions for α-actinin-3 in bone. α-Actinin-3 but not α-actinin-2 is expressed in osteoblasts. The Actn3(-/-) mouse displays significantly reduced bone mass, with reduced cortical bone volume (-14%) and trabecular number (-61%) seen by microCT. Dynamic histomorphometry indicated this was due to a reduction in bone formation. In a cohort of postmenopausal Australian women, ACTN3 577XX genotype was associated with lower BMD in an additive genetic model, with the R577X genotype contributing 1.1% of the variance in BMD. Microarray analysis of cultured osteoprogenitors from Actn3(-/-) mice showed alterations in expression of several genes regulating bone mass and osteoblast/osteoclast activity, including Enpp1, Opg and Wnt7b. Our studies suggest that ACTN3 likely contributes to the regulation of bone mass through alterations in bone turnover. Given the high frequency of R577X in the general population, the potential role of ACTN3 R577X as a factor influencing variations in BMD in elderly humans warrants further study.

    Bone 2011;49;4;790-8

  • Deficiency of α-actinin-3 is associated with increased susceptibility to contraction-induced damage and skeletal muscle remodeling.

    Seto JT, Lek M, Quinlan KG, Houweling PJ, Zheng XF, Garton F, MacArthur DG, Raftery JM, Garvey SM, Hauser MA, Yang N, Head SI and North KN

    Institute for Neuroscience and Muscle Research, The Children's Hospital at Westmead, Locked Bag 4001, Sydney, NSW 2145, Australia.

    Sarcomeric α-actinins (α-actinin-2 and -3) are a major component of the Z-disk in skeletal muscle, where they crosslink actin and other structural proteins to maintain an ordered myofibrillar array. Homozygosity for the common null polymorphism (R577X) in ACTN3 results in the absence of fast fiber-specific α-actinin-3 in ∼20% of the general population. α-Actinin-3 deficiency is associated with decreased force generation and is detrimental to sprint and power performance in elite athletes, suggesting that α-actinin-3 is necessary for optimal forceful repetitive muscle contractions. Since Z-disks are the structures most vulnerable to eccentric damage, we sought to examine the effects of α-actinin-3 deficiency on sarcomeric integrity. Actn3 knockout mouse muscle showed significantly increased force deficits following eccentric contraction at 30% stretch, suggesting that α-actinin-3 deficiency results in an increased susceptibility to muscle damage at the extremes of muscle performance. Microarray analyses demonstrated an increase in muscle remodeling genes, which we confirmed at the protein level. The loss of α-actinin-3 and up-regulation of α-actinin-2 resulted in no significant changes to the total pool of sarcomeric α-actinins, suggesting that alterations in fast fiber Z-disk properties may be related to differences in functional protein interactions between α-actinin-2 and α-actinin-3. In support of this, we demonstrated that the Z-disk proteins, ZASP, titin and vinculin preferentially bind to α-actinin-2. Thus, the loss of α-actinin-3 changes the overall protein composition of fast fiber Z-disks and alters their elastic properties, providing a mechanistic explanation for the loss of force generation and increased susceptibility to eccentric damage in α-actinin-3-deficient individuals.

    Human molecular genetics 2011;20;15;2914-27

  • Dindel: accurate indel calls from short-read data.

    Albers CA, Lunter G, MacArthur DG, McVean G, Ouwehand WH and Durbin R

    Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, United Kingdom. caa@sanger.ac.uk

    Small insertions and deletions (indels) are a common and functionally important type of sequence polymorphism. Most of the focus of studies of sequence variation is on single nucleotide variants (SNVs) and large structural variants. In principle, high-throughput sequencing studies should allow identification of indels just as SNVs. However, inference of indels from next-generation sequence data is challenging, and so far methods for identifying indels lag behind methods for calling SNVs in terms of sensitivity and specificity. We propose a Bayesian method to call indels from short-read sequence data in individuals and populations by realigning reads to candidate haplotypes that represent alternative sequence to the reference. The candidate haplotypes are formed by combining candidate indels and SNVs identified by the read mapper, while allowing for known sequence variants or candidates from other methods to be included. In our probabilistic realignment model we account for base-calling errors, mapping errors, and also, importantly, for increased sequencing error indel rates in long homopolymer runs. We show that our method is sensitive and achieves low false discovery rates on simulated and real data sets, although challenges remain. The algorithm is implemented in the program Dindel, which has been used in the 1000 Genomes Project call sets.

    Funded by: British Heart Foundation: RG/09/012/28096; Wellcome Trust: 086084, 090532, WT089088/Z/09/Z

    Genome research 2011;21;6;961-73

  • Gene inactivation and its implications for annotation in the era of personal genomics.

    Balasubramanian S, Habegger L, Frankish A, MacArthur DG, Harte R, Tyler-Smith C, Harrow J and Gerstein M

    Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA.

    The first wave of personal genomes documents how no single individual genome contains the full complement of functional genes. Here, we describe the extent of variation in gene and pseudogene numbers between individuals arising from inactivation events such as premature termination or aberrant splicing due to single-nucleotide polymorphisms. This highlights the inadequacy of the current reference sequence and gene set. We present a proposal to define a reference gene set that will remain stable as more individuals are sequenced. In particular, we recommend that the ancestral allele be used to define the reference sequence from which a core human reference gene annotation set can be derived. In addition, we call for the development of an expanded gene set to include human-specific genes that have arisen recently and are absent from the ancestral set.

    Funded by: Wellcome Trust

    Genes & development 2011;25;1;1-10

  • A map of human genome variation from population-scale sequencing.

    1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME and McVean GA

    The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

    Funded by: British Heart Foundation: RG/09/012/28096; Howard Hughes Medical Institute; Medical Research Council: G0801823, G0801823(89305); NCRR NIH HHS: S10RR025056; NHGRI NIH HHS: 01HG3229, N01HG62088, P01HG4120, P41HG2371, P41HG4221, P41HG4222, P50HG2357, R01 HG003229, R01 HG003229-05, R01 HG004719-01, R01 HG004719-02, R01 HG004719-02S1, R01 HG004719-03, R01 HG004719-04, R01HG2651, R01HG3698, R01HG4333, R01HG4719, R01HG4960, RC2 HG005552-01, RC2 HG005552-02, RC2HG5552, U01HG5208, U01HG5209, U01HG5210, U01HG5211, U01HG5214, U41HG4568, U54 HG003273, U54HG2750, U54HG2757, U54HG3067, U54HG3079, U54HG3273; NIGMS NIH HHS: R01GM59290, R01GM72861, T32 GM007753; NIMH NIH HHS: 01MH84698; Wellcome Trust: 075491, 077009, 077014, 077192, 081407, 085532, 086084, 089061, 089062, 089088, WT075491/Z/04, WT077009, WT081407/Z/06/Z, WT085532AIA, WT086084/Z/08/Z, WT089088/Z/09/Z

    Nature 2010;467;7319;1061-73

  • Loss-of-function variants in the genomes of healthy humans.

    MacArthur DG and Tyler-Smith C

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, UK. dm8@sanger.ac.uk

    Genetic variants predicted to seriously disrupt the function of human protein-coding genes-so-called loss-of-function (LOF) variants-have traditionally been viewed in the context of severe Mendelian disease. However, recent large-scale sequencing and genotyping projects have revealed a surprisingly large number of these variants in the genomes of apparently healthy individuals--at least 100 per genome, including more than 30 in a homozygous state--suggesting a previously unappreciated level of variation in functional gene content between humans. These variants are mostly found at low frequency, suggesting that they are enriched for mildly deleterious polymorphisms suppressed by negative natural selection, and thus represent an attractive set of candidate variants for complex disease susceptibility. However, they are also enriched for sequencing and annotation artefacts, so overall present serious challenges for clinical sequencing projects seeking to identify severe disease genes amidst the 'noise' of technical error and benign genetic polymorphism. Systematic, high-quality catalogues of LOF variants present in the genomes of healthy individuals, built from the output of large-scale sequencing studies such as the 1000 Genomes Project, will help to distinguish between benign and disease-causing LOF variants, and will provide valuable resources for clinical genomics.

    Funded by: Wellcome Trust

    Human molecular genetics 2010;19;R2;R125-30

  • Origins and functional impact of copy number variation in the human genome.

    Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P, Fitzgerald T, Hu M, Ihm CH, Kristiansson K, Macarthur DG, Macdonald JR, Onyiah I, Pang AW, Robson S, Stirrups K, Valsesia A, Walter K, Wei J, Wellcome Trust Case Control Consortium, Tyler-Smith C, Carter NP, Lee C, Scherer SW and Hurles ME

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA UK.

    Structural variations of DNA greater than 1 kilobase in size account for most bases that vary among human genomes, but are still relatively under-ascertained. Here we use tiling oligonucleotide microarrays, comprising 42 million probes, to generate a comprehensive map of 11,700 copy number variations (CNVs) greater than 443 base pairs, of which most (8,599) have been validated independently. For 4,978 of these CNVs, we generated reference genotypes from 450 individuals of European, African or East Asian ancestry. The predominant mutational mechanisms differ among CNV size classes. Retrotransposition has duplicated and inserted some coding and non-coding DNA segments randomly around the genome. Furthermore, by correlation with known trait-associated single nucleotide polymorphisms (SNPs), we identified 30 loci with CNVs that are candidates for influencing disease susceptibility. Despite this, having assessed the completeness of our map and the patterns of linkage disequilibrium between CNVs and SNPs, we conclude that, for complex traits, the heritability void left by genome-wide association studies will not be accounted for by common CNVs.

    Funded by: Canadian Institutes of Health Research; NHGRI NIH HHS: HG004221; NIGMS NIH HHS: GM081533; Wellcome Trust: 077006/Z/05/Z, 077008, 077009, 077014

    Nature 2010;464;7289;704-12

  • An Actn3 knockout mouse provides mechanistic insights into the association between alpha-actinin-3 deficiency and human athletic performance.

    MacArthur DG, Seto JT, Chan S, Quinlan KG, Raftery JM, Turner N, Nicholson MD, Kee AJ, Hardeman EC, Gunning PW, Cooney GJ, Head SI, Yang N and North KN

    Institute for Neuromuscular Research, The Children's Hospital at Westmead, Sydney 2145, NSW, Australia.

    A common nonsense polymorphism (R577X) in the ACTN3 gene results in complete deficiency of the fast skeletal muscle fiber protein alpha-actinin-3 in an estimated one billion humans worldwide. The XX null genotype is under-represented in elite sprint athletes, associated with reduced muscle strength and sprint performance in non-athletes, and is over-represented in endurance athletes, suggesting that alpha-actinin-3 deficiency increases muscle endurance at the cost of power generation. Here we report that muscle from Actn3 knockout mice displays reduced force generation, consistent with results from human association studies. Detailed analysis of knockout mouse muscle reveals reduced fast fiber diameter, increased activity of multiple enzymes in the aerobic metabolic pathway, altered contractile properties, and enhanced recovery from fatigue, suggesting a shift in the properties of fast fibers towards those characteristic of slow fibers. These findings provide the first mechanistic explanation for the reported associations between R577X and human athletic performance and muscle function.

    Human molecular genetics 2008;17;8;1076-86

  • Loss of ACTN3 gene function alters mouse muscle metabolism and shows evidence of positive selection in humans.

    MacArthur DG, Seto JT, Raftery JM, Quinlan KG, Huttley GA, Hook JW, Lemckert FA, Kee AJ, Edwards MR, Berman Y, Hardeman EC, Gunning PW, Easteal S, Yang N and North KN

    Institute for Neuromuscular Research, Children's Hospital at Westmead, Sydney, New South Wales 2145, Australia.

    More than a billion humans worldwide are predicted to be completely deficient in the fast skeletal muscle fiber protein alpha-actinin-3 owing to homozygosity for a premature stop codon polymorphism, R577X, in the ACTN3 gene. The R577X polymorphism is associated with elite athlete status and human muscle performance, suggesting that alpha-actinin-3 deficiency influences the function of fast muscle fibers. Here we show that loss of alpha-actinin-3 expression in a knockout mouse model results in a shift in muscle metabolism toward the more efficient aerobic pathway and an increase in intrinsic endurance performance. In addition, we demonstrate that the genomic region surrounding the 577X null allele shows low levels of genetic variation and recombination in individuals of European and East Asian descent, consistent with strong, recent positive selection. We propose that the 577X allele has been positively selected in some human populations owing to its effect on skeletal muscle metabolism.

    Nature genetics 2007;39;10;1261-5

  • ACTN3 genotype is associated with human elite athletic performance.

    Yang N, MacArthur DG, Gulbin JP, Hahn AG, Beggs AH, Easteal S and North K

    Institute for Neuromuscular Research, Children's Hospital at Westmead, Sydney, Australia.

    There is increasing evidence for strong genetic influences on athletic performance and for an evolutionary "trade-off" between performance traits for speed and endurance activities. We have recently demonstrated that the skeletal-muscle actin-binding protein alpha-actinin-3 is absent in 18% of healthy white individuals because of homozygosity for a common stop-codon polymorphism in the ACTN3 gene, R577X. alpha-Actinin-3 is specifically expressed in fast-twitch myofibers responsible for generating force at high velocity. The absence of a disease phenotype secondary to alpha-actinin-3 deficiency is likely due to compensation by the homologous protein, alpha-actinin-2. However, the high degree of evolutionary conservation of ACTN3 suggests function(s) independent of ACTN2. Here, we demonstrate highly significant associations between ACTN3 genotype and athletic performance. Both male and female elite sprint athletes have significantly higher frequencies of the 577R allele than do controls. This suggests that the presence of alpha-actinin-3 has a beneficial effect on the function of skeletal muscle in generating forceful contractions at high velocity, and provides an evolutionary advantage because of increased sprint performance. There is also a genotype effect in female sprint and endurance athletes, with higher than expected numbers of 577RX heterozygotes among sprint athletes and lower than expected numbers among endurance athletes. The lack of a similar effect in males suggests that the ACTN3 genotype affects athletic performance differently in males and females. The differential effects in sprint and endurance athletes suggests that the R577X polymorphism may have been maintained in the human population by balancing natural selection.

    American journal of human genetics 2003;73;3;627-31

Luca Pagani

- Visiting Scientist

I received both my B.A. and MSci in Molecular Biology at the Scuola Normale Superiore of Pisa, Italy in 2007 and 2009 respectively. My experience at Sanger started in 2009 thanks to an international exchange program (Erasmus) while my current involvement with the institute continues after a PhD project at the Biological Anthropology department of the University of Cambridge.

Research

I have always been fascinated by the migration events that brought a single African species to colonize the whole planet. Although with some delay, I finally understood that Biology was somewhat useful to try and retrieve the migration routes followed by our ancestors on their way out of Africa. The PhD project I am currently involved in is about the human populations currently inhabiting Eastern Africa. The aim of my project is indeed to better understand the demographic dynamics occurred in the area during the last 200.000 years to clarify the processes that led our expansion out of Africa.

References

  • Revisiting the thrifty gene hypothesis via 65 loci associated with susceptibility to type 2 diabetes.

    Ayub Q, Moutsianas L, Chen Y, Panoutsopoulou K, Colonna V, Pagani L, Prokopenko I, Ritchie GR, Tyler-Smith C, McCarthy MI, Zeggini E and Xue Y

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1HH, UK.

    We have investigated the evidence for positive selection in samples of African, European, and East Asian ancestry at 65 loci associated with susceptibility to type 2 diabetes (T2D) previously identified through genome-wide association studies. Selection early in human evolutionary history is predicted to lead to ancestral risk alleles shared between populations, whereas late selection would result in population-specific signals at derived risk alleles. By using a wide variety of tests based on the site frequency spectrum, haplotype structure, and population differentiation, we found no global signal of enrichment for positive selection when we considered all T2D risk loci collectively. However, in a locus-by-locus analysis, we found nominal evidence for positive selection at 14 of the loci. Selection favored the protective and risk alleles in similar proportions, rather than the risk alleles specifically as predicted by the thrifty gene hypothesis, and may not be related to influence on diabetes. Overall, we conclude that past positive selection has not been a powerful influence driving the prevalence of T2D risk alleles.

    Funded by: Wellcome Trust: 098051, 098381, WT090367MA

    American journal of human genetics 2014;94;2;176-85

  • Genome-wide evidence of Austronesian-Bantu admixture and cultural reversion in a hunter-gatherer group of Madagascar.

    Pierron D, Razafindrazaka H, Pagani L, Ricaut FX, Antao T, Capredon M, Sambo C, Radimilahy C, Rakotoarisoa JA, Blench RM, Letellier T and Kivisild T

    Laboratoire d'Anthropologie Moléculaire et Imagerie de Synthèse, Unité Mixte de Recherche 5288, Centre National de la Recherche Scientifique, Université de Toulouse, 31073 Toulouse, France.

    Linguistic and cultural evidence suggest that Madagascar was the final point of two major dispersals of Austronesian- and Bantu-speaking populations. Today, the Mikea are described as the last-known Malagasy population reported to be still practicing a hunter-gatherer lifestyle. It is unclear, however, whether the Mikea descend from a remnant population that existed before the arrival of Austronesian and Bantu agriculturalists or whether it is only their lifestyle that separates them from the other contemporary populations of South Madagascar. To address these questions we have performed a genome-wide analysis of >700,000 SNP markers on 21 Mikea, 24 Vezo, and 24 Temoro individuals, together with 50 individuals from Bajo and Lebbo populations from Indonesia. Our analyses of these data in the context of data available from other Southeast Asian and African populations reveal that all three Malagasy populations are derived from the same admixture event involving Austronesian and Bantu sources. In contrast to the fact that most of the vocabulary of the Malagasy speakers is derived from the Barito group of the Austronesian language family, we observe that only one-third of their genetic ancestry is related to the populations of the Java-Kalimantan-Sulawesi area. Because no additional ancestry components distinctive for the Mikea were found, it is likely that they have adopted their hunter-gatherer way of life through cultural reversion, and selection signals suggest a genetic adaptation to their new lifestyle.

    Funded by: European Research Council: 261213

    Proceedings of the National Academy of Sciences of the United States of America 2014;111;3;936-41

  • Genetic signatures reveal high-altitude adaptation in a set of ethiopian populations.

    Huerta-Sánchez E, Degiorgio M, Pagani L, Tarekegn A, Ekong R, Antao T, Cardona A, Montgomery HE, Cavalleri GL, Robbins PA, Weale ME, Bradman N, Bekele E, Kivisild T, Tyler-Smith C and Nielsen R

    Department of Integrative Biology, University of California, Berkeley, CA, USA. emiliahsc@berkeley.edu

    The Tibetan and Andean Plateaus and Ethiopian highlands are the largest regions to have long-term high-altitude residents. Such populations are exposed to lower barometric pressures and hence atmospheric partial pressures of oxygen. Such "hypobaric hypoxia" may limit physical functional capacity, reproductive health, and even survival. As such, selection of genetic variants advantageous to hypoxic adaptation is likely to have occurred. Identifying signatures of such selection is likely to help understanding of hypoxic adaptive processes. Here, we seek evidence of such positive selection using five Ethiopian populations, three of which are from high-altitude areas in Ethiopia. As these populations may have been recipients of Eurasian gene flow, we correct for this admixture. Using single-nucleotide polymorphism genotype data from multiple populations, we find the strongest signal of selection in BHLHE41 (also known as DEC2 or SHARP1). Remarkably, a major role of this gene is regulation of the same hypoxia response pathway on which selection has most strikingly been observed in both Tibetan and Andean populations. Because it is also an important player in the circadian rhythm pathway, BHLHE41 might also provide insights into the mechanisms underlying the recognized impacts of hypoxia on the circadian clock. These results support the view that Ethiopian, Andean, and Tibetan populations living at high altitude have adapted to hypoxia differently, with convergent evolution affecting different genes from the same pathway.

    Funded by: NHGRI NIH HHS: R01HG003229, R01HG003229-08S2

    Molecular biology and evolution 2013;30;8;1877-88

  • Evolution of the pygmy phenotype: evidence of positive selection fro genome-wide scans in African, Asian, and Melanesian pygmies.

    Migliano AB, Romero IG, Metspalu M, Leavesley M, Pagani L, Antao T, Huang DW, Sherman BT, Siddle K, Scholes C, Hudjashov G, Kaitokai E, Babalu A, Belatti M, Cagan A, Hopkinshaw B, Shaw C, Nelis M, Metspalu E, Mägi R, Lempicki RA, Villems R, Lahr MM and Kivisild T

    Department of Anthropology, University College London, London, UK.

    Human pygmy populations inhabit different regions of the world, from Africa to Melanesia. In Asia, short-statured populations are often referred to as "negritos." Their short stature has been interpreted as a consequence of thermoregulatory, nutritional, and/or locomotory adaptations to life in tropical forests. A more recent hypothesis proposes that their stature is the outcome of a life history trade-off in high-mortality environments, where early reproduction is favored and, consequently, early sexual maturation and early growth cessation have coevolved. Some serological evidence of deficiencies in the growth hormone/insulin-like growth factor axis have been previously associated with pygmies' short stature. Using genome-wide single-nucleotide polymorphism genotype data, we first tested whether different negrito groups living in the Philippines and Papua New Guinea are closely related and then investigated genomic signals of recent positive selection in African, Asian, and Papuan pygmy populations. We found that negritos in the Philippines and Papua New Guinea are genetically more similar to their nonpygmy neighbors than to one another and have experienced positive selection at different genes. These results indicate that geographically distant pygmy groups are likely to have evolved their short stature independently. We also found that selection on common height variants is unlikely to explain their short stature and that different genes associated with growth, thyroid function, and sexual development are under selection in different pygmy groups.

    Human biology 2013;85;1-3;251-84

  • The GenoChip: a new tool for genetic anthropology.

    Elhaik E, Greenspan E, Staats S, Krahn T, Tyler-Smith C, Xue Y, Tofanelli S, Francalacci P, Cucca F, Pagani L, Jin L, Li H, Schurr TG, Greenspan B, Spencer Wells R and Genographic Consortium

    Department of Mental Health, Johns Hopkins University Bloomberg School of Public Health, USA.

    The Genographic Project is an international effort aimed at charting human migratory history. The project is nonprofit and nonmedical, and, through its Legacy Fund, supports locally led efforts to preserve indigenous and traditional cultures. Although the first phase of the project was focused on uniparentally inherited markers on the Y-chromosome and mitochondrial DNA (mtDNA), the current phase focuses on markers from across the entire genome to obtain a more complete understanding of human genetic variation. Although many commercial arrays exist for genome-wide single-nucleotide polymorphism (SNP) genotyping, they were designed for medical genetic studies and contain medically related markers that are inappropriate for global population genetic studies. GenoChip, the Genographic Project's new genotyping array, was designed to resolve these issues and enable higher resolution research into outstanding questions in genetic anthropology. The GenoChip includes ancestry informative markers obtained for over 450 human populations, an ancient human (Saqqaq), and two archaic hominins (Neanderthal and Denisovan) and was designed to identify all known Y-chromosome and mtDNA haplogroups. The chip was carefully vetted to avoid inclusion of medically relevant markers. To demonstrate its capabilities, we compared the FST distributions of GenoChip SNPs to those of two commercial arrays. Although all arrays yielded similarly shaped (inverse J) FST distributions, the GenoChip autosomal and X-chromosomal distributions had the highest mean FST, attesting to its ability to discern subpopulations. The chip performances are illustrated in a principal component analysis for 14 worldwide populations. In summary, the GenoChip is a dedicated genotyping platform for genetic anthropology. With an unprecedented number of approximately 12,000 Y-chromosomal and approximately 3,300 mtDNA SNPs and over 130,000 autosomal and X-chromosomal SNPs without any known health, medical, or phenotypic relevance, the GenoChip is a useful tool for genetic anthropology and population genetics.

    Funded by: NIMH NIH HHS: T32 MH014592; Wellcome Trust: 098051

    Genome biology and evolution 2013;5;5;1021-31

  • Ethiopian genetic diversity reveals linguistic stratification and complex influences on the Ethiopian gene pool.

    Pagani L, Kivisild T, Tarekegn A, Ekong R, Plaster C, Gallego Romero I, Ayub Q, Mehdi SQ, Thomas MG, Luiselli D, Bekele E, Bradman N, Balding DJ and Tyler-Smith C

    Division of Biological Anthropology, University of Cambridge, UK. lp8@sanger.ac.uk

    Humans and their ancestors have traversed the Ethiopian landscape for millions of years, and present-day Ethiopians show great cultural, linguistic, and historical diversity, which makes them essential for understanding African variability and human origins. We genotyped 235 individuals from ten Ethiopian and two neighboring (South Sudanese and Somali) populations on an Illumina Omni 1M chip. Genotypes were compared with published data from several African and non-African populations. Principal-component and STRUCTURE-like analyses confirmed substantial genetic diversity both within and between populations, and revealed a match between genetic data and linguistic affiliation. Using comparisons with African and non-African reference samples in 40-SNP genomic windows, we identified "African" and "non-African" haplotypic components for each Ethiopian individual. The non-African component, which includes the SLC24A5 allele associated with light skin pigmentation in Europeans, may represent gene flow into Africa, which we estimate to have occurred ~3 thousand years ago (kya). The non-African component was found to be more similar to populations inhabiting the Levant rather than the Arabian Peninsula, but the principal route for the expansion out of Africa ~60 kya remains unresolved. Linkage-disequilibrium decay with genomic distance was less rapid in both the whole genome and the African component than in southern African samples, suggesting a less ancient history for Ethiopian populations.

    Funded by: Wellcome Trust: 098051

    American journal of human genetics 2012;91;1;83-96

  • The dual origin of Tati-speakers from Dagestan as written in the genealogy of uniparental variants.

    Bertoncini S, Bulayeva K, Ferri G, Pagani L, Caciagli L, Taglioli L, Semyonov I, Bulayev O, Paoli G and Tofanelli S

    Department of Biology, University of Pisa, Pisa, Italy. stef.bertoncini@gmail.com

    Objectives: Tat language is classified in an Iranian subbranch of the Indo-European family. It is spoken in the Caucasus and in the West Caspian region by populations with heterogeneous cultural traditions and religion whose ancestry is unknown. The aim of this study is to get a first insight about the genetic history of this peculiar linguistic group.

    Methods: We investigated the uniparental gene pools, defined by NRY and mtDNA high-resolution markers, in two Tati-speaking communities from Dagestan: Mountain Jews or Juhur, who speak the Judeo-Tat dialect, and the Tats, who speak the Muslim-Tat dialect. The samples have been collected in monoethnic rural villages and selected on the basis of genealogical relationships. A novel approach aimed at resolving cryptic cases in the recent history of human populations, which combines the properties of uniparental genetic markers with the potential of "forward-in-time" computer simulations, is presented.

    Results: Judeo-Tats emerged as a group with tight matrilineal genetic legacy who separated early from other Jewish communities. Tats exhibited genetic signals of a much longer in situ evolution, which appear as substantially unlinked with other Indo-Iranian enclaves in the Caucasus.

    Conclusions: The independent demographic histories of the two samples, with mutually reversed profiles at paternally and maternally transmitted genetic systems, suggest that geographic proximity and linguistic assimilation of Tati-speakers from Dagestan do not reflect a common ancestry.

    American journal of human biology : the official journal of the Human Biology Council 2012;24;4;391-9

  • High altitude adaptation in Daghestani populations from the Caucasus.

    Pagani L, Ayub Q, MacArthur DG, Xue Y, Baillie JK, Chen Y, Kozarewa I, Turner DJ, Tofanelli S, Bulayeva K, Kidd K, Paoli G and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Hinxton, UK. lp8@sanger.ac.uk

    We have surveyed 15 high-altitude adaptation candidate genes for signals of positive selection in North Caucasian highlanders using targeted re-sequencing. A total of 49 unrelated Daghestani from three ethnic groups (Avars, Kubachians, and Laks) living in ancient villages located at around 2,000 m above sea level were chosen as the study population. Caucasian (Adygei living at sea level, N = 20) and CEU (CEPH Utah residents with ancestry from northern and western Europe; N = 20) were used as controls. Candidate genes were compared with 20 putatively neutral control regions resequenced in the same individuals. The regions of interest were amplified by long-PCR, pooled according to individual, indexed by adding an eight-nucleotide tag, and sequenced using the Illumina GAII platform. 1,066 SNPs were called using false discovery and false negative thresholds of ~6%. The neutral regions provided an empirical null distribution to compare with the candidate genes for signals of selection. Two genes stood out. In Laks, a non-synonymous variant within HIF1A already known to be associated with improvement in oxygen metabolism was rediscovered, and in Kubachians a cluster of 13 SNPs located in a conserved intronic region within EGLN1 showing high population differentiation was found. These variants illustrate both the common pathways of adaptation to high altitude in different populations and features specific to the Daghestani populations, showing how even a mildly hypoxic environment can lead to genetic adaptation.

    Funded by: Wellcome Trust

    Human genetics 2012;131;3;423-33

  • A world in a grain of sand: human history from genetic data.

    Colonna V, Pagani L, Xue Y and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, UK.

    Genome-wide genotypes and sequences are enriching our understanding of the past 50,000 years of human history and providing insights into earlier periods largely inaccessible to mitochondrial DNA and Y-chromosomal studies.To see a world in a grain of sand ...William Blake, Auguries of Innocence.

    Funded by: Wellcome Trust

    Genome biology 2011;12;11;234

  • The key role of patrilineal inheritance in shaping the genetic variation of Dagestan highlanders.

    Caciagli L, Bulayeva K, Bulayev O, Bertoncini S, Taglioli L, Pagani L, Paoli G and Tofanelli S

    Dipartimento di Biologia, Università di Pisa, Pisa, Italy.

    The Caucasus region is a complex cultural and ethnic mosaic, comprising populations that speak Caucasian, Indo-European and Altaic languages. Isolated mountain villages (auls) in Dagestan still preserve high level of genetic and cultural diversity and have patriarchal societies with a long history of isolation. The aim of this study was to understand the genetic history of five Dagestan highland auls with distinct ethnic affiliation (Avars, Chechens-Akkins, Kubachians, Laks, Tabasarans) using markers on the male-specific region of the Y chromosome. The groups analyzed here are all Muslims but speak different languages all belonging to the Nakh-Dagestanian linguistic family. The results show that the Dagestan ethnic groups share a common Y-genetic background, with deep-rooted genealogies and rare alleles, dating back to an early phase in the post-glacial recolonization of Europe. Geography and stochastic factors, such as founder effect and long-term genetic drift, driven by the rigid structuring of societies in groups of patrilineal descent, most likely acted as mutually reinforcing key factors in determining the high degree of Y-genetic divergence among these ethnic groups.

    Journal of human genetics 2009;54;12;689-94

Michal Szpak

- PhD Student

I received my B.Sc. in Biology from the University of Warsaw. My research in the Molecular Archaeology research group focused on ancient DNA analyses, evolutionary genetics and phylogenetics. I investigated bones of various Pleistocene mammalian species in order to shed more light on their evolutionary history. Subsequently, I conducted my master's degree research at the University of Virginia, Medical School. During that period I was involved in several projects investigating population genetics and human genomics at The Center for Public Health Genomics. I was responsible for microarray-based CNVs detection and examination of their distribution across 4 ethnic groups.

Research

Recently, I have joined the Human Evolution team at the Sanger Institute where I was working on the positive selection and evolutionary history of human genes involved in interactions with viruses. I’m currently sequencing mountain and other gorilla’s mtDNA in order to investigate their population genetics and phylogeny. The next step and my PhD project will be functional study of evolutionarily interesting human variants, highly differentiated between human populations, using model organisms. I will model the human selected and non-selected alleles in mice, zebrafish, iPS or lymphoblastoid cell lines, depending upon the predicted phenotype to study their function.

Wei Wei

- Visiting PhD student

I am a third-year PhD student in institute of Forensic Medicine, Sichuan University, Chengdu, China and joined the Human Evolution team in September, 2011 as a visiting student.

Research

My PhD project started with identifying the informative Y-chromosomal makers for the populations in China and applying them in the forensic science using traditional PCR based methods. Now I am extending my research interest by using the publicly available whole Y chromosomal resequencing data, such as the ones from the Complete Genomics to refine the Y chromosome phylogenetic tree by identifying more new Y markers and understand the human male history by carrying on the population genetic analysis.

References

  • Exploring of new Y-chromosome SNP loci using Pyrosequencing and the SNaPshot methods.

    Wei W, Luo HB, Yan J and Hou YP

    Department of Forensic Genetics, West China School of Basic Science and Forensic Medicine, Sichuan University (West China University of Medical Sciences), Chengdu, 610041, Sichuan, China. weiwei090818@163.com

    The single nucleotide polymorphisms on the Y chromosome (Y-SNP) have been considered to be important in forensic casework. However, Y-SNP loci were mostly population specific and lacked biallelic polymorphisms in the Asian population. In this study, we developed a strategy for seeking and genotyping new Y-SNP markers based on both Pyrosequencing and the SNaPshot methods. As results, 34 new biallelic markers were observed to be polymorphic in the Chinese Han population by estimation of allele frequencies of 103 candidate's Y-SNP loci in DNA pools using Pyrosequencing technology. Then, a multiplex system with 20 Y-SNP loci was genotyped using the SNaPshot™ multiplex kit. Twenty Y-SNP loci defined 56 different haplotypes, and the haplotype diversity was estimated to be 0.9539. Our result demonstrated that the strategy could be used as an efficient tool to search and genotype biallelic markers from a large amount of candidate loci. In addition, 20 Y-SNP loci constructed a multiplex system, which could provide supplementary information for forensic identification.

    International journal of legal medicine 2012;126;6;825-33

Yali Xue

- Staff Scientist

I studied public health as an undergraduate, epidemiology for my master’s degree, and medical and population genetics for my Ph.D in Harbin Medical University, China. I collected samples from different ethnic groups in China and established cell lines from them, some of which are now included in the HGDP panel. In all, I studied human genetic diversity in China for 8 years, also making visits to Oxford University, UK and Cleveland University, US during this period. I received the national scientific research award in 2005.

Research

Joined Sanger in 2004, working initially on Y-chromosomal diversity, including involvement in the Genographic project. Subsequently, focused more on identifying signatures of positive selection in the human genome. Since 2008, I have concentrated on applying new sequencing technology to address human evolution and population genetics questions, e.g. directly measuring Y mutation rate. Involved in the 1000 Genomes Project, including Y-chromosomal diversity, a genome-wide scan for positive selection, identifying disease variants in the general population, and functional prediction of the consequences of variants. Also coordinate two major team projects for the new quinquennium: Native American and Himalayan population genetics studies.

References

  • Response to the comment on "The hare and the tortoise: One small step for four SNPs, one giant leap for SNP-kind".

    Xue Y and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambs. CB10 1SA, UK.

    The possibility of introducing new sequencing technologies into forensic genetics raises questions that go beyond the choice between SNPs and STRs as the preferred genetic markers. We suggest that many of the novel methodological and technical issues could be incorporated into the likelihood ratio frameworks currently used by forensic scientists. However, changes to ethical and legal structures may be needed before the new information could be used.

    Forensic science international. Genetics 2011;5;4;361-2

  • A worldwide analysis of beta-defensin copy number variation suggests recent selection of a high-expressing DEFB103 gene copy in East Asia.

    Hardwick RJ, Machado LR, Zuccherato LW, Antolinos S, Xue Y, Shawa N, Gilman RH, Cabrera L, Berg DE, Tyler-Smith C, Kelly P, Tarazona-Santos E and Hollox EJ

    Department of Genetics, University of Leicester, University Road, Leicester, United Kingdom.

    Beta-defensins are a family of multifunctional genes with roles in defense against pathogens, reproduction, and pigmentation. In humans, six beta-defensin genes are clustered in a repeated region which is copy-number variable (CNV) as a block, with a diploid copy number between 1 and 12. The role in host defense makes the evolutionary history of this CNV particularly interesting, because morbidity due to infectious disease is likely to have been an important selective force in human evolution, and to have varied between geographical locations. Here, we show CNV of the beta-defensin region in chimpanzees, and identify a beta-defensin block in the human lineage that contains rapidly evolving noncoding regulatory sequences. We also show that variation at one of these rapidly evolving sequences affects expression levels and cytokine responsiveness of DEFB103, a key inhibitor of influenza virus fusion at the cell surface. A worldwide analysis of beta-defensin CNV in 67 populations shows an unusually high frequency of high-DEFB103-expressing copies in East Asia, the geographical origin of historical and modern influenza epidemics, possibly as a result of selection for increased resistance to influenza in this region.

    Funded by: Medical Research Council: G0801123, GO801123; Wellcome Trust: 067948, 077009, 087663

    Human mutation 2011;32;7;743-50

  • A map of human genome variation from population-scale sequencing.

    1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME and McVean GA

    The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

    Funded by: British Heart Foundation: RG/09/012/28096; Howard Hughes Medical Institute; Medical Research Council: G0801823, G0801823(89305); NCRR NIH HHS: S10RR025056; NHGRI NIH HHS: 01HG3229, N01HG62088, P01HG4120, P41HG2371, P41HG4221, P41HG4222, P50HG2357, R01 HG003229, R01 HG003229-05, R01 HG004719-01, R01 HG004719-02, R01 HG004719-02S1, R01 HG004719-03, R01 HG004719-04, R01HG2651, R01HG3698, R01HG4333, R01HG4719, R01HG4960, RC2 HG005552-01, RC2 HG005552-02, RC2HG5552, U01HG5208, U01HG5209, U01HG5210, U01HG5211, U01HG5214, U41HG4568, U54 HG003273, U54HG2750, U54HG2757, U54HG3067, U54HG3079, U54HG3273; NIGMS NIH HHS: R01GM59290, R01GM72861, T32 GM007753; NIMH NIH HHS: 01MH84698; Wellcome Trust: 075491, 077009, 077014, 077192, 081407, 085532, 086084, 089061, 089062, 089088, WT075491/Z/04, WT077009, WT081407/Z/06/Z, WT085532AIA, WT086084/Z/08/Z, WT089088/Z/09/Z

    Nature 2010;467;7319;1061-73

  • A worldwide survey of human male demographic history based on Y-SNP and Y-STR data from the HGDP-CEPH populations.

    Shi W, Ayub Q, Vermeulen M, Shao RG, Zuniga S, van der Gaag K, de Knijff P, Kayser M, Xue Y and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Hinxton, Cambs., United Kingdom.

    We have investigated human male demographic history using 590 males from 51 populations in the Human Genome Diversity Project - Centre d'Etude du Polymorphisme Humain worldwide panel, typed with 37 Y-chromosomal Single Nucleotide Polymorphisms and 65 Y-chromosomal Short Tandem Repeats and analyzed with the program Bayesian Analysis of Trees With Internal Node Generation. The general patterns we observe show a gradient from the oldest population time to the most recent common ancestors (TMRCAs) and expansion times together with the largest effective population sizes in Africa, to the youngest times and smallest effective population sizes in the Americas. These parameters are significantly negatively correlated with distance from East Africa, and the patterns are consistent with most other studies of human variation and history. In contrast, growth rate showed a weaker correlation in the opposite direction. Y-lineage diversity and TMRCA also decrease with distance from East Africa, supporting a model of expansion with serial founder events starting from this source. A number of individual populations diverge from these general patterns, including previously documented examples such as recent expansions of the Yoruba in Africa, Basques in Europe, and Yakut in Northern Asia. However, some unexpected demographic histories were also found, including low growth rates in the Hazara and Kalash from Pakistan and recent expansion of the Mozabites in North Africa.

    Molecular biology and evolution 2010;27;2;385-93

  • Population differentiation as an indicator of recent positive selection in humans: an empirical evaluation.

    Xue Y, Zhang X, Huang N, Daly A, Gillson CJ, Macarthur DG, Yngvadottir B, Nica AC, Woodwark C, Chen Y, Conrad DF, Ayub Q, Mehdi SQ, Li P and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, United Kingdom.

    We have evaluated the extent to which SNPs identified by genomewide surveys as showing unusually high levels of population differentiation in humans have experienced recent positive selection, starting from a set of 32 nonsynonymous SNPs in 27 genes highlighted by the HapMap1 project. These SNPs were genotyped again in the HapMap samples and in the Human Genome Diversity Project-Centre d'Etude du Polymorphisme Humain (HGDP-CEPH) panel of 52 populations representing worldwide diversity; extended haplotype homozygosity was investigated around all of them, and full resequence data were examined for 9 genes (5 from public sources and 4 from new data sets). For 7 of the genes, genotyping errors were responsible for an artifactual signal of high population differentiation and for 2, the population differentiation did not exceed our significance threshold. For the 18 genes with confirmed high population differentiation, 3 showed evidence of positive selection as measured by unusually extended haplotypes within a population, and 7 more did in between-population analyses. The 9 genes with resequence data included 7 with high population differentiation, and 5 showed evidence of positive selection on the haplotype carrying the nonsynonymous SNP from skewed allele frequency spectra; in addition, 2 showed evidence of positive selection on unrelated haplotypes. Thus, in humans, high population differentiation is (apart from technical artifacts) an effective way of enriching for recently selected genes, but is not an infallible pointer to recent positive selection supported by other lines of evidence.

    Funded by: Wellcome Trust

    Genetics 2009;183;3;1065-77

  • Human Y chromosome base-substitution mutation rate measured by direct sequencing in a deep-rooting pedigree.

    Xue Y, Wang Q, Long Q, Ng BL, Swerdlow H, Burton J, Skuce C, Taylor R, Abdellah Z, Zhao Y, Asan, MacArthur DG, Quail MA, Carter NP, Yang H and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Hinxton, Cambs CB10 1SA, UK. ylx@sanger.ac.uk

    Understanding the key process of human mutation is important for many aspects of medical genetics and human evolution. In the past, estimates of mutation rates have generally been inferred from phenotypic observations or comparisons of homologous sequences among closely related species. Here, we apply new sequencing technology to measure directly one mutation rate, that of base substitutions on the human Y chromosome. The Y chromosomes of two individuals separated by 13 generations were flow sorted and sequenced by Illumina (Solexa) paired-end sequencing to an average depth of 11x or 20x, respectively. Candidate mutations were further examined by capillary sequencing in cell-line and blood DNA from the donors and additional family members. Twelve mutations were confirmed in approximately 10.15 Mb; eight of these had occurred in vitro and four in vivo. The latter could be placed in different positions on the pedigree and led to a mutation-rate measurement of 3.0 x 10(-8) mutations/nucleotide/generation (95% CI: 8.9 x 10(-9)-7.0 x 10(-8)), consistent with estimates of 2.3 x 10(-8)-6.3 x 10(-8) mutations/nucleotide/generation for the same Y-chromosomal region from published human-chimpanzee comparisons depending on the generation and split times assumed.

    Funded by: Wellcome Trust

    Current biology : CB 2009;19;17;1453-7

  • A systematic, large-scale resequencing screen of X-chromosome coding exons in mental retardation.

    Tarpey PS, Smith R, Pleasance E, Whibley A, Edkins S, Hardy C, O'Meara S, Latimer C, Dicks E, Menzies A, Stephens P, Blow M, Greenman C, Xue Y, Tyler-Smith C, Thompson D, Gray K, Andrews J, Barthorpe S, Buck G, Cole J, Dunmore R, Jones D, Maddison M, Mironenko T, Turner R, Turrell K, Varian J, West S, Widaa S, Wray P, Teague J, Butler A, Jenkinson A, Jia M, Richardson D, Shepherd R, Wooster R, Tejada MI, Martinez F, Carvill G, Goliath R, de Brouwer AP, van Bokhoven H, Van Esch H, Chelly J, Raynaud M, Ropers HH, Abidi FE, Srivastava AK, Cox J, Luo Y, Mallya U, Moon J, Parnau J, Mohammed S, Tolmie JL, Shoubridge C, Corbett M, Gardner A, Haan E, Rujirabanjerd S, Shaw M, Vandeleur L, Fullston T, Easton DF, Boyle J, Partington M, Hackett A, Field M, Skinner C, Stevenson RE, Bobrow M, Turner G, Schwartz CE, Gecz J, Raymond FL, Futreal PA and Stratton MR

    Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK.

    Large-scale systematic resequencing has been proposed as the key future strategy for the discovery of rare, disease-causing sequence variants across the spectrum of human complex disease. We have sequenced the coding exons of the X chromosome in 208 families with X-linked mental retardation (XLMR), the largest direct screen for constitutional disease-causing mutations thus far reported. The screen has discovered nine genes implicated in XLMR, including SYP, ZNF711 and CASK reported here, confirming the power of this strategy. The study has, however, also highlighted issues confronting whole-genome sequencing screens, including the observation that loss of function of 1% or more of X-chromosome genes is compatible with apparently normal existence.

    Funded by: Cancer Research UK: 10118; NICHD NIH HHS: HD26202; Wellcome Trust: 077012

    Nature genetics 2009;41;5;535-43

  • A common MYBPC3 (cardiac myosin binding protein C) variant associated with cardiomyopathies in South Asia.

    Dhandapany PS, Sadayappan S, Xue Y, Powell GT, Rani DS, Nallari P, Rai TS, Khullar M, Soares P, Bahl A, Tharkan JM, Vaideeswar P, Rathinavel A, Narasimhan C, Ayapati DR, Ayub Q, Mehdi SQ, Oppenheimer S, Richards MB, Price AL, Patterson N, Reich D, Singh L, Tyler-Smith C and Thangaraj K

    Department of Biochemistry, Madurai Kamaraj University, Madurai 625 021, India.

    Heart failure is a leading cause of mortality in South Asians. However, its genetic etiology remains largely unknown. Cardiomyopathies due to sarcomeric mutations are a major monogenic cause for heart failure (MIM600958). Here, we describe a deletion of 25 bp in the gene encoding cardiac myosin binding protein C (MYBPC3) that is associated with heritable cardiomyopathies and an increased risk of heart failure in Indian populations (initial study OR = 5.3 (95% CI = 2.3-13), P = 2 x 10(-6); replication study OR = 8.59 (3.19-25.05), P = 3 x 10(-8); combined OR = 6.99 (3.68-13.57), P = 4 x 10(-11)) and that disrupts cardiomyocyte structure in vitro. Its prevalence was found to be high (approximately 4%) in populations of Indian subcontinental ancestry. The finding of a common risk factor implicated in South Asian subjects with cardiomyopathy will help in identifying and counseling individuals predisposed to cardiac diseases in this region.

    Funded by: NHGRI NIH HHS: R01 HG006399-02; Wellcome Trust: 077009

    Nature genetics 2009;41;2;187-91

  • A genome-wide survey of the prevalence and evolutionary forces acting on human nonsense SNPs.

    Yngvadottir B, Xue Y, Searle S, Hunt S, Delgado M, Morrison J, Whittaker P, Deloukas P and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA, UK.

    Nonsense SNPs introduce premature termination codons into genes and can result in the absence of a gene product or in a truncated and potentially harmful protein, so they are often considered disadvantageous and are associated with disease susceptibility. As such, we might expect the disrupted allele to be rare and, in healthy people, observed only in a heterozygous state. However, some, like those in the CASP12 and ACTN3 genes, are known to be present at high frequencies and to occur often in a homozygous state and seem to have been advantageous in recent human evolution. To evaluate the selective forces acting on nonsense SNPs as a class, we have carried out a large-scale experimental survey of nonsense SNPs in the human genome by genotyping 805 of them (plus control synonymous SNPs) in 1,151 individuals from 56 worldwide populations. We identified 169 genes containing nonsense SNPs that were variable in our samples, of which 99 were found with both copies inactivated in at least one individual. We found that the sampled humans differ on average by 24 genes (out of about 20,000) because of these nonsense SNPs alone. As might be expected, nonsense SNPs as a class were found to be slightly disadvantageous over evolutionary timescales, but a few nevertheless showed signs of being possibly advantageous, as indicated by unusually high levels of population differentiation, long haplotypes, and/or high frequencies of derived alleles. This study underlines the extent of variation in gene content within humans and emphasizes the importance of understanding this type of variation.

    Funded by: Wellcome Trust: 062023

    American journal of human genetics 2009;84;2;224-34

  • Adaptive evolution of UGT2B17 copy-number variation.

    Xue Y, Sun D, Daly A, Yang F, Zhou X, Zhao M, Huang N, Zerjal T, Lee C, Carter NP, Hurles ME and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK.

    The human UGT2B17 gene varies in copy number from zero to two per individual and also differs in mean number between populations from Africa, Europe, and East Asia. We show that such a high degree of geographical variation is unusual and investigate its evolutionary history. This required first reinterpreting the reference sequence in this region of the genome, which is misassembled from the two different alleles separated by an artifactual gap. A corrected assembly identifies the polymorphism as a 117 kb deletion arising by nonallelic homologous recombination between approximately 4.9 kb segmental duplications and allows the deletion breakpoint to be identified. We resequenced approximately 12 kb of DNA spanning the breakpoint in 91 humans from three HapMap and one extended HapMap populations and one chimpanzee. Diversity was unusually high and the time to the most recent common ancestor was estimated at approximately 2.4 or approximately 3.0 million years by two different methods, with evidence of balancing selection in Europe. In contrast, diversity was low in East Asia where a single haplotype predominated, suggesting positive selection for the deletion in this part of the world.

    Funded by: Wellcome Trust

    American journal of human genetics 2008;83;3;337-46

Bryndis Yngvadottir

by1@sanger.ac.uk unknown

I received my B.A in Social Anthropology from the University of Iceland in 2001 and my M.A. in Biological Anthropology from the same university in 2004. I gained my Ph.D. from the University of Cambridge in 2008, after undertaking a four-year Ph.D. programme at the Wellcome Trust Sanger Institute. My doctoral project was in the field of Evolutionary Genetics under the supervision of Dr. Chris Tyler-Smith. Subsequently, I joined the Human Evolution team as a postdoctoral fellow.

Research

My primary research interests are in the field of human evolution. Specifically, they include the subjects of genetic variation in humans and non-human great apes, natural selection, cultural history and genome-wide comparison of closely related species. My current work is focused on analysing genetic variation in modern gorillas to make inferences about their demographic past. To this end I am using the de novo assembly of Kamilah, a western lowland gorilla, as well as reduced representation sequence data from additional individuals representing both the eastern and western species.

References

  • Population differentiation as an indicator of recent positive selection in humans: an empirical evaluation.

    Xue Y, Zhang X, Huang N, Daly A, Gillson CJ, Macarthur DG, Yngvadottir B, Nica AC, Woodwark C, Chen Y, Conrad DF, Ayub Q, Mehdi SQ, Li P and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, United Kingdom.

    We have evaluated the extent to which SNPs identified by genomewide surveys as showing unusually high levels of population differentiation in humans have experienced recent positive selection, starting from a set of 32 nonsynonymous SNPs in 27 genes highlighted by the HapMap1 project. These SNPs were genotyped again in the HapMap samples and in the Human Genome Diversity Project-Centre d'Etude du Polymorphisme Humain (HGDP-CEPH) panel of 52 populations representing worldwide diversity; extended haplotype homozygosity was investigated around all of them, and full resequence data were examined for 9 genes (5 from public sources and 4 from new data sets). For 7 of the genes, genotyping errors were responsible for an artifactual signal of high population differentiation and for 2, the population differentiation did not exceed our significance threshold. For the 18 genes with confirmed high population differentiation, 3 showed evidence of positive selection as measured by unusually extended haplotypes within a population, and 7 more did in between-population analyses. The 9 genes with resequence data included 7 with high population differentiation, and 5 showed evidence of positive selection on the haplotype carrying the nonsynonymous SNP from skewed allele frequency spectra; in addition, 2 showed evidence of positive selection on unrelated haplotypes. Thus, in humans, high population differentiation is (apart from technical artifacts) an effective way of enriching for recently selected genes, but is not an infallible pointer to recent positive selection supported by other lines of evidence.

    Funded by: Wellcome Trust

    Genetics 2009;183;3;1065-77

  • A genome-wide survey of the prevalence and evolutionary forces acting on human nonsense SNPs.

    Yngvadottir B, Xue Y, Searle S, Hunt S, Delgado M, Morrison J, Whittaker P, Deloukas P and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA, UK.

    Nonsense SNPs introduce premature termination codons into genes and can result in the absence of a gene product or in a truncated and potentially harmful protein, so they are often considered disadvantageous and are associated with disease susceptibility. As such, we might expect the disrupted allele to be rare and, in healthy people, observed only in a heterozygous state. However, some, like those in the CASP12 and ACTN3 genes, are known to be present at high frequencies and to occur often in a homozygous state and seem to have been advantageous in recent human evolution. To evaluate the selective forces acting on nonsense SNPs as a class, we have carried out a large-scale experimental survey of nonsense SNPs in the human genome by genotyping 805 of them (plus control synonymous SNPs) in 1,151 individuals from 56 worldwide populations. We identified 169 genes containing nonsense SNPs that were variable in our samples, of which 99 were found with both copies inactivated in at least one individual. We found that the sampled humans differ on average by 24 genes (out of about 20,000) because of these nonsense SNPs alone. As might be expected, nonsense SNPs as a class were found to be slightly disadvantageous over evolutionary timescales, but a few nevertheless showed signs of being possibly advantageous, as indicated by unusually high levels of population differentiation, long haplotypes, and/or high frequencies of derived alleles. This study underlines the extent of variation in gene content within humans and emphasizes the importance of understanding this type of variation.

    Funded by: Wellcome Trust: 062023

    American journal of human genetics 2009;84;2;224-34

  • The promise and reality of personal genomics.

    Yngvadottir B, Macarthur DG, Jin H and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK.

    The publication of the highest-quality and best-annotated personal genome yet tells us much about sequencing technology, something about genetic ancestry, but still little of medical relevance.

    Funded by: Wellcome Trust

    Genome biology 2009;10;9;237

  • Insights into modern disease from our distant evolutionary past.

    Yngvadottir B

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK. bya@sanger.ac.uk

    An EMBO workshop entitled 'Human Evolution and Disease' was held recently (6-9 December 2006, Hyderabad, India) where 141 scientists from many disciplines came together to discuss recent studies of human variation, origins and dispersal, natural selection and disease susceptibility. The meeting tackled the subject of human evolution and disease from the different perspectives of archaeology, linguistics, genetics and genomics based on both new and publicly available data sets. In this report, we highlight the latest fashion crazes in the discipline, in particular, the use of large public data sets and new methods to analyse modern human variation and the links between human evolution and disease susceptibility.

    European journal of human genetics : EJHG 2007;15;5;603-6

  • A shared Y-chromosomal heritage between Muslims and Hindus in India.

    Gutala R, Carvalho-Silva DR, Jin L, Yngvadottir B, Avadhanula V, Nanne K, Singh L, Chakraborty R and Tyler-Smith C

    Department of Medicine, University of Texas Health Science Center, San Antonio, TX, USA.

    Arab forces conquered the Indus Delta region in 711 AD: and, although a Muslim state was established there, their influence was barely felt in the rest of South Asia at that time. By the end of the tenth century, Central Asian Muslims moved into India from the northwest and expanded throughout the subcontinent. Muslim communities are now the largest minority religion in India, comprising more than 138 million people in a predominantly Hindu population of over one billion. It is unclear whether the Muslim expansion in India was a purely cultural phenomenon or had a genetic impact on the local population. To address this question from a male perspective, we typed eight microsatellite loci and 16 binary markers from the Y chromosome in 246 Muslims from Andhra Pradesh, and compared them to published data on 4,204 males from East Asia, Central Asia, other parts of India, Sri Lanka, Pakistan, Iran, the Middle East, Turkey, Egypt and Morocco. We find that the Muslim populations in general are genetically closer to their non-Muslim geographical neighbors than to other Muslims in India, and that there is a highly significant correlation between genetics and geography (but not religion). Our findings indicate that, despite the documented practice of marriage between Muslim men and Hindu women, Islamization in India did not involve large-scale replacement of Hindu Y chromosomes. The Muslim expansion in India was predominantly a cultural change and was not accompanied by significant gene flow, as seen in other places, such as China and Central Asia.

    Funded by: Wellcome Trust: 077009

    Human genetics 2006;120;4;543-51

  • mtDNA variation in Inuit populations of Greenland and Canada: migration history and population structure.

    Helgason A, Pálsson G, Pedersen HS, Angulalik E, Gunnarsdóttir ED, Yngvadóttir B and Stefánsson K

    deCODE Genetics, Inc., 101 Reykjavik, Iceland. agnar@decode.is

    We examined 395 mtDNA control-region sequences from Greenlandic Inuit and Canadian Kitikmeot Inuit with the aim of shedding light on the migration history that underlies the present geographic patterns of genetic variation at this locus in the Arctic. In line with previous studies, we found that Inuit populations carry only sequences belonging to haplotype clusters A2 and D3. However, a comparison of Arctic populations from Siberia, Canada, and Greenland revealed considerable differences in the frequencies of these haplotypes. Moreover, large sample sizes and regional information about birthplaces of maternal grandmothers permitted the detection of notable differences in the distribution of haplotypes among subpopulations within Greenland. Our results cast doubt on the prevailing hypothesis that contemporary Inuit trace their all of their ancestry to so-called Thule groups that expanded from Alaska about 800-1,000 years ago. In particular, discrepancies in mutational divergence between the Inuit populations and their putative source mtDNA pool in Siberia/Alaska for the two predominant haplotype clusters, A2a and A2b, are more consistent with the possibility that expanding Thule groups encountered and interbred with existing Dorset populations in Canada and Greenland.

    American journal of physical anthropology 2006;130;1;123-34

  • Spread of an inactive form of caspase-12 in humans is due to recent positive selection.

    Xue Y, Daly A, Yngvadottir B, Liu M, Coop G, Kim Y, Sabeti P, Chen Y, Stalker J, Huckle E, Burton J, Leonard S, Rogers J and Tyler-Smith C

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA, United Kingdom.

    The human caspase-12 gene is polymorphic for the presence or absence of a stop codon, which results in the occurrence of both active (ancestral) and inactive (derived) forms of the gene in the population. It has been shown elsewhere that carriers of the inactive gene are more resistant to severe sepsis. We have now investigated whether the inactive form has spread because of neutral drift or positive selection. We determined its distribution in a worldwide sample of 52 populations and resequenced the gene in 77 individuals from the HapMap Yoruba, Han Chinese, and European populations. There is strong evidence of positive selection from low diversity, skewed allele-frequency spectra, and the predominance of a single haplotype. We suggest that the inactive form of the gene arose in Africa approximately 100-500 thousand years ago (KYA) and was initially neutral or almost neutral but that positive selection beginning approximately 60-100 KYA drove it to near fixation. We further propose that its selective advantage was sepsis resistance in populations that experienced more infectious diseases as population sizes and densities increased.

    Funded by: Wellcome Trust

    American journal of human genetics 2006;78;4;659-70

  • An Icelandic example of the impact of population structure on association studies.

    Helgason A, Yngvadóttir B, Hrafnkelsson B, Gulcher J and Stefánsson K

    deCODE Genetics, Sturlugata 8, 101 Reykjavík, Iceland. agnar@decode.is <agnar@decode.is&gt;

    The impact of population structure on association studies undertaken to identify genetic variants underlying common human diseases is an issue of growing interest. Spurious associations of alleles with disease phenotypes may be obtained or true associations overlooked when allele frequencies differ notably among subpopulations that are not represented equally among cases and controls. Population structure influences even carefully designed studies and can affect the validity of association results. Most study designs address this problem by sampling cases and controls from groups that share the same nationality or self-reported ethnic background, with the implicit assumption that no substructure exists within such groups. We examined population structure in the Icelandic gene pool using extensive genealogical and genetic data. Our results indicate that sampling strategies need to take account of substructure even in a relatively homogenous genetic isolate. This will probably be even more important in larger populations.

    Nature genetics 2005;37;1;90-5

* quick link - http://q.sanger.ac.uk/p05a13tm