Population genomics of molecular phenotypes

High-throughput sequencing is opening up a new chapter in the study of molecular evolution and genetics by allowing deep sequencing of whole populations of organisms and cells.

We are now in a unique position to study the nature of the evolutionary forces responsible for the amazing diversity of life. We can ask: What is the role of genetics in a person's susceptibility to develop a cancer, or another potentially fatal disease? Are the observed differences between individuals mostly a result of neutral evolution or do they bear a fitness advantage? These questions are not only interesting for evolutionary biologists but can also make a fundamental contribution to biomedical applications. The promise of personalised medicine will critically depend on finding and understanding molecular disease phenotypes.

Our group will contribute to this effort by developing population genetic analysis methods which use experimental measurements of sequence properties in conjunction with deep sequencing data to elucidate functional consequences of genomic variation.

[Maggie Bartlett, NHGRI]

Background

Genomic sequences are a result of evolutionary processes; hence there is a close relationship between statistical methods for genome sequence analysis and quantitative population genetics. Evolution is a complex stochastic process, whose outcome is decided at the level of populations consisting of individual organisms or cells. The major evolutionary forces: mutation/recombination (provide variation), genetic drift (the noise in reproduction), and selection (differential reproductive success of individuals) all contribute to observed evolutionary change. Thus, their individual roles are difficult to disentangle. However, it is precisely this decomposition that will be critical when we attempt to understand functional consequences of mutations, that is the link between genotype and phenotype, from observation of naturally occurring variation. For instance, somatic mutations in tumors can be classified into driver and passenger mutations. The mutations in the former class are causal, for example increasing the net growth rate of the cell while the latter have no effect on the cancer phenotype of the cell. Unsurprisingly, the task of classifying somatic mutations correctly is important in gaining new insight into the processes involved in cancer (The Cancer Genome, Stratton et al. Nature, 2009).

Exploiting genome-wide sequence data for functional studies has turned out to be challenging, as it is often not easy to discern functionally relevant genomic variation at the molecular level from changes without phenotypic effects. Assigning function to regulatory sequence variation has been harder still and its quantitative understanding remains work in progress.

We aim to improve this situation by combining experimental measurements of sequence properties, for example transcription factor to DNA binding affinities, with deep sequencing data and interpreting such data sets using population genetic theory.

Research

Our Aims

The main objective of Population genomics of molecular phenotypes group is to increase understanding of functional and evolutionary consequences of naturally occurring variation.

Our Approach

Sequences

New sequencing technologies enable deep sequencing of multiple populations and are thus making a detailed observation of naturally occurring genetic variation possible. However, the reads generated using so-called second generation sequencing platforms need a substantial informatics effort, for example assembly and imputation, before they can be used. Multiple groups at Sanger are contributing to this effort so that other investigators (including our group) can focus on downstream analyses, for example evolutionary interpretation of the sequence variation.

Molecular phenotypes

Increasing numbers of high-throughput technologies are available to measure molecular phenotypes, for example protein binding microarrays (Badis et al., 2009), mechanically induced trapping of molecular interactions (Maerkl and Quake, 2007) and ChIP-Seq technology. Figure 2. depicts an example molecular phenotype: sequence specific binding energy E between transcription factor molecule and DNA which can be either measured experimentally or in some cases inferred statistically. The resulting mapping from a genotype to a molecular phenotype, i.e. a → E(a), exemplifies our approach to analyze genomic variation. Any variation found in a binding site has only relevance if it changes the binding energy E and hence the binding probability of the transcription factor.

Population and evolutionary genomics

Individual organisms within and across species share ancestry - a fact which can readily be observed from correlations between their genomic sequences. Elucidating the tempo and mode changing these correlations is central in the study of molecular evolution.

We try to interpret the observed variation using population genetic models whose (in)compatibility to the data is judged using statitical methods, for example Bayesian inference. A key step for such an inference is to identify a set of observables which should capture substantial part of biologically relevant variation in that sequence region. It is precisely here where the usage of molecular phenotype maps becomes indispensable. In short, they provide one systematic way of reducing the dimensionality of the problem which is critical to avoid under sampling (genotype space is vast even for short sequences such as binding sites).

Once the sequences are projected onto corresponding molecular phenotypes we can perform inferences based on evolutionary models - see Figure 3. 3a. shows examples of outgroup directed allele probability distributions Q(x) (a classic observable in population genetics, it exploits polymorphism data in an ingroup species A and polarizes the direction of mutations using an outgroup species B). x = k/m is the frequency of ingroup allele which differs from the outgroup allele, i.e. if x = 0 both species share the allele, if 0 < x < 1 the locus is polymorphic and if x = 1 a substitution event between species has happened. Model distributions shown are for Kimura's two allele model with mutation, selection and genetic drift. 3b. shows time-series of evolutionary simulation of frequency x(t) of a genomic locus. Such time-series data is increasingly becoming available at a single nucleotide resolution via the application of resequencing technologies and will help to infer e.g. whether a newly arisen mutation evolves under positive selection.

Population graphics.

Population graphics. [Genome Research Limited]

zoom

An extra layer of complication affecting interpretation of sequence variation is that many different evolutionary scenarious can leave similar traces to the sequences. For instance, loci under moderate negative selection undergo increased number of subtitutions during a population bottleneck. If the bottleneck happened in distant past, the polymorphism spectrum that we would observe now would have had enough time to equilibrate again to the current population level. However, the full probability density Q(x) would have a surplus of substitutions events - a hallmark of positive selection. Unraveling such interpretation challenges makes the field exciting.

Summary

We develop methods to use molecular phenotype data in conjuction with sequence variation data from population sequencing of multiple genomes. This integrated data is then interpreted using population genetic theory to elucidate functional and evolutionary consequences of the variation.

Selected Publications

  • Quantifying selection acting on a complex trait using allele frequency time series data.

    Illingworth CJ, Parts L, Schiffels S, Liti G and Mustonen V

    Molecular biology and evolution 2012;29;4;1187-97

  • A method to infer positive selection from marker dynamics in an asexual population.

    Illingworth CJ and Mustonen V

    Bioinformatics (Oxford, England) 2012;28;6;831-7

  • Distinguishing driver and passenger mutations in an evolutionary history categorized by interference.

    Illingworth CJ and Mustonen V

    Genetics 2011;189;3;989-1000

  • Germline fitness-based scoring of cancer mutations.

    Fischer A, Greenman C and Mustonen V

    Genetics 2011;188;2;383-93

  • Fitness flux and ubiquity of adaptive evolution.

    Mustonen V and Lässig M

    Proceedings of the National Academy of Sciences of the United States of America 2010;107;9;4248-53

  • From fitness landscapes to seascapes: non-equilibrium dynamics of selection and adaptation.

    Mustonen V and Lässig M

    Trends in genetics : TIG 2009;25;3;111-9

  • Energy-dependent fitness: a quantitative model for the evolution of yeast transcription factor binding sites.

    Mustonen V, Kinney J, Callan CG and Lässig M

    Proceedings of the National Academy of Sciences of the United States of America 2008;105;34;12376-81

  • Molecular evolution under fitness fluctuations.

    Mustonen V and Lässig M

    Physical review letters 2008;100;10;108101

  • Adaptations to fluctuating selection in Drosophila.

    Mustonen V and Lässig M

    Proceedings of the National Academy of Sciences of the United States of America 2007;104;7;2277-82

  • Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies.

    Mustonen V and Lässig M

    Proceedings of the National Academy of Sciences of the United States of America 2005;102;44;15936-41

Team

Team members

Chris Illingworth

- unknown

I studied mathematics at St. John's College, Cambridge and subsequently completed a PhD at the University of Essex on the topic of flexibility in protein-ligand binding, encompassing issues of protein sequence, protein structure, and both quantum and classical molecular models. I subsequently moved to the University of Oxford, where I applied computational modelling to study electrical polarization in ion channels and to binding in the HIF-1α-pVHL complex, moving from there into a short-term post lecturing in physical chemistry and bioinformatics at the University of Essex. I moved to the Sanger Institute in June 2010.

Research

Improvements in genome sequencing have led to the availability of data describing in detail the evolution of a biological system over a period of time. Such data has the potential to give insight into processes such as the development of drug resistance in bacteria, the adaptation of viruses to combat the human immune system, and the changes which make healthy cells become cancerous. I am working on the development of statistical models with which to best understand these processes, so as to combat the threat caused by cancer and infectious disease.

References

  • Components of selection in the evolution of the influenza virus: linkage effects beat inherent selection.

    Illingworth CJ and Mustonen V

    Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom. ci3@sanger.ac.uk

    The influenza virus is an important human pathogen, with a rapid rate of evolution in the human population. The rate of homologous recombination within genes of influenza is essentially zero. As such, where two alleles within the same gene are in linkage disequilibrium, interference between alleles will occur, whereby selection acting upon one allele has an influence upon the frequency of the other. We here measured the relative importance of selection and interference effects upon the evolution of influenza. We considered time-resolved allele frequency data from the global evolutionary history of the haemagglutinin gene of human influenza A/H3N2, conducting an in-depth analysis of sequences collected since 1996. Using a model that accounts for selection-caused interference between alleles in linkage disequilibrium, we estimated the inherent selective benefit of individual polymorphisms in the viral population. These inherent selection coefficients were in turn used to calculate the total selective effect of interference acting upon each polymorphism, considering the effect of the initial background upon which a mutation arose, and the subsequent effect of interference from other alleles that were under selection. Viewing events in retrospect, we estimated the influence of each of these components in determining whether a mutant allele eventually fixed or died in the global viral population. Our inherent selection coefficients, when combined across different regions of the protein, were consistent with previous measurements of dN/dS for the same system. Alleles going on to fix in the global population tended to be under more positive selection, to arise on more beneficial backgrounds, and to avoid strong negative interference from other alleles under selection. However, on average, the fate of a polymorphism was determined more by the combined influence of interference effects than by its inherent selection coefficient.

    Funded by: Wellcome Trust: 098051

    PLoS pathogens 2012;8;12;e1003091

  • Quantifying selection acting on a complex trait using allele frequency time series data.

    Illingworth CJ, Parts L, Schiffels S, Liti G and Mustonen V

    Wellcome Trust Sanger Institute, Hinxton, Cambridge, United Kingdom.

    When selection is acting on a large genetically diverse population, beneficial alleles increase in frequency. This fact can be used to map quantitative trait loci by sequencing the pooled DNA from the population at consecutive time points and observing allele frequency changes. Here, we present a population genetic method to analyze time series data of allele frequencies from such an experiment. Beginning with a range of proposed evolutionary scenarios, the method measures the consistency of each with the observed frequency changes. Evolutionary theory is utilized to formulate equations of motion for the allele frequencies, following which likelihoods for having observed the sequencing data under each scenario are derived. Comparison of these likelihoods gives an insight into the prevailing dynamics of the system under study. We illustrate the method by quantifying selective effects from an experiment, in which two phenotypically different yeast strains were first crossed and then propagated under heat stress (Parts L, Cubillos FA, Warringer J, et al. [14 co-authors]. 2011. Revealing the genetic structure of a trait by sequencing a population under selection. Genome Res). From these data, we discover that about 6% of polymorphic sites evolve nonneutrally under heat stress conditions, either because of their linkage to beneficial (driver) alleles or because they are drivers themselves. We further identify 44 genomic regions containing one or more candidate driver alleles, quantify their apparent selective advantage, obtain estimates of recombination rates within the regions, and show that the dynamics of the drivers display a strong signature of selection going beyond additive models. Our approach is applicable to study adaptation in a range of systems under different evolutionary pressures.

    Funded by: Wellcome Trust: 098051, WT077192/Z/05/Z

    Molecular biology and evolution 2012;29;4;1187-97

  • A method to infer positive selection from marker dynamics in an asexual population.

    Illingworth CJ and Mustonen V

    Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK.

    Motivation: The observation of positive selection acting on a mutant indicates that the corresponding mutation has some form of functional relevance. Determining the fitness effects of mutations thus has relevance to many interesting biological questions. One means of identifying beneficial mutations in an asexual population is to observe changes in the frequency of marked subsets of the population. We here describe a method to estimate the establishment times and fitnesses of beneficial mutations from neutral marker frequency data.

    Results: The method accurately reproduces complex marker frequency trajectories. In simulations for which positive selection is close to 5% per generation, we obtain correlations upwards of 0.91 between correct and inferred haplotype establishment times. Where mutation selection coefficients are exponentially distributed, the inferred distribution of haplotype fitnesses is close to being correct. Applied to data from a bacterial evolution experiment, our method reproduces an observed correlation between evolvability and initial fitness defect.

    Funded by: Wellcome Trust: 098051

    Bioinformatics (Oxford, England) 2012;28;6;831-7

  • Distinguishing driver and passenger mutations in an evolutionary history categorized by interference.

    Illingworth CJ and Mustonen V

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom.

    In many biological scenarios, from the development of drug resistance in pathogens to the progression of healthy cells toward cancer, quantifying the selection acting on observed mutations is a central question. One difficulty in answering this question is the complexity of the background upon which mutations can arise, with multiple potential interactions between genetic loci. We here present a method for discerning selection from a population history that accounts for interference between mutations. Given sequences sampled from multiple time points in the history of a population, we infer selection at each locus by maximizing a likelihood function derived from a multilocus evolution model. We apply the method to the question of distinguishing between loci where new mutations are under positive selection (drivers) and loci that emit neutral mutations (passengers) in a Wright-Fisher model of evolution. Relative to an otherwise equivalent method in which the genetic background of mutations was ignored, our method inferred selection coefficients more accurately for both driver mutations evolving under clonal interference and passenger mutations reaching fixation in the population through genetic drift or hitchhiking. In a population history recorded by 750 sets of sequences of 100 individuals taken at intervals of 100 generations, a set of 50 loci were divided into drivers and passengers with a mean accuracy of >0.95 across a range of numbers of driver loci. The potential application of our model, either in full or in part, to a range of biological systems, is discussed.

    Funded by: Wellcome Trust: 091747

    Genetics 2011;189;3;989-1000

Component Qr failed to execute