Population genomics of molecular phenotypes

High-throughput sequencing is opening up a new chapter in the study of molecular evolution and genetics by allowing deep sequencing of whole populations of organisms and cells.

We are now in a unique position to study the nature of the evolutionary forces responsible for the amazing diversity of life. We can ask: What is the role of genetics in a person's susceptibility to develop a cancer, or another potentially fatal disease? Are the observed differences between individuals mostly a result of neutral evolution or do they bear a fitness advantage? These questions are not only interesting for evolutionary biologists but can also make a fundamental contribution to biomedical applications. The promise of personalised medicine will critically depend on finding and understanding molecular disease phenotypes.

Our group will contribute to this effort by developing population genetic analysis methods which use experimental measurements of sequence properties in conjunction with deep sequencing data to elucidate functional consequences of genomic variation.

[Maggie Bartlett, NHGRI]

Background

Genomic sequences are a result of evolutionary processes; hence there is a close relationship between statistical methods for genome sequence analysis and quantitative population genetics. Evolution is a complex stochastic process, whose outcome is decided at the level of populations consisting of individual organisms or cells. The major evolutionary forces: mutation/recombination (provide variation), genetic drift (the noise in reproduction), and selection (differential reproductive success of individuals) all contribute to observed evolutionary change. Thus, their individual roles are difficult to disentangle. However, it is precisely this decomposition that will be critical when we attempt to understand functional consequences of mutations, that is the link between genotype and phenotype, from observation of naturally occurring variation. For instance, somatic mutations in tumors can be classified into driver and passenger mutations. The mutations in the former class are causal, for example increasing the net growth rate of the cell while the latter have no effect on the cancer phenotype of the cell. Unsurprisingly, the task of classifying somatic mutations correctly is important in gaining new insight into the processes involved in cancer (The Cancer Genome, Stratton et al. Nature, 2009).

Exploiting genome-wide sequence data for functional studies has turned out to be challenging, as it is often not easy to discern functionally relevant genomic variation at the molecular level from changes without phenotypic effects. Assigning function to regulatory sequence variation has been harder still and its quantitative understanding remains work in progress.

We aim to improve this situation by combining experimental measurements of sequence properties, for example transcription factor to DNA binding affinities, with deep sequencing data and interpreting such data sets using population genetic theory.

Research

Our Aims

The main objective of Population genomics of molecular phenotypes group is to increase understanding of functional and evolutionary consequences of naturally occurring variation.

Our Approach

Sequences

New sequencing technologies enable deep sequencing of multiple populations and are thus making a detailed observation of naturally occurring genetic variation possible. However, the reads generated using so-called second generation sequencing platforms need a substantial informatics effort, for example assembly and imputation, before they can be used. Multiple groups at Sanger are contributing to this effort so that other investigators (including our group) can focus on downstream analyses, for example evolutionary interpretation of the sequence variation.

Molecular phenotypes

Increasing numbers of high-throughput technologies are available to measure molecular phenotypes, for example protein binding microarrays (Badis et al., 2009), mechanically induced trapping of molecular interactions (Maerkl and Quake, 2007) and ChIP-Seq technology. Figure 2. depicts an example molecular phenotype: sequence specific binding energy E between transcription factor molecule and DNA which can be either measured experimentally or in some cases inferred statistically. The resulting mapping from a genotype to a molecular phenotype, i.e. a → E(a), exemplifies our approach to analyze genomic variation. Any variation found in a binding site has only relevance if it changes the binding energy E and hence the binding probability of the transcription factor.

Population and evolutionary genomics

Individual organisms within and across species share ancestry - a fact which can readily be observed from correlations between their genomic sequences. Elucidating the tempo and mode changing these correlations is central in the study of molecular evolution.

We try to interpret the observed variation using population genetic models whose (in)compatibility to the data is judged using statitical methods, for example Bayesian inference. A key step for such an inference is to identify a set of observables which should capture substantial part of biologically relevant variation in that sequence region. It is precisely here where the usage of molecular phenotype maps becomes indispensable. In short, they provide one systematic way of reducing the dimensionality of the problem which is critical to avoid under sampling (genotype space is vast even for short sequences such as binding sites).

Once the sequences are projected onto corresponding molecular phenotypes we can perform inferences based on evolutionary models - see Figure 3. 3a. shows examples of outgroup directed allele probability distributions Q(x) (a classic observable in population genetics, it exploits polymorphism data in an ingroup species A and polarizes the direction of mutations using an outgroup species B). x = k/m is the frequency of ingroup allele which differs from the outgroup allele, i.e. if x = 0 both species share the allele, if 0 < x < 1 the locus is polymorphic and if x = 1 a substitution event between species has happened. Model distributions shown are for Kimura's two allele model with mutation, selection and genetic drift. 3b. shows time-series of evolutionary simulation of frequency x(t) of a genomic locus. Such time-series data is increasingly becoming available at a single nucleotide resolution via the application of resequencing technologies and will help to infer e.g. whether a newly arisen mutation evolves under positive selection.

Population graphics.

Population graphics. [The Wellcome Trust Sanger Institute]
Enlarge this image (2288 x 1322)

An extra layer of complication affecting interpretation of sequence variation is that many different evolutionary scenarious can leave similar traces to the sequences. For instance, loci under moderate negative selection undergo increased number of subtitutions during a population bottleneck. If the bottleneck happened in distant past, the polymorphism spectrum that we would observe now would have had enough time to equilibrate again to the current population level. However, the full probability density Q(x) would have a surplus of substitutions events - a hallmark of positive selection. Unraveling such interpretation challenges makes the field exciting.

Summary

We develop methods to use molecular phenotype data in conjuction with sequence variation data from population sequencing of multiple genomes. This integrated data is then interpreted using population genetic theory to elucidate functional and evolutionary consequences of the variation.

Selected Publications

  • From fitness landscapes to seascapes: non-equilibrium dynamics of selection and adaptation.

    Mustonen V and Lässig M

    Trends in genetics : TIG 2009;25;3;111-9

  • Energy-dependent fitness: a quantitative model for the evolution of yeast transcription factor binding sites.

    Mustonen V, Kinney J, Callan CG and Lässig M

    Proceedings of the National Academy of Sciences of the United States of America 2008;105;34;12376-81

  • Molecular evolution under fitness fluctuations.

    Mustonen V and Lässig M

    Physical review letters 2008;100;10;108101

  • Adaptations to fluctuating selection in Drosophila.

    Mustonen V and Lässig M

    Proceedings of the National Academy of Sciences of the United States of America 2007;104;7;2277-82

  • Evolutionary population genetics of promoters: predicting binding sites and functional phylogenies.

    Mustonen V and Lässig M

    Proceedings of the National Academy of Sciences of the United States of America 2005;102;44;15936-41