Dr Richard Durbin
Richard is Joint Head of Human Genetics at the Wellcome Trust Sanger Institute and leader of the Genome Informatics group.
Richard has worked on many areas of biological sequence analysis, and currently focuses on studying human genetic variation by genome-wide resequencing using new sequencing technologies.
Apart from human genome resequencing, projects that Richard is connected to include the SGRP yeast sequence variation and population genomics project, the TreeFam database of animal gene families, the Ensembl resource for vertebrate genome annotation, the WormBase model organism database for C. elegans, the MitoCheck study of mitosis regulation in human cells, the Pfam database of protein domain families, and the ACEDB genome database.
Richard is currently supervising four postdocs and a research student. He is interested in applications for new potential research students or postdocs, particularly in the area of population genome sequence variation analysis. During the last few years Richard's group have been using evolutionary probabilistic methods based on phylogenetic trees, and from this two new projects listed below have developed.
TreeFam
First, alongside continued method development, Richard's team have initiated a new comprehensive data resource, TreeFam, in the same way that in the past the Pfam project grew out of this group. This project started in 2004 in collaboration with the Beijing Genome Institute (BGI). It is developing a high quality comprehensive resource that shows how genes in animal gene families are related in an evolutionary tree, and hence assigns orthology and paralogy relationships between members of the families. The approach taken is analogous to that used by Pfam, using automated methods to develop candidate families, then progressively curating these families, at which point names and basic references are assigned, as well as any clear errors fixed. Once a family is curated, new sequences can be assigned to it during regular database rebuilds, allowing the classification to be maintained as more genomes are finished. An initial paper on TreeFam was published in the NAR database issue in January 2006. TreeFam now contains 289,083 genes from 25 species in 1,203 curated TreeFam-A families (39,000 genes) and 15,002 automatically generated TreeFam-B families.
New methods to handle genetic variation data
Second, Richard's team have developed a new way to analyse genetic variation data within species, based on heuristic reconstructions of Ancestral Recombination Graphs (ARGs; software is available). These describe the tree that relates individuals at each position in the genome, analogous to the phylogenetic tree, and how these trees vary as one moves along the chromosome, because of ancestral recombination events. Although it is well established that in principle knowing the ARG relating a set of individuals would allow optimal analysis of, for example, genetic disease association, inferring the ARG from gentoype data is underdetermined, and estimation or sampling using full likelihood or Bayesian methods is intractable. Rather than work with a simplified model, the team have developed a computationally efficient way to reconstruct plausible ARGs from large scale data sets, and shown using both simulated and real data how this can help association fine mapping. They have also shown in the yeast resequencing project, how ARGs can be used to integrate low coverage sequence data from many strains (S cerevisiae and S paradoxus) to infer full sequences for each strain with error estimates, and support population genetic analyses of sequence variation. The team are interested in extending this to human and pathogen data.
Suggested reading
"Biological Sequence Analysis", Sean Eddy S, Anders Krogh A and Graeme Mitchison G (Cambridge: Cambridge University Press, 1998)
Selected Publications
-
The Sequence Alignment/Map format and SAMtools.
Bioinformatics (Oxford, England) 2009;25;16;2078-9
PUBMED: 19505943; PMC: 2723002; DOI: 10.1093/bioinformatics/btp352
-
The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.
Genome research 2009;19;7;1316-23
PUBMED: 19498102; PMC: 2704439; DOI: 10.1101/gr.080531.108
-
Population genomics of domestic and wild yeasts.
Nature 2009;458;7236;337-41
PUBMED: 19212322; PMC: 2659681; DOI: 10.1038/nature07743
-
Inferring selection on amino acid preference in protein domains.
Molecular biology and evolution 2009;26;3;527-36
PUBMED: 19095755; PMC: 2716081; DOI: 10.1093/molbev/msn286
-
Accurate whole human genome sequencing using reversible terminator chemistry.
Nature 2008;456;7218;53-9
PUBMED: 18987734; PMC: 2581791; DOI: 10.1038/nature07517
-
Mapping short DNA sequencing reads and calling variants using mapping quality scores.
Genome research 2008;18;11;1851-8
PUBMED: 18714091; PMC: 2577856; DOI: 10.1101/gr.078212.108
-
Mapping trait loci by use of inferred ancestral recombination graphs.
American journal of human genetics 2006;79;5;910-22
PUBMED: 17033967; PMC: 1698562; DOI: 10.1086/508901
-
TreeFam: a curated database of phylogenetic trees of animal gene families.
Nucleic acids research 2006;34;Database issue;D572-80
PUBMED: 16381935; PMC: 1347480; DOI: 10.1093/nar/gkj118

