Yeast Data

SGRP, the Saccharomyces Genome Resequencing Project

This work was a collaboration between the Sanger Institute and Professor Ed Louis' group at the Institute of Genetics, University of Nottingham

Archive Page

This page is maintained as a historical record and is no longer being updated.

Saccharomyces Genome Resequencing

The goal of the project was to advance understanding of genomic variation and evolution by analysing sequences from multiple strains of the two Saccharomyces pecies, S cerevisiae and S paradoxus.

We have completed ABI sequencing of haploids of 37 cerevisiae strains and 27 paradoxus strains to a depth of between 1x and 3x, yielding a total of 1.42 million reads (1,292 megabases); and Illumina GA (Solexa) sequencing of four of the 37 cerevisiae strains and an additional 10 paradoxus strains.

The sequence data has been aligned to the respective reference genome sequences using SsahaSNP (for ABI) and Maq (for Illumina) followed by the application of heuristics to select the most plausible alignments. The SNPs (single-nucleotide polymorphisms) implied by these alignments have been extracted. We have also developed methods, based on ancestral recombination graphs, for imputing nucleotide values at positions in the genome where some strains may have no or only poor-quality evidence while other, closely-related ones are better represented.


  • Download the reads, alignments and provisional assemblies of each strain. This is what you need if you are interested in carrying out genome-wide analyses. You will also need:
    • The reads from the above download are also available from the NCBI Trace Archive, and can be accessed by following the instructions below.
  • Purchase the strains
  • SsahaSNP alignment software.
  • Maq alignment software.

There are also BLAST servers for all S. cerevisiae strains and all S. paradoxus strains, and an alternative blast server at the University of Toronto.


To download the SGRP reads from the NCBI Trace Archive, enter a query such as

CENTER_NAME = "SC" and STRAIN = "W303"

(substituting the strain name of your choice for W303) and click “Submit”.

However, you need to be aware that because of some plate-handling errors, the names of some of the reads there need to be corrected. These corrections have already been applied in the SGRP browser and the FTP download data, which you should use unless you specifically need NCBI format. Also, quality clipping has been applied to the FTP download data, but not to the versions in the trace archive.

The full list of corrections is available on the ftp site. In that file, a single name on a line by itself means that that read in the Trace Archive should be ignored, while two names mean that the read with the first name should have the second name so that the p1k and q1k reads are correctly paired. The strains in question, and the number of reads affected, are as follows.

S cerevisiae S paradoxus
BC187 619 A4
DBVPG1373 85 CBS5829 651
DBVPG6044 1161 DBVPG4650 180
DBVPG6765 1128 DBVPG6304 1981
L_1374 96 N_17 78
SK1 19594 N_43 530
Y55 647 N_44 2273
YGPM 1343 N_45 720
YPS128 16347 Q59_1 871
YPS606 1151 T21_4 389
273614N 194 UFRJ50816 354
NCYC361 188 YPS138 201
UWOPS03_461_4 2822 UWOPS91_917_1 471
YJM975 24
YJM978 1114

Data Release Policy

The release of pre-publication data from large resource-generating scientific projects was the subject of a meeting held in January 2003, the Fort Lauderdale meeting, sponsored by the Wellcome Trust, one of the Project funders. The report from that meeting can be viewed here.

The recommendations of the Fort Lauderdale meeting address the roles and responsibilities of data producers, data users, and funders of “community resource projects”, with the aim of establishing and maintaining an appropriate balance between the interests of data users in rapid access to data and the needs of data producers to receive recognition for their work. The conclusion of the attendees at the meeting was that responsible use of the data is necessary to ensure that first-rate data producers will continue to participate in such projects and produce and quickly release valuable large-scale data sets. “Responsible use” was defined as allowing the data producers to have the opportunity to publish the initial global analyses of the data, as articulated at the outset of the project. Doing so also will ensure that the data generated are fully described.

Related links

Data use

This sequencing centre plans on publishing the completed and annotated sequences in a peer-reviewed journal as soon as possible. Permission of the principal investigator should be obtained before publishing analyses of the sequence/open reading frames/genes on a chromosome or genome scale. See our data sharing policy.