Saccharomyces Genome Resequencing
The goal of the project was to advance understanding of genomic variation and evolution by analysing sequences from multiple strains of the two Saccharomyces pecies, S cerevisiae and S paradoxus.
We have completed ABI sequencing of haploids of 37 cerevisiae strains and 27 paradoxus strains to a depth of between 1x and 3x, yielding a total of 1.42 million reads (1,292 megabases); and Illumina GA (Solexa) sequencing of four of the 37 cerevisiae strains and an additional 10 paradoxus strains.
The sequence data has been aligned to the respective reference genome sequences using SsahaSNP (for ABI) and Maq (for Illumina) followed by the application of heuristics to select the most plausible alignments. The SNPs (single-nucleotide polymorphisms) implied by these alignments have been extracted. We have also developed methods, based on ancestral recombination graphs, for imputing nucleotide values at positions in the genome where some strains may have no or only poor-quality evidence while other, closely-related ones are better represented.
- Browse the data. (This is no longer actively maintained and so may at times be temporarily unavailable.)
- Download the reads, alignments and provisional assemblies of each strain. This is what you need if you are interested in carrying out genome-wide analyses. You will also need:
- Documentation on the data.
- The reads from the above download are also available from the NCBI Trace Archive, and can be accessed by following the instructions below.
- Purchase the strains
- SsahaSNP alignment software.
- Maq alignment software.
A brief illustration of how to view variation around a particular gene in one or a few strains is available here.
To download the SGRP reads from the NCBI Trace Archive, enter a query such as
CENTER_NAME = "SC" and STRAIN = "W303"
(substituting the strain name of your choice for W303) and click “Submit”.
However, you need to be aware that because of some plate-handling errors, the names of some of the reads there need to be corrected. These corrections have already been applied in the SGRP browser and the FTP download data, which you should use unless you specifically need NCBI format. Also, quality clipping has been applied to the FTP download data, but not to the versions in the trace archive.
The full list of corrections is available on the ftp site. In that file, a single name on a line by itself means that that read in the Trace Archive should be ignored, while two names mean that the read with the first name should have the second name so that the p1k and q1k reads are correctly paired. The strains in question, and the number of reads affected, are as follows.
|S cerevisiae||S paradoxus||–|
Data Release Policy
The release of pre-publication data from large resource-generating scientific projects was the subject of a meeting held in January 2003, the Fort Lauderdale meeting, sponsored by the Wellcome Trust, one of the Project funders. The report from that meeting can be viewed here.
The recommendations of the Fort Lauderdale meeting address the roles and responsibilities of data producers, data users, and funders of “community resource projects”, with the aim of establishing and maintaining an appropriate balance between the interests of data users in rapid access to data and the needs of data producers to receive recognition for their work. The conclusion of the attendees at the meeting was that responsible use of the data is necessary to ensure that first-rate data producers will continue to participate in such projects and produce and quickly release valuable large-scale data sets. “Responsible use” was defined as allowing the data producers to have the opportunity to publish the initial global analyses of the data, as articulated at the outset of the project. Doing so also will ensure that the data generated are fully described.
This sequencing centre plans on publishing the completed and annotated sequences in a peer-reviewed journal as soon as possible. Permission of the principal investigator should be obtained before publishing analyses of the sequence/open reading frames/genes on a chromosome or genome scale. See our data sharing policy.