The project employs a whole chromosome shotgun strategy (WCS) with limited skims of mapped P1 and BAC clones (a "tiling set" of clones), aiming for a 8-10x coverage of the chromosome.
Chromosomal bands are isolated and gel-purified from pulsed field gels, sheared, size selected (1.0 - 2.0 KB), and cloned into the pUC18 vector. Forward and reverse reads from each clone randomly selected from the chromosome-enriched library are produced (in the case of chromosome I 20,000), followed by initial automatic assembly of the sequence reads into contiguous sequences (or "contigs"). Contig sequences represent secondary sequence data, in that they are the condensation of a number of shotgun reads. Contigs reflect more reliably the finished sequence data because the depth of coverage of assembled shotgun reads ensures that the majority of ambiquities are identified and at least partially resolved. Gaps and low-quality regions of sequence are resolved using primer walking, PCR and re-sequencing of clones. To reduce the complexity of the finishing process, positioning of reads and gap closure is aided by the "skimming" (about 2x coverage) of mapped P1 and/or BAC clones.As these may derive from either homologue, the database contains mostly sequence derived from a particular chromosome , but there is always some contamination of the gel-eluted DNA and this generally derives from all other chromosomes (not only those closest in the gel).
The various stages of the project are:
- Library Preparation - Shotgun library in E. coli
- Shotgun Started - Preliminary sequencing in progress
- Shotgun Complete - Sequence in contigs after automatic assembly, awaiting gap closure
- Finishing - Ordering of contigs and gap closure
- Finished - Gaps closed:sequence contiguous on both strands
- Submitted - Sequence analysed and submitted/retrievable from EMBL
Nomenclature
An essential part of any genome project is the tracking and unique identifying of clones and primary sequence reads. The Sanger Institute uses a standard nomenclature system for read names and library identification.
Sequence Data
The data can be accessed in three formats which reflect the progression from primary shotgun reads, through contiguation into finished chromosomes and finally the predicted gene structures. You can find out more about the datasets available for sequence similarity searching and how to make use it effectively here.
Data can be:
- downloaded from the ftp site as reads or contigs or EMBL for submitted sequences
- accessed through similarity searches (either BLAST, omniBLAST or protein motif searches)
- viewed through GeneDB
- if you would like additional help (such as FASTA nucleotide/protein databases), please contact either Matt Berriman (mb4@sanger.ac.uk or Christiane Hertz-Fowler (chf@sanger.ac.uk).



