Last updated 1st December 2010
Searching the Sanger web resources
- 1. How can I find my gene? Do I use Ensembl or Vega?
-
First, you need to choose the resource you want to search your gene in: finished clone sequence or the whole genome assembly. Both resources have their pros and cons.
- The clone mapping and finishing project provides highly reliable and stable sequence, but is not yet complete. At the time of writing, the finished clones cover approximately 87% of the genomic sequence.
- The assemblies are built from all finished clones plus a whole genome shotgun assembly. All the data are integrated based on the physical map with the missing bits filled in by the whole genome shotgun contigs. This results in a fairly complete genome sequence, but the sequence parts differ in reliability.
The finished clones on the current chromosomal tile path are being manually annotated and can be browsed in Vega. This results in highly accurate gene structures based on supporting evidence. The project is ongoing, and Vega is updated frequently to reflect new clones being sequenced and annotated. Detailed information about how to use Vega can be found in module 1 of our tutorial. Vega only contains finished clones on the current tile path. If you want to search ALL finished and unfinished zebrafish clones at the Sanger Institute, you can do this here
If you want to look for a gene in the current assembly, use Ensembl. Assemblies are built and annotated about once a year to reflect additional data being available. Detailed information about how to use Ensembl can be found in module 2 of our tutorial.
- 2. How can I find my marker?
-
You can use the text search box in Ensembl or the PGP viewer.
- 3. How can I find a BAC/PAC clone?
-
You can check where the clone lives and find out about its neighbours as well as download sequence in Chromoview. More explanation of the format of the information presented there can be found here.
You can search/browse/download clones and manual annotation in the Vega database.
In Ensembl, you can search for a clone by its name (accession.version, e.g. BX649292.16). You can also identify a clone by this name on the 'Region overview' and 'Region in detail' pages. If you want to find a clone that covers a certain region in a WGS contig within the assembly (EMBL accession starting "CABZ"), switch on the 'BAC end' sequences in the 'Configure this page' panel under 'Other DNA alignments'. If you zoom out wide enough, ends belonging to the same clone will be shown as a box.
We also offer an ftp site for downloading clone sequences regardless whether they've been manually annotated yet or not. For each clone you'll find a fasta file and also an embl file which will be updated according to the finishing/manual annotation process.
You can use our Blast server to search these clones.
If you are interested in the progress of a particular clone through the sequencing pipeline, you can look up its status in Chromoview. You will have to enter an accession number or the internal clone name. Please check FAQ 7 and FAQ 8 for further details of the clone naming schemes.
- 4. How can I retrieve data from the whole genome shotgun assemblies?
-
You can search the trace repository with your sequence with Megablast or Blast using the NCBI trace archive search.
- 5. How can I retrieve data about the integrated genome assemblies (Zv1-Zv9)?
-
You can find all information about the assemblies here. The assemblies are also submitted to EMBL/Genbank and can be downloaded there.
- 6. How can I retrieve data from the Ensembl database?
-
Please have a look at module 2 of our tutorial.
If you are looking for a certain gene or marker, try FAQ no. 1 and/or 2. If you are interested in complex queries in the Ensembl database, such as retrieving all genes with a certain domain, or downloading 1 kb upstream of the transcription start of all predicted genes, try Biomart, which is also described in module 2 of our tutorial.
If you want to download sequence data from Ensembl, you can click on the blue box representing the contig or clone itself. The 'Export' option leads you to the Export data window. Please note that this option will only give you the part of the sequence that was used in the assembly. If you want to download the sequence of a whole finished clone and its overlaps with its neighbours, then only the non-overlapping part will be returned in the exported file. For the whole sequence of a finished clone, please look up the entire entry in EMBL using its accession.
- 7. What do the names in Ensembl for clones, contigs and scaffolds mean?
-
All clones and whole genome shotgun contigs in the integrated assembly in Ensembl are identified by their EMBL accessions. In the case of the WGS contigs, these accessions always start with CABZ.
Clones and WGS contigs are grouped into "scaffolds". A scaffold corresponds to a single fingerprint contig of clones, which has had its gaps filled in using WGS contigs placed using sequence alignment. Scaffolds on chromosomes also occur which consist purely of WGS contigs; these are WGS contigs which have been placed using marker information. Scaffolds are given names which begin with Zv9_scaffold, followed by a number. The first scaffold on chromosome 1 is Zv9_scaffold1, the second is Zv9_scaffold2, and so on up till the end of chromosome 25.
There are also scaffolds which are not placed on chromosomes. Some of these are based on fingerprint contigs of clones whose parent chromosome could not be identified. These have scaffold names which are Zv9_scaffold followed by a number, just like scaffolds on chromosomes. The remaining scaffolds not on chromosomes consist of WGS contigs which could not be placed on a chromosome using alignment, marker, or cDNA information. These have names which start with Zv9_NA, followed by an arbitrary number.
In assemblies prior to Zv9, WGS contigs were given identifiers like Zv8_scaffold36.2 (for the second WGS contig in scaffold Zv8_scaffold36). As mentioned above, as of Zv9 this system is no longer used; WGS contigs are now identified using their EMBL accession.
Whole genome shotgun assembly Zv3 used identifiers like ctg12079 to label virtual contigs. These contigs could be related to clone fingerprint contigs from the mapping project at the time of the data freeze were therefore labelled with the appropriate fingerprint contig name. Contigs in this Zv3 which couldn't be related to any fingerprint contig were given names such as NA10008.
In Zv1, the first whole genome shotgun assembly, identifiers like z06s024429 were given to virtual supercontigs.
- 8. I found a sequence of interest. How can I order the biological clone?
-
All BAC and PAC clones from the clone sequencing project can be ordered, though not from the Sanger Institute. The source for ordering depends upon the library the clone belongs to, which can be derived from the international clone name.
Each clone has three different identifiers: its EMBL accession, its international (or "external") name, and its internal Sanger name. The EMBL accession allows the sequence of the clone to be obtained in the EMBL database. The international and internal names are useful for ordering the physical clone, and are related to one another as described below. As stated above in FAQ 7, in Ensembl all clones are identified using their EMBL accession.
If you have only the EMBL accession for the clone, you can look up the record in EMBL or Chromoview to determine the external name.
International clone names like RP71-1H3 start with the library identifier RP71 followed by the plate number 1 and the well number H3.
Internal clone identifiers such as bZ1H3 are the same, apart from the library identifier which is bZ in this case.For details of how prefixes relate to individual libraries please check the mapping page or see the following list of external/internal prefixes:
library external prefix internal prefix order contact CHORI-211 CH211 zC bacpacorders@chori.org DanioKey DKEY zK sales@imagenes-bio.de DanioKey Pilot DKEYP zKp sales@imagenes-bio.de CHORI-73 CH73 zH bacpacorders@chori.org RPCI-71 RP71 bZ bacpacorders@chori.org BUSM1 (PAC) BUSM1 dZ camemiya@benaroyaresearch.org ZfishFos ZFOS zF archives@sanger.ac.uk CHORI-1073 (FOS) CH1073 zFD bacpacorders@chori.org Further details of these libraries are available on our library page.
Some of the clones we sequenced were submitted by members of the community and we don't know the libraries they are from. In this case, the external prefix will be 'XX' and some of them will translate to the internal prefix 'bY'.
ESTs with the prefix ZF_mu belong to the muscle library created within the ZF Models EU project by Sarah Baxendale (Sheffield University). They can be ordered from Geneservices (contact Sebastien Allouis).
If you are interested in a contig from one of the whole genome shotgun assemblies, we are unfortunately unable to send you the plasmids corresponding to individual traces. You can try to find a matching clone and then order it from the appropriate address.
Facts about the sequencing project
- 1. Which strategies were chosen to sequence the zebrafish genome?
-
When the zebrafish genome project was started in spring 2001, we chose two different strategies to obtain the sequence.
The first strategy is the traditional clone mapping and sequencing technique. The BAC libraries CHORI211 and DanioKey were chosen and fingerprinted to generate a map. From this map a tiling path is calculated that covers the genome sequence clone by clone. Clones from this tiling path are then chosen for individual high quality sequencing. The genome sequence is then pieced together clone by clone. This approach takes time but leads to a high quality genome sequence, featured in Vega. You can find all clone mapping and sequencing related links here.
Due to the number of individuals chosen for the generation of the initially used clone libraries, the project suffered from several haplotypic variations being present, leading to artificial duplications. In order to solve this we have resorted to additional libraries, made from a single double haploid fish. Clones sequenced from these libraries are treated as reference and given priority in mapping and assembly processes.
In order to provide a full genome sequence whilst the above project is ongoing, we also produce integrated genome assemblies built on the above clone path, with the gaps filled by whole genome shotgun contigs, as featured in Ensembl. The whole genome shotgun assembly which contributed to the present Zv9 integrated assembly (WGS31) was created using Illumina sequencing reads from a double-haploid Tübingen fish (289 million reads providing approximately 30-fold coverage), combined with capillary sequencing reads from a second related double-haploid Tübingen fish (12.2 million reads providing approximately 7.5-fold coverage).
This use of data from double-haploid Tübingen fish results in less artificial haplotypic duplication than was found in previous WGS assemblies which were generated from multiple individual diploid fish. A novel de Bruijn graph based algorithm called Fuzzypath was used to assemble the Illumina reads into short sequence contigs; these contigs were then combined with the capillary reads using the Phusion assembler.
The resulting WGS traces can be searched with Megablast or Blast using the NCBI trace archive search. All whole genome shotgun and assembly related links can be found here.
- 2. Who looks after the genome sequence?
-
After the release of the integrated genome assembly Zv9, the zebrafish genome sequence has now officially been handed over to the care of the Genome Reference Consortium (GRC). The GRC will continue to improve the zebrafish clone path by closing gaps, fixing errors and represent complex variation. More details can be found in this talk, given at the 2010 zebrafish conference in Madison.
- 3. Which strain of zebrafish was used?
-
Both the clone sequence and whole genome shotgun assemblies are based on DNA derived from Tübingen fish. Information on the DNA source of the BAC/PAC clones is listed on the library page.
- 4. What is the construction strategy of the physical map?
-
We are using a combination of approaches to provide contiguation of the zebrafish genome. Along with labs in Utrecht and Tübingen, we have generated around 20 fold coverage of fingerprinted clones across the genome. These data are analysed and assembled at the Sanger.
The contigs generated are acting as a template for clone by clone sequencing of the genome. It is hoped these data will provide a useful basis for positional cloning within the community. Contigs identified as of special interest to the community can be prioritised for walking. Please contact zfish-help for details.
- 5. What is the 'clone mapping and sequencing' approach?
-
Please have a look at FAQ no. 1.
- 6. How does manual annotation work?
-
Clones from the 'clone mapping and sequencing' approach are subject to manual annotation to find all genes in them. For this, the alignments of ESTs, cDNAs or protein sequences (supporting evidence) to the genomic region are investigated and a gene is added if the match is of sufficient quality, continuous and features correct splice sites. Genes are not built on the results of ab initio gene prediction programs. Once a clone is manually annotated, it is submitted to EMBL/Genbank and can be downloaded from there and/or browsed in the Vega database.
- 7. How and when are assemblies built?
-
Assemblies are built when the underlying datasets change. We currently aim at a release every one or two years. For the assembly process we use the sequence from mapped and finished clones as a starting point. The remaining gaps get filled with sequence from whole genome shotgun supercontigs, as described in FAQ number 1 from the "Facts" section. Markers and cDNAs are used as anchors to merge clone and WGS contig sequence.
This results in three categories of contigs. Firstly, contigs consisting of finished BAC sequences and gaps in between filled with WGS sequence, placed onto a chromosome. Secondly, the same but without any chromosome placement. These two types of contigs will be named Zv9_scaffold<number>. Thirdly, some WGS contigs can't be tied to the finished clone sequences at all, these will go into the assembly as Zv9_NA<number>.
After all this, the assembly is released to the public.
For further details about identifiers in the assembly, see FAQ number 7 about searching the Sanger web resources.
- 8. How and when are Ensembl databases built?
-
When a new assembly is available, the process starts with loading the assembly sequence into a new database. Analyses like Blast searches, repeat masking, marker e-PCR and ab initio gene predictions (Genscan) are run. As soon as these initial analyses are finished, the results get publicly released as a pre-Ensembl database.
After the initial analyses, the gene build starts. Genewise is used to predict genes on the basis of homology matches between the genomic sequence and protein sequences from zebrafish and other species. After the gene build, additional data is added. For a full description of available information please consult Ensembl. When this process is finished, the data gets released as an Ensembl database, replacing the previous one. The pre-Ensembl database will then be taken off the website.
Ensembl databases are released every month to reflect changes in the underlying code and the featured data. This means that the content of an Ensembl database can change slightly over time, such as when a new marker set is used. The genomic sequence will not change until the release of a new assembly, and the Ensembl genes will usually not change (unless otherwise announced).
- 9. How much has been sequenced so far?
-
Integrated whole genome assembly
The latest assembly Zv9 is built on 11,099 finished clones placed onto the physical map. These clones cover 87% of the estimated euchromatic genome size of 1.335 Gb. The remaining gaps were filled with contigs from the whole genome shotgun assembly WGS31. The total size of scaffolds in Zv9 (including sequence that has not yet been localised to a chromosome) is 1.412 Gb.
The current in-house statistics are displayed on the homepage
- 10. What is the time scale for the project?
-
We intend to release a new genome assembly every one or two years. This of course is dependent on the amount of additional information gathered within a certain time period, so the distances in time might vary.
The genome sequence released as integrated assemblies with automated annotation (latest: Zv9) already provide a valuable and reliable resource for the zebrafish community. At the same time the Vega database features finished clones with manual annotation for those in search of gold standard sequence and manual annotation. 2011 will see a new merged gene set, created from the automated annotation in Ensembl and the manual annotation in Vega. Both the manual annotation and the sequencing project are ongoing. The genome sequence has now been taken on by the Genome Reference Consortium.
- 11. What is the Sanger data use policy?
-
If you have used our data in your analysis and wish to publish, please have a look at our data use policy first.
- 12. How do I cite the zebrafish genome assembly?
-
A publication describing the zebrafish genome is currently being written. In the meantime, please use the following information to cite the genome assembly.
Primary Notation: The Danio rerio Sequencing Project (http://www.sanger.ac.uk/Projects/D_rerio/); Wellcome Trust Sanger Institute
Location: Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK
Funding body: Wellcome Trust
NCBI ProjectID: 11776
Accessions for Zv9 chromosomes can be found here.
