last updated 18.06.2006
Searching the Sanger web resources
- How can I find my gene? Ensembl or Vega?
Firstly you need to chose the resource you want to search your gene in: The whole genome assembly or finished clone sequence. Both projects have their pros and contras. The assemblies are build from all finished clones plus a whole genome shotgun assembly. All the data are integrated based on the physical map with the missing bits filled in by the whole genome shotgun contigs. This results in a rather complete genome sequence, but the sequence parts differ in reliability. The clone mapping and finishing project provides highly reliable and stable sequence, but is not complete yet. At the time of writing this, the finished clones cover app. 60% of the genomic sequence.
The clones on the current chromosomal tile path are being manually annotated and can be browsed in Vega. This results in highly accurate gene structures based on supporting evidence. The project is ongoing and Vega is updated frequently to reflect new clones being sequenced and annotated. Detailed information about how to use Vega can be found in our tutorial which you can download here.
Vega only contains finished clones on the current tile path. If you want to search ALL finished and unfinished zebrafish clones at the Sanger Institute, you can do this here
If you want to look for a gene in the current assembly use Ensembl. Assemblies are built and annotated about once a year to reflect additional data being available. Detailed information about how to use Ensembl can be found in our tutorial which you can download here.
- How can I find my marker?
You can use the text search box in Ensembl, or the fpc database.
- How can I find a BAC/PAC clone?
You can check where the clone lives and find out about its neighbours asd well as download sequence in our TPF and AGP browser. More explanation can be found here.
You can search/browse/download clones and manual annotation in the Vega database.
In Ensembl, you can identify a clone by its name (accession.version) on the ContigView pages under 'Overview' and 'Detailed View'. In case you want to find a clone that covers a certain region in a WGS contig within the assembly (Zv8_...), switch on the 'BAC end' sequences under 'Features'. If you zoom out wide enough, ends belonging to the same clone will be shown as a red box.
We also offer an ftp site for downloading clone sequences regardless whether they've been manually annotated yet or not. For each clone you'll find a fasta file and also an embl file which will be updated according to the finishing/manual annotation process.
You can use our Blast server to search these clones.
If you are interested in BAC end sequences: They have been submitted to dbGSS and can be searched/downloaded there. In case our zebrafish fpc database lists them as sequenced but you can't find them in dbGSS yet, you can get help here. Just send a list of the BAC ends you are interested in and we'll try our best.
In case you submitted clones for sequencing you can look up their status here. You will have to enter an accession number or the internal clone name. Please check FAQ no. 7 if in doubt.
- How can I retrieve data from the whole genome shotgun?
You can search the trace repository with your sequence using SSAHA or translated SSAHA. You can learn more about SSAHA here.
- How can I retrieve data from the assembly?
You can find all information about the assemblies here. The assemblies are also submitted to EMBL/Genbank and can be downloaded there.
- How can I retrieve data from the Ensembl database?
Please have a look at our tutorial which you can download here.
If you are looking for a certain gene or marker, try FAQ no. 1 and/or 2. If you are interestred in complex queries in the Ensembl database, like e.g. retrieving all genes with a certain domain, or downloading 1 kb upstream of the transcription start of all predicted genes, try Biomart.
If you want to download sequence data from Ensembl, you can either try the mouse-over on the blue sequence box in the middle of the contigview window. The 'export this file' option leads you to the Export data window. Please note that this option will only give you the part of the sequence that was used in the assembly. If you want to download the sequence of a whole finished clone and it overlaps with it's neighbours, then only the non-overlapping part will be returned in the exported file. For the whole sequence of a finished clone, please use the 'EMBL source file' option from the mouse-over.
- I found a sequence of interest. What does the identifier stand for?
Whole genome shotgun traces name like Z35728-a15f12.q1c can be split into the zebrafish ligation number Z35728, the plate number a15, the well number f12 and the direction of the read q1c. These clones cannot be ordered, unfortunately.
External clone names like RP71-1H3 out of the finishing project start with the library identifier RP71 (for others please lookup the library page) followed by the plate number 1 and the well number H3. These clones can be ordered from the respective distributors.
Internal clone identifiers like bZ1H3 are basically the same, apart from the libary identifier which is bZ in this case. For others please check the mapping page or this list of external/internal prefixes:
ESTs with the prefix ZF_mu belong to the muscle library created within the ZF Models EU project by Sarah Baxendale (Sheffield University). They can be ordered from Geneservices (contact Sebastien Allouis).
library external prefix internal prefix order contact CHORI-211 CH211 zC bacpacorders@chori.org DanioKey DKEY zK sales@imagenes-bio.de DanioKey Pilot DKEYP zKp sales@imagenes-bio.de CHORI-73 CH73 zH bacpacorders@chori.org RPCI-71 RP71 bZ bacpacorders@chori.org BUSM1 (PAC) BUSM1 dZ camemiya@benaroyaresearch.org ZFISHFOS ZFISHFOS zF archives@sanger.ac.uk CHORI-1073 (FOS) CH1073 zFD bacpacorders@chori.org Some of the clones we sequenced were submitted by members of the community and we don't know the libraries they are from. In this case, the external prefix will 'XX' and some of them will translate to the internal prefix 'bY'.
Identifiers like z06s024429 mean that you are dealing with a virtual supercontig from the first whole genome shotgun assembly Zv1.
Identifiers like ctg12079 mean that you are looking at the virtual contig ctg12079 from the whole genome shotgun assembly Zv3. These contigs could be related to fpc contigs from the mapping project at the time of the data freeze and therefore got the appropriate fpc contig names. Contigs in this assembly might also be named like NA10008 which means they couldn't be related to any fpc contigs.
From the assembly Zv4 onwards, we named supercontigs with relation to fpc contigs Zv4_scaffold followed by a random number. Unmapped supercontigs are called Zv4_NA, again followed by a random number.
- I found a sequence of interest. How can I order on the biological clone? back to top
If you found a trace out of the whole genome shotgun project we are unfortunately unable to send you the plasmid. You can try to find a matching clone and then order it from the appropriate address.
All BAC and PAC clones from the clone sequencing project can be ordered, though not from the Sanger Institute. Please check the library page for details or order using the information displayed above.
ZF_mu EST clones can be ordered from Geneservices (contact Sebastien Allouis).
Facts about the sequencing project
- Which strategies were chosen to sequence the zebrafish genome?
When the zebrafish genome project was started in spring 2001, we chose two different strategies to obtain the sequence which is estimated to have a size of 1.6 to 1.7 Gb.
The first strategy is the traditional clone mapping and sequencing technique. The BAC libraries CHORI211 and DanioKey were chosen and fingerprinted to generate a map. From this map a tiling path is calculated that covers the genome sequence clone by clone. Clones from this tiling path are then chosen for individual high quality sequencing. The whole genome sequence is then pieced together clone by clone. This approach takes time but leads to a high quality genome sequence, featured in Vega. You can find all clone mapping and sequencing related links here. Recently, it emerged that due to the number of individuals chosen for the generation of the clone libraries, the project suffers from haplotypic variations being present. These are difficult to sort for the mapping team. We currently try to solve this by resorting to yet another library, CHORI-73, which was made from a single double haploid fish. Clones sequenced from this library is treated as reference and given priority in mapping and assembly processes.
In order to provide a full genome sequence whilst the above project is ongoing, we also produce integrated genome assemblies built on the above with the gaps filled with whole genome shotgun contigs as featured in Ensembl. For the whole genome shotgun sequencing, DNA from Tüebingen embryos was used to generate plasmid libraries with 2-10 kb and fosmid libraries with 40 kb inserts. The resulting traces are stored in our trace repository and can be searched with the search tool SSAHA. All whole genome shotgun and assembly related links can be found here.
- What animals were chosen to get DNA from for the projects?
The whole genome shotgun was made from DNA derived from Tüebingen embryos. Information on the DNA source of the BAC/PAC clones is listed on the library page.
- What is the construction strategy of the physical map?
We are using a combination of approaches to provide contiguation of the zebrafish genome. Along with labs in Utrecht and Tüebingen, we have generated around 20 fold coverage of fingerprinted clones across the genome. These data are analysed and assembled at the Sanger, and released using Webfpc.
The contigs generated are acting as a template for clone by clone sequencing of the genome. It is hoped these data will provide a useful substrate for positional cloning within the community. Contigs identified as of special interest to the community can be prioritised for walking. Please contact Sean Humphray for details.
- What is the 'clone mapping and sequencing' approach?
Please have a look at FAQ no. 1.
- How does manual annotation work?
Clones from the 'clone mapping and sequencing' approach are subject to manual annotation to find all genes in them. For this, the alignments of ESTs, cDNAs or protein sequences (supporting evidence) to the genomic region are investigated and a gene is added if the match is of sufficient quality, continuous and features correct splice sites. Genes are not built on 'guesses' like the results of gene prediction programs. Once a clone is manually annotated, it is submitted to EMBL/Genbank and can be downloaded from there and/or browsed in the Vega database.
- How and when are assemblies built?
Assemblies are built when the underlying datasets change. We currently aim at one to two releases per year. For the assembly process we use the sequence from mapped and finished clones as a scaffold. The remaining gaps get filled with sequence from whole genome shotgun supercontigs, produced by assembling reads with Phusion. Markers, BAC end sequences and cDNAs are used as anchors to merge clone and contig sequence.
This results in three categories of contigs. Firstly, contigs consisting of finished BAC sequences and gaps in between filled with WGS sequence, placed onto a chromosome. Secondly, the same but without any chromosome placement. These two types of contigs will be named Zv8_scaffold<number>. Thirdly, some WGS contigs can't be tied to the finished clone sequences at all, these will go into the assembly as Zv8_NA<number>.
After all this, the assembly is released to the public.
- How and when are Ensembl databases built?
When a new assembly is available, the process starts with loading the assembly sequence into a new database. Analyses like Blast searches, repeat masking, Marker-EPCR and ab initio gene predictions (Genscan) are run and as soon as these initial analyses are finished, the results get publicly released as a pre-Ensembl database.
After the initial analyses, the gene build starts. Genewise is used to predict genes on the basis of homology matches between the genomic sequence and protein sequences from zebrafish and other species. After the gene build, additional data is added. For the whole current spec of available information please check Ensembl. When these processes is finished, the data gets released as an Ensembl database replacing the previous one and the pre-Ensembl database will be taken off the web pages.
Ensembl databases are released every month to reflect changes in the underlying code and the featured data. This means that the content of an Ensembl database can slighty change over time, e.g. when a new marker set gets used. The genomic sequence won't change until the release of a new assembly. Also the genes usually won't change (unless otherwise announced).
- How much has been sequenced so far?
The latest assembly Zv8 is built on 9,816 clones placed onto the physical map. The remaining gaps were filled with contigs from a 10x whole genome shotgun assembly.
The current in-house statistics are displayed on the homepage
- What is the time scale for the project?
We intend to release a new genome assembly once to twice a year. This of course is dependent on the amount of additional information gathered within a certain time period, so the distances in time might vary.
The genome sequence released as integrated assemblies with automated annotation (latest: Zv8) already provide a valuable and reliable resource for the zebrafish community. At the same time the Vega database features finished clones with manual annotation for those in search of gold standard sequence and manual annotation. We aim at providing a completely finished and manually annotated genome sequence in Vega by the end of 2009.
- What is the Sanger data use policy?
If you have used our data in your analysis and wish to publish, please have a look at our data use policy first.



