Zebrafish genome Project: Frequently asked Questions

[Wellcome Library, London]

Searching the Sanger web resources

1. How can I find my gene?

Ensembl provides automated annotation on the latest reference assembly, merged with high quality manual annotation from the Havana group. This merge is repeated every 6 months to ensure that Ensembl keeps up with current annotation as provided in the Vega browser.

2. How can I find a certain marker?

You can use the text search box in Ensembl or the PGPviewer.

3. How can I find a clone?

You can check the current clone path and download sequence in Chromoview and PGPviewer, the latter also provides information on the reliability of a clone in a certain position in the path. The latest reference assembly can be searched for specific clones in Ensembl and Vega. We also offer an ftp site for downloading clone sequences, you can use our Blast server to search these clones and can also find out about the progress of a particular clone through the sequencing pipeline.

All this information is available on the Genome Project home page.

5. How can I retrieve zebrafish genome assembly sequence data?

You can find all information about the assemblies here. The assemblies are also submitted to ENA/Genbank and can be downloaded there.

6. How can I retrieve data from the Ensembl database?

Please have a look at tutorials at the bottom of the Genome Project home page. The Ensembl web pages also provide extensive infornmtion including video tutorials

7. What do the names in Ensembl for clones, contigs and scaffolds mean?

All clones and whole genome shotgun contigs in the integrated assembly in Ensembl are identified by their EMBL accessions. In the case of the WGS contigs, these accessions always start with CABZ.

Clones and WGS contigs are grouped into "scaffolds". A scaffold corresponds to a single fingerprint contig of clones, which has had its gaps filled in using WGS contigs placed using sequence alignment. Scaffolds on chromosomes also occur which consist purely of WGS contigs; these are WGS contigs which have been placed using marker information. Scaffolds are given names which begin with Zv9_scaffold, followed by a number. The first scaffold on chromosome 1 is Zv9_scaffold1, the second is Zv9_scaffold2, and so on up till the end of chromosome 25.

There are also scaffolds which are not placed on chromosomes. Some of these are based on fingerprint contigs of clones whose parent chromosome could not be identified. These have scaffold names which are Zv9_scaffolds followed by a number, just like scaffolds on chromosomes. The remaining scaffolds not on chromosomes consist of WGS contigs which could not be placed on a chromosome using alignment, marker, or cDNA information. These have names which start with Zv9_NA, followed by an arbitrary number.

In assemblies prior to Zv9, WGS contigs were given identifiers like Zv8_scaffold36.2 (for the second WGS contig in scaffold Zv8_scaffold36). As mentioned above, as of Zv9 this system is no longer used; WGS contigs are now identified using their ENA accession.

Whole genome shotgun assembly Zv3 used identifiers like ctg12079 to label virtual contigs. These contigs could be related to clone fingerprint contigs from the mapping project at the time of the data freeze were therefore labelled with the appropriate fingerprint contig name. Contigs in this Zv3 which couldn't be related to any fingerprint contig were given names such as NA10008.

8. I found a sequence of interest. How can I order the biological clone?

All BAC and PAC clones from the clone sequencing project can be ordered, though not from the Sanger Institute. The source for ordering depends upon the library the clone belongs to, which can be derived from the international clone name.

Each clone has three different identifiers: its EMBL accession, its international (or "external") name, and its internal Sanger name. The EMBL accession allows the sequence of the clone to be obtained in the EMBL database. The international and internal names are useful for ordering the physical clone, and are related to one another as described below. As stated above in FAQ 7, in Ensembl all clones are identified using their EMBL accession.

If you have only the EMBL accession for the clone, you can look up the record in ENA or Chromoview to determine the external name.

International clone names like RP71-1H3 start with the library identifier RP71 followed by the plate number 1 and the well number H3.

Internal clone identifiers such as bZ1H3 are the same, apart from the library identifier which is bZ in this case.

For details of how prefixes relate to individual libraries please check the mapping page or see the following list of external/internal prefixes:

library external prefix internal prefix order contact
CHORI-211 CH211 zC bacpacorders@chori.org
DanioKey DKEY zK sales@imagenes-bio.de
DanioKey Pilot DKEYP zKp sales@imagenes-bio.de
CHORI-73 CH73 zH bacpacorders@chori.org
RPCI-71 RP71 bZ bacpacorders@chori.org
BUSM1 (PAC) BUSM1 dZ camemiya@benaroyaresearch.org
ZfishFos ZFOS zF We are currently negotiating the distribution of this library
CHORI-1073 (FOS) CH1073 zFD bacpacorders@chori.org

Some of the clones we sequenced were submitted by members of the community and we don't know the libraries they are from. In this case, the external prefix will be 'XX' and some of them will translate to the internal prefix 'bY'.

ESTs with the prefix ZF_mu belong to the muscle library created within the ZF Models EU project by Sarah Baxendale (Sheffield University). They can be ordered from Source Bioscience via their online ordering service.

If you are interested in a contig from one of the whole genome shotgun assemblies, we are unfortunately unable to send you the plasmids corresponding to individual traces. You can try to find a matching clone and then order it from the appropriate address.

Facts about the sequencing project

1. Which strategies were chosen to sequence the zebrafish genome?

The zebrafish genome project (BioProject PRJNA11776) is based on a clone-by-clone mapping and sequencing approach. This technique provides accurate reference sequence but is extremely time consuming. In order to provide useful data to the research community as soon as possible, whole genome shotgun sequencing (WGS, Zv1-Zv3) was undertaken at the same time and the clone path supplemented with contigs generated from this additional sequence, leading to integrated genome assemblies (Zv4-Zv9). The latest assembly, Zv9 contains 83% clone sequence, derived from several Tuebingen libraries including those from a single double-haploid individual. This sequence was complemented with sequence from the WGS31 assembly, accession CABZ00000000.1 (17%), which was generated from whole genome shotgun reads and next generation sequencing of a single double-haploid Tuebingen individual.

2. Who looks after the genome sequence?

After the release of the integrated genome assembly Zv9, the zebrafish genome sequence has now officially been handed over to the care of the Genome Reference Consortium (GRC). The GRC will continue to improve the zebrafish clone path by closing gaps, fixing errors and represent complex variation.

3. Which strain of zebrafish was used?

Both the clone sequence and whole genome shotgun assemblies are based on DNA derived from Tübingen fish. Information on the DNA source of the BAC/PAC clones is listed on the library page.

4. What is the construction strategy of the physical map?

We are using a combination of approaches to provide contiguation of the zebrafish genome. Along with labs in Utrecht and Tübingen, we generated around 20 fold coverage of fingerprinted clones across the genome. These data have been analysed and assembled at the Sanger Institute.

The resulting clone path has since been majorly improved by correcting it using infromation from meiotic maps, mainly MGH and HS, plus the recently generated high-density map SATMAP, which was firstly used for Zv8. Improvements are ongoing and the current clone path can be browsed in Chromoview.

5. How does manual annotation work?

Clones from the 'clone mapping and sequencing' approach are subject to manual annotation to find all genes in them. For this, the alignments of ESTs, cDNAs or protein sequences (supporting evidence) to the genomic region are investigated and a gene is added if the match is of sufficient quality, continuous and features correct splice sites. Genes are not built on the results of ab initio gene prediction programs. Once a clone is manually annotated, it is submitted to ENA/Genbank and can be downloaded from there and/or browsed in the Vega database.

6. How and when are assemblies built?

Assemblies are built when the underlying datasets change. We currently aim at a release about every two years. For the assembly process we use the sequence from mapped and finished clones as a starting point. The remaining gaps get filled with sequence from whole genome shotgun supercontigs, as described in FAQ number 1 from the "Facts" section. Markers and cDNAs are used as anchors to merge clone and WGS contig sequence.

This results in three categories of contigs. Firstly, contigs consisting of finished BAC (and fosmid) sequences, with gaps in between filled with WGS sequence, placed onto a chromosome. Secondly, the same but without any chromosome placement. These two types of contigs will be named Zv9_scaffold<number>. Thirdly, some WGS contigs can't be tied to the finished clone sequences at all, these will go into the assembly as scaffolds named Zv9_NA<number>.

After all this, the assembly is released to the public.

For further details about identifiers in the assembly, see FAQ number 7 about searching the Sanger web resources.

7. How and when are Ensembl databases built?

When a new assembly is available, the process starts with loading the assembly sequence into a new database. Analyses like Blast searches, repeat masking, marker e-PCR and ab initio gene predictions (Genscan) are run. As soon as these initial analyses are finished, the results get publicly released as a pre-Ensembl database.

After the initial analyses, the gene build starts. Genewise is used to predict genes on the basis of homology matches between the genomic sequence and protein sequences, cDNA and RNAseq alignments from zebrafish and other species. The resulting genes are merged with those deriving from manual annotation of the genome. For a full description of the gene build and all available data and information please consult the extensive documenattion at Ensembl.

When this process is finished, the data gets released as an Ensembl database, replacing the previous one. The pre-Ensembl database will then be taken off the website.

Ensembl databases are released every month to reflect changes in the underlying code and the featured data. This means that the content of an Ensembl database can change slightly over time, such as when a new marker set is used. The genomic sequence will not change until the release of a new assembly, and the Ensembl genes will usually not change (unless otherwise announced).

8. What is the time scale for the project?

Zv10 - or better GRCz10 - is planned for early 2014.

The genome sequence released as integrated assemblies with automated annotation (latest: Zv9) already provide a valuable and reliable resource for the zebrafish community. At the same time the Vega database features finished clones with manual annotation for those in search of gold standard sequence and manual annotation. Merged gene sets are updated ona frequent basis to reflect the ongoing manual annotation. The genome sequence improvement and maintenance has now been taken on by the Genome Reference Consortium.

9. What is the Sanger data use policy?

If you have used our data in your analysis and wish to publish, please have a look at our data use policy first.

10. How do I cite the zebrafish genome assembly?

A paper describing the zebrafish genome has been published.

Component Qr failed to execute