First assembly of the zebrafish genome released
Please note that this is a *preliminary* assembly and there are a number of points to remember:
There is a high level of misassembly. This is because the source DNA came from ~1000 5 day old embryos and the polymorphism is at least 1/200bps with additional significant indels. Thus regions of the genome which are highly variable do not form clusters for assembly since the sequences that originate from a given region are quite likely from different haplotypes. This causes assembly dropouts for some regions and false duplications in other regions where phrap splits different haplotypes into multiple paths. We are working on the assembly code, Phusion, to address these issues. However, there is an enormous amount of useful sequence in this assembly and hope this outweighs the problems in the assembly.
More information is available at:
ftp://ftp.ensembl.org/pub/traces/zebrafish/assembly/assembly06/READMEAlthough the assembly is being made available as early as possible to the research community, an Ensembl gene build has NOT yet been performed. We are investigating this now but for the moment Ensembl will continue to present clone-based data.
We plan to release an updated Ensembl which presents all normal Ensembl features except Ensembl gene predictions in a few weeks.
The assembly may be searched using BLAST at:
http://www.ensembl.org/Danio_rerio/blastviewand by SSAHA at:
http://www.ensembl.org/Danio_rerio/ssahaviewNote that Zebrafish SSAHA now supports very rapid queries using protein sequences. This feature will be extended to all Ensembl species in due course.
Assembly data are available at:
ftp://ftp.ensembl.org/pub/traces/zebrafish/assembly/assembly06Assembly Statistics
We started with 9643640 reads comprising 6.07Gbp (630bps average RL). There are 7942778 unique reads, 82.4% of starting reads, in the assembly.
Phusion was used to cluster the reads and phrap was used for cluster assembly and consensus generation
Small supercontigs with less than 3 reads or smaller than 1kb were rejected. 3.5Mbp of the assembly was rejected as possible contamination based on read source statistics at the supercontig level.
For the supercontigs (bp measures include estimated gap sizes):
Estimated coverage based on 12Mbp of 143 finished clones gives:
