Zv8, the 8th integrated whole genome assembly of the zebrafish genome has been released
General information
The assembly comprises a total sequence length of 1,481,241,295 bp in 11,623 fragments. This assembly has been tied to the clone path (as present on the 12th of June 2008. 1.14 Gb of sequence from 9,816 sequenced clones (8,726 finished and 1,090 unfinished) were taken as a scaffold that was completed with contigs from a whole genome shotgun (WGS) assembly (see details below). The clone path was calculated based on marker information from genetic and radiation hybrid maps, see details below. The resulting chromosome sequences were compared to a whole genome shotgun (WGS) assembly and the whole genome shotgun contigs were used to fill gaps where appropriate.
The scaffolds that are either based clone contigs or could be associated with chromosome placements due to marker information are named Zv8_scaffold followed by a number. The WGS contigs that could not be placed onto chromosomes are named Zv8_NA followed by a number. Previously, these contigs were discarded unless they matched certain criteria (length > 2 kb and features present or length > 5kb). For Zv8, all Zv8_NA contigs > 2kb were retained, hence the rise in contig numbers. According to the agreement reached at the European Zebrafish Meeting in Paris, 2003, we translated linkage group numbers directly into chromosome numbers (e.g. linkage group 1 = chromosome 1).
Please note:
This is still a *preliminary* assembly and there are a number of points to remember. The regions of the assembly covered by WGS contigs are of lower quality. The assembly will still contain misjoins and misassemblies and artificial duplications due to retention of haplotypic sequences are likely to occur despite our efforts to remove them. During the generation of Zv8, particular attention has been paid to improving the order of the clone path.
Assembly Strategy
Previous Zebrafish genome assemblies have been problematic for several reasons. Firstly, haplotypic duplications were artificially integrated into these assemblies. Secondly, due to the choice of genetic map on which the assemblies were built there were discrepancies between mapping / positional cloning experiments and the position of the markers represented in the assemblies. Finally, in the transition to Zv7 several regions of the genome were not present because of the choice of whole-genome shotgun assembly used to fill the gaps between finished BAC/fosmid fingerprint contig (FPC) sequence.
Table 1: describes the three main maps used to organized Zv8 assembly.
Heat Shock, MGH and T51. Although T51 has double coverage and 7 times
more resolution than the other two maps it is the less reliable in
long distances and general organization of the genome. However T51
map is useful for short distances (see ZFIN
for more details).
The version of HS map used in Zv8
has been produced by Matt Clark (Derek Stemple lab) in collaboration
with John Postlethwait lab. A new set of 971 SNPS (ss# and ZSNP#) have
been scored in the HS panel and added to the existing map giving an
improvement in the coverage of ~23%.
MGH and T51 maps have been kindly
provided by Yi Zhou at the Children's Hospital Boston.
To overcome these problems with Zv8 we have made two major changes to the assembly process:
1) We have reorganized FPC order and orientation by more careful use of the existing genetic maps, Heat Shock (HS) , MGH and the T51 radiation hybrid map (see table 1). In the process we have identified a number of haplotypic duplications, which have been removed.
2) To fill in gaps between finished BAC/fosmid FPC sequence we have used a whole genome shotgun assembly (WGS) with a greater fold coverage than was used in Zv7.
Reorganization of FPCs:
Analysis of the correlation between Zv7 and the three genetic maps (HS, MGH and T51) highlights a particular problem in the Zv7 assembly (Fig. 2). While the T51 map was used to anchor FPCs in Zv7 there is nevertheless poor correlation between the assembly and the map. Indeed, the average correlation between the three maps and the assembly is only around 0.7.(Fig. 1)
To overcome this problem we have reorganized the FPCs, prioritising the meiotic maps HS and MGH for long-range order and chromosome assignment, then using the T51 radiation hybrid map mostly to resolve local order and orientation. In this process we consider an FPC as an indivisible unit. The FPC provides a link between the markers mapped on the FPC and therefore makes an association between the three different maps. We have used this association to integrate map position data derived from the three different maps by sequence alignment between genetic markers and the sequence of each FPC. By using all three maps we increase the coverage and resolution to the maximum possible given available information.
Figure 1: Describes the correlation between the genetic maps and the asemblies as the spearman rho rank correlation coefficient (Zv7 and Zv8).
We weighted each marker based on the overall quality of each map and the mapping quality of the marker on the genome. In principle, the two meiotic maps more accurately represent the genome structure, but are lower resolution than the T51 map. Between the two meiotic maps, the HS map is more reliable because allele scoring is more accurate than for the MGH map. With the initial marker mapping information each FPC is first assigned to a chromosome and an FPC position is evaluated by a weighted mean of the positions of the markers. Given the FPC chromosome assignment and position, the FPCs were then sorted based on their map positions giving precedence to HS data, then MGH and finally T51.
We see a striking improvement in the correlation between the Zv8 assembly and each of the genetic maps, which is now at an average of 0.96 for each chromosome (Fig. 1 and Fig. 2).
In addition, we find that Chromosome 4 has a significantly increased size now closer to the size expected by flow cytometry . Assignment of sequence to Chromosome 4 needs to be interpreted with caution. The long arm of Chromosome 4 contains very repetitive sequence, which could lead to mis-localisation of FPCs.
|
|
Figure 2: Distribtion of Markers in the assembly vs MGH genetic map. Zv7 assembly (left panel) shows a fair number of markers mislocalized at the end of each chromosome (red arrow). Also there are markers mapping outside their genetic linkage group. In Zv8 (right panel) the markers within the chromosome are properly aligned and now there isn't markers mislocated at the end of the chromosome. Also the number of markers in wrong chromosomes is reduced in Zv8. A detailed view of the marker distribution from Chr 1 to Chr 25 can be displayed by clicking over the chromosmes in the graph legend. Use the right buttons to change the genetic map.
Resources
A pre-ensembl database has been built on the Zv8 assembly, featuring the sequence and automated feature annotation, is now available. A full gene build will be released in a 'full Ensembl database' in spring 2009.
The whole assembly can be downloaded from ftp://ftp.ensembl.org/pub/assembly/zebrafish/Zv8release
Assembly Statistics
The WGS assembly is based on 20,541,433 reads comprising 14,160,626,498 bp with a coverage of 6.5-7x. This set includes 6,882,050 reads from a new library generated from a single Tuebingen, double haploid zebrafish. In order to increase continuity of contigs in the finished or near finished regions, we shredded 1,366,419 reads from finished clones in the tiling path. From this set 18,969,500 reads were finally placed in the assembly. Phusion was used to cluster the reads and phrap was used for cluster assembly and consensus generation. This resulted in 247,928 contigs with an N50 size of 20,629bp. Contigs are joined in supercontigs based on read-pair information where the sizes of gaps are estimated using insert sizes of different lengths. Small supercontigs with less than 3 reads or smaller than 0.5 kb were rejected. There are 105,987 supercontigs in the WGS assembly with an N50 size of 687,451bp.
The integration of the WGS assembly with the clone sequences results in the Zv8 assembly (bp measures include estimated gap sizes):
- Total bases = 1,481,241,295 bp
- Scaffolds = 11,623
- Largest = 76,918,211
- N50 = 50,748,729, n = 13
- 1,322,655,876 bp in scaffolds placed on chromosomes 1-25 (includes 100 bp gaps between scaffolds).
- 43,712,372 bp in 180 scaffolds tied to unplaced FPC contigs.
- 123,873,047 bp in 11,418 NA scaffolds.
