Comparison of draft human sequence versions from the public and private domain

Summary

The greater part of the data for Celera's assemblies comes from the public Human Genome Project (HGP).

Despite this benefit, Celera's assembly is only comparable with that of the HGP and is dependent upon it.

Far from "winning the race", as they have claimed and many commentators have believed, their methodology has been found wanting.

The situation today

Now that the papers from the two teams can be inspected, we can for the first time judge the relative quality of the two versions and the strengths of the methods.

It would be expected that Celera's version would be somewhat superior, because they had the opportunity to pick up the HGP data, as it was continuously released, and combine it with theirs. The HGP could not do the reverse, because it would require that it agree not to redistribute, which would defeat its objective of the widest possible dissemination.

The sources of the data

We first note that Celera makes no attempt to assemble their data on its own.

They use a combination of sequence data derived from the following sources: 5.1-fold coverage generated by Celera, from random shotgun sequencing, and 7.5-fold coverage generated by the public HGP, from clone-based sequencing. = 12.6-fold total coverage.

In Celera's use of HGP data for their assemblies the sequences were artificially broken up into so-called 'faux reads' - that is, a set of perfect, evenly spaced sequence reads with no gaps (page 10, column 3).

Celera's paper then incorrectly counts the 'faux reads' from the HGP as being the equivalent of 2.9-fold random coverage. Although the number of 'faux reads' corresponds to 2.9-fold coverage, the 'faux reads' were not randomly chosen by Celera, which would have resulted in many gaps, but rather were a perfect tiling path, with no gaps, reflecting the full 7.5-fold coverage embodied in the HGP data.

Assembly methods

Celera claims that their method of "whole genome assembly" of the sequence data into a coherent product is superior to the HGP's method of hierarchical assembly. The latter first breaks the genome randomly ("shotguns" it) into 150,000 base pieces called clones. These are ordered (mapped), and a suitable subset is sequenced individually by a second shotgun. Celera attempts to achieve the same goal without the intermediate step.

The results of three assembly methods are shown in this table.

The first column shows pure whole genome assembly. Even at 12.6X coverage, it fails badly both in sequence continuity (no of gaps) and in long-range order (no of components requiring localisation to a specific place on the genome). Connecting all the pieces and filling in the gaps between them would be a formidable task.

The third column shows the HGP's hierarchical shotgun assembly. At this (draft) stage it has 94% coverage and many gaps, as expected. Nevertheless it is already much better than the whole genome assembly, even though it is based on a more economical 7.5X coverage. All components are localised by the mapping step, which makes "finishing" straightforward.

The second column shows Celera's compartmentalised assembly: an attempt to make greater use of HGP data, by importing its map information as well as its shotgun coverage. Since this assembly has 68% more shotgun data than HGP alone, it should give a much better assembly. Remarkably, it appears quite similar. However, the local ordering of pieces of sequence ("contigs") has been improved, as expected at this stage when the HGP product is unfinished.

We note in passing that Celera's whole genome assembly of Drosophila, which was claimed as a proof of its power, also had many problems that are only now being resolved by map based finishing.

Finishing

Finishing is the process whereby additional sequence is obtained to achieve full coverage and continuity over very long distances by filling in the gaps. Only the HGP is undertaking finishing.

At this draft stage, with 94% coverage, 35% of the HGP sequence is already finished, and all will be done in the next year or two. The finished sequence, which includes the whole of chromosomes 21 and 22, is of much higher quality than any of the draft assemblies.

Conclusions

Celera's whole genome shotgun is the method that was claimed to make methods based on clones and maps redundant and wasteful. Their compartmentalised assembly depends on HGP maps, and is really a version of HGP's hierarchical shotgun assembly. It should be noted that Celera's analysis uses the compartmentalised assembly exclusively.

It is difficult to escape the conclusion that pure whole genome shotgun has failed as far as generating the sequence of the human genome is concerned.

We conclude that on current showing both the mapping and the finishing process that HGP employs are essential for determining the sequence of the human and other large genomes.

Human Genome Publication

Contact the Press Office

Don Powell Media and Public Relations Manager
Wellcome Trust Sanger Institute, Hinxton, Cambs, CB10 1SA, UK

Tel +44 (0)1223 496 928
Mobile +44 (0)7753 775 397
Fax +44 (0)1223 494 919
Email press.office@sanger.ac.uk

* quick link - http://q.sanger.ac.uk/3jomj7rt