MetaHIT Consortium (Metagenomics of the Human Intestinal Tract consortium)

The MetaHIT project aims to understand the role of the human intestinal microbiota in health and disease.

The consortium involves 13 research centres from eight countries. The project is funded by £11.4 million from the European Commission and runs from 1st January 2008 for four years.

MetaHIT Consortium website

[Genome Research Limited]

Background

The Sanger Institute contributes to MetaHIT by producing draft nucleotide sequence for the genomes of 100 bacterial strains commonly found in the human intestinal tract. This will provide a reference set of genomes for future studies. The first 30 strains sequenced will be cultured bacteria, while the remaining 70 will be bacteria that cannot yet be grown in the laboratory. For these 70 uncultured bacteria, individual bacterial cells will be isolated and the whole of the single copy of the genome amplified to generate sufficient DNA for nucleotide sequencing.

This work is being carried out with the help of our collaborators at the Rowett Institute of Nutrition and Health in Aberdeen, and Dr Annick Bernalier (Microbiology Unit, INRA Clermont-Ferrand).

Paired-end sequencing generates sequence reads from both ends of a DNA fragment. Sequence reads from many DNA fragments are then assembled into contiguous sequences (contigs). If two reads from one DNA fragment occur in different contigs, then it is likely that the two contigs represent regions that are adjacent in the genome. In this way contigs may be linked together into scaffolds. Within a scaffold, although the sequences between the contigs are unknown their length may be estimated from the length of the DNA fragments being sequenced.

The scaffold information for each genome is given in two files produced by the GS De Novo Assembler. There is a fasta file of the concatenated contig sequences that were scaffolded by paired end analysis. The contigs are separated by Ns with the number of Ns corresponding to the estimated gap size (there are a minimum of 20 Ns marking each gap). The scaffold information is also presented as a text file in the AGP format of the NCBI.

Draft Genome Sequences for Cultured Strains

Please note that these draft sequences are unchecked and unedited, and will contain errors. The 454 is known to have problems with homopolymeric tracts, and these are therefore likely to contain significant numbers of errors.

Note: The improved assemblies listed in the table below were produced using both 454 and Illumina reads. Each genome was assembled initially with SOAP. Newbler was then used to create a combined assembly. Contigs were joined to scaffolds created by Newbler based on overlaps and read pair information. IMAGE was then run on each genome. IMAGE works to close down positive gaps using Illumina sequence that is not assembled. It also finds negative gaps, so these can be manually closed. The sequence was then corrected using ICORN, all indels and SNPs were checked and suggested changes made by ICORN instigated where appropriate. Finally all repeats within the genome over 100bp were checked to ensure that they were confirmed by at least two spanning read pairs. Any obvious misassemblies were addressed and where repeats were not confirmed by spanning read pairs they were broken apart.

Strain Fold cov. de novo sequence (fasta/
qual)
Scaffolds (fasta/
text)
Improved assembly EMBL acc. Last update
Alistipes shahii WAL 8301 22x seq qual seq text seq FP929032 26/08/2011
Bacteroides xylanisolvens XB1A 18x seq qual seq text seq FP929033 26/08/2011
Bifidobacterium longum subsp. longum F8 45x seq qual seq text seq FP929034 26/08/2011
Bifidobacterium pseudocatenulatum D2CA 17x seq qual seq text seq   26/08/2011
Brachyspira aalborgii 513 21x seq qual seq text seq   26/08/2011
Brachyspira pilosicoli WesB 34x seq qual seq text seq   26/08/2011
Butyrivibrio fibrisolvens 16/4 63x seq qual seq text seq FP929036 26/08/2011
Clostridiales sp. SM4/1 14x seq qual seq text seq FP929060 09/11/2011
Clostridiales sp. SSC/2 29x seq qual seq text seq FP929061 09/11/2011
Clostridiales sp. SS3/4 16x seq qual seq text seq FP929062 26/08/2011
Clostridium saccharolyticum-like K10 20x seq qual seq text seq FP929037 26/08/2011
Coprococcus catus GD/7 21x seq qual seq text seq FP929038 26/08/2011
Coprococcus comes SL7/1 14x seq qual   seq   26/08/2011
Coprococcus sp. ART55/1 17x seq qual seq text seq FP929039 26/08/2011
Enterobacter cloacae subsp. cloacae NCTC 9394 15x seq qual seq text seq FP929040 26/08/2011
Enterococcus sp. 7L76 16x seq qual seq text seq FP929058 26/08/2011
Eubacterium cylindroides T2-87 19x seq qual seq text seq FP929041 26/08/2011
Eubacterium rectale DSM 17629 25x seq qual seq text seq FP929042 26/08/2011
Eubacterium rectale M104/1 21x seq qual seq text seq FP929043 09/11/2011
Eubacterium siraeum 70/3 25x seq qual seq text seq FP929044 26/08/2011
Eubacterium siraeum V10Sc8a 26x seq qual seq text seq FP929059 26/08/2011
Faecalibacterium prausnitzii L2-6 29x seq qual seq text seq FP929045 26/08/2011
Faecalibacterium prausnitzii SL3/3 20x seq qual seq text seq FP929046 09/11/2011
Gordonibacter pamelaeae 7-10-1-b 20x seq qual seq text seq FP929047 26/08/2011
Megamonas hypermegale ART12/1 27x seq qual seq text seq FP929048 26/08/2011
Roseburia faecis CC123            
Roseburia faecis 11SE37            
Roseburia intestinalis M50/1 25x seq qual seq text seq FP929049 26/08/2011
Roseburia intestinalis XB6B4 34x seq qual seq text seq FP929050 26/08/2011
Ruminococcus bromii L2-63 26x seq qual seq text seq FP929051 26/08/2011
Ruminococcus sp. 18P13 26x seq qual seq text seq FP929052 26/08/2011
Ruminococcus sp. SR1/5 32x seq qual seq text seq FP929053 26/08/2011
Ruminococcus obeum A2-162 22x seq qual seq text seq FP929054 09/11/2011
Ruminococcus torques L2-14 27x seq qual seq text seq FP929055 26/08/2011

Draft genome sequences for uncultured strains

Note that the following draft genomes were derived from whole genome amplified DNA, thus in addition to the potential sequencing errors mentioned above for the cultured strains, the following draft genome sequences may also contain errors due to rearrangements that have occurred during the genome amplification process.

Note: The improved assemblies listed in the table below were produced using both 454 and Illumina reads. Each genome was assembled initially with SOAP. Newbler was then used to create a combined assembly. Contigs were joined to scaffolds created by Newbler based on overlaps and read pair information. IMAGE was then run on each genome. IMAGE works to close down positive gaps using Illumina sequence that is not assembled. It also finds negative gaps, so these can be manually closed. The sequence was then corrected using ICORN, all indels and SNPs were checked and suggested changes made by ICORN instigated where appropriate. Finally all repeats within the genome over 100bp were checked to ensure that they were confirmed by at least two spanning read pairs. Any obvious misassemblies were addressed and where repeats were not confirmed by spanning read pairs they were broken apart.

Strain Fold coverage de novo sequence (fasta/
qual)
Scaffolds (fasta/
text)
Improved assembly EMBL accession Last update
Bacteroides dorei D8 66x seq qual seq text seq   26/08/2011
Eubacterium hallii SM6/1 22x seq qual seq text seq   26/08/2011
Synergistetes sp. SGP1 47x seq qual seq text seq FP929056 26/08/2011

Bulk data download

To download MetaHIT data in bulk, please use this ftp link.

Contact

Please address all sequencing enquiries to Dr Keith Turner

* quick link - http://q.sanger.ac.uk/mzxpherj