In addition to sequencing the megabase chromosomes of the T. brucei genome, the Wellcome Trust Sanger Institut as well as TIGR have carried out extensive genome survey sequencing.
TIGR has provided 47,000 single-pass reads of randomly selected clones: these derived both from ends of P1 and BAC genomic clones as well as from genomic DNA clones, selected from a T. brucei TREU927 GUTat 10.1 whole genome TIGR manufactured sheared DNA library (av. insert size 2-3 kb). These have proved immensely useful resources to the research community for gene discovery. The end-sequences of the P1 and BAC clones have also been used in physical mapping.
The Sanger Institute has in turn submitted > 43,000 GSS sequences from the 2-kb sheared genomic DNA clones constructed by TIGR. These end sequences have since been clustered with ESTs available through public databases and some preliminary automated analysis has been carried out. The sequences can be obtained from ftp.sanger.ac.uk/pub/databases/T.brucei_sequences/GSS/.
As an aid to the community, all GSS sequences were subjected to a BLASTX analysis of Swissprot/TrEMBL databases in February 2002. The summary data are shown below:
Applying a probability cut-off of 1e-10 to the BLAST output:
- 8196 had a hit (~21 percent)
of which, according to their description lines:
The following now have html-linked sequences - 1095 were probably INGI-related (ORF 1, 2)
- 441 were adenylate cyclases
- 77 were described as ESAG
- 632 were VSGs
- 112 were ribosomal proteins
- 66 were helicases
- 1454 showed similarity to hypothetical proteins
- 4170 did not fall into the above "classes"
- 2025 had no hits at all.
- species-by-species tally of top
BLASTX hits
(Note: T. brucei brucei and T. brucei are treated as separate items)
Each of these datasets are available, either by clicking on the above links, or from the GSS ftp site. The entire set of Sanger GSS are also available as a fasta database.
GSS and EST clustering
All T. brucei genome survey sequences plus approximately 5,500 EST/mRNA sequences were clustered, using the sequence assembly programme phrap. The ESTs were retrieved from EMBL in February 2001, using Trypanosoma brucei listed as an organism as a search term. This will therefore include EST data generated from different Trypanosoma brucei subspecies and strains. The dataset totalled 96,474 sequences (~45.87Mb). ]12,251 contigs were generated, while 8,242 sequences could not be placed in a contig (singletons). The GSS/EST clusters have an estimated coverage of >95% of the T. brucei genome. They are accessible for similarity searching and a summary of top BLAST hits can be viewed here.



