Overview of Data for Sequence Similarity Searches
Data types available for searching
- Sequence reads
These are the individual sequence reads, generally 500-600 bp in
length. The sequence read has chromosome of origin followed by the
shotgun clone id and either "q1c" or "p1c" for either forward or
reverse primer in its name.
- Contig sequences
- Contig sequences represent secondary sequence data, in that
they are the condensation of a number of shotgun reads. Contig
reflect more reliably the finished sequence data because the
depth of coverage of assembled shotgun reads ensures that the
majority of ambiquities are identified and at least partially
resolved. This is not to say that contigs do not contain
insertion and/or deletion events, usually as a conseqeunce of
the algorithim used to create a consensus. Please not, that
currently the Sanger Institute is unable to track contigs
through assembly and therefore, contig id's will change.
- Individual contig sequences which are highlighted by Blast
analysis can be retrieved by following the 'Sequence' link in the
returned HTML page.
- If contigs have already been submitted to the HTG section of
public databases, then a link will take you there
- Only contigs greater than 2 kb are present in the Blast
searchable dataset.
- annotated genes and proteins
Annotated and curated tRNA, snRNA, rRNA and protein-coding genes and pseudogenes on
manually annotated chromosomes/contigs available through GeneDB
- automatically predicted gene and peptides
Automatic predictions and analyses of open reading frames and putative protein
products available through GeneDB
- EMBL
All avaliable data with T. brucei listed as organism
submitted to the public databases
- GSS
See here for further detail
- GSS/EST clusters
See here for further detail
Searching data
- Searching the contig database
is the most direct method of searching for a gene. This should be
your initial dataset to search : use BlastN with a homologous DNA
queries to identifying an exact match, TBlastX with paralogous DNA
or DNA from a related species to identify weaker matches. Finally
use TBlastN with a peptide query to match proteins back to the
genomic DNA.
- Searching data available through GeneDB
will allow you to see whether your sequence of interest has
already been annotated and if so, how.
Tips when searching data
- When interpreting the Blast output, remember what query and Blast
type you have been running. BlastN results should essentially be
cut and dry (searching for exact match with a %identity in the
high 90's). Search algorithms such as TBlastX and TBlastN are
matching similarities at the peptide level and hence matches are
likely to be confinded to conserved regions of coding exons. Look
for co-linear matches along the contig sequence.