A dataset of experimentally confirmed genes for training.
The complete
C. elegans genomic sequence will be completed during 1998 and in order
to improve our gene predictions and annotations we are currently appraising various
gene finding programs in
C. elegans. The anticipation would then be that
the gene finding programs could be installed at the Sanger Institute and Genome Sequencing
Centre and used to assist in our annotation process.
As you can appreciate, one of the more time consuming parts of generating gene finding
algorithms is making training set of data of confirmed genes. In order to assist, we have
ready made training sets available. These have been generated by:-
Identifying C. elegans genes which are biologically confirmed.
The genomic sequence for these is derived where the flanking intergenic sequences is
equal to half the distance to the neighbouring genes.
The confirmed genomic sequences were then joined into a single large contig. The orientation
of each gene being assigned randomly.
The large contig of genes was then cut into segments of 50KB. The reasoning for this
was that this would represent a 'real world' problem for the programs which would typically be run
on cosmid size sequences. The entire contig is also made available.
The format of the training files is in gff format
Contact worm@sanger.ac.uk for more information or if you would like to participate.
Dataset 1 (released 20/11/1997,last revised 13/3/98) containing 271Kb and 65 genes.
Dataset 2 (released 12/1/1998) containing 163Kb and 20 genes (1 of which displays alternate splicing)
Dataset 3 (released 30/1/1998, last revised 20/02/98) containing 128Kb and 27 genes (2 of which displays alternate splicing)
Splice sites
We have a dataset
of 8,192 splice site pairs which have been confirmed using EST data. The
file consists of pairs of 5' and 3' intron splice sites. Each 50bp
before and after the splice site.
Intronic Sequences
We also have a dataset of 8,192 confirmed and complete intronic sequence (confirmed by EST data). These should be useful
for calculating length distributions, base biases etc...