Contact WTSI Webmaster Printer friendly format Login to WTSI resources WTSI RSS feed
All Sequencing
  • Human (HGP)
  • Pathogens
  • Blast
  • C. elegans
  • Overview
  • Sequence data
  • BLAST search
  • Wormpep
  • FTP site
  • C. briggsae
  • C. briggsae project
  • BLAST Search
  • WormBase
  • Release info
  • Current gene names
  • Submit data
  • GFF files
  • Documentation
  • Annotation
  • Website

  • Ensembl
  • C. elegans project
  • Website Search
  • People Search
  • Library Services
  • Site Map
  • Feedback / Help
Retrieve BLAST result
Resources for Genefinding in C. elegans

A dataset of experimentally confirmed genes for training.

The complete C. elegans genomic sequence will be completed during 1998 and in order to improve our gene predictions and annotations we are currently appraising various gene finding programs in C. elegans. The anticipation would then be that the gene finding programs could be installed at the Sanger Institute and Genome Sequencing Centre and used to assist in our annotation process.

As you can appreciate, one of the more time consuming parts of generating gene finding algorithms is making training set of data of confirmed genes. In order to assist, we have ready made training sets available. These have been generated by:-

  • Identifying C. elegans genes which are biologically confirmed.
  • The genomic sequence for these is derived where the flanking intergenic sequences is equal to half the distance to the neighbouring genes.
  • The confirmed genomic sequences were then joined into a single large contig. The orientation of each gene being assigned randomly.
  • The large contig of genes was then cut into segments of 50KB. The reasoning for this was that this would represent a 'real world' problem for the programs which would typically be run on cosmid size sequences. The entire contig is also made available.
  • The format of the training files is in gff format

    Contact worm@sanger.ac.uk for more information or if you would like to participate.

  • Dataset 1 (released 20/11/1997,last revised 13/3/98) containing 271Kb and 65 genes.
  • Dataset 2 (released 12/1/1998) containing 163Kb and 20 genes (1 of which displays alternate splicing)
  • Dataset 3 (released 30/1/1998, last revised 20/02/98) containing 128Kb and 27 genes (2 of which displays alternate splicing)

    Splice sites

    We have a dataset of 8,192 splice site pairs which have been confirmed using EST data. The file consists of pairs of 5' and 3' intron splice sites. Each 50bp before and after the splice site.

    Intronic Sequences

    We also have a dataset of 8,192 confirmed and complete intronic sequence (confirmed by EST data). These should be useful for calculating length distributions, base biases etc...
  • Information Projects Other Services
    Sanger Home
    Sitemap
    Site Search
    Information
    Careers
    Press
    News
    Seminars
    Workshops
    Publications
    Staff Theses
    Travel Directions
    Research Teams
    Research Faculty
    Personnel Search
    Human Genetics
    Model Organism Genetics
    Pathogen Genetics
    Bioinformatics
    Sequencing
    Library
    Helpdesk
    Webmail
    VPN Access
    Sign In
    SSO Pass. Reset

    webmaster@sanger.ac.uk

    Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK  Tel:+44 (0)1223 834244

    Last Modified Wed Oct 15 14:47:23 2003

    Genome Research Limited is a charity registered in England with number 1021457

    Data Sharing Policy | Conditions of Use | Copyright