Contact WTSI Webmaster Printer friendly format Login to WTSI resources WTSI RSS feed
  • C. elegans
  • Overview
  • Sequence data
  • BLAST search
  • Wormpep
  • FTP site
  • C. briggsae
  • C. briggsae project
  • BLAST Search
  • WormBase
  • Release info
  • Current gene names
  • Submit data
  • GFF files
  • Documentation
  • Annotation
  • Website

  • Ensembl
  • C. elegans project
Wormpep database format

The Wormpep database is in fasta format. There is one descriptor line beginning with '>' for each protein, followed by the protein sequence. The format of the descriptor line is outlined in the following example:

>B0041.7 CE17314 WBGene00006961 locus:xnp-1 helicase status:Confirmed TR:O02061 protein_id:AAC24256.1

  • B0041.7 is the coding sequence (CDS) identifier. It is made up of the name of the clone from which the gene is partially or wholly derived, in this case the cosmid B0041, followed by a number.
  • CE17314 is the Wormpep accession number (beginning with the letters CE, followed by a five-digit number). Every accession number corresponds to one particular protein sequence. Therefore, the same accession number can be associated with several different CDS identifiers.
  • WBGene00006961 is the WormBase gene identifier. All C. elegans genes have one identifier per locus, i.e. all splice variants of a gene share the same gene identifier.
  • locus:xnp-1 is the locus to which the protein corresponds; it is always followed by a colon.
  • helicase is a brief annotation of the protein. If you think the annotation for any protein is inappropriate, wrong, misleading or can be improved please let us know.
  • status:Confirmed indicates that the gene encoding this protein has complete EST/mRNA coverage (otherwise the status is 'Predicted')
  • TR:O02061 is the TREMBL accession number for the protein. This number is replaced by the Swiss-Prot accession number (SW:) if the protein has been given one.
  • protein_id:AAC24256.1 is the protein_id number which has been given to the protein by the nucleotide databases. Every time the sequence changes, but still represents the same CDS, the version number gets incremented by one.
The Wormpep database also contains each of the individual proteins derived from alternatively spliced sequences. Protein sequences from alternatively spliced genes are defined, for example, as R07B1.5A and R07B1.5B; which in this case represents the two alternate transcripts of the R07B1.5.

Problem Proteins....

If you have any queries or comments about a protein, please contact:

worm@sanger.ac.uk

Please provide the wormpep accession (see above) in any correspondence.

Note for GCG users: All Wormpep CDS identifiers contain a dot '.', which is removed by the program fromfasta. This may cause different CDS' to get the same name, e.g. F38E1.11 and F38E11.1. You should therefore change the dots to some other character that GCG accepts, e.g. '_'.

webmaster@sanger.ac.uk

Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK  Tel:+44 (0)1223 834244

Last Modified Fri Jun 3 13:58:51 2011

Genome Research Limited is a charity registered in England with number 1021457

Data Sharing | Copyright