The Wormpep database is in fasta format. There is one descriptor line beginning with '>' for each protein, followed by the protein sequence. The format of the descriptor line is outlined in the following example:
>B0041.7 CE17314 WBGene00006961 locus:xnp-1 helicase status:Confirmed TR:O02061 protein_id:AAC24256.1
- B0041.7 is the coding sequence (CDS) identifier. It is made up of the name of the clone from which the gene is partially or wholly derived, in this case the cosmid B0041, followed by a number.
- CE17314 is the Wormpep accession number (beginning with the letters CE, followed by a five-digit number). Every accession number corresponds to one particular protein sequence. Therefore, the same accession number can be associated with several different CDS identifiers.
- WBGene00006961 is the WormBase gene identifier. All C. elegans genes have one identifier per locus, i.e. all splice variants of a gene share the same gene identifier.
- locus:xnp-1 is the locus to which the protein corresponds; it is always followed by a colon.
- helicase is a brief annotation of the protein. If you think the annotation for any protein is inappropriate, wrong, misleading or can be improved please let us know.
- status:Confirmed indicates that the gene encoding this protein has complete EST/mRNA coverage (otherwise the status is 'Predicted')
- TR:O02061 is the TREMBL accession number for the protein. This number is replaced by the Swiss-Prot accession number (SW:) if the protein has been given one.
- protein_id:AAC24256.1 is the protein_id number which has been given to the protein by the nucleotide databases. Every time the sequence changes, but still represents the same CDS, the version number gets incremented by one.
Problem Proteins....
If you have any queries or comments about a protein, please contact:worm@sanger.ac.uk
Please provide the wormpep accession (see above) in any correspondence.
Note for GCG users: All Wormpep CDS identifiers contain a dot '.', which is removed by the program fromfasta. This may cause different CDS' to get the same name, e.g. F38E1.11 and F38E11.1. You should therefore change the dots to some other character that GCG accepts, e.g. '_'.
