These are notes associated with the paper "Genome Sequence of the Nematode Caenorhabditis elegans. A Platform for Investigating Biology", The C. elegans Sequencing Consortium, Science (1998) 282:2012-2018. (for a full list of authors see below).
This page and the data resources that it links to will be maintained on the Sanger Institute web site under http://www.sanger.ac.uk/Projects/C_elegans/Science98/ at least until the end of 2001. The resources are provided for archival purposes - they reflect the state of the sequence and annotation during 1998 when the Science papers were being written. For a current view of the sequence and annotation please see our main C.elegans web page.
Many data sets are compressed with the Unix "gzip" or PC "zip" program, giving a file ending of ".gz" or ".zip" respectively. If your browser does not uncompress these files automatically on download, the files should be saved to disk and then uncompressed with an appropriate utility; most PC and Mac compression packages, such as Winzip and Stuffit, can uncompress Unix ".gz" files. When possible, a link to an uncompressed version has also been provided.
Sequence annotation is an inexact science, and while the gene predictions reflect our best efforts we know that many of them will turn out to be wrong in places. There are also inevitably errors in the sequence itself, although we believe these are at a very low level. If you find errors, or have other corrections to our annotation, please email us at wormquery@sanger.ac.uk. We will acknowledge you, correct our master database, and from there correct the corresponding entries in the public databases.
The Protein Datasets
The analysis of the C. elegans protein data sets in this study preceded the completion of the genomic sequence. Three protein sets were made available to the contributing authors.
June, 1998 (16626 proteins) in zip or gzip format. (also available via ftp) August, 1998 (18581 proteins) in zip or gzip format. (also available via ftp) October 1998 (19099 proteins) in zip or gzip format. (also available via ftp)
In addition, some authors used the Wormpep option on the C.elegans blast server, which at the time contained 18452 proteins (available in zip or gzip format, or by ftp).
For these data sets to be as representative of the whole genome as possible we also included conceptual protein translations from genes predicted in the unfinished but contiguous sequence data. These are preliminary gene predictions produced using GENEFINDER (build version 1998/06/02) [Green et al, unpublished] and have had no manual inspection or editing. They have identifiers ending .[letter] (see below).
The authors also had access to the WormPep database which only includes protein translations from genes passed by human review and submitted to EMBL/Genbank.
Nomenclature within the protein data sets
Genes identified by the C. elegans sequencing project are given a unique identifier based on the name of the clone containing (at least a part of) them, followed by a dot then an additional number and/or letter. These identifiers are stable, in that when gene predictions are changed due to new evidence, the same identifier is used for the new version.The genes which have been subjected to human review and their predictions consolidated with other available biological information e.g. EST sequences and protein homologies have a [clone].[number] nomenclature. In the case of multiple proteins which are derived from alternatively spliced transcripts of a single gene, each protein translation is designated with a further letter, e.g. B0399.2a, B0399.2b etc.
Preliminary gene predictions can be identified by their [clone].[letter] nomenclature e.g. ZK1086.c. In the case of preliminary gene predictions the identifiers are temporary and are lost when the gene is manually reviewed.
Proteins in Wormpep have an identifier corresponding the gene identifier, and an accession number that is unique for the literal sequence, so when a gene structure is changed the identifier remains the same, but the accession number changes. Two different identifiers can share the same accession number if the sequence is identical, e.g. some histone proteins.
The DNA sequence
The assembled DNA sequence for the six chromosomes is available below. These sequences where used for the various chromosomal analyses and plots presented in the paper. There are also associated GFF format files which describe the genomic features of the chromosomal sequences, including the predicted intron/exon structures, repeat information etc.. A description of GFF format is available here. These DNA sequences and annotation also form the basis of the October authors' protein set.
The Chromosome DNA files These are the compressed fasta files for the six chromosomes, each containing a single DNA sequence. The sequences are a composite of finished and unfinished sequence material, with gaps represented by sequences of consecutive N's of nominal length.
GZIP format: I.dna, II.dna, III.dna, IV.dna, V.dna, X.dna
ZIP format: I.dna, II.dna, III.dna, IV.dna, V.dna, X.dna
Corresponding GFF files The following compressed files give for each of the above chromosomal sequences all the annnotation information used for the paper, including the predicted intron/exon structures, repeat information etc.. A description of GFF format is available here.
GZIP format: I.gff, II.gff, III.gff, IV.gff, V.gff, X.gff
ZIP format: I.gff, II.gff, III.gff, IV.gff, V.gff, X.gff
Resources used for specific analyses in the Genome Consortium paper
The protein data set was the October set described above, and the DNA sequence and positional annotation used were as in the previous section. more about blast etc. when different from general analysis belowCross-species comparison
The derivation of each organismal set of proteins:-
Yeast - Proteins were derived from the ORF set maintained in the Saccharomyces Genome Database. The actual protein set used is available in gzip or zip format.
Human - Proteins used where the human proteins present in SwissProt version 36. The actual protein set used is available in gzip or zip format. However RL41_HUMAN could not be used to search as it was too short (25aa). Therefore the size of the searched set was 4979.
E.coli - Proteins used were derived from the set maintained at the NCBI Entrez genomes division. The actual protein set used is available in gzip or zip format.
The wublastp parameters used were
B=1 E=1e-3 -filter seg
Resources used by companion papers
Neurobiology of the Caenorhabditis elegans Genome, Cornelia I. Bargmann, Science 282:2028-2033. Methods and results for searches. The blast server protein data set was used.
Caenorhabditis elegans Is a Nematode, Mark Blaxter, Science 282:2041-2046. Notes on methods used, and further resources available. The Wormpep 14 protein data set was used.
Comparison of the Complete Protein Sets of Worm and Yeast: Orthology and Divergence Stephen A. Chervitz, L. Aravind, Gavin Sherlock, Catherine A. Ball, Eugene V. Koonin, Selina S. Dwight, Midori A. Harris, Kara Dolinski, Scott Mohr, Temple Smith, Shuai Weng, J. Michael Cherry, and David Botstein, Science 282:2022-2028. Notes on methods used, and further resources available. The October data protein data set was used.
Zinc Fingers in Caenorhabditis elegans: Finding Families and Probing Pathways Neil D. Clarke and Jeremy M. Berg, Science 282:2018-2022. The June protein data set was used. Further information and data are available.
The Taxonomy of Developmental Control in Caenorhabditis elegans Gary Ruvkun and Oliver Hobert, Science 282:2033-2041. Methods used. The blast server protein data set was used.
Gene Prediction and Standard Analysis in C. elegans Genome Project
The C. elegans genomic data has been produced primarily as resource for experimental biologists and has been under active curation for this purpose for many years. Our understanding of metazoan genomes is far from complete and it would be naive to expect that we will be able to produce a complete set of correct gene translations at this point. It is anticipated that this process will continue refinement for many years. Currently, gene predictions have been made using the best tools and biological information we have had available at the time. In many cases improvements have been incorporated into the analysis process even though it was not feasible to retrospectively apply these changes and update previous work.It is also important to note that we have actively solicited corrections to the sequence annotation from the scientific community. In many ways, the gene predictions can be considered to have been under the peer review of the scientific community. Sequences which have been in the public domain for many years will have had the long-term benefit of this process.
An overview of the annotation process and the tools employed at
the time of the Science paper was written is shown below:-
GENEFINDER
Ab-initio gene prediction. [Green et al. unpublished, phg@u.washington.edu]
The command line used was:-
genefinder -tablenamefile tablefile -intronPenalty intron_penalty.lookup -exonPenalty exon_penalty.lookup sequence_file.fasta
The tables given in tablefile are contained in the compressed Unix tar file nemtables.tar.gz.
POSTWISE
Gene Prediction bases on protein homology [Birney E. (1997). ISMB,5,56.]
The command line used was:-
postwise -silent -ace -gene worm.gf sequence.fasta exblx_file
tRNASCAN-SE
transferRNA prediction [Lowe, T.M. and Eddy, S.R. (1997). Nucl. Acids. Res..,25,955.]
The command line used was:-
tRNAscan-SE -a -q sequence.fastaVersion used was tRNAscan-SE 1.11 (Nov 97)
INV
Inverted Repeat Detection [R. Durbin unpub. available from http://www.sanger.ac.uk/Software]TAN
Tandem Repeat Detection [R. Durbin unpub. available from http://www.sanger.ac.uk/Software]POLY
Tandem Repeat Detection [R. Durbin unpub. available from http://www.sanger.ac.uk/Software]MSPcrunch
Blast Post Processor [Sonnhammer, E.L.L. and Durbin R. (1994). J. Comp. Biol., 2,9.]
Version used was Version 2.1, compiled Jun 18 1997.
BLASTX
Six frame translation and comparison to protein database [Altschul et al. (1990). J. Mol. Biol.. 215,4010.]The command line used was:-
blastx swir sequence.fasta B=1000000 -span1 M=BLOSUM62-12 V=0 H=0Version used was BLASTX 1.4.6 [16-Oct-94] [Build 00:04:26 Oct 20 1994]
TBLASTX
DNA vs DNA comparisons at the protein level. [Altschul et al. (1990). J. Mol. Biol.. 215,4010.]Version used was TBLASTX 1.4.7 [16-Oct-94] [Build 00:14:27 Oct 20 1994]
EST_genome
Alignment of EST sequences to Genomic DNA [Mott, R. (1997). CABIOS,13,477.]To reduce the number of candidate ESTs to align to genomic sequences using EST_genome, EST sequences were pre-filtered using BLASTN and MSPcrunch. The command line for this operation is given by:-
blastn est_database sequence.fasta B=1000000 | MSPcrunch -l 0 -



