Contact WTSI Webmaster Printer friendly format Login to WTSI resources WTSI RSS feed
All Sequencing
  • Human (HGP)
  • Pathogens
  • Blast
  • C. elegans
  • Overview
  • Sequence data
  • BLAST search
  • Wormpep
  • FTP site
  • C. briggsae
  • C. briggsae project
  • BLAST Search
  • WormBase
  • Release info
  • Current gene names
  • Submit data
  • GFF files
  • Documentation
  • Annotation
  • Website

  • Ensembl
  • C. elegans project
  • Website Search
  • People Search
  • Library Services
  • Site Map
  • Feedback / Help
Retrieve BLAST result
Notes associated with the 1998 Science Paper

These are notes associated with the paper "Genome Sequence of the Nematode Caenorhabditis elegans. A Platform for Investigating Biology", The C. elegans Sequencing Consortium, Science (1998) 282:2012-2018. (for a full list of authors see below).

This page and the data resources that it links to will be maintained on the Sanger Institute web site under http://www.sanger.ac.uk/Projects/C_elegans/Science98/ at least until the end of 2001. The resources are provided for archival purposes - they reflect the state of the sequence and annotation during 1998 when the Science papers were being written. For a current view of the sequence and annotation please see our main C.elegans web page.

Many data sets are compressed with the Unix "gzip" or PC "zip" program, giving a file ending of ".gz" or ".zip" respectively. If your browser does not uncompress these files automatically on download, the files should be saved to disk and then uncompressed with an appropriate utility; most PC and Mac compression packages, such as Winzip and Stuffit, can uncompress Unix ".gz" files. When possible, a link to an uncompressed version has also been provided.

Sequence annotation is an inexact science, and while the gene predictions reflect our best efforts we know that many of them will turn out to be wrong in places. There are also inevitably errors in the sequence itself, although we believe these are at a very low level. If you find errors, or have other corrections to our annotation, please email us at wormquery@sanger.ac.uk. We will acknowledge you, correct our master database, and from there correct the corresponding entries in the public databases.

The Protein Datasets

The analysis of the C. elegans protein data sets in this study preceded the completion of the genomic sequence. Three protein sets were made available to the contributing authors.

  • June, 1998 (16626 proteins) in zip or gzip format. (also available via ftp)
  • August, 1998 (18581 proteins) in zip or gzip format. (also available via ftp)
  • October 1998 (19099 proteins) in zip or gzip format. (also available via ftp)
  • In addition, some authors used the Wormpep option on the C.elegans blast server, which at the time contained 18452 proteins (available in zip or gzip format, or by ftp).

    For these data sets to be as representative of the whole genome as possible we also included conceptual protein translations from genes predicted in the unfinished but contiguous sequence data. These are preliminary gene predictions produced using GENEFINDER (build version 1998/06/02) [Green et al, unpublished] and have had no manual inspection or editing. They have identifiers ending .[letter] (see below).

    The authors also had access to the WormPep database which only includes protein translations from genes passed by human review and submitted to EMBL/Genbank.

    Nomenclature within the protein data sets

    Genes identified by the C. elegans sequencing project are given a unique identifier based on the name of the clone containing (at least a part of) them, followed by a dot then an additional number and/or letter. These identifiers are stable, in that when gene predictions are changed due to new evidence, the same identifier is used for the new version.

    The genes which have been subjected to human review and their predictions consolidated with other available biological information e.g. EST sequences and protein homologies have a [clone].[number] nomenclature. In the case of multiple proteins which are derived from alternatively spliced transcripts of a single gene, each protein translation is designated with a further letter, e.g. B0399.2a, B0399.2b etc.

    Preliminary gene predictions can be identified by their [clone].[letter] nomenclature e.g. ZK1086.c. In the case of preliminary gene predictions the identifiers are temporary and are lost when the gene is manually reviewed.

    Proteins in Wormpep have an identifier corresponding the gene identifier, and an accession number that is unique for the literal sequence, so when a gene structure is changed the identifier remains the same, but the accession number changes. Two different identifiers can share the same accession number if the sequence is identical, e.g. some histone proteins.

    The DNA sequence

    The assembled DNA sequence for the six chromosomes is available below. These sequences where used for the various chromosomal analyses and plots presented in the paper. There are also associated GFF format files which describe the genomic features of the chromosomal sequences, including the predicted intron/exon structures, repeat information etc.. A description of GFF format is available here. These DNA sequences and annotation also form the basis of the October authors' protein set.

  • The Chromosome DNA files

    These are the compressed fasta files for the six chromosomes, each containing a single DNA sequence. The sequences are a composite of finished and unfinished sequence material, with gaps represented by sequences of consecutive N's of nominal length.

    GZIP format: I.dna, II.dna, III.dna, IV.dna, V.dna, X.dna
    ZIP format: I.dna, II.dna, III.dna, IV.dna, V.dna, X.dna

  • Corresponding GFF files

    The following compressed files give for each of the above chromosomal sequences all the annnotation information used for the paper, including the predicted intron/exon structures, repeat information etc.. A description of GFF format is available here.

    GZIP format: I.gff, II.gff, III.gff, IV.gff, V.gff, X.gff
    ZIP format: I.gff, II.gff, III.gff, IV.gff, V.gff, X.gff


  • Resources used for specific analyses in the Genome Consortium paper

    The protein data set was the October set described above, and the DNA sequence and positional annotation used were as in the previous section. more about blast etc. when different from general analysis below

    Cross-species comparison

    The derivation of each organismal set of proteins:-

  • Yeast - Proteins were derived from the ORF set maintained in the Saccharomyces Genome Database. The actual protein set used is available in gzip or zip format.

  • Human - Proteins used where the human proteins present in SwissProt version 36. The actual protein set used is available in gzip or zip format. However RL41_HUMAN could not be used to search as it was too short (25aa). Therefore the size of the searched set was 4979.

  • E.coli - Proteins used were derived from the set maintained at the NCBI Entrez genomes division. The actual protein set used is available in gzip or zip format.
  • The wublastp parameters used were

     B=1 E=1e-3 -filter seg

    Resources used by companion papers

    Neurobiology of the Caenorhabditis elegans Genome, Cornelia I. Bargmann, Science 282:2028-2033. Methods and results for searches. The blast server protein data set was used.

    Caenorhabditis elegans Is a Nematode, Mark Blaxter, Science 282:2041-2046. Notes on methods used, and further resources available. The Wormpep 14 protein data set was used.

    Comparison of the Complete Protein Sets of Worm and Yeast: Orthology and Divergence Stephen A. Chervitz, L. Aravind, Gavin Sherlock, Catherine A. Ball, Eugene V. Koonin, Selina S. Dwight, Midori A. Harris, Kara Dolinski, Scott Mohr, Temple Smith, Shuai Weng, J. Michael Cherry, and David Botstein, Science 282:2022-2028. Notes on methods used, and further resources available. The October data protein data set was used.

    Zinc Fingers in Caenorhabditis elegans: Finding Families and Probing Pathways Neil D. Clarke and Jeremy M. Berg, Science 282:2018-2022. The June protein data set was used. Further information and data are available.

    The Taxonomy of Developmental Control in Caenorhabditis elegans Gary Ruvkun and Oliver Hobert, Science 282:2033-2041. Methods used. The blast server protein data set was used.

    Gene Prediction and Standard Analysis in C. elegans Genome Project

    The C. elegans genomic data has been produced primarily as resource for experimental biologists and has been under active curation for this purpose for many years. Our understanding of metazoan genomes is far from complete and it would be naive to expect that we will be able to produce a complete set of correct gene translations at this point. It is anticipated that this process will continue refinement for many years. Currently, gene predictions have been made using the best tools and biological information we have had available at the time. In many cases improvements have been incorporated into the analysis process even though it was not feasible to retrospectively apply these changes and update previous work.

    It is also important to note that we have actively solicited corrections to the sequence annotation from the scientific community. In many ways, the gene predictions can be considered to have been under the peer review of the scientific community. Sequences which have been in the public domain for many years will have had the long-term benefit of this process.

    An overview of the annotation process and the tools employed at the time of the Science paper was written is shown below:- Analysis overview

    GENEFINDER

    Ab-initio gene prediction. [Green et al. unpublished, phg@u.washington.edu]

    The command line used was:-

    genefinder  -tablenamefile tablefile -intronPenalty intron_penalty.lookup 
                        -exonPenalty exon_penalty.lookup sequence_file.fasta

    The tables given in tablefile are contained in the compressed Unix tar file nemtables.tar.gz.

    POSTWISE

    Gene Prediction bases on protein homology [Birney E. (1997). ISMB,5,56.]

    The command line used was:-

    postwise -silent -ace -gene worm.gf sequence.fasta exblx_file

    tRNASCAN-SE

    transferRNA prediction [Lowe, T.M. and Eddy, S.R. (1997). Nucl. Acids. Res..,25,955.]

    The command line used was:-

    tRNAscan-SE -a -q sequence.fasta 
    Version used was tRNAscan-SE 1.11 (Nov 97)

    INV

    Inverted Repeat Detection [R. Durbin unpub. available from http://www.sanger.ac.uk/Software]

    TAN

    Tandem Repeat Detection [R. Durbin unpub. available from http://www.sanger.ac.uk/Software]

    POLY

    Tandem Repeat Detection [R. Durbin unpub. available from http://www.sanger.ac.uk/Software]

    MSPcrunch

    Blast Post Processor [Sonnhammer, E.L.L. and Durbin R. (1994). J. Comp. Biol., 2,9.]

    Version used was Version 2.1, compiled Jun 18 1997.

    BLASTX

    Six frame translation and comparison to protein database [Altschul et al. (1990). J. Mol. Biol.. 215,4010.]

    The command line used was:-

    blastx swir sequence.fasta B=1000000 -span1 M=BLOSUM62-12 V=0 H=0 
    Version used was BLASTX 1.4.6 [16-Oct-94] [Build 00:04:26 Oct 20 1994]

    TBLASTX

    DNA vs DNA comparisons at the protein level. [Altschul et al. (1990). J. Mol. Biol.. 215,4010.]

    Version used was TBLASTX 1.4.7 [16-Oct-94] [Build 00:14:27 Oct 20 1994]

    EST_genome

    Alignment of EST sequences to Genomic DNA [Mott, R. (1997). CABIOS,13,477.]

    To reduce the number of candidate ESTs to align to genomic sequences using EST_genome, EST sequences were pre-filtered using BLASTN and MSPcrunch. The command line for this operation is given by:-

    blastn est_database sequence.fasta  B=1000000 | MSPcrunch -l 0 - 

    Authors

    The following were involved in the C. elegans genome sequencing project at or associated with the Sanger Institute

    Rachael Ainscough, Simon Bardill, Karen Barlow, Victoria Basham, Caroline Baynes, Lisa Beard, Alastair Beasley, Mary Berks, James Bonfield, Jacqueline Brown, Christine Burrows, John Burton, Connie Chui, Emma Clark, Louise Clark, Gerard Colville, Theresa Copsey, Amanda Cottage, Alan Coulson, Molly Craxton, Auli Cummings, Paul Cummings, Simon Dear, Thomas Dibling, Richard Dobson, Jonathan Doggett, Richard Durbin, Jillian Durham, Andrew Ellington, David Evans, Kerry Fleming, John Fowler, Audrey Fraser, Debbie Frame, Alison Gardner, Jane Garnett, Iain Gray, Jane Gregory, Mark Griffiths, Sarah Hall, Barbara Harris, Trevor Hawkins, Cathy Hembry, Sarah Holmes, Bijay Jassal, Matt Jones, Steve Jones, Ann Joy, Paul Kelly, Joanna Kershaw, Andrew Kimberley, Yuji Kohara, Neil Laister, Dan Lawson, Nicola Lennard, Julia Lightning, Simon Limbrey, Sarah Lindsay, Christine Lloyd, Simon Margerison, Anna Marrone, Lucy Matthews, Paul Matthews, Rebecca Mayes, Kirsten McLay, Amanda McMurray, Mark Metzstein, Simon Miles, Nicholas Mills, Maryam Mohammadi, Beverley Mortimore, Mary O'Callaghan, Anthony Osborn, Sophie Palmer, Chantal Percy, Adelaide Pettett, Emma Playford, Michelle Pound, Rebecca Rocheford, Jane Rogers, David Saunders, Maggie Searle, Katherine Seeger, Ratna Shownkeen, Matthew Sims, Nicola Smaldon, Andrew Smith, Michelle Smith, Mike Smith, Rebekah Smye, Erik Sonnhammer, Rodger Staden, Charles Steward, John Sulston, June Swinburne, Ruth Taylor, Louise Tee, Jean Thierry-Mieg, Karen Thomas, Jeanette Usher, Mellanie Wall, Justine Wallis, Andy Watson, Sarah White, Anna Wild, Jane Wilkinson, Leanne Williams, Jenny Winster, Isabel Wragg

    The corresponding list for the Genome Sequencing Center, St Louis is

    Amanda Abbott, Jane Abu-Threideh, Craig Ahrens, Ella Alexander, Johar Ali, Mark Ames, Kirsten Anderson, Stephanie Andrews, Susanna Angell, Paul Antonacci, Lucinda Antonacci-Fulton, Bessie Antoniou, Damon Baisden, Lilla Bartko, Shiv Basu, Chris Bauer, Cathy Beck, Michael Becker, Louis Begnel, Kirk Behymer, Gary Bemis, Dan Bentley, Zachary Bevins, Thomas Biewald, Linda Blackwood, Donald Blair, Mary Blanchard, Mary Blandford, Elizabeth Boatright, Sherell Bourne, Kyle Bova, Holland Bradshaw, Ryan Brinkman, Rose Brockhouse, Michelle Broy, Christina Budnicki, Jennifer Burkhart, Tracy Caffrey, Kelly Carpenter, Tim Carter, Brandi Chiapelli, Asif Chinwalla, Stephanie Chissoe, Kathleen Clarke, Sandy Clifton, Jim Cloud, Molly Cofman, Megan Connell, Mark Cook, Judy Cooper, Matt Cooper, Matthew Cordes, Mark Cotton, Jennifer Couch, Laura Courtney, Krista Creason, Robin Crocker, Jye'Mon Crockett, Taquilla Crum, Michael Dante, Betty Darron, Ruth Davenport, Michelle David, Sharon Davidson, Teresa Davidson, Shanoa Davis, Andy Delehaunty, Sandy Dempsey, Jasna Despot, Hong Ding, Maggie Dotson, Kristy Drone, Hui Du, Zijin Du, Chad Dubbelde, Treasa DuBuque, Grant Duckels, Sean Eddy, Jennifer Edwards, Glendoria Elliott, Efrem Exum, Anthony Favello, Ginger Fewell, Tanya Fiedler, Lisa Flagg, William Fronick, Bob Fulton, Tony Gaige, Stacie Gattung, Cynthia Geisel, Steve Geisel, Alicia Gibson, Candi Giddings, Barbara Gillam, Warren Gish, Danielle Glossip, Jennifer Godfrey, Deepa Goela, Norma Goins, Tina Graves, Tracie Greco, Phil Green, Serena Gregory, William Haakenson, Priscilla Hale, Charles Harkins, Gwen Harmon, Mark Harper, Anthony Harris, Michelle Harrison, James Hawkins, Maria Hawkins, Clay Hawryszko, Chuck Heidbrink, John Henkhaus, LaDeana Hillier, Kurt Hinds, Michael Holman, Andrea Holmes, Donna Hopson, Melisa Hotic, Monica Hultman, Ann Jacobs, Craig Jenkins, Mohamed Jier, Doug Johnson, Mark Johnston, Brenda Jones, Kimberly Jones, Corinne Joshu, Paula Kassos, Kimberly Keen, Jennifer Kellen, Kimberley Kemp, Deana Keppler, Amy Kerstetter, Melissa Ketterman, Kyung Kim, Mark King, Jennifer Kirsten, Bill Klinke, Jeremy Kock, Sara Kohlberg, Ian Korf, Amy Kozlowicz, Jason Kramer, Rebecca Krauss, Tamara Kucaba, Michelle Lacy, Thomas Lakanen, Betty Lamar, Yvonne Langston, Yvonne LaPlant, John Latreille, Daniel Layman, Thomas Le, Thuy-Tien Le, Tri-Tin Le, John Ledwith, Lynn Lehnert, Darcy Leimbach, Sarah Lennox, Shawn Leonard, Lili Li, Paul Lowery, Terrie Lynch, Chris Macri, Len Maggi, Maggie Maher, Elaine Mardis, Marco Marra, Gabor Marth, John Martin, Rachel Maupin, Ken McDonald, Ramonna McDonald, Rebecca McGrane, Kelly Mead, Becky Meininger, Sandra Menezes, Brian Merry, Rebecca Miko, Kevin Miller, Nancy Miller, Walt Miller, Brian Minges, Patrick Minx, Tonya Modde, Bradley Moore, Matthew Morris, Garrett Mullen, Molly Mullen, Jennifer Murray, Diane Nelson, Joanne Nelson, Amy Nguyen, Christine Nguyen, Nham Nhan, Susan Nichols, Laura Niemann, David O'Brien, Darla O'Neal, Ben Oberkfell, Amy Ozanich, Philip Ozersky, Dimitrios Panussis, Kimberly Pape, Jeremy Parsons, Adele Pauley, Charlene Pearman, Dale Peluso, Kymberlie Pepin, Denise Peterson, Amy Phillips, Craig Pohl, Faye Prevedell, Tim Raichle, Jennifer Randall, Mary Reynolds, Carrie Rhine, Lorrie Rice, Joanne Rieff, Lisa Rifkin, Linda Riles, Judy Robertson, Kerry Robinson, David Rohleder, Tracy Rohlfing, Chris Rose, Ellen Ryan, Laura Sammons, Brent Sandberg, Jill Sansone, Lisa Sapetti, Mark Schaller, Carrie Schaus, Paul Scheet, Emilie Scherger, Ann Schrader, Brian Schultz, Doug Scronce, Shawn Shafer, Kimberly Shih, Arthur Simonyan, Joanne Small, Aimee Smith, Reene Smith, Jackie Snider, Lisa Spalding, John Spieth, Peter, St. Zachary Stacy, David States, Shayla Stein, Laurita Stellyes, Nathan Stitziel, Tamberlyn Stoneking, Cindy Strong, Joe Strong, Catrina Strowmatt, Eric Stuebe, Jessica Stumpf, Veronika Sudnekevich, Carrie Sutterer, Alison Taich, Sameer Talcherkar, Aye Tin-Wollam, Evanne Trevaskis, Susan Tucci, Bradley Twyman, Karen Underwood, Phillip Valencia, Scott Valentine, Mark Vaudin, Kevin Vaughan, Joelle Veizer, Dana Vignati, Caryn Wagner-McPherson, Christopher Walker, Pamela Wamsley, Robert Waterston, Lori Weinstock, Michael Wendl, Rod White, Lori Wilcox, Alma Willis, Curtis Wilson, Richard Wilson, Mark Winkelmann, Jeffrey Woessner, Patricia Wohldmann, Cliff Wollam, Kimberly Woods, Xiaoyun Wu, Shiaw-Pyng Yang, Martin Yoakum, Xiao Zheng, Hui Zhu, Michael Zidanic
    Information Projects Other Services
    Sanger Home
    Sitemap
    Site Search
    Information
    Careers
    Press
    News
    Seminars
    Workshops
    Publications
    Staff Theses
    Travel Directions
    Research Teams
    Research Faculty
    Personnel Search
    Human Genetics
    Model Organism Genetics
    Pathogen Genetics
    Bioinformatics
    Sequencing
    Library
    Helpdesk
    Webmail
    VPN Access
    Sign In
    SSO Pass. Reset

    webmaster@sanger.ac.uk

    Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK  Tel:+44 (0)1223 834244

    Last Modified Wed Jan 26 17:35:02 2005

    Genome Research Limited is a charity registered in England with number 1021457

    Data Sharing Policy | Conditions of Use | Copyright