Contact WTSI Webmaster Printer friendly format Login to WTSI resources WTSI RSS feed
  • C. elegans
  • Overview
  • Sequence data
  • BLAST search
  • Wormpep
  • FTP site
  • C. briggsae
  • C. briggsae project
  • BLAST Search
  • WormBase
  • Release info
  • Current gene names
  • Submit data
  • GFF files
  • Documentation
  • Annotation
  • Website

  • Ensembl
  • C. elegans project
C. elegans sequence analysis at the Sanger Institute

The C. elegans 100 Mb genome is near completion, representing the first genome of a multicellular organism to be sequenced. The sequencing of the C. elegans genome has relied almost entirely on the sequence ready contigs provided by the physical map.  The resulting product has been a set of overlapping DNA sequences derived from physical map clones. This approach has provided the advantage that as each sequence is derived from a discrete clone an obvious link is established between the sequence data and the physical map.  Analysing the finished sequences on a clone by clone basis has the advantage that the average size of 30 KB is easily managed by the majority of sequence analysis tools. The tools for analysing the sequence are subject to change and the most recent version can be seen on the Sanger Institute C. elegans web site.

When a sequencing project for a clone is finished the required sequence is excised from the GAP database (1) using the program MKCON-GAP (2). Automated analyses are performed on the DNA sequence and the results converted into ACEDB file format (table 1). Each analysis program is called from within a UNIX shell script. The script also logs each process as well as its exit status so that any problems can be quickly identified and resolved. When automated analysis is complete the resulting log file is mailed to the user.

Table 1. Summary of analysis programs used
GENEFINDER ab-initio gene prediction
POSTWISE Gene Prediction based on protein homology
tRNASCAN-SE transfer RNA prediction (3)
INV Inverted Repeat detection (4)
TAN Tandem Repeat Detection (4)
HMMLS, HMMFS Hidden Markov Model detection of repeat families (5)
POLY Detection of repeat family members present in  tandem arrays (4)
MSPcrunch BLAST Post Processor (6)
BLASTX Six frame translation and comparison to protein
TBLASTX DNA vs DNA comparisons at protein level (7)
EST_genome Alignment of EST sequences to Genomic DNA (8)

The program GENEFINDER (Green et al. unpublished) was developed to predict putative protein coding genes within the C. elegans sequence data. GENEFINDER uses statistical criteria derived from log likelihood ratios to detect potential genes based on genomic features such as splice sites, translation start sites and codon biases. A dynamic programming algorithm is used to find a set of non-overlapping candidate genes with the highest total score for each DNA strand.

One of the major problems in ab-initio gene prediction methods is the inability to accurately and sensitively detect splice site signals. High enough numbers of false positive splice sites are produced to make gene finding problematic. C. elegans also lacks a detectable branch site consensus, in contrast to both yeast and mammalian introns. However gene detection is aided by the fact that C. elegans introns tend to be relatively small with a median size 57bp and also because up to 29% of the genomic sequence is predicted to be protein coding. The translation initiating AUG codon does have a weak surrounding consensus, showing a preference for A in the preceding four bases; however the 5' ends of genes remain difficult to predict with high accuracy. One potentially confusing aspect of gene prediction is trans-splicing in C. elegans (9). In this process short RNA sequences are spliced onto the 5' ends of mRNAs. The recognition site for this splice leader sequence addition has the same consensus sequence as a splice acceptor site. This can lead to gene prediction misinterpreting the initiating exon as an internal exon and the gene being erroneously extended upstream. Another problem is the fact that some genes in C. elegans are co-transcribed producing polycistronic messages (10). One feature of such operons is that the distance between the polyadenylation site of an upstream gene and the trans-splice site of the downstream gene is short, usually being about 100bp. Together the close nature of the genes and trans-splicing signals create the strong possibility that many operons will be predicted as a single gene.

As gene finding methodologies are currently non-optimal, consolidation of gene predictions with other biological information has proven to be essential. The most informative sources of extra information have been expressed sequence tags (ESTs) from C. elegans (11,12,Y. Kohara, unpublished) and similarity to proteins in the public databases. Therefore many of the initial GENEFINDER predictions have been manually edited to consolidate the EST and protein homology information. The ACEDB sequence display (FMAP) allows both the visualisation of the homologous regions and the rapid editing of the gene structures.

EST's provide valuable transcriptional data, not only confirming that predicted genes are indeed transcribed in vivo but also the position of intron/exon boundaries. EST mapping to genomic sequence is carried out by EST_genome (8). This program aligns the EST sequence to sequence while preferentially allowing gaps at intron (NN/GT..AG/) boundaries. Introns confirmed by strong EST matches are assigned an ACEDB tag by a perl script which parses EST_genome output (EST_genome2ace, S.Jones) highlighted in green on the FMAP. A small fraction of C. elegans introns (<1%) begin with GC instead of the canonical GT and at present can only be detected reliably where confirmatory EST data exists.

Many examples of alternative transcripts have been shown to exist in C. elegans. When analysing genome sequence using ab-initio methods alternative splicing is particularly problematic. Currently only EST data can be used reliably to predict alternative transcripts. However the use of EST data in the detection of splice variants is hampered by the fact that alternative transcripts present at low levels will be poorly represented in the EST dataset. Currently, only about 1% of genes have been determined to have alternative transcripts using this method.

Protein homology mapping is done by comparing a six-frame translation of the genomic sequence to a protein database using BLASTX (6) in conjunction with a BLAST post-processor, MSPcrunch. MSPcrunch improves the signal to noise ratio by the elimination of matches due to low complexity regions whilst increasing the significance of low scoring fragmentary hits to the same protein sequence. MSPcrunch also has the added advantage of being able to produce output in ACEDB file format.

We maintain a non-redundant protein database, SWIR (E. Sonnhammer and P. Rice, unpublished), for protein homology searches. The SWIR database is made up from the three protein databases WormPep (13), SwissProt (14) and Trembl (14). Duplicate sequences in these databases are detected and removed by their database cross-references and peptide sequences are retained in the priority of Wormpep, SwissProt and then Trembl. Redundancy of this protein database is limited further by the removal of closely related sequences. This is achieved by first identifying candidates by their dimer composition and eliminating sequences where the identity is 95% or greater as determined by a Needleman-Wunsch alignment (15).

Protein coding elements are also detected by comparison with EST data sets from other nematode species. The current data set consists predominately of EST's from Brugia malayi and Caenorhabditis briggsae. This enables detection of protein families which have arisen within the nematode lineage. Comparisons are done at the protein level using TBLASTX which compares conceptual six-frame translations of both the EST sequences and the C. elegans genomic sequence.

As the public protein database become more complete it also becomes more feasible to predict genes based on homology information. Gene predictions based on homology data are produced using POSTWISE (Birney et al. unpub.). POSTWISE uses an algorithm that combines gene prediction and protein homology in a single probabilistic model (16). It is envisaged that tools such as POSTWISE will play an important role in the curation and updating of the predicted gene structures as new orthologs and paralogs enter the protein databases.

Currently the ab-initio detection of non-protein coding genes has been limited to prediction of tRNA genes. This has been done using tRNAscan-SE (3). This approach utilises the program tRNAscan (17) to rapidly identify initial tRNA gene candidates. The resultant set is filtered using the COVE probabilistic RNA prediction package (18) which detects false positives. The program is also able to predict the amino acid charged by the tRNA gene and genes that possess incomplete primary or secondary structure are marked as pseudogenes.

One problem in the assignment of putative function is the variance between different human annotators. Annotation may vary in detail, nomenclature, accuracy and spelling. A common error is 'over annotation' where the protein homology is over interpreted, resulting in the annotation being more specific than the homology justifies. Other problems occur when the annotator annotates a single functional domain whilst multiple functional domains are present or the annotator fails to record correctly the actual number of functional domains present. Such errors are common in proteins where multiple domains are prevalent e.g. extracellular receptor signal transduction proteins.

Another problem in using homology information alone is that in many cases the annotation of the database proteins themselves may be incorrect and by utilising their annotation we simply propagate and proliferate the error. Therefore, it is obviously beneficial if additional evidence for a functional domain can be provided. Initially, the PROSITE database of regular expressions was used to provide diagnostic motifs for protein domains (19). However, the use of PROSITE in the C. elegans project has now been superseded by the PFAM database. The PFAM database is a collection of hidden markov models of protein domains (20). The PFAM database has been constructed to provide a sensitive and accurate automatic method of finding protein domains. Searching this database allows us not only to detect known domains but accurately record their number and position. As PFAM predicts functional domains, annotations based on PFAM hits can be expressed in a consistent manner. The confidence in PFAM hits allows protein annotation to become semi-automated and easily updated. However a drawback is that currently PFAM hits are limited to only 26% of the C. elegans predicted proteins.

From the phylogeny of C. elegans it would be expected that little in the way of DNA sequence data from closely related organisms would be readily available and that most of the protein similarities would be between conserved domains predating the divergence of the major animal phyla.

ACEDB

ACEDB was developed for the C.elegans genome project to provide for a number of requirements.

  • The database needed to be as flexible and configurable as possible so that new analysis methods and data could be easily incorporated into the database schema and graphical displays.
  • Rapid editing of predicted gene structures and genome features was needed for when more information became available.
  • The database also needed to be easily disseminated so it can be accessible by the end user biologist


For genomic sequence analysis ACEDB has a graphical feature map (FMAP) (Figure 3) for viewing DNA sequences. The FMAP allows viewing of genomic features such as open reading frames, start codons, codon biases, homology data, intron exon structures and putative splice sites, in addition to the actual sequence data. An analogous graphical display also exists for protein sequences (PEPMAP) which allows the viewing of peptide specific features such as hydropathy plots and the identification of chemically similar amino acids. Features from other analysis programs can be easily added to the graphical displays. The simple structure of the database format allows the output of other analysis methods to be quickly converted to the ACEDB file format by relatively simple AWK or PERL programs. ACEDB also allows the analysed sequences data to be converted into the EMBL file format so that sequence data can be easily submitted to the public databases and allows rapid resubmission of sequence entries when required.

ACEDB also has a number of built in tools which aid in genomic sequence analysis. A version of GENEFINDER is incorporated into ACEDB allowing genes to be predicted in sequences displayed in the FMAP as described above. BLIXEM (Figure 2) (6) allows homologies to sequences to be viewed graphically as multiple sequence alignments and dot plots can also be generated between matching sequences using the program DOTTER (22).

Figure 2  BLIXEM display showing translated region of genomic DNA with protein matches

[expand]

Figure 3   ACEDB FMap display

[expand]

The C. elegans

ACEDB database can be downloaded via FTP (ref.) but as it exceeds 1 GB in size the

preferred route is via the web from http://www.sanger.ac.uk/Projects/C_elegans using the WEBACE interface

DNA and protein sequences can also be searched on-line using the C. elegans BLAST server. The resultant BLAST alignments also provide hyper-links to the EMBL entries and via WEBACE to the ACEDB database.

References

1. Bonfield, J.K. Smith, K.F. and Staden, R. (1995). Nuc. Acids Res., 23, 4992.

2. MKCON available from available from http://www.sanger.ac.uk/Software

3. Lowe, T.M. and Eddy, S.R. (1997). Nucl. Acids Res., 25, 955.

4. INV TAN AND POLY available from http://www.sanger.ac.uk/Software

5. Eddy, S.R. Mitchison, G. and Durbin, R. (1995). J. Comp Biol., 2, 9.

6. Sonnhammer, E.L.L. and Durbin, R. (1994). Comput. Applic. Biosci, 10, 301.

7. Altschul S.F. Gish, W. Miller E. Myers, E.W. and Lipman D. (1990). J. Mol. Biol., 215, 403

8. Mott, R. (1997). CABIOS, 13, 477.

9. Krause, M. and Hirsh, D. (1987). Cell, 63, 753.

10. Spieth, J. Brooke, G. Kuersten, S. lea, Kristi and Blumenthal, T. (1993). Cell, 73, 521.

11. Waterston, R.H. Martin, C. Craxton, M. Huynh, C. Coulson, A. Hillier, L. Durbin, R. Green, P. Shownkeen, R. Halloran, N. Metzstein, M. Hawkins, T. Wilson, R. Berks, M. Du, Z. Thomas, K. Theirry-Mieg, J. and Sulston J. (1992). Nature Genet., 1, 114.

12. McCombie, W.R. Adams, M.D. Kelley, J.M. Fitzgerald, M.G. Utterback, T.R. Khan, M. Dubnick, M. Kerlavage, A.R. Venter, J.C. and Fields, C. (1992). Nat Genet. 1, 124.

13. Wormpep available from http://www.sanger.ac.uk/C_elegans/Wormpep

14. Bairoch, A and Apweiler, R. (1997). Nuc. Acids Res., 24, 21.

15. Needleman, S. and Wunsch C. (1970). J. Mol. Biol., 48, 444.

16. Birney, E (1997). ISMB, 5, 56.

17. Fichant, G.A. and Burks C. (1991). J. Mol. Biol., 220, 659.

18. Eddy, S.R. and Durbin, R. (1994). Nucl. Acids. Res., 22, 2079.

19. Bairoch, A (1993). Nuc. Acids Res. 21, 3097.

20. Sonnhammer, E.L., Eddy, S.R. and Durbin, R. (1997). Proteins, 28, 405.

21. Durbin, R. and Thierry Mieg, J. (1991-). Code and data available from  the anonymous FTP server at ncbi.nlm.nih.gov.

22. Sonnhammer, E.L.L. and Durbin, R. (1995). Gene 167, GC1.

webmaster@sanger.ac.uk

Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK  Tel:+44 (0)1223 834244

Last Modified Tue Dec 16 13:52:43 2003

Genome Research Limited is a charity registered in England with number 1021457

Data Sharing | Copyright