Contact WTSI Webmaster Printer friendly format Login to WTSI resources WTSI RSS feed
  • C. elegans
  • Overview
  • Sequence data
  • BLAST search
  • Wormpep
  • FTP site
  • C. briggsae
  • C. briggsae project
  • BLAST Search
  • WormBase
  • Release info
  • Current gene names
  • Submit data
  • GFF files
  • Documentation
  • Annotation
  • Website

  • Ensembl
  • C. elegans project
GFF features used in WormBase

WormBase is released in a number of data formats. The GFF files associated with each release constitute a simple flatfile version of the sequence and sequence features. This document explains the naming of features and methods utilised in the GFF files.

GFF format

Briefly, a GFF file is a tab-delimited flatfile with the following structure:
‹seqname› ‹source› ‹feature› ‹start› ‹end› ‹score› ‹strand› ‹frame› ‹attributes›
see GFF specification for a fuller explanation of the GFF format.

Current GFF Source list

Source Possible features Description
Genomic_canonical Sequence Genome sequences forming the minimal tiling path along each chromosome. The named sequences represent those objects in ACEDB declared 'Genomic_canonical' and these are the sequences submitted to the public nucleotide databases. The full list of clone paths can be found in the summary tables for each WormBase release.
Link Sequence Level of organisation of the genome sequences into chromosome sequences. Clones are aligned into LINK and SUPERLINK objects to form chromosomes.
curated Sequence
exon
intron
CDS
Gene prediction for a protein-coding sequence. 'Curated' genes have been appraised by an annotator. The nomenclature for curated CDS's is ‹Clone›.[1-99]. Curated predictions are the primary data for generating the wormpep data set.
provisional Sequence
exon
intron
CDS
Gene prediction for a protein-coding sequence. 'Provisional' genes are preliminary Genefinder predictions and have not been appraised by an annotator. The nomenclature for provisional CDS is ‹Clone›.[a-z]. 'provisional' predictions are included in the wormpep data set.
tRNAscan-SE-1.11 Sequence
exon
intron
tRNA gene predictions. tRNAs are predicted using the tRNAscan program (Sean Eddy).
RNA Sequence
exon
intron
RNA gene predictions.
Pseudogene Sequence
exon
intron
Pseudogene predictions.
Transposon repeat Transposon predictions.
GenePair_STS Structural The PCR product which results from the amplification using a defined set of oligonucleotides. The PCR_products constitute a resource for investigating C.elegans biology especially functional analysis. Hence, a number of features have utilised these sequences and the associated data will have identical coordinates (e.g. RNAi phenotypes and Expression profile analysis).
WTP partial_gene Worm Transcriptome Project (WTP) predictions of partial trasncripts. Essentially mapping the EST/mRNA transcript data to genomic sequence and marking the extent of overlap. WTP spans relate to confirmed transcripts and form a data set of experimentally supported worm genes. The exon/intron sructure is not stored and the span includes potential 5' and 3' UTR sequences.
RNAi experimental RNAi experiments. The start/stop span relates to the physical DNA used in the RNAi assay. RNAi has been performed using PCR products and cDNA clones.
cDNA_for_RNAi experimental cDNA sequences used in RNAi assays. The start/stop span relates to the full length mapping of the cDNA to genomic sequence (i.e. it includes the introns of the unprocessed transcript).
Expr_profile Expression Stuart Kym's expression profiles. A distillation of microarray expression studies based on the Research Genetics designed oligo data set. Each profile relates to the PCR product on the microarray slide.
BLAT_EST_BEST similarity BLAT alignment of EST sequence to genomic sequence. The best alignment indicates this is the optimal alignment in the current release.
BLAT_EST_OTHER similarity BLAT alignment of EST sequence to genomic sequence. The other alignment indicates this is a sub-optimal alignment in the current release and a better match is found elsewhere.
BLAT_mRNA_BEST similarity BLAT alignment of mRNA sequence to genomic sequence. The best alignment indicates this is the optimal alignment in the current release.
BLAT_mRNA_OTHER similarity BLAT alignment of mRNA sequence to genomic sequence. The other alignment indicates this is a sub-optimal alignment in the current release and a better match is found elsewhere.
BLATX_NEMATODE similarity BLATX alignment of nematode EST consensus sequences to genomic sequence.
WABA_weak similarity  
WABA_strong similarity  
WABA_coding similarity  
BLASTX similarity  
inverted repeat Inverted repeat regions.
tandem repeat Tandem repeat regions.
RepeatMasker repeat Dispersed repeats as mapped by RepeatMasker
Allele SNP
complex_change_in_nucleotide_sequence
deletion
insertion
substitution
transposable_element_insertion_site
Known differences from the reference genome sequence.

Current GFF Feature list

Feature Quoted in Description
Sequence Genomic_canonical
Link
curated
provisional
Pseudogene
tRNAscan
RNA
Sequence features relate to regions defined by a start and stop coordinate. Sequences have a direction based on the strand attribute. See Figure 1 feature A for an example of the sequence feature for a gene prediction.
exon curated
provisional
Pseudogene
tRNAscan
RNA
Exons features are part of predicted gene models. An exon is defined by a start and stop coordinate with a direction based on the strand attribute. See Figure 1 feature B for an example of the sequence feature for a protein-coding gene prediction. Exon do not necessarily correlate with coding sequences (see exon 1 and exon 4 in the Figure 1 example) - hence an exon feature will not have a frame(phase) attribute.
intron curated
provisional
Pseudogene
tRNAscan
RNA
Intron features are part of predicted gene models. An intron is defined by a start and stop coordinate with a direction based on the strand attribute. See Figure 1 feature D for an example of the sequence feature for a protein-coding gene prediction. Introns do not form part of coding sequence.
CDS curated
provisional
CDS features are part of predicted gene models. An CDS feature is defined by a start and stop coordinate with a direction based on the strand attribute. See Figure 1 feature C for an example of the sequence feature for a protein-coding gene prediction. CDS regions are those which can be translated into peptide sequence. CDS sequences must correlate with coding sequences and hence must have a frame(phase) attribute.
structural Genepair_STS Genomic DNA region representing a physical DNA substrate.
partial_gene WTP Sequence region which covers a partial gene prediction. This span has no exon/intron structure information. The region is supported by transcript data (essentially EST sequences) but the full span may not translate into a valid peptide.
experimental RNAi
cDNA_for_RNAi
waffle
expression Expr_profile Expression studies.
similarity BLAT_EST_BEST
BLAT_EST_OTHER
BLAT_mRNA_BEST
BLAT_mRNA_OTHER
BLATX_NEMATODE
WABA_weak
WABA_strong
WABA_coding
BLASTX
Similarity matches.
repeat transposon
inverted
tandem
RepeatMasker
Repeat features.
ALLELE   no data
substitution Allele Simple allelic changes, usually mono or di-nucleotide substitutions
complex_change_in_nucleotide_sequence Allele Basically, these are deletion alleles where the deletion is replaced by (a usually smaller length) insertion sequence
insertion Allele Simple insertion alleles, which are not due to transposon insertions. Allele object should (ideally) capture nature of inserted sequence
deletion Allele Deletion alleles in WormBase represent where other strains of C. elegans have a tract of genomic sequence deleted (with respect to the sequenced N2 strain). Many deletion alleles in WormBase were generated by the C. elegans Knockout Consortium, though details of many other deletion alleles have been extracted from the literature. Some deletion alleles represent deletions which have also been associated with a insertion, i.e. a large tract of sequence is deleted and replaced by a smaller tract of different sequence.
Allele  
 
Clone_left_end    
Clone_right_end    

Figure 1 - Example of a sequence region which contains a gene model. Exons are shown as black boxes linked by introns. The coding sequence (ATG->STOP) is shaded in red. The spans (coordinate start/stop) for GFF features are marked as horizontal coloured bars. Features displayed include [A] Sequence, [B] exon, [C] CDS coding exon and [D] introns.

webmaster@sanger.ac.uk

Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK  Tel:+44 (0)1223 834244

Last Modified Wed Oct 1 11:45:17 2008

Genome Research Limited is a charity registered in England with number 1021457

Data Sharing | Copyright