WormBase is released in a number of data formats. The GFF files associated with each release constitute a simple flatfile version of the sequence and sequence features. This document explains the naming of features and methods utilised in the GFF files.
GFF format
Briefly, a GFF file is a tab-delimited flatfile with the following structure:seqname source feature start end score strand frame attributessee GFF specification for a fuller explanation of the GFF format.
Current GFF Source list
Current GFF Feature list
| Feature | Quoted in | Description | |
|---|---|---|---|
| Sequence | Genomic_canonical Link curated provisional Pseudogene tRNAscan RNA |
Sequence features relate to regions defined by a start and stop coordinate. Sequences have a direction based on the strand attribute. See Figure 1 feature A for an example of the sequence feature for a gene prediction. | |
| exon | curated provisional Pseudogene tRNAscan RNA |
Exons features are part of predicted gene models. An exon is defined by a start and stop coordinate with a direction based on the strand attribute. See Figure 1 feature B for an example of the sequence feature for a protein-coding gene prediction. Exon do not necessarily correlate with coding sequences (see exon 1 and exon 4 in the Figure 1 example) - hence an exon feature will not have a frame(phase) attribute. | |
| intron | curated provisional Pseudogene tRNAscan RNA |
Intron features are part of predicted gene models. An intron is defined by a start and stop coordinate with a direction based on the strand attribute. See Figure 1 feature D for an example of the sequence feature for a protein-coding gene prediction. Introns do not form part of coding sequence. | |
| CDS | curated provisional |
CDS features are part of predicted gene models. An CDS feature is defined by a start and stop coordinate with a direction based on the strand attribute. See Figure 1 feature C for an example of the sequence feature for a protein-coding gene prediction. CDS regions are those which can be translated into peptide sequence. CDS sequences must correlate with coding sequences and hence must have a frame(phase) attribute. | |
| structural | Genepair_STS | Genomic DNA region representing a physical DNA substrate. | |
| partial_gene | WTP | Sequence region which covers a partial gene prediction. This span has no exon/intron structure information. The region is supported by transcript data (essentially EST sequences) but the full span may not translate into a valid peptide. | |
| experimental | RNAi cDNA_for_RNAi |
waffle | |
| expression | Expr_profile | Expression studies. | |
| similarity | BLAT_EST_BEST BLAT_EST_OTHER BLAT_mRNA_BEST BLAT_mRNA_OTHER BLATX_NEMATODE WABA_weak WABA_strong WABA_coding BLASTX |
Similarity matches. | |
| repeat | transposon inverted tandem RepeatMasker |
Repeat features. | |
| ALLELE | no data | ||
| substitution | Allele | Simple allelic changes, usually mono or di-nucleotide substitutions | |
| complex_change_in_nucleotide_sequence | Allele | Basically, these are deletion alleles where the deletion is replaced by (a usually smaller length) insertion sequence | |
| insertion | Allele | Simple insertion alleles, which are not due to transposon insertions. Allele object should (ideally) capture nature of inserted sequence | |
| deletion | Allele | Deletion alleles in WormBase represent where other strains of C. elegans have a tract of genomic sequence deleted (with respect to the sequenced N2 strain). Many deletion alleles in WormBase were generated by the C. elegans Knockout Consortium, though details of many other deletion alleles have been extracted from the literature. Some deletion alleles represent deletions which have also been associated with a insertion, i.e. a large tract of sequence is deleted and replaced by a smaller tract of different sequence. | |
| Allele | |||
| Clone_left_end | |||
| Clone_right_end |
Figure 1 - Example of a sequence region which contains a gene model. Exons are shown as black boxes linked by introns. The coding sequence (ATG->STOP) is shaded in red. The spans (coordinate start/stop) for GFF features are marked as horizontal coloured bars. Features displayed include [A] Sequence, [B] exon, [C] CDS coding exon and [D] introns.
