GFF: an Exchange Format for Feature Description

GFF is a format for describing genes and other features associated with DNA, RNA and Protein sequences.

The current version level of GFF is Version 2 with the following specification.

[Genome Research Limited]

This page is a starting-point for finding out about this format and its use in bioinformatics. In particular, since its proposal a considerable amount of software has been developed for use with GFF and this page is intended as a focus for the collation of this software, whether developed in the Sanger Institute or elsewhere.

A GFF record is an extension of a basic (name,start,end) tuple (or "NSE") that can be used to identify a substring of a biological sequence. (For example, the NSE (ChromosomeI,2000,3000) specifies the third kilobase of the sequence named "ChromosomeI".) GFF allows for moderately verbose annotation of single NSEs. It also provides limited support for NSE pairs in a rather asymmetrical way. An alternative format for representing NSE pairs that is used by several of the programs listed below is EXBLX, as used by MSPcrunch (Sonnhammer and Durbin (1994), "An expert system for processing sequence homology data", Proceedings of ISMB 94, 363-368).

The most common operations that one tends to want to perform on sets of NSEs and NSE-pairs include intersection, exclusion, union, filtration, sorting, transformation (to a new co-ordinate system) and dereferencing (access to the described sequence). With a suitably flexible definition of NSE "similarity", these operations form a basis for more sophisticated algorithms like clustering and joining-together by dynamic programming. Programs to perform all of these tasks are described below, with links to local copies.

Criticism of and new links for this page are always welcome. Please contact the page administrator, whose email address appears at the foot of the page.

Sanger Institute GFF Perl Modules

Broad-functionality Perl 5.0 modules developed by Tim Hubbard and extended/maintained by Richard Bruskiewich. Given that the modules lie in your perl module @INC path, "use GFF" imports all the associated modules for use. These modules include:

A GFF Perl Installable Archive of all these modules and their associated HTML documentation, is now available.

29/4/99 Advisory: Module (package) spaces reorganized and modules renamed:

  • GFFObject.pm => GFF.pm - is the only module users need to 'use' in their scripts (pulls in the other modules...)
  • GFF.pm => GFF::GeneFeatureSet.pm
  • GeneFeature.pm => GFF::GeneFeature.pm
  • HomolGeneFeature.pm => GFF:HomolGeneFeature.pm

19/4/99 Advisory: GeneFeaturePair.pm and GFFPair.pm (formerly a part of the broad functionality Perl 5.0 modules) have been completely deprecated, with corresponding functionality now merged into GFF.pm (the score() method) and GeneFeature.pm (all '*Match*() methods).

Josep Abril's GFF programs (IMIM, Spain)

Web site for gff2ps and gff2aplot, programs to graphically representing GFF file data (highlighted at ISMB '99).

Ian Holmes GFF programs & scripts (pre-1998 repository; no longer updated at the Sanger)

Updated versions of some of these scripts, maintained by Ian Holmes can be found at http://biowiki.org/GffTools/

  • GFF dynamic programming: gffdp.pl - a Perl program for joining together GFF segments using Generalised Hidden Markov Models with stacks, written by Ian Holmes. (Requires the BraceParser.pm module.) The architecture and scoring schemes of the underlying models are entirely flexible and can be specified in a separate file. Example model files include:
  • gene.model - a model for assembling exon predictions
  • transposon.model - a model for finding DNA transposons (or indeed any proteins flanked by inverted repeats)

More information about this program is available on request.

  • EXBLX dynamic programming: bigdp - a C++ program that assembles EXBLX segments using an affine gap penalty by doing linear-space divide-and-conquer dynamic programming, written by Ian Holmes. The program does not examine the sequences to which the EXBLX data refer, but finds optimal connections between the segments given their co-ordinates. GFF pair format can be converted to EXBLX using gff2exblx.pl.

    EXBLX records are single lines comprising eight whitespace-delimited fields: (SCORE, PERCENT-ID, START#1, END#1, NAME#1, START#2, END#2, NAME#2). bigdp requires that the two NSEs are the same length (i.e. END#1- START#1= END#2- START#2). The output of bigdp is modified EXBLX. Each line of the ouput describes a set of several input segments joined together; the percent-ID field is replaced by the number of input segments that were used and a ninth field, compactly describing the co-ordinates of the input segments, is added. The algorithm used by the program is documented more fully in Ian Holme's PhD thesis.

  • gffhitcount - a C++ program that counts the number of times each base in a set of sequences is spanned by a GFF record and returns the results in GFF format.

  • Miscellaneous Perl scripts:
    • gffintersect.pl - efficiently finds the intersection (or exclusion) of two GFF streams, reporting intersection information in the Group field. Definition of "intersection" allows for near-neighbours and minimum-overlap
    • intersectlookup.pl - used with gffintersect.pl to do reverse lookups and other manipulations on the results of an intersection test. Useful for e.g. pruning the lowest-scoring redundant entries from a GFF file
    • gffmask.pl - uses a GFF file to mask out specified sections of a FASTA-format DNA database with "n"'s (or any other character)
    • gfftransform.pl - transforms a GFF stream from one co-ordinate system to another (e.g. from clone to chromosome co-ordinates), given another GFF file describing the transformation. Requires GFFTransform.pm
    • gff2seq.pl - given chromosome co-ordinates, a clone database and a physical map co-ordinate file, returns the specified section of chromosomal sequence, even if it spans multiple clones. Requires SeqFileIndex.pm and FileIndex.pm
    • gfffilter.pl - filters lines out of a GFF stream according to user-specified criteria
    • gffsort.pl - sorts GFF streams by sequence name and startpoint
    • gffmerge.pl - merges sorted GFF streams
    • cluster2gff.pl - converts a list of whitespace-separated NSE clusters (in the format "name/start-end") into a GFF data set.
    • exblxgffintersect.pl - similar to gffintersect.pl, but finds NSE pairs in an EXBLX file that intersect with single NSEs in a GFF file. Useful for e.g. filtering out all hits between known genes from an all-vs-all BLAST comparison of genomic DNA
    • GFFTransform.pm - module to convert between GFF co-ordinate systems. Used by gfftransform.pl, blasttransform.pl and exblxtransform.pl
    • SeqFileIndex.pm - module to access a clone database using a map file. Requires FileIndex.pm. Used by gff2seq.pl
    • FileIndex.pm - module to build a quick lookup table for flatfiles. Used by exblxsym.pl, gff2seq.pl and SeqFileIndex.pm
    • BraceParser.pm - module to parse gffdp.pl model files, wherein fields are enclosed by braces {like this}

Several of these scripts duplicate functionality provided by Tim Hubbard's perl modules (see above), but may be less algorithmically complex (a significant consideration for chromosome-sized GFF files!).

Please do email Ian Holmes if you require documentation for these programs.

  • Programs that are only tangentially related to GFF, but complement the GFF tools well:
    • exblxsym.pl - symmetrises an EXBLX file (ensures that for every A:B pair there is a single corresponding pair B:A)
    • exblxasym.pl - asymmetrises an EXBLX file (filters through only those pairs A:B for which B>A)
    • exblxcluster.pl - builds optimal clusters from an EXBLX stream
    • exblxfastcluster.pl - builds clusters from an EXBLX stream using a fast incremental heuristic
    • seqcluster.pl - builds optimal clusters from an EXBLX stream, ignoring sequence start and endpoint
    • exblxindex.pl - builds a quick lookup index for an EXBLX file
    • exblxsingles.pl - filters through only non-overlapping entries from an EXBLX stream
    • exblxsort.pl - sorts an EXBLX stream
    • exblxtidy.pl - tidies up an EXBLX stream (joins overlapping matches, prunes out lines corresponding to BLAST errors, etc.)
    • exblxtransform.pl - transforms from one co-ordinate system to another (e.g. clones to chromosomes). Requires GFFTransform.pm
    • cfilter.pl - flags low-complexity regions in a FASTA DNA database. The complexity is calculated as the entropy of variable-length oligomer composition in a variable-length sliding window
    • blasttransform.pl - BLASTs a clone database against itself then transforms, sorts and merges the results into chromosome co-ordinates according to a physical (sequence) map file, which is in GFF format. Requires GFFTransform.pm
    • SequenceIterator.pm - module to assist iterations on FASTA DNA databases; creates temporary files for each sequence
* quick link - http://q.sanger.ac.uk/l2j0i0pl