GFF Perl Object Modules

GFF::GeneFeature.pm - Perl object module extension for GFF Gene Features

Synopsis

use GFF ; # which contains an implicit 'use GFF::GeneFeature'

Authors

Copyright (c) 1999 Created by Tim Hubbard th@sanger.ac.uk.

Augmented by Richard Bruskiewich

Sanger Institute, Wellcome Trust Genome Campus, Cambs, UK All rights reserved.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.

Description

GFF::GeneFeature (derived from the GFF base class) is a class of Perl object describing a single gene feature record ("line") in the General Feature Format ("GFF"). A GFF::GeneFeatureSet GFF::GeneFeatureSet is the container object for a set of GFF::GeneFeature objects. A GFF::HomolGeneFeature GFF::HomolGeneFeature is a homology-specific gene feature object class derived from the GFF::GeneFeature object class.

How to Read Method Protocols

Normal Perl data type notations are used for argument declarations in the method protocols. A backslash denotes argument passing by reference. Class methods are invoked using the 'class->method(args)' or 'method class args' Perl call formats.

Source code

The most current release of the Perl source code for this module is available here.

GFF::GeneFeature Construction/Input Methods

For all construction methods, an optional "$version" argument may be given which sets the created object to a specified GFF specification version. If this argument is not given, then the GFF class (package) default version is used (see GFF->version()).

new( $version )
Class method to construct a new, empty GFF::GeneFeature object.
new_from_line( $line, $version )
Class method to parse a gene feature $line string into a GFF::GeneFeature object (creates and returns the object reference).
new_from_parse( $line, \&parser, $version )
Class method to parse the $line string into the GFF::GeneFeature object using a user-defined &parser function (creates and returns the object reference). The &parser should expect the (empty) GFF::GeneFeature object reference as its first argument and the input line (string) as its second argument. So given, the function should perform the appropriate parsing of the input $line to load the GFF::GeneFeature object with data. A typical use for this method is to parse non-GFF native formats into GFF format.
parse_group( $group_field )
Method to parse the $group_field string into the invoking GFF::GeneFeature object [group] field. The method assumes that the invoking object already knows what GFF version it is.
copy( $version )
Method to duplicate the invoking GFF::GeneFeature object. If the optional '$version' argument is specified (and greater than 0) then the new copy is cast into the specified version. This allows for GFF version casting of GeneFeatureSets.
fromHomolGeneFeature( $version )
A "type cast" method to convert an HomolGeneFeature object into a GFF::GeneFeature object.

GFF::GeneFeature Output Methods

dump( \*OUTPUT, $tab, $newline, $flen, $tag )
Method uses dump_string() to write a formatted output of a GFF::GeneFeature object to a filehandle, OUTPUT. If \*OUTPUT is not given, \*STDOUT is used. When the optional $tab, $flen and $newlines arguments are omitted, this method is guaranteed to dump well-formed GFF records meeting GFF version standards. The use of the optional parameters provide alternate, non-GFF compliant tabular text formats for output.
tab
The "$tab" argument is a boolean flag, where a "true" (non-null) value directs the use of tab as the field delimiter in the output line; otherwise, use blank space (flag is assumed "true" (non-null) if not specified).
newline
The "$newline" argument is passed to dump_group (see below) via dump_string.
flen
The "$flen" argument is a boolean flag (assumed "false" if not specified), where a non-null value stipulates that the length of the current output line should be printed as an extra field at the end of the output line. (Note: the extra length of this field is *not* added to the displayed line size, but the extra field is tab delimited, if $tab is set).
tag
Version 2 GFF: the optional $tag argument, passed to the dump_string() method, restricts [group] field dumping of the GFF to the specified tag. If $tag is undefined, all [group] tag values are dumped. If $tag is defined but null (empty string), then no [group] tag-values are dumped. Otherwise, a defined and non-empty $tag value is assumed to be a simple identifier or Perl regex matching tags whose values are to be included in the dump.
dump_string( $tab, $newline, $tag )
Method to return a formatted output of all the fields (including [group]) of a GFF::GeneFeature object to a string.
The optional "$tab" argument is a boolean flag, where a "true" (non-null) value directs the use of tab as the field delimiter in the output line; otherwise, use blank space (flag is assumed "true" (non-null) if not specified).
The "$newline" and $tag arguments are passed to dump_group (see below). See dump() method above and the
dump_group() method below for explanation about $tag argument.
dump_group( $tab, $newline, $len, $tag )
Method to return a formatted output of a GFF::GeneFeature object [group] field to a string. For Version 2 object dumps, this consists of semicolon delimited tag-value structures (see GFF specification).
The optional "$tab" argument is a boolean flag, where a "true" (non-null) value directs the use of tab as the field delimiter in the output line; otherwise, use blank space (flag is assumed "true" (non-null) if not specified).
The "$newline" argument is a boolean flag (assumed "false" if omitted), where a "true" (non-null) value directs both that [group] field tag-value pairs are printed one per line (with the semi-colon omitted), and that any '\n' characters in normally double quoted free text [group] value strings are converted to "real" newlines which wrap the text into multiline printed format. Both newline effects are printed within the restricted context of the [group] field column. In other words, the $newline flag is used for some semblance of "pretty printing" group fields.
If $tab and/or $newline are specified, then the $len argument should contain the length of the dump line preceeding the [group] field.
If the $tag argument is undefined, then all tag-value pairs are dumped. Otherwise, $tag is assumed to be a simple tag or Perl regex expression matching tag value fields which are to be dumped.
Note: A subtle feature (or bug, depending upon your point of view) of this routine is that tags-values are dumped in "tag" ascending alphanumeric order, not necessarily in the original order read into the system (i.e. from an original GFF file read in by GFF->read()).
dump_matches( \*OUTPUT, $tab )
Method uses dump_string() to write a formatted output of a GFF::GeneFeature object to a filehandle. Includes information about any (overlap) matching GFF::GeneFeatures (note: match output for each record is multi-line, the matches designated by an indented '=>' bullet). If \*OUTPUT is not given, \*STDOUT is used. The "$tab" argument is a boolean flag, where a "true" (non-null) value directs the use tab as the field delimiter in the output line; otherwise, blank space is used as the delimiter (assumed "true" (non-null) if not specified).

GFF::GeneFeature Access Methods

The various GFF record fields may be set or queried by the following access methods. All the methods can take a single string argument to set the variable. With or without an argument, the methods return the current (or newly set) value, as a string, except as specifically noted below:

version()
GFF version.
seqname()
Name of the host sequence.
source()
Source of the sequence.
feature()
Feature type name.
start()
Start coordinate of feature.
end()
End coordinate of feature.
score()
Source score of feature (by method). Returned as a string even if a float number.
strand()
"+" (forward), "-" (reverse) or "." (n/a).
frame()
"0", "1", "2" or "." (n/a)
group() - Version 1
Returns or sets the optional [group] field value (string?).
group($tag,$value0,$value1,...,$valueN) - Version 2+ GFF
Under the Version 2 specification, the (optional) group field of a GFF record must be structured as an ACEDB .ace style tag-value set, flattened to one line by using semicolon delimiters instead of newlines. For this reason, the Version 2 group() method here returns a reference to an anonymous Perl hash which indexes references to arrays of values, by the tag names, (i.e. $gf-group()->{$tag} == \@values)>.
If the argument list is empty, then the reference to this hash is returned. If only a '$tag' name is given as an argument, the value list for that tag is returned (valueless tags are created in the object, but return undef as the valuelist - use the TAG() AUTOLOAD feature to test for such tags (see below)).
If values are provided in the call, the tag is set to these values (overwriting any previous values - see also group_value_list()).
group_value_list( $tag, \@values, $append ):
Version 2 GFF: method sets and/or returns the reference to the array of values associated with a given $tag of a (Version 2) tag-value structure. The (optional) specification of a reference to such an array of values (\@values) resets the $tag hash value to the new array reference (and returns it). If $tag is undefined or null, then the function just returns the value-list of the first tag (if any) in the tag-value pairs. If the $append argument is defined and non-NULL, then the given @values are appended to any existing value list (default: 0)
Version 1 GFF: just returns the [group] string value embedded in a single member list; $tag is ignored.
group_value( $tag, $index, $value )
(Version 2 GFF) method sets and/or returns the element at the (zero based) "index" position in the value list associated with a given $tag of a (Version 2) tag-value structure. If an $index is not given, the first value of the value list is returned. If $value is specified, the ith element is set to it (note: if the $tag associated value array is undefined when this method is called, then it is created and its value is set to a single element list containing $value). (Version 1 GFF) just returns the [group] string, ignoring any $tag or $index provided. If $value is provided, then the [group] name is reset (same as $gf->group($value)).
TAG(\@values) - Version 2 GFF only
[group] field tags given as method names are now Perl AUTOLOAD recognized as accesses to the tag-value field. An optional reference to an array of tag-values may be given to associate with the tag. With or without values, the current value list reference is returned. Valueless tags simply return a boolean '1' if they exist. If the tag does NOT exist in the given object, then '0' is returned.
deleteTag($tag)
This method deletes [group] values and tags. Note: this operation directly modifies the gene feature objects concerned.
Version 1 GFF - $tag argument is ignored; the method undefines the [group] field value.
Version 2 GFF - if no $tag argument is provided, the entire [group] tag-value array is cleared of tags and values. If a $tag argument is provided, only the indicated tag (if it exists) is deleted from the [group] field, along with any associated values.
comment()
trailing line comments associated with a GFF record. Under Version 2, such comments, starting after the [group] field, must be delimited with a '#' character.

GFF::GeneFeature Analysis Methods

The following methods compute upon GFF features.

length()
Calculate the segment length (end-start+1) of a given feature.
remap( $offset )
Method to add an $offset amount to the start and end coordinates of a GFF::GeneFeature. Note: the start and end of the original object, not a copy, are changed.
match( $GF2, $tolerance, $single, $strand )
Method to compare two GFF::GeneFeature objects to look for an overlap returns 5 scalars as an array. The first GFF::GeneFeature object invokes the method giving the second GFF::GeneFeature object as the first argument ("GF2" above), plus optional method arguments "$tolerance", "$single" and "$strand" to guide the analysis (each are assumed to be 0 when not explicitly given).
Based upon their specified start and end coordinates, two GFF::GeneFeatures will either overlap perfectly, partially or not at all (are "misses"). The $tolerance value specified controls the match decision for each category of overlap as follows:
  1. Specifying a $tolerance value of 0 dictates that an exact match is required, that is, that the corresponding 5' and 3' coordinate ends of both GFF::GeneFeature objects must be equal to one another.

    Release 2.106 revision: optionally, the $tolerance argument can now be a reference to an array 5' and 3' end specific tolerance pairs [t55, t53, t35, t33], where t55 == 5' of the 5' end of the gene feature, t53 == 3' of the 5' end of the gene feature, etc. (Note: 5' is a function of strandedness, if any, or simply 'start' for '.' strand objects),

  2. For partial overlaps of GFF::GeneFeature objects, if the "$tolerance" is set to a negative number then, ceteris paribus, then two overlapping GFF::GeneFeature objects are matched unconditionally.

  3. If two GFF::GeneFeature object segments overlap imperfectly but a positive, non-zero $tolerance. is specified, then a match is successful if the (absolute value of the) extent of both the 5' and 3' mismatch in coordinates is less than or equal to the tolerance value (see also the effect of the "$single" method argument below).

  4. If two GFF::GeneFeature object segments do not overlap at all, but if a negative $tolerance value less than -1 is specified, then a match is declared if the misses are within $tolerance, that is, if the difference in coordinates of the closest segment ends of the two GFF::GeneFeature objects is less than the $tolerance.

If a value of 1 is given for the "$single" argument to the method, then the above $tolerance conditions, for positive $tolerance values and imperfect GFF::GeneFeature overlaps, are relaxed such that either a 5' or a 3' end mismatch within tolerance results in a positive match.
If "$strand" is given (i.e. not 0), then the match fails unless the two GFF::GeneFeature lie on the same strand. If $strand is zero, then strandedness of features is completely ignored in the match comparison (i.e. can also be '.' == unknown)
The 5 scalar fields returned in the match @result array have the following values:
$result[0]
1 ("true") or 0 ("false") indicates if objects match by criteria supplied
$result[1]
For overlaps: Number indicating signed difference in 5' end of 2 objects For misses within tolerance: 3' segment end "closest" coordinate of the 5'-most GFF::GeneFeature
$result[2]
For overlaps: Number indicating signed difference in 3' end of 2 objects For misses within tolerance: 5' segment end "closest" coordinate of the 3'-most GFF::GeneFeature
$result[3]
Detailed match information: 0 if no overlap; 1 if match; 2 if overlap, but rejected on tolerance+single; 3 if overlap, but rejected on strand; 4 if no overlap but accepted on tolerance
$result[4]
Descriptive string about overlap, if there is one
overlap_logical( $GF2, $verbose )
Method that uses GFF::GeneFeature::match() method to return a non-null ("true") or 0 ("false") answer on whether or not there is any detectable 'overlap' (on either strand) between pair of features: the invoking GFF::GeneFeature object and a second object specified as the first argument ("$GF2"). If the $verbose flag is defined and not null, then a detailed match description is returned.
match_logical( $GF2, $verbose )
Method that uses GFF::GeneFeature::match() method to return a non-null ("true") or 0 ("false") answer on whether or not there is an exact strand/coordinate (overlap) match between pair of features: the invoking GFF::GeneFeature object and a second object specified as the first argument ("$GF2"). If the $verbose flag is defined and not null, then a detailed match description is returned.
overlap_merge( $GF2, $tolerance, $strand, $group_tag, $addscores, $copy )
Merge the start and end coordinates of the invoking GFF::GeneFeature, and a second GFF::GeneFeature object (given as the "$GF2" object reference argument) to itself, if the two GFF::GeneFeatures overlap within $tolerance positions (default 0).
Optional '$strand' argument forces strand sensitivity in merging.
The optional $group_tag argument is overloaded to two forms:
  1. a simple [group] field tag may be given, under which the method records merge info. This includes: [group] 'Sequence' tag value (if available), <source>, <feature>, <start>, <end> values.
  2. a reference to a Perl function may be given, which expects to be invoked with the two overlapping gene features in the set. This function would generally modify the first gene feature object in some customized way.
The optional $addscores argument stipulates that the scores of merged objects are to be added.
If the $copy argument is set, then the method makes returns a copy constructed from the merged object, instead of the original invoking object (Note: separate copies are made for each merger event, so a 'self_overlap_merge' may be useful after this method is called).
Returns all merged (copy of) invoking GeneFeature iff an overlap_merge occurred, otherwise 0.
addMatch( $gfm, $e5, $e3 )
Method to record, for the invoking GFF::GeneFeature, an overlap match of this GFF::GeneFeature with another GFF::GeneFeature object ('$gfm'). The associated 5' and 3' overlap offsets (see GFF::GeneFeature::match())) are also recorded. with
getMatches()
Method to get GFF::GeneFeature match records for the invoking GFF::GeneFeature object. Returned as a reference to a hash, keyed on GFF::GeneFeature object references ('$gfm'), with values equal to (references to an anonymous array [$gfm, $e5, $e3] ) of the matched GFF::GeneFeature and associated offsets (see addMatch above).
getMatches_logical()
Method to return non-null ("true") or 0 ("false") depending upon whether the GFF::GeneFeature has matches or not.
addMember( $GFF, $membergroup )
Method to add a "member" record to indicate the parent GFF object for a specified grouping ("$membergroup") of GFF::GeneFeatures
getMember( $membergroup )
Method to retrieve the reference to the parent GFF object of GFF::GeneFeature object indexed under this particular grouping ("$membergroup").

Revision History

2.106(19/10/99) - rbsk
match() method can take a reference to an array for the $tolerance value, consisting of pairs of 5' and 3' end specific tolerances.
2.105 (3/10/99) - rbsk
$group_tag in overlap_merge() and associated code inherited from GFF::GeneFeatureSet::self_overlap_merge().
group() method bug: fixed Version 1 GFF crash bug
2.104 30/9/99 - rbsk
$tag argument in dump(), dump_string(), dump_group()
2.103 27/9/99 - rbsk
$strand argument in overlap_merge()
2.102 21/9/99 - rbsk
added $copy argument to overlap_merge() method; now also returns $self instead of simple '1' iff overlap occurs
created the deleteTag() method
16/9/99 - rbsk
match() method should totally ignore <strand> if $strand not set?
8/9/99 - rbsk
overlap_merge() $addscores argument.
8/9/99 - rbsk
fixed GeneFeatureSet.pm handling of valueless [group] tags
group() method can now be used to set tags without values or tags with values, but $gf->group('tag') or $gf->group('tag','value0','value1',...,'valueN') ; Seems redundant to methods group_value() and group_value_list() which are similar but slight different in their operation.
For Version 2, AUTOLOAD now returns a boolean '1' for any [group] value tag which exists but has no values. Note that for Version 2, AUTOLOAD names not recognized ALL default to tag methods and fail silently by returning NULL (rather than throwing an exception, as in Version 1 GFF).
31/8/99 - rbsk
$append argument added to group_value_list() method
overlap_merge() method: optional '$tolerance' value provides for overlap merge where the two features lie within $tolerance base pairs of each other
27/8/99 - rbsk
sub parse_group() bug fix: couldn't parse some instances of 'end' double quotes
18/8/99 - rbsk
coded explicit primary field access methods, rather than relying upon AUTOLOAD (i.e. to gain efficiency - George Hartzwell suggestion :-)
12/7/99 - rbsk
custom AUTOLOAD recognizes GFF Version 2 [group] tags as access functions i.e. $gf->Sequence() == $gf->group_value_list('Sequence') and $gf->Sequence(\@VALUES) == $gf->group_value_list('Sequence',\@VALUES)
4/5/99 - rbsk
renamed GeneFeature.pm => GFF::GeneFeature.pm
20/4/99 - rbsk
Instead of a list, the getMatches() method now returns a reference to a hash, keyed on GeneFeature references (getMatches_logical is now strictly boolean)
19/4/99 - rbsk
GeneFeaturePair.pm functionality merged with GeneFeature.pm (semantic change)
Changed 'pairs' to 'matches' in object data structure in order to capture 'one-to-many' semantics:
  • $fields{'Pairs'} => $fields{'matches'}
  • addPair => addMatch # adds a matching GeneFeature
  • getPair => getMatches # returns array of matches
  • getPair_logical => getMatches_logical # returns number of matches (0=> false)
  • dump_pairs => dump_matches # dumps all GF's with matches (and the matches too)
intersect_overlap_pairs() => intersect_overlap_matches:
bug found: use 'absolute' coordinate offsets of second relative to first GF
17/3/99 - rbsk
added $newline arg to dump(), dump_string() & dump_group()
16/3/99 - rbsk
GeneFeature objects now subclassed from GFFObject class;
13/3/99 - rbsk
bug fix in group_value() function, dump_group(), et al.
3/3/99 - rbsk
extensively revised and improved the documentation added Version 2 GFF code, especially, group() field management methods
* quick link - http://q.sanger.ac.uk/3vyh77g8