GFF Perl Object Modules
GFF::Analysis.pm - Perl utility library for General Feature Format (``GFF'')analysis routines using the GFF Perl Object libraries.
Synopsis
# include what functions you need; GFF::Analysis.pm contains an implicit 'use GFF ;'
use GFF::Analysis qw(constructGene makeGenes mRNA featureLengthStats segregateGeneFeatures normalize
mergeGeneFeatures cleanUpSeqName normalize_mRNA);
Description
GFF::Analysis (derived from GFF) is a utility library for the Gene Finding Feature, built upon the GFF perl module library.
Exports:
- constructGene()
- makeGenes()
- mRNA()
- segregateGeneFeatures()
- cleanUpName() # name cleanup protocol used by normalize
- normalize()
- mergeGeneFeatures()
- normalize_mRNA() # calls segregateGeneFeatures(), normalize(), and mergeGeneFeatures() sequentially
Authorship
Copyright (c) 1999 Created by Richard Bruskiewich.
Sanger Institute, Wellcome Trust Genome Campus, Cambs, UK All rights reserved.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation
Source code
The most current release of the Perl source code for this module is available here. All bug reports may be submitted to Richard Bruskiewich.
Methods
Note: These methods are not 'object' invoked, but take a GFF::GeneFeatureSet reference as their first argument.
- constructGene($gffi)
- Given a GeneFeatureSet object ('$gfs') GeneFeatures with <feature> fields specifically labelled with 'exon' and possibly 'promoter', 'transcription_start' and/or 'polyA_signal' tags, and belonging to a single gene as defined by a common [group] field label, this method returns an augmented GeneFeatureSet object fully describing a 'gene' containing introns, UTRs and flanking sequences inferred from the original GeneFeature set.
- GFF::GeneFeatures in the calling GeneFeatureSet object are assumed to all be on the same <strand>, from the same <seqname>,
-
<source> and
group_value('Sequence')named gene (i.e.makeGenes()clustering) thus, the method looks at the first GeneFeature encountered in the GeneFeatureSet for all these values! The <frame> and <score> are assumed to be irrelevant for all features added in this method (e.g. introns), and is thus set to '.'. The incoming GeneFeatures are also assumed to be non-overlapping, since this assumption drives the identification of 'inter' GeneFeature gaps ('introns' et al.) Also, if the first (and/or last) GeneFeature start (end) does (do) not coincide with the start (end) of the <seqname> region range, then the 5' (and 3') flanking regions are inferred and so labelled in the field. This latter labelling is also influenced by the presence of 'promoter', 'transcription_start' and/or 'polyA_signal' GeneFeatures. - makeGenes($gffi)
- After clustering a GeneFeatureSet set of predicted exons, promoter, polyA's etc. by 'gene' groups (i.e. by Version 1 [group] tags or by Version 2 [group] field 'Sequence' tag-values), this method uses the GeneFeatureSet::constructGene method to infer additional GeneFeatures (e.g. introns, [5'|3'] UTRs and Flanking regions). The method then returns all the GeneFeatures (old and new) in a new GeneFeatureSet object.
- mRNA($gffi, $seq, $pattern)
- Method to return a single string of a subsequence representing a mRNA or similar gapped entity represented by the gene features in the invoking object, whose <feature> field matches the $pattern. The method expects a string '$seq' corresponding to the sequence from which the features are to be extracted. Returns a concatenated string of all the subsequences defined by the matching gene features.
- featureLengthStats($gffo,\%statab,$label)
-
This method applies the GFF::GeneFeatureSet::lengthStats() method to a given GeneFeatureSet, returning the results in
a primary hash (passed by reference,\%statab ) indexed under the given $label, and returned as a secondary level hash
reference. i.e.
\%statab->{"$label"}->{'<data_label>'} -
The '<data_label> secondary hash keys for these statistics are as follows:
- 'M'
- mean length
- 'SD'
- standard deviation of lengths
- 'N'
- total number of features
- 'Cov'
- total sum of lengths ('coverage')
- 'Cov2'
- total sum of lengths squared
- 'LenC'
- reference to an array of feature incidence counts for each class of length
- A side effect of the setting of the table is that these values are returned (in an array context) as a list, in the order indicated above.
- segregateGeneFeatures($gffi,\%statab,$trace)
-
Given a GeneFeatureSet containing 'sequence', 'exon' and 'CDS' records, this method returns a list of four
$gffobject pointers representing each segregated subsets for each of 'genes', 'pseudogenes', 'exons' and 'CDSs', respectively. -
If a suitable reference '\%statab' to a hash table is given, then the method also compiles statistics for each subset
into that table using the
featureLengthStats()method (see above). That is, the table is of the form:\%statab->{('Gene'|'Exon'|'CDS')}->{'<data item>'} -
where '<data item>'s are statistics as returned by the
featureLengthStats(). - Another side effect of the method is that mRNA/coding_exon and CDS/exon redundancies are filtered out of the dataset.
- The optional '$trace' boolean flag, when non-null, turns on GFF module tracing.
- mergeGeneFeatures($gffg,$gffp,$gffe,$gffc,\%statab,$trace)
-
This method performs the reverse operation to that of
segregateGeneFeatures(),taking four types of features - 'gene (sequence)', 'pseudogene (sequence)', 'exon' and 'CDS' records - and remerging them into cohesive 'gene' sets based upon overlapping GFF coordinates. The method then returns a list of references to the each of the 'true' and 'pseudo' GeneFeatureSets. -
Along the way, a further set of statistics may be (optionally) computed and stored into a dereferenced hash table
\%statab passed to the function. These statistics pertain rather to 'exons' and 'CDSs' per (pseudo)gene, and are
stored in the hash at the primary level under the 'Gene' key, and at the secondary level in dereferenced hashes with
'Exon' and 'CDS' keys, then the specific data items, e.g.
\%statab->{Gene}->{(Exon|CDS|Transcript|Translation)}->{<data item>} -
The '<data item>'s are as follows:
- 'M'
- mean features per gene
- 'SD'
- standard deviation of features per gene
- 'X'
- total sum of given features
- 'X2'
- total sum of given features squared
- 'GeneC'
- reference to an array of gene incidence counts for each class of 'per gene' values
- The optional '$trace' boolean flag, when non-null, turns on GFF module tracing.
- cleanUpSeqName(name)
-
Default namefilter used by
normalize(),which removes: - - CDS/mRNA suffixes - 5'/3' suffixes - alphabet or digit isoform suffixes
- The cleaned up name is returned.
- normalize($gffndg, $gffndp, $gffnde, $gffndc, \&nameFilter, \%statab, $trace)
- Method to return a normalized set of gene descriptions in which all isoforms and exons have been merged into distinct, non-overlapping non-duplicated sets of data.
-
&nameFilter;should take a$namestring as $input, perform clean up of name decorations, then return the$namestring. If nameFilter is undef or NULL, then name clean-up is suppressed. If nameFilter is defined, non-NULL but not a reference point to a subroutine, then a standard cleanup of names is performed. Otherwise, the user supplied subroutine is used. - normalize_mRNA($gffi, \&nameFilter, $source, \%statab, $trace )
- Invokes segregate, normalize and merge routines (see above) to normalize transcript GFF.
-
The optional
$sourceargument is used to rewrite the <source> field of the file to a uniform value (default: 'Gene'). -
Providing a defined reference to an empty hash, '\%statab' triggers the compilation of statistics about the file, as
generated in the
segregateGeneFeatures()andmergeGeneFeatures()methods (see above). - Separate statistics are generated for each of transcripts, genes, exons and CDS's. (Note that 'genes' are defined as the normalized transcript sequence spans output by the method, whereas 'transcripts' are the sequence records before merge overlaps are done). Note that the gene sequence count 'N' are after normalization, but all other feature counts 'N' are unnormalized numbers.
-
A defined and non-null
$traceflag turns on runtime tracing of the normalization method.
Revision history
- 2.09 (19/10/99) - rbsk
-
segregateGeneFeatures()andmergeGeneFeatures()handle pseudogenes separately and explicitly; - 2.08 (16/10/99) - rbsk
-
for greater flexibility, I extracted out
normalize_mRNA()functionality, into externally visible methods:- featureLengthStats()
- segregateGeneFeatures()
- mergeGeneFeatures()
- 2.07 (13/10/99) - rbsk
-
added
$statabargument tonormalize_mRNA() - 2.06 (9/10/99) - rbsk
-
now exporting
cleanUpSeqName();keep 'em:' prefixes for now - 2.05 (27/9/99) - rbsk
- need to make all normalization procedures strand sensitive!
- 2.04 (21/7/99) - rbsk
-
normalize_mRNA()should not considersource(*CDS)records withfeature(sequence)to be redundant, in case the specific 'sequence' only has a CDS specified (but no mRNA); This method now also relabels <source> fields to 'TranscriptSet' - 2.03 (16/7/99) - rbsk
-
removed
draw_graph()from here (into GFF::Graph()) because of Curve_plot.pm usage, which won't be universal outside Sanger (at least until Raphael decides to release it for general use?) - 2.02 (14/7/99) - rbsk
-
transferred
draw_graph()from GeneFeatureSet to here and generalized to multiple GFF plot (API changed) - 2.01 (12/7/99) - rbsk
-
creation from miscellaneous GFF analysis code; transferred methods
makeGenes(),constructGene()and mRNA from GFF::GeneFeatureSet to this module


