Overview
How does Projector work?

Figure 1.
Projector takes two input DNA sequences (one from mouse, one from human) which are known to be or which seem to be
similar to eachother as well as a set of known genes for one of the two input sequences and simultaneously predicts the
corresponding genes of the un-annotated sequence as well as an alignment between the two sequences. Unlike other
homology based gene prediction methods (as for example Genewise by E. Birney et. al.) which map an aminoacid sequence
of a protein to a DNA sequence, Projector maps known gene structures to another, similar DNA sequence in order to
predict related gene structures. Projector thus uses sequence homology directly at DNA level and takes the conservation
of gene structures between related genes explicitly into account. Projector can model partial, complete and multiple
genes and can also predict pairs of genes which are related by events of exon-fusion or exon-splitting (i.e. which have
a different number of exons). The mathematical method underlying Projector is a pair hidden Markov model, which is the
same as the one used in Doublescan which is a comparative gene prediction program which predicts genes in an ab initio
way. Similar to Doublescan, Projector requires the major similarities in
the two input sequences to appear in collinearity.
Data sets
Here are the Projector test set:
as well as the set of genes predicted by:
that were used in the publication.
Acknowledgements
Many thanks to Roger Pettett for helping me set up this webserver.
Web server
How does this Projector Web server work ?
Submitting jobs
To submit jobs go to the submission web site
-
DNA sequences
-
The first of the two input files to projector contains the pairs of similar DNA sequences (each pair containing one
mouse and one human sequence) which are to be analysed by projector. The format of the DNA sequences required by
projector is a variant of the
fasta format which consists of a special header line. Make sure that the DNA sequences consist of A, C, G and T
letters only.
-
Paste your the contents of your fasta file in the first window of the submission
web site. Before submitting the sequences to projector, it is a good idea to verify for each sequence pair that
the larger regions of similarity between the two sequences come in collinearity. This can be done using programs like
Dotter or Blastn).
-
example of a valid fasta file
with three pairs of mouse and human DNA sequences
-
Gene structures
-
The second of the two input files to projector contains the gene structures which are known for the FIRST sequence in
each DNA sequence pair in the above file. The gene structures should come in gtf format
-
gtf-format. In the above example in which the first
sequence of each DNA pair is always the mouse sequence, the corresponding gtf file has to contain the known mouse
genes. Even though Projector can deal with partial and multiple genes, you still have to ensure that the input gtf
ANNOTATION is COMPLETE as Projector will deduce the remaining gene structure from the gtf information (e.g. if there
are only three CDS lines and no Start_Codon and Stop_Codon lines in the input gtf file, Projector will assume that
the sequence contains three exons with intronic sequence between them and at the start and end of the sequence). Note
that the order lines in the gtf file does not matter.
-
These known gene structures are used within Projector as constraints to predict the corresponding gene structures in
the un-annotated DNA sequence, ie in our example the human genes. If you were in the opposite position and knew all
the human genes and would like to predict the corresponding mouse genes, each sequence pair of your input fasta file
would have to start with the human DNA sequence and the gtf input file would have to contain all the known human gene
structures.
-
Paste the contents of your gtf file in the second window of the submission web
site.
-
example of a gtf file with
known mouse genes which corresponds to the above fasta file
-
Notes
-
A word of caution: projector can model many, but not all types of gene structures (see the states and transitions of
the underlying pairhmm
(ps-format)). It can, for example, not intermediate and terminal (protein coding) exons of less than a codon
length and initial exons which only consist of the start codon. Feeding projector these gene structures as known
genes (ie as constraints) will make it fall over, as it is technically impossible to retrieve a valid state path.
Please remove the corresponding sequence pairs from your analysis.
Retrieving data
The two output files can be retrieved on the retrieval web site
-
gtf-file:
-
contains the predicted genes in gtf-format ( example of a projector
output gtf-file)
-
gtf_w_c_e-file:
-
contains information on the genes as well as the subsequences which are conserved between the two sequences in a
variant of the gtf-format (example of a
projector gtf_w_c_e output file)