How does Projector work?

Figure 1.
Projector takes two input DNA sequences (one from mouse, one from human) which are known to be or which seem to be
similar to eachother as well as a set of known genes for one of the two input sequences and simultaneously predicts
the corresponding genes of the un-annotated sequence as well as an alignment between the two sequences. Unlike other
homology based gene prediction methods (as for example Genewise by E. Birney et. al.) which map an aminoacid sequence
of a protein to a DNA sequence, Projector maps known gene structures to another, similar DNA sequence in order to
predict related gene structures. Projector thus uses sequence homology directly at DNA level and takes the
conservation of gene structures between related genes explicitly into account. Projector can model partial, complete
and multiple genes and can also predict pairs of genes which are related by events of exon-fusion or exon-splitting
(i.e. which have a different number of exons). The mathematical method underlying Projector is a pair hidden Markov
model, which is the same as the one used in Doublescan which is a comparative gene prediction program which predicts
genes in an ab initio way. Similar to Doublescan, Projector requires
the major similarities in the two input sequences to appear in collinearity.
Reference & data sets
Here are the Projector test set (gtf-file and fasta-file) as well as the
set of genes predicted by Projector (gtf-file) and Genewise
(gtf-file) that were
used in the publication.
How does this Projector Web server work ?
Input: to submit jobs go to the submission web site
- The first of the two input files to projector contains the pairs of similar DNA sequences (each pair containing
one mouse and one human sequence) which are to be analysed by projector. The format of the DNA sequences required by
projector is a variant of the
fasta format which consists of a special header line. Make sure that the DNA sequences consist of A, C, G and T
letters only. Paste your the contents of your fasta file in the first window of the submission web site. Before submitting the sequences to projector, it is a good idea
to verify for each sequence pair that the larger regions of similarity between the two sequences come in
collinearity. This can be done using programs like Dotter or
Blastn).
- The second of the two input files to projector contains the gene structures which are known for the FIRST
sequence in each DNA sequence pair in the above file. The gene structures should come in gtf format gtf-format. In the above example in which the first sequence
of each DNA pair is always the mouse sequence, the corresponding gtf file has to contain the known mouse
genes. Even though Projector can deal with partial and multiple genes, you still have to ensure that the input gtf
ANNOTATION is COMPLETE as Projector will deduce the remaining gene structure from the gtf information (e.g. if there
are only three CDS lines and no Start_Codon and Stop_Codon lines in the input gtf file, Projector will assume that
the sequence contains three exons with intronic sequence between them and at the start and end of the sequence). Note
that the order lines in the gtf file does not matter. These known gene structures are used within Projector as
constraints to predict the corresponding gene structures in the un-annotated DNA sequence, ie in our example the
human genes. If you were in the opposite position and knew all the human genes and would like to predict the
corresponding mouse genes, each sequence pair of your input fasta file would have to start with the human DNA
sequence and the gtf input file would have to contain all the known human gene structures. Paste the contents of your
gtf file in the second window of the submission web site.
- example of a gtf file
with known mouse genes which corresponds to the above fasta file
- A word of caution: projector can model many, but not all types of gene structures (see the states and transitions
of the underlying pairhmm
(ps-format) ). It can, for example, not intermediate and terminal (protein coding) exons of less than a codon
length and initial exons which only consist of the start codon. Feeding projector these gene structures as known
genes (ie as constraints) will make it fall over, as it is technically impossible to retrieve a valid state path.
Please remove the corresponding sequence pairs from your analysis.
Output: the two output files can be retrieved on the retrieval web
site
Questions, suggestions, requests, problems ?
I am happy to receive your comments and suggestions about Projector. Projector is available on request. Please email
irmtraud.meyer@cantab.net.
Acknowledgements
Many thanks to Roger Pettett for helping me set up this webserver.