Projector

Projector is a program for the comparative, homology based prediction of protein coding genes in mouse and human DNA.

Projector takes the known genes of one DNA sequence and predicts the corresponding genes in an evolutionarily related DNA sequence.

[Genome Research Limited]

Overview

How does Projector work?

Figure 1.

Figure 1.

Projector takes two input DNA sequences (one from mouse, one from human) which are known to be or which seem to be similar to eachother as well as a set of known genes for one of the two input sequences and simultaneously predicts the corresponding genes of the un-annotated sequence as well as an alignment between the two sequences. Unlike other homology based gene prediction methods (as for example Genewise by E. Birney et. al.) which map an aminoacid sequence of a protein to a DNA sequence, Projector maps known gene structures to another, similar DNA sequence in order to predict related gene structures. Projector thus uses sequence homology directly at DNA level and takes the conservation of gene structures between related genes explicitly into account. Projector can model partial, complete and multiple genes and can also predict pairs of genes which are related by events of exon-fusion or exon-splitting (i.e. which have a different number of exons). The mathematical method underlying Projector is a pair hidden Markov model, which is the same as the one used in Doublescan which is a comparative gene prediction program which predicts genes in an ab initio way. Similar to Doublescan, Projector requires the major similarities in the two input sequences to appear in collinearity.

Data sets

Here are the Projector test set:

as well as the set of genes predicted by:

that were used in the publication.

Acknowledgements

Many thanks to Roger Pettett for helping me set up this webserver.

Web server

How does this Projector Web server work ?

Submitting jobs

To submit jobs go to the submission web site

DNA sequences
The first of the two input files to projector contains the pairs of similar DNA sequences (each pair containing one mouse and one human sequence) which are to be analysed by projector. The format of the DNA sequences required by projector is a variant of the fasta format which consists of a special header line. Make sure that the DNA sequences consist of A, C, G and T letters only.
Paste your the contents of your fasta file in the first window of the submission web site. Before submitting the sequences to projector, it is a good idea to verify for each sequence pair that the larger regions of similarity between the two sequences come in collinearity. This can be done using programs like Dotter or Blastn).
example of a valid fasta file with three pairs of mouse and human DNA sequences
Gene structures
The second of the two input files to projector contains the gene structures which are known for the FIRST sequence in each DNA sequence pair in the above file. The gene structures should come in gtf format
gtf-format. In the above example in which the first sequence of each DNA pair is always the mouse sequence, the corresponding gtf file has to contain the known mouse genes. Even though Projector can deal with partial and multiple genes, you still have to ensure that the input gtf ANNOTATION is COMPLETE as Projector will deduce the remaining gene structure from the gtf information (e.g. if there are only three CDS lines and no Start_Codon and Stop_Codon lines in the input gtf file, Projector will assume that the sequence contains three exons with intronic sequence between them and at the start and end of the sequence). Note that the order lines in the gtf file does not matter.
These known gene structures are used within Projector as constraints to predict the corresponding gene structures in the un-annotated DNA sequence, ie in our example the human genes. If you were in the opposite position and knew all the human genes and would like to predict the corresponding mouse genes, each sequence pair of your input fasta file would have to start with the human DNA sequence and the gtf input file would have to contain all the known human gene structures.
Paste the contents of your gtf file in the second window of the submission web site.
example of a gtf file with known mouse genes which corresponds to the above fasta file
Notes
A word of caution: projector can model many, but not all types of gene structures (see the states and transitions of the underlying pairhmm (ps-format)). It can, for example, not intermediate and terminal (protein coding) exons of less than a codon length and initial exons which only consist of the start codon. Feeding projector these gene structures as known genes (ie as constraints) will make it fall over, as it is technically impossible to retrieve a valid state path. Please remove the corresponding sequence pairs from your analysis.

Retrieving data

The two output files can be retrieved on the retrieval web site

gtf-file:
contains the predicted genes in gtf-format ( example of a projector output gtf-file)
gtf_w_c_e-file:
contains information on the genes as well as the subsequences which are conserved between the two sequences in a variant of the gtf-format (example of a projector gtf_w_c_e output file)

Publications

  • Gene structure conservation aids similarity based gene prediction.

    Meyer IM and Durbin R

    Nucleic acids research 2004;32;2;776-83

Contacts

Questions, suggestions, requests, problems ?

I am happy to receive your comments and suggestions about Projector. Projector is available on request. Please email irmtraud.meyer@cantab.net.

* quick link - http://q.sanger.ac.uk/u8m2zext