DoubleScan

Doublescan is a program for comparative ab initio prediction of protein coding genes in mouse and human DNA.

Doublescan takes two input DNA sequences (one from mouse, one from human) which are known to be or which seem to be similar to eachother and simultaneously predicts the genes of both sequences as well as the alignment of the two sequences. Doublescan can model partial, complete and multiple genes (as well as no genes at all) and can also align pairs of genes which are related by events of exon-fusion or exon-splitting. The mathematical method underlying Doublescan is a pair hidden Markov model.

[Genome Research Limited]

How does Doublescan work ?

Doublescan takes two input DNA sequences (one from mouse, one from human) which are known to be or which seem to be similar to eachother and simultaneously predicts the genes of both sequences as well as the alignment of the two sequences. Doublescan can model partial, complete and multiple genes (as well as no genes at all) and can also align pairs of genes which are related by events of exon-fusion or exon-splitting. The mathematical method underlying Doublescan is a pair hidden Markov model.

Where do I find more information?

  • Comparative ab initio prediction of gene structures using pair HMMs.

    Meyer IM and Durbin R

    Bioinformatics (Oxford, England) 2002;18;10;1309-18

How does this Doublescan Web server work ?

Input: two DNA sequences to the submission web site.

  • Select a pair of DNA sequences (one from mouse, one from human) to be analysed. Larger regions of similarities between the two sequences should come in collinearity (you can check this using programs like Dotter or Blastn).
  • Cut and paste the two selected DNA sequences in the two windows of the submission web site and submit them for analysis. The sequences have to come in a variant of the fasta format and the DNA sequences have to consist of A, C, G and T letters only (example fasta files: human sequence, mouse sequence).

Output: the two output files can be retrieved on the retrieval web site.

Questions, suggestions, requests, problems ?

Email me at irmtraud.meyer@cantab.net.

Fasta format required by Doublescan

Fasta format required by Doublescan

Doublescan uses the absolute coordinates in its output files to indicate the position of the predicted genes and features. In order to to this, the input files have to come in a variant of the default fasta format which requires a header line of the format below.

Header line:

>name start_position-end_position orientation

where

  • name is the name of the sequence (example: Mm.X13235.5)
  • start_position is an integer which is the position of the first character in the sequence (example: 100) and its value has to be smaller to that of the end_position
  • end_position is an integer which is the position of the last character in the sequence (example: 737 i.e. the sequence is 737-100+1 = 638 nucleotides long)
  • orientation can be either 'forward' or 'reverse'depending on strand which is to be analysed for genes. Note that the value of the orientation in the header line does not indicate the orientation of the sequence (gggaatg....) & as the fasta file should always give the sequence of the forward strand.
  • the fields in the header line have to be tab-delimited

The DNA sequence:

  • has to consist of A, C, G and T only
  • always has to be the sequence of the forward strand

Example:

>Mm.X13235.5 100-737 forward
gggaatgaagtttttctgcaggatttaaatgtggtctttaagagacaccgcatgcaaaga
atagctggggcttgctagccaatgaaaacattcagattccaatgacgcatccttttttct
ccacccccttccaagacccggattcggaaaccccgcctaacgctctagttttcaaccagg
tccgcagaaggcctatttaaagggacgattgctgtctccctgctgtcataaccatgtctg
gacgtggcaagggtggtaaaggccttgggaaaggcggcgctaagcgccaccgtaaggttc
tccgcgataacatccagggcatcaccaagcctgccatccgccgcctggcccggcgcgggg
gagtgaagcgcatctccggcctcatctacgaggagacccgcggtgtgctgaaggtgttcc
tggagaacgtgatccgcgacgccgtcacctacacggagcacgccaagcgcaagaccgtca
ccgccatggacgtggtctacgcgctcaagcgccagggccgcactctctacggattcggcg
gttaatcgactaacaaacgattttccactgtcaacaaaaggcccttttcagggccaccca
caaattcctagaaggagttgttcacttaccgaagctt
* quick link - http://q.sanger.ac.uk/etmbcpn9