SSAHA: Sequence Search and Alignment by Hashing Algorithm

SSAHA is a software tool for very fast matching and alignment of DNA sequences.

It achieves its fast search speed by converting sequence information into a 'hash table' data structure, which can then be searched very rapidly for matches.

For improved alignment and mapping of paired-end sequencing reads please use SSAHA2.

[Genome Research Limited]

Overview

Getting The Best From The Software

  • The fast sequence search speed is achieved partially at the expense of greater memory usage. You will probably need a machine with at least 1 Gb of RAM to make much use of this software. If you keep running out of memory you have four options.
    • Try using the UNIX command unlimit to free up more memory on your machine.
    • Try using a smaller value for the hash word length (command line option -wl, set to 10 by default).
    • Try using the -pf option. This stores the hash table in a compressed format (based on a trick by Jim Kent at UCSC), although search speed is slightly impaired.
    • Try using a machine with more RAM ...
  • The SSAHA algorithm is most suitable for applications requiring exact or 'almost exact' matches between two sequences, such as SNP detection or sequence assembly. The sensitivity of the algorithm can be increased by decreasing the step length (command line option -sl, although note that this also increases RAM usage), but in all cases the algorithm will not detect a stretch of consecutive matching bases that is shorter than the hash word length (10 bases by default).
  • If you are likely to need to search the same set of sequence data on more than one occasion, use the -sn option on the first run to save the hash table to a file. Subsequent runs can then load in this hash table using the -sf hash option instead of computing it from scratch.
  • Loads and loads of short matches? Try the following:
    • Set the -ms parameter to a lower value (default is 100000). This causes the software to ignore more of the commonest words in the database. Conversely, sensitivity is increased by setting this parameter to a higher value.
    • Set the -nr parameter. This causes each query sequence to be scanned for tandem repeats using a simple algorithm.
    • Set the -mg and -mi parameters. When set, these cause the software to try to join up adjacent shorter matches into larger matches.
    • Set the -mp parameter. When set, the software prints only matches whose total number of matching bases exceeds a threshold.

Some Applications

  • Fast sequence assembly (Zemin Ning)
  • SNP detection (Jim Mullikin)
  • Ordering and orientation of contigs (Tony Cox)

Contact

Publications

If you find this software or results useful please cite the following paper:

  • SSAHA: a fast search method for large DNA databases.

    Ning Z, Cox AJ and Mullikin JC

    Genome research 2001;11;10;1725-9

* quick link - http://q.sanger.ac.uk/far6ks9a