SSAHA2: Sequence Search and Alignment by Hashing Algorithm

SSAHA2 (Sequence Search and Alignment by Hashing Algorithm) is a pairwise sequence alignment program designed for the efficient mapping of sequencing reads onto genomic reference sequences.

SSAHA2 reads of most sequencing platforms (ABI-Sanger, Roche 454, Illumina-Solexa) and a range of output formats (SAM, CIGAR, PSL etc.) are supported. A pile-up pipeline for analysis and genotype calling is available as a separate package.

[Genome Research Limited]

Information

The SSAHA2 manual explains the software and what most parts of the program do. The FAQs tab may helpful if you are experiencing problems.

Availability

The SSAHA2 mapper is available free of charge as binaries compiled for a range of computer architectures. The pile-up pipeline is available as source code. If you do not find your architecture represented here we may be able to provide a binary for it upon request.

If you like SSAHA2 also check out the related software SMALT which is more accurate, efficient and simpler to use.

Reference:

If you publish scientific work based on results obtained using SSAHA2 please cite the following paper:

  • SSAHA: a fast search method for large DNA databases.

    Ning Z, Cox AJ and Mullikin JC

    Genome research 2001;11;10;1725-9

Downloads

The latest release of SSAHA2 and the Pile-up pipeline is available for Linux:

FTP download

v2.5.5

Released November 11th 2011

Other

Notes

[1] Note for MacOSX: occasionally a browser decides to display the contents of the .dmg.gz archive file rather than downloading it. If this happens hold down the <control> key and click on the download link. A popup menu should appear, containing several choices. One of the choices should be something like "Save Link As" (or perhaps "Download Link...", "Save Link to Desktop", or a variation on this theme). Select that option, and the archive file should be download correctly.

Output formats

For each query sequence submitted to SSAHA2 the output consists of a list of summary lines, one for each alignment of that query sequence to the subject sequence. Optionally, a full alignment follows each summary line. The numbering of base positions of the read begins with 1 at the leftmost (5') end of the read and coordinates of the mapped read segments are usually reported from left to right along the reference sequence (except in PSL format).

SSAHA2 summary lines are available in default, GFF, SUGAR, CIGAR, VULGAR, PSL or PSLX format. Tags are available in those formats to aid parsing. In addition, alignments can be output in SAM format. Full alignments can be written below the summary lines, nomatter what format the summary line is written in. The format of the alignment is always the same, nomatter what format the summary line is written in.

The examples below use query.fa,query_rc.fa and subject.fa. Assuming that you have first built the hashtable 'subject' like this:

./ssaha2Build -save subject subject.fa

query.fa

>query
AACTAAGCTTTCATATCTGCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGGCT AGGCTGAGATACTACTATGAACTTCATTTTTAAATTATGATACATTTCCAATAAAGTAAAAA 

query_rc.fa

>query_RC
TTTTTACTTTATTGGAAATGTATCATAATTTAAAAATGAAGTTCATAGTAGTATCTCAGCCT AGCCTAAAATCAATGTCAAGAGTATTCCCAAAGATGCTTGCAGATATGAAAGCTTAGTT 

subject.fa

>subject
AACTAAGCTTTCATATCTCCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGGGCT AGGTTGAGATACTACTATGAACTTCATTTTTAAAATGATACATTTCCAATAAAGTAAAAA

Tags

Whatever the chosen output format, when the SSAHA2 option '-tags' is set to 1 a tag will be prepended to the summary line for each alignment. The actual string used as the tag varies between formats. In the default format the tag string is 'ALIGNMENT' for historical reasons, otherwise it is simply the name of the format followed by a colon e.g 'cigar:'. To obtain only the summary lines without headers just grep for them:

./ssaha2 -tags 1 subject query.fa | grep '^ALIGNMENT'
ALIGNMENT  95    query subject        1      121         1       120   F     121 95.87 121

or

./ssaha2 -tags 1 -output vulgar subject query.fa | grep '^vulgar:'
vulgar: query 0 121 + subject 0 120 + 95 M 57 57 G 0 1 M 36 36 G 2 0 M 26 26

You can then just use awk or perl to get the fields you want. To separate hits by query you can either build a hash on the query sequence names or use the default output with a header for each query and reset the record separator to something like "\n\n===================================================\n" when you read the file.

Alignments

Whatever the chosen output format, when the SSAHA2 option '-align' is set to 1 a full alignment will follow each summary line. The alignment consists of three logical lines, which may be split over many physical lines to fit on the terminal. The first line shows the query bases, the second line identifies the alignment and the third line shows the subject bases. The local ordinates of the sequences are given at the beginning and end of each physical line. The middle line has a space for matching bases, a '-' for a gap in the alignment, a 'v' for a transversion (C/T<->A/G) and an 'i' for a transition (T<->C or A<->G). For example

./ssaha2 -align 1 subject query.fa

===================================================
Matches For Query 0 (121 bases): query
===================================================
Score     Q_Name             S_Name            Q_Start    Q_End  S_Start    S_End Direction #Bases identity
95    query subject        1      121         1       120   F     121 95.87 121

Query          1 AACTAAGCTTTCATATCTGCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGG-CT 59
                                 v                                      -
Sbjct          1 AACTAAGCTTTCATATCTCCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGGGCT 60

Query         60 AGGCTGAGATACTACTATGAACTTCATTTTTAAATTATGATACATTTCCAATAAAGTAAA 119
                  i                              --
Sbjct         61 AGGTTGAGATACTACTATGAACTTCATTTTTAAA--ATGATACATTTCCAATAAAGTAAA 118

Query        120 AA 121
                  
Sbjct        119 AA 120

The subject sequence is always assumed to be in the forward orientation. This implies that when you get a reverse complement hit, it is in fact the reverse strand of the query that has aligned to the subject. So for reverse complement hits the numbering of the query ordinates goes down as the subject ordinates increase, eg:

./ssaha2  -align 1 subject query_RC.fa

===================================================
Matches For Query 0 (121 bases): query_RC
===================================================
Score     Q_Name             S_Name            Q_Start    Q_End  S_Start    S_End Direction #Bases identity
95    query_RC subject        121      1       120         1   C     121 95.87 121

Query        121 AACTAAGCTTTCATATCTGCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGG-CT 63
                                 v                                      -
Sbjct          1 AACTAAGCTTTCATATCTCCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGGGCT 60 

Query         62 AGGCTGAGATACTACTATGAACTTCATTTTTAAATTATGATACATTTCCAATAAAGTAAA 3
                  i                              --
Sbjct         61 AGGTTGAGATACTACTATGAACTTCATTTTTAAA--ATGATACATTTCCAATAAAGTAAA 118

Query          2 AA 1
                  
Sbjct        119 AA 120

The default format

For each query sequence the default SSAHA2 ouput consists of a header, followed by a summary line for each alignment of that query sequence as described below. The header contains the index of the query sequence with respect to the input file (starting at zero), the query length and the query name. Using an example

./ssaha2 subject query.fa

===================================================
Matches For Query 0 (120 bases): query
===================================================
Score     Q_Name             S_Name            Q_Start    Q_End  S_Start    S_End Direction #Bases identity
113   query subject        1      120         1       120   F     120 100.00 120

Each summary line has 11 fields:

Score
Smith-Waterman score obtained from the cross_match alignment.
Q_Name
Name of the query sequence from the input file.
S_Name
Name of the subject sequence from the input file.
Q_Start
Start of the alignment in the query.
Q_End
End of the alignment in the query.
S_Start
Start of the alignment in the subject.
S_End
End of the alignment in the subject.
Direction
Orientation of the alignment (F/C), subject is always assumed to be forward so this really refers to the query orientation.
#Bases
Alignment length (matches + mismatches + insertions + deletions).
identity
Number of matching bases divided by the alignment length times 100.

Finally we write the query sequence length on the end of the line in case you don't want to parse the header.

For forward alignments Q_Start and S_Start are the numerically lower end points of the alignment and Q_End and S_End are the numerically higher end points of the alignment. Whereas for reverse complement alignments Q_Start is the numerically higher end point and Q_End is the numerically lower end point. This is to remain consistent with the physical alignment (as it is drawn when the option -align 1 is used) and reflects the fact that the query has been reverse complemented in order to align it to the subject; which is always assumed to be in the forward orientation.

The GFF format

The General Feature Format gives a way to describe any sequence feature in the following tab delimited form

<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

In SSAHA2 output the mandatory fields (the first eight fields) refer to the query sequence:

seqname
Name of the query sequence from the input file.
source
'SSAHA2'
feature
'similarity'
start
Start locus of the alignment in the query.
end
Start locus of the alignment in the query.
score
Smith-Waterman score obtained from the cross_match alignment.
strand
Orientation of the alignment (+/-), subject is always assumed to be forward so this refers to the query orientation.
frame
'.'

To fully describe an alignment we also need to specify the name of the subject sequence and the start and end positions of the alignment in the subject. Since in GFF2 we are allowed to specify multiple values per tag we can give this information in a single attribute. Since all free text values must be quoted with double quotes the attribute is as follows:

Subject <"subject name"> <subject_start> <subject_end>;

To fully describe a gapped alignment we also need to specify the gap structure. To do this we follow the proposed tag-value structure for gapped alignments in the format specification:

Align <local_query_start> <local_subject_start> [length];

A tag-value set is used to define each ungapped block in the alignment, multiple Align tags then specify the complete gapped alignment. The length value is optional in the proposed format but will always be present in SSAHA2 output.

Each field is separated by a tab, but no tabs are written within a field such as the Subject or Align attribute. No comments are written by the current version of SSAHA2. So finally the GFF output for the example is:

./ssaha2 -output gff subject query.fa 

##gff-version 2 
##source-version 1.0.2
##date 2006-08-18

query SSAHA2  similarity      1       120     95     +       .       Subject "subject" 1 120;       Align 1 1 57;       Align 58 59 36;       Align 96 95 26;
    

For forward alignments the mandatory field 'start' and the Subject value 'subject_start' are the numerically lower end points of the alignment and the mandatory field 'end' and the Subject value 'subject_end' are the numerically higher end points of the alignment. Whereas for reverse complement alignments 'start' is the numerically higher end point and 'end' is the numerically lower end point as these refer to the query sequence. This is to remain consistent with the physical alignment (as it is drawn when the option -align 1 is used) and reflects the fact that the query has been reverse complemented in order to align it to the subject; which is always assumed to be in the forward orientation. This also affects the Align values, 'local_query_start for a reverse complement hit will always be the numerically higher end point of the alignment block in the query. For example:

./ssaha2 -output gff subject query_RC.fa 
##gff-version 2
##source-version 1.0.2
##date 06/08/21
query_RC        SSAHA2  similarity      121     1       95      -       .       Subject "subject" 1 120;        Align 121 1 57; Align 64 59 36; Align 26 95 26;

Note that SSAHA2 also writes a header at the beginning of the output when in GFF format:

##gff-version 2 
##source-version <1.0.2>
##date <date the job is run>

The header is written once at the beginning of the output. Output files should be named with a .gff extension.

The SUGAR format

The Simple UnGapped Alignment Report consists of 9 fields:

query_name
Name of the query sequence from the input file.
query_start
Start of the alignment in the query.
query_end
End of the alignment in the query.
query_strand
Query orientation (+/-).
subject_name
Name of the subject sequence from the input file.
subject_start
Start of the alignment in the subject.
subject_end
End of the alignment in the subject.
subject_strand
Subject orientation (always + for SSAHA2).
score
Smith-Waterman score obtained from the cross_match alignment.

Again 'query_start', 'subject_start' are the numerically lower end points and 'query_end', 'subject_end' are the numerically higher end points; except for reverse complement alignments where 'query_start' is the numerically higher query end point and 'query_end' is the numerically lower query end point. Example:

./ssaha2 -output sugar subject query.fa 
query 1 121 + subject 1 120 + 95

The CIGAR format

The Compact Idiosyncratic Gapped Alignment Report format starts with the same nine fields as SUGAR (see above); followed by a series of <operation, length> pairs where operation is one of match, insertion or deletion, and the length describes the number of times this operation is repeated. A match describes the situation where the bases are aligned opposite to each other, whether or not they actually match. The insertion and deletion operations are with repsect to the subject, so within the alignment an insertion means a gap in the subject sequence and a deletion means a gap in the query sequence. 'M' shows a match, 'I' shows an insertion and 'D' shows a deletions. Example:

./ssaha2 -output cigar subject query.fa 
query 1 121 + subject 1 120 + 95 M 57 D 1 M 36 I 2 M 26

The VULGAR format

The Verbosely Ugly Labelled Gapped Alignment Report format also starts with the same nine fields as SUGAR (see above); but is followed by a series of <label, query_length, subject_length> triplets, where label can be 'M' for match or 'G' for gap. The distinction between an insertion and a deletion is now made by the relative lengths of 'query_length' and 'subject_length'. Example:

./ssaha2 -output vulgar subject query.fa 
query 1 121 + subject 1 120 + 95 M 57 57 G 0 1 M 36 36 G 2 0 M 26 26

The PSL and PSLX format

PSL is the default format for the BLAT program. It consists of a single header at the start of the output followed by a summary line for each of the alignments.
There are 21 mandatory fields to each summary line

<matches> <misMatches> <repMatches> <nCount> <qNumInsert> <qBaseInsert> <tNumInsert> <tBaseInsert> <strand> <qName> <qSize> <qStart> <qEnd> <tName> <tSize> <tStart> <tEnd>

<blockCount> <blockSizes> <qStarts> <tStarts> 
matches
Number of bases that match that aren't repeats.
misMatches
Number of bases that don't match.
repMatches
Number of bases that match but are part of repeats.
nCount
Number of 'N' bases.
qNumInsert
Number of insertions in query.
qBaseInsert
Number of bases inserted in query.
tNumInsert
Number of insertions in subject.
tBaseInsert
Number of bases inserted in subject.
strand
Orientation of query strand (+/-).
qName
Query sequence name.
qSize
Query sequence size.
qStart
Alignment start position in query.
qEnd
Alignment end position in query.
tName
Subject sequence name.
tSize
Subject sequence size.
tStart
Alignment start position in subject.
tEnd
Alignment end position in subject.
blockCount
Number of blocks in the alignment.
blockSizes
Comma-separated list of sizes for each alignment block.
qStarts
Comma-separated list of starting loci for each block in query.
tStarts
Comma-separated list of starting loci for each block in subject.

The only SSAHA2 specific change is that 'matches' just referes to the number of matching bases and 'repMatches' is always zero. PSLX has the same first 21 fields as PSL augmented by the actual sequence blocks of the alignment. These blocks correspond to the comma-separated list of starting positions of each block in the query and subject. Our example alignment in PSL is

./ssaha2 -output psl subject query.fa 
psLayout version 3

match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap   strand  Q       Q       Q       Q       T       T       T       T       block   blockSizes qStarts  tStarts
      match   match           count   bases   count   bases           name    size    start   end     name    size    start   end     count
---------------------------------------------------------------------------------------------------------------------------------------------------------------
117     2       0       0       1       2       1       1       +       query   121     0       121     subject 120     0       120     3       57,36,26,  0,57,95, 0,58,94,

The example in PSLX is

./ssaha2 -output pslx subject query.fa 
psLayout version 3

match   mis-    rep.    N's     Q gap   Q gap   T gap   T gap   strand  Q       Q       Q       Q       T       T       T       T       block   blockSizes qStarts  tStarts
      match   match           count   bases   count   bases           name    size    start   end     name    size    start   end     count
---------------------------------------------------------------------------------------------------------------------------------------------------------------
117     2       0       0       1       2       1       1       +       query   121     0       121     subject 120     0       120     3       57,36,26,  0,57,95, 0,58,94,
AACTAAGCTTTCATATCTGCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGG,CTAGGCTGAGATACTACTATGAACTTCATTTTTAAA,ATGATACATTTCCAATAAAGTAAAAA,
AACTAAGCTTTCATATCTCCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGG,CTAGGTTGAGATACTACTATGAACTTCATTTTTAAA,ATGATACATTTCCAATAAAGTAAAAA,

To remain consistent with BLAT 'qStart', 'tStart' are the numerically lower end points and 'qEnd', 'tEnd' are the numerically higher end points for BOTH forward and reverse complement hits. This is the exception that proves the rule for all the other formats. Likewise, the counting of bases starts at zero, again this is the exception that proves the rule. It is more sensible to be consistent with BLAT than with the SSAHA2 graphical alignment in this case because if you're using PSL and you want to see the bases it makes more sense to use PSLX than to use the '-align 1' option.

Last line.

Currently there is a final line appended to all output, nomatter what the format

Finished job

This is used by people variously to terminate parsing, check that the job ran to completion, etc. Depending on whether people let me know whether they love it ot hate it, I'm happy to remove it or leave it there.

Notes

... Did I mention? The subject is always assumed to be in the forward orientation.

FAQ

Wellcome to the SSAHA2 Frequently Asked Questions.

What are the changes since version 2.4.1?

New features:

  • new output option '-output sam_soft' produced SAM output with 'soft clipping', i.e. the sequence of the entire read is output. -output sam only outputs the aligned segment of the read (hard clipping).
  • changed -kmer and -skip default parameters for -rtype abi to -kmer 13 -skip 13. The options -solexa, -454 and -rtype <read_type> can now be specified with ssaha2Build as well as ssaha2. i.e. -kmer and -skip no longer have to be explicitly specified with ssaha2Build.

    This means one can build a hash table with

    ssaha2Build -454 -save <htab_name> <fasta_file>
    

    then run the mapping with something like

    ssaha2 -454 -output cigar -save <htab_name> <fastq_file>
    
  • added option -multi <rand_seed> for the random selection of a read-pair in cases were there are multiple mappings with the same score.
  • The mates of paired-end reads can now differ in length. In previous versions they had to be of identical length.
  • Output in SAM format now includes unmapped reads. Hitherto their output had been suppressed.
  • SAM output: 'proper' pairs are now assigned the average score rather than the sum of scores of the mates.

Bug fixes:

  • Fixed a core dump with option -output sam when query reads were given in FASTA files without quality values.
  • Fixed SAM output to extra file (-output sam -outfile <file_name>) for single-end reads (i.e. -pair option not specified). Previous versions ignored the -outfile option if the -pair option was not specified. As a consequence mappings of single-end reads were written to standard output even for the SAM format.
  • Fixed PSLX and PSL output formats. Apparently, those never worked properly in any version.
  • Fixed -disk 1 and -disk 2 options which were broken for paired-end reads.
  • Fixed a bug in the alignment display (option -align 1) of paired-end reads. The mapping coordinates of the 2nd mate were wrong in the display.
Can I use samtools with SSAHA2 for SNP/indel calling?

SSAHA2 produces output in SAM format with the -output sam option. It is best to have SSAHA2 write the SAM output to an extra file with the -outfile <samfile> option.

So if your genomic sequences are in a FASTA file called genome.fa and you mapped paired-end reads in the FASTQ files mate_1.fq and mate_2.fq using, say, the following commands:

  1. ssaha2Build -rtype 454 -save genome_hash genome.fa to generate a hash table
  2. ssaha2 -rtype 454 -pair 20,3000 -output sam -outfile mapped.sam -save genome_hash mate_1.fq mate_2.fq

    You can use the SAM output file as input for samtools pileup -S mapped.sam. For example, to convert the file mapped.sam to BAM format use

  3. samtools faidx genome.fa

    which produces a file genome.fa.fai

  4. samtools view -Sb -t genome.fa.fai -o mapped.bam mapped.sam
Can SSAHA2 read/write BAM files?
No, for the time being all SSAHA2 input and output is ASCII text based. Input files have to contain sequences in FASTA or FASTQ format. If you require a BAM file as output, have SSAHA2 write a SAM file with the options -output sam -outfile <samfile> and convert to BAM format as described in the answer to the previous question.
I have produced SAM output using SSAHA2 but I am having trouble figuring out what the number in the FLAG field translates to?
You can use the C-program samflag.c to decompose the number into the binary flags. Save the program under the name samflag.c and compile e.g. cc -o samflag samflag.c. Then issue ./samflag <FLAG value (decimal)>.
How do I know whether a SSAHA2 job finished correctly?
SSAHA2 prints the line "SSAHA2 finished." as the last line to standard output if no errors occurred.
I get the error message "command line -kmer 13 -skip 2 contradicts hash table parameters -kmer 11 -skip 2" . What does that mean?
This message means that ssaha2 tries to read a hash table from disk which was built for a different k-mer word length than requested by the ssaha2 command line. You will either have to rebuild the hash table with ssaha2Build and the k-mer word length you require or change the value set with the -kmer <k> option in the ssaha2 command line to the value of the hash table (in this case -kmer 11 ).
How much memory is required to run SSAHA2?
The memory requirements are determined by the size of the head (the actual index) and the body (the word locations) of the hash table.
The size of the head depends on the length k of the hashed k-mer words and is approx. 4^(k+1) bytes. The word length is specified with the -kmer <k> option.
The size of the body depends on the total length N of the hashed reference sequences and the skip step s specified with the option -skip <s> . The body size is approx. 4*N/s bytes.
For example, a hash table built for the human genome (N ~3x10^9 bases) with the parameters -kmer 12 -skip 2 would have a header of 67 MB and a body of 6 GB. With parameters -kmer 13 -skip 3 the header would occupy 268 MB, the body 4 GB.
The option -disk 1 allows you to leave the body of the table on disk, but the program will then be slower because it has to access the disk frequently. The header is always kept in memory.
Can SSAHA2 map long paired-end sequences?
Yes, paired-end mapping is invoked with the option -pair <a>,<b>. By default SSAHA2 tunes for Sanger capillary reads. The -rtype 454 flag tunes for reads from the Roche 454 FLX platform and The -rtype solexa flag tunes for reads from the Illumina/Solexa platform.

Contact

Please read the FAQ page before contacting us about a problem. Remember to provide sufficient information for us to reproduce the error including the SSAHA2 version number (-v option) and command line options used.

SSAHA2 is currently being maintained by Hannes Ponstingl, the pile-up pipeline by Yong Gu and Zemin Ning.

* quick link - http://q.sanger.ac.uk/r77m6hlg