SSAHA2 (Sequence Search and Alignment by Hashing Algorithm) is a pairwise sequence alignment program designed for the efficient mapping of sequencing reads onto genomic reference sequences.
SSAHA2 reads of most sequencing platforms (ABI-Sanger, Roche 454, Illumina-Solexa) and a range of output formats (SAM, CIGAR, PSL etc.) are supported. A pile-up pipeline for analysis and genotype calling is available as a separate package.
[Genome Research Limited]
The SSAHA2 manual explains the software and what most parts of the program do. The FAQs tab may helpful if you are experiencing problems.
The SSAHA2 mapper is available free of charge as binaries compiled for a range of computer architectures. The pile-up pipeline is available as source code. If you do not find your architecture represented here we may be able to provide a binary for it upon request.
If you like SSAHA2 also check out the related software SMALT which is more accurate, efficient and simpler to use.
If you publish scientific work based on results obtained using SSAHA2 please cite the following paper:
Genome research 2001;11;10;1725-9
PUBMED: 11591649; PMC: 311141; DOI: 10.1101/gr.194201
The latest release of SSAHA2 and the Pile-up pipeline is available for Linux:
v2.5.5
Released November 11th 2011
Other
[1] Note for MacOSX: occasionally a browser decides to display the contents of the .dmg.gz archive file rather than downloading it. If this happens hold down the <control> key and click on the download link. A popup menu should appear, containing several choices. One of the choices should be something like "Save Link As" (or perhaps "Download Link...", "Save Link to Desktop", or a variation on this theme). Select that option, and the archive file should be download correctly.
For each query sequence submitted to SSAHA2 the output consists of a list of summary lines, one for each alignment of that query sequence to the subject sequence. Optionally, a full alignment follows each summary line. The numbering of base positions of the read begins with 1 at the leftmost (5') end of the read and coordinates of the mapped read segments are usually reported from left to right along the reference sequence (except in PSL format).
SSAHA2 summary lines are available in default, GFF, SUGAR, CIGAR, VULGAR, PSL or PSLX format. Tags are available in those formats to aid parsing. In addition, alignments can be output in SAM format. Full alignments can be written below the summary lines, nomatter what format the summary line is written in. The format of the alignment is always the same, nomatter what format the summary line is written in.
The examples below use query.fa,query_rc.fa and subject.fa. Assuming that you have first built the hashtable 'subject' like this:
./ssaha2Build -save subject subject.fa
>query AACTAAGCTTTCATATCTGCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGGCT AGGCTGAGATACTACTATGAACTTCATTTTTAAATTATGATACATTTCCAATAAAGTAAAAA
>query_RC TTTTTACTTTATTGGAAATGTATCATAATTTAAAAATGAAGTTCATAGTAGTATCTCAGCCT AGCCTAAAATCAATGTCAAGAGTATTCCCAAAGATGCTTGCAGATATGAAAGCTTAGTT
>subject AACTAAGCTTTCATATCTCCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGGGCT AGGTTGAGATACTACTATGAACTTCATTTTTAAAATGATACATTTCCAATAAAGTAAAAA
Whatever the chosen output format, when the SSAHA2 option '-tags' is set to 1 a tag will be prepended to the summary line for each alignment. The actual string used as the tag varies between formats. In the default format the tag string is 'ALIGNMENT' for historical reasons, otherwise it is simply the name of the format followed by a colon e.g 'cigar:'. To obtain only the summary lines without headers just grep for them:
./ssaha2 -tags 1 subject query.fa | grep '^ALIGNMENT' ALIGNMENT 95 query subject 1 121 1 120 F 121 95.87 121
or
./ssaha2 -tags 1 -output vulgar subject query.fa | grep '^vulgar:' vulgar: query 0 121 + subject 0 120 + 95 M 57 57 G 0 1 M 36 36 G 2 0 M 26 26
You can then just use awk or perl to get the fields you want. To separate hits by query you can either build a hash on the query sequence names or use the default output with a header for each query and reset the record separator to something like "\n\n===================================================\n" when you read the file.
Whatever the chosen output format, when the SSAHA2 option '-align' is set to 1 a full alignment will follow each summary line. The alignment consists of three logical lines, which may be split over many physical lines to fit on the terminal. The first line shows the query bases, the second line identifies the alignment and the third line shows the subject bases. The local ordinates of the sequences are given at the beginning and end of each physical line. The middle line has a space for matching bases, a '-' for a gap in the alignment, a 'v' for a transversion (C/T<->A/G) and an 'i' for a transition (T<->C or A<->G). For example
./ssaha2 -align 1 subject query.fa
===================================================
Matches For Query 0 (121 bases): query
===================================================
Score Q_Name S_Name Q_Start Q_End S_Start S_End Direction #Bases identity
95 query subject 1 121 1 120 F 121 95.87 121
Query 1 AACTAAGCTTTCATATCTGCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGG-CT 59
v -
Sbjct 1 AACTAAGCTTTCATATCTCCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGGGCT 60
Query 60 AGGCTGAGATACTACTATGAACTTCATTTTTAAATTATGATACATTTCCAATAAAGTAAA 119
i --
Sbjct 61 AGGTTGAGATACTACTATGAACTTCATTTTTAAA--ATGATACATTTCCAATAAAGTAAA 118
Query 120 AA 121
Sbjct 119 AA 120
The subject sequence is always assumed to be in the forward orientation. This implies that when you get a reverse complement hit, it is in fact the reverse strand of the query that has aligned to the subject. So for reverse complement hits the numbering of the query ordinates goes down as the subject ordinates increase, eg:
./ssaha2 -align 1 subject query_RC.fa
===================================================
Matches For Query 0 (121 bases): query_RC
===================================================
Score Q_Name S_Name Q_Start Q_End S_Start S_End Direction #Bases identity
95 query_RC subject 121 1 120 1 C 121 95.87 121
Query 121 AACTAAGCTTTCATATCTGCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGG-CT 63
v -
Sbjct 1 AACTAAGCTTTCATATCTCCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGGGCT 60
Query 62 AGGCTGAGATACTACTATGAACTTCATTTTTAAATTATGATACATTTCCAATAAAGTAAA 3
i --
Sbjct 61 AGGTTGAGATACTACTATGAACTTCATTTTTAAA--ATGATACATTTCCAATAAAGTAAA 118
Query 2 AA 1
Sbjct 119 AA 120
For each query sequence the default SSAHA2 ouput consists of a header, followed by a summary line for each alignment of that query sequence as described below. The header contains the index of the query sequence with respect to the input file (starting at zero), the query length and the query name. Using an example
./ssaha2 subject query.fa =================================================== Matches For Query 0 (120 bases): query =================================================== Score Q_Name S_Name Q_Start Q_End S_Start S_End Direction #Bases identity 113 query subject 1 120 1 120 F 120 100.00 120
Each summary line has 11 fields:
Finally we write the query sequence length on the end of the line in case you don't want to parse the header.
For forward alignments Q_Start and S_Start are the numerically lower end points of the alignment and Q_End and S_End are the numerically higher end points of the alignment. Whereas for reverse complement alignments Q_Start is the numerically higher end point and Q_End is the numerically lower end point. This is to remain consistent with the physical alignment (as it is drawn when the option -align 1 is used) and reflects the fact that the query has been reverse complemented in order to align it to the subject; which is always assumed to be in the forward orientation.
The General Feature Format gives a way to describe any sequence feature in the following tab delimited form
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]
In SSAHA2 output the mandatory fields (the first eight fields) refer to the query sequence:
To fully describe an alignment we also need to specify the name of the subject sequence and the start and end positions of the alignment in the subject. Since in GFF2 we are allowed to specify multiple values per tag we can give this information in a single attribute. Since all free text values must be quoted with double quotes the attribute is as follows:
Subject <"subject name"> <subject_start> <subject_end>;
To fully describe a gapped alignment we also need to specify the gap structure. To do this we follow the proposed tag-value structure for gapped alignments in the format specification:
Align <local_query_start> <local_subject_start> [length];
A tag-value set is used to define each ungapped block in the alignment, multiple Align tags then specify the complete gapped alignment. The length value is optional in the proposed format but will always be present in SSAHA2 output.
Each field is separated by a tab, but no tabs are written within a field such as the Subject or Align attribute. No comments are written by the current version of SSAHA2. So finally the GFF output for the example is:
./ssaha2 -output gff subject query.fa
##gff-version 2
##source-version 1.0.2
##date 2006-08-18
query SSAHA2 similarity 1 120 95 + . Subject "subject" 1 120; Align 1 1 57; Align 58 59 36; Align 96 95 26;
For forward alignments the mandatory field 'start' and the Subject value 'subject_start' are the numerically lower end points of the alignment and the mandatory field 'end' and the Subject value 'subject_end' are the numerically higher end points of the alignment. Whereas for reverse complement alignments 'start' is the numerically higher end point and 'end' is the numerically lower end point as these refer to the query sequence. This is to remain consistent with the physical alignment (as it is drawn when the option -align 1 is used) and reflects the fact that the query has been reverse complemented in order to align it to the subject; which is always assumed to be in the forward orientation. This also affects the Align values, 'local_query_start for a reverse complement hit will always be the numerically higher end point of the alignment block in the query. For example:
./ssaha2 -output gff subject query_RC.fa ##gff-version 2 ##source-version 1.0.2 ##date 06/08/21 query_RC SSAHA2 similarity 121 1 95 - . Subject "subject" 1 120; Align 121 1 57; Align 64 59 36; Align 26 95 26;
Note that SSAHA2 also writes a header at the beginning of the output when in GFF format:
##gff-version 2 ##source-version <1.0.2> ##date <date the job is run>
The header is written once at the beginning of the output. Output files should be named with a .gff extension.
The Simple UnGapped Alignment Report consists of 9 fields:
Again 'query_start', 'subject_start' are the numerically lower end points and 'query_end', 'subject_end' are the numerically higher end points; except for reverse complement alignments where 'query_start' is the numerically higher query end point and 'query_end' is the numerically lower query end point. Example:
./ssaha2 -output sugar subject query.fa query 1 121 + subject 1 120 + 95
The Compact Idiosyncratic Gapped Alignment Report format starts with the same nine fields as SUGAR (see above); followed by a series of <operation, length> pairs where operation is one of match, insertion or deletion, and the length describes the number of times this operation is repeated. A match describes the situation where the bases are aligned opposite to each other, whether or not they actually match. The insertion and deletion operations are with repsect to the subject, so within the alignment an insertion means a gap in the subject sequence and a deletion means a gap in the query sequence. 'M' shows a match, 'I' shows an insertion and 'D' shows a deletions. Example:
./ssaha2 -output cigar subject query.fa query 1 121 + subject 1 120 + 95 M 57 D 1 M 36 I 2 M 26
The Verbosely Ugly Labelled Gapped Alignment Report format also starts with the same nine fields as SUGAR (see above); but is followed by a series of <label, query_length, subject_length> triplets, where label can be 'M' for match or 'G' for gap. The distinction between an insertion and a deletion is now made by the relative lengths of 'query_length' and 'subject_length'. Example:
./ssaha2 -output vulgar subject query.fa query 1 121 + subject 1 120 + 95 M 57 57 G 0 1 M 36 36 G 2 0 M 26 26
PSL is the default format for the BLAT program. It consists of a single header at the start of
the output followed by a summary line for each of the alignments.
There are 21 mandatory fields to each summary line
<matches> <misMatches> <repMatches> <nCount> <qNumInsert> <qBaseInsert> <tNumInsert> <tBaseInsert> <strand> <qName> <qSize> <qStart> <qEnd> <tName> <tSize> <tStart> <tEnd> <blockCount> <blockSizes> <qStarts> <tStarts>
The only SSAHA2 specific change is that 'matches' just referes to the number of matching bases and 'repMatches' is always zero. PSLX has the same first 21 fields as PSL augmented by the actual sequence blocks of the alignment. These blocks correspond to the comma-separated list of starting positions of each block in the query and subject. Our example alignment in PSL is
./ssaha2 -output psl subject query.fa
psLayout version 3
match mis- rep. N's Q gap Q gap T gap T gap strand Q Q Q Q T T T T block blockSizes qStarts tStarts
match match count bases count bases name size start end name size start end count
---------------------------------------------------------------------------------------------------------------------------------------------------------------
117 2 0 0 1 2 1 1 + query 121 0 121 subject 120 0 120 3 57,36,26, 0,57,95, 0,58,94,
The example in PSLX is
./ssaha2 -output pslx subject query.fa
psLayout version 3
match mis- rep. N's Q gap Q gap T gap T gap strand Q Q Q Q T T T T block blockSizes qStarts tStarts
match match count bases count bases name size start end name size start end count
---------------------------------------------------------------------------------------------------------------------------------------------------------------
117 2 0 0 1 2 1 1 + query 121 0 121 subject 120 0 120 3 57,36,26, 0,57,95, 0,58,94,
AACTAAGCTTTCATATCTGCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGG,CTAGGCTGAGATACTACTATGAACTTCATTTTTAAA,ATGATACATTTCCAATAAAGTAAAAA,
AACTAAGCTTTCATATCTCCAAGCATCTTTGGGAATACTCTTGACATTGATTTTAGG,CTAGGTTGAGATACTACTATGAACTTCATTTTTAAA,ATGATACATTTCCAATAAAGTAAAAA,
To remain consistent with BLAT 'qStart', 'tStart' are the numerically lower end points and 'qEnd', 'tEnd' are the numerically higher end points for BOTH forward and reverse complement hits. This is the exception that proves the rule for all the other formats. Likewise, the counting of bases starts at zero, again this is the exception that proves the rule. It is more sensible to be consistent with BLAT than with the SSAHA2 graphical alignment in this case because if you're using PSL and you want to see the bases it makes more sense to use PSLX than to use the '-align 1' option.
Currently there is a final line appended to all output, nomatter what the format
Finished job
This is used by people variously to terminate parsing, check that the job ran to completion, etc. Depending on whether people let me know whether they love it ot hate it, I'm happy to remove it or leave it there.
... Did I mention? The subject is always assumed to be in the forward orientation.
Wellcome to the SSAHA2 Frequently Asked Questions.
changed -kmer and -skip default parameters for -rtype abi to -kmer 13 -skip 13. The options -solexa, -454 and -rtype <read_type> can now be specified with ssaha2Build as well as ssaha2. i.e. -kmer and -skip no longer have to be explicitly specified with ssaha2Build.
This means one can build a hash table with
ssaha2Build -454 -save <htab_name> <fasta_file>
then run the mapping with something like
ssaha2 -454 -output cigar -save <htab_name> <fastq_file>
SSAHA2 produces output in SAM format with the -output sam option. It is best to have SSAHA2 write the SAM output to an extra file with the -outfile <samfile> option.
So if your genomic sequences are in a FASTA file called genome.fa and you mapped paired-end reads in the FASTQ files mate_1.fq and mate_2.fq using, say, the following commands:
ssaha2Build -rtype 454 -save genome_hash genome.fa to generate a hash table
ssaha2 -rtype 454 -pair 20,3000 -output sam -outfile mapped.sam -save genome_hash mate_1.fq
mate_2.fq
You can use the SAM output file as input for samtools pileup -S mapped.sam. For example, to
convert the file mapped.sam to BAM format use
samtools faidx genome.fa
which produces a file genome.fa.fai
samtools view -Sb -t genome.fa.fai -o mapped.bam mapped.sam
samflag.c and compile e.g.
cc -o samflag samflag.c. Then issue ./samflag <FLAG value (decimal)>.
Please read the FAQ page before contacting us about a problem. Remember to provide sufficient information for us to reproduce the error including the SSAHA2 version number (-v option) and command line options used.
SSAHA2 is currently being maintained by Hannes Ponstingl, the pile-up pipeline by Yong Gu and Zemin Ning.