SMALT efficiently aligns DNA sequencing reads with genomic reference sequences.
Reads from a range of sequencing platforms, for example Illumina-Solexa, Roche-454, PacBio or ABI-Sanger, can be processed including paired-end reads.
[Genome Research Limited]
The software employs a perfect hash index of short words (< 21 nucleotides long), sampled at equidistant steps along the genomic reference sequences.
For each read, potentially matching segments in the reference are identified from seed matches in the index and subsequently aligned with the read using a banded Smith-Waterman algorithm.
The best gapped alignments of each read is reported including a score for the reliability of the best mapping. The user can adjust the trade-off between sensitivity and speed by tuning the length and spacing of the hashed words.
A mode for the detection of split (chimeric) reads is provided. Multi-threaded program execution is supported.
Mapping with SMALT involves two steps: First, a hash index has to be generated for the genomic reference sequences. Then the sequencing reads are mapped onto the reference using the index.
All sequence input files have to be in FASTA or FASTQ format.
smalt index -k 13 -s 6 hs37k13s6 NCBI37.fasta builds a hash table for the human genome in file NCBI37.fasta. Two files hs37k13s6.smi and s37k13s6.sma are written to disk.
-k 13 specifies the length, -s 6 the spacing of the hashed words. This setting is suitable for human DNA reads of the Illumina-Solexa platform with read length > 70 nucleotides.
smalt map -i 800 -f samsoft -o map.sam hs37k13s6 mate_1.fastq mate_2.fastq loads the hash table created by the previous step into memory and maps paired-end reads in the files mate_1.fastq and mate_2.fastq with an expected range of insert sizes of up to 800 bp.
The output is written to the file map.sam in SAM output format using soft clipping of sequences.
Released 11th November 2011
Older versions of SMALT are available on the FTP site.
© 2010, 2011 Genome Research Limited.
Binaries are available free of charge.
The source code will be made available shortly under the GNU General Public License. www.gnu.org/licenses/
Questions and comments about SMALT should be directed to the author, Hannes Ponstingl.