Tools to generate automatically high quality sequence by ordering contigs, closing gaps, correcting sequence errors and transferring annotation.
With the advent of next generation sequencing a lot of effort was put into developing software for mapping or aligning short reads and performing genome assembly. For genome assembly the problem of generating a draft assembly (i.e. a set of unordered contigs) has now been very well addressed - but for users who need high quality assemblies for their analyses there are still unresolved issues: this is where PAGIT is used.
PAGIT addresses the need for software to generate high quality draft genomes. It is based on a series of programs that we developed:
PAGIT bundles these software and makes them more accessible for users.
We have a mailing list for announcements and questions. PAGIT mailing list.
PAGIT was published in Nature Protocols: A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs. Nat Protoc. 2012 Jun 7;7(7):1260-84. doi: 10.1038/nprot.2012.068
Extra care must be taken, when working with genome bigger than 200mb.
[Genome Research Limited]
We have bundled the four tools together with some other helpful scripts. In the download area they can be downloaded as precompiled versions, or pre-installed on a virtual machine.
By following the links to the individual tools it is possible to download the source code for each tool.
The virtual machine version makes it possible to run the suite on almost every operating system, just as long as you
can install the virtual box:
https://www.virtualbox.org/wiki/Downloads
PAGIT is free software and is distributed under the terms of the GNU General Public License.
PAGIT relies on other freely available bioinformatics software developed by third parties. The list of this third-party software is as follows:
For questions or comments, please contact Thomas D. Otto.
Pagit is compiled for linux/unix systems and available as virtual machine. The installation procedure is below.
mv PAGIT.V1.64bit.tgz /path/to/my/installed/software cd /path/to/my/installed/software tar xzf PAGIT.V1.64bit.tgz
bash ./installme.sh
source PAGIT/sourceme.pagit
The Virtual machine was tested on Windows and MAC OS. It is recommended to have at least 4GB memory when with bacterial size genomes. If the machine has less memory, a setup of a SWAP might be require, see below.
cd ~/bin/PAGIT/exampleTestset/ ./dotestrun.shAll four programs of PAGIT should run through and at the end, an ACT window will open.
IMPORTANT:The password for root is wt. For the user pagit it is pagitvm.
The script to join chromosomes for ABACAS was missing. Please download it and unzip the content in the PAGIT/ABACAS directory.
ABACAS - Algorithm Based Automatic Contiguation of Assembled Sequences
ABACAS is intended to rapidly contiguate (align, order, orientate), visualise and design primers to close gaps on shotgun assembled contigs based on a reference sequence.
Bioinformatics (Oxford, England) 2009;25;15;1968-9
PUBMED: 19497936; PMC: 2712343; DOI: 10.1093/bioinformatics/btp347
For further information see the SourceForge page of ABACAS
IMAGE - Iterative Mapping and Assembly for Gap Elimination
IMAGE is a software designed to close gaps in any draft assembly using Illumina paired end reads. IMAGE is best described in several stages: aligning of Illumina reads at contig ends; local assembly of reads into new contigs; reference contigs are extended or merged; iterating the whole process to extend and merge more contigs.
Genome biology 2010;11;4;R41
PUBMED: 20388197; PMC: 2884544; DOI: 10.1186/gb-2010-11-4-r41
For further information see the SourceForge page of IMAGE
ICORN - Iterative Correction of Reference Nucleotide
iCORN is a software to correct reference genome sequences. The main idea is to iteratively map reads and find differences in the sequence: as the sequence is corrected a greater proportion of the reads are able to map. Results are exported for Artemis or Gap4 for visualisation.
Bioinformatics (Oxford, England) 2010;26;14;1704-7
PUBMED: 20562415; PMC: 2894513; DOI: 10.1093/bioinformatics/btq269
For further information see the SourceForge page of iCORN
RATT - Rapid Annotation Transfer Tool
RATT is software to transfer annotation from a reference (annotated) genome to an unannotated query genome. It was first developed to transfer annotations between different genome assembly versions. However, it can also transfer annotations between strains and even different species. RATT is able to transfer any entry present on a reference sequence, such as the systematic id or an annotator's notes; such information would be lost in a de novo annotation.
Nucleic acids research 2011;39;9;e57
PUBMED: 21306991; PMC: 3089447; DOI: 10.1093/nar/gkq1268
As a crucial step in RATT is to set the correct transfer parameter, here the possible options and which MUMmer parameter this implies:
| parameter name | word size | identity cutoff | cluster size | max extend cluster | anchor choice | rearrange | Faux SNP |
|---|---|---|---|---|---|---|---|
| Assembly | 30 | 99 | 400 | 1000 | -g -o 0 | yes | |
| Assembly.Repetitive | 30 | 99 | 400 | 1000 | --maxmatch | -g -o 0 | yes |
| Strain | 20 | 90 | 400 | 500 | -r -o 1 | yes | |
| Strain.global | 20 | 90 | 400 | 500 | -g -o 1 | yes | |
| Strain.Repetitive | 20 | 90 | 400 | 500 | --maxmatch | -r -o 1 | yes |
| Strain.global.Repetitive | 20 | 90 | 400 | 500 | --maxmatch | -g -o 1 | yes |
| Species | 10 | 40 | 400 | 1000 | -r -o 5 | no | |
| Species.global | 10 | 40 | 400 | 1000 | -g -o 5 | no | |
| Species.Repetitive | 10 | 40 | 400 | 1000 | --maxmatch | -r -o 5 | no |
| Species.global.Repetitive | 10 | 40 | 400 | 1000 | --maxmatch | -g -o 5 | no |
| Multiple | 25 | 98 | 400 | 1000 | --maxmatch | -q -o 1 | no |
| Free* | RATT_l | RATT_ind | RATT_c | RATT_g | RATT_anchor | RATT_rearrange | no |
(*) - must be set as bash variables. Alternatively the user might just update the start.ratt.sh file.
For further information see the SourceForge page of RATT