PAGIT - Post Assembly Genome Improvement Toolkit

Tools to generate automatically high quality sequence by ordering contigs, closing gaps, correcting sequence errors and transferring annotation.

With the advent of next generation sequencing a lot of effort was put into developing software for mapping or aligning short reads and performing genome assembly. For genome assembly the problem of generating a draft assembly (i.e. a set of unordered contigs) has now been very well addressed - but for users who need high quality assemblies for their analyses there are still unresolved issues: this is where PAGIT is used.

PAGIT addresses the need for software to generate high quality draft genomes. It is based on a series of programs that we developed:

  1. ABACAS, that is able to contiguate contigs from a de novo assembly against a closely related reference.
  2. IMAGE, an iterative approach for closing gaps in assembled genomes using mate pair information. It is able to close gaps left open by the assembler in a draft genome, even when using the same data sets as used by the original assembler.
  3. iCORN, that enables errors in the consensus sequence to be corrected by iteratively mapping reads to the current assembly.
  4. RATT, a tool to transfer the annotation from a reference genome, or an earlier assembly, onto the latest assembly.

PAGIT bundles these software and makes them more accessible for users.

A complete description of the tools in PAGIT was published in following Nature Protocols: A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs. Nat Protoc. 2012 Jun 7;7(7):1260-84. doi: 10.1038/nprot.2012.068. A copy of the manuscript can also be found in pubmed central.

We have a mailing list for announcements and questions. PAGIT mailing list.

Contact:tdo sanger.ac.uk

Planned updates for Version 1.1

  • several bug fixes
  • include icorn2, which can efficiently correct errors in pacbio assemblies
  • include REAPR, a tool to correct assemblies

Extra care must be taken, when working with genome bigger than 200mb.

[Genome Research Limited]

Overview

How to Get PAGIT:

We have bundled the four tools together with some other helpful scripts. In the download area they can be downloaded as precompiled versions, or pre-installed on a virtual machine.

By following the links to the individual tools it is possible to download the source code for each tool.

The virtual machine version makes it possible to run the suite on almost every operating system, just as long as you can install the virtual box:
https://www.virtualbox.org/wiki/Downloads

License

PAGIT is free software and is distributed under the terms of the GNU General Public License.

PAGIT relies on other freely available bioinformatics software developed by third parties. The list of this third-party software is as follows:

  • Artemis - annotation & BAM visualization tool
  • ACT - a DNA sequence comparison viewer
  • BLASTALL - sequence comparison tool
  • BWA - read mapping tool (Burrows-Wheeler transformation based)
  • MUMmer - sequence comparison tools
  • SAMTOOLS - suite to work with BAM files
  • SMALT - read mapping tool (k-mer based)
  • VELVET - short read assembler

Contact

For questions or comments, please contact Thomas D. Otto.

Download

Pagit is compiled for linux/unix systems and available as virtual machine. The installation procedure is below.

Linux
Virtual Machine

Installation

Linux

  • Download the appropriate compressed tar archive for your Linux system. Click on the Linux binary x64bit on the link above.
  • Move the compressed tar archive to the location where you want PAGIT installed, then decompress the tar ball by typing the following commands in a terminal window:
    mv PAGIT.V1.64bit.tgz /path/to/my/installed/software
    cd /path/to/my/installed/software
    tar xzf PAGIT.V1.64bit.tgz
    
  • Now execute the install script by typing the following in a terminal window:
    bash ./installme.sh
    
  • Each time you want to run, source the environment settings to run PAGIT:
    source PAGIT/sourceme.pagit
    
  • (Optional) The environment settings for PAGIT should be sourced each time PAGIT is executed. Alternatively, the command source PAGIT/sourceme.pagit may be included into your local environmental variable file - for example the file .bashrc - so that the PAGIT environment is automatically initialised.
  • We assume that the tcsh shell and Java (version 1.6 or above) are installed on the system

Virtual Machine

The Virtual machine was tested on Windows and MAC OS. It is recommended to have at least 4GB memory when with bacterial size genomes. If the machine has less memory, a setup of a SWAP might be require, see below.
  • If not already performed, download the virtual box software from VirtualBox and install it according the VirtualBox documentation: VirtualBox
  • Download the PAGIT virtual machine required for your operating system. Click on either the Virtual Machine 32 bit or the Virtual Machine 64 bit link above.
  • If you choose the bzip2 version, you will need to unzip the file first. Depending your operating system, a double click on it should do it.
  • Open virtual box and click on new to create a new virtual machine. Click on next to move through the registration screens.
  • You will need to give the virtual machine a name (e.g. PAGIT) and select the operating system and version: which would be Linux and then either Ubuntu or Ubuntu64.
  • Specify the amount of memory to be allocated. You should not give the virtual machine more than 75%#37; of the complete memory available, but it should have at least 2GB.
  • Specify the Virtual Hard Disk using the toggle on the use existing hard disk option and click on the file icon to find and select the downloaded PAGIT virtual machine.
  • To start the virtual machine, select it and click on the green arrow.
  • If not already open, open terminal, left site, third last icon.
  • As all variables are already set, you can try the test set with:
    cd ~/bin/PAGIT/exampleTestset/
    ./dotestrun.sh
    
    All four programs of PAGIT should run through and at the end, an ACT window will open.

IMPORTANT:The password for root is wt. For the user pagit it is pagitvm.

Bugs

  • The script to join chromosomes for ABACAS was missing. Please download it and unzip the content in the PAGIT/ABACAS directory.
  • Promer of abacas. Path in the promer file were set wrong. A new file is here: download it. Please replace the in the directory PAGIT/bin/.
  • Abacas option order. A bug was reported in Abacas, that the order of the parameter is relevant. Please double check this.

ABACAS

ABACAS - Algorithm Based Automatic Contiguation of Assembled Sequences

ABACAS figure.

ABACAS figure.
Enlarge this image (630 x 142)

ABACAS is intended to rapidly contiguate (align, order, orientate), visualise and design primers to close gaps on shotgun assembled contigs based on a reference sequence.

  • ABACAS: algorithm-based automatic contiguation of assembled sequences.

    Assefa S, Keane TM, Otto TD, Newbold C and Berriman M

    Bioinformatics (Oxford, England) 2009;25;15;1968-9

For further information see the SourceForge page of ABACAS

IMAGE

IMAGE - Iterative Mapping and Assembly for Gap Elimination

IMAGE figure.

IMAGE figure.
Enlarge this image (652 x 224)

IMAGE is a software designed to close gaps in any draft assembly using Illumina paired end reads. IMAGE is best described in several stages: aligning of Illumina reads at contig ends; local assembly of reads into new contigs; reference contigs are extended or merged; iterating the whole process to extend and merge more contigs.

  • Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps.

    Tsai IJ, Otto TD and Berriman M

    Genome biology 2010;11;4;R41

For further information see the SourceForge page of IMAGE

iCORN

ICORN - Iterative Correction of Reference Nucleotide

iCORN figure.

iCORN figure.
Enlarge this image (511 x 354)

iCORN is a software to correct reference genome sequences. The main idea is to iteratively map reads and find differences in the sequence: as the sequence is corrected a greater proportion of the reads are able to map. Results are exported for Artemis or Gap4 for visualisation.

  • Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology.

    Otto TD, Sanders M, Berriman M and Newbold C

    Bioinformatics (Oxford, England) 2010;26;14;1704-7

For further information see the SourceForge page of iCORN

RATT

RATT - Rapid Annotation Transfer Tool

RATT figure.

RATT figure.
Enlarge this image (563 x 254)

RATT is software to transfer annotation from a reference (annotated) genome to an unannotated query genome. It was first developed to transfer annotations between different genome assembly versions. However, it can also transfer annotations between strains and even different species. RATT is able to transfer any entry present on a reference sequence, such as the systematic id or an annotator's notes; such information would be lost in a de novo annotation.

  • RATT: Rapid Annotation Transfer Tool.

    Otto TD, Dillon GP, Degrave WS and Berriman M

    Nucleic acids research 2011;39;9;e57

As a crucial step in RATT is to set the correct transfer parameter, here the possible options and which MUMmer parameter this implies:

parameter name word size identity cutoff cluster size max extend cluster anchor choice rearrange Faux SNP
Assembly 30 99 400 1000 -g -o 0 yes
Assembly.Repetitive 30 99 400 1000 --maxmatch -g -o 0 yes
Strain 20 90 400 500 -r -o 1 yes
Strain.global 20 90 400 500 -g -o 1 yes
Strain.Repetitive 20 90 400 500 --maxmatch -r -o 1 yes
Strain.global.Repetitive 20 90 400 500 --maxmatch -g -o 1 yes
Species 10 40 400 1000 -r -o 5 no
Species.global 10 40 400 1000 -g -o 5 no
Species.Repetitive 10 40 400 1000 --maxmatch -r -o 5 no
Species.global.Repetitive 10 40 400 1000 --maxmatch -g -o 5 no
Multiple 25 98 400 1000 --maxmatch -q -o 1 no
Free* RATT_l RATT_ind RATT_c RATT_g RATT_anchor RATT_rearrange no

(*) - must be set as bash variables. Alternatively the user might just update the start.ratt.sh file.

For further information see the SourceForge page of RATT

FAQ

 

  1. Are there other ways to improve the assembly, e.g. manually?
    • This is a very complex topic, and not really part of PAGIT as such, but the Wellcome Trust Advance courses do teach about how to generate and improve assemblies (in the working with pathogens workshop). Please find here the pdf of the module, as it might help you find mis-assemblies, understand the reasons behind mis-assemblies, and help you to fix them manually.
* quick link - http://q.sanger.ac.uk/jkmf9187