PAGIT - Post Assembly Genome Improvement Toolkit

Tools to generate automatically high quality sequence by ordering contigs, closing gaps, correcting sequence errors and transferring annotation.

With the advent of next generation sequencing a lot of effort was put into developing software for mapping or aligning short reads and performing genome assembly. For genome assembly the problem of generating a draft assembly (i.e. a set of unordered contigs) has now been very well addressed - but for users who need high quality assemblies for their analyses there are still unresolved issues: this is where PAGIT is used.

PAGIT addresses the need for software to generate high quality draft genomes. It is based on a series of programs that we developed:

  1. ABACAS, that is able to contiguate contigs from a de novo assembly against a closely related reference.
  2. IMAGE, an iterative approach for closing gaps in assembled genomes using mate pair information. It is able to close gaps left open by the assembler in a draft genome, even when using the same data sets as used by the original assembler.
  3. iCORN, that enables errors in the consensus sequence to be corrected by iteratively mapping reads to the current assembly.
  4. RATT, a tool to transfer the annotation from a reference genome, or an earlier assembly, onto the latest assembly.

PAGIT bundles these software and makes them more accessible for users.

[Genome Research Limited]

Overview

A complete description of the tools in PAGIT was published in:

  • A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs.

    Swain MT, Tsai IJ, Assefa SA, Newbold C, Berriman M and Otto TD

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, UK.

    Genome projects now produce draft assemblies within weeks owing to advanced high-throughput sequencing technologies. For milestone projects such as Escherichia coli or Homo sapiens, teams of scientists were employed to manually curate and finish these genomes to a high standard. Nowadays, this is not feasible for most projects, and the quality of genomes is generally of a much lower standard. This protocol describes software (PAGIT) that is used to improve the quality of draft genomes. It offers flexible functionality to close gaps in scaffolds, correct base errors in the consensus sequence and exploit reference genomes (if available) in order to improve scaffolding and generating annotations. The protocol is most accessible for bacterial and small eukaryotic genomes (up to 300 Mb), such as pathogenic bacteria, malaria and parasitic worms. Applying PAGIT to an E. coli assembly takes ∼24 h: it doubles the average contig size and annotates over 4,300 gene models.

    Funded by: Wellcome Trust: 098051

    Nature protocols 2012;7;7;1260-84

How to Get PAGIT:

We have bundled the four tools together with some other helpful scripts. In the download area they can be downloaded as precompiled versions, or pre-installed on a virtual machine.

By following the links to the individual tools it is possible to download the source code for each tool.

The virtual machine version makes it possible to run the suite on almost every operating system, just as long as you can install the virtual box:
https://www.virtualbox.org/wiki/Downloads

License

PAGIT is free software and is distributed under the terms of the GNU General Public License.

PAGIT relies on other freely available bioinformatics software developed by third parties. The list of this third-party software is as follows:

  • Artemis - annotation & BAM visualization tool
  • ACT - a DNA sequence comparison viewer
  • BLASTALL - sequence comparison tool
  • BWA - read mapping tool (Burrows-Wheeler transformation based)
  • MUMmer - sequence comparison tools
  • SAMTOOLS - suite to work with BAM files
  • SMALT - read mapping tool (k-mer based)
  • VELVET - short read assembler

Planned updates for Version 1.1

  • several bug fixes
  • include icorn2, which can efficiently correct errors in pacbio assemblies
  • include REAPR, a tool to correct assemblies

Warning

Extra care must be taken, when working with genome bigger than 200mb.

Contact

For questions or comments, please contact Thomas D. Otto.

Additionally we have a mailing list for announcements and questions. PAGIT mailing list.

Download

Pagit is compiled for linux/unix systems and available as virtual machine. The installation procedure is below.

Linux

Virtual Machine

Installation

Linux

  • Download the appropriate compressed tar archive for your Linux system. Click on the Linux binary x64bit on the link above.
  • Move the compressed tar archive to the location where you want PAGIT installed, then decompress the tar ball by typing the following commands in a terminal window:
    mv PAGIT.V1.64bit.tgz /path/to/my/installed/software
    cd /path/to/my/installed/software
    tar xzf PAGIT.V1.64bit.tgz
    
  • Now execute the install script by typing the following in a terminal window:
    bash ./installme.sh
    
  • Each time you want to run, source the environment settings to run PAGIT:
    source PAGIT/sourceme.pagit
    
  • (Optional) The environment settings for PAGIT should be sourced each time PAGIT is executed. Alternatively, the command source PAGIT/sourceme.pagit may be included into your local environmental variable file - for example the file .bashrc - so that the PAGIT environment is automatically initialised.
  • We assume that the tcsh shell and Java (version 1.6 or above) are installed on the system

Virtual Machine

The Virtual machine was tested on Windows and MAC OS. It is recommended to have at least 4GB memory when with bacterial size genomes. If the machine has less memory, a setup of a SWAP might be require, see below.

  • If not already performed, download the virtual box software from VirtualBox and install it according the VirtualBox documentation: VirtualBox
  • Download the PAGIT virtual machine required for your operating system. Click on either the Virtual Machine 32 bit or the Virtual Machine 64 bit link above.
  • If you choose the bzip2 version, you will need to unzip the file first. Depending your operating system, a double click on it should do it.
  • Open virtual box and click on new to create a new virtual machine. Click on next to move through the registration screens.
  • You will need to give the virtual machine a name (e.g. PAGIT) and select the operating system and version: which would be Linux and then either Ubuntu or Ubuntu64.
  • Specify the amount of memory to be allocated. You should not give the virtual machine more than 75% of the complete memory available, but it should have at least 2GB.
  • Specify the Virtual Hard Disk using the toggle on the use existing hard disk option and click on the file icon to find and select the downloaded PAGIT virtual machine.
  • To start the virtual machine, select it and click on the green arrow.
  • If not already open, open terminal, left site, third last icon.
  • As all variables are already set, you can try the test set with:
    cd ~/bin/PAGIT/exampleTestset/
    ./dotestrun.sh
      
    
    All four programs of PAGIT should run through and at the end, an ACT window will open.

IMPORTANT: The password for root is wt. For the user pagit it is pagitvm.

Bugs

  • The script to join chromosomes for ABACAS was missing. Please download it and unzip the content in the PAGIT/ABACAS directory.
  • Promer of abacas. Path in the promer file were set wrong. A new file is here: download it. Please replace the in the directory PAGIT/bin/.
  • Abacas option order. A bug was reported in Abacas, that the order of the parameter is relevant. Please double check this.

ABACAS

ABACAS - Algorithm Based Automatic Contiguation of Assembled Sequences

ABACAS figure.

ABACAS figure.

zoom

ABACAS is intended to rapidly contiguate (align, order, orientate), visualise and design primers to close gaps on shotgun assembled contigs based on a reference sequence.

  • ABACAS: algorithm-based automatic contiguation of assembled sequences.

    Assefa S, Keane TM, Otto TD, Newbold C and Berriman M

    Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SA, UK. sa4@sanger.ac.uk

    Summary: Due to the availability of new sequencing technologies, we are now increasingly interested in sequencing closely related strains of existing finished genomes. Recently a number of de novo and mapping-based assemblers have been developed to produce high quality draft genomes from new sequencing technology reads. New tools are necessary to take contigs from a draft assembly through to a fully contiguated genome sequence. ABACAS is intended as a tool to rapidly contiguate (align, order, orientate), visualize and design primers to close gaps on shotgun assembled contigs based on a reference sequence. The input to ABACAS is a set of contigs which will be aligned to the reference genome, ordered and orientated, visualized in the ACT comparative browser, and optimal primer sequences are automatically generated.

    ABACAS is implemented in Perl and is freely available for download from http://abacas.sourceforge.net.

    Funded by: Wellcome Trust: WT085775/Z/08/Z

    Bioinformatics (Oxford, England) 2009;25;15;1968-9

For further information see the SourceForge page of ABACAS

IMAGE

IMAGE - Iterative Mapping and Assembly for Gap Elimination

IMAGE figure.

IMAGE figure.

zoom

IMAGE is a software designed to close gaps in any draft assembly using Illumina paired end reads. IMAGE is best described in several stages: aligning of Illumina reads at contig ends; local assembly of reads into new contigs; reference contigs are extended or merged; iterating the whole process to extend and merge more contigs.

  • Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps.

    Tsai IJ, Otto TD and Berriman M

    Parasite Genomics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. jit@sanger.ac.uk

    Advances in sequencing technology allow genomes to be sequenced at vastly decreased costs. However, the assembled data frequently are highly fragmented with many gaps. We present a practical approach that uses Illumina sequences to improve draft genome assemblies by aligning sequences against contig ends and performing local assemblies to produce gap-spanning contigs. The continuity of a draft genome can thus be substantially improved, often without the need to generate new data.

    Funded by: Wellcome Trust: WT 085775/Z/08/Z

    Genome biology 2010;11;4;R41

For further information see the SourceForge page of IMAGE

iCORN

ICORN - Iterative Correction of Reference Nucleotide

iCORN figure.

iCORN figure.

zoom

iCORN is a software to correct reference genome sequences. The main idea is to iteratively map reads and find differences in the sequence: as the sequence is corrected a greater proportion of the reads are able to map. Results are exported for Artemis or Gap4 for visualisation.

  • Iterative Correction of Reference Nucleotides (iCORN) using second generation sequencing technology.

    Otto TD, Sanders M, Berriman M and Newbold C

    Parasite Genomics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK. tdo@sanger.ac.uk

    Motivation: The accuracy of reference genomes is important for downstream analysis but a low error rate requires expensive manual interrogation of the sequence. Here, we describe a novel algorithm (Iterative Correction of Reference Nucleotides) that iteratively aligns deep coverage of short sequencing reads to correct errors in reference genome sequences and evaluate their accuracy.

    Results: Using Plasmodium falciparum (81% A + T content) as an extreme example, we show that the algorithm is highly accurate and corrects over 2000 errors in the reference sequence. We give examples of its application to numerous other eukaryotic and prokaryotic genomes and suggest additional applications.

    Availability: The software is available at http://icorn.sourceforge.net

    Funded by: Wellcome Trust: WT085775/Z/08/Z

    Bioinformatics (Oxford, England) 2010;26;14;1704-7

For further information see the SourceForge page of iCORN

RATT

RATT - Rapid Annotation Transfer Tool

RATT figure.

RATT figure.

zoom

RATT is software to transfer annotation from a reference (annotated) genome to an unannotated query genome. It was first developed to transfer annotations between different genome assembly versions. However, it can also transfer annotations between strains and even different species. RATT is able to transfer any entry present on a reference sequence, such as the systematic id or an annotator's notes; such information would be lost in a de novo annotation.

  • RATT: Rapid Annotation Transfer Tool.

    Otto TD, Dillon GP, Degrave WS and Berriman M

    Parasite Genomics, Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK. tdo@sanger.ac.uk

    Second-generation sequencing technologies have made large-scale sequencing projects commonplace. However, making use of these datasets often requires gene function to be ascribed genome wide. Although tool development has kept pace with the changes in sequence production, for tasks such as mapping, de novo assembly or visualization, genome annotation remains a challenge. We have developed a method to rapidly provide accurate annotation for new genomes using previously annotated genomes as a reference. The method, implemented in a tool called RATT (Rapid Annotation Transfer Tool), transfers annotations from a high-quality reference to a new genome on the basis of conserved synteny. We demonstrate that a Mycobacterium tuberculosis genome or a single 2.5 Mb chromosome from a malaria parasite can be annotated in less than five minutes with only modest computational resources. RATT is available at http://ratt.sourceforge.net.

    Funded by: Wellcome Trust: WT 085775/Z/08/Z

    Nucleic acids research 2011;39;9;e57

As a crucial step in RATT is to set the correct transfer parameter, here the possible options and which MUMmer parameter this implies:

parameter name word size identity cutoff cluster size max extend cluster anchor choice rearrange Faux SNP
Assembly 30 99 400 1000 -g -o 0 yes
Assembly.Repetitive 30 99 400 1000 --maxmatch -g -o 0 yes
Strain 20 90 400 500 -r -o 1 yes
Strain.global 20 90 400 500 -g -o 1 yes
Strain.Repetitive 20 90 400 500 --maxmatch -r -o 1 yes
Strain.global.Repetitive 20 90 400 500 --maxmatch -g -o 1 yes
Species 10 40 400 1000 -r -o 5 no
Species.global 10 40 400 1000 -g -o 5 no
Species.Repetitive 10 40 400 1000 --maxmatch -r -o 5 no
Species.global.Repetitive 10 40 400 1000 --maxmatch -g -o 5 no
Multiple 25 98 400 1000 --maxmatch -q -o 1 no
Free* RATT_l RATT_ind RATT_c RATT_g RATT_anchor RATT_rearrange no

(*) - must be set as bash variables. Alternatively the user might just update the start.ratt.sh file.

For further information see the SourceForge page of RATT

FAQ

Are there other ways to improve the assembly, e.g. manually?
This is a very complex topic, and not really part of PAGIT as such, but the Wellcome Trust Advance courses do teach about how to generate and improve assemblies (in the working with pathogens workshop). Please find here the pdf of the module, as it might help you find mis-assemblies, understand the reasons behind mis-assemblies, and help you to fix them manually.
* quick link - http://q.sanger.ac.uk/pagit