A tool that evaluates the accuracy of a genome assembly using mapped paired end reads.
REAPR is a tool that evaluates the accuracy of a genome assembly using mapped paired end reads, without the use of a reference genome for comparison. It can be used in any stage of an assembly pipeline to automatically break incorrect scaffolds and flag other errors in an assembly for manual inspection. It reports mis-assemblies and other warnings, and produces a new broken assembly based on the error calls.
The software requires as input an assembly in FASTA format and paired reads mapped to the assembly in a BAM file. Mapping information such as the fragment coverage and insert size distribution is analysed to locate mis-assemblies. REAPR works best using mapped read pairs from a large insert library (at least 1000bp). Additionally, if a short insert Illumina library is also available, REAPR can combine this with the large insert library in order to score each base of the assembly.
Latest Linux version
Please see inside the tarball for installation instructions (in the README file) and the manual (pdf).
Note: it is recommended that reads are mapped with version 0.7.0.1 of SMALT without the -f bam option (use -f samsoft and import to BAM afterwards), to make input to REAPR. Higher versions of SMALT have not been tested with REAPR. Note that the latest version of REAPR can run the mapping for you.
Latest MAC/Windows version (virtual machine)
REAPR was developed for and intended to be run on Linux. If you have a Windows machine or a Mac (or even Linux) then you can run REAPR using a virtual machine with VirtualBox. REAPR is installed on the Sanger Institute pathogens virtual machine.
Previous versions of REAPR are available on the FTP site.
How does REAPR use the short insert reads? Why are they optional?
They are used to accurately score each base of the assembly. They are not used to call errors so if you only want error calls, then you do not need to use them. Not using them means skipping the perfectmap or perfectfrombam stage that can be run before the main pipeline. Only read pairs that can map perfectly to the genome are used, so they need to be of high quality.
How does REAPR use the large insert reads?
REAPR uses the large insert reads to call errors in the assembly. All error calls and warnings, except for ‘low perfect coverage’, are generated from the large insert reads.
What do I do if I only have reads from one library?
The most common example of this is an assembly made from a single library of short fragment paired end reads. In this case they are likely to be high quality, so can be used as both the short and ‘long’ insert reads. If you only have large insert reads, then they are unlikely to be of high enough quality to be used as ‘short’ insert reads and it is probably best to skip the perfectmap stage and use them only for error calling. See page 10 of the manual for the commands to run.
How much coverage do I need?
For short insert reads, this depends on the quality of the reads. By default, a base of the assembly will not be called as error free if it has less than 5X perfect and unique coverage. For large insert reads, it is better to think in terms of fragment coverage, which needs to be at least around 15X. This could equate to only about 1X of read coverage, depending on your insert size.
What do I do if I have reads from several libraries?
REAPR assumes that the short insert reads all came from the same insert size distribution. The same is true of the large insert reads. So you can combine reads from different libraries if they have approximately the same insert size. If you need to choose from multiple long insert lengths, use the longest that has enough coverage.
Can I use more two or more libraries of different insert sizes to call errors?
No. Please see the answer to the previous question.
Can reads from technology other than Illumina be used?
Read pairs from any technology can be used, as long as you can make a BAM file of mapped read pairs. For example, REAPR has been shown to work well with 454 data.
Can REAPR be used with transcriptome reads?
This is not recommended because REAPR assumes that the coverage is approximately uniform across the assembly. Tests have shown that the uneven read depth of transcriptome data causes a very high false positive rate of error calls.
What does ‘error-free’ mean?
A base is called error free if: 1) it has at least 5X perfect and unique coverage of the short insert reads and 2) the FCD at that base is OK.
Can I use a different read mapper?
Yes, but be aware that we use SMALT because it has an option (-x) to map each read independently of its mate. This stops reads within a pair from incorrectly getting mapped near to each other and therefore helps with error calls. If your favourite read mapper cannot do this, then your results may be worse than using SMALT.
REAPR is available under the GPL3 license.
If you make use of this software in your research, please cite as:
REAPR: a universal tool for genome assembly evaluation, Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto TD, Genome Biology (2013, 14(5):R47).