ESGI High throughput sequencing analysis tools

The European Sequencing and Genotyping Infrastructure (ESGI) project is a collaborative project and coordination support action for integration funded by the European Union FP7 programme.

ESGI involves a network of leading European genome research centres that aims to integrate the capacity for large-scale genomic analysis and technical expertise in sequencing and genotyping technology among the partner institutes. This web site was produced as part of Work Package 4 of ESGI in order to facilitate access to high throughput sequencing analysis tools, acting as an information resource for the project and the wider research community. It contains information about bioinformatics and computational tools available for the analysis of high-throughput genomic data.

In the accompanying pages on this site reached from the tabs below we identify tools in use at the centres in the network, classified into categories. In each case there is a brief description, together with links to the home pages for the relevant software package which support download and provide documentation and references.

The use of these tools can be automated with pipeline management software, which let you define a certain workflow and then push data through that pipeline. A pipeline manager ensures that all the tools in the pipeline get run successfully, spreading the work load over a compute cluster. VRPipe was used at the Sanger Institute to do the bulk of the data processing for the 1000 Genomes Project, and can be used to run any of the software packages listed below. A detailed guide for using VRPipe in Amazon's cloud is available, enabling those centres with limited local computing infrastructure to carry out large-scale analyses.

These pages are maintained by Yasin Memari on behalf of ESGI WP4.

[ESGI]

Formats

Standard formats are important for exchange of information between software components, allowing development of modular pipelines. Different stages of sequence analysis require different file formats. fastq format has emerged as de facto standard for storing primary nucleotide sequences and quality scores from various technologies, with a possibility of improvement (SRF). SAM/BAM provides a flexible and compact format for storing and manipulating the aligned sequences. Furthermore, new file formats have been proposed for variation data to allow for a more comprehensive storage of annotations and quality scores from sequencing.

  • SAM - Sequence Alignment/Map Format. The binary equivalent (BAM) is current standard for storing alignment data. The SAMtools and Picard toolkits provide the core generic interfaces for manipulating SAM/BAM.
  • CRAM - Proposed new format by the EBI, it is intended to replace BAM format, due to its greater compression capability. The CRAM toolkit is designed to read and write CRAM/BAM files, allow conversion between formats, and support integration with various software.
  • VCF - Variant Call Format designed for storing variant and genotype information for the 1000 Genomes Project, and now used more widely. Variants of different types, including single nucleotide variants (SNVs), insertion/deletions (indels), multi-allelic genotypes, etc can be stored along with annotations and quality scores. VCFtools can be used to manipulate the VCF files and to carry out basic population genetics calculations.
  • BED - UCSC format for defining annotation tracks. It can be used to store the coordinates for any sequence feature e.g. peaks for the ChIP-seq data.

Alignment

A critical first step in use of sequence data is alignment to a reference genome. High throughput sequencing requires efficient algorithms for mapping millions of query sequences to reference genomes. This stage is inevitable especially for data consisting of very short reads. Initial short-read aligners used the seed-and-index methods to build hash tables for individual query sequences (e.g. MAQ). More modern aligners (e.g. BWA) use the Burrows-Wheeler Transform to compress the entire suffix arrays of a reference genome into a string that has smaller memory footprint and handles repeats more gracefully. There is some trade-off between speed and accuracy, and the extent to which the confidence in an alignment is known, a quantity represented in the mapping quality measure.

  • BWA - A fast and accurate algorithm for gapped alignment of nucleotide sequences. It combines two algorithms, one that works for short query sequences (<200bp), and one that allows longer sequences (up to 100kbp), and is widely used for mapping data to be used for variant calling.
  • Mosaik - In addition to several other functions, Mosaik can align reads of a wide range of lengths generated by a number of platforms. It uses the Smith-Waterman algorithm to produce gapped alignments.
  • SOAPaligner - This is a fast Burrows-Wheeler based algorithm for mapping reads generated by the Illumina platforms. It is part of the SOAP package that performs other operations, including assembly, SNP, indel and structural variation calling.
  • BowTie - One of the fastest algorithms, it can quickly index and map reads to a large reference genome. Although originally designed to map very short reads (<50bp) with high speed, this is often the method of choice when quantifying read counts from RNA-seq or ChIP-seq data, but the slightly lower mapping accuracy of the default mode makes it less appropriate for variant calling. The latest versions can support longer reads (up to 1000bp) and allow tuning the trade-off between speed and accuracy.
  • GEM - The Genome Multitool mapper is a fast and highly accurate short read aligner which adopts a filtration-based approach to approximate string matching. The program has been implemented in the GEM library which facilitates constructing integrated workflows and pipelines for different purposes, including DNA/RNA reads mapping, SNP calling, microRNA analysis and ChIP-seq data analysis.

Assembly

When a reference genome is not available, or when it is desired to analyse the data in a reference-free fashion, primary sequencing reads must be assembled de novo. The complexity of assembly scales with the number of DNA fragments and genome size of the species. Most short-read assemblers are memory hungry and result in fragmentary sequence contigs, breaking chromosomal sequences at repetitive regions. Currently de novo assembly can yield high coverage for small eukaryotic genomes, but the assembly of mammalian genomes can be cumbersome. For newly sequenced species, extra steps are also required to enhance the assembly. For example, paired-end sequence data from different size insert libraries can be used to increase scaffold size. To assign scaffolds to chromosomes requires anchoring to known markers on a genetic map of the organism.

  • Velvet - Perhaps the most widely used algorithm for assembly of bacterial-sized genomes, it uses a de Bruijin graph model to construct the contigs. The memory requirement of this program can be prohibitive for larger genomes.
  • SOAPdenovo - Developed by the Beijing Genomics Institute, this scales to vertebrate-sized genomes of several Gbp on a large memory computer and has been used to assemble reference genomes for multiple individuals and species.
  • ALLPATHS-LG - Another proven assembler for vertebrate scale genomes developed by the Broad Institute, which delivers good de novo assemblies on large memory machines, but is designed for a particular recipe of sequencing libraries.
  • ABySS - A short-read assembler for vertebrate-sized genomes, albeit with fragmentary results. It utilises a distributed memory approach to minimize its memory footprint.
  • SGA - String Graph Assembler uses the concept of string graphs for memory efficiency. It can be used on large genomes within 100GB memory, but requires a minimum depth of coverage.

Variant calling

Variant calling can be done per sample or combined across samples. Particularly the latter is relevant for trios and for low coverage population sequencing. Variants are called usually using a Bayesian approach that assigns posterior probabilities to possible genotypes. Variant callers are distinct in the way that they utilize mapping and depth-of-coverage information in order to call a position as homozygous or heterozygous. There is also the possibility of sequencing errors or wrong mapping that can affect the false discovery rate. Small indel calling can be challenging. As the alignment algorithms map each DNA fragment independently to the reference genome, this can introduce mismatches on reads supporting indels, as mismatches are preferable to gaps. New methods re-align the reads near gaps jointly to determine whether an indel event is a likely alternative to multiple mismatches. In principle, this improves indel calling and reduces the rate of false positive single nucleotide calls near these loci.

  • SAMtools - In addition to its utility in manipulating BAM files, the SAMtools mpileup commands can be used to calculate genotype likelihoods in single samples or across multiple samples. It integrates with bcftools that does the actual consensus/variant calling.
  • GATK Unified Genotyper - The Genome Analysis Toolkit (GATK) provides a number of utilities. In addition to base quality recalibration and local re-alignment, the GATK performs fragment-based calling using a two-stage inference approach. It also implements a training algorithm for variant quality score recalibration (VQSR) in order to distinguish a true SNP from sequencing artefacts.
  • Dindel - Used for calling indels for the 1000 Genomes Project, it implements a probabilistic realignment model that reconstructs a set of local candidate haplotypes for each candidate indel region. These haplotypes are then used to assign posterior probabilities to each indel event and nearby SNPs.
  • pibase - A new tool for validating novel Single Nucleotide Variants (SNVs). It utilizes the details in the alignment files in order to identify the reproducible genotypes, hence distinguishing the true variants from artifacts. It can also be used to compare the nucleotide signals in different BAM files.

Imputation

Genotype calling from low coverage data may require an extra step of imputation. This both fills the gaps that exist due to lack of coverage, and results in more accurate genotypes. Several imputation methods exist for inferring the missing genotype in outbred or related populations. A few algorithms attempt to identify the candidate haplotypes for the study population. Once the data is phased (haplotypes known), missing genotypes can be inferred from the haplotype sharing patterns among the individuals or from external reference panels. The release of reference panels for HapMap2 & 3 and the 1000 Genomes Project has meant that many, with access to GWAS data, will attempt to extend the association studies to imputed data before whole genome sequencing becomes widely accessible. For imputing related individuals, IBD-based phasing can be combined with haplotype frequency models to increase the imputation power.

  • IMPUTE2 - A flexible and computationally feasible algorithm for imputing from large reference panels, based on a haplotype-switching hidden Markov model. It can integrate information across multiple reference panels containing overlapping or non-overlapping variants. The new version can impute X chromosome data along with indels and structural variants.
  • MACH - Another hidden Markov model based imputation package comparable to IMPUTE2.
  • BEAGLE - A powerful software that is often faster than IMPUTE2 and MACH, with perhaps slightly lower accuracy. It can handle unrelated individuals and parent-offspring data, and can detect regions which are identical-by-descent or homozygous-by-decent and can perform haplotypic association analysis.

Structural variation

Structural variations (SV) are more difficult to detect than single nucleotide variants or short indels. They include a range of long variants, including copy number variants (CNV), large insertions-deletions and inversions or translocations. Traditionally, structural variants were identified by arrayCGH techniques and verified by capillary sequencing. The emergence of next generation sequencing led to the design of algorithms for SV detection using short-read data. Available SV callers mostly take advantage of the features of paired-end sequencing, e.g. CNVs can be identified in regions with higher than expected depth-of-coverage. Likewise, if mate pairs span a shorter or longer than expected distance, or they are mapped to the reference with a change of orientation, this could be respectively attributed to insertion, deletion or inversions in the donor genome. Given these, there are also confounding factors such as GC-content biases or mapping biases in the repeat regions that can complicate the structural variant discovery. The actual coordinates of the structural variants are hard to detect, and our best bet is currently re-sequencing or de novo assembly.

  • BreakDancer - It can detect various types of structural variants from short paired-end sequence data. It can integrate the data across a number of samples to enhance the detection of common variants.
  • Pindel - It uses a pattern growth approach to detect the precise break points within reads spanning medium-sized insertions (up to 50bp or so) or deletions (up to 10kbp) using the read-spanning and mate pair mapping information.
  • Genome STRiP - This tool is designed for discovery and genotyping of structural variation using date from multiple individuals. The cross-sampling approach aims to overcome the biases that arise when a reference is required for comparison, as structural variants are under-represented in reference data.
  • PeSVFisher - A software for accurate detection of breakpoints for copy number variants as well as other structural variants including inversions, duplications and translocations. The positions of the breakpoints are informative about co-localization of structural changes in the genome.

RNAseq

RNA-seq is used to identify transcript structures and quantify transcript expression levels. Transcripts can be of different types, including large or small RNAs, splicing isoforms and fusion transcripts. The assembly of the transcriptome from short-read data is challenging, due to variable sequencing depth in different transcripts, the strand-specific nature of RNA-seq data, and the alternative splicing which can be difficult to resolve. Transcriptome assembly may require reference-based alignment, de novo assembly or a combination of both. Alignment can be used when a high quality reference is available, however it cannot be used to assemble alternatively spliced variants, trans-spliced genes or novel unannotated transcripts. In these cases, de novo approach can be integrated to enhance the assembly.

  • TopHat - Built on the Bowtie aligner, it is used to map the RNA-seq data to a reference. It first identifies the potential exons and then creates a library of possible splice junctions.
  • Cufflinks - Used to assemble transcripts and estimate their abundance. It can handle stranded reads and different insert sizes. It also performs differential expression analysis.
  • SeqBuster - A toolkit designed to process datasets enriched for miRNAs or non-miRNA small RNAs. It integrates several types of analysis, including deep characterization of miRNA variability and differential expression analyses between different samples.

ChIPseq

ChIP-seq is used to determine the interaction between DNA and chromatin-associated proteins. It can identify the sequences where transcription factors or proteins bind to the DNA and can define the 'chromatin states' from combinations of modified histones, hence affecting the regulation of gene expression.

Raw reads derived from ChIP-seq are commonly aligned using the fast and memory-efficient short read aligner Bowtie and systematic bias can be corrected by using BEADS prior to downstream analysis. In post-mapping analysis workflow, genomic regions which are significantly enriched in aligned tags can be determined by the use of peak calling algorithms such as MACS. The defined peak intervals can be further characterized relative to genome features e.g chromosomes, promoters or exon, using CEAS or GREAT. DNA sequence motifs can be identified using MEME or Peak-motifs. It is worth noting that the analysis through MEME is limited to 600 peaks, while Peak-motifs unlimited.

Many bioinformatics tools are readily available from the public domain , i.e. Galaxy. However, a local Galaxy server or a Galaxy instance on the Amazon cloud be set up to meet particular needs. A workflow for basic ChIP-seq analysis using Galaxy is provided here. Powerful solutions for post-mapping analyses by diverse high-dimensional methods can also be implemented in the statistical R programming language and gathered through the software package Bioconductor.

  • Bowtie - Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).
  • BEADS - BEADS is a normalization scheme that corrects nucleotide composition bias, mappability variations and differential local DNA structural effects in deep sequencing data.
  • MACS - MACS analyzes data generated by short read sequencers. MACS empirically models the shift size of ChIP-Seq tags, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to effectively capture local biases in the genome, allowing for more robust predictions.
  • CEAS - CEAS (Cis-regulatory Element Annotation System) is designed to characterize genome-wide protein-DNA interaction patterns from ChIP-chip and ChIP-Seq of both sharp and broad binding factors. It provides statistics on ChIP enrichment at important genome features such as specific chromosome, promoters, gene bodies, or exons, and infers genes most likely to be regulated by a binding factor.
  • GREAT - GREAT assigns biological meaning to a set of non-coding genomic regions by analyzing the annotations of the nearby genes. Thus, it is particularly useful in studying cis functions of sets of non-coding genomic regions. Cis-regulatory regions can be identified via both experimental methods (e.g. ChIP-seq) and by computational methods (e.g. comparative genomics).
  • MEME - The MEME Suite web server provides a unified portal for online discovery and analysis of sequence motifs representing features such as DNA binding sites and protein interaction domains.
  • Peak-motifs - RSAT peak-motifs performs a complete motif analysis and offers a user-friendly web interface without any restriction on sequence size or number of peaks.
  • Galaxy - Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research.
  • Bioconductor - Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development.

Related links

* quick link - http://q.sanger.ac.uk/esgi