Artemis: visualising, analysing and browsing next generation sequence data
Artemis: visualising, analysing and browsing next generation sequence data
Example 1: Plasmodium falciparum 3D7
A region in chromosome 1 of Plasmodium falciparum 3D7. Displaying transcriptome data of the expression levels at two time intervals, 8 hours and 24 hours. The sequence, annotation and BAM alignment files can be launched in Artemis (please see our GitHub page for download and installation instructions).
When this example has launched, right click on the BAM view and open the coverage plot from the 'Graph' option in the pop-up menu. This shows the different levels of expression at 8 hours (red) and 24 hours (green). As well as the differences in expression at different times this example also confirms the annotation of the exon boundaries.
New insights into the blood-stage transcriptome of Plasmodium falciparum using RNA-Seq.
Otto TD, Wilinski D, Assefa S, Keane TM, Sarry LR, Böhme U, Lemieux J, Barrell B, Pain A, Berriman M, Newbold C and Llinás M
The Chlamydia trachomatis genome and plasmid are concatenated together in this example. It can be seen from the reads mapped to the plasmid sequence that there is a deletion of ca. 370 bp in the variant of this organism. The deletion removes the primer binding site used in the standard diagnostic test. This led to a significant increase in cases of what is now called the new variant C. trachomatis in Sweden where it is found.
Changing the view to plot the reads by their inferred size shows the characteristic increase in this value for the reads where they span the deleted region in the variant. This view is illustrated in the image below.
Co-evolution of genomes and plasmids within Chlamydia trachomatis and the emergence in Sweden of a new variant strain.
Seth-Smith HM, Harris SR, Persson K, Marsh P, Barron A, Bignell A, Bjartling C, Clark L, Cutcliffe LT, Lambden PR, Lennard N, Lockey SJ, Quail MA, Salim O, Skilton RJ, Wang Y, Holland MJ, Parkhill J, Thomson NR and Clarke IN
Mycobacterium tuberculosis H37Rv genome example with Illumina re-sequencing data. Right clicking on the BAM view gives an option to 'Show' -> 'SNP marks' and vertical red lines on the reads indicate differences to the reference. Additionally a SNP density plot can be shown ('Graph' -> 'SNP'). Sequencing errors produce randomly distributed lines and true SNPs can be seen where lines align vertically.
Example 4: Streptococcus pneumoniae
A region of Streptococcus pneumoniae antibiotic resistant pandemic clone PMEN1. This shows the variation data for 11 strains of this compared to the reference. The pspA gene in the reference is a pseudogene as a result of a frame shift. The magenta insertions in most strains show where this frame shift has been caused by a deletion in the reference strain (and one of the other strains). There are also some other points of interest - the gene has independently become a pseudogene in a second strain, because of a SNP leading to a premature stop codon, and a third strain actually has a recombination that spans this whole region (shown by the increased SNP density). For single base changes the colour represents the base it is being changed to, i.e. T black, G blue, A green, C red. Right clicking on the panel that is displaying the variation data provides a number of options including filtering and different colour schemes.
Example 5: Plasmodium berghei
An example in Plasmodium berghei of a repeat that was collapsed during the assembly. Clone the BAM panel and zoom out. Right click on the top BAM panel an change the view to show the coverage (Views -> Coverage). In the bottom BAM panel right click and use the 'Filter' option to hide proper pairs. There are three ways for confirming this in artemis:
the amount of reads that are not proper pairs increases at the break point of the duplication
Example 6: Salmonella bongori
An example in Salmonella bongori of directional transcriptome sequencing using Illumina technology. When launched change the view to 'Strand Stack' (right-click: Views -> Strand Stack). For this protocol, the reads indicating the direction are on the opposite strand.
A simple method for directional transcriptome sequencing using Illumina technology.
Croucher NJ, Fookes MC, Perkins TT, Turner DJ, Marguerat SB, Keane T, Quail MA, He M, Assefa S, Bähler J, Kingsley RA, Parkhill J, Bentley SD, Dougan G and Thomson NR
An example in Plasmodium berghei of a misassembly shown in ACT (included in the Artemis tool set). A BAM panel is loaded for each assembly to identify a misassembly in the top sequence in the region 1060000-1080000 bp. The bottom sequence is the de novo assembly. To speed up the display of the BAM files (over HTTP) for the ACT launch below the Illumina BAM's contain just 10% of the reads. The view can be cloned to show a coverage plot of mapped Illumina reads and the insert size view of mapped 454 reads. In the corrected genome the coverage is more even and does not drop except in the gap regions.
Example 8: Plasmodium falciparum
Genetic crosses can be studied using ACT. In this example reads from a genetic cross are mapped against the Pf3D7 and HB3 Plasmodium falciparum chromosome 14 sequences. This shows the variations found when compared to these two sequences. This can be used to identiy crossover in the progeny.
Example 9: Mouse Genomes
Artemis showing data from the Mouse Genomes Project the reference genome is loaded from an indexed FASTA file. Read alignments of 3 strains are shown in the `paired stack' view with read pairs joined by a grey line. The reads are colour coded by strain with the coverage plot for each underneath. These reveal a deletion in chromosome 19 as seen from the coverage of 129S1/SvImJ (red) and not in two of the strains, A/J (green) and C57BL/6NJ (blue).
The read alignment files (in BAM format) can be download along with their index files (*.bai). These files are large (40-140 GB) which can make them prohibitive to download locally. So an alternative is to download the region or chromosome of interest. This can be done using samtools. For example the script below can be used to download the read alignments for the 17 strains used in the Mouse Genomes Project containing just the region in chromosme 19 described above. It creates the index file for the downloaded BAM and creates a file 'artemis_bam.list' that can be used to load the read alignments into artemis.
Launch Artemis. When Artemis has been launched open the sequence file NCBIM37_um.fa. From the drop down menu in the entry panel at the top of Artemis, change the view to chromosome 19.
Read the sorted and indexed annotation file into Artemis Mus_musculus.NCBIM37.64.sorted.gtf.gz (from the 'File' menu select the 'Read An Entry' option).
Read the file containing the paths to the BAM files (artemis_bam.list) into Artemis ('File' → 'Read BAM /VCF...').
You will then get a message to indicate that Artemis is displaying reads for Chromosome 19 (rather than the default chromosome 1).
Navigate to the region of the BAM that was downloaded (57270000-57300000). This can be done using the Navigator (from the menus 'Goto' → 'Navigator...').
Mouse genomic variation and its effect on phenotypes and gene regulation.
Keane TM, Goodstadt L, Danecek P, White MA, Wong K, Yalcin B, Heger A, Agam A, Slater G, Goodson M, Furlotte NA, Eskin E, Nellåker C, Whitley H, Cleak J, Janowitz D, Hernandez-Pliego P, Edwards A, Belgard TG, Oliver PL, McIntyre RE, Bhomra A, Nicod J, Gan X, Yuan W, van der Weyden L, Steward CA, Bala S, Stalker J, Mott R, Durbin R, Jackson IJ, Czechanski A, Guerra-Assunção JA, Donahue LR, Reinholdt LG, Payseur BA, Ponting CP, Birney E, Flint J and Adams DJ
Download the annotation. Unzip this annotation file. This file needs to be indexed for Artemis to be able to read it. The reference names need to match the names in the sequence file (human_g1k_v37.fasta). This can be achieved by stripping out the 'chr' string from the first column of the gtf file (e.g. chr1 becomes 1) using this perl command:
Launch Artemis. When Artemis has been launched open the sequence file human_g1k_v37.fasta. From the drop down menu in the entry panel at the top of Artemis, change the view to chromosome 20.
Read the annotation file into Artemis gencode_v4.annotation.GRCh37.sorted.gtf.gz (from the 'File' menu select the 'Read An Entry' option).
Read the BAM files into Artemis ('File' → 'Read BAM /VCF...').
You will then get a message to indicate that Artemis is displaying reads for Chromosome 20 (rather than the default chromosome 1).
To differentiate reads from each BAM right click in the BAM view and select 'Colour By' → 'Coverage Plot Colours'.
Note: A minimum Java version of 1.8 is required to load the next-generation file formats.
Artemis and ACT can be downloaded and run as standalone tools and used to browse indexed BAM, VCF (Variant Call Format) and BCF (Binary VCF) files on a users local machine. Please see our GitHub page for additional information, download and installation instructions.
Artemis and ACT read BAM files that has been sorted and indexed using SAMtools. This provides an integrated BamView panel displaying sequence alignment mappings to a reference sequence (see Example 1 above).
Variant Call Format (VCF) files can also be read (see Example 3 above). The VCF files need to be compressed and indexed using bgzip and tabix respectively (see the tabix manual and download page). The compressed file gets read in (e.g. file.vcf.gz) and below are the commands for generating this from a VCF file:
bgzip file.vcf tabix -p vcf file.vcf.gz
Alternatively a BCF file can be indexed with BCFtools and read into Artemis or ACT.
As with reading in multiple BAM files, it is possible to read a number of (compressed and indexed) VCF files by listing their full paths in a single file. They then get displayed in separate rows in the VCF panel.
A free genome browser and annotation tool that allows visualisation of sequence features, next generation data and the results of analyses within the context of the sequence, and also its six-frame translation.
A Java application for displaying pairwise comparisons between two or more DNA sequences. It can be used to identify and analyse regions of similarity and difference between genomes and to explore conservation of synteny, in the context of the entire sequences and their annotation.