[ << Back]

Next Generation Sequencing

Where does the MapSeq/pf data come from?
MapSeq/pf presents data from a collaborative project, which brings together sequencing data from samples submitted by investigators in many countries around the world. Collaborators in this project submit patient blood samples, from which parasite DNA is extracted and sequenced at the Wellcome Trust Sanger Institute (WTSI) using 'next-generation' sequencing technologies by Illumina/Solexa. Our sequencing pipeline analyzes the vast quantities of sequencing data generated from the samples, identifying variable positions across the whole genome, and genotyping every sample at those positions. The resulting genotyping data is the data you can browse and analyze when you use MapSeq. The following paragraphs describe in more detail the sequencing, assembly and genotyping process.

From the arm to the browser: the MapSeq/pf sequencing pipeline
We receive isolates submitted by partners around the world. Since Illumina sequencing requires relatively small DNA quantities, we are able to process both culture samples, and 'from the arm' clinical samples that have not been cultured. When clinical sample are collected, blood is filtered to separate red blood cells and thus remove human DNA. Subsequent DNA extraction is followed by Illumina sequencing at WTSI, which produce millions of short nucleotide sequences (reads) for each sample. Nucleotide reads are paired, in that we sequence both ends of short genomic fragments, which makes the data somewhat easier to re-assemble. The data currently stored in MapSeq/pf was generated with a variety of read lengths: 36-, 54- and 76-base pairs.

Reassembling the Illumina data into full-genome data for each of the samples is a complex challenge, and we have spent much effort in developing a sophisticated pipeline to perform this task. To begin with, Illumina reads from our samples are mapped against the 3D7 Plasmodium falciparum reference genome (version 2.1.5) using the maq program (Li, 2008). This process identifies a list of over one million variable positions, which are mapped against the reference genome. This list of SNPs represents the full catalogue of variable positions which are used for genotyping.

Genotyping is performed by the SNP'o'matic program (Manske, 2009), which re-screens the raw Illumina short reads, mapping them against the reference genome according to stringent sequence matching criteria. Following this mapping, genotypes are determined at each position for each sample, where individual allele coverage had to exceed a minimum of 5 reads, or 20% of the sample's total coverage, at that position. If no alleles at a position fulfilled this criterion, then the sample's genotype is assumed missing. Under such criteria, many genotypes are called as heterozygous (in other words, we call multiple alleles at that position).

Challenges in the Assembly and Genotyping process
The P. falciparum genome assembly presents huge challenges. Its AT content is approximately 80%, which means that many regions of the genome have extremely low variability, with frequent repeats and low-complexity sequences. These factors make the mapping of short reads (such as those generated by Illumina methods) challenging at best, and practically impossible in regions where the high AT content makes extended stretches of DNA non-unique across the genome.

Our SNP'o'matic mapping process attempts to minimize mapping ambiguity, in many case discarding reads with low mapping confidence. Even then, our coverage of the genome is uneven, and there are regions where assembly mismatches are more likely to occur. Such mismatches may produce insufficient coverage at certain positions (which may not be called for a given sample), and incorrect multiple-allele calls at others. Although we try to minimize these occurrences, you should be aware of these limitations when you inspect genotyping data. Generally, coding regions are more CG-rich than non-coding regions, and therefore tend to have deeper and more even coverage.

MapSeq provides tools which allow you to assess the credibility of a call. When you view genotyping data, you can click on any call, to display the number of reads that supports that call. MapSeq also allows you to invoke LookSeq to view the 'pileup data', which shows how reads are assembled at any given position.

What about other types of genomic variations?
Currently, when you use MapSeq, you will only see genotypes at SNP positions. Of course, SNPs are only one type of variation you may want to investigate, since P. falciparum genomes frequently present other variations, such as insertions, deletions, copy number variations (CNV), and so on. The detection of these variations present substantial challenges when assembling genomes from short nucleotide reads.

It is our plan to detect and describe all variations in the P. falciparum genome, and to provide tools that will facilitate their analysis. However, although we have made much progress towards characterizing these variations, we are not yet able to do so with the reliability and confidence with which we genotype SNPs. This will change over time, and our data will be re-analyzed as our methods are refined.

[ << Back]