Reassembling the Illumina data into full-genome data for each of the samples is a complex challenge, and we have spent much effort in developing a sophisticated pipeline to perform this task. To begin with, Illumina reads from our samples are mapped against the 3D7 Plasmodium falciparum reference genome (version 2.1.5) using the maq program (Li, 2008). This process identifies a list of over one million variable positions, which are mapped against the reference genome. This list of SNPs represents the full catalogue of variable positions which are used for genotyping.
Genotyping is performed by the SNP'o'matic program (Manske, 2009), which re-screens the raw Illumina short reads, mapping them
against the reference genome according to stringent sequence matching criteria. Following this mapping, genotypes are determined at
each position for each sample, where individual allele coverage had to exceed a minimum of 5 reads, or 20% of the sample's total
coverage, at that position. If no alleles at a position fulfilled this criterion, then the sample's genotype is assumed missing.
Under such criteria, many genotypes are called as heterozygous (in other words, we call multiple alleles at that position).
Our SNP'o'matic mapping process attempts to minimize mapping ambiguity, in many case discarding reads with low mapping confidence. Even then, our coverage of the genome is uneven, and there are regions where assembly mismatches are more likely to occur. Such mismatches may produce insufficient coverage at certain positions (which may not be called for a given sample), and incorrect multiple-allele calls at others. Although we try to minimize these occurrences, you should be aware of these limitations when you inspect genotyping data. Generally, coding regions are more CG-rich than non-coding regions, and therefore tend to have deeper and more even coverage.
MapSeq provides tools which allow you to assess the credibility of a call. When you view genotyping data, you can click on any call,
to display the number of reads that supports that call. MapSeq also allows you to invoke LookSeq to view the 'pileup data', which shows
how reads are assembled at any given position.
It is our plan to detect and describe all variations in the P. falciparum genome,
and to provide tools that will facilitate their analysis. However, although we have made much
progress towards characterizing these variations, we are not yet able to do so with the reliability and
confidence with which we genotype SNPs. This will change over time, and our data will be re-analyzed as our
methods are refined.