|AutoCSA Algorithm Description|
AutoCSA Algorithm Description
a) Individual trace analysis
The first step in the analysis is to identify the position (scan index) and height (intensity) of the peaks in each channel of the trace file. Each peak must be a local maximum in its respective channel, i.e. a positive gradient on the left and negative gradient on the right. Two or more adjacent peaks in the same channel that fail to separate are detected by testing for an inflexion like pattern in the peak downslope gradient, quantified by a gradient change ratio of > 2.5 between two successive scans.
AutoCSA uses the known amplimer DNA sequence to match nucleotides with peaks in the sequence trace. When the first target peak is aligned to the amplimer with a match the analysis moves forward a prescribed number of scan indices, identifies the next peak and compares this to the known sequence. If there is a match the process iterates through the peaks/DNA sequence to generate a minimum match of 15 consecutive bases to guarantee the true alignment of trace and amplimer sequence. The matching process continues until the beginning of the reverse PCR primer sequence. However, the matching procedure can break down, for example in the presence of i) a Homozygous Substitution, ii) a Homozygous Deletion, or iii) local poor quality trace. In this event the relevant amplimer base is termed a "trace hole" and the matching procedure recommences at the next base. A quality value is assigned to each base with a matched peak and defined as a signal (matched peak) to noise (unmatched peak) ratio of intensities.
b) Comparative analysis
The sequence trace that is being screened for mutations (test trace) is compared to a reference sequence trace to detect heterozygous variants. The goal here is to detect any decrease in peak intensity of test trace nucleotides compared to the reference trace, i.e. calculate peak height drops. The best quality reference sequence trace is selected automatically from a panel of sequences, by scoring each trace based on quality and coverage.
Since peak intensities vary on a trace by trace basis the intensities of the peaks must be normalised before trace comparison. In fact it is the intensity ratios of test to reference trace that are normalised. Peak intensity ratios at each nucleotide of the compared traces are normalised over the region of coverage common to both traces. Normalisation is performed trace wise rather than channel wise and yields an intensity ratio free of any trace and channel scaling effects. In order to perform the trace-wise normalisation it is necessary to account for the different scaling of each channel. For this purpose channel correction factors are calculated, firstly by assigning one channel as a reference, then finding the average intensity ratio (test to reference over all peaks) of each channel and dividing these by the same quantity for the reference channel.
The steps in the normalisation procedure can be summarised as follows;
This ratio (Dropj) quantifies the test trace nucleotide peak height drop which is the main signature used to identify heterozygous bases.
AutoCSA uses distinct algorithms to detect the different classes of variants, heterozygous/homozygous substitutions and indels. Using this approach has enabled sensitivity/specificity to be maximised for each variant type.
a) Heterozygous Substitutions
Each base in the test trace is analysed in a serial fashion for a heterozygous variant. The discriminant for indicating the presence of a single base heterozygous substitution is a peak height drop between the base of the trace under investigation and reference sequence and the presence of an additional mutant peak. For the analysis of somatic mutations in primary tumours and cancer cell lines the critical peak height drop is set to 20%. This is a crucial parameter which can be varied depending on the sensitivity required, for detecting SNPS and mutations in haploid DNA the drop can be increased. If the critical peak height drop test has been satisfied, the software searches for heterozygous novel peaks.
Any novel peak is then tested for it's viability, i.e. does it have an intensity suitable to be considered a heterozygous peak. The peak's intensity is compared against a median intensity of the 2 matched peaks both upstream and downstream in the same channel. The critical factor for this test is 23% (15% for the G channel to capture the inherently low intensity heterozygotes associated with this channel). Next, the novel peak in the test sequence is compared to the equivalent signal (same channel) in the reference trace to check the peak is not reproducible noise. This calculation is the reverse of the peak height drop calculation and generates a peak height increase metric of the novel peak. To detect heterozygotes with subtle novel peaks this metric must be > 100%. Further this metric is used to identify the putative variant by selecting the peak with the highest score.
In addition, an optional final search for more subtle heterozygous substitutions can be performed for peak height drops in the range 14-20%, here we restrict calls to regions of good quality data.
b) Homozygous Substitutions
Homozygous substitutions can be identified from individual "trace holes" resulting from the amplimer matching procedure. Each trace hole is interrogated for the presence of a novel peak (other than that indicated by the amplimer). If more than one novel peak is present the one with the highest intensity is annotated as the variant base. Next a viability test is performed on any novel peak, which is the same as the viability test described for heterozygous substitutions, but in this case a critical ratio of 35% is used.
c) Homozygous Deletions
Homozygous deletions can be identified from any contiguous set of "trace holes" resulting from the amplimer matching procedure. The scan index gap between the adjacent bases either side of the hole is calculated. This index gap is compared to the local upstream base spacing and if equivalent a putative variant is assigned.
d) Homozygous Insertions
Homozygous insertions can be identified by interrogating the spacings between neighbouring nucleotides. A scan index gap is calculated between neighbouring bases that have been aligned to the amplimer sequence for the trace under investigation. If there is a homozygous insertion there will be a larger than expected scan gap. If this large scan gap is greater than a 1.75 multiple of the local upstream base spacing the data is interrogated for viable peaks that could correspond to a novel allele. Essentially the novel sequence in the scan gap is basecalled. The novel peaks are compared to matched neighbours in the same channel to ascertain if they are true peaks, again using a viability test as used for Homozygous Substitutions, with a critical ratio of 35%. If the number of true novel peaks correlates to the length of the scan gap a variant is assigned.
e) Heterozygous Insertions and Deletions
In order to detect a heterozygous indel the trace being screened is required to have two specific properties. The first of these is a "step" in the quality profile, marked by an abrupt drop in quality. This step is identified by detection of a maximum in the ratio of average qualities calculated over adjacent windows of 30 bases (where the quality score used here is the signal/noise ratio over 5 bases). If a "step" is detected then a "normalisation filter" is also applied in the peak height drop calculation (previously described), this ensures that the change in wildtype peak scaling after the quality step, is taken into account and is an essential aspect of detecting heterozygous indels. The second criteria for detection is a critical concentration (33%) of individual closely spaced (<3 bases) heterozygous substitutions from the quality step to the end of coverage. The position of the quality step and the first heterozygous substitution in the concentrated region are then used to define the start of the variant. Correct identification of the start position is essential for correct annotation but the most important aspect of detecting indels is that some variant call has been made, even if it is not correctly annotated.
The next stage is to annotate the indel, and this is achieved by generating a "mutated sequence" string beginning at the suspected start position. The length and sequence of the variant are identified from the amplimer sequence (for deletions) or mutated sequence (for insertions). A "speculative indel" is assigned if the type or length information is undefinable.
Flagging rules are applied in the order given, proceeding to the next variant as soon as a PASS or FAIL result is reached.
b) Heterozygous Insertions and Deletions
c) Homozygous Insertions and Deletions