Contact WTSI Webmaster Printer friendly format Login to WTSI resources WTSI RSS feed
Genomics & Genetics
  • Overview
  • CGP
  • Software
  • Public Projects
  • AutoCSA
  • BioView
  • DbCon
  • Information
  • Links
  • Team
  • News
  • Publications
  • Conditions of use
  • Licence Agreement
Cancer Genome Project Software
AutoCSA Algorithm Description

AutoCSA Algorithm Description


  • Preprocessing
  • Variant Detection
  • Post-processing, Flagging

  • Preprocessing


    a) Individual trace analysis

    The first step in the analysis is to identify the position (scan index) and height (intensity) of the peaks in each channel of the trace file. Each peak must be a local maximum in its respective channel, i.e. a positive gradient on the left and negative gradient on the right. Two or more adjacent peaks in the same channel that fail to separate are detected by testing for an inflexion like pattern in the peak downslope gradient, quantified by a gradient change ratio of > 2.5 between two successive scans.

    AutoCSA uses the known amplimer DNA sequence to match nucleotides with peaks in the sequence trace. When the first target peak is aligned to the amplimer with a match the analysis moves forward a prescribed number of scan indices, identifies the next peak and compares this to the known sequence. If there is a match the process iterates through the peaks/DNA sequence to generate a minimum match of 15 consecutive bases to guarantee the true alignment of trace and amplimer sequence. The matching process continues until the beginning of the reverse PCR primer sequence. However, the matching procedure can break down, for example in the presence of i) a Homozygous Substitution, ii) a Homozygous Deletion, or iii) local poor quality trace. In this event the relevant amplimer base is termed a "trace hole" and the matching procedure recommences at the next base. A quality value is assigned to each base with a matched peak and defined as a signal (matched peak) to noise (unmatched peak) ratio of intensities.


    b) Comparative analysis

    The sequence trace that is being screened for mutations (test trace) is compared to a reference sequence trace to detect heterozygous variants. The goal here is to detect any decrease in peak intensity of test trace nucleotides compared to the reference trace, i.e. calculate peak height drops. The best quality reference sequence trace is selected automatically from a panel of sequences, by scoring each trace based on quality and coverage.

    Since peak intensities vary on a trace by trace basis the intensities of the peaks must be normalised before trace comparison. In fact it is the intensity ratios of test to reference trace that are normalised. Peak intensity ratios at each nucleotide of the compared traces are normalised over the region of coverage common to both traces. Normalisation is performed trace wise rather than channel wise and yields an intensity ratio free of any trace and channel scaling effects. In order to perform the trace-wise normalisation it is necessary to account for the different scaling of each channel. For this purpose channel correction factors are calculated, firstly by assigning one channel as a reference, then finding the average intensity ratio (test to reference over all peaks) of each channel and dividing these by the same quantity for the reference channel.

    The steps in the normalisation procedure can be summarised as follows;

    This ratio (Dropj) quantifies the test trace nucleotide peak height drop which is the main signature used to identify heterozygous bases.


    Variant Detection

    AutoCSA uses distinct algorithms to detect the different classes of variants, heterozygous/homozygous substitutions and indels. Using this approach has enabled sensitivity/specificity to be maximised for each variant type.


    a) Heterozygous Substitutions

    Each base in the test trace is analysed in a serial fashion for a heterozygous variant. The discriminant for indicating the presence of a single base heterozygous substitution is a peak height drop between the base of the trace under investigation and reference sequence and the presence of an additional mutant peak. For the analysis of somatic mutations in primary tumours and cancer cell lines the critical peak height drop is set to 20%. This is a crucial parameter which can be varied depending on the sensitivity required, for detecting SNPS and mutations in haploid DNA the drop can be increased. If the critical peak height drop test has been satisfied, the software searches for heterozygous novel peaks.

    Any novel peak is then tested for it's viability, i.e. does it have an intensity suitable to be considered a heterozygous peak. The peak's intensity is compared against a median intensity of the 2 matched peaks both upstream and downstream in the same channel. The critical factor for this test is 23% (15% for the G channel to capture the inherently low intensity heterozygotes associated with this channel). Next, the novel peak in the test sequence is compared to the equivalent signal (same channel) in the reference trace to check the peak is not reproducible noise. This calculation is the reverse of the peak height drop calculation and generates a peak height increase metric of the novel peak. To detect heterozygotes with subtle novel peaks this metric must be > 100%. Further this metric is used to identify the putative variant by selecting the peak with the highest score.

    In addition, an optional final search for more subtle heterozygous substitutions can be performed for peak height drops in the range 14-20%, here we restrict calls to regions of good quality data.


    b) Homozygous Substitutions

    Homozygous substitutions can be identified from individual "trace holes" resulting from the amplimer matching procedure. Each trace hole is interrogated for the presence of a novel peak (other than that indicated by the amplimer). If more than one novel peak is present the one with the highest intensity is annotated as the variant base. Next a viability test is performed on any novel peak, which is the same as the viability test described for heterozygous substitutions, but in this case a critical ratio of 35% is used.


    c) Homozygous Deletions

    Homozygous deletions can be identified from any contiguous set of "trace holes" resulting from the amplimer matching procedure. The scan index gap between the adjacent bases either side of the hole is calculated. This index gap is compared to the local upstream base spacing and if equivalent a putative variant is assigned.


    d) Homozygous Insertions

    Homozygous insertions can be identified by interrogating the spacings between neighbouring nucleotides. A scan index gap is calculated between neighbouring bases that have been aligned to the amplimer sequence for the trace under investigation. If there is a homozygous insertion there will be a larger than expected scan gap. If this large scan gap is greater than a 1.75 multiple of the local upstream base spacing the data is interrogated for viable peaks that could correspond to a novel allele. Essentially the novel sequence in the scan gap is basecalled. The novel peaks are compared to matched neighbours in the same channel to ascertain if they are true peaks, again using a viability test as used for Homozygous Substitutions, with a critical ratio of 35%. If the number of true novel peaks correlates to the length of the scan gap a variant is assigned.


    e) Heterozygous Insertions and Deletions

    In order to detect a heterozygous indel the trace being screened is required to have two specific properties. The first of these is a "step" in the quality profile, marked by an abrupt drop in quality. This step is identified by detection of a maximum in the ratio of average qualities calculated over adjacent windows of 30 bases (where the quality score used here is the signal/noise ratio over 5 bases). If a "step" is detected then a "normalisation filter" is also applied in the peak height drop calculation (previously described), this ensures that the change in wildtype peak scaling after the quality step, is taken into account and is an essential aspect of detecting heterozygous indels. The second criteria for detection is a critical concentration (33%) of individual closely spaced (<3 bases) heterozygous substitutions from the quality step to the end of coverage. The position of the quality step and the first heterozygous substitution in the concentrated region are then used to define the start of the variant. Correct identification of the start position is essential for correct annotation but the most important aspect of detecting indels is that some variant call has been made, even if it is not correctly annotated.

    The next stage is to annotate the indel, and this is achieved by generating a "mutated sequence" string beginning at the suspected start position. The length and sequence of the variant are identified from the amplimer sequence (for deletions) or mutated sequence (for insertions). A "speculative indel" is assigned if the type or length information is undefinable.


    Post-processing, Flagging

    Flagging rules are applied in the order given, proceeding to the next variant as soon as a PASS or FAIL result is reached.


    a) Substitutions
  • 1) If the variant is on both strands at the same position, PASS.
  • 2) If the reverse strand does not contain the variant and the mutant quality at the variant position > 4.5, FAIL.
  • 3) If the reference quality at the variant position is < 3.0, FAIL (heterozygotes only).
  • 4) If the average quality of at least 5 of the 7 upstream bases is > 4.0 AND the quality of the mutant base is < 4.5, PASS.

  • b) Heterozygous Insertions and Deletions
  • 1) If the variant is called an INS or DEL, PASS.
  • 2) For speculative Indel variants;
  • a) if there is no comparison coverage on the opposite strand, PASS.
  • b) if the reverse trace has an average quality < 3.5, PASS.
  • 3) If the opposite strand has an indel (using a search window of half width max (indel length, 15) centred on indel position), PASS.

  • c) Homozygous Insertions and Deletions
  • 1) If the opposite strand has an indel (using a window of half width indel length + 2 centred on indel position), PASS.
  • 2) If the opposite strand has no indel (centred window of length 50 bases) and the opposite average quality (over a centred window of half width max(indel length, 10)) is >5.0, FAIL.
  • 3) If the current strand's local average quality (both 5' and 3' using 5 of a maximum of 7 bases in both directions) is >5.0, PASS.
  • Information Projects Other Services
    Sanger Home
    Sitemap
    Site Search
    Information
    Careers
    Press
    News
    Seminars
    Workshops
    Publications
    Staff Theses
    Travel Directions
    Research Teams
    Research Faculty
    Personnel Search
    Human Genetics
    Model Organism Genetics
    Pathogen Genetics
    Bioinformatics
    Sequencing
    Library
    Helpdesk
    Webmail
    VPN Access
    Sign In
    SSO Pass. Reset

    webmaster@sanger.ac.uk

    Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK  Tel:+44 (0)1223 834244

    Last Modified Mon Feb 12 14:29:42 2007

    Genome Research Limited is a charity registered in England with number 1021457

    Help | Contact us | Legal | Cookies policy | Data sharing