Contact WTSI Webmaster Printer friendly format Login to WTSI resources WTSI RSS feed
Genomics & Genetics
  • Overview
  • CGP
  • Software
  • Public Projects
  • AutoCSA
  • BioView
  • DbCon
  • Information
  • Links
  • Team
  • News
  • Publications
  • Conditions of use
  • Licence Agreement
Cancer Genome Project Software

AutoCSA Installation and User Guide


  1. Introduction
  2. Release
  3. The Main Concepts of AutoCSA
  4. Installing AutoCSA
  5. Setting Up AutoCSA
  6. Running AutoCSA
  7. Web Interface
  8. Parameters
  9. Protein Annotation
  10. Requirements
  11. Detailed Algorithm
  12. Region of Interest (ROI)
  13. Licence

1. Introduction


AutoCSA (Automatic Comparative Sequence Analysis) is a mutation detection program designed to detect small mutations (1-50 bases) in sequence traces. It is capable of detecting both homozygous and heterozygous base substitutions and small insertions and deletions. The software is written in java and has been designed in a modular way so it can be easily integrated into sequencing pipelines or run as a standalone application. The software comes with a web-based viewer so potential mutations can be rapidly inspected. A spreadsheet of the results is generated so the data can be easily edited and moved onto other applications. There is also the option to add the gene DNA sequences so the mutations can be automatically annotated at the protein level. It has specifically been designed with high throughput environments in mind, so it is easy to automate the analysis of large amounts of data with little manual intervention.




2. Release


AutoCSA current release is version 1.0Beta.




3. The Main Concepts of AutoCSA


In order to run AutoCSA successfully it is important to understand the main concepts of the algorithm. This section can be skipped but it should help you in running and getting the best out of the software.


    3.a) Peak Matching

    AutoCSA uses the known DNA amplimer sequence to match the peaks in the trace with bases in the amplimer (Figure 1). This stage is also used to provide information for identification of homozygous mutations.


    Figure 1 Example of Peak Matching


    3.b) Heterozygote Mutation Signature

    The main concept of AutoCSA is that it compares the trace of interest against a reference trace in order to detect heterozygous mutations. AutoCSA scans both traces looking for drops in height of corresponding peaks between the reference and trace of interest which are then marked as a potential mutations (Figure 2). The peak height drop needs to be 20% or greater to be flagged as a potential mutation. Any drop that is less than 20% is ignored and any equal or above is run through a further set of tests to see if it is a valid mutation.


    Figure 2

    3.c) Reference Trace Selection

    It is important that the reference trace chosen is of good quality for effective mutation detection. If AutoCSA is given a series of reference traces it will chose the best trace by comparing the quality, length of coverage and number of trace holes (unidentified bases) of all the traces.


    3.d) Per Base Quality Score (q)

    AutoCSA gives a simple quality score to each base identified in the trace, known as q. It is a simple signal to noise ratio of the matched base against any noise or signal at that same position. This metric is used in calculating whether a mutation is considered real and also in assessing the coverage of a trace. Q scores vary in the range from 0-25, with a q score of below 4.5 being of low quality.

    • q < 4.5 Poor quality
    • 4.5 - 10.0 Moderate quality
    • q > 10.0 Excellent quality

    3.e) Mutation Flagging

    Mutation flagging is a series of rules that are applied by AutoCSA to remove false positive calls. The rules use the local and global quality of the trace and also the concentration of mutation calls. If bi-directional sequencing is used extra rules are applied which examine the corresponding bases on both strands.

    NOTE: Flagging was designed with bi-directional sequencing in mind and so performs better under these conditions.


    3.f) Mutation Types

    Listed below are a description of all possible sequence variant types called by AutoCSA. The first 5 types are basic mutations found in the trace under investigation, 6-7 define mutations in the reference trace, 8-9 are mutations which cannot be fully characterised by AutoCSA.


    1. Heterozygous Substitutions
    2. Called when AutoCSA detects a peak height drop in the trace with respect to the reference trace and a mutant peak satisfying a number of validity tests.
    3. Homozygous Substitutions
    4. Called when AutoCSA detects a missing base in the trace, and instead finds an alternate base, which must satisfy a number of validity tests.
    5. Heterozygous Insertions and Deletions
    6. These variant types are characterised by a high concentration of individual heterozygous substitutions emanating from a position in the trace with a marked decrease in the quality profile. In order for these calls to be made AutoCSA must also be able to derive an annotation of the Indel i.e. insertion or deletion and the changed bases.
    7. Homozygous Insertions and Deletions
    8. These are called when AutoCSA detects either a series of extra bases (not present in the amplimer) or missing bases in the trace without an associated jump in scan index.
      Homozygous Complex (Indel)
      These are called when an insertion and deletion exist at the same base location (indel), i.e. a set of bases are deleted and replaced with a different set of bases.
    9. Homozygous Germline Substitutions
    10. Called when the identical substitution is detected on both the reference and trace being screened.
    11. Homozygous Reference Substitutions
    12. These are called when the reference trace contains a homozygous substitution when compared to the DNA amplimer sequence.
    13. Speculative Indels (Het and Hom)
    14. These variant types are called when AutoCSA cannot fully characterise an insertion or deletion, location and/or base change.
    15. Trace Hole
    16. Trace holes represent a failure to match a nucleotide in the amplimer with a base in the trace, and a subsequent failure to characterise a mutation.



4. Installing Java and AutoCSA


AutoCSA should run on most computer systems as long as it supports a recent version of the java programming language (or java runtime environment). However, we have only tested the code on Linux, MS Windows XP and Mac OS X 10 and cannot guarantee that it will work or run as designed on other systems.


    4.a) Installing Java

    If you already have a version of java running on your machine this stage can be skipped. Otherwise you will need to install an up to date version of the Java Virtual Machine (JVM). The JVM is available from the Sun Microsystems website (http://www.java.com).


    Download java software from (http://www.java.com). Just click the download button on the java.com website and the webpages will guide you through the installation process and will also test if the install has been successful (Figure 3).


    Figure 3 java.com website



    4.b) Installing AutoCSA

    AutoCSA comes with an installer so you just need to double click on csa-installer.jar which was included in the root directory of the zip download. The installer will then guide you through the initial setup stages.


    Click through the installation guide (Figure 4). You must accept our license agreement in order to complete the installation. The guide will also prompt you to where you would like the software to be installed. The default location is C:\Program Files\AutoCSA (hint:- it is worth noting where the software is installed as you will need to go to the folder to run AutoCSA). Once this has been completed the installation guide will automatically run the setup wizard (see next section).


    Figure 4 Installation Wizard




5. Setting up AutoCSA


The initial setup of AutoCSA has been simplified by the use of a setup wizard which will guide you through a series of menus and questions. The wizard system prompts the user for where the trace files are located, where the output should go and the make up of the trace file name. The setup only needs to run through once. AutoCSA can also be configured manually and the final part of this section outlines how this can be achieved.


    5.a) Defining the Filename Format and Input/Output Folders

    The second page of the wizard prompts you for an example of a sample trace file and 2 folder locations, input/output (Figure 5). AutoCSA requires that certain information about a file is recorded in its actual name. An identifier is used to split these fields within the file name which can be defined by the user. Below is an example format for the reference and sample trace names with underscore (_) being used to split the different components of the file name.


    Example of reference trace filenames:

    BRCA1exon2_Normal1_reference_f_Run1.ab1
    BRCA1exon2_Normal1_reference_r_Run1.ab1


    Example of sample trace filenames:

    BRCA1exon2_Patient1_sample_f_Run1.ab1
    BRCA1exon2_Patient1_sample_r_Run1.ab1


    All filenames should contain the five fields below:-

    • Amplimer name (e.g. BRCA1exon2)
    • Sample name (e.g. Normal1, Patient1)
    • Indication if the file is a reference or a sample trace (e.g. reference or sample)
    • Direction sequencing was carried out, forward (f) or reverse (r) strand (e.g. f or r)
    • Indication which forward and reverse traces are paired together (Must contain one or more numerical digits)(e.g. Run1, Run2)

    The following three bullet points should help you to fill in page 2 of the wizard (Figure 5):-

    • An example sequencing file. The wizard requires the location of an example trace file, which it then uses to help define the fields in the file name on the next page. Click the browse button and navigate to a folder with a trace file and select the file. Then click Open, the wizard should automically add the path of this file to the text box.
    • Root of input directories. This is the top folder where the trace files are stored. AutoCSA will look in all subdirectories below this for trace files and if all the other data which is needed is available will analyse the trace. Again, click the browse button and navigate to the directory. Then Click Open.
    • An output location. This is the directory where the webpages and excel spreadsheets containing the output will be saved to. You need to create a directory to store the output in a convenient place. Then click the browse button, navigate to the output directory and select it. Then click Open.

    Figure 5 Page 2 of the Setup Wizard, (define sample trace file and input and output folders)


    The wizard then takes the sample file that you provided on the last page and using the file as an example asks a series of questions so the fields in the filename can be defined (Figure 6).


    An example file name:-

    BRAFexon11_NCI-H1395_sample_f_Run1.scf


    The questions are shown below along with the answers required if the file name above is used (in red):-
    1. Please enter the trace file extension used, e.g. ab1, scf (do not include "."). scf
    2. Please enter the character used to divide the components of the file names, e.g. "_". _

    The wizard then splits the filename up and lists the fields in the filename components table (Figure 6, text box on right side).

    3. Please enter the number (shown in the file components table) for the "amplimer name". 1
    4. Please enter the number (shown in the file components table) for the "sample name". 2
    5. Please enter the number (shown in the file components table) indicating if the trace is a "reference" or a "sample to be screened". 3
    6. Please enter the text that defines the reference trace, e.g. "reference". reference
    7. Please enter the number (shown in the file components table) indicating if the trace is forward or reverse. 4
    8. Please enter the text that defines the forward trace, e.g. "f" or "s", (any numeric characters will be removed). f
    9. Please enter the text that defines the reverse trace, e.g. "r" or "a", (any numeric characters will be removed). r
    10. Please enter the number (shown in the file components table) indicating which traces are paired together in this experiment, e.g. "Run1", (any non-numeric characters will be removed). 5

    Figure 6 Setup Wizard Completed File Format Form



    5.b) DNA Amplimer Sequence

    Once the fields for the trace file have been defined the wizard then goes onto define the DNA amplimer sequences (Figure 7). In order for AutoCSA to run it must have the actual DNA sequence for each DNA amplimer defined. The wizard will prompt you for the name of the amplimer sequence, the actual DNA sequence and the coordinates of the region of interest (ROI, AutoCSA will ignore any mutations outside this region). By filling in the New Amplimer Section and clicking the add button the wizard will load this information into AutoCSA (use ctrl-v to paste in the sequence). Multiple entries can be added and once all the amplimers have been added just click the finish button to complete setup. To add further amplimers the wizard can be run again or the information can be manually added to the amplimer properties file (see manually configuring AutoCSA section). AutoCSA is now ready to run.


    Figure 7 Setup Wizard Amplimer Form


    5.c) Manually configuring AutoCSA

    There are two main configuration files for AutoCSA, the standaloneCsa.properties file and the amplimer.properties file. Both files are located in the directory where AutoCSA is installed, (the default location is C:\Program Files\AutoCSA). These files can be edited manually using a basic text editor (care must be taken as editors can add hidden characters and file extensions which could crash AutoCSA) or setup can be run again and the files will be overwritten.


    The standaloneCsa properties file contains information on the makeup of the file name, where input/output directories are and several other parameters (See Parameters Section, Figure 8).


    Figure 8 Example of a StandaloneCSA Property File


    The amplimer file contains the DNA sequence for each amplimer sequence and the Region of Interest coordinates on the amplimer (Figure 9). Rather than going through the wizard it is easier for users to directly edit and add amplimers manually to the file.


    Figure 9 Example of a Amplimer File




6. Running AutoCSA


Once setup is complete AutoCSA can be run by clicking on the run icon in the folder where AutoCSA was installed (Figure 10). The default location is C:\Program Files\AutoCSA. A terminal will appear which will display the progress of the software and also any errors. Once AutoCSA has finished running it will automatically open a web browser with a summary of the results. Details of the web interfaces is detailed in the next section.


Figure 10 Running AutoCSA


Running AutoCSA's example set

An example set of trace files are available from our website so you can test AutoCSA and get used to configuring the system. The example set has examples of the 4 main mutations that AutoCSA can detect, a heterozygous substitution, a homozygous substitution, a heterozygous indel and a homozygous indel. The zipped file is available from here. You just need to click on the link and save to a folder on your computer. Once downloaded the zip file must be decompressed by double clicking on the icon. Once downloading and uncompressing has been completed you then need to follow these instructions to configure AutoCSA (further help on running the example data set is available from the quick start document ):-

  1. Rerun the setup program and follow the instructions as in the setup AutoCSA section above. The files are named as set out in the above section. The amplimer section can be ignored as a completed amplimer.properties file comes in the top directory of the zip file.
  2. Copy the amplimer.properties file from the downloaded directory to the directory where AutoCSA is installed.
  3. Run AutoCSA by clicking on the run icon in the AutoCSA directory (see start of this section).



7. Web interface


The resulting mutations are displayed in a series of webpages. The webpages are generated in the folder which has been specified in the setup section and is called index.html. You just need to type the address into the web browser or double click on the index.html icon.

For a demo click here

The index.html page lists the amplimer sequences that have been screened and links to the comparisons that were performed (Comparison data), information about the reference traces used (Wildtypes) and number of mutations identified (Figure 11).


Figure 11 AutoCSA Summary Page


Clicking on an amplimer name gives information on the length of the DNA amplimer, the region of interest (roi), the CDS mapping and translation (if provided) the actual DNA sequence (Figure 12). The region of interest (roi) defines the area of the trace that AutoCSA will mark up mutations. Mutations outside this area will be ignored.


Figure 12 Amplimer Information


The results link gives a list of comparisons carried out and what mutations were called (Figure 13). The DNA name, direction of sequencing (forward or reverse), run id that pairs the forward and reverse trace, status of trace (pass/fail), coverage, quality score, number of substitutions and other mutation types are recorded for each comparison. The mutations can be viewed by clicking on substitution or other mutation column. Clicking on the run id for a trace brings up a full trace view of the trace (if option to generate full traces has been switched on, default is off).


Figure 13 Comparison Summary Page


The detailed view of a substitution, displays four traces with 20 bases either side of the potential mutation (Figure 14). The top two traces are the traces that AutoCSA used to call the mutation. The very top trace being the reference trace and the second trace being the trace being screened for mutations. The mutation position is marked with a shaded region with the colour coding for the actual base in the amplimer sequence (Guanine=grey, Cytosine=blue, Thymine=red, Adenine=green). The third and fourth traces are the reverse sequenced traces with the third trace being the reverse trace under-investigation and the bottom the reverse reference trace. Information on the position, zygosity, type and base change of the mutation is recorded in a text box to the right of the traces.


Figure 14 Detailed Substitution View


Insertions and Deletions are displayed using a slightly different format (Figure 15). In this case the complete trace and the reference trace are displayed in a scrollable window. For homozygous mutations the mutation is shaded and the wild type bases put above the position.


Figure 15 Detailed Homozygous Insertion and Deletion Page


For heterozygous insertions and deletions a scrollable window with the trace and reference trace is provided (Figure 16). The start of the mutation is marked on the trace.


Figure 16 Detailed Heterozygous Insertion and Deletion Page


8. Parameters

AutoCSA has a number of parameters which can be set for advanced users.


Drop parameter

As outlined in the "Main Concepts of AutoCSA" section above, the main signature that the software uses to detect heterozygote mutations is a drop in signal intensity between a reference and trace under-investigation for a particular base. The software default value for this is 20% drop or greater. Any drop that is less than 20% is ignored and any equal or above is run through a further set of tests to see if it is a valid mutation. If this parameter is increased then this will have the affect of decreasing the false calls made by the software but will also decrease the sensitivity of the software. If AutoCSA is being used to detect homozygous mutations or looking of 50:50 heterozygotes the drop can be increased to around 40% which will significantly reduce false calls.


The drop value can be altered in AutoCSA by editing the critMutRatio parameter in the csa_analysis.properties file. This file is located in the resource subfolder where StandaloneCSA has been installed. The change can be made by using a simple text editor like notepad.



standaloneCsa.properties
Property Default Usage
amplimer_name initialsed in setup position in filename that is the amplimer name
antisense_identifier initialsed in setup character indicating antisense/reverse sequencing direction
delimiter initialsed in setup character between filename components
direction initialsed in setup position in filename that indicates sequencing direction
dna_name initialsed in setup position in filename that indicates sample name
flagging 1 flagging will run when '1'
generate_all_full_traces 0 all full trace views generated when '1'
(normally only generated when a variant is found or best normal trace)
mobility_correct 0 applies to 'scf' files, currently not used
remove_not_in_roi 1 removes variants outside of ROI from view when '1'
run_num initialsed in setup position in filename that indicates run number
sample_type initialsed in setup position in filename that indicates wildtype or variant sample
sense_identifier initialsed in setup character indicating sense/forward sequencing direction
seq_file_extension initialsed in setup extension on sequencing files
seq_files_path initialsed in setup path to root directory containing sequencing files
view_output initialsed in setup path where output files will be written
wildtype_identifier initialsed in setup text indicating sample as reference/wildtype

Flagging switch (flagging)

Flagging uses a series of rules to filter out false calls. If the sequence traces are of very good quality or sensitivity needs to be kept very high then flagging can be switched off. This is achieved by setting the "flagging" parameter to 0 in the standaloneCSA properties file.


Full trace file generation (generate_all_full_traces)

AutoCSA generates an image of the mutation and surrounding bases to allow manual review by the user. There is also the option to switch on full trace generation which generates a full trace. By default this is switched to off as it slows the speed of analysis.


Removal of variants not in ROI (remove_not_in_roi)

Variants found outside of the region of interest are normally removed as these are expected to be in areas of poor quality (the beginning and end of the amplimer).


9. Protein Annotation

AutoCSA has the ability to annotate the mutation change at the protein level. In order for this option to work the CDS of the gene must be provided and linked to the amplimer sequence being used. This information is recorded in the amplimer.properties file (figure 17). For each amplimer name a line indicating which gene cds it maps to must be provided:-


map-STK11exon1=STK11


The actual cds sequence must also be added to the file:-


cds-STK11=atggaggtggtgga.............


Once the relevant information has been added to the amplimer.properties file AutoCSA will then attempt to automatically annotate called mutations.


Figure 17 Fields required in the amplimer.properties file for protein annotation


10. Requirements

AutoCSA is written in java and will run under java 1.4 which must be downloaded and installed on your computer before installation. It is available from Sun's Java site or see installation instructions.


The software is optimised to run on sequence traces generated by ABI3730 sequencers running with 36cm capillaries, RapidSeq-15sec run module and POP7 sequencing buffer. AutoCSA uses the raw sequence channels which are also present in the files generated by the sequencers (.ab1 file extension).


Due to the nature of sequence traces, AutoCSA does not attempt to match the DNA sequence until base 50, we therefore recommend having a buffer of 75 bases either side of the region of interest. The software has been used extensively on amplimers of length 500 bases but has only been minimally tested on longer amplimers.


The software has been tested successfully under linux, MS windows XP, and Mac OS X 10.4.6.


11. Detailed Algorithm

Please click here for a detailed description of the algorithm.



12. Region of Interest (ROI)

The region of interest serves two functions:

  1. Indicates the approximate coding portion of the amplimer
  2. Indicates the region of the amplimer expected to give good quality data

Generally it is anticipated that the first 50 residues of a sequencing reaction will contain high amounts of noise. This noise can affect the quality calculations used in normal selection and cause variants to be lost in flagging.

For this reason we recommend that the ROI should be defined as approximately:

  • amplimerName-roi=50,(amplimerLength-50) for bi-directional sequencing
  • amplimerName-roi=50,amplimerLength for uni-directional sequencing



13. Licence

Copyright (c) 2006 Genome Research Ltd.
Author: Cancer Genome Project, cgpit@sanger.ac.uk


THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


You are granted a non-exclusive, non-transferable licence to use this software for your own personal and non-commercial purposes only. You shall not lease, license, loan, sell or distribute the software in whole or in part to any third party. You shall not modify, decompile, disassemble, reverse engineer or create derivative works of this software without the prior consent of the authors.


Information Projects Other Services
Sanger Home
Sitemap
Site Search
Information
Careers
Press
News
Seminars
Workshops
Publications
Staff Theses
Travel Directions
Research Teams
Research Faculty
Personnel Search
Human Genetics
Model Organism Genetics
Pathogen Genetics
Bioinformatics
Sequencing
Library
Helpdesk
Webmail
VPN Access
Sign In
SSO Pass. Reset

webmaster@sanger.ac.uk

Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK  Tel:+44 (0)1223 834244

Last Modified Fri Feb 16 11:52:54 2007

Genome Research Limited is a charity registered in England with number 1021457

Help | Contact us | Legal | Cookies policy | Data sharing