AutoCSA Installation and User Guide
AutoCSA (Automatic Comparative Sequence Analysis) is a mutation detection program designed to detect small mutations (1-50 bases) in sequence traces. It is capable of detecting both homozygous and heterozygous base substitutions and small insertions and deletions. The software is written in java and has been designed in a modular way so it can be easily integrated into sequencing pipelines or run as a standalone application. The software comes with a web-based viewer so potential mutations can be rapidly inspected. A spreadsheet of the results is generated so the data can be easily edited and moved onto other applications. There is also the option to add the gene DNA sequences so the mutations can be automatically annotated at the protein level. It has specifically been designed with high throughput environments in mind, so it is easy to automate the analysis of large amounts of data with little manual intervention.
AutoCSA current release is version 1.0Beta.
3. The Main Concepts of AutoCSA
In order to run AutoCSA successfully it is important to understand the main concepts of the algorithm. This section can be skipped but it should help you in running and getting the best out of the software.
3.a) Peak Matching
AutoCSA uses the known DNA amplimer sequence to match the peaks in the trace with bases in the amplimer (Figure 1). This stage is also used to provide information for identification of homozygous mutations.
3.b) Heterozygote Mutation Signature
The main concept of AutoCSA is that it compares the trace of interest against a reference trace in order to detect heterozygous mutations. AutoCSA scans both traces looking for drops in height of corresponding peaks between the reference and trace of interest which are then marked as a potential mutations (Figure 2). The peak height drop needs to be 20% or greater to be flagged as a potential mutation. Any drop that is less than 20% is ignored and any equal or above is run through a further set of tests to see if it is a valid mutation.
3.c) Reference Trace Selection
It is important that the reference trace chosen is of good quality for effective mutation detection. If AutoCSA is given a series of reference traces it will chose the best trace by comparing the quality, length of coverage and number of trace holes (unidentified bases) of all the traces.
3.d) Per Base Quality Score (q)
AutoCSA gives a simple quality score to each base identified in the trace, known as q. It is a simple signal to noise ratio of the matched base against any noise or signal at that same position. This metric is used in calculating whether a mutation is considered real and also in assessing the coverage of a trace. Q scores vary in the range from 0-25, with a q score of below 4.5 being of low quality.
3.e) Mutation Flagging
Mutation flagging is a series of rules that are applied by AutoCSA to remove false positive calls.
The rules use the local and global quality of the trace and also the
concentration of mutation calls. If bi-directional sequencing is used
extra rules are applied which examine the corresponding bases on both
3.f) Mutation Types
Listed below are a description of all possible sequence variant types called by AutoCSA. The first 5 types are basic mutations found in the trace under investigation, 6-7 define mutations in the reference trace, 8-9 are mutations which cannot be fully characterised by AutoCSA.
4. Installing Java and AutoCSA
AutoCSA should run on most computer systems as long as it supports a recent version of the java programming language (or java runtime environment). However, we have only tested the code on Linux, MS Windows XP and Mac OS X 10 and cannot guarantee that it will work or run as designed on other systems.
4.a) Installing Java
If you already have a version of java running on your machine this stage can be skipped. Otherwise you will need to install an up to date version of the Java Virtual Machine (JVM). The JVM is available from the Sun Microsystems website (http://www.java.com).
Download java software from (http://www.java.com). Just click the download button on the java.com website and the webpages will guide you through the installation process and will also test if the install has been successful (Figure 3).
4.b) Installing AutoCSA
AutoCSA comes with an installer so you just need to double click on csa-installer.jar which was included in the root directory of the zip download. The installer will then guide you through the initial setup stages.
Click through the installation guide (Figure 4). You must accept our license agreement in order to complete the installation. The guide
will also prompt you to where you would like the software to be installed. The default location is
5. Setting up AutoCSA
The initial setup of AutoCSA has been simplified by the use of a setup wizard which will guide you through a series of menus and questions. The wizard system prompts the user for where the trace files are located, where the output should go and the make up of the trace file name. The setup only needs to run through once. AutoCSA can also be configured manually and the final part of this section outlines how this can be achieved.
5.a) Defining the Filename Format and Input/Output Folders
The second page of the wizard prompts you for an example of a sample trace file and 2 folder locations, input/output (Figure 5). AutoCSA requires that certain information about a file is recorded in its actual name. An identifier is used to split these fields within the file name which can be defined by the user. Below is an example format for the reference and sample trace names with underscore (_) being used to split the different components of the file name.
Example of reference trace filenames:
Example of sample trace filenames:
All filenames should contain the five fields below:-
The following three bullet points should help you to fill in page 2 of the wizard (Figure 5):-
The wizard then takes the sample file that you provided on the last page and using the file as an example asks a series of questions so the fields in the filename can be defined (Figure 6).
An example file name:-
The questions are shown below along with the answers required if the file name above is used (in red):-
1. Please enter the trace file extension used, e.g. ab1, scf (do not include "."). scf
2. Please enter the character used to divide the components of the file names, e.g. "_". _
The wizard then splits the filename up and lists the fields in the filename components table (Figure 6, text box on right side).
3. Please enter the number (shown in the file components table) for the "amplimer name". 1
4. Please enter the number (shown in the file components table) for the "sample name". 2
5. Please enter the number (shown in the file components table) indicating if the trace is a "reference" or a "sample to be screened". 3
6. Please enter the text that defines the reference trace, e.g. "reference". reference
7. Please enter the number (shown in the file components table) indicating if the trace is forward or reverse. 4
8. Please enter the text that defines the forward trace, e.g. "f" or "s", (any numeric characters will be removed). f
9. Please enter the text that defines the reverse trace, e.g. "r" or "a", (any numeric characters will be removed). r
10. Please enter the number (shown in the file components table) indicating which traces are paired together in this experiment, e.g. "Run1", (any non-numeric characters will be removed). 5
5.b) DNA Amplimer Sequence
Once the fields for the trace file have been defined the wizard then goes onto define the DNA amplimer sequences (Figure 7). In order for AutoCSA to run it must have the actual DNA sequence for each DNA amplimer defined. The wizard will prompt you for the name of the amplimer sequence, the actual DNA sequence and the coordinates of the region of interest (ROI, AutoCSA will ignore any mutations outside this region). By filling in the New Amplimer Section and clicking the add button the wizard will load this information into AutoCSA (use ctrl-v to paste in the sequence). Multiple entries can be added and once all the amplimers have been added just click the finish button to complete setup. To add further amplimers the wizard can be run again or the information can be manually added to the amplimer properties file (see manually configuring AutoCSA section). AutoCSA is now ready to run.
5.c) Manually configuring AutoCSA
There are two main configuration files for AutoCSA, the standaloneCsa.properties file and the amplimer.properties file. Both files are located
in the directory where AutoCSA is installed, (the default location is
The standaloneCsa properties file contains information on the makeup of the file name, where input/output directories are and several other parameters (See Parameters Section, Figure 8).
The amplimer file contains the DNA sequence for each amplimer sequence and the Region of Interest coordinates on the amplimer (Figure 9). Rather than going through the wizard it is easier for users to directly edit and add amplimers manually to the file.
6. Running AutoCSA
Once setup is complete AutoCSA can be run by clicking on the run icon in the folder where AutoCSA was installed (Figure 10). The default location is
Running AutoCSA's example set
An example set of trace files are available from our website so you can test AutoCSA and get used to configuring the system. The example set has examples of the 4 main mutations that AutoCSA can detect, a heterozygous substitution, a homozygous substitution, a heterozygous indel and a homozygous indel. The zipped file is available from here. You just need to click on the link and save to a folder on your computer. Once downloaded the zip file must be decompressed by double clicking on the icon. Once downloading and uncompressing has been completed you then need to follow these instructions to configure AutoCSA (further help on running the example data set is available from the quick start document ):-
7. Web interface
The resulting mutations are displayed in a series of webpages. The webpages are generated in the folder which has
been specified in the setup section and is called index.html. You just need to type the address into the web browser
or double click on the index.html icon.
The index.html page lists the amplimer sequences that have been screened and links to the comparisons that were performed (Comparison data), information about the reference traces used (Wildtypes) and number of mutations identified (Figure 11).
Clicking on an amplimer name gives information on the length of the DNA amplimer, the region of interest (roi), the CDS mapping and translation (if provided) the actual DNA sequence (Figure 12). The region of interest (roi) defines the area of the trace that AutoCSA will mark up mutations. Mutations outside this area will be ignored.
The results link gives a list of comparisons carried out and what mutations were called (Figure 13). The DNA name, direction of sequencing (forward or reverse), run id that pairs the forward and reverse trace, status of trace (pass/fail), coverage, quality score, number of substitutions and other mutation types are recorded for each comparison. The mutations can be viewed by clicking on substitution or other mutation column. Clicking on the run id for a trace brings up a full trace view of the trace (if option to generate full traces has been switched on, default is off).
The detailed view of a substitution, displays four traces with 20 bases either side of the potential mutation (Figure 14). The top two traces are the traces that AutoCSA used to call the mutation. The very top trace being the reference trace and the second trace being the trace being screened for mutations. The mutation position is marked with a shaded region with the colour coding for the actual base in the amplimer sequence (Guanine=grey, Cytosine=blue, Thymine=red, Adenine=green). The third and fourth traces are the reverse sequenced traces with the third trace being the reverse trace under-investigation and the bottom the reverse reference trace. Information on the position, zygosity, type and base change of the mutation is recorded in a text box to the right of the traces.
Insertions and Deletions are displayed using a slightly different format (Figure 15). In this case the complete trace and the reference trace are displayed in a scrollable window. For homozygous mutations the mutation is shaded and the wild type bases put above the position.
For heterozygous insertions and deletions a scrollable window with the trace and reference trace is provided (Figure 16). The start of the mutation is marked on the trace.
AutoCSA has a number of parameters which can be set for advanced users.
As outlined in the "Main Concepts of AutoCSA" section above, the main signature that the software uses to detect heterozygote mutations is a drop in signal intensity between a reference and trace under-investigation for a particular base. The software default value for this is 20% drop or greater. Any drop that is less than 20% is ignored and any equal or above is run through a further set of tests to see if it is a valid mutation. If this parameter is increased then this will have the affect of decreasing the false calls made by the software but will also decrease the sensitivity of the software. If AutoCSA is being used to detect homozygous mutations or looking of 50:50 heterozygotes the drop can be increased to around 40% which will significantly reduce false calls.
The drop value can be altered in AutoCSA by editing the critMutRatio parameter in the csa_analysis.properties file. This file is located in the resource subfolder where StandaloneCSA has been installed. The change can be made by using a simple text editor like notepad.
Flagging switch (flagging)
Flagging uses a series of rules to filter out false calls. If the sequence traces are of very good quality or sensitivity needs to be kept very high then flagging can be switched off. This is achieved by setting the "flagging" parameter to 0 in the standaloneCSA properties file.
Full trace file generation (generate_all_full_traces)
AutoCSA generates an image of the mutation and surrounding bases to allow manual review by the user. There is also the option to switch on full trace generation which generates a full trace. By default this is switched to off as it slows the speed of analysis.
Removal of variants not in ROI (remove_not_in_roi)
Variants found outside of the region of interest are normally removed as these are expected to be in areas of poor quality (the beginning and end of the amplimer).
9. Protein Annotation
AutoCSA has the ability to annotate the mutation change at the protein level. In order for this option to work the CDS of the gene must be provided and linked to the amplimer sequence being used. This information is recorded in the amplimer.properties file (figure 17). For each amplimer name a line indicating which gene cds it maps to must be provided:-
The actual cds sequence must also be added to the file:-
Once the relevant information has been added to the amplimer.properties file AutoCSA will then attempt to automatically annotate called mutations.
AutoCSA is written in java and will run under java 1.4 which must be downloaded and installed on your computer before installation. It is available from Sun's Java site or see installation instructions.
The software is optimised to run on sequence traces generated by ABI3730 sequencers running with 36cm capillaries, RapidSeq-15sec run module and POP7 sequencing buffer. AutoCSA uses the raw sequence channels which are also present in the files generated by the sequencers (.ab1 file extension).
Due to the nature of sequence traces, AutoCSA does not attempt to match the DNA sequence until base 50, we therefore recommend having a buffer of 75 bases either side of the region of interest. The software has been used extensively on amplimers of length 500 bases but has only been minimally tested on longer amplimers.
The software has been tested successfully under linux, MS windows XP, and Mac OS X 10.4.6.
11. Detailed Algorithm
Please click here for a detailed description of the algorithm.
12. Region of Interest (ROI)
The region of interest serves two functions:
Generally it is anticipated that the first 50 residues of a sequencing reaction will contain high amounts of noise. This noise can affect the quality calculations used in normal selection and cause variants to be lost in flagging.
For this reason we recommend that the ROI should be defined as approximately:
Copyright (c) 2006 Genome Research Ltd.
THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
You are granted a non-exclusive, non-transferable licence to use this software for your own personal and non-commercial purposes only. You shall not lease, license, loan, sell or distribute the software in whole or in part to any third party. You shall not modify, decompile, disassemble, reverse engineer or create derivative works of this software without the prior consent of the authors.