CAF

Common assembly format (CAF).

CAF is a text format for describing sequence assemblies.

About

It is acedb-compliant and is an extension of the ace-file format used earlier, but with support for base quality measures and a more extensive description of the Sequence data. CAF was designed during the Sanger sequencing era, its modern-day successor is the SAM format, or its binary equivalents BAM and CRAM.

Downloads

Tools for manipulating CAF files can be downloaded from the Sanger ftp site:

caftools source – Source code for the ‘caftools’ package. This includes a number of tools for manipulating CAF files.

gap2caf source – Source code for the ‘gap2caf’ package. gap2caf converts a gap4 database to CAF format. Building this requires a copy of the Staden package source code.

mini phrap2gap – A collection of perl scripts that wrap the process of running a sequence assembly with phrap and converting it to a gap4 database. It requires a number of other tools, notably the caftools package above, and phrap itself.

Further information

CAF Overview

CAF is intended to be sufficently comprehensive that any assembly engine/editor such as Phrap, Consed, Gap, Acembly, FAK etc can derive all the information it needs from the CAF file without reading any other data, (except for trace information which is still held in SCF files). The eventual aim is to be able to freely convert between CAF and any other format so that different assembly programs can be combined. This aim is still some way off because of incompatibilities between these programs. Currently it is possible to convert to and from Phrap, FAK, and into GAP (the reverse is possible with some loss of information).

CAF Version 2 supports some additional tags (Insert_size, Ligation_no) for reads. More importantly, it has support for describing partially-assembled groups of contigs. The basic idea is that an Is_assembly Sequence object is an ordered set of Is_group Contig Group objects, listed by the Group_order tag. In turn, each Contig group is an ordered set of contigs listed by the Contig_order tag. It is thus easy to manipulate groups if contigs (eg that are believed to be adjacent) without having to use absolute coordinates.

In CAF files, Comments are any text preceded by //

CAF supports three object types, Sequence, DNA and BaseQuality. The Sequence type is the most complex. All base coordinates start at position 1 (NOT 0). These are the important Sequence attributes (others are available in the CAF acedb model but are not currently used by any processing module).

Sequence type

Sequence : "Name" // Name of the Sequence
Is_read | Is_contig | Is_group | Is_assembly // Type
Padded | Unpadded // State
ProcessStatus "State" "Text" // Reads only: Asp pass or failure, with reason.
// "State" can be:
// PASS,
// SVEC (completely seq vector),
// QUAL,(poor trace quality),
// CONT (contaminant, eg E.coli)
Asped "Date" // Reads only: Date processed
Dye Dye_terminator | Dye_primer // Reads only: Chemistry
SCF_File "Filename" // Reads only: Name of SCF file containing trace data
Primer Unknown_primer | Universal_primer | Custom "Oligo" // Reads only: primer type,
// including oligo sequence if Custom
Template "Template" // Reads only: Template name
Insert_size x1 x2 // Predicted range [x1,x2] of insert size
Ligation_no "Text" // The Ligation number (ie library identifier) for the read
Strand Forward | Reverse // Reads only: forward or reverse strand
Seq_vec "Type" x1 x2 "Text" // Sequencing vector from position x1 to position x2 inclusive
// "Type" is redundant, set to SVEC by default.
Clone_vec "Type" x1 x2 "Text" // Cloning vector from position x1 to position x2 inclusive
// "Type" is redundant, set to CVEC by default.
Clipping "Type" x1 x2 "Text" // Clipping from position x1 to position x2 inclusive
GoldenPath "Read" x1 x2 // Contigs only: Phrap Golden path:
// Use "Read" between contig coords x1, x2 inclusive
Tag "Type" x1 x2 "Text" // General Tag on interval [x1,x2] (used in GAP)
Assembled_from "Read" s1 s2 r1 r2 // Contigs only: Alignment of Read to contig.
// Interval [r1, r2] in the read align with [s1,s2] in contig.
// If s1 > s2 then align the reverse complement of [r1,r2] with [s1,s2].
Align_to_SCF s1 s2 r1 r2 // Reads only: Alignment of Read to original SCF base-calls.
// Similar to Assembled_from
Group_order Group p1 // Assemblies only: defines Group to be at group position p1 within assembly
Contig_order Contig q1 // Groups only: Defines Contig to be at relative position q1 within group

DNA type

DNA : "Name"                            // Name of sequence
ACGTGCGG......                          // The sequence: Use ACGT, N for unknowns, - for pads.

BaseQuality type

BaseQuality : "Name"                    // Name of sequence
0 12 13 90 ...                          // Base qualities. These must be positive integers
                                        // between 0 and 99 inclusive. If the Base Quality is present
                                        // in must be the same length as the DNA

Sequence Attributes

Sequence objects in the CAF file are used to represent both read and contigs (constructed during sequence assembly from reads).

Field Description
Is_read The sequence respresents a reading
Is_contig The sequence respresents a set of aligned reads

Attributes common to all Sequence objects

An assembly can be in either a padded or unpadded state. In the padded state, padding characters are added to the contig sequence and the reads so that the reads align with one another perfectly. In the unpadded state, the alignment between reads and the contig is described fully.

Field Description
Padded The sequence is padded
Unpadded The sequence is unpadded

Features identified within sequences are annotated using one of the following four fields. In each case, Method is either the name of a program, or some generic name describing the method. We currently use the name of the GAP4 (Bonfield et al, 1996) tag id. The feature is considered to cover bases From through To in the Sequence. Comment is optional. If more than one word, the comment should be surrounded by double quotes.

Field Attributes Description
Clone_vec Method From To Comment Location of Cloning vector.
Seq_vec Method From To Comment Location of Sequencing vector.
Clipping Method From To High quality bases.
Tag Method From To Comment Some arbirary feature.

Attributes of reads only

Field Attributes Description
Sequencing_vector Vector_name The name of the sequencing vector used.
Template Template_name The name of the DNA template from which the sequence was determined.
Insert_size From To Minimum and maximum size estimate of insert of DNA template.
SCF_File File_name The name of the SCF file for this reading.
Base_caller Base_caller_name The name of the base calling software used to generate read.
Stolen Comment The read was not originally sequenced for the clone
Staden_id Integer The internal read number from the Staden GAP database (gap2caf).
Asped Date The date the read was sequenced/pre-processed.
ProcessStatus Text Preprocessing status (PASS or FAIL).
Align_to_SCF ReadFrom ReadTo SCFFrom SCFTo The alignment between the read and the original base calls

Information about the sequencing chemistry used to generate the read is included. One of the following must be specified.

Field Description
Dye_primer The read was sequenced using Dye primer chemistry
Dye_terminator The read was sequenced using Dye terminator chemistry

Orientation of the read with respect to the template is specified by one of the following:

Field Description
Forward The read is on the Forward strand of the template
Reverse The read is on the Reverse strand of the template

The location of the read with respect to the insert is specified by one of the following:

Field Attributes Description
Universal_primer Either a forward or reverse universal primer was used.
Custom Primer_name A custom primer was used. The name is optional.

Attributes of contigs only

The alignment of the reading to the contig is described in lines:

Field Attributes Description
Assembled_from ContigFrom ContigTo ReadFrom ReadTo The alignment between the contig and the read

Sequence State: Padded vs Unpadded

The alignment of a reading to the contig can be Padded or Unpadded. Padded means that gaps (“-“) have been inserted where required in both contig and aligned readings so that there is a 1-1 correspondence between the aligned DNAs. In a Padded assembly there is exactly one Assembled_from line for each aligned read in a contig, and the DNA objects contain “-” padding characters.

In an unpadded alignment all the pads are removed from the DNA objects and there are multiple Assembled_from lines for each reading in a contig. However, Within each Assembled_from line there is a 1-1 correspondence between the (possibly reverse-complemented) read interval [r1,r2] and the contig interval [s1,s2].

Some jobs (notably ones that need to compare reads to the contig consensus sequence) are easier with the padded alignment, while others are better done unpadded. The programs caf_pad and caf_depad allow one to move transparently between padded and unpadded states, without loss of information (well, almost – columns of pads are removed). Note that in a padded alignment with BaseQuality information it is necessary to attach a quality value to each pad to keep the lengths equal. caf_pad currently does this by interpolating the quality values of the surrounding bases.

Example:

Sequence : hh26e2.s1
Is_read
Unpadded
SCF_File hh26e2.s1SCF
Template hh26e2
Dye Dye_primer
Primer Universal_primer
Strand Forward
ProcessStatus PASS
Align_to_SCF 1 171 1 171
Align_to_SCF 172 595 173 596
Tag DONE 119 119 "AUTO-EDIT: replaced C by t at 119 (double, isolated, strong)"
Tag DONE 145 145 "AUTO-EDIT: replaced N by t at 145 (double, compound, strong)"
Tag DONE 146 146 "AUTO-EDIT: replaced T by g at 146 (double, compound, strong)"
Tag DONE 171 171 "AUTO-EDIT: deleted G at 171 (double, isolated, strong)"
Tag DONE 193 193 "AUTO-EDIT: replaced A by c at 193 (double, isolated, strong)"
Seq_vec SVEC 1 24 "M13mp18"
Clipping QUAL 56 117
Clipping ECLIP 46 241
DNA : hg02b9.s1
CGCTGCAGGTCGACTCTAGAGGCTCCCCTGAGCCGCTGTGGATTGAGGAGGTGAGGCGTG
AGGAGGTGAGGAGTGAGAAGGTCAGGAGGGACGGAGGTGACGAGTGAGGAGGCGAGGTGA
GGCGTTAGGAGGTGGGGAAGTCAGGAGGTGAGTCAGGACCTGAGGAGTCAGGGGGTGAGG
AGTTAGGTGGTCAGGAGTCAGGAGGTGACGAGTTAGGAGGTGGGGCAAGTGAGGAGGTGA
GGAGTGAGGACATGACGAGTGAGTAGGTGAGGAGTCACGGGGGTCAGGAACGTGACAAGG
TTTCGACGTCCAATCCATCGTTCCAGGACCTTCCAGCTTGTGTCCTCTGACAGTGACCTC
ACCTGCCAGGTCTGGCCCTCCTGGCAGGCAAGAGGGCCGGCCGTGGGGGCGGTGGAGGGG
GTGGCCTCCCAGGGGTGAAGTCGGGGGTTGGGCTCCGACCGTCTGGCCACCGTTGGGGGT
GAGCCCGGTGGGAGTGTTGGGGGGG
BaseQuality : hg02b9.s1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0 15 0 0 0 0 0
0 18 17 17 0 18 0 0 0 0 0 0 0 0 0 0 19 0 15 0 0 0 0 18 17 24
22 18 21 18 18 0 15 15 21 23 21 0 0 0 0 0 21 0 0 0 16 19 23 21
0 0 0 0 0 21 21 22 0 0 0 0 0 0 0 0 0 0 0 0 18 18 23 22 23 23
15 15 17 0 23 15 21 22 21 21 21 25 19 23 19 30 18 18 18 27 21
0 0 0 0 0 0 25 18 22 15 0 0 0 15 0 21 17 0 0 0 0 0 18 0 0 0 16
16 21 16 0 0 0 0 0 16 16 0 0 0 0 0 0 15 15 0 0 15 15 0 0 0 0
0 0 0 18 18 21 20 20 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 25 19 19 0 0 0 0 0 0 0 0 0 0 16 0 0
0 0 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 0 0

This is an unpadded reading, with standard dye primer chemistry with universal primer. The base at position 172 in the SCF trace have been deleted, hence the Align_to_SCF maps read position 171 -> trace position 171, and read position 172 -> trace position 173.

Positions 1 to 31 are sequencing vector. Positions 56 to 117 are high quality, according to the base-quality measures. Positions 46 to 241 align well with the contig consensus.

The read has been autoedited (hence the deleted base) and tags attached to the edited bases.

The corresponding contig is:

Sequence : Contig3
Is_contig
Unpadded
Assembled_from hg82b3.s1 7635 7677 1 43
Assembled_from hg82b3.s1 7678 7697 45 64
Assembled_from hg82b3.s1 7698 7703 66 71
.
.
.
Assembled_from hg02b9.s1 7968 7878 1 91
Assembled_from hg02b9.s1 7877 7745 93 225
Assembled_from hg02b9.s1 7744 7694 227 277
Assembled_from hg02b9.s1 7693 7607 279 365
Assembled_from hg02b9.s1 7606 7603 367 370
Assembled_from hg02b9.s1 7601 7599 371 373
Assembled_from hg02b9.s1 7598 7584 375 389
Assembled_from hg02b9.s1 7583 7570 391 404
Assembled_from hg02b9.s1 7569 7470 406 505
.
.
.

The reverse-complemented read hg02b9.s1 positions [1,505] align to [7470,7968] (This is the full alignment, ie not clipped back). The individual unpadded subsections align as indicated, eg [1,91] with [7878,7968].

After padding the alignment (with caf_pad) we have:

Sequence : hg02b9.s1
Is_read
Padded
SCF_File hg02b9.s1SCF
Template hg02b9
Dye Dye_primer
Primer Universal_primer
ProcessStatus QUAL
Align_to_SCF 1 3 1 3
Align_to_SCF 5 20 4 19
Align_to_SCF 22 37 20 35
Align_to_SCF 39 46 36 43
Align_to_SCF 48 48 44 44
Align_to_SCF 51 74 45 68
Align_to_SCF 76 80 69 73
Align_to_SCF 82 94 74 86
Align_to_SCF 96 108 87 99
Align_to_SCF 110 121 100 111
Align_to_SCF 123 128 112 117
Align_to_SCF 130 199 118 187
Align_to_SCF 201 267 188 254
Align_to_SCF 269 274 255 260
Align_to_SCF 276 278 261 263
Align_to_SCF 280 281 264 265
Align_to_SCF 283 284 266 267
Align_to_SCF 286 290 268 272
Align_to_SCF 292 292 273 273
Align_to_SCF 294 308 274 288
Align_to_SCF 311 312 289 290
Align_to_SCF 314 317 291 294
Align_to_SCF 319 334 295 310
Align_to_SCF 336 336 311 311
Align_to_SCF 338 371 312 345
Align_to_SCF 373 376 346 349
Align_to_SCF 378 398 350 370
Align_to_SCF 400 442 371 413
Align_to_SCF 444 444 414 414
Align_to_SCF 446 446 415 415
Align_to_SCF 448 468 416 436
Align_to_SCF 470 534 437 501
Align_to_SCF 536 539 502 505
Seq_vec SVEC 1 30 "M13mp18"
Clipping QUAL 2 537
DNA hg02b9.s1 539
Strand Forward
Clone 4B5
Asped 25-Mar-1996
DNA : hg02b9.s1
CGC-TGCAGGTCGACTCTAG-AGGCTCCCCTGAGCCG-CTGTGGAT-T--GAGGAGGTGA
GGCGTGAGGAGGTG-AGGAG-TGAGAAGGTCAGG-AGGGACGGAGGTG-ACGAGTGAGGA
G-GCGAGG-TGAGGCGTTAGGAGGTGGGGAAGTCAGGAGGTGAGTCAGGACCTGAGGAGT
CAGGGGGTGAGGAGTTAGG-TGGTCAGGAGTCAGGAGGTGACGAGTTAGGAGGTGGGGCA
AGTGAGGAGGTGAGGAGTGAGGACATG-ACGAGT-GAG-TA-GG-TGAGG-A-GTCACGG
GGGTCAGG--AA-CGTG-ACAAGGTTTCGACGTC-C-AATCCATCGTTCCAGGACCTTCC
AGCTTGTGTCC-TCTG-ACAGTGACCTCACCTGCCAGG-TCTGGCCCTCCTGGCAGGCAA
GAGGGCCGGCCGTGGGGGCGGT-G-G-AGGGGGTGGCCTCCCAGGGGT-GAAGTCGGGGG
TTGGGCTCCGACCGTCTGGCCACCGTTGGGGGTGAGCCCGGTGGGAGTGTTGGG-GGGG
BaseQuality : hg02b9.s1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 0 0 0 0 0 0
15 0 0 0 0 0 0 18 17 17 0 18 0 0 0 0 0 0 0 0 0 0 0 0 19 0 15
0 0 0 0 18 17 24 1 22 18 21 18 18 0 15 15 21 23 21 0 0 0 0 0
0 21 0 0 0 16 19 23 21 0 0 0 0 0 0 21 21 21 22 0 0 0 0 0 0 0 0
0 0 0 0 18 18 23 22 23 23 15 15 17 0 23 15 21 22 21 21 21 25
19 23 19 30 18 18 18 27 21 0 0 0 0 0 0 25 18 22 15 0 0 0 15 0
21 17 0 0 0 0 0 18 0 0 0 16 16 21 16 8 0 0 0 0 0 16 16 0 0 0
0 0 0 15 15 0 0 15 15 0 0 0 0 0 0 0 18 18 21 20 20 20 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 19
19 9 0 0 0 0 0 0 0 0 0 0 0 0 16 8 0 0 0 0 0 15 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 0 0 0 0
0 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0 0 0 0 0 0

Note how the Align_to_SCF is now much more complex since it has to take into account all the pads inserted into the dna. The DNA contains pads “-” and the BaseQuality data has been interpolated at each pad position. The alignment of the read to the contig is now much simpler however:

Sequence : Contig3
Is_contig
Padded
.
.
Assembled_from hg02b9.s1 8618 8080 1 539
.
.

Further information

The CAF format was based on the AceDB dump format, so it has a corresponding AceDB model.

An example CAF file.