Contact WTSI Webmaster Printer friendly format Login to WTSI resources WTSI RSS feed
  • C. elegans
  • Overview
  • Sequence data
  • BLAST search
  • Wormpep
  • FTP site
  • C. briggsae
  • C. briggsae project
  • BLAST Search
  • WormBase
  • Release info
  • Current gene names
  • Submit data
  • GFF files
  • Documentation
  • Annotation
  • Website

  • Ensembl
  • C. elegans project
Follow the yellow brick road: agp files for the worm

Introduction

The agp (A Golden Path) file format maps a path through a tiling set of genome sequences based on their accession and sequence version in the public nucleotide databases (EMBL and GenBank).
  • File format
  • Generating agp files
  • Validating agp files
  • Documentation format

File format

agp files are simple flat files with the following structure:

Chromosome Chromosome coordinates Ordinal Type EMBL data EMBL accession coordinates Strand
Start Stop Start Stop
X 1 2649 1 F AL031272.2 1 2649 +
X 2650 63549 2 N        
X 63550 93313 3 F Z83097.1 1 29764 +

The chromosome is the chromosome to which this agp file relates.
The chromosome coordinates are non-overlapping absolute coordinates for each DNA sequence in the tiling path.
The ordinal is a number relating to the order of the DNA sequences, starts at 1 and increments to the total number of sequences in the tiling path.
The type denotes whether the sequence is finished 'F' or is a padding gap '-'.
The EMBL data is the accession number and sequence version in the EMBL nucleotide database.
The EMBL accession coordinate are relative coordinates within the named EMBL entry.
The strand is the strand alignment for this tile in the path (usually '+' for forward).


Generating agp files

The agp files are constructed using the tiling path of genome sequences from autoace. The autoace GFF dumps are split using GFFsplitter; one of these CHROMOSOME_*.clone_path.gff is processed via the script GFF_with_accessions to produce the GFF file CHROMOSOME_*.clone_acc.gff. This file contains the absolute coordinates for each clone (tile), strand information and data pertaining to the relevant EMBL entry:
CHROMOSOME_X  Genomic_canonical  Sequence      1   2649  .  +  .  Sequence "CTEL7X" acc=AL031272 ver=2
CHROMOSOME_X  Genomic_canonical  Sequence  63550  93313  .  +  .  Sequence "AC8"    acc=Z83097   ver=1
make_agp_files.pl reads from this file and constructs the non-overlapping 'golden' tile path running from base 1 of the first clone 'tile' to the last non-redundant base before the overlap with the second clone. This process continues until the last clone in the tiling path which is included in it's entirity. Note: this means that the relative coordinates for sequence extracted from each clone will always begin at position 1 and continue until the last unique base before the overlap with the clone to the right - compare the GFF file above with the agp file in the file format section.

 
 
 
 
Figure 1 - How to construct the golden path from the tiling path of genome sequences. Each 'tile' is included in the consensus from base 1 to the last base not overlapping the clone to the right.
 
 
 
 


Validating agp files

The agp files are validated using the agp2dna.pl script. This reads the agp file and extracts the sequence from the current EMBL entry (current means the one presently indexed by SRS at the Sanger; EMBLNEW takes precedence over EMBL). The entire chromosome consensus sequence is reconstructed to the file CHROMOSOME_*.agp.seq. If there are no discrepencies between this file and the DNA file dumped from ACEDB, CHROMOSOME_*.dna, then there will be no error messages in the log file, CHROMOSOME_*.agp_seq.log. If errors occured because of asynchrony between ACEDB and EMBL and must be dealt with elsewhere. [dl This information should be included in the release letter.]


Documentation format

This guide uses the same nomenclature as the build guide.

Text in red indicates either the name of a script/program to run, or commands to be typed within interactive tace sessions. Where script names also exist as hyperlinks, they can be clicked to access the POD documentation for that script.

Text in blue represents comments/warnings that should be checked. Some of these may only be temporary comments and should possibly be removed if no longer valid/relevant.

Text in green refers to file or path names.


webmaster@sanger.ac.uk

Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK  Tel:+44 (0)1223 834244

Last Modified Wed Oct 15 14:47:22 2003

Genome Research Limited is a charity registered in England with number 1021457

Data Sharing | Copyright