MAPTAG is an informatics tool that annotates batches of unknown sequences to the mouse genome, and assigns gene ID where possible through sequence match to the Ensembl database system. It consists of several linked sequence search modules that are controlled by a management script. Sequences are submitted for analysis in simple FASTA format, and resulting data is generated in tab-delimited text files that can be manually or automatically imported into relational databases or PC-based desktop analysis packages such as Microsoft Excel.
Interpretation of results requires some familiarity with the Ensembl mouse genome database and with the SSAHA sequence analysis and mapping program. The program was designed to handle protein coding genomic sequences. Although MAPTAG can also be used on potential non-coding sequences, there are limitations in the annotation of the Ensembl database that make the program less powerful with such sequences due to currently incomplete or unavailable information and annotation. MapTag was originally created to annotate short sequences identified by gene trapping (Sanger Institute Gene Trap Resource), but the analysis methods are be broadly applicable to most tags.
The package we describe here consists of two main components. The first is a sequence similarity searching phase that uses SSAHA (Ning et al, 2001) to locate and map tag sequences to mouse chromosomes. The second phase takes the matches and queries the Ensembl mouse database system (Hubbard et al, 2002) to identify overlapping sequence gene and EST features. To ensure the best possible specificity, the similarity searches are performed in several tiers, and at two quality thresholds (Figure 1). The initial round of analysis is performed using a sequence match-length cut-off of 50 nucleotides. All matches below this length are ignored.
SSAHA employs a hashing algorithm to facilitate extremely fast sequence matching, meaning the entire mouse genome may be searched in only a few seconds. This speed comes at the expense of some sensitivity and therefore SSAHA is optimised to find near-exact matches. The MapTrap annotation process first looks for tag matches that are longer than 50bp but will allow several mismatches and insertions. These matches are notionally identified at the "higher-quality" matches. The remaining unmatched sequences are subjected to the same mapping process but with the length cutoff reduced to 28bp. Such short matches may be quite legitimate since a tag sequence may map in a fragmented way across the boundaries of several exons.
First, tag sequences are searched against Ensembl mouse exon sequences, matches are recorded and the tags removed from the pool of remaining sequences. Unmatched sequences from this phase are next searched against Ensembl EST gene exons, again with matching sequences removed from further processing. Remaining unmatched tags are searched against intron sequences from Ensembl and then Ensembl EST genes. Any sequences unmatched at this stage are searched against the entire genomic sequence. This step ensures that tags mapping to genes that have not been annotated by the Ensembl pipeline are identified. Finally all remaining unmatched sequences are submitted to a second iteration of the same search procedure but with the match-length threshold reduced to 28 nucleotides. This lower threshold ensures that short or partial matches to exons are not missed.
The output of the various search stages is processed automatically to identify any annotation on the matching sequence region that resides in the Ensembl mouse database. Matches to exons and introns are linked back to annotated genes, or novel predictions, along with other information such as Swissprot, Refseq and SpTrembl identifiers. Matches to Ensembl EST gene exons are mapped back to Ensembl genes, where possible, by exonic overlaps. MAPTAG output is in tabular form for all hits at each level of analysis for a specific tag, giving detailed information for tags on all exon matches returned.
MAPTAG analysis modules are written in Perl and managed by a shell script and can be run (or adapted to run) on any UNIX-based system. The package requires a local installation of the Ensembl software, although the Ensembl databases may be accessed remotely which makes it unnecessary to download and install large local databases copies. Ensembl software can be used to export FASTA files of the necessary Ensembl/EST exons and introns. A client/server version of SSAHA is available for download from the Sanger Institute website.
MAPTAG is implemented as a simple unix shell script that executes a series of perl or shell scripts and passes data from one to the other via intermediate disk files. It should therefore run under most type of UNIX for which C-shell, perl and SSAHA binaries are available. Data are supplied in the form of FASTA files and a series of tab-delimited text output files are generated that are suitable for importing directly into database tables or other analysis applications like Excel. The MAPTAG files are available via FTP and includes instructions for installing the system locally.
Figure 1. The MAPTAG pipeline. See main text for a description of the components.
Overview: MAPTAG generates a series of text based files that correspond to each level of informatic analysis. The program generates two sets of text files, one at 50 base stringency, and the second at 28. The files are named according to which level of analysis they represent (e.g. Core_Hits.txt or EST_Overlap.txt). These files are compatible with any spreadsheet or database program that can delimit text files, such as Excel, Access, or MySQL.
Importing files into analysis program: The text files are tab separated, and should be easily opened and read by most programs. Specific import method depends on chosen analysis program. Check the user manual for how to import a text-based file if unclear.
Data analysis - Primary Tables: All tables are organized around the original sequence name provided for the FASTA sequence. The column headings for all tables, with the exception of EST_Overlap, is as follows:
Cell Line Chromosome Ensembl Gene ID Strand Class RefSeq Accession ID Swsisprot SpTrembl Gene Start Gene End Ensembl Exon ID Exon Start Exon End Hit Start Hit End Length Sense SSAHA Score
Core_Hits: The core hits table is made up of sequences which match at high confidence to a single, unique gene based on exon matches within the gene. Multiple results are given for sequences that match more than one exon within the gene, with each entry representing a unique exon of the same gene. Filtering the list for unique entries in the sequence name column gives a list of genes hit. Note: attention should be given to the sense column. Negative signs (-) in this column represent antisense hits, found on the opposite DNA strand as the gene listed.
EST_Hits: This table is made up of sequences which match at high confidence to a single, unique EST. General processing is the same as for core hits.
Intron_Hits: This table represents high confidence matches made to single intronic genomic loci. The gene ID given is to the gene into which the intron falls. The exon column gives the two exons that the intron bridges.
Genomic_Hits: Matches to genomic loci where no gene or EST features are present in the Ensembl database. Further analysis can be done on this category to link original sequence to genes through BLAST, dbEST, and other bioinformatics tools. Sequence tags that hit preferentially in the 5' UTR of genes will often fall in this category due to current restraints in the annotation of the Ensembl database.
Core_Multi, EST_multi, Intron_multi, Genomic_multi: The multi hit tables represent cases where the sequence tag matches multiple locations in the Ensembl database. The format is the same as the Core table, except with multiple genes as well as multiple exons. In order to resolve these matches in to one correct hit, the gene with largest number of tagged exons should be selected, as analysis has demonstrated that other measures of match quality, such as the total length of exon coverage, can bias the gene selection in favour of pseudogenes.
EST_Overlap: The results in this table are generated by the EST_Overlap module, which takes hits in the EST category and compares EST exonic coverage with gene exonic coverage to link an Ensembl EST transcript to a gene. All entries in this table correspond to a detailed entry in the EST_hits table. Column heading for this table is as follows:
Cell Line Ensembl EST ID Strand Ensembl Gene ID Overlap Start Overlap End Strand
Data analysis - Summary: Using the output from the various levels of MAPTAG analysis, the user can build a comprehensive set of matches, with a known annotation method and confidence level for each. Ensembl gene IDs can be used to link Ensembl data with information contained in other databases.
For questions and comments corresponding to the scripts, programming, and implementation:
Tony Cox (avc@sanger.ac.uk)
For questions and comments corresponding to MAPTAG interpretation and general bioinformatics issues:
Alex Nord (an1@sanger.ac.uk)
For more information on the Sanger Institute Gene Trap Resource:
info.genetrap@sanger.ac.uk
webmaster@sanger.ac.uk
Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK Tel:+44 (0)1223 834244
Last Modified Fri Jan 19 23:24:45 2007
Genome Research Limited is a charity registered in England with number 1021457
Data Sharing Policy | Conditions of Use | Copyright