Est_db

The est_db package is a software suite and database system designed to support expressed sequence tag (EST) sequencing projects, and to provide comprehensive bioinformatic analysis of sequenced EST libraries, for gene discovery and other purposes.

The database can hold and efficiently process hundreds of thousands of EST sequences, track the cDNA libraries and clones to which they belong, and store the results of their analysis. Should they be available, large compute farms can be used for the analysis.

[Genome Research Limited]

New to Est_db?

The manual explains the software and what most parts of the program do.

What does Est_db do?

Extensive bioinformatic analysis can be carried out on the sequenced EST libraries, including similarity (BLAST) searches, protein sequence prediction, and the import of EST clustering and assembly data from external sources. Results are searchable via a web page, with graphic output of the various analyses, enabling one to retrieve information pertaining to a particular cDNA clone, or EST read, as well as view EST clustering results, or graphical representations of BLAST results on the searched EST sequences.

The est_db package is likely to appeal not only to sequencing groups directly employed in EST sequencing, but also to groups interested in performing bespoke analysis of ESTs that may already be publically available, in order to support their ongoing research aims. The package is easily-extensible, via an API designed specifically to handle ESTs and their analysis. It is open source and is made available free of charge, and, where possible, similarly open-licensed components have been used in its development.

Application at the Sanger Institute

The est_db software package has been developed and used at the Sanger Institute to support the Xenopus tropicalis EST Project - a collaboration between the Sanger Institute and the Wellcome/Cancer Research UK, Gurdon Institute in Cambridge. To date est_db has been used to process the nearly 400,000 ESTs sequenced as part of the project, approximately 305,000 of which passed its quality control (QC) checks, and have been submitted to public databanks.

The extensive facilities offered by est_db to analyse large numbers of ESTs have been used for the bioinformatic analysis of these sequenced X. tropicalis EST libraries, facilitating use of the data by the scientific community. This analysis can be viewed and searched live via the X. tropicalis est_db web interface.

Description

The est_db software package consists of three principal components: a relational database back-end (MySQL), a perl API (EST_DB.pm), and a CGI web script. The MySQL database holds all the information stored in the est_db system including the EST data itself and the cDNA clone and library details from which the DNA sequences were produced. Also stored are the results obtained from the various bioinformatic programmes incorporated into the est_db analysis pipeline (currently WuBLAST, RepeatMasker and ESTScan). EST clustering and sequence assembly results are stored in the database, together with the information required to control the analysis pipeline, and the tracking information necessary for the EST submission process to public databanks.

All the stored information can be accessed and manipulated in a high-level manner using the object-orientated perl API. This makes it straightforward to implement sophisticated analyses of both the raw EST data and derived analysis. Classes are provided to handle the EST sequence data, EST clustering results, and subsequent BLAST and other analysis of both ESTs and consensus sequences generated from EST clustering. The schema is neutral to the method or package used to cluster and assemble the ESTs, but a database adaptor is provided which can directly extract results from a StackPACK2.1.1 MySQL analysis database.

Web functionality is implemented with a perl script, using the CGI.pm, and GD.pm modules. A set of easily-extensible classes (EST_DB::ESTView) are provided as a high-level means to generate and place features on the graphic representations of sequences, allowing the graphic web views to be extended or customised as additional analysis results are added to the pipeline.

The est_db pipeline has features designed to handle job creation and management within the est_db system, with the LSF scheduler being used to execute the underlying tasks. This allows lengthy analysis processes, such as some BLAST searches, which if carried out with a single CPU might take days or weeks, to be completed in a few hours. The whole analysis is split into a number of smaller jobs by the pipeline each of which can be executed on a separate CPU or machine, parallelizing execution. The pipeline has been tested to more handle than 300 machines reading and writing concurrently to the MySQL database as analyses are performed. The user can specify various parameters to control the pipeline (job granularity etc), allowing the software installation to be customised to the available hardware resources.

Familarity with the Ensembl API will aid use of the est_db API, as the latter shares many design features to those of the Ensembl genome annotation system and web browser (www.ensembl.org). The majority of programmes and modules are documented with embedded perl documentation (POD). Additionally examples of running the pipeline and summaries of the methods available in principal database adaptor (EST_DB::DB_Adaptor::Sanger) and the ESTView classes are provided in the /doc and /sanger/doc directories (see below).

Licensing conditions

Open source, available free of charge under the terms of the Perl Artistic License.

Package download

The package is available as a single gzipped tar archive. Download it here.

Software requirements

MySQL (tested with Server version 3.23.32) MySQL home

Perl modules

bioperl
required from both Bioperl 0.7x (BPLite Blast parsing) and Bioperl 1.x (OBDA for indexing), which should be placed in the perl5lib so that 0.7 is checked first www.bioperl.org
DBI
For MySQL database access DBI download
CGI
for the webscript CGI.pm - a Perl5 CGI Library
GD
for graphics generation GD.pm - Interface to Gd Graphics Library
Hum::EMBL
for the generation of EMBL flatfiles;
Available upon request from jgrg@sanger.ac.uk

Other applications

Brief install instructions

Download the compressed software package (above)
  
ftp://ftp.sanger.ac.uk/pub/EST_data/Xenopus/est_db_software/est_db_06_11_03.tar.gz
  Size : 222321 bytes
  MD5  : b37d57863ef8ab69448b2d28196e1393
  
Download one of the current X. tropicalis EST_DB dumps

Individual libraries clustered separately:
  
ftp://ftp.sanger.ac.uk/pub/EST_data/Xenopus/est_db_dump/X_tropicalis_06_11_03_by_library.tar.gz
  Size : 177244356 bytes
  MD5  : ff137b86ed2e5d1686845967a737c7e7

Global clustering of all libaries:
  
ftp://ftp.sanger.ac.uk/pub/EST_data/Xenopus/est_db_dump/X_tropicalis_06_11_03_global.tar.gz
  Size : 144148585 bytes
  MD5  : 04bda23cd3c837d86d537d38d8a9bf8e

Extract all the files from the archives

Set perl5lib variable so that EST_DB modules can be found as well as the
others mentioned.

run scripts/create_est_tables to create a blank EST_DB on your MySQL server
(Need to edit script for MySQL username)

Reload the data with scripts/reload_text_MySQL_EST_DB_dump
(Need to edit .conf file in /conf dir)
(Needs to be run local to the server)

Install CGI script on web server
(Edit web_config file, MySQL host/user, tmp file location)

Run perldoc on files to generate a set of script and API documentation.

Support

While we hope the software sees as wide a reuse as possible, the amount of time we have to support off-site use and installation is rather limited. Should demand for use of the software be wide, we may be able to increase the amount of documentation currently available. It is likely that to be able to successfully install the package one should have significant perl and relational database experience.

Software queries should be addressed to transcript@sanger.ac.uk

est_db package documentation notes

sanger/doc/EST_pipeline.txt         running the est_db pipeline
sanger/doc/setup.txt                setting up BLAST dbs & searches

doc/est_db_api.txt                  the est_db perl API
doc/Sanger_DB_Adaptor_methods.txt   DB_Adaptor methods
doc/ESTView_classes.txt             graphical classes and their methods
doc/similarity_search_analysis.txt  using dev/similarity_search_statistics

est_db package directory notes

CGI/                                web scripts
conf/                               files holding MySQL access parameters for scripts
dev/                                development scripts, potentially incomplete or not working
doc/                                documentation
modules/                            location of the EST_DB:: modules
mothball/                           deprecated scripts and other files
run/                                pipeline programmes
scripts/                            core scripts that populate and modify the database
scripts/utils/                      utility scripts
submissions/                        EMBL control and submission scripts
t/                                  test scripts, not all working
sanger/                             Sanger Institute specific scripts
web_config/
* quick link - http://q.sanger.ac.uk/6u47bxla