Mascot Percolator is a software package that interfaces the database search algorithm Mascot [10612281] with Percolator [17952086], a well performing machine learning algorithm for rescoring database search results.
We have demonstrate it to be amenable for both low and high accuracy mass spectrometry data, outperforming all available Mascot scoring schemes as well as providing reliable significance measures [19338334].
[Genome Research Limited]
Follow the instructions as provided by SUN. Type "java -version" at the command line to check the installation and the version.
Unzip package.
Extract the files and copy everything from within the java subfolder into the root of the MascotPercolator folder. That should comprise two files: msparser.jar, libmsparserj.so (for Linux only) and msparser.dll (for Windows only).
You now need to compile Percolator (see README file in Percolator package). On UNIX machines you might need to make Percolator executable after compilation by performing the following command: "chmod u+x percolator". You should then be able to run Percolator with "./percolator". Path to the executable must be specified in the config file as described in the next step.
java -jar MascotPercolator.jar". For help regarding the installation or execution, feel free to contact James Wright.
To run Mascot Percolator:
java -cp MascotPercolator.jar cli.MascotPercolator [options...]
java -cp MascotPercolator.jar cli.MascotPercolator -rankdelta 1 -newDat -target 11083 -decoy 11084 -out 11083-11084
Mascot Percolator extracts all necessary data from the Mascot dat file(s), trains Percolator and writes the results to the specified summary file. Mascot Percolator requires a separate target and decoy search, which can be achieved in two ways:
[1] Note: Given the Mascot results are in the default results folder as specified in the config file, then the 'log ID' is the integer part of the Mascot result file of interest. Example: given /mascot/results/ is the root folder of the Mascot results and /mascot/results/20090330/F001234.dat is the results file of interest, then the 'log ID' would be 1234.
The queueing system was implemented to distribute the Mascot Percolator processes onto various machines (nodes). Thereby the post processing time can be reduced linearly with the number of machines available.
If you have a Load Sharing Facility (LSF) installed and your nodes have access to the Mascot results files, you are certainly better off using LSF directly.
WARNING: Even though we run this queuing system without any problems in our IT environment, the distributed computing package shall be seen as experimental. Please feel free to send us bug reports.
There are four separate components involved:
java -cp libs/hsqldb.jar org.hsqldb.Server -database.0 file:mascotPercolatorLogDB -dbname.0 mascotPercolatorLog
This example starts up a hsqldb database server.
database.0' specifies the file where the database is saved dbname.0' specifies the database name. You can connect to this SQL database using the HSQLDB server JDBC driver: 'jdbc:hsqldb:hsql://localhost:9001/mascotpercolatorlog' with user 'sa' and no password.
Please notice that user 'sa' has full read/write access.
java -Djava.rmi.server.hostname=yourhost -cp MascotPercolator.jar queue.Server [options ...]
Replace 'yourhost' with the hostname of your machine.
Parameters (replacing the "[options ...]" expression):
Example:
java -Djava.rmi.server.hostname=mascotsrv -cp MascotPercolator.jar queue.Server \ -dbHost localhost -dbAlias mascotpercolatorlog -htmlStatusFile /mascot/mascot/html/percolator/index.html -port 1198
Please note that Nodes cannot connect to the Mascot queue.Server and will fail, unless you allow them specifically to do so. For this, you need to create a file called 'server.policy' before starting the queue.server and set specific permissions that grant access to local system resources. Please read: http://java.sun.com/developer/onlineTraining/Programming/JDCBook/appA.html. We use 'AllPermission' setting, but make sure you understand the implications. We do not take any responsibility for your chosen settings.
java -cp MascotPercolator.jar queue.Node [options ...]
Parameters (replacing the "[options ...]" expression):
Example:
java -cp MascotPercolator.jar queue.Node -copyDat -server mascotsrv
Note: 'copyDat' is currently only supported for UNIX machines. For this to work successfully, make sure you run server and node processes as the same user to have no file permission issues. If you have not all nodes in your ssh fingerprint, the server will halt and ask for manual confirmation whenever it connects a new unknown node. We set 'StrictHostKeyChecking no' in the ssh config to auto accept all new hosts. Make sure you understand the implications.
java -cp MascotPercolator.jar queue.SubmitJob [options ...]
Parameters (replacing the "[options ...]" expression):
all remaining options are identical with executing Mascot Percolator directly.
Example:
java -cp MascotPercolator.jar queue.SubmitJob -server mascotsrv -user 'markus' -target 12787 -decoy 12789 -out '/tmp/12787-12789'
If you have a LSF queue implemented on your system, but no access to the Mascot results files, this queue package is still useful by using 'OneShotNodes instead of the standard Nodes. Instead of starting up nodes manually and submitting jobs individually, a OneShotNode takes care of both and can thereby be embedded into a standard LSF command. A OneShotNode has a job associated upon start-up and unlike the standard nodes, terminates upon successful completion. The basic command is like that:
java -cp MascotPercolator.jar queue.OneShotNode [options...]
Options are a superset of queue.SubmitJob and queue.Node.
Example of using OneShotNode as part of a bsub LSF command:
bsub -q long -M7500000 -R'select[mem>7500] rusage[mem=7500]' -o /lustre/log/percolator/9 \
"java -Djava.io.tmpdir=/lustre/temp -cp MascotPercolator.jar queue.OneShotNode -server mascotsrv -serverPort 1198 -copyDat -user mb8 -target 12865 -decoy 12866 -out /lustre/percolator/12865-12866"
The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
A variety of methods are described in the literature to assign peptide sequences to observed tandem MS data. Typically, the identified peptides are associated only with an arbitrary score that reflects the quality of the peptide-spectrum match but not with a statistically meaningful significance measure. In this chapter, we discuss why statistical significance measures can simplify and unify the interpretation of MS-based proteomic experiments. In addition, we also present available software solutions that convert scores into sound statistical measures.
Methods in molecular biology (Clifton, N.J.)2010;604;43-53
PUBMED: 20013363; DOI: 10.1007/978-1-60761-444-9_4
The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom.
Sound scoring methods for sequence database search algorithms such as Mascot and Sequest are essential for sensitive and accurate peptide and protein identifications from proteomic tandem mass spectrometry data. In this paper, we present a software package that interfaces Mascot with Percolator, a well performing machine learning method for rescoring database search results, and demonstrate it to be amenable for both low and high accuracy mass spectrometry data, outperforming all available Mascot scoring schemes as well as providing reliable significance measures. Mascot Percolator can be readily used as a stand alone tool or integrated into existing data analysis pipelines.
Funded by: Wellcome Trust: 077198
Journal of proteome research2009;8;6;3176-81
PUBMED: 19338334; PMC: 2734080; DOI: 10.1021/pr800982s
Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA.
Automated methods for assigning peptides to observed tandem mass spectra typically return a list of peptide-spectrum matches, ranked according to an arbitrary score. In this article, we describe methods for converting these arbitrary scores into more useful statistical significance measures. These methods employ a decoy sequence database as a model of the null hypothesis, and use false discovery rate (FDR) analysis to correct for multiple testing. We first describe a simple FDR inference method and then describe how estimating and taking into account the percentage of incorrectly identified spectra in the entire data set can lead to increased statistical power.
Funded by: NCRR NIH HHS: P41 RR11823; NIBIB NIH HHS: R01 EB007057
Journal of proteome research 2008;7;1;29-34
PUBMED: 18067246; DOI: 10.1021/pr700600n
Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA.
A variety of methods have been described in the literature for assigning statistical significance to peptides identified via tandem mass spectrometry. Here, we explain how two types of scores, the q-value and the posterior error probability, are related and complementary to one another.
Funded by: NCRR NIH HHS: P41 RR11823; NIBIB NIH HHS: R01 EB007057
Journal of proteome research 2008;7;1;40-4
PUBMED: 18052118; DOI: 10.1021/pr700739d
Department of Genome Sciences, University of Washington, 1705 NE Pacific St., William H. Foege Building, Seattle, Washington 98195, USA.
Shotgun proteomics uses liquid chromatography-tandem mass spectrometry to identify proteins in complex biological samples. We describe an algorithm, called Percolator, for improving the rate of confident peptide identifications from a collection of tandem mass spectra. Percolator uses semi-supervised machine learning to discriminate between correct and decoy spectrum identifications, correctly assigning peptides to 17% more spectra from a tryptic Saccharomyces cerevisiae dataset, and up to 77% more spectra from non-tryptic digests, relative to a fully supervised approach.
Funded by: NCRR NIH HHS: P41 RR011823; NIBIB NIH HHS: R01 EB007057
Nature methods 2007;4;11;923-5
PUBMED: 17952086; DOI: 10.1038/nmeth1113
Imperial Cancer Research Fund, London, UK.
Several algorithms have been described in the literature for protein identification by searching a sequence database using mass spectrometry data. In some approaches, the experimental data are peptide molecular weights from the digestion of a protein by an enzyme. Other approaches use tandem mass spectrometry (MS/MS) data from one or more peptides. Still others combine mass data with amino acid sequence data. We present results from a new computer program, Mascot, which integrates all three types of search. The scoring algorithm is probability based, which has a number of advantages: (i) A simple rule can be used to judge whether a result is significant or not. This is particularly useful in guarding against false positives. (ii) Scores can be compared with those from other types of search, such as sequence homology. (iii) Search parameters can be readily optimised by iteration. The strengths and limitations of probability-based scoring are discussed, particularly in the context of high throughput, fully automated protein identification.
Electrophoresis 1999;20;18;3551-67
PUBMED: 10612281; DOI: 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2