Mascot Percolator is a software package that interfaces the database search algorithm Mascot [10612281] with Percolator [17952086], a well performing machine learning algorithm for rescoring database search results.
We have demonstrate it to be amenable for both low and high accuracy mass spectrometry data, outperforming all available Mascot scoring schemes as well as providing reliable significance measures [19338334] [22493177].
[Genome Research Limited]
Follow the instructions as provided by SUN.
Type "java -version" at the command line to check the installation and the version.
Unzip package.
Extract the files and copy everything from within the java subfolder into the root of the MascotPercolator folder. That should comprise two files: msparser.jar, libmsparserj.so (for Linux only) and msparser.dll (for Windows only).
You now need to compile Percolator (see README file in Percolator package). On UNIX machines you might need to make
Percolator executable after compilation by performing the following command: "chmod u+x percolator".
You should then be able to run Percolator with "./percolator". Path to the executable must be
specified in the config file as described in the next step.
java -jar MascotPercolator.jar".
For help regarding the installation or execution, feel free to contact James Wright.
To run Mascot Percolator:
java -cp MascotPercolator.jar cli.MascotPercolator [options...]
java -cp MascotPercolator.jar cli.MascotPercolator -rankdelta 1 -newDat -u -target 11083 -decoy 11084 -out 11083-11084
Mascot Percolator extracts all necessary data from the Mascot dat file(s), trains Percolator and writes the results to the specified summary file. Mascot Percolator requires a separate target and decoy search, which can be achieved in two ways:
[1] Note: Given the Mascot results are in the default results folder as specified in the config file, then the 'log ID' is the integer part of the Mascot result file of interest. Example: given /mascot/results/ is the root folder of the Mascot results and /mascot/results/20090330/F001234.dat is the results file of interest, then the 'log ID' would be 1234.
The queueing system was implemented to distribute the Mascot Percolator processes onto various machines (nodes). Thereby the post processing time can be reduced linearly with the number of machines available.
If you have a Load Sharing Facility (LSF) installed and your nodes have access to the Mascot results files, you are certainly better off using LSF directly.
WARNING: Even though we run this queuing system without any problems in our IT environment, the distributed computing package shall be seen as experimental. Please feel free to send us bug reports.
There are four separate components involved:
java -cp libs/hsqldb.jar org.hsqldb.Server -database.0 file:mascotPercolatorLogDB -dbname.0 mascotPercolatorLog
This example starts up a hsqldb database server.
database.0' specifies the file where the database is saved
dbname.0' specifies the database name.
You can connect to this SQL database using the HSQLDB server JDBC driver:
'jdbc:hsqldb:hsql://localhost:9001/mascotpercolatorlog' with user 'sa' and no password.
Please notice that user 'sa' has full read/write access.
java -Djava.rmi.server.hostname=yourhost -cp MascotPercolator.jar queue.Server [options ...]
Replace 'yourhost' with the hostname of your machine.
Parameters (replacing the "[options ...]" expression):
Example:
java -Djava.rmi.server.hostname=mascotsrv -cp MascotPercolator.jar queue.Server \ -dbHost localhost -dbAlias mascotpercolatorlog -htmlStatusFile /mascot/mascot/html/percolator/index.html -port 1198
Please note that Nodes cannot connect to the Mascot queue.Server and will fail, unless you allow them
specifically to do so. For this, you need to create a file called 'server.policy' before
starting the queue.server and set specific permissions that grant access to local system resources. Please read:
http://java.sun.com/developer/onlineTraining/Programming/JDCBook/appA.html.
We use 'AllPermission' setting, but make sure you understand the implications. We do not take any responsibility
for your chosen settings.
java -cp MascotPercolator.jar queue.Node [options ...]
Parameters (replacing the "[options ...]" expression):
Example:
java -cp MascotPercolator.jar queue.Node -copyDat -server mascotsrv
Note: 'copyDat' is currently only supported for UNIX machines. For this to work successfully, make sure you run
server and node processes as the same user to have no file permission issues. If you have not all nodes in
your ssh fingerprint, the server will halt and ask for manual confirmation whenever it connects a new unknown node.
We set 'StrictHostKeyChecking no' in the ssh config to auto accept all new hosts. Make sure you
understand the implications.
java -cp MascotPercolator.jar queue.SubmitJob [options ...]
Parameters (replacing the "[options ...]" expression):
all remaining options are identical with executing Mascot Percolator directly.
Example:
java -cp MascotPercolator.jar queue.SubmitJob -server mascotsrv -user 'markus' -target 12787 -decoy 12789 -out '/tmp/12787-12789'
If you have a LSF queue implemented on your system, but no access to the Mascot results files, this queue package is still useful by using 'OneShotNodes instead of the standard Nodes. Instead of starting up nodes manually and submitting jobs individually, a OneShotNode takes care of both and can thereby be embedded into a standard LSF command. A OneShotNode has a job associated upon start-up and unlike the standard nodes, terminates upon successful completion. The basic command is like that:
java -cp MascotPercolator.jar queue.OneShotNode [options...]
Options are a superset of queue.SubmitJob and queue.Node.
Example of using OneShotNode as part of a bsub LSF command:
bsub -q long -M7500000 -R'select[mem>7500] rusage[mem=7500]' -o /lustre/log/percolator/9 \
"java -Djava.io.tmpdir=/lustre/temp -cp MascotPercolator.jar queue.OneShotNode -server mascotsrv -serverPort 1198 -copyDat -user mb8 -target 12865 -decoy 12866 -out /lustre/percolator/12865-12866"
To run the experimental GUI wizard:
java -cp MascotPercolator.jar cli.MascotPercolator -gui
This feature is still under development and is not fully support it is available in Mascot Percolator v2.02, however it is likley to have some bugs. Please use and feedback is much appreciated.
Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Hinxton, Cambridge.
Peptide identification using tandem mass spectrometry is a core technology in proteomics. Latest generations of mass spectrometry instruments enable the use of electron transfer dissociation (ETD) to complement collision induced dissociation (CID) for peptide fragmentation. However, a critical limitation to the use of ETD has been optimal database search software. Percolator is a post-search algorithm, which uses semi-supervised machine learning to improve the rate of peptide spectrum identifications (PSMs) together with providing reliable significance measures. We have previously interfaced the Mascot search engine with Percolator and demonstrated sensitivity and specificity benefits with CID data. Here, we report recent developments in the Mascot Percolator V2.0 software including an improved feature calculator and support for a wider range of ion series. The updated software is applied to the analysis of several CID and ETD fragmented peptide data sets. This version of Mascot Percolator increases the number of CID PSMs by up to 80% and ETD PSMs by up to 60% at a 0.01 q-value (1% false discovery rate) threshold over a standard Mascot search, notably recovering PSMs from high charge state precursor ions. The greatly increased number of PSMs and peptide coverage afforded by Mascot Percolator has enabled a fuller assessment of CID/ETD complementarity to be performed. Using a data set of CID and ETcaD spectral pairs, we find that at a 1% false discovery rate, the overlap in peptide identifications by CID and ETD is 83%, which is significantly higher than that obtained using either stand-alone Mascot (69%) or OMSSA (39%). We conclude that Mascot Percolator is a highly sensitive and accurate post-search algorithm for peptide identification and allows direct comparison of peptide identifications using multiple alternative fragmentation techniques.
Funded by: Wellcome Trust: 079643/Z/06/Z
Molecular & cellular proteomics : MCP 2012;11;8;478-91
PUBMED: 22493177; PMC: 3412976; DOI: 10.1074/mcp.O111.014522
The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.
A variety of methods are described in the literature to assign peptide sequences to observed tandem MS data. Typically, the identified peptides are associated only with an arbitrary score that reflects the quality of the peptide-spectrum match but not with a statistically meaningful significance measure. In this chapter, we discuss why statistical significance measures can simplify and unify the interpretation of MS-based proteomic experiments. In addition, we also present available software solutions that convert scores into sound statistical measures.
Methods in molecular biology (Clifton, N.J.) 2010;604;43-53
PUBMED: 20013363; DOI: 10.1007/978-1-60761-444-9_4
The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom.
Sound scoring methods for sequence database search algorithms such as Mascot and Sequest are essential for sensitive and accurate peptide and protein identifications from proteomic tandem mass spectrometry data. In this paper, we present a software package that interfaces Mascot with Percolator, a well performing machine learning method for rescoring database search results, and demonstrate it to be amenable for both low and high accuracy mass spectrometry data, outperforming all available Mascot scoring schemes as well as providing reliable significance measures. Mascot Percolator can be readily used as a stand alone tool or integrated into existing data analysis pipelines.
Funded by: Wellcome Trust: 077198
Journal of proteome research 2009;8;6;3176-81
PUBMED: 19338334; PMC: 2734080; DOI: 10.1021/pr800982s
Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA.
Automated methods for assigning peptides to observed tandem mass spectra typically return a list of peptide-spectrum matches, ranked according to an arbitrary score. In this article, we describe methods for converting these arbitrary scores into more useful statistical significance measures. These methods employ a decoy sequence database as a model of the null hypothesis, and use false discovery rate (FDR) analysis to correct for multiple testing. We first describe a simple FDR inference method and then describe how estimating and taking into account the percentage of incorrectly identified spectra in the entire data set can lead to increased statistical power.
Funded by: NCRR NIH HHS: P41 RR11823; NIBIB NIH HHS: R01 EB007057
Journal of proteome research 2008;7;1;29-34
PUBMED: 18067246; DOI: 10.1021/pr700600n
Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA.
A variety of methods have been described in the literature for assigning statistical significance to peptides identified via tandem mass spectrometry. Here, we explain how two types of scores, the q-value and the posterior error probability, are related and complementary to one another.
Funded by: NCRR NIH HHS: P41 RR11823; NIBIB NIH HHS: R01 EB007057
Journal of proteome research 2008;7;1;40-4
PUBMED: 18052118; DOI: 10.1021/pr700739d
Department of Genome Sciences, University of Washington, 1705 NE Pacific St., William H. Foege Building, Seattle, Washington 98195, USA.
Shotgun proteomics uses liquid chromatography-tandem mass spectrometry to identify proteins in complex biological samples. We describe an algorithm, called Percolator, for improving the rate of confident peptide identifications from a collection of tandem mass spectra. Percolator uses semi-supervised machine learning to discriminate between correct and decoy spectrum identifications, correctly assigning peptides to 17% more spectra from a tryptic Saccharomyces cerevisiae dataset, and up to 77% more spectra from non-tryptic digests, relative to a fully supervised approach.
Funded by: NCRR NIH HHS: P41 RR011823; NIBIB NIH HHS: R01 EB007057
Nature methods 2007;4;11;923-5
PUBMED: 17952086; DOI: 10.1038/nmeth1113
Imperial Cancer Research Fund, London, UK.
Several algorithms have been described in the literature for protein identification by searching a sequence database using mass spectrometry data. In some approaches, the experimental data are peptide molecular weights from the digestion of a protein by an enzyme. Other approaches use tandem mass spectrometry (MS/MS) data from one or more peptides. Still others combine mass data with amino acid sequence data. We present results from a new computer program, Mascot, which integrates all three types of search. The scoring algorithm is probability based, which has a number of advantages: (i) A simple rule can be used to judge whether a result is significant or not. This is particularly useful in guarding against false positives. (ii) Scores can be compared with those from other types of search, such as sequence homology. (iii) Search parameters can be readily optimised by iteration. The strengths and limitations of probability-based scoring are discussed, particularly in the context of high throughput, fully automated protein identification.
Electrophoresis 1999;20;18;3551-67
PUBMED: 10612281; DOI: 10.1002/(SICI)1522-2683(19991201)20:18<3551::AID-ELPS3551>3.0.CO;2-2