Mascot Percolator: accurate and sensitive peptide identification

Mascot Percolator is a software package that interfaces the database search algorithm Mascot [10612281] with Percolator [17952086], a well performing machine learning algorithm for rescoring database search results.

We have demonstrate it to be amenable for both low and high accuracy mass spectrometry data, outperforming all available Mascot scoring schemes as well as providing reliable significance measures [19338334] [22493177].

[Genome Research Limited]

Installation instructions

  1. Install the SUN Java runtime environment, version 1.5 or higher from http://www.java.com/en/download

    Follow the instructions as provided by SUN. Type "java -version" at the command line to check the installation and the version.

  2. Download the Mascot Percolator package: ftp://ftp.sanger.ac.uk/pub/resources/software/mascotpercolator/

    Unzip package.

  3. Download Mascot Parser from http://www.matrixscience.com/msparser_download.html

    Extract the files and copy everything from within the java subfolder into the root of the MascotPercolator folder. That should comprise two files: msparser.jar, libmsparserj.so (for Linux only) and msparser.dll (for Windows only).

  4. Download Percolator from http://noble.gs.washington.edu/proj/percolator/
    • Please note that for Mascot Percolator version 2.00 or earlier you will need to use Percolator v1.14. Mascot Percolator v2.02 onwards supports later versions of Percolator although the -u option is required.

    You now need to compile Percolator (see README file in Percolator package). On UNIX machines you might need to make Percolator executable after compilation by performing the following command: "chmod u+x percolator". You should then be able to run Percolator with "./percolator". Path to the executable must be specified in the config file as described in the next step.

  5. Update the config.properties file, available in the root folder of Mascot Percolator, according to your needs.
    • specify the path to the root folder of the Mascot results files by modifying the available example.
    • specify the path to the Percolator executable by modifying the available example.
    • enable or disable specific features as described in [19338334]. Not recommended.
  6. To test whether Mascot Percolator can be executed, enter "java -jar MascotPercolator.jar".

For help regarding the installation or execution, feel free to contact James Wright.

Standalone usage

To run Mascot Percolator:

java -cp MascotPercolator.jar cli.MascotPercolator [options...]

Parameters (replacing the "[options ...]" expression):

target VAL (required)
Log ID [1] or path/file name of the Mascot target results dat file
decoy VAL (required)
Log ID [1] or path/file name of the Mascot decoy results dat file. Note: if Mascot's 'auto-decoy' mode was used, use same logID/file as for the target parameter.
out VAL (required)
Results path and file name (without extension)
overwrite (optional)
Given result files already exist, this option forces overwrite
validate FILE (optional)
File with a list of correct peptides/proteins (sequences simply concatenated or alternatively one sequence per line without identifiers)
rankdelta N (optional)
Maximum allowed Mascot score difference of peptide hit at hand as compared to top hit match.
Default = -1: If set to 1 all peptide hit ranks that have a delta score of < 1 to the top hit match are processed.
A setting of -1 strictly reports only the top hit match of a spectrum.
newDat (optional flag)
Write a new Mascot dat file that replaces the Mascot scores with Percolator's posterior error probabilities that were transformed as follows: newMascotScore = -10log10(PosteriorErrorProbability). The Mascot Identity Threshold was set to 13 (score equivalent to posterior error probabilities &= 0.05).
  1. This option does not replace the existing dat files.
  2. The decoy section of the new dat file is written only when the Mascot auto-decoy method was used for the target/decoy search.
  3. Peptide hit ranks may be different from the original Mascot search, since Mascot Percolator re-ranks the peptide hits based on the reported posterior error probabilities obtained from Percolator
rt (optional/flag)
Enables retention time; will only be switched on when available from input data; default off; largely untested.
xml (optional/flag)
Write supplemental XML output as defined here: http://noble.gs.washington.edu/proj/percolator/model/percolator_out.xsd
features (optional/flag)
Write out feature file with results
chargefeature (optional/flag)
Switch to using a single value feature to represent precursor charge state rather than the standard 4 feature format
highcharge (optional/flag)
calculates series specific features for higher (up to 5+) fragment charge states
nofilter (optional/flag)
switches off filter which ignores spectra with less than 15 fragment peaks
u (optional/flag)
This flag switches Percolator between PSM mode and unique peptide mode. Using this option with the latest versions of Percolator and hence Mascot Percolator report all PSMs rather than peptides. If using earlier versions of Percolator (pre v2.0) this will do the opposite and force Percolator and Mascot PErcolator to report only unique peptides. (only available in Mascot Percolator v2.02 onwards)

Example:

java -cp MascotPercolator.jar cli.MascotPercolator -rankdelta 1 -newDat -u -target 11083 -decoy 11084 -out 11083-11084

Mascot Percolator extracts all necessary data from the Mascot dat file(s), trains Percolator and writes the results to the specified summary file. Mascot Percolator requires a separate target and decoy search, which can be achieved in two ways:

  1. Either a Mascot search is performed with the Mascot auto-decoy option enabled. In this case, the "-target" and "-decoy" parameter refer to the same logID or results file.
  2. Two independent searches against a target and decoy database are performed, using identical search parameter settings. The "-target" and "-decoy" parameters are set accordingly.

Notes

[1] Note: Given the Mascot results are in the default results folder as specified in the config file, then the 'log ID' is the integer part of the Mascot result file of interest. Example: given /mascot/results/ is the root folder of the Mascot results and /mascot/results/20090330/F001234.dat is the results file of interest, then the 'log ID' would be 1234.

Distributed usage

The queueing system was implemented to distribute the Mascot Percolator processes onto various machines (nodes). Thereby the post processing time can be reduced linearly with the number of machines available.

If you have a Load Sharing Facility (LSF) installed and your nodes have access to the Mascot results files, you are certainly better off using LSF directly.

WARNING: Even though we run this queuing system without any problems in our IT environment, the distributed computing package shall be seen as experimental. Please feel free to send us bug reports.

There are four separate components involved:

  1. A queue database server that keeps track of the processes. To start-up the database, execute:
    java -cp libs/hsqldb.jar org.hsqldb.Server -database.0 file:mascotPercolatorLogDB -dbname.0 mascotPercolatorLog
    

    This example starts up a hsqldb database server.

    • 'database.0' specifies the file where the database is saved
    • 'dbname.0' specifies the database name.

    You can connect to this SQL database using the HSQLDB server JDBC driver: 'jdbc:hsqldb:hsql://localhost:9001/mascotpercolatorlog' with user 'sa' and no password.

    Please notice that user 'sa' has full read/write access.

  2. A queue server that receives and dispatches jobs to available nodes and writes log changes to the database server. To start-up the server, execute:
    java -Djava.rmi.server.hostname=yourhost -cp MascotPercolator.jar queue.Server [options ...]
    

    Replace 'yourhost' with the hostname of your machine.

    Parameters (replacing the "[options ...]" expression):

    dbAlias VAL
    database name, e.g. mascotpercolatorlog
    dbHost VAL
    database host, e.g. localhost
    htmlStatusFile VAL
    simple static html status page will be written to this path and updated periodically as runs are queued & processed
    port N (optional)
    port

    Example:

    java -Djava.rmi.server.hostname=mascotsrv -cp MascotPercolator.jar queue.Server \
    -dbHost localhost -dbAlias mascotpercolatorlog -htmlStatusFile /mascot/mascot/html/percolator/index.html -port 1198
    

    Please note that Nodes cannot connect to the Mascot queue.Server and will fail, unless you allow them specifically to do so. For this, you need to create a file called 'server.policy' before starting the queue.server and set specific permissions that grant access to local system resources. Please read: http://java.sun.com/developer/onlineTraining/Programming/JDCBook/appA.html. We use 'AllPermission' setting, but make sure you understand the implications. We do not take any responsibility for your chosen settings.

  3. Time to start the node(s) which will execute the jobs. Start as many nodes as you wish on your various machines. To start-up a node, execute:
    java -cp MascotPercolator.jar queue.Node [options ...]
    

    Parameters (replacing the "[options ...]" expression):

    server VAL
    Server host name, where Mascot Percolator queue is running
    serverPort N (optional)
    Port of Mascot Percolator queue server
    copyDat (optional)
    Given the node has no access to the Mascot dat file location as specified in the config.properties file, it is copied via secure copy (SCP) to a temporary file on the node, which is deleted upon completion.

    Example:

    java -cp MascotPercolator.jar queue.Node -copyDat -server mascotsrv
    

    Note: 'copyDat' is currently only supported for UNIX machines. For this to work successfully, make sure you run server and node processes as the same user to have no file permission issues. If you have not all nodes in your ssh fingerprint, the server will halt and ask for manual confirmation whenever it connects a new unknown node. We set 'StrictHostKeyChecking no' in the ssh config to auto accept all new hosts. Make sure you understand the implications.

  4. Finally, to submit jobs to the server, execute:
    java -cp MascotPercolator.jar queue.SubmitJob [options ...]
    

    Parameters (replacing the "[options ...]" expression):

    server VAL
    Server host name, where Mascot Percolator queue is running
    serverPort N (optional)
    Port of Mascot Percolator queue server

    all remaining options are identical with executing Mascot Percolator directly.

    Example:

    java -cp MascotPercolator.jar queue.SubmitJob -server mascotsrv -user 'markus' -target 12787 -decoy 12789 -out '/tmp/12787-12789'
    
  5. Special case: OneShotNode

    If you have a LSF queue implemented on your system, but no access to the Mascot results files, this queue package is still useful by using 'OneShotNodes instead of the standard Nodes. Instead of starting up nodes manually and submitting jobs individually, a OneShotNode takes care of both and can thereby be embedded into a standard LSF command. A OneShotNode has a job associated upon start-up and unlike the standard nodes, terminates upon successful completion. The basic command is like that:

    java -cp MascotPercolator.jar queue.OneShotNode [options...]
    

    Options are a superset of queue.SubmitJob and queue.Node.

    Example of using OneShotNode as part of a bsub LSF command:

    bsub -q long -M7500000 -R'select[mem>7500] rusage[mem=7500]' -o /lustre/log/percolator/9 \
          "java -Djava.io.tmpdir=/lustre/temp -cp MascotPercolator.jar queue.OneShotNode -server mascotsrv -serverPort 1198 -copyDat -user mb8 -target 12865 -decoy 12866 -out /lustre/percolator/12865-12866"
        
    

GUI Wizard

To run the experimental GUI wizard:

java -cp MascotPercolator.jar cli.MascotPercolator -gui

This feature is still under development and is not fully support it is available in Mascot Percolator v2.02, however it is likley to have some bugs. Please use and feedback is much appreciated.

FAQ

How should I interpret the q-values and a posterior error probabilities (PEP)?
Please refer to Ref. [18052118], Ref. [18067246] and Ref. [20013363] at the end of this document.
Why are the peptides N and C terminals always set to "X" ?
Percolator requires the pre- and post-fixes to be set, however, Mascot Percolator does not apportion the proteins and since a peptide can match several proteins, we keep these blank ("X").

Version History

v1.00:
  • initial release.
v1.01:
  • bufix: Mascot Parser library apeared to calculat incorrect peptide fragmentation at times, resulting in biased results with Mascot Percolator under certain conditions.
  • Work-a-round implemented for Mascot Parser versions pre 2.2.0.0
  • Bug was reported to Matrix Science and was fixed with Mascot Parser library version post 2.2.05. The problem was SWIG, which caused premature garbage collection.
  • bugfix: we found that Mascot (in cluster mode only) up to version 2.2.03 calculated the auto decoy search not correctly, resulting in biased Mascot Percolator results.
  • Fixed by Matrix Science in Mascot version 2.2.04. Bug #2584 under http://www.matrixscience.com/mascot_support.html#2.2 new: simplified command line interface (now only one output file name required).
v1.02:
  • feature refinement.
v1.03:
  • bugfix: query column in results file were not reported correctly.
v1.04:
  • release 1.04 requires Percolator version 1.09 onwards.
  • new: write Mascot results dat file with -10log10(PosteriorErrorProbabilty) as a Mascot score replacement (-newDat parameter).
  • new: retention time prediction is available as an option now (-rt flag), given the data mgf file contains 'RTINSECONDS' for each spectrum (see http://www.matrixscience.com/help/data_file_help.html)
  • new: xml output instead of tab delimited output available (-xml parameter).
  • internal change: switched to new Percolator tab delimited input format.
v1.05 (30/03/2009):
  • bugfix: when a semicolon was used to separate protein identifiers of the same entry (e.g. Prot1;Prot1a), a parser error occurred. new: distributed computing package implemented documentation available: http://www.sanger.ac.uk/Software/analysis/MascotPercolator/
v1.06 (03/04/2009):
  • new: new training feature 'varMods' defined; the number of variable mods divided by all possible variable mods for the peptide at hand
  • new: the parameter 'ranks' is no longer supported and was replaced with 'rankdelta'. rankdelta = maximum allowed Mascot score difference between best peptide match (rank 1) and peptide matches of rank 2..10. Default = 1: all peptide hit ranks that have a delta Mascot score of less than 1 to the top hit match are processed, ideal e.g. to report isobaric peptides etc. A setting of -1 strictly reports only the top hit match. Hits of rank 2 and above are only considered if Mascot score greater than 13.
  • new: the 'newDat' parameter flag now writes also the decoy part of the new dat file. However, this is only available for Mascot Percolator runs that are based on Mascot's Auto-Decoy feature.
  • new: when new dat files are generated, a warning is written into the header: "Result file re-written by Mascot Percolator using scores derived from Percolator PEP values".
  • new: new parameter flag '-features' to save the training data e.g. for debugging purposes.
  • bugfix: various users reported some Exceptions that are now handled.
v1.07 (08/05/2009):
  • bugfix: overwrite parameter has appended results to old results file. Fixed.
  • bugfix: queue.Server threw an Exception if a Mascot search result was to be searched that was generated after the queue.Server was started.
  • bugfix: protein feature is now disabled, since we have seen some artefacts on some datasets. This will be further investigated.
  • bugfix: isotope corrected delta mass feature reported incorrect values when charge was 3+.
  • bugfix: delta score now reported without log transformation; before it was always reported as log(deltaScore). Also no more upper maximum that was used as sanity check before.
  • new: the static html page that the queue.Server writes now includes a auto-refresh tag, so the user always sees the latest status.
  • new: if queue.Server scp' dat files to nodes, it sets the permissions of the temporary copy to be 666 (every UNIX user can read/write the file). Make sure you understand the implications.
v1.08 (16/05/2009):
  • bugfix: selection of sub-optimal ranks was biased towards target hits when rankdelta > 0. Now preserving balanced target/decoy set.
  • new: complete re-implementation of queuing system; avoids bi-directional network communication for more stability and reliability.
  • new: config.default.feature file is now part of the distribution. Do not change the file name or content. It presets the Mascot score to be used for the initial SVM training round.
  • new: auto-refresh tag is removed from static html page upon request.
  • new: updated config.properties file.
  • bugfix: reduced memory footprint.
v1.09 (24/06/2009):
  • (basic features identical to 1.08; changes only affect extended feature set)
  • update: refined spectrum processing (top 20 ions per 100mz mass window instead of global 0.1% ion intensity cut-off).
  • update: relative intensities are now reported as % of total intensity.
  • update: sequence coverage feature is removed since redundant with ion coverage when single series are used;
  • update: peptide score feature removed due to strong correlation with Mascot score.
  • new: MS2 mass accuracy is accounted for by median and IQR of MS2 ion mass errors.
  • new: feature implemented longest consecutive series of B+/B++ and Y/Y++ ions.
v1.10 (10/12/2009):
  • new: percolated ion-series are not hardcoded anymore and depend on the Mascot search parameters (instrument). More flexible use.
  • bugfix: if unimod information is not available in the dat files, check for mod_file and throw error if it cannot be found.
V1.11 (30/04/2010):
  • new: Added charge filtering capability to the MascotROC utility
V1.12 (16/10/2010):
  • new: Added mascotList utility which will list all query top rank hits and their MHT and MIT FDR values
  • bugfix: Fix server version to read correctly from percolator v14 output
V1.13
  • Minor Bug Fixes
V1.14
  • Minor Bug Fixes
V1.15 (05/01/11):
  • bugfix: Made new dat writer compatible with Mascot 2.3
  • bugfix: file scanner depth increased
  • bugfix: rankdelta default set to -1
  • new: Multi database searches are now compatible
V1.16 (01/02/11)
  • update: removed individual charge state features and replaced with single charge value
  • update: added proteins to MascotList tool
V2.00 (28/02/11)
  • update: optional single charge value feature rather than individual features (consider increasing range of individual features?)
  • update: Added optional matching of high mass fragments, 3+ and 4+ fragments are calculated and used in fragment matching features
  • update: Added spectrum quality filter (option to disable) to remove spectra with low number of peaks
  • bugfix: Fix to solve totalMatchedIntensity (and hence relMatchedIntensity) being calculated incorrectly
  • bugfix: Fix to make sure each fragment intensity is only considered once in the total matched intensity (consider doing this for individual ion series as well?)
V2.01 (2/11/12)
  • bugfix: made server/node compatible with new mascot cluster
V2.02 (3/12/12)
  • update: Error Tolerant Experimental Option added (BETA)
  • update: New -u option for use with newer versions of percolator
  • update: GUI Wizard (ALPHA)
  • bugfix: ion series headers

References

  • Enhanced peptide identification by electron transfer dissociation using an improved Mascot Percolator.

    Wright JC, Collins MO, Yu L, Käll L, Brosch M and Choudhary JS

    Proteomic Mass Spectrometry, Wellcome Trust Sanger Institute, Hinxton, Cambridge.

    Peptide identification using tandem mass spectrometry is a core technology in proteomics. Latest generations of mass spectrometry instruments enable the use of electron transfer dissociation (ETD) to complement collision induced dissociation (CID) for peptide fragmentation. However, a critical limitation to the use of ETD has been optimal database search software. Percolator is a post-search algorithm, which uses semi-supervised machine learning to improve the rate of peptide spectrum identifications (PSMs) together with providing reliable significance measures. We have previously interfaced the Mascot search engine with Percolator and demonstrated sensitivity and specificity benefits with CID data. Here, we report recent developments in the Mascot Percolator V2.0 software including an improved feature calculator and support for a wider range of ion series. The updated software is applied to the analysis of several CID and ETD fragmented peptide data sets. This version of Mascot Percolator increases the number of CID PSMs by up to 80% and ETD PSMs by up to 60% at a 0.01 q-value (1% false discovery rate) threshold over a standard Mascot search, notably recovering PSMs from high charge state precursor ions. The greatly increased number of PSMs and peptide coverage afforded by Mascot Percolator has enabled a fuller assessment of CID/ETD complementarity to be performed. Using a data set of CID and ETcaD spectral pairs, we find that at a 1% false discovery rate, the overlap in peptide identifications by CID and ETD is 83%, which is significantly higher than that obtained using either stand-alone Mascot (69%) or OMSSA (39%). We conclude that Mascot Percolator is a highly sensitive and accurate post-search algorithm for peptide identification and allows direct comparison of peptide identifications using multiple alternative fragmentation techniques.

    Funded by: Wellcome Trust: 079643/Z/06/Z

    Molecular & cellular proteomics : MCP 2012;11;8;478-91

  • Scoring and validation of tandem MS peptide identification methods.

    Brosch M and Choudhary J

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.

    A variety of methods are described in the literature to assign peptide sequences to observed tandem MS data. Typically, the identified peptides are associated only with an arbitrary score that reflects the quality of the peptide-spectrum match but not with a statistically meaningful significance measure. In this chapter, we discuss why statistical significance measures can simplify and unify the interpretation of MS-based proteomic experiments. In addition, we also present available software solutions that convert scores into sound statistical measures.

    Methods in molecular biology (Clifton, N.J.) 2010;604;43-53

  • Accurate and sensitive peptide identification with Mascot Percolator.

    Brosch M, Yu L, Hubbard T and Choudhary J

    The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, United Kingdom.

    Sound scoring methods for sequence database search algorithms such as Mascot and Sequest are essential for sensitive and accurate peptide and protein identifications from proteomic tandem mass spectrometry data. In this paper, we present a software package that interfaces Mascot with Percolator, a well performing machine learning method for rescoring database search results, and demonstrate it to be amenable for both low and high accuracy mass spectrometry data, outperforming all available Mascot scoring schemes as well as providing reliable significance measures. Mascot Percolator can be readily used as a stand alone tool or integrated into existing data analysis pipelines.

    Funded by: Wellcome Trust: 077198

    Journal of proteome research 2009;8;6;3176-81

  • Assigning significance to peptides identified by tandem mass spectrometry using decoy databases.

    Käll L, Storey JD, MacCoss MJ and Noble WS

    Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA.

    Automated methods for assigning peptides to observed tandem mass spectra typically return a list of peptide-spectrum matches, ranked according to an arbitrary score. In this article, we describe methods for converting these arbitrary scores into more useful statistical significance measures. These methods employ a decoy sequence database as a model of the null hypothesis, and use false discovery rate (FDR) analysis to correct for multiple testing. We first describe a simple FDR inference method and then describe how estimating and taking into account the percentage of incorrectly identified spectra in the entire data set can lead to increased statistical power.

    Funded by: NCRR NIH HHS: P41 RR11823; NIBIB NIH HHS: R01 EB007057

    Journal of proteome research 2008;7;1;29-34

  • Posterior error probabilities and false discovery rates: two sides of the same coin.

    Käll L, Storey JD, MacCoss MJ and Noble WS

    Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA.

    A variety of methods have been described in the literature for assigning statistical significance to peptides identified via tandem mass spectrometry. Here, we explain how two types of scores, the q-value and the posterior error probability, are related and complementary to one another.

    Funded by: NCRR NIH HHS: P41 RR11823; NIBIB NIH HHS: R01 EB007057

    Journal of proteome research 2008;7;1;40-4

  • Semi-supervised learning for peptide identification from shotgun proteomics datasets.

    Käll L, Canterbury JD, Weston J, Noble WS and MacCoss MJ

    Department of Genome Sciences, University of Washington, 1705 NE Pacific St., William H. Foege Building, Seattle, Washington 98195, USA.

    Shotgun proteomics uses liquid chromatography-tandem mass spectrometry to identify proteins in complex biological samples. We describe an algorithm, called Percolator, for improving the rate of confident peptide identifications from a collection of tandem mass spectra. Percolator uses semi-supervised machine learning to discriminate between correct and decoy spectrum identifications, correctly assigning peptides to 17% more spectra from a tryptic Saccharomyces cerevisiae dataset, and up to 77% more spectra from non-tryptic digests, relative to a fully supervised approach.

    Funded by: NCRR NIH HHS: P41 RR011823; NIBIB NIH HHS: R01 EB007057

    Nature methods 2007;4;11;923-5

  • Probability-based protein identification by searching sequence databases using mass spectrometry data.

    Perkins DN, Pappin DJ, Creasy DM and Cottrell JS

    Imperial Cancer Research Fund, London, UK.

    Several algorithms have been described in the literature for protein identification by searching a sequence database using mass spectrometry data. In some approaches, the experimental data are peptide molecular weights from the digestion of a protein by an enzyme. Other approaches use tandem mass spectrometry (MS/MS) data from one or more peptides. Still others combine mass data with amino acid sequence data. We present results from a new computer program, Mascot, which integrates all three types of search. The scoring algorithm is probability based, which has a number of advantages: (i) A simple rule can be used to judge whether a result is significant or not. This is particularly useful in guarding against false positives. (ii) Scores can be compared with those from other types of search, such as sequence homology. (iii) Search parameters can be readily optimised by iteration. The strengths and limitations of probability-based scoring are discussed, particularly in the context of high throughput, fully automated protein identification.

    Electrophoresis 1999;20;18;3551-67

* quick link - http://q.sanger.ac.uk/qjkyx82r