Mascot Percolator: accurate and sensitive peptide identification

Mascot Percolator is a software package that interfaces the database search algorithm Mascot [1] with Percolator [2], a well performing machine learning algorithm for rescoring database search results.

We have demonstrate it to be amenable for both low and high accuracy mass spectrometry data, outperforming all available Mascot scoring schemes as well as providing reliable significance measures [3].

[The Wellcome Trust Sanger Institute]

  1. Install the SUN Java runtime environment, version 1.5 or higher from http://www.java.com/en/download

    Follow the instructions as provided by SUN. Type "java -version" at the command line to check the installation and the version.

  2. Download the Mascot Percolator package: ftp://ftp.sanger.ac.uk/pub4/resources/software/mascotpercolator/

    Unzip package.

  3. Download Mascot Parser from http://www.matrixscience.com/msparser_download.html

    Extract the files and copy everything from within the java subfolder into the root of the MascotPercolator folder. That should comprise two files: msparser.jar, libmsparserj.so (for Linux only) and msparser.dll (for Windows only).

  4. Download Percolator (version 1.11) from http://noble.gs.washington.edu/proj/percolator/

    You now need to compile Percolator (see README file in Percolator package). On UNIX machines you might need to make Percolator executable after compilation by performing the following command: "chmod u+x percolator". You should then be able to run Percolator with "./percolator". Path to the executable must be specified in the config file as described in the next step.

  5. Update the config.properties file, available in the root folder of Mascot Percolator, according to your needs.

    - specify the path to the root folder of the Mascot results files by modifying the available example.

    - specify the path to the Percolator executable by modifying the available example.

    - enable or disable specific features as described in [3]. Not recommended.

  6. To test whether Mascot Percolator can be executed, enter "java -jar MascotPercolator.jar".

For help regarding the installation or execution, feel free to contact mb8[at]sanger[.]ac[.]uk (Markus Brosch).

java -cp MascotPercolator.jar cli.MascotPercolator [options ...]

Parameters (replacing the "[options ...]" expression):

  • target VAL : (required) Log ID (*) or path/file name of the Mascot target results dat file
  • decoy VAL : (required) Log ID (*) or path/file name of the Mascot decoy results dat file. Note: if Mascot's 'auto-decoy' mode was used, use same logID/file as for the target parameter.
  • out VAL : (required) Results path and file name (without extension)
  • overwrite : (optional) Given result files already exist, this option forces overwrite
  • validate FILE : (optional) File with a list of correct peptides/proteins (sequences simply concatenated or alternatively one sequence per line without identifiers)
  • rankdelta N : (optional) Maximum allowed Mascot score difference of peptide hit at hand as compared to top hit match. Default = 1: all peptide hit ranks that have a delta score of < 1 to the top hit match are processed. A setting of -1 strictly reports only the top hit match of a spectrum.
  • newDat : (optional flag) Write a new Mascot dat file that replaces the Mascot scores with Percolator's posterior error probabilities that were transformed as follows: newMascotScore = -10log10(PosteriorErrorProbability). The Mascot Identity Threshold was set to 13 (score equivalent to posterior error probabilities <= 0.05).
    Note 1: This option does not replace the existing dat files.
    Note 2: The decoy section of the new dat file is written only when the Mascot auto-decoy method was used for the target/decoy search.
    Note 3: Peptide hit ranks may be different from the original Mascot search, since Mascot Percolator re-ranks the peptide hits based on the reported posterior error probabilities obtained from Percolator
  • rt : (optional flag) Enables retention time; will only be switched on when available from input data; default off; largely untested.
  • xml : (optional flag) Write supplemental XML output as defined here: http://noble.gs.washington.edu/proj/percolator/model/percolator_out.xsd

Example:

java -cp MascotPercolator.jar cli.MascotPercolator -rankdelta 1 -newDat -target 11083 -decoy 11084 -out 11083-11084

Mascot Percolator extracts all necessary data from the Mascot dat file(s), trains Percolator and writes the results to the specified summary file. Mascot Percolator requires a separate target and decoy search, which can be achieved in two ways:

1. Either a Mascot search is performed with the Mascot auto-decoy option enabled. In this case, the "-target" and "-decoy" parameter refer to the same logID or results file.

2. Two independent searches against a target and decoy database are performed, using identical search parameter settings. The "-target" and "-decoy" parameters are set accordingly.

(*) Note: Given the Mascot results are in the default results folder as specified in the config file, then the 'log ID' is the integer part of the Mascot result file of interest. Example: given /mascot/results/ is the root folder of the Mascot results and /mascot/results/20090330/F001234.dat is the results file of interest, then the 'log ID' would be 1234.

The queueing system was implemented to distribute the Mascot Percolator processes onto various machines (nodes). Thereby the post processing time can be reduced linearly with the number of machines available.

If you have a Load Sharing Facility (LSF) installed and your nodes have access to the Mascot results files, you are certainly better off using LSF directly.

WARNING: Even though we run this queuing system without any problems in our IT environment, the distributed computing package shall be seen as experimental. Please feel free to send us bug reports.

There are four separate components involved:

  1. A queue database server that keeps track of the processes. To start-up the database, execute:

    java -cp libs/hsqldb.jar org.hsqldb.Server -database.0 file:mascotPercolatorLogDB -dbname.0 mascotPercolatorLog

    This example starts up a hsqldb database server.

    • 'database.0' specifies the file where the database is saved
    • 'dbname.0' specifies the database name.

    You can connect to this SQL database using the HSQLDB server JDBC driver: 'jdbc:hsqldb:hsql://localhost:9001/mascotpercolatorlog' with user 'sa' and no password.

    Please notice that user 'sa' has full read/write access.

  2. A queue server that receives and dispatches jobs to available nodes and writes log changes to the database server. To start-up the server, execute:

    java -Djava.rmi.server.hostname=yourhost -cp MascotPercolator.jar queue.Server [options ...]

    Replace 'yourhost' with the hostname of your machine.

    Parameters (replacing the "[options ...]" expression):

    • dbAlias VAL : database name, e.g. mascotpercolatorlog
    • dbHost VAL : database host, e.g. localhost
    • htmlStatusFile VAL : simple static html status page will be written to this path and updated periodically as runs are queued & processed
    • port N : (optional) port

    Example:

    java -Djava.rmi.server.hostname=mascotsrv -cp MascotPercolator.jar queue.Server -dbHost localhost -dbAlias mascotpercolatorlog -htmlStatusFile /mascot/mascot/html/percolator/index.html -port 1198

    Please note that Nodes cannot connect to the Mascot queue.Server and will fail, unless you allow them specifically to do so. For this, you need to create a file called 'server.policy' before starting the queue.server and set specific permissions that grant access to local system resources. Please read: http://java.sun.com/developer/onlineTraining/Programming/JDCBook/appA.html. We use 'AllPermission' setting, but make sure you understand the implications. We do not take any responsibility for your chosen settings.

  3. Time to start the node(s) which will execute the jobs. Start as many nodes as you wish on your various machines. To start-up a node, execute:

    java -cp MascotPercolator.jar queue.Node [options ...]

    Parameters (replacing the "[options ...]" expression):

    • server VAL : Server host name, where Mascot Percolator queue is running
    • serverPort N : (optional) Port of Mascot Percolator queue server
    • copyDat : (optional) Given the node has no access to the Mascot dat file location as specified in the config.properties file, it is copied via secure copy (SCP) to a temporary file on the node, which is deleted upon completion.

    Example:

    java -cp MascotPercolator.jar queue.Node -copyDat -server mascotsrv

    Note: 'copyDat' is currently only supported for UNIX machines. For this to work successfully, make sure you run server and node processes as the same user to have no file permission issues. If you have not all nodes in your ssh fingerprint, the server will halt and ask for manual confirmation whenever it connects a new unknown node. We set 'StrictHostKeyChecking no' in the ssh config to auto accept all new hosts. Make sure you understand the implications.

  4. Finally, to submit jobs to the server, execute:

    java -cp MascotPercolator.jar queue.SubmitJob [options ...]

    Parameters (replacing the "[options ...]" expression):

    • server VAL : Server host name, where Mascot Percolator queue is running
    • serverPort N : (optional) Port of Mascot Percolator queue server
    • all remaining options are identical with executing Mascot Percolator directly.

    Example:

    java -cp MascotPercolator.jar queue.SubmitJob -server mascotsrv -user 'markus' -target 12787 -decoy 12789 -out '/tmp/12787-12789'

  5. Special case: OneShotNode

    If you have a LSF queue implemented on your system, but no access to the Mascot results files, this queue package is still useful by using 'OneShotNodes instead of the standard Nodes. Instead of starting up nodes manually and submitting jobs individually, a OneShotNode takes care of both and can thereby be embedded into a standard LSF command. A OneShotNode has a job associated upon start-up and unlike the standard nodes, terminates upon successful completion. The basic command is like that:

    java -cp MascotPercolator.jar queue.OneShotNode [options ...]

    Options are a superset of queue.SubmitJob and queue.Node.

    Example of using OneShotNode as part of a bsub LSF command:

    bsub -q long -M7500000 -R'select[mem>7500] rusage[mem=7500]' -o /lustre/log/percolator/9 "java -Djava.io.tmpdir=/lustre/temp -cp MascotPercolator.jar queue.OneShotNode -server mascotsrv -serverPort 1198 -copyDat -user mb8 -target 12865 -decoy 12866 -out /lustre/percolator/12865-12866"

  • How should I interpret the q-values and a posterior error probabilities (PEP) ?

    » Please refer to Ref. [4], Ref. [5] and Ref. [6] at the end of this document.

  • Why are the peptides N and C terminals always set to "X" ?

    » Percolator requires the pre- and post-fixes to be set, however, Mascot Percolator does not apportion the proteins and since a peptide can match several proteins, we keep these blank ("X").

  1. Probability-based protein identification by searching sequence databases using mass spectrometry data.

    Perkins DN, Pappin DJ, Creasy DM and Cottrell JS

    Electrophoresis 1999;20;18;3551-67

  2. Semi-supervised learning for peptide identification from shotgun proteomics datasets.

    Käll L, Canterbury JD, Weston J, Noble WS and MacCoss MJ

    Nature methods 2007;4;11;923-5

  3. Accurate and sensitive peptide identification with Mascot Percolator.

    Brosch M, Yu L, Hubbard T and Choudhary J

    Journal of proteome research 2009;8;6;3176-81

  4. Posterior error probabilities and false discovery rates: two sides of the same coin.

    Käll L, Storey JD, MacCoss MJ and Noble WS

    Journal of proteome research 2008;7;1;40-4

  5. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases.

    Käll L, Storey JD, MacCoss MJ and Noble WS

    Journal of proteome research 2008;7;1;29-34

  1. [6] Brosch, M. & Choudhary, J.
    Scoring and validation of tandem MS peptide identification methods.
    Humana Press, 2009. Proteome bioinformatics: Informatics for mass-spectrometry based protein science