Percolator
Percolator is an algorithm that uses semi-supervised machine learning to improve the discrimination between correct and incorrect spectrum identifications. The matches from searching a decoy database provide the negative examples for the classifier, and a subset of the high-scoring matches from the target database provide the positive examples. Percolator trains a machine learning algorithm called a support vector machine (SVM) to discriminate between the positive and negative matches by assigning weights to a number of features. Examples of features include Mascot score, precursor mass error, fragment mass error, number of variable modifications, etc. The vector of features with their optimal weights is then be used to re-rank matches from all queries, often leading to improved sensitivity.
Percolator was developed by Lukas Käll, Jesse D Canterbury, Jason Weston, William Stafford Noble, & Michael J MacCoss at the University of Washington, Department of Genome Sciences. The software is released under an Apache 2.0 licence and included with Mascot by permission.
We would also like to acknowledge the work of Markus Brosch and colleagues at the Sanger Centre, Hinxton, UK, who first applied Percolator to Mascot results and developed a wrapper application called Mascot Percolator.
There are a number of relevant publications:
- Kall, L., et al., Semi-supervised learning for peptide identification from shotgun proteomics datasets, Nature Methods 4 923-925 (2007)
- Kall, L., et al., Posterior error probabilities and false discovery rates: Two sides of the same coin, Journal of Proteome Research 7 40-44 (2008)
- Kall, L., et al., Assigning significance to peptides identified by tandem mass spectrometry using decoy databases, Journal of Proteome Research 7 29-34 (2008)
- Kall, L., et al., Non-parametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry, Bioinformatics 24 I42-I48 (2008)
- Brosch, M., et al., Accurate and Sensitive Peptide Identification with Mascot Percolator, Journal of Proteome Research 8 3176-3181 (2009)
- Spivak, M., et al., Improvements to the Percolator Algorithm for Peptide Identification from Shotgun Proteomics Data Sets, Journal of Proteome Research 8 3737-3745 (2009)
Percolator returns p values, q values and Posterior Error Probabilities (PEPs) for each match. The q value can be thought of as the false discovery rate. If we accept all matches with q values of 0.01 or less, the false discovery rate will be 1%. The PEP is the probability that an individual match is a chance event.
The requirements for using Percolator to re-rank the matches from a Mascot search are:
- MS/MS search
- The search must include the results from an automatic decoy database search
- The search must contain at least 750 queries
- At least 100 database entries must be searched.
- The search must not be an error tolerant search.
If these requirements are met, the result report will include a checkbox Show Percolator scores. When this is checked and the report re-loaded, the original Mascot scores will be replaced as follows:
- Score: -10log(PEP)
- Expect value: PEP
- Identity threshold score for p<0.05: 13
Percolator will usually give a worthwhile improvement in sensitivity. There are occasions when it can fail. For example, if there are very few good matches in the search results, it may not have enough positive examples to work with.
Features
The complete set of features that can be made available to Percolator is defined in code. You can choose a sub-set of these features using a setting in the Options section of the Mascot configuration file, mascot.dat. The default setting, as shipped, is:
PercolatorFeatures dM, mScore, MIT, MHT, peptideLength, z1, z2, z4, z7, isoSysDM, isoSysDMppm, isoSysDMz, 12C, mc0, mc1, mc2, varmods, varmodsCount, totInt, intMatchedTot, relIntMatchedTot, RMS, RMSppm, meanAbsFragDa, meanAbsFragPPM, rawScore
Feature name | Description |
---|---|
retentionTime | Retention time in seconds if available |
dM | Calculated minus observed peptide mass in Da |
mScore | Mascot score (always on) |
lgDScore | Mascot score minus Mascot score of next best non-isobaric peptide hit |
mrCalc | Calculated Mr |
charge | Charge |
dMppm | Calculated minus observed peptide mass in ppm |
absDM | Absolute value of calculated minus observed peptide mass in Da |
absDMppm | Absolute value of calculated minus observed peptide mass in ppm |
isoDM | Absolute value of calculated minus observed peptide mass, after eliminating possible isotope errors up to 2 Da, in Da |
isoDMppm | Absolute value of calculated minus observed peptide mass, after eliminating possible isotope errors up to 2 Da, in ppm |
isoDmz | Absolute value of calculated minus observed peptide m/z |
mc | Number of missed cleavages (always 0 if no enzyme) |
varmods | Number of modified sites divided by number of modifiable sites (set to 0 if number of modifiable sites is 0) |
varcount | Number of distinct varmods present |
varmodsCount | The number of variable mods used in the peptide. That is, if there are 10 Met and 5 of these are oxidised, this counts as 1. A peptide with Met-OX, phosphoS, deamidation, and acetylation, would count as 5. |
modifiable | Total number of modifiable sites |
modified | Total number of modified residues and terminii |
totInt | Log total ion intensity. The 20 most intense peaks in each 100 Da bin are used for all features, and totInt reports this value |
intMatchedTot | Log total matched ion intensity |
relIntMatchedTot | Total matched ion intensity divided by total ion intensity as a percentage (no logs involved) |
fragDeltaMed | Median value of all matched fragment errors in Da |
fragDeltaIqr | Interquartile range value of all matched fragment errors in Da |
fragDeltaMedPPM | Median value of all matched fragment errors in ppm |
fragDeltaIqrPPM | Interquartile range value of all matched fragment errors in ppm |
fragDeltaPolyFit | 2nd order polynomial fit to m/z vs delta. Result is RSquared multiplied by the number of points divided by 100 |
longest | Longest sequence matched ions, reported separately for each ion series (backbone only), as with fracIonsMatched |
fracIonsMatched | Fraction of calculated ions matched, reported separately for each ion series, with NLs lumped together (e.g. fracIonsMatchedB1, fracIonsMatchedB1deriv, fracIonsMatchedB2, fracIonsMatchedB2deriv) |
matchedIntensity | Matched ion intensity, reported separately for each ion series, as with fracIonsMatched |
qmatch | The number of peptide matches for which an ms-ms match was attempted |
MIT | Mascot identity threshold |
MHT | Mascot homology threshold |
peptideLength | Peptide length |
z1 | 1 if charge = 1 |
z2 | 1 if charge = 2 or 3 |
z4 | 1 if charge = 4, 5, or 6 |
z7 | 1 if charge = 7 or more |
12C | 1 if peptide mass is 12C value (no isotope error) |
mc0 | 1 if missed cleavages = 0 or if no enzyme |
mc1 | 1 if missed cleavages = 0 or 1 |
mc2 | 1 if missed cleavages = 2 or more |
RMS | RMS m/z error for matched fragments |
RMSppm | RMS ppm error for matched fragments |
meanAbsFragDa | Mean absolute m/z error for matched fragments |
meanAbsFragPPM | Mean absolute PPM error for matched fragments |
rawscore | Simple binomial score using matches to main series sequence ions and p = 2*ITOL*n/100 where n is the number of peaks selected in each 100 Da bin |
peptide | The peptide string that was matched interpolated with numbers to represent modifications, e.g. X.DAKAAM1AGRLM1IR.X |
proteins | A tab separated list of accessions of proteins that contain this peptide. Must be last feature in list |
One feature is treated differently from the others: retention time. If retention time is included in the peak list, so that it is available in the Mascot result file, it can be used as a feature by comparing the experimental RT values with values calculated by Percolator. To enable this:
- The peak list must supply retention time information using the MGF RTINSECONDS parameter. It is not sufficient to have the information embedded in the scan title string
- In the Options section of mascot.dat, set PercolatorUseRT to 1 to turn this feature on by default. Please note that retention time calculation in Percolator is very time consuming and the sensitivity improvement is only marginal for most data sets. We advise against turning it on as a global default. Better to try it on specific examples by adding the argument percolate_rt=1 to the report URL.
Two options in mascot.dat control whether target matches other than rank 1 are Percolated:
- PercolatorTargetRankScoreThreshold: Target matches below rank 1 are not Percolated if score less than this value (default 20)
- PercolatorTargetRankRelativeThreshold: Target matches below rank 1 are not Percolated if score difference divided by rank 1 score is greater than this value (default 0.2)
Data flow
- At the completion of a qualifying search, nph-mascot.exe creates a Percolator input file (*.pip) in the result directory
- When a report for Percolated results is loaded, the Percolator executable is called by nph-cache_families.pl to create a pair of output files (*.target.pop, *.decoy.pop) in the result directory. If Percolation is on by default, which is not recommended, this will occur when the report is first loaded. Otherwise, it occurs when the Percolator checkbox is checked and the report reloaded using Format As.
- Finally, nph-cache_families.pl uses the *.pop files to create new cache files that allow a report to be displayed using Percolated scores in place of the original Mascot scores.