Error tolerant search

Introduction

When the results of an MS/MS Ions Search of an LC-MS/MS dataset are reviewed, there will often be a number of spectra that remain unmatched. Assuming that a given MS/MS spectrum contains adequate information, i.e. a reasonable number of fragment ion peaks at usable signal to noise, possible reasons for this failure include:

  • Underestimated mass measurement error
  • Incorrect determination of precursor charge
  • Enzyme non-specificity
  • Unsuspected chemical & post-translational modifications
  • Peptide sequence not in the database

If mass measurement error has been underestimated, this should be apparent from the graphs showing the differences between the calculated and measured mass values in the Peptide View and Protein View reports.

Incorrect determination of precursor charge has to be dealt with during peak detection. If it is not possible to determine the precursor charge reliably, then one option is to generate peak lists for all probable charge states.

The Mascot Error Tolerant Search addresses the final three difficulties by searching selected database entries with relaxed enzyme specificity, while iterating through a comprehensive list of chemical and post-translational modifications, together with a residue substitution matrix.

There are two ways to perform an error tolerant search. The preferred method is to check the error tolerant checkbox on the search form, which leads to an automatic, second pass search. There is also a manual procedure, in which the user selects the proteins that will go forward for the second pass search. This was an earlier implementation, and is retained mainly for compatibility with existing workflows and third party software.

Note that both methods are only applicable to MS/MS data; it is not possible to perform an error tolerant peptide mass fingerprint. For a truly unknown modification, or a sequence variation of more than a single base or residue, the error tolerant sequence tag is worth investigating.

Automatic Error Tolerant Search

An automatic error tolerant search is performed by choosing the error tolerant checkbox on the search form. A standard, first pass search is performed using the search parameters specified in the form. From the results of the first pass search, all of the database entries that contain one or more peptide matches with expect values less than the significance threshold (0.05 by default) are selected for an error tolerant, second pass search. At the completion of the second pass search, a single report is generated, combining the results from both passes.

During the error tolerant, second pass search:

  1. The selected enzyme becomes semi-specific, (that is, only one end of a peptide needs to match the cleavage specificity), and the value of the missed cleavage parameter is increased by 1
  2. The complete list of modifications is tested, serially
  3. For a protein, the set of all possible amino acid substitutions is tested. For a nucleic acid sequence, all single base insertions, deletions, and substitutions are tested.
  4. Only one of the above is allowed per peptide. That is, an individual peptide can be semi-specific OR have one unsuspected modification OR have one primary sequence mutation.
  5. If the mass delta of the modification is less than the smaller of the precursor mass tolerance and the fragment mass tolerance, the modification is rejected. This eliminates modifications that are meaningless given the estimated mass error, like Q->K, in most cases.

The following constraints apply to the standard, first pass search:

  1. Enzyme must be fully specific
  2. A reduced ceiling on the number of variable modifications, (default is 2, but this can be changed globally in mascot.dat or for a user group in Mascot security)
  3. Cannot be combined with quantitation
  4. Search cannot include error tolerant sequence tag

If an automatic decoy database search is also specified, and the Target PSM FDR is (no target), the behaviour is as above. If a value is specified for Target PSM FDR, the significance threshold for the first pass search is adjusted to achieve this target. This determines which queries go forward to the second pass (those without significant matches) and which proteins will be searched (those with one or more significant matches).

The Target PSM FDR is applied independently to the results from the second pass, which means that the significance thresholds for the two passes may be very different. Since the target is based on PSM counts, if it can be achieved for the results from both passes, then it will also be true for the combined results. If it is not possible to get within a factor of 2 of the target, a warning will appear in the report.

Manual Error Tolerant Search

Database entries are selected from the results report of a standard search. Check the Error tolerant checkbox, near the Search selected button, and choose one or more proteins to be included in the search. (On the public web site, a maximum of 3 proteins can be chosen). Clicking on the Search selected button loads a modified search form, from which you can change many of the search parameters. Cleavage agent defaults to None, though an enzyme can be chosen if desired.

During the error tolerant search:

  1. The complete list of modifications is tested, serially
  2. For a protein, the set of all possible amino acid substitutions is tested. For a nucleic acid sequence, all single base insertions, deletions, and substitutions are tested.

The manual error tolerant search should only be used in exceptional cases. One reason is that, because enzyme specificity is dropped entirely, and modifications can be combined with non-specificity, and the number of database entries tends to be fewer, the level of "junk" matches in the manual search will be higher than in the automatic search. Another reason is that, in the automatic search, the results from both passes are saved to the result file, which provides greater reporting flexibility. For example, you can choose to show or hide the additional, error tolerant matches. The combined report also reduces compatibility problems for applications that read Mascot result files.

Reviewing the Results

It is important to recognise that only the matches from the standard, first pass search provide evidence for the identity of a protein. The additional matches found in the error tolerant, second pass search are valuable because they are the most likely assignments of the spectra. Occasionally, an additional match will provide useful biological information, such as distinguishing between two isoforms. If the same modification shows up many times, this may indicate an experimental artefact that needs to be eliminated or, at least, selected as a variable modification for standard searches.

Nevertheless, these additional matches have been obtained by selecting a small number of database entries and beating them into submission with non-specificity, substitutions and a long list of modifications, so should be viewed with caution.

The second pass matches do not contribute to the protein score. If the query also has a lower scoring match to the same protein in the first pass search, this contributes, so that the protein scores are identical to those that would be obtained in a standard, single pass search. (If you perform a manual error tolerant search, the report will show protein scores derived from all the matches listed.)

It is advisable to select the decoy database option whenever performing an automatic error tolerant search. If this is not done, statistics are based on the number of trials, and will be less reliable. For an automatic error tolerant search that includes the decoy option, expect values are estimated from the decoy results. The target and decoy proteins are treated as pairs. After the first pass search, when proteins are selected, each significant match, whether target or decoy, causes the relevant pair of target and decoy proteins to be selected for the second pass. This means that the target and decoy databases are of identical size and contains all significant PSMs from the first pass.

Queries that get significant matches in the first pass search do not go forward to the second pass. (Actually, they do, but we blindly discard the second pass results.) Queries that failed to get significant matches in the first pass go forward to the second pass, where they are searched against the selected entries. Statistics for the second pass matches are based on the total number of trials from both passes. This could mean that a much higher score is required to get a significant match in the second pass than in the first.

For example, click on this thumbnail image to load an example of the results from an automatic error tolerant search with decoy. Scroll down to hit 2, Alkaline phosphatase.

error tolerant search results

In some cases, the additional match is the result of non-specific cleavage, such as queries 133 and 162. If the error tolerant match was found by introducing a modification or a sequence change, the mass delta and its location are given at the end of the row, in square brackets. When the mouse rests over the mass delta hyperlink, all the known assignments of this delta are displayed in a pop-up. Take a look at query 260. The mass tolerance for this search was fairly wide, ±0.8 Da, so the observed mass difference could correspond to either carbamidomethylation or carboxymethylation at the N-terminus. Since this sample was alkylated with iodoacetamide, we would choose carbamidomethylation as the more likely suspect, especially as this brings the error on the precursor mass into line with the general trend, whereas carboxymethylation would give an error of +0.5 Da. The assignment to carbamidomethylation is also very believable, because this is a known artefact of over-alkylation. The same modification is found for other queries. Another easily believable assignment is pyro-Glu for the match to query 252.

In other cases, the match may be good, but the assignment is not believable. For example, look at query 218, which has a mass difference of 15.0 Da on the N-term D, assigned to Hydroxamic_acid (DE). If you look this up in Unimod, it is described as an artefact of exposure to hydroxylamine. This is possible, but note that the amino terminus also carries an Acetyl, mass 42.0, which was included in the search as a variable modification. The sum of these two mods is 57.0, which happens to be the mass of N-term carbamidomethylation, an altogether more likely explanation.

Always check the alternative matches that are displayed if you expand the twisty in the rank column or click on the query number to load a Peptide View report. It is common to get multiple matches with similar scores, and the best match may be an unlikely modification, while a match with a slightly lower score has a more credible explanation. Query 124 provides a good example. The displayed match has a delta corresponding to succinylation and the peptide sequence found in family member 2.3. A match with the same score is obtained from the peptide sequence found in family member 2.1 plus a delta of 114.0, which corresponds to double carbamidomethylation – a more likely modification.

More about Modifications

The list of modifications used by Mascot is taken directly from the Unimod database. For further details of individual modifications, please refer to Unimod.

Note that only a small sub-set of modifications is displayed by default in the Mascot search form. If you want to see the complete list, you must go to the search form defaults page and tick the checkbox for Show all mods.

In an Error Tolerant search, all the entries on the modifications list are tested serially, and all permutations of each individual modification are tested. For example, if a modification affects serine, and a peptide contains three serines, but has a molecular mass consistent with just two modifications, there are 3 permutations to be tested (110,101,011).

This differs from the behaviour for any variable modifications explicitly specified in the search form, when all permutations and combinations of the selected modifications are tested. Specifying more than a handful of variable modifications leads to a drastic loss of discrimination, because the number of permutations and combinations increases geometrically with the total fractional abundance of modifiable residues.

More about Sequence Variants

Variations in the primary sequence generally result from variations in the DNA sequence. These may be DNA sequencing errors, they may be mutations or polymorphisms, or they may be more extensive evolutionary changes, because the database entry is not the authentic protein, but a related sequence from a different species.

When searching a nucleic acid database, single base deletions and insertions can be tested in addition to substitutions. The consequences of deletions and insertions cannot be tested for a protein database because they cause a frame shift, which completely changes the amino acid sequence from that point onwards.

Amino acid substitutions in protein sequences are handled like modifications, and the composition and mass changes are taken from Unimod entries.