Sequence database setup: Database Manager

Database Manager is a browser-based utility for configuring and updating local copies of sequence databases. It replaces both the Database Maintenance Utility and the Database Update script, which were components of Mascot 2.3 and earlier.

Sequence databases may be Fasta files, containing AA or NA sequences, for searching with Mascot, or MSP files, which are spectral libraries for searching with NIST MSPepSearch.

The file formats and download locations of sequence databases change from time to time. One of the smart features of Database Manager is that database configurations for the most popular public databases are updated automatically, by downloading configuration data from the Matrix Science web site. This means that, for databases such as SwissProt and NCBIprot, all you need do to make a new sequence database available for searching in Mascot is:

  • Choose to enable the database
  • Decide where the files should live
  • Optionally, specify an update schedule

If the file format or the download URL changes at some future date, Database Manager will get new configuration information prior to updating the database files.

Concepts & Definitions

Predefined Database Definition: Configuration information for the most popular public databases is kept up-to-date on the Matrix Science web site, and downloaded as required by Database Manager. You don’t need to know file URLs or worry about parse rules, etc. for a Predefined Database.

Custom Database Definition: If you want to search a database that is not included in the list of Predefined Database Definitions, or if you want to configure one of these databases in some non-standard way, you create a Custom Database Definition.

Synchronisation: If a custom definition is very similar to a predefined definition, it can be converted into a predefined definition by being synchronised. The advantage of doing this is that the configuration will then be kept up-to-date automatically.

Tasks: Database files can be very large, and downloading may take a long time. Database Manager processes tasks serially in the background as long as Mascot Monitor (ms-monitor.exe) is running.

Update Schedule: An schedule can be created to update all the files associated with a database automatically. Maybe once each week or each month. Files will only be downloaded if a new version is available.

Parse Rules: The format of Fasta title lines varies from database to database. Each database definition must include accession and description parse rules, which tell Mascot how to extract a unique identifier and a description for each entry. Protein accessions and descriptions are optional for library (MSP) files. If present, then parse rules will be required.

Active vs. Inactive: An Inactive database definition is effectively hidden from Mascot, and the database will not appear on the Database Status page. A new custom definition is inactive until configuration is complete and all the required files are in their final locations. An Active database definition is visible to Mascot. If the required files are present, Mascot will try to compress them and bring the database On-line. This can fail for all sorts of reasons, such as a missing or corrupt file or a mistake in the configuration, and Database Status will show an error. Hence, an active database definition does not necessarily mean the database is In Use. You might wish to set an active definition to be inactive if you don’t want to see it listed in the search form or if there is some problem with the definition that you don’t want to resolve immediately.

Running Database Manager for the first time

IMPORTANT: Database Manager must be allowed exclusive control of database configuration. Editing mascot.dat outside of Database Manager will just cause confusion because Database Manager re-writes mascot.dat whenever a configuration changes. If you prefer to configure sequence databases manually, by editing mascot.dat, never run Database Manager. Manual procedures are described here.

When Database Manager is run for the first time, it imports existing database definitions from mascot.dat. If a definition looks similar to a Predefined Definition, you will be offered the option to synchronise it. Database Manager will also try to download the latest configuration file from the Matrix Science web site. It is at this point that any problem with the connection to the Internet will be discovered. If you see the following warnings, unless the Mascot Server is intentionally isolated from the Internet, choose to configure the proxy settings and save them.

Database Manager

If the Mascot Server is intentionally isolated from the Internet, choose Do not use the Internet to avoid seeing constant error messages about failed connections.

Once any connection issues have been resolved, the configuration import page is displayed. If Database Manager is being run after a clean installation of Mascot, the only existing definitions will be SwissProt (a Fasta) and PRIDE_Contaminants (a spectral library). In most cases, you will want to synchronise these definitions with the predefined ones, and choose Import. If you have upgraded an existing Mascot installation, other database definitions in mascot.dat will also be listed, and you need to decide which to synchronise and which to keep as custom definitions.

Database Manager

Database Manager tries to match existing definitions against predefined definitions and reports the quality of the match as none, poor, good, or perfect. For poor or good matches, the differences can be inspected. Usually, these arise because the existing definition is out-of-date in some respect.

If the Mascot Server is not allowed to access the Internet, choose Keep as Custom unless the match is perfect. This is because synchronisation of any definition where the match is not perfect requires the database files to be updated. Even if you have an Internet connection, choose Keep as custom for any database with a poor or good match unless you want to update the database files or if you see difference in the existing definition that you want to preserve.

Choose Import to proceed. The list of Databases will be displayed, with status information for those that have been synchronised and need updating.

Database Manager

Custom definitions that are possible matches to predefined ones can be made predefined at any time by choosing Synchronise custom definitions

Adding a new database

You can add new databases in four different ways:

  1. Enable predefined definition

    Apart from confirming a location for the downloaded files, everything will be handled automatically. Only one instance of each predefined definition can be enabled at any one time, as database names must be unique. If you want to enable a predefined database, but make changes to the configuration, e.g. to keep an old version on-line, choose Create New;Use predefined definition template.

  2. Create New; Custom

    Create a new custom database definition from scratch.

  3. Create New; Copy Of

    Create a new custom database definition by copying an existing definition. You will be required to enter a new database name and given the choice of copying the existing database files.

  4. Create New; Use predefined definition template

    Create a new custom database definition by starting from a predefined definition. The differences between this and enabling a predefined definition are (i) you can make changes to the configuration, (ii) the definition will not be kept up-to-date automatically.

When a new database is created, unless it is predefined, you will either need to supply download URLs for the files or copy the files manually to the specified directory on the Mascot Server before configuration can be completed. This is primarily to allow parse rules to be tested, but it also verifies that the download URL works or that the manually copied files have the correct names and security settings / permissions.

Drop-down help is provided for each element in the configuration pages. The following terms may benefit from additional explanation:

Database Name: Each database must have a unique name. Ideally, the name should be short and descriptive. Note that these names are case sensitive, and much confusion can be caused by creating both SwissProt and swissprot.

Local paths: The delimiters between directories must always be forward slashes, even if Mascot is running on a Windows system. The default parent directory for sequence database directories can be specified on the Settings page.

Memory mapping and locking: Memory mapped files can be locked in memory, but only if the computer has sufficient RAM. Having a database locked in memory means that it can never be swapped out to disk, ensuring maximum possible search speed. If you try to lock databases into RAM when there isn’t room, this will not be a major problem. The locking will fail, generate an error message, and Mascot will carry on regardless. A more serious problem is when there is just sufficient RAM to lock the databases, but none left over for searches or other applications. In this case, the whole system will slow down and the hard disk will be observed to be “thrashing”. Eventually, the system is likely to hang or crash.

Threads: A Mascot search can use multiple threads, so as to make use of all the logical processors covered by the licence. Usually, it is best to leave threads set to -1, which means automatic. If you want to restrict the number of threads on a non-cluster (SMP) system, you can do so by setting a value of 1 or more. Each CPU in the Mascot licence allows use of up to 4 cores, which requires 8 threads for a hyperthreaded processor or 4 otherwise. On a cluster system, the number of threads is set for each search node in a separate configuration file, nodelist.txt.

Creating a spectral library from search results

When you create a custom definition for a spectral library, the library can be an existing MSP file that you download or it can be created locally from your Mascot search results.

When a library is created from search results, only results files and peptide matches that pass suitable filtering criteria will be included in the library. More information about spectral library filters can be found on the Spectral library search help page.

Scheduled updates

If a URL is specified for downloading the Fasta or MSP file, or if the library is being created from search results, you can create an update schedule. This can be done when the database is first added or later, by clicking on the name hyperlink in the databases list.

Global settings

Allow Internet access: If the Mascot Server machine has no Internet connection or if you do not wish Database Manager to access the Internet, this should be set explicitly to avoid getting error messages. If Internet access is prevented, you cannot download databases, which means that predefined definitions cannot be enabled. Use Create New; Use predefined definition template instead, and manually copy the required files to the Mascot server.

Predefined database definitions are taken from two files that were part of the Mascot installation, but will eventually become out-of-date. These two files are databases_1.xml and libraries_1.xml. If you find that there is a newer version of either file, you can download it manually on a machine with Internet access, rename it using an approximate timestamp that is unique for your configuration files (e.g. 2016-12-25-11-11-11.xml) and copy it to your Mascot config/db_manager/public directory. Do not over-write the original databases_1.xml or libraries_1.xml files.

Allow external full-text reports: Even when Internet access is enabled, it may be undesirable to allow reports for specific database entries to be retrieved from Internet sources. You can change the source for external reports in a custom definition but not in a predefined definition. Disabling external sources here blocks all external full-text reports.

Proxy: If automatic HTTP proxy detection fails, or if the proxy server is password protected, enter and save details. Native FTP proxy servers are not supported.

Sequence directory: The files for each database reside in a directory with the same name as the database. When a new database is added, the sequence directory specifies the default path under which the database directory will be created unless it already exists. This is only a default, and you can change the path during configuration of a database. Database directories do not have to be kept together, and can be distributed across drives or partitions as convenient. If you choose remote storage, make sure the connection is fast and reliable and that memory mapping is supported. Windows UNC paths are not supported. The delimiters between directories must always be forward slashes, even if Mascot is running on a Windows system.

Technical

Important files:

  • db_manager.etags.2

    Successful downloads of database files are recorded in a file called db_manager.etags.2 in the incoming directory for the database. Each new version of a database is downloaded once. If you try to download the same file(s) a second time, maybe because the original was accidentally deleted, Database Manager will report that no new files are available. To force a new download, delete db_manager.etags.2 before choosing Update.

  • mascot.dat

    The general configuration file, mascot.dat in the config directory, is re-written by Database Manager whenever configuration changes are saved, so it is pointless to edit the database related sections. Files with names like 2012-04-12_135833.mascot.dat are backups of mascot.dat

  • global.conf

    Global settings, such as proxy server details, are saved to global.conf in the config/db_manager directory.

  • databases_1.xml

    When Mascot is installed, the initial set of predefined Fasta definitions is a file called databases_1.xml in the config/db_manager/public directory. If there is an Internet connection, whenever Database Manager tries to update a database, it checks for updates to this file. If a new version is available, it is downloaded to a file with a name like 2017-04-13-15-34-30.xml. (Note: these are not backups and must not be deleted).

  • libraries_1.xml

    When Mascot is installed, the initial set of predefined library definitions is a file called libraries_1.xml in the config/db_manager/public directory. If there is an Internet connection, whenever Database Manager tries to update a database, it checks for updates to this file. If a new version is available, it is downloaded to a file with a name like 2017-04-13-15-34-30.xml. (Note: these are not backups and must not be deleted).

  • configuration.xml

    Database configuration information is saved to configuration.xml in the config/db_manager directory. Files with names like 2012-04-14_144014.configuration.xml are backups. For custom definitions, all the configuration information is in configuration.xml. For predefined databases, only limited settings are in configuration.xml. Most settings are inherited from databases_1.xml, libraries_1.xml, or a later update file, e.g. 2017-04-13-15-34-30.xml. Note that the configuration for a predefined database is only updated when the database files themselves are updated.

    To illustrate, imagine we have SwissProt 2017_01 as a predefined database. Six months later, with release 2017_07, the Fasta title line changes so as to require a new accession parse rule. A new version of databases_1.xml is posted on the Matrix Science web site and downloaded by Database Manager. However, until your local copy of SwissProt is updated to 2017_07 or later, you don’t actually want to use the new accession parse rule, because this could break the configuration for the files from the earlier release. So, the definition in configuration.xml specifies that the configuration settings are inherited from the earlier file until SwissProt is updated, at which point, the definition in configuration.xml will be changed to specify that settings are inherited from the latest public file.

Editing configuration.xml:

In the current release of Database Manager, there is no user interface for creating or modifying a taxonomy parse rule. You can only select from those that already exist. If you need to make changes in this area, the procedure is described in Chapter 9 of the Installation & Setup manual. You’ll want to create a custom definition for the database in question, possibly by using a predefined definition as a template. Trying to modify a predefined definition is problematic because your changes will be lost each time the database is updated after a new databases_1.xml file has been downloaded.

If you simply want to add a new category to the taxonomy filter drop-down list that appears in the search form, this does not require any changes to database configuration files. Just edit the file called taxonomy in the Mascot config directory, as explained in Chapter 9 of the Installation & Setup manual.