Sequence database setup: UniProt proteomes
Overview
A UniProt complete proteome consists of the set of proteins thought to be expressed by an organism whose genome has been completely sequenced. A reference proteome is the complete proteome of a representative, well-studied model organism or an organism of interest for biomedical research.
UniProtKB is a collaboration between the European Bioinformatics Institute, the Swiss Institute of Bioinformatics and the Protein Information Resource.
First, you need to discover the Proteome ID for your proteome of interest by searching https://www.uniprot.org/proteomes/. This example uses rice, taxonomy Oryza sativa subsp. japonica with Proteome ID UP000059680
In Database Manager, create a new custom definition, as follows:
- Fasta or New database; Create New
- Use pre-defined template; UniProt_proteome_template
- Create
- Download from remote URL; Next
- Set up download URL
- Paste the following into the FASTA file URL field, where the proteome ID is for your proteome of interest
https://www.uniprot.org/uniprot/?query=proteome:UP000059680&format=fasta&compress=no&include=yes - Save; Start downloading
- Activate
(Note that HTTPS support in Database Manager requires Mascot Server 2.6.2 or later.) The complete configuration for the rice proteome in Database Manager would look similar to this:
Once configured, You can enable automatic updating by clicking on the database name then choosing Edit schedule.
Manual download
- Locate the proteome for your organism of interest by searching by name or by taxonomy ID at
https://www.uniprot.org/proteomes/ - Click on the Proteome ID link
- Click on the Download button and choose All protein entries, Fasta (Canonical and isoform), compressed
Taxonomy
Taxonomy is not required for a single organism database
Parse Rules
When a single entry is expanded into entries for multiple isoforms, they share the same ID, so AC must be used as the unique identifier
>sp|Q67W82-2|4CL4_ORYSJ Isoform 2 of Probable 4-coumarate--CoA ligase 4 OS=Oryza sativa subsp. japonica GN=4CL4
AC from Fasta title: ">..|\([^|]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"
Configuration (Mascot 2.3 and earlier)
A Fasta file containing canonical and isoform sequence for the rice proteome was downloaded to /usr/local/mascot/sequence/rice_proteome/current, and renamed to rice_proteome_20120414.fasta.
Full text for individual entries can be retrieved across the web from Uniprot. Note that port 80, as shown in the screen shot, no longer works.
Host: www.uniprot.org
Port: 443
Path: /uniprot/#ACCESSION#.txt
Parse rule: RULE_23 "\(.*\)"
Always test a new definition before applying the changes to mascot.dat