Format Scoremats Into A CDD Database Using formatrpsbdb

Tao Tao, PhD
User Services
NCBI, NLM, NIH

TOC

1. Introduction

The standalone PSI-BLAST (blastpgp) searches produces a separate file when -C parameter is specified. This file contains input sequence and its associated Position Specific Scoring Matrix (PSSM) in ASN.1 encoded "PssmWithParameters" object ("scoremat" in short). The scoremat is useful in functional analysis of protein sequences. A collection of scoremat files can be converted to a database suitable for searching with Reverse Position Specific (RPS) Blast (rpsblast) using the program formatrpsdb. When given a list of these files, formatrpsdb produces the corresponding database.

formatrpsdb is designed to simultaneously perform the work, used to be performed stepwise by copymat, makemat, and formatdb, without generating the large number of intermediate files these utilities would need to create an final rpslast database. Furthermore, scoremat objects are of more general use than the binary format makemat requires. It is our hope that direct manipulation of scoremat objects will encourage the conversion of more diverse sequence collections into rpslast databases.

Databases generated by formatrpsdb are binary compatible with databases generated by copymat/makemat/formatdb, although the database files generally will not be byte-for-byte identical. The database is also endian-specific due to database indexing.

1.1. Other relevant documents

This document is a rewrite of a few existing documents on rpsblast database tools with a focus on the practical usage. Other documents on relevant BLAST tools are at:

www.ncbi.nlm.nih.gov/staff/tao/URLAPI/rpsblast.htmlfor rpsblast usage
www.ncbi.nlm.nih.gov/staff/tao/URLAPI/blastpgp.htmlfor blastpgp usage
ftp.ncbi.nlm.nih.gov/blast/documents/rpsblast.htmltechnical information on rpsblast
ftp.ncbi.nlm.nih.gov/blast/documents/formatdb.htmltechnical information on formatdb
ftp.ncbi.nlm.nih.gov/blast/documents/blastpgp.htmltechnical information on blastpgp
www.ncbi.nlm.nih.gov/data_specs/asn/scoremat.asnASN.1 specification for PssmWithParameters

1.2. Prerequesites for using formatrpsdb

This section assumes some familiarity with the aforementioned documents.

An rpsblast database consists of two groups of files. The first group is a standard protein database, and the second group of files contains precomputations used to speed up rpsblast searches of the standard protein database. Previously, formatdb would build the first group of files, and makemat/copymat would be used to build the second group (the 'RPS data files').

As mentioned before, formatrpsdb performs all of these steps in a single pass. However, the collection of sequences passed to formatrpsdb must already be consistent in several important ways:

  • All sequences must use the same protein alphabet, current a 28-letter alphabet.
  • Scores in all PSSMs must be scaled by the same factor.
  • If a scoremat does not contain a PSSM, it must contain a set of residue frequencies that formatrpsdb can use to create a PSSM manually.
    • The PSSM creation process is identical to that performed by makemat, and requires a scaling factor, gap existence and extension penalties, and an underlying score matrix.
    • These must be provided as command line options to formatrpsdb, or each scoremat can contain one or more of these values, which will be used in place of the values specified as input arguments.
    • If a sequence contains both a PSSM and residue frequencies, the latter will be ignored.
    See the command line options below.

Regarding the last requirement, a collection of sequences passed to formatrpsdb may include a mixture of sequences for which a PSSM is available, and sequences for which only the residue frequencies are available. The present version of formatrpsdb requires that all parameters (scale factor, gap open/extend, underlying score matrix), whether appearing within a scoremat or supplied from the command line, must be the same for all sequences.

Prebuilt collections of sequences that satisfy these criteria are available from NCBI, along with tools capable of building compliant sequence files. Further, blastpgp is capable of reading and writing scoremat files containing residue frequencies.

2. rpsblast Databases

rpsblast databases can be formatted from scoremats using formatrpsdb. Those scoremats can be from NCBI, from third party, or generated de novo from custom blastpgp searches.

2.1 CDD Databases from NCBI

CDD databases accessible throught NCBI BLAST homepage (www.ncbi.nlm.nih.gov/BLAST/) are also available for downloading in preformatted as well as unformatted scoremat format. For those users with no need to convert custom scoremats into rpsblast database, preformatted version is recommented. Simply download the set you need, unpack the archive, and search then with rpsblast. Note that the rpsblast search databases generated by formatrpsdb are architecture dependent. It is NOT possible to create them on one platform and use them on another.

Table 2.1. CDD Database from NCBI
Intended PlatformArchiveContent 1
Intel/AMD Chip with Linux, Solaris, Windows littile_endian/Cdd._LE.tar.gz Preformatted cdd, Pfam, Smart, Cog, and Kog database
Sun/Solaris, SGI/IRIX, PowerPC/MacOSX 2 big_endian/Cdd_BE.tar.gz Same as above
All platforms cdd.tar.gz Unformatted scoremats for the above databases
Note:
1 For more information, see CDD help
2 Mac with Intel DuoCore is a little endian platform.

More information on "endianness" is given in Table 1.1 of http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/unix_setup.html.

To build search databases for rpsblast from the scoremat archives, we need to unpack the archive and extract its contents. It contains ASCII formatted files with the following extensions:

  • *.smp Position Specific Scoring Matrices (PSSMs) stored in a new ASN.1 format ("scoremat"), which is shared between various BLAST applications.
  • *.pn lists of PSSM file names

These files allow for the compilation of five rpsblast databases, i.e., Smart, Pfam, Cog, Kog, and Cdd. Note that Cdd contains only a subset of domains from Smart, Pfam, COG databases, chosen by NCBI curators to reduce redundancy. It only represents domains with wide phylogenetic distribution and is the set that's indexed in NCBI's Entrez. Those files can be formatted into rpsblast search databases. See Section 4 for details.

2.2. CDD Databases from Custom Scoremat

Users now can take any arbitrary subset of PSSMs and compile them into an rpsblast search database. All formatrpsdb needs is a list of file names (such as "Smart.pn" in the example above) and the corresponding "scoremats" (*.smp) files. Newer versions of blastpgp can write out "checkpoints" in the "scoremat" format as well using the combination of "-C" and "-u" option. The output from -C of blastpgp can be combined with arbitrary subsets of scoremat-formatted PSSMs distributed here, to create a customized rpsblast search sets. The scoremat-formatted PSSMs distributed here are scaled with a factor 100.0, and if one was to combine them with blastpgp generated "scoremats", the same scaling-factor must be set in the formatrpsdb command line parameter.

3. General usages of formatrpsdb

3.1. Format scoremat from blastpgp into a basic RPS blast database

Given a set of three sequence files from blastpgp -C output, 'scoremat1', 'scoremat2', and 'scoremat3', along with a text file 'list' consisting of the the name of these three scoremat files in the following format:

scoremat1
scoremat2
scoremat3

we can use the following command line to create an rpsblast database. The database components are listed after the command line.

formatrpsdb -i list -p T  

list.pin   list.psq   list.phr   list.rps   list.loo   list.aux

The first three files are a standard non-indexed protein database, and the last three are RPS data files. To search the resulted database in rpsblast, call it using "-d list".

3.2 Format scoremats with additional index files

To index the formatted rpsblast database, we need to add "-o T" to the command line. The files generated by formatrpsdb are listed under the command line. Those in bold are additional indexing files.

formatrpsdb -i list -p T -o T

list.psd   list.psi   list.pin   list.psq   list.phr   list.rps   
list.loo   list.aux

They will allow retrieval of individual sequences with their seqid through fastacmd program found in the standalone blast distribution.

3.3 Change the default database name and add informative title

The default name for the formatrpsdb formatted database is the input file name. Often that name is not descriptive of the database content. We can rename the output database by providing our desired database name to the '-n' option. For example, to call the produced database in 3.1 'kinase' while adding a descriptive title, we can use the following command line:

formatrpsdb -i list -p T -o T -n kinase -t "protein serine kinase profile"

The database created will have 'kinase.*' file name, instead of the 'list.*' as given in Section 3.1. This file name often is not informative in the rpsblast search result. For this reason, we use the -t parameter to add more description of the databae, which will be displayed at the beginning of an rpsblast result

3.4 Format the NCBI provided cdd files

The scoremats for cdd databases are available as an ftp archive at: ftp.ncbi.nih.gov/pub/mmdb/cdd/cdd.tar.gz

This ftp file archives collections of position-specific scoring matrices created for the conserved domain search through rpsblast. The PSSMs are meant to be used for compiling rpsblast search databases only and they originate from various alignment collections:

  • Pfam: PSSMs from a mirror of the Pfam-A seed alignment database
  • Smart: PSSMs from a mirror of the Smart domain alignment database
  • COG: PSSMs from automatically aligned sequences or fragments from COGs
  • KOG: PSSMs from automatically aligned sequences or fragments from KOGs
  • cd: alignment models curated at NCBI as part of the CDD project

The following commands will build the rpsblast searchable databases:

formatrpsdb -i Smart.pn -o T -f 9.82 -n Smart -S 100.0 
formatrpsdb -i Pfam.pn -o T -f 9.82 -n Pfam -S 100.0 
formatrpsdb -i Cog.pn -o T -f 9.82 -n Cog -S 100.0 
formatrpsdb -i Kog.pn -o T -f 9.82 -n Kog -S 100.0 	
formatrpsdb -i Cdd.pn -o T -f 9.82 -n Cdd -S 100.0 

Note that the parameter '-f' supplied with formatrpsdb, the three-letter word score threshold for detecting and extending hits in RPS-Blast, will determine the size of the search database. A lower threshold will result in larger databases and a slightly increased search sensitivity, at the cost of additional memory requirements and reduced search speed.

Matrices distributed for creating rpsblast search databases are scaled by a factor (option -S) of 100. A score threshold value of 9.82 will result in search-databases of a size very similar to using unscaled matrices and a threshold value of 11.

5. Feedback

Please direct bug reports, inquiries for assistance, and requests for new features to

blast-help@ncbi.nlm.nih.gov

For inquiries on other NCBI resources not related to blast, please send them to:

info@ncbi.nlm.nih.gov

6. Appendix: CDD database related FTP files and legacy procedures

6.1 Program parameters of formatrpsdb

The current version for formatrpsdb and a complete list of the command line parameters may be obtained by executing formatrpsdb without options followed by enter key stroke, as in:

formatrpsdb - < enter >
The formatrpsdb parameters are listed below individually with explanation and usage examples.

Table 6.1.1
Parameter-t
FunctionTitle for database file
DefaultOptional
Input format[String]
ExampleTo name the database "kinase pssm", use: -t "kinase pssm"
Note:
Input string to this option will be read by programs, such as fastacmd and rpsblast, and printed in the ouput so we know what the database content is.

Table 6.1.2
Parameter-i
FunctionInput file containing list of ASN.1 Scoremat filenames
Default-
Input format[File In]
ExampleTo format input scoremat kinase.sn, use: -i kinase.sn
Note:
This option must be specified using complete file name with extension. Each scoremat file contains the score matrix (or residue frequencies) and identification data for a single sequence. These scoremat filenames should appear in the input file one record per line. There are no restrictions on the number and name of the filename.

Table 6.1.3
Parameter-l (lower case L)
FunctionLogfile name
Defaultformatrpsdb.log
Input format[File Out]
Example-
Note:
Status and error information will be recorded in this file. Change from default not recommended.

Table 6.1.4
Parameter-o
FunctionCreate index files for formatted database
DefaultF
Input format[T/F]
ExampleTo turn this on, use: -o T
Note:
If the "-o T" is used and the sequence identifiers within each scoremat allow it, formatrpsdb will generate index files for the generated database. These will allow retrieval of individual sequences through fastacmd. For custom scoremat generated from protein databases with deflines not comforming to NCBI format, this may not work.

Table 6.1.5/td>
Parameter-v
FunctionSpecifies database volume size (unit is 106.1.letters)
Default0
Input format[Integer]
ExampleTo break up large collectin into volumes 6.1. million base each, use: -v 6.1.
Note:
Default 1000 million base.

Table 6.1.6/td>
Parameter-b
FunctionInput scoremat files are binary
DefaultF
Input format[T/F]
ExampleIf the scoremat files are binary, use: -b T
Note:
In blastpgp, scoremat files output mode is controlled by "-u". The scoremat ASN.1 format allows sequence data in human-readable text format or a more compact binary format. Setting this option to 'T' signals to formatrpsdb that all of the scoremat files listed in the file for '-i' option contain binary ASN.1 scoremat data. Default is to read in ASN.1 scoremat files in ascii format.

Table 6.1.7
Parameter-f
FunctionThreshold for extending inintial word hits
Default11.0
Input format[Real]
ExampleTo increase the threshold to 13, use: -f 13
Note:
This determines what word match to extend, fractional threshold values, such as 10.6.1. are acceptable. While the database is being generated, formatrpsdb builds a blast lookup table, which indexes each input sequence for searches using rpsblast. The argument to '-f' specifies the threshold value; groups of letters in any input sequence which score above this value are added to the lookup table.

Table 6.1.8
Parameter-n
FunctionName of the output database
Default-
Input format[String]
ExampleTo name the output database to "kinase_pssm", use: -n "kinase_pssm"
Note:
This parameter is optional. By default, the database generated will consist of a collection of files whose prefix matches that of the -i input filename. Use this option to name the produced database to a more informative name.

Table 6.1.9
Parameter-S
FunctionThe scaling factor to apply when creating PSSMs
Default100.0
Input format[Real]
ExampleTo increase this to 200, use: -S 200
Note:
This is optional and is for scoremats that contain only residue frequencies. When given a scoremat file that does not contain a PSSM, formatrpsdb looks for a set of residue frequencies in the file, and attempts to create a PSSM using those residue frequencies. The creation process requires a scale factor for the computed scores, provided by this argument.

Table 6.1.10
Parameter-G
FunctionThe gap opening penalty, if not present in the scoremat
Default11
Input format[Integer]
ExampleTo increase this to 12, use: -G 12
Note:
This is primarily intended for scoremat files that contain only residue frequencies. If an input file does not contain gap opening and extension penalties, the values of these two arguments will be substituted.

Table 6.1.11
Parameter-E
FunctionThe gap extension penalty, if not present in the scoremat
Default1
Input format[Integer]
ExampleTo increase this to 2, use: -E 2
Note:
This is primarily intended for scoremat files that contain only residue frequencies. If an input file does not contain gap opening and extension penalties, the values of these two arguments will be substituted.

Table 6.1.12
Parameter-U
FunctionUnderlying score matrix, if not present in the scoremat
DefaultBLOSUM6.1.
Input format[String]
ExampleTo change it to BLOSUM45, use: -U BLOSUM46.1./td>
Note:
If an input file does not contain the name of the NCBI standard score matrix from which residue frequencies were derived, the matrix name specified by the -U option will be substituted. This is primarily intended for scoremat files that contain only residue frequencies.

6.2 Additional CDD database related FTP files

The "acd.tar.gz" archive contains the CD data as used by the CD-server for visualization of CD-search results. They are stored as binary ASN.1.

The file "fasta.tar.gz" contains sequence alignments from the CDs in mFASTA format. Note that sequence fragments are identified with GIs and/or accessions, but the alignments do not contain full-length sequences: the fragments span the region between the first and last aligned residue only.

The file "cddid.tbl.gz" contains summary information about the CD models in the distribution. This is a tab-delimited text file, with a single row per CD model and the following columns:

  1. PSSM-Id (unique numerical identifier)
  2. CD accession (starting with 'cd', 'pfam', 'smart', 'COG', 'KOG', or 'LOAD')
  3. CD "short name"
  4. CD description
  5. PSSM-Length (number of columns, the size of the search model)

The file "cddannot.dat.gz" contains information about conserved family features as recorded for curated CD models. This is a tab-delimited text file, with a single row per "feature" and the following columns:

  1. PSSM-Id (unique numerical identifier)
  2. CD accession (starting with 'cd')
  3. CD "short name"
  4. Feature number
  5. Feature description/name
  6. Boolean flag (0/1), indicating presence of structure-based feature evidence
  7. Boolean flag (0/1), indicating presence of reference-based feature evidence
  8. Boolean flag (0/1), indicating presence of additional comments
  9. comma-separated feature addresses
The feature addresses are positions on the alignment's "master sequence", which is a consensus sequence, and on the alignment's PSSM (the database search model).

6.3 Legacy procedures for rpsblast database preparation (in three steps)

The followings are legacy procedures for preparing rpsblast databases using three different tools. This multi-step procedure is no longer needed after the introduction of formatrpsdb (since 2.2.10). The following is provided for reference purposes only.

6.3.1 Binaries used in rpsblast database preparation

The following binary files are used to setup rpsblast databases:

  1. makemat - primary profile preprocessor to convert a collection of binary profiles, created by the -C option of PSI-BLAST, into portable ASCII form
  2. copymat - secondary profile preprocessor to convert ASCII matrices, produced by the primary preprocessor, into database that can be read into memory quickly;
  3. formatdb - general BLAST database.
The database of score matrices, prepared by copymat and formatdb are searched by rpsblast which produces BLAST-like output.

6.3.2 Conversion of profiles into searchable database

Note: if you are starting with *.mtx files obtained from the NCBI FTP site or another source you should skip the steps listed in 6.6.2.1.

6.3.2.1 Primary preprocessing

Prepare the following files:

  1. a collection of PSI-BLAST-generated profiles with arbitrary names and suffix .chk;
  2. a collection of "profile master sequences", associated with the profiles, each in a separate file with arbitrary name and a 3 character suffix starting with c; the sequences can have deflines; they need not be sequences in nr or in any other sequence database; if the sequences have deflines, then the deflines must be unique.
  3. a list of profile file names, one per line, named %lt;database_name>.pn;
  4. a list of master sequence file names, one per line, in the same order as a list of profile names, named <database_name>.sn;

The following files will be created:

  • a collection of ASCII files, corresponding to each of the original profiles, named <profile_name>.mtx;
  • a list of ASCII matrix files, named <database_name>.mn;
  • ASCII file with auxiliary information, named <database_name>.aux;

Table 6.3.2.1 Progrm Options for makemat
Option Name Function and Note
-Pdatabase name (required)
-GCost to open a gap (optional), default = 11
-ECost to extend a gap (optional), default = 1 Note: It is not enforced that the values of -G and -E passed to makemat were actually used in making the checkpoints. However, the values fed in to makemat are propagated to copymat and rpsblast.
-UUnderlying amino acid scoring matrix (optional), default = BLOSUM62
-dUnderlying sequence database used to create profiles (optional), default = nr Note:It may make sense to use -z without -d when the profiles were created with an older, smaller version of an existing database
-zEffective size of sequence database given by -d, default = current size of -d option
-SScaling factor for matrix outputs to avoid round-off problems; default = PRO_DEFAULT_SCALING_UP, currently defined as 100. Use 1.0 to have no scaling. Output scores will be scaled back down to a unit scale to make them look more like BLAST scores, but we found working with a larger scale to help with roundoff problems. ATTENTION: It is strongly recommended to use -S 1 - the scaling factor should be set to 1 for rpsblast at this point in time.
-HGet help (overrides all other arguments)

6.3.2.2 Secondary preprocessing

Prepare the following files:

  1. a collection of ASCII files, corresponding to each of the original profiles, named .mtx (created by makemat);
  2. a collection of "profile master sequences", associated with the profiles, each in a separate file with arbitrary name and a 3 character suffix starting with c.
  3. a list of ASCII_matrix files, named .mn (created by makemat);
  4. a list of master sequence file names, one per line, in the same order as a list of matrix names, named .sn;
  5. ASCII file with auxiliary information, named .aux (created by makemat);

The files input to copymatices are in ASCII format and thus portable between machines with different encodings for machine-readable files. The following files will be created:

  1. a huge binary file, containing all profile matrices, named <database_name>.rps;
  2. a huge binary file, containing lookup table for the Blast search corresponding to matrixes, named <database_name>.loo
  3. File containing concatenation of all FASTA "profile master sequences", named <database_name> (without extention)

6.3.2.3 Creating of BLAST database

To create blast database from <database_name> file containing all "profile master sequences", we need to run "formatdb" to create regular BLAST database of all "profile master sequences":

formatdb -i <database_name> -p T  -o T

6.4 Documentation of the .mtx file format

Format of the .mtx file:

Table 6.3.2.2 Progrm Options for copymat
Option NameFunction and Note
-Pdatabase name (required)
-Hget help (overrides all other arguments)
-rformat data for rpsblast. "-r T" has to be set to format data for rpsblast at this step
NOTE: copymat requires a fair amount of memory as it first constructs the the lookup table in memory before writing it to disk. Users have found that they require a machine with at least 6.0 Meg of memory for this task.
L = Length of SEQ
SEQ = Sequence
ka#-* = Karlin/Altschul parameters, block #. There are three blocks, each containing four floating point numbers on separate lines.
pX-Y = The position specific scores as integers.

The first element of this file format is [L]. This is the sequence length. The second line contains the sequence itself, in NCBI AA notation. After this, there are three KA blocks (four lines of floating point numbers each), then the positional scores.

The positional scores are arranged in a grid. Each line contains 26 elements, corresponding to the 26 elements in the NCBI AA encoding, and there are L lines where L is the previously mentioned sequence length. Using the symbols mentioned above, it looks something like this:

<p/>
<L>
<SEQ>
<ka1-1>
<ka1-2>
<ka1-3>
<ka1-4>
<ka2-1>
<ka2-2>
<ka2-3>
<ka2-4>
<ka6.1>
<ka6.2>
<ka6.3>
<ka6.4>
<p1-1< <p1-2> <p1-3> ... <p1-26>
<p2-1> <p2-2> <p2-3> ... <p2-26>
...
<pL-1> <pL-2> <pL-3> ... <pL-26>

One can find the explanation for the three blocks of KA-parameters in makemat source code, lines 188-190:

putMatrixKbp(checkFile, compactSearch->kbp_gap_std[0], scaleScores, 1/scalingFactor);
putMatrixKbp(checkFile, compactSearch->kbp_gap_psi[0], scaleScores, 1/scalingFactor);
putMatrixKbp(checkFile, sbp->kbp_ideal, scaleScores, 1/scalingFactor);

Thus, the first KA block is the standard score, the second is for rpsblast, and the third is the ideal score.