Format Scoremats Into A CDD Database Using
formatrpsbdb
Tao Tao, PhD User Services NCBI, NLM, NIH
TOC
1. Introduction
The standalone PSI-BLAST (blastpgp) searches produces a separate file when
-C parameter is specified. This file
contains input sequence and its associated Position Specific
Scoring Matrix (PSSM) in
ASN.1 encoded "PssmWithParameters" object ("scoremat" in short). The
scoremat is useful in functional analysis of
protein sequences. A collection of scoremat files can be converted to a
database suitable for searching with Reverse
Position Specific (RPS) Blast (rpsblast) using the program formatrpsdb.
When given a list of these files, formatrpsdb
produces the corresponding database.
formatrpsdb is designed to simultaneously perform the work, used to be
performed stepwise by copymat, makemat,
and formatdb, without generating the large number of intermediate files
these utilities would need to create
an final rpslast database. Furthermore, scoremat objects are of more
general use than the binary format makemat
requires. It is our hope that direct manipulation of scoremat objects will
encourage the conversion of more diverse
sequence collections into rpslast databases.
Databases generated by formatrpsdb are binary compatible with databases
generated by copymat/makemat/formatdb,
although the database files generally will not be byte-for-byte identical.
The database is also endian-specific due
to database indexing.
1.1. Other relevant documents
This document is a rewrite of a few existing documents on rpsblast
database tools with a focus on the practical
usage. Other documents on relevant BLAST tools are at:
1.2. Prerequesites for using
formatrpsdb
This section assumes some familiarity with the aforementioned documents.
An rpsblast database consists of two groups of files. The first group is a
standard protein database, and the second group
of files contains precomputations used to speed up rpsblast searches of
the standard protein database. Previously, formatdb
would build the first group of files, and makemat/copymat would be used to
build the second group (the 'RPS data files').
As mentioned before, formatrpsdb performs all of these steps in a single
pass. However, the collection of sequences passed
to formatrpsdb must already be consistent in several important ways:
- All sequences must use the same protein alphabet, current a
28-letter alphabet.
- Scores in all PSSMs must be scaled by the same factor.
- If a scoremat does not contain a PSSM, it must contain a set
of residue frequencies that formatrpsdb can use to create
a PSSM manually.
- The PSSM creation process is identical to that performed
by makemat, and requires a scaling factor,
gap existence and extension penalties, and an underlying
score matrix.
- These must be provided as command line options to
formatrpsdb, or each scoremat can contain one or more of these
values, which will be used in place of the values
specified as input arguments.
- If a sequence contains both a PSSM and residue
frequencies, the latter will be ignored.
See the command line options below.
Regarding the last requirement, a collection of sequences passed to
formatrpsdb may include a mixture of sequences for which a
PSSM is available, and sequences for which only the residue frequencies
are available. The present version of formatrpsdb requires
that all parameters (scale factor, gap open/extend, underlying score
matrix), whether appearing within a scoremat or supplied from
the command line, must be the same for all sequences.
Prebuilt collections of sequences that satisfy these criteria are
available from NCBI, along with tools capable of building compliant
sequence files. Further, blastpgp is capable of reading and writing
scoremat files containing residue frequencies.
2. rpsblast Databases
rpsblast databases can be formatted from scoremats using formatrpsdb.
Those scoremats can be from NCBI, from third party,
or generated de novo from custom blastpgp searches.
2.1 CDD Databases from NCBI
CDD databases accessible throught NCBI BLAST homepage
(www.ncbi.nlm.nih.gov/BLAST/) are also available for downloading in
preformatted
as well as unformatted scoremat format. For those users with no need to
convert custom scoremats into rpsblast database, preformatted
version is recommented. Simply download the set you need, unpack the
archive, and search then with rpsblast. Note that
the rpsblast search databases generated by formatrpsdb are architecture
dependent. It is NOT possible to create them on one platform
and use them on another.
| Table 2.1. CDD
Database from NCBI |
| Intended Platform | Archive | Content
1 |
| Intel/AMD Chip with Linux, Solaris, Windows |
littile_endian/Cdd._LE.tar.gz |
Preformatted cdd, Pfam, Smart, Cog, and Kog database |
| Sun/Solaris, SGI/IRIX, PowerPC/MacOSX 2 |
big_endian/Cdd_BE.tar.gz |
Same as above |
| All platforms |
cdd.tar.gz |
Unformatted scoremats for the above databases |
Note:
1 For more information, see CDD
help
2 Mac with Intel DuoCore is a little endian platform.
More information on "endianness" is given in Table 1.1 of
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/unix_setup.html.
To build search databases for rpsblast from the scoremat archives, we need
to unpack the archive and extract its contents.
It contains ASCII formatted files with the following extensions:
- *.smp Position Specific Scoring Matrices (PSSMs) stored in a new
ASN.1 format ("scoremat"), which is shared between various BLAST
applications.
- *.pn lists of PSSM file names
These files allow for the compilation of five rpsblast databases, i.e.,
Smart, Pfam, Cog, Kog, and Cdd. Note that Cdd contains
only a subset of domains from Smart, Pfam, COG databases, chosen by NCBI
curators to reduce redundancy. It only represents
domains with wide phylogenetic distribution and is the set that's indexed
in NCBI's Entrez. Those files can be formatted
into rpsblast search databases. See Section 4 for
details.
2.2. CDD Databases from Custom
Scoremat
Users now can take any arbitrary subset of PSSMs and compile them into an
rpsblast search database. All formatrpsdb needs is a
list of file names (such as "Smart.pn" in the example above) and the
corresponding "scoremats" (*.smp) files. Newer versions of
blastpgp can write out "checkpoints" in the "scoremat" format as well
using the combination of "-C" and "-u" option. The output
from -C of blastpgp can be combined with arbitrary subsets of
scoremat-formatted PSSMs distributed here, to create a customized
rpsblast search sets. The scoremat-formatted PSSMs distributed here are
scaled with a factor 100.0, and if one was to combine them
with blastpgp generated "scoremats", the same scaling-factor must be set
in the formatrpsdb command line parameter.
3. General usages of formatrpsdb
3.1. Format scoremat from blastpgp into
a basic RPS blast database
Given a set of three sequence files from blastpgp -C output, 'scoremat1',
'scoremat2', and 'scoremat3', along with a
text file 'list' consisting of the the name of these three scoremat files
in the following format:
scoremat1
scoremat2
scoremat3
we can use the following command line to create an rpsblast database. The
database components are listed after the command line.
formatrpsdb -i list -p T
list.pin list.psq list.phr list.rps list.loo list.aux
|
The first three files are a standard non-indexed protein database, and the
last three are RPS data files. To search the
resulted database in rpsblast, call it using "-d list".
3.2 Format scoremats with additional
index files
To index the formatted rpsblast database, we need to add "-o T" to the
command line. The files generated by formatrpsdb are
listed under the command line. Those in bold are additional indexing
files.
formatrpsdb -i list -p T -o T
list.psd list.psi list.pin list.psq list.phr list.rps
list.loo list.aux
|
They will allow retrieval of individual sequences with their seqid through
fastacmd program found in the standalone blast distribution.
3.3 Change the default database name
and add informative title
The default name for the formatrpsdb formatted database is the input file
name. Often that name is not descriptive of the database
content. We can rename the output database by providing our desired
database name to the '-n' option. For example, to call the produced
database in 3.1 'kinase' while adding a descriptive title, we can use the
following command line:
formatrpsdb -i list -p T -o T -n kinase -t "protein serine kinase profile"
|
The database created will have 'kinase.*' file name, instead of the
'list.*' as given in Section 3.1.
This file name often is not informative in the rpsblast search result. For
this reason, we use the -t parameter
to add more description of the databae, which will be displayed at the
beginning of an rpsblast result
3.4 Format the NCBI provided cdd
files
The scoremats for cdd databases are available as an ftp archive at:
ftp.ncbi.nih.gov/pub/mmdb/cdd/cdd.tar.gz
This ftp file archives collections of position-specific scoring matrices
created for the conserved domain search through rpsblast.
The PSSMs are meant to be used for compiling rpsblast search databases
only and they originate from various alignment collections:
- Pfam: PSSMs from a mirror of the Pfam-A seed alignment database
- Smart: PSSMs from a mirror of the Smart domain alignment database
- COG: PSSMs from automatically aligned sequences or fragments from
COGs
- KOG: PSSMs from automatically aligned sequences or fragments from
KOGs
- cd: alignment models curated at NCBI as part of the CDD
project
The following commands will build the rpsblast searchable databases:
formatrpsdb -i Smart.pn -o T -f 9.82 -n Smart -S 100.0
formatrpsdb -i Pfam.pn -o T -f 9.82 -n Pfam -S 100.0
formatrpsdb -i Cog.pn -o T -f 9.82 -n Cog -S 100.0
formatrpsdb -i Kog.pn -o T -f 9.82 -n Kog -S 100.0
formatrpsdb -i Cdd.pn -o T -f 9.82 -n Cdd -S 100.0
|
Note that the parameter '-f' supplied with formatrpsdb, the three-letter
word score threshold for detecting and
extending hits in RPS-Blast, will determine the size of the search
database. A lower threshold will result in
larger databases and a slightly increased search sensitivity, at the cost
of additional memory requirements and
reduced search speed.
Matrices distributed for creating rpsblast search databases are scaled by
a factor (option -S) of 100. A score
threshold value of 9.82 will result in search-databases of a size very
similar to using unscaled matrices and a
threshold value of 11.
5. Feedback
Please direct bug reports, inquiries for assistance, and requests for new
features to
blast-help@ncbi.nlm.nih.gov
For inquiries on other NCBI resources not related to blast, please send
them to:
info@ncbi.nlm.nih.gov
6. Appendix: CDD database related FTP
files and legacy procedures
6.1 Program parameters of
formatrpsdb
The current version for formatrpsdb and a complete list of the command
line parameters may be obtained by executing
formatrpsdb without options followed by enter key stroke, as in:
formatrpsdb - < enter >
The formatrpsdb parameters are listed below individually with explanation
and usage examples.
| Table 6.1.1 |
| Parameter | -t |
| Function | Title for database
file |
| Default | Optional |
| Input format | [String] |
| Example | To name the database "kinase
pssm", use: -t "kinase pssm" |
Note:
Input string to this option will be read by programs, such as fastacmd and
rpsblast, and
printed in the ouput so we know what the database content is.
| Table 6.1.2 |
| Parameter | -i |
| Function | Input file containing list of
ASN.1 Scoremat filenames |
| Default | - |
| Input format | [File In] |
| Example | To format input scoremat
kinase.sn, use: -i kinase.sn |
Note: This option must be specified using complete file name with
extension. Each scoremat
file contains the score matrix (or residue frequencies) and identification
data for a single sequence. These
scoremat filenames should appear in the input file one record per line.
There are no restrictions on
the number and name of the filename.
| Table 6.1.3 |
| Parameter | -l (lower case L) |
| Function | Logfile name |
| Default | formatrpsdb.log |
| Input format | [File Out] |
| Example | - |
Note: Status and error information will be recorded in this file.
Change from default not recommended.
| Table 6.1.4 |
| Parameter | -o |
| Function | Create index files for
formatted database |
| Default | F
|
| Input format | [T/F] |
| Example | To turn this on, use: -o
T |
Note: If the "-o T" is used and the sequence identifiers within each
scoremat allow it,
formatrpsdb will generate index files for the generated database. These
will allow retrieval of individual sequences through
fastacmd. For custom scoremat generated from protein databases with
deflines not comforming to NCBI format, this may not work.
| Table 6.1.5/td> |
| Parameter | -v |
| Function | Specifies database volume size
(unit is 106.1.letters) |
| Default | 0 |
| Input format | [Integer] |
| Example | To break up large collectin into
volumes 6.1. million base each, use: -v 6.1. |
Note: Default 1000 million base.
| Table 6.1.6/td> |
| Parameter | -b |
| Function | Input scoremat files are
binary |
| Default | F |
| Input format | [T/F] |
| Example | If the scoremat files are
binary, use: -b T |
Note: In blastpgp, scoremat files output mode is controlled by "-u".
The scoremat ASN.1 format
allows sequence data in human-readable text format or a more compact
binary format. Setting this option to 'T' signals to formatrpsdb
that all of the scoremat files listed in the file for '-i' option contain
binary ASN.1 scoremat data. Default is to read in ASN.1
scoremat files in ascii format.
| Table 6.1.7 |
| Parameter | -f |
| Function | Threshold for extending
inintial word hits |
| Default | 11.0 |
| Input format | [Real] |
| Example | To increase the threshold to 13,
use: -f 13 |
Note: This determines what word match to extend, fractional threshold
values, such as 10.6.1. are acceptable.
While the database is being generated, formatrpsdb builds a blast lookup
table, which indexes each input sequence
for searches using rpsblast. The argument to '-f' specifies the threshold
value; groups of letters in any input
sequence which score above this value are added to the lookup table.
| Table 6.1.8 |
| Parameter | -n |
| Function | Name of the output
database |
| Default | - |
| Input format | [String] |
| Example | To name the output database to
"kinase_pssm", use: -n "kinase_pssm" |
Note: This parameter is optional. By default, the database generated
will consist of a collection
of files whose prefix matches that of the -i input filename. Use this
option to name the produced database
to a more informative name.
| Table 6.1.9 |
| Parameter | -S |
| Function | The scaling factor to apply
when creating PSSMs |
| Default | 100.0 |
| Input format | [Real] |
| Example | To increase this to 200, use: -S
200 |
Note: This is optional and is for scoremats that contain only residue
frequencies. When given
a scoremat file that does not contain a PSSM, formatrpsdb looks for a set
of residue frequencies in the file, and attempts to
create a PSSM using those residue frequencies. The creation process
requires a scale factor for the computed scores, provided
by this argument.
| Table
6.1.10 |
| Parameter | -G |
| Function | The gap opening penalty, if not
present in the scoremat |
| Default | 11 |
| Input format | [Integer] |
| Example | To increase this to 12, use: -G
12 |
Note: This is primarily intended for scoremat files that contain only
residue frequencies. If an
input file does not contain gap opening and extension penalties, the
values of these two arguments will
be substituted.
| Table
6.1.11 |
| Parameter | -E |
| Function | The gap extension penalty, if
not present in the scoremat |
| Default | 1 |
| Input format | [Integer] |
| Example | To increase this to 2, use: -E
2 |
Note: This is primarily intended for scoremat files that contain only
residue frequencies. If an input
file does not contain gap opening and extension penalties, the values of
these two arguments will be substituted.
| Table
6.1.12 |
| Parameter | -U |
| Function | Underlying score matrix, if not
present in the scoremat |
| Default | BLOSUM6.1. |
| Input format | [String] |
| Example | To change it to BLOSUM45, use:
-U BLOSUM46.1./td> |
Note: If an input file does not contain the name of the NCBI standard
score matrix from which
residue frequencies were derived, the matrix name specified by the -U
option will be substituted.
This is primarily intended for scoremat files that contain only residue
frequencies.
6.2 Additional CDD database related FTP
files
The "acd.tar.gz" archive contains the CD data as used by the CD-server for
visualization of CD-search
results. They are stored as binary ASN.1.
The file "fasta.tar.gz" contains sequence alignments from the CDs in
mFASTA format. Note that sequence
fragments are identified with GIs and/or accessions, but the alignments do
not contain full-length sequences:
the fragments span the region between the first and last aligned residue
only.
The file "cddid.tbl.gz" contains summary information about the CD models
in the distribution. This is a
tab-delimited text file, with a single row per CD model and the following
columns:
- PSSM-Id (unique numerical identifier)
- CD accession (starting with 'cd', 'pfam', 'smart', 'COG',
'KOG', or 'LOAD')
- CD "short name"
- CD description
- PSSM-Length (number of columns, the size of the search
model)
The file "cddannot.dat.gz" contains information about conserved family
features as recorded for curated CD models.
This is a tab-delimited text file, with a single row per "feature" and the
following columns:
- PSSM-Id (unique numerical identifier)
- CD accession (starting with 'cd')
- CD "short name"
- Feature number
- Feature description/name
- Boolean flag (0/1), indicating presence of structure-based
feature evidence
- Boolean flag (0/1), indicating presence of reference-based
feature evidence
- Boolean flag (0/1), indicating presence of additional
comments
- comma-separated feature addresses
The feature addresses are positions on the alignment's "master sequence",
which is a consensus sequence, and on the
alignment's PSSM (the database search model).
6.3 Legacy procedures for rpsblast
database preparation (in three steps)
The followings are legacy procedures for preparing rpsblast databases
using three different tools. This multi-step procedure is
no longer needed after the introduction of formatrpsdb (since 2.2.10). The
following is provided for reference purposes only.
6.3.1 Binaries used in rpsblast
database preparation
The following binary files are used to setup rpsblast databases:
- makemat - primary profile preprocessor to convert a collection of
binary profiles, created by the -C option of PSI-BLAST, into portable
ASCII form
- copymat - secondary profile preprocessor to convert ASCII matrices,
produced by
the primary preprocessor, into database that can be read into memory
quickly;
- formatdb - general BLAST database.
The database of score matrices, prepared by copymat and formatdb are
searched by rpsblast which produces BLAST-like output.
6.3.2 Conversion of profiles into
searchable database
Note: if you are starting with *.mtx files obtained from the NCBI FTP site
or another source you should skip the steps listed
in 6.6.2.1.
6.3.2.1 Primary
preprocessing
Prepare the following files:
- a collection of PSI-BLAST-generated profiles with arbitrary
names and suffix .chk;
- a collection of "profile master sequences", associated with
the profiles, each in a separate file with arbitrary name and a 3
character suffix starting with c; the sequences can have deflines; they
need not be sequences in nr or in any other sequence database; if the
sequences have deflines, then the deflines must be unique.
- a list of profile file names, one per line, named
%lt;database_name>.pn;
- a list of master sequence file names, one per line, in the
same order as a list of profile names, named
<database_name>.sn;
The following files will be created:
- a collection of ASCII files, corresponding to each of the original
profiles, named <profile_name>.mtx;
- a list of ASCII matrix files, named <database_name>.mn;
- ASCII file with auxiliary information, named
<database_name>.aux;
| Table 6.3.2.1 Progrm
Options for makemat |
| Option Name | Function and Note |
| -P | database name (required) |
| -G | Cost to open a gap (optional), default = 11 |
| -E | Cost to extend a gap (optional), default = 1 Note: It
is not enforced that the values of -G and -E passed to makemat were
actually used in making the checkpoints. However, the values fed in to
makemat are propagated to copymat and rpsblast. |
| -U | Underlying amino acid scoring matrix (optional),
default = BLOSUM62 |
| -d | Underlying sequence database used to create profiles
(optional), default = nr Note:It may make sense to use -z without -d when
the profiles were created with an older, smaller version of an existing
database |
| -z | Effective size of sequence database given by -d,
default = current size of -d option |
| -S | Scaling factor for matrix outputs to avoid round-off
problems; default = PRO_DEFAULT_SCALING_UP, currently defined as 100. Use
1.0 to have no scaling. Output scores will be scaled back down to a unit
scale to make them look more like BLAST scores, but we found working with
a larger scale to help with roundoff problems. ATTENTION: It is strongly
recommended to use -S 1 - the scaling factor should be set to 1 for
rpsblast at this point in time. |
| -H | Get help (overrides all other arguments) |
6.3.2.2 Secondary
preprocessing
Prepare the following files:
- a collection of ASCII files, corresponding to each of the
original profiles, named .mtx (created by makemat);
- a collection of "profile master sequences", associated with
the profiles, each in a separate file with arbitrary name and a 3
character suffix starting with c.
- a list of ASCII_matrix files, named .mn
(created by makemat);
- a list of master sequence file names, one per line, in the
same order as a list of matrix names, named .sn;
- ASCII file with auxiliary information, named
.aux (created by makemat);
The files input to copymatices are in ASCII format and thus portable
between machines with different encodings for machine-readable files.
The following files will be created:
- a huge binary file, containing all profile matrices, named
<database_name>.rps;
- a huge binary file, containing lookup table for the Blast
search corresponding to matrixes, named <database_name>.loo
- File containing concatenation of all FASTA "profile master
sequences", named <database_name> (without extention)
| Table 6.3.2.2 Progrm
Options for copymat |
| Option Name | Function and Note |
| -P | database name (required) |
| -H | get help (overrides all other arguments) |
| -r | format data for rpsblast. "-r T" has to be set to
format data for rpsblast at this step |
| NOTE: copymat requires a fair amount
of memory as it first constructs the the lookup table in memory before
writing it to disk. Users have found that they require a machine with at
least 6.0 Meg of memory for this task. |
6.3.2.3 Creating of BLAST
database
To create blast database from <database_name> file containing all
"profile master sequences", we need to run "formatdb" to create regular
BLAST database of all "profile master sequences":
formatdb -i <database_name> -p T -o T
6.4 Documentation of the .mtx file
format
Format of the .mtx file:
| L | = Length of SEQ |
| SEQ | = Sequence |
| ka#-* | = Karlin/Altschul parameters, block #.
There are three blocks, each containing four floating point numbers on
separate lines. |
| pX-Y | = The position specific scores as
integers. |
The first element of this file format is [L]. This is the sequence length.
The second line contains the sequence itself, in NCBI AA notation.
After this, there are three KA blocks (four lines of floating point
numbers each), then the positional scores.
The positional scores are arranged in a grid. Each line contains 26
elements, corresponding to the 26 elements in the NCBI AA encoding, and
there are L lines where L is the previously mentioned sequence length.
Using the symbols mentioned above, it looks something like this:
<p/>
<L>
<SEQ>
<ka1-1>
<ka1-2>
<ka1-3>
<ka1-4>
<ka2-1>
<ka2-2>
<ka2-3>
<ka2-4>
<ka6.1>
<ka6.2>
<ka6.3>
<ka6.4>
<p1-1< <p1-2> <p1-3> ... <p1-26>
<p2-1> <p2-2> <p2-3> ... <p2-26>
...
<pL-1> <pL-2> <pL-3> ... <pL-26>
One can find the explanation for the three blocks of KA-parameters in
makemat source code, lines 188-190:
putMatrixKbp(checkFile, compactSearch->kbp_gap_std[0], scaleScores,
1/scalingFactor);
putMatrixKbp(checkFile, compactSearch->kbp_gap_psi[0], scaleScores,
1/scalingFactor);
putMatrixKbp(checkFile, sbp->kbp_ideal, scaleScores, 1/scalingFactor);
Thus, the first KA block is the standard score, the second is for
rpsblast, and the third is the ideal score.
|
|