NCBI C Toolkit Cross Reference

C/doc/blast/blastclust.html


  1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  2     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  3 
  4 <html xmlns="http://www.w3.org/1999/xhtml">
  5   <head>
  6     <meta name="generator"
  7     content="HTML Tidy for Linux/x86 (vers 1st October 2002), see www.w3.org" />
  8 
  9     <title></title>
 10   </head>
 11 
 12   <body>
 13 <pre>
 14 BLASTCLUST - BLAST score-based single-linkage clustering.
 15 
 16 1. Clustering procedure.
 17 
 18 BLASTCLUST automatically and systematically clusters protein or DNA sequences
 19 based on pairwise matches found using the BLAST algorithm in case of proteins or 
 20 Mega BLAST algorithm for DNA. In the latter case a single Mega BLAST search is
 21 performed for all the sequences combined against a database created from the
 22 same sequences. BLASTCLUST finds pairs of sequences that have statistically
 23 significant matches and clusters them using single-linkage clustering. 
 24 
 25 BLASTCLUST uses the default values for the BLAST and Mega BLAST parameters.
 26 For protein sequences these are: matrix BLOSUM62; gap opening cost 11; gap
 27 extension cost 1; no low-complexity filtering.
 28 For DNA sequences: match reward 1, mismatch penalty -3, non-affine gapping costs 
 29 (see README.mbl document for explanation), wordsize 28.
 30 In both cases e-value threshold is set to 1e-6. 
 31 For each pair of sequences the top-scoring alignment is evaluated according to
 32 the following criteria: 
 33 
 34        x1                   x2      HSP length on seqX: Hx = x2-x1+1 
 35         |                    |      gaps in seqX: Gx
 36 seqX ---======================----- seqX length: Lx
 37          \\|||||||||||||||||//      BLAST score: S 
 38 seqY ----====================------ number of identical residues: N
 39          |                  |       seqY length: Ly
 40         y1                 y2       gaps in seqY: Gy
 41                                     HSP length on seqY: Hy = y2-y1+1 
 42 
 43 coverage of seqX: Cx = Hx/Lx
 44 coverage of seqY: Cy = Hy/Ly
 45 coverage:         max(Cx,Cy) or min(Cx,Cy), depending on the value of -b option 
 46 alignment length  Al = Hx+Gx = Hy+Gy
 47 score density:    S/min(Hx,Hy) or N/Al*100%
 48 
 49 If the coverage is above a certain threshold
 50  AND
 51 the score density is above a certain threshold,
 52 
 53 these two sequences are considered to be neighbored.
 54 
 55 Thus determined neighbor relationships is considered symmetric and provides
 56 the base for clustering by a single-linkage method (which puts a sequence
 57 to a cluster if the sequence is a neighbor to at least one sequence in the
 58 cluster).
 59 
 60 2. Input formats.
 61 
 62 The primary input format for BLASTCLUST is a FASTA-format sequence file.
 63 Each sequence should have a unique identifier (as defined by formatdb).
 64 BLASTCLUST formats this sequence set into a BLASTable database
 65 (in the directory pointed to by the environment variable TMPDIR or in
 66 the current directory), then removes the database.
 67 
 68 Instead of a FASTA file, a database prepared by formatdb with -o option
 69 set to TRUE can be supplied as an input.
 70 
 71 Another type of input is a sequence hit-list previously saved by
 72 BLASTCLUST (in this case BLASTCLUST will use pre-computed HSP data
 73 instead of making de novo comparisons).
 74 
 75 You can restrict clustering to a subset of your data by supplying an ID
 76 list file (IDs separated by spaces, tabs, newlines, commas or semicolons).
 77 This is supposed to be used for re-clustering subsets of sequences using
 78 the previously computed hit-list file.
 79 
 80 3. Output format.
 81 
 82 BLASTCLUST prints out clusters of sequence IDs, sorted from largest to
 83 smallest cluster (alphabetically by ID of the first sequence if of the
 84 same size), separating clusters by a newline character. Sequence
 85 identifiers within a cluster are space-separated and sorted from
 86 longest to shortest sequence (alphabetically by IDs if of the same length).
 87 
 88 4. Crash recovery.
 89 
 90 If the program crashed because of system error you can restart it
 91 using crash recovery mode. This works only if you were saving
 92 hit-list during the clustering. Start the job with the same command
 93 line as before, specifying the hit-list saving to the same file but
 94 also set the "continue unfinished clustering" option to TRUE. The
 95 process will restart from the last saved point and will append the
 96 hit-list file.
 97 
 98 5. Environment.
 99 
100 BLASTCLUST is supposed to work in a normal NCBI environment, in
101 particular:
102 
103 BLOSUM62 matrix is available via .ncbirc or BLASTMAT environment
104 variable.
105 
106 6. Program options
107 
108 Input:
109 
110  -i &lt;file&gt; sequence file in the FASTA format (default = stdin)
111  -d &lt;file&gt; sequence database name
112  -r &lt;file&gt; name of a hit-list file saved by BLASTCLUST
113 
114  These three options are mutually exclusive.
115 
116  -l &lt;file&gt; a file with a list of IDs to restrict the clustering,
117     applicable only when reclustering from a saved hit-list.
118  
119 Thresholds:
120 
121  -S &lt;threshold&gt; similarity threshold
122     if &lt;3 then the threshold is set as a BLAST score density
123     (0.0 to 3.0; default = 1.75)
124     if &gt;=3 then the threshold is set as a percent of identical
125     residues (3 to 100)
126  -L &lt;threshold&gt; minimum length coverage (0.0 to 1.0; default = 0.9)
127  -b &lt;T|F&gt; require coverage as specified by -L and -S on both (T) or
128     only one (F) sequence of a pair (default = TRUE)
129 
130 Output:
131 
132  -o &lt;file&gt; file to save cluster list (default = stdout)
133  -s &lt;file&gt; file to save hit-list (this file may be not portable across
134     platforms)
135  -p &lt;T|F&gt; protein (T) or nucleotide (F) sequences in the input
136     (default = TRUE)
137 
138 Misc:
139 
140  -C &lt;T|F&gt; continue unfinished clustering (crash recovery mode).
141     (default = FALSE)
142  -a &lt;number&gt; Number of CPU's to use in a multi-thread mode
143     (default = 1).
144  -v &lt;logfile&gt; Progress report destination (printed every 1000 sequences).
145     Set to F to suppress report messages (default = stderr).
146  -e &lt;T|F&gt; Enable sequence id parsing in database formatting. Set to F if 
147     multiple sequences have identical ids (default = TRUE).
148  -W Word size to use for initial matches (default = 0, translates to 3 for
149     proteins and 32 for nucleotides). 
150  -c &lt;config file&gt; Configuration file with advanced options, containing any 
151     of the following options with their values, separated by whitespace:
152     -r, -q, -G, -E - match, mismatch, gap open and gap extension scores 
153 respectively, 
154     -e - e-value cut off,
155     -y, -X - the dropoff values for the ungapped and gapped extension respectively, 
156     -A - window size for two-hit version,
157     -I - hitlist size,
158     -Y, -z - effective search space and database length respectively, to be used for 
159 e-value and bit score calculations,
160     -F - filter string,
161     -s - raw score cut off for nucleotide search,
162     -S - strand option.
163 
164 7. Credits:
165 
166 Ilya Dondoshansky (dondosha@ncbi.nlm.nih.gov)
167 Yuri Wolf (wolf@ncbi.nlm.nih.gov)
168 
169 05 August, 2000
170 
171 8. Questions, requests and/or bug reports:
172 
173 blast-help@ncbi.nlm.nih.gov
174 
175 APPENDIX A.
176 Format of the hit-list file.
177 
178 The hit-list file consists of the following parts:
179 
180  - header
181  - sequence ID list
182  - sequence length list
183  - hit list
184 
185 The byte-by-byte layout is platform-dependent; field sizes given here
186 are true for most UNIX platforms.
187 
188 A.1. Header.
189 
190         4-byte integer  IDtype  1 if numeric IDs; 0 if string IDs
191         4-byte integer  ListSz  size of the ID list; if IDs are numeric this
192                                 is the number of SeqID records, otherwise this
193                                 is the length of the ID list (in bytes)
194 
195 A.2. Sequence ID list.
196 
197 If IDtype is 1 (numeric IDs) then the list is ListSz records of
198 
199         4-byte integer  SeqID   sequence ID (numeric)
200 
201 If IDtype is 0 (string IDs) then the list is a list of records of
202 
203         var-length char SeqID   sequence ID (string)
204         space (' ')             separator
205 
206 (total length is ListSz bytes; the number of sequences is equal to the number
207 of spaces).
208 
209 A.3. Sequence length list.
210 
211 This is a list of
212 
213         4-byte integer  SeqLen  sequence length
214 
215 A.4. Hit list.
216 
217 The list consists of the following records going to the end of file:
218 
219         4-byte integer  N1      ordinal number of the 1st sequence
220         4-byte integer  N2      ordinal number of the 2nd sequence
221         4-byte integer  HSPL1   HSP length on the 1st sequence
222         4-byte integer  HSPL2   HSP length on the 2nd sequence
223         8-byte float    Score   BLAST score
224         8-byte float    PercId  Percent of identical residues
225 </pre>
226   </body>
227 </html>
228 

source navigation ]   [ diff markup ]   [ identifier search ]   [ freetext search ]   [ file search ]  

This page was automatically generated by the LXR engine.
Visit the LXR main site for more information.