|
NCBI Home IEB Home C Toolkit docs C++ Toolkit source browser C Toolkit source browser (2) |
NCBI C Toolkit Cross ReferenceC/doc/fa2htgs/ |
source navigation identifier search freetext search file search |
| Name | Size | Date (GMT) | Description | ||||
|---|---|---|---|---|---|---|---|
| Parent directory | 2009-12-08 04:38:03 | ||||||
| README | 15876 | 2001-12-12 20:36:28 | |||||
| updateHtgsDoc | 849 | 1998-01-23 16:25:51 |
1 =+= README =+============Last update: April 4, 2000 ============
2
3 the latest version of this document can be found at:
4
5 ftp://ftp.ncbi.nih.gov/fa2htgs/README
6
7 -----
8
9 After having consulted with NCBI staff (see contact information below)
10 submitters from Genome Sequencing centers will establish what the best
11 protocol will be for them to deposit their sequence submission data to
12 NCBI.
13
14 One of these protocol may require the fa2htgs tool, present in this
15 directory. fa2htgs is a program used to generate Seq-submits (an ASN.1
16 sequence submission file) for high throughput genome sequencing
17 projects. Presently we have built fa2htgs for the following platforms:
18
19 alphaOSF1.tar.Z
20 ibmaix.tar.Z
21 linux.tar.Z
22 sgi.tar.Z
23 solaris.tar.Z
24 sun.tar.Z
25 win32/fa2htgs.exe (win95/NT)
26
27 If fa2htgs is required for a platform not present here,
28 please let us know (address below) and we will be happy to
29 try to provide it.
30
31 fa2htgs will read a FASTA file (or an Ace Contig file with Phrap sequence
32 quality values), a Sequin submission template file, (to get contact
33 and citation information for the submission), and a series of command line
34 arguments (see below). This program will then combines these
35 information to make a submission suitable for GenBank. Once you have
36 generated your submission file, you need to follow the submission
37 protocol (see the README present on your FTP account or mailed out to
38 your Center).
39
40 fa2htgs is intended for the automation by scripts for bulk submission of
41 unannotated genome sequence. It can easily be extended from its current
42 simple form to allow more complicated processing. A submission
43 prepared with fa2htgs can also be read into Sequin, and then annotated
44 more extensively. See the Sequin home page at:
45
46 http://www.ncbi.nlm.nih.gov/Sequin/
47
48 . Contacting NCBI about HTGS submissions and about using fa2htgs:
49
50 Questions and concerns about this processing protocol, or how to
51 use this tool should be forwarded to:
52
53 htgs@ncbi.nlm.nih.gov.
54
55
56 =========+=========
57
58 using fa2htgs:
59
60 typing "fa2htgs -" will cause the program to show its command line
61 arguments. Below we show these with additional comments (what we show
62 within { } does not appear on the command line)
63
64 fa2htgs 2.0 arguments:
65
66 -i Filename for fasta input [File In]
67 default = stdin
68 -t Filename for Seq-submit template [File In]
69 default = template.sub
70 -o Filename for asn.1 output [File Out] Optional
71 default = stdout
72 -e Log errors to file named: [File Out] Optional
73 -n Organism name? [String] Optional
74 default = Homo sapiens
75 -s Sequence name? [String]
76
77 { The sequence must have a name that is unique within }
78 { the genome center. We use the combination of the genome }
79 { center name (-g argument) and the sequence name (-s) to }
80 { track this sequence and to talk to you about it. }
81 { The name can have any form you like but must be unique }
82 { within your center.
83
84 -l length of sequence in bp? [Integer]
85
86 { The length is checked against the actual number of }
87 { bases we get. For phase 1 and 2 sequence it is also }
88 { used to estimate gap lengths. For phase 1 and 2 }
89 { records, it is important to use a number GREATER than }
90 { the amount of provided nucleotide, otherwise this will }
91 { generate false 'gaps'. Here is assumed that the }
92 { putative full length of the BAC or cosmid will be used. }
93 { There should be at least 20 to 30 'n' in between the }
94 { segments (you can check for these in Sequin), as this }
95 { will ensure proper behavior when this sequence }
96 { is used with BLAST. Otherwise 'artifactual' unrelated }
97 { segment neighbors may be brought into proximity of }
98 { each other. }
99
100 -g Genome Center tag? [String]
101
102 { This is probably the same as your login name on the }
103 { NCBI FTP server }
104
105 -p HTGS phase? [Integer]
106 default = 1
107 range from 1 to 3
108
109 { Phase 1 - a collection of unordered contigues with }
110 { gaps of unknown length. Phase 1 record must }
111 { at the very least have two segments with }
112 { one gap. }
113 { Phase 2 - a series of ordered contigs, gap lengths may }
114 { be known. This could be a single sequence, }
115 { without gaps, if the sequence has ambiguities }
116 { which will be resolved. }
117 { Phase 3 - a single contiguous sequence. This sequenced }
118 { is finished, although it may, or may not }
119 { be annotated. }
120
121 -a GenBank accession (if an update) [String] Optional
122
123 { this argument is required if this is an update, do }
124 { not use it if you are preparing a new submission }
125
126 -r Remark for update? [String] Optional
127
128 { if this is an update, you can add a brief comment }
129 { (within "") describing the nature of the update }
130 { ("new sequence", "new citation", "updated features") }
131
132 -c Clone name? [String] Optional
133
134 { will appear as /clone in the source feature }
135 { This could be the same as the -s argument (sequence }
136 { name) but this one will appear in the /clone qualifier }
137
138 -h Chromosome? [String] Optional
139
140 { will appear as /chromsome in the source feature }
141
142 -d Title for sequence? [String] Optional
143
144 { the text that will appear in the DEFINITION line }
145 { of the GenBank flatfile. }
146
147 -m Take comment from template ? [T/F] Optional
148 default = F
149 -u Take biosource from template ? [T/F] Optional
150 default = F
151 -x Secondary accession number, separate by commas if multiple, s.t. U10000,L11000 [String] Optional
152
153 [ ACCESSION AC000000 L00000 }
154 { ^ ^ }
155 { | secondary accession number }
156 { primary accession number }
157 { }
158 { In some cases a large segment will supercede another }
159 { or group of other accession numbers (records). These }
160 { records which are no longer wanted in GenBank should be }
161 { made secondary. Using the -x argument you can list the }
162 { Accession Numbers you want to make secondary. This will}
163 { instruct us to remove the accession number(s) from }
164 { GenBank, and will no longuer be part of the GenBank }
165 { release. They will nonetheless be available from Entrez.}
166 { }
167 { !!GREAT CARE should be taken when using this argument!!!}
168 { inproper use of accession numbers here will result in }
169 { the innapropriate withdrawal of GenBank records from }
170 { GenBank, EMBL and DDBJ. We provide this parameter as }
171 { a conveniance to submitting centers, but this may need }
172 { removed if it is not used carefully. }
173
174 -C Clone library name? [String] Optional
175
176 { will appear as /clone-lib="string" on the source feature }
177
178 -M Map? [String] Optional
179
180 { will appear as /map="string" on the source feature }
181
182 -O Filename for the comment: [File In] Optional
183
184 { will read the comment from a given file. }
185 { maximum 100 characters per line. }
186 { new lines can be incorporated with "~", and if you }
187 { actually want to include the "~" in your text, you }
188 { need to escape it with "`". Please ensure that the }
189 { correct format is obtained by viewing your comment }
190 { in Sequin. }
191
192
193 -T Filename for phrap input [File In] Optional
194
195 { Using this argument infers that you are NOT using the }
196 { -i above }
197
198 -P Contigs to use, separate by commas if multiple [String] Optional
199
200 { if -P is not indicated with the -T option, then the }
201 { fragments will go in in the order that they are in the }
202 { ace file (which is appropriate for a phase 1 record, }
203 { but not for a phase 2 or 3. If you need to set the }
204 { order of the segments of the ace file, you need to set }
205 { it with the -P flag, like this: }
206 { -P "Contig1,Contig4,Contig3,Contig2,Contig5" }
207
208
209 -A Filename for accession list input [File In] Optional
210
211 { Using this argument infers that you are NOT using the }
212 { -i or -T arguments above. The input file contains a }
213 { tab-delimited table with three to five columns, which }
214 { are accession number, start position, stop position, }
215 { and (optionally) length and strand. If start > stop, }
216 { the minus strand on the referenced accession is used. }
217 { A gap is indicated by the word "gap" instead of an }
218 { accession, 0 for the start and stop positions, and a }
219 { number for the length. }
220
221 -X Coordinates are on the resulting sequence ? [T/F] Optional
222 default = F
223
224 { if -X is TRUE, then the coordinates in the input file }
225 { are on the resulting segmented sequence. This implies }
226 { that bases 1 through n of each accession are used. }
227 { if -X is FALSE, the coordinates are on the individual }
228 { accessions, and these need not start at base 1 of the }
229 { record. }
230
231
232 -D HTGS_DRAFT sequence? [T/F] Optional
233 default = F
234
235 -S Strain name? [String] Optional
236
237 -b Gap length [Integer]
238 default = 100
239 range from 0 to 1000000000
240
241 -N Annotate assembly_fragments [T/F] Optional
242 default = F
243
244 -6 SP6 clone (e.g., Contig1,left) [String] Optional
245
246 -7 T6 clone (e.g., Contig2,right) [String] Optional
247
248 -L Filename for phrap contig order [File In] Optional
249
250 { This is a tab-delimited file that can be used to drive }
251 { the order of contigs (normally specified by -P), as well }
252 { as indicating the SP6 and T7 ends. It can also be used }
253 { when contigs are known to be in opposite orientation. }
254 { For example: }
255 { }
256 { Contig2 + 1 SP6 left }
257 { Contig3 + 1 }
258 { Contig1 - T7 right }
259 { }
260 { The first column is the contig name, the second is the }
261 { orientation, the third is the fragment_group, the fourth }
262 { indicates the SP6 or T7 end, and the fifth says which }
263 { side of SP6 or T7 end had vector removed. }
264
265
266 Presented here is an example of a phase 2 submission from an Arabidopsis
267 sequencing center. It is followed by an command line arguments used in
268 an example with a Phrap ace file.
269
270
271 BEFORE YOU BEGIN: fa2htgs does depend on the presence of some external
272 files. These are provided with Sequin, so if a networked version of
273 Sequin is already installed (see URL above for Sequin info) all the
274 default files that need to be present will be there and allow fa2htgs
275 to run.
276
277
278 Here are the files you need (let's assume we have a 100Kb BAC):
279
280 1) fasta file (example below)
281 2) sequin submission file (more on this below)
282 3) genome center name ("pgec" in this example, use your
283 FTP login name)
284 4) the sequence/clone name (this will *always* stay with the record)
285 5) The phase number:
286
287 phase 1: multiple pieces, not in order (alway >= 2 pieces,
288 often many more)
289 phase 2: multiple pieces, in order, but can be as few as
290 one unfinished sequence
291 phase 3: 1 piece, where the sequence is "finished"
292
293 6) the full sequence length, when the project is finished (eg 100000
294 in our example).
295
296 7) A new submission has no Accession Number, and and an update always
297 does. You will need to keep track of this (ie which sequence name has
298 which accession number)
299
300 8) The organism, in this example "Arabidopsis thaliana"
301
302 9) The chromosome number, 1 in this example.
303
304 10) the output (file name) convention so far has been to call it the
305 clone name.ss (eg P74A8.ss) "ss" is a seq-submit, or sequence
306 submission. We then have our scripts/code report with the same file
307 name convention. Also note that because we are working in Unix space,
308 'case' of letter is important, and try to avoid 'metacharacters'
309 (like ^*/\ etc).
310
311 so the phase 1 or 2 FASTA file will look like this (in this example,
312 this is one has 3 segments, but you could (in phase 1) have many more):
313
314 >P74A8 pcr product joining p130c12 and p91c10
315 gatcagcccaaagcattgattaggggaacttacctgtagagggctgcagcaatggggaac
316 acctggctgggtcacagagtggtcaatgcactccatgacttttgggtcaggacacagaaa
317 gaaagagcggggaaccggggggccctacagtgatgaattatactaactgattttagaatg
318 >?
319 >fake next line
320 ttaaacaaacattgcatttccagaataaaccccatttagtaacgcatagtgtgcttgtat
321 ctcagcctcccaaagtgctgggattatagacatgagccagcgcacctggctttgttagcc
322 >?200
323 >fake another line
324 ttttcaaataactttttgaactttgttaattttttaattgcacgttttctccttcattta
325 ctaattccattcaaaagtagcatcaatgagaataaattacttaggaatacatttaattaa
326 aaagtgctagacttgtacactgaaaattacaaagtactctggagatatattc
327
328
329
330 The first line has the seqence id, and a title, then each segment
331 is seperated by
332
333 >?
334 >foobar
335
336 or:
337
338 >?200
339 >foobar
340
341 where you put a "?" if you don't know the distance between the pieces,
342 or a number of bp if you do know the distance (eg 200 bp), and the
343 other line is the fasta formated next segment (foobar). So that is it
344 for phase 1 or 2. Phase 3 will be a single fasta file. All phase 1
345 will probably always be >?.
346
347 So the other thing you need is a submisssion prepared by sequin. This
348 will allow you to put in the references, authors, Titles, submission
349 information the way you want it. You simply need to make a 1 bp
350 submission really. fa2htgs will read that file and copy the
351 information over to the htgs information with the "real" data.
352
353 So once you have made the submission, you deposit it on the FTP account
354 under "SEQSUBMIT" directory, we have software that looks for it there
355 every day, validate the center, clone (sequence) id's, check if it's an
356 update and so on, and write a report that you can pick up the next
357 day.
358
359 It is good to put the output of fa2htgs in Sequin and validate the
360 record. This is specially important for phase 3 records where many
361 annotations may be present (added with the help of Sequin): Sequin has
362 a very good validation suite (look under Search -> Validate)
363
364 This finished record is now ready for deposition to your FTP account
365 in the SEQSUBMIT directory.
366
367
368 example of the command line arguments using quality score/Phrap ace file
369 (all on tyhe same command line):
370
371 ./fa2htgs -t nuc1.sqn -o test.cmd32.out -s Phrap_Contig_Test2 -l 111505
372 -g pgec -p 2 -h 1 -d Phrap_Contig_Test2 -n "Arabidopsis thaliana"
373 -T g5129z079.ace -P "Contig1,Contig2,Contig4,Contig3,Contig7"
374
375
376 example of a contig file for a yeast chromosome (with coordinates on the
377 individual accessions):
378
379 U73805 1 2669
380 U12980 79 103687
381 L05146 133 29410
382 L22015 2001 41988
383 L28920 148 54812
384
385
386 -- Questions about fa2htgs or how to submit?
387
388 Just contact us at NCBI:
389
390 e-mail: htgs@ncbi.nlm.nih.gov
391
392 ==============+= end of the fa2htgs README =+==========================
|
This page was automatically generated by the
LXR engine.
Visit the LXR main site for more information. |