NCBI C Toolkit Cross Reference

C/doc/fa2htgs/


  Name Size Date (GMT) Description
Back   Parent directory   2009-12-08 04:38:03
File   README 15876  2001-12-12 20:36:28
File   updateHtgsDoc 849  1998-01-23 16:25:51

  1 =+= README =+============Last update: April 4, 2000 ============
  2 
  3 the latest version of this document can be found at:
  4 
  5 ftp://ftp.ncbi.nih.gov/fa2htgs/README
  6 
  7 -----
  8 
  9 After having consulted with NCBI staff (see contact information below)
 10 submitters from Genome Sequencing centers will establish what the best
 11 protocol will be for them to deposit their sequence submission data to
 12 NCBI.
 13 
 14 One of these protocol may require the fa2htgs tool, present in this
 15 directory. fa2htgs is a program used to generate Seq-submits (an ASN.1
 16 sequence submission file) for high throughput genome sequencing
 17 projects. Presently we have built fa2htgs for the following platforms:
 18 
 19    alphaOSF1.tar.Z
 20       ibmaix.tar.Z
 21        linux.tar.Z
 22          sgi.tar.Z
 23      solaris.tar.Z
 24          sun.tar.Z
 25  win32/fa2htgs.exe (win95/NT)
 26 
 27 If fa2htgs is required for a platform not present here, 
 28 please let us know (address below) and we will be happy to 
 29 try to provide it.
 30 
 31 fa2htgs will read a FASTA file (or an Ace Contig file with Phrap sequence 
 32 quality values), a Sequin submission template file, (to get contact 
 33 and citation information for the submission), and a series of command line
 34 arguments (see below).  This program will then combines these
 35 information to make a submission suitable for GenBank. Once you have
 36 generated your submission file, you need to follow the submission
 37 protocol (see the README present on your FTP account or mailed out to
 38 your Center).
 39 
 40 fa2htgs is intended for the automation by scripts for bulk submission of
 41 unannotated genome sequence. It can easily be extended from its current
 42 simple form to allow more complicated processing.  A submission
 43 prepared with fa2htgs can also be read into Sequin, and then annotated
 44 more extensively.  See the Sequin home page at:
 45 
 46 http://www.ncbi.nlm.nih.gov/Sequin/
 47 
 48 . Contacting NCBI about HTGS submissions and about using fa2htgs:
 49 
 50   Questions and concerns about this processing protocol, or how to 
 51   use this tool should be forwarded to:
 52 
 53   htgs@ncbi.nlm.nih.gov.
 54 
 55 
 56 =========+=========
 57 
 58 using fa2htgs:
 59 
 60 typing "fa2htgs -" will cause the program to show its command line
 61 arguments. Below we show these with additional comments (what we show
 62 within { } does not appear on the command line)
 63 
 64 fa2htgs 2.0   arguments:
 65 
 66   -i  Filename for fasta input [File In]
 67     default = stdin
 68   -t  Filename for Seq-submit template [File In]
 69     default = template.sub
 70   -o  Filename for asn.1 output [File Out]  Optional
 71     default = stdout
 72   -e  Log errors to file named: [File Out]  Optional
 73   -n  Organism name? [String]  Optional
 74     default = Homo sapiens
 75   -s  Sequence name? [String]
 76 
 77      { The sequence must have a name that is unique within     }
 78      { the genome center. We use the combination of the genome }
 79      { center name (-g argument) and the sequence name (-s) to }
 80      { track this sequence and to talk to you about it.        }
 81      { The name can have any form you like but must be unique  }
 82      { within your center.
 83 
 84   -l  length of sequence in bp? [Integer]
 85 
 86      { The length is checked against the actual number of      }
 87      { bases we get. For phase 1 and 2 sequence it is also     }
 88      { used to estimate gap lengths. For phase 1 and 2         }
 89      { records, it is important to use a number GREATER than   }
 90      { the amount of provided nucleotide, otherwise this will  }
 91      { generate false 'gaps'.  Here is assumed that the        }
 92      { putative full length of the BAC or cosmid will be used. } 
 93      { There should be at least 20 to 30 'n' in between the    }
 94      { segments (you can check for these in Sequin), as this   }
 95      { will ensure proper behavior when this sequence          }
 96      { is used with BLAST.  Otherwise 'artifactual' unrelated  }
 97      { segment neighbors may be brought into proximity of      }
 98      { each other.                                             }
 99 
100   -g  Genome Center tag? [String]
101 
102      { This is probably the same as your login name on the     }
103      { NCBI FTP server                                         }
104 
105   -p  HTGS phase? [Integer]
106     default = 1
107     range from 1 to 3
108  
109      { Phase 1 - a collection of unordered contigues with      }
110      {           gaps of unknown length.  Phase 1 record must  }
111      {           at the very least have two segments with      }
112      {           one gap.                                      }
113      { Phase 2 - a series of ordered contigs, gap lengths may  }
114      {           be known.  This could be a single sequence,   }
115      {           without gaps, if the sequence has ambiguities }
116      {           which will be resolved.                       }
117      { Phase 3 - a single contiguous sequence.  This sequenced }
118      {           is finished, although it may, or may not      } 
119      {           be annotated.                                 }
120 
121   -a  GenBank accession (if an update) [String]  Optional
122    
123      { this argument is required if this is an update, do      }
124      { not use it if you are preparing a new submission        }
125 
126   -r  Remark for update? [String]  Optional
127 
128      { if this is an update, you can add a brief comment       }
129      { (within "") describing the nature of the update         }
130      { ("new sequence", "new citation", "updated features")    }
131 
132   -c  Clone name? [String]  Optional
133     
134      { will appear as /clone in the source feature             }
135      { This could be the same as the -s argument (sequence     }
136      { name) but this one will appear in the /clone qualifier  }
137 
138   -h  Chromosome? [String]  Optional
139 
140      { will appear as /chromsome in the source feature         }
141 
142   -d  Title for sequence? [String]  Optional
143 
144      { the text that will appear in the DEFINITION line        }
145      { of the GenBank flatfile.                                }
146 
147   -m  Take comment from template ? [T/F]  Optional
148     default = F
149   -u  Take biosource from template ? [T/F]  Optional
150     default = F
151   -x  Secondary accession number, separate by commas if multiple, s.t. U10000,L11000 [String]  Optional
152 
153       [ ACCESSION AC000000 L00000                               }
154       {           ^        ^                                    }
155       {           |        secondary accession number           }
156       {           primary accession number                      }
157       {                                                         }
158       { In some cases a large segment will supercede another    }
159       { or group of other accession numbers (records).  These   }
160       { records which are no longer wanted in GenBank should be } 
161       { made secondary. Using the -x argument you can list the  }
162       { Accession Numbers you want to make secondary.  This will} 
163       { instruct us to remove the accession number(s) from      }
164       { GenBank, and will no longuer be part of the GenBank     }
165       { release. They will nonetheless be available from Entrez.}
166       {                                                         }
167       { !!GREAT CARE should be taken when using this argument!!!}
168       { inproper use of accession numbers here will result in   }
169       { the innapropriate withdrawal of GenBank records from    }
170       { GenBank, EMBL and DDBJ.  We provide this parameter as   }
171       { a conveniance to submitting centers, but this may need  }
172       { removed if it is not used carefully.                    }
173 
174   -C  Clone library name? [String]  Optional
175     
176      { will appear as /clone-lib="string" on the source feature }
177 
178   -M  Map? [String]  Optional
179     
180      { will appear as /map="string" on the source feature       }
181 
182   -O  Filename for the comment: [File In]  Optional
183     
184      { will read the comment from a given file.                 }
185      { maximum 100 characters per line.                         }
186      { new lines can be incorporated with "~", and if you       }
187      { actually want to include the "~" in your text, you       }
188      { need to escape it with "`".  Please ensure that the      }
189      { correct format is obtained by viewing your comment       }
190      { in Sequin.                                               }
191 
192 
193   -T  Filename for phrap input [File In]  Optional
194 
195      { Using this argument infers that you are NOT using the    }
196      {  -i above                                                }
197 
198   -P  Contigs to use, separate by commas if multiple [String]  Optional
199   
200      { if -P is not indicated with the -T option, then the      }
201      { fragments will go in in the order that they are in the   }
202      { ace file (which is appropriate for a phase 1 record,     }
203      { but not for a phase 2 or 3.  If you need to set the      }
204      { order of the segments of the ace file, you need to set   }
205      { it with the -P flag, like this:                          }
206      { -P "Contig1,Contig4,Contig3,Contig2,Contig5"             }
207 
208 
209   -A  Filename for accession list input [File In]  Optional
210 
211      { Using this argument infers that you are NOT using the    }
212      {  -i or -T arguments above.  The input file contains a    }
213      { tab-delimited table with three to five columns, which    }
214      { are accession number, start position, stop position,     }
215      { and (optionally) length and  strand.  If start > stop,   }
216      { the minus strand on the referenced accession is used.    }
217      { A gap is indicated by the word "gap" instead of an       }
218      { accession, 0 for the start and stop positions, and a     }
219      { number for the length.                                   }
220 
221   -X  Coordinates are on the resulting sequence ? [T/F]  Optional
222     default = F
223   
224      { if -X is TRUE, then the coordinates in the input file    }
225      { are on the resulting segmented sequence.  This implies   }
226      { that bases 1 through n of each accession are used.       }
227      { if -X is FALSE, the coordinates are on the individual    }
228      { accessions, and these need not start at base 1 of the    }
229      { record.                                                  }
230 
231 
232   -D  HTGS_DRAFT sequence? [T/F]  Optional
233     default = F
234 
235   -S  Strain name? [String]  Optional
236 
237   -b  Gap length [Integer]
238     default = 100
239     range from 0 to 1000000000
240   
241   -N  Annotate assembly_fragments [T/F]  Optional
242     default = F
243   
244   -6  SP6 clone (e.g., Contig1,left) [String]  Optional
245   
246   -7  T6 clone (e.g., Contig2,right) [String]  Optional
247   
248   -L  Filename for phrap contig order [File In]  Optional
249   
250      { This is a tab-delimited file that can be used to drive   }
251      { the order of contigs (normally specified by -P), as well }
252      { as indicating the SP6 and T7 ends.  It can also be used  }
253      { when contigs are known to be in opposite orientation.    }
254      { For example:                                             }
255      {                                                          }
256      { Contig2     +       1       SP6     left                 }
257      { Contig3     +       1                                    }
258      { Contig1     -               T7      right                }
259      {                                                          }
260      { The first column is the contig name, the second is the   }
261      { orientation, the third is the fragment_group, the fourth }
262      { indicates the SP6 or T7 end, and the fifth says which    }
263      { side of SP6 or T7 end had vector removed.                }
264 
265 
266 Presented here is an example of a phase 2 submission from an Arabidopsis 
267 sequencing center. It is followed by an command line arguments used in
268 an example with a Phrap ace file.
269 
270 
271 BEFORE YOU BEGIN: fa2htgs does depend on the presence of some external
272   files.  These are provided with Sequin, so if a networked version of
273   Sequin is already installed (see URL above for Sequin info) all the
274   default files that need to be present will be there and allow fa2htgs
275   to run.
276 
277 
278 Here are the files you need (let's assume we have a 100Kb BAC):
279 
280 1) fasta file (example below)
281 2) sequin submission file (more on this below)
282 3) genome center name ("pgec" in this example, use your 
283    FTP login name)
284 4) the sequence/clone name (this will *always* stay with the record)
285 5) The phase number:
286 
287 phase 1: multiple pieces, not in order (alway >= 2 pieces, 
288          often many more)
289 phase 2: multiple pieces, in order, but can be as few as 
290          one unfinished sequence
291 phase 3: 1 piece, where the sequence is "finished"
292 
293 6) the full sequence length, when the project is finished (eg 100000 
294    in our example).
295 
296 7) A new submission has no Accession Number, and and an update always
297    does.  You will need to keep track of this (ie which sequence name has
298    which accession number)
299 
300 8) The organism, in this example "Arabidopsis thaliana"
301 
302 9) The chromosome number, 1 in this example.
303 
304 10) the output (file name) convention so far has been to call it the
305     clone name.ss (eg P74A8.ss)  "ss" is a seq-submit, or sequence
306     submission.  We then have our scripts/code report with the same file
307     name convention.  Also note that because we are working in Unix space,
308     'case' of letter is important, and try to avoid 'metacharacters' 
309     (like ^*/\ etc).
310 
311 so the phase 1 or 2 FASTA file will look like this (in this example,
312 this is one has 3 segments, but you could (in phase 1) have many more):
313 
314 >P74A8 pcr product joining p130c12 and p91c10 
315 gatcagcccaaagcattgattaggggaacttacctgtagagggctgcagcaatggggaac
316 acctggctgggtcacagagtggtcaatgcactccatgacttttgggtcaggacacagaaa
317 gaaagagcggggaaccggggggccctacagtgatgaattatactaactgattttagaatg
318 >?
319 >fake next line
320 ttaaacaaacattgcatttccagaataaaccccatttagtaacgcatagtgtgcttgtat
321 ctcagcctcccaaagtgctgggattatagacatgagccagcgcacctggctttgttagcc
322 >?200
323 >fake another line
324 ttttcaaataactttttgaactttgttaattttttaattgcacgttttctccttcattta
325 ctaattccattcaaaagtagcatcaatgagaataaattacttaggaatacatttaattaa
326 aaagtgctagacttgtacactgaaaattacaaagtactctggagatatattc
327 
328 
329 
330 The first line has the seqence id, and a title, then each segment 
331 is seperated by
332 
333 >?
334 >foobar
335 
336 or:
337 
338 >?200
339 >foobar
340 
341 where you put a "?" if you don't know the distance between the pieces,
342 or a number of bp if you do know the distance (eg 200 bp), and the
343 other line is the fasta formated next segment (foobar).  So that is it
344 for phase 1 or 2.  Phase 3 will be a single fasta file.  All phase 1
345 will probably always be >?. 
346 
347 So the other thing you need is a submisssion prepared by sequin.  This
348 will allow you to put in the references, authors, Titles, submission
349 information the way you want it.  You simply need to make a 1 bp
350 submission really.  fa2htgs will read that file and copy the
351 information over to the htgs information with the "real" data.
352 
353 So once you have made the submission, you deposit it on the FTP account
354 under "SEQSUBMIT" directory, we have software that looks for it there
355 every day, validate the center, clone (sequence) id's, check if it's an
356 update and so on, and write a report that you can pick up the next
357 day.
358 
359 It is good to put the output of fa2htgs in Sequin and validate the
360 record.  This is specially important for phase 3 records where many
361 annotations may be present (added with the help of Sequin): Sequin has
362 a very good validation suite (look under Search -> Validate)
363 
364 This finished record is now ready for deposition to your FTP account
365 in the SEQSUBMIT directory.
366 
367 
368 example of the command line arguments using quality score/Phrap ace file
369 (all on tyhe same command line):
370 
371 ./fa2htgs -t nuc1.sqn -o test.cmd32.out  -s Phrap_Contig_Test2  -l 111505 
372 -g pgec -p 2 -h 1 -d Phrap_Contig_Test2 -n "Arabidopsis thaliana"  
373 -T g5129z079.ace -P "Contig1,Contig2,Contig4,Contig3,Contig7"
374 
375 
376 example of a contig file for a yeast chromosome (with coordinates on the
377 individual accessions):
378 
379 U73805  1       2669
380 U12980  79      103687
381 L05146  133     29410
382 L22015  2001    41988
383 L28920  148     54812
384 
385 
386 -- Questions about fa2htgs or how to submit?  
387 
388    Just contact us at NCBI:
389 
390         e-mail: htgs@ncbi.nlm.nih.gov
391 
392 ==============+= end of the fa2htgs README =+==========================

source navigation ]   [ identifier search ]   [ freetext search ]   [ file search ]  

This page was automatically generated by the LXR engine.
Visit the LXR main site for more information.