NCBI C Toolkit Cross Reference

C/doc/tbl2asn.txt


  1 TBL2ASN AUTOMATED BULK SUBMISSION PROGRAM
  2 
  3 tbl2asn is a program that automates the submission of sequence records to
  4 GenBank.  It uses many of the same functions as Sequin, but is driven
  5 entirely by data files, and records need no additional manual editing before
  6 submission.  Entire genomes, consisting of many chromosomes with feature
  7 annotation, can be processed in seconds using this method.
  8 
  9 For a submission, tbl2asn expects a template file containing a text ASN.1
 10 Submit-block object.  These can be generated in Sequin and saved for use by
 11 tbl2asn.  The Submit-block contains contact information (to whom questions
 12 on the submission can be addressed) and a submission citation (which lists
 13 the authors who get scientific credit for the sequencing).
 14 
 15 The template file can also contain one or more text ASN.1 Seq-descr objects
 16 (such as Title or BioSource) appended after the Submit-block.  These can
 17 also be generated in Sequin and saved to a file, and then appended to the
 18 template with a text editor.  They will become descriptors packaged at the
 19 top of each submission file.
 20 
 21 tbl2asn reads six other kinds of data files.  Nucleotide sequence data is
 22 expected in FASTA format, and these files are identified by having a .fsa
 23 suffix.  Feature table files, in the five-column format described later,
 24 have a .tbl suffix.  These can be easily generated by most genome centers
 25 that maintain feature locations in a spreadsheet or database.  For sets of
 26 data records, a source qualifier table can be placed in a .src file.  The
 27 protein translations of CDS features can be supplied as FASTA sequences in
 28 files with a .pep suffix.  These will replace the tbl2asn-generated
 29 conceptual translations, and can be used to verify correct CDS intervals.
 30 The nucleotide sequence products of mRNA features can be provided as FASTA
 31 files with a .rna suffix.  Sequence quality scores can be supplied in files
 32 with a .qvl suffix.
 33 
 34 tbl2asn generates a .sqn file for submission to the database from these
 35 input files.
 36 
 37 
 38 COMMAND-LINE ARGUMENTS
 39 
 40 To process a set of chromosomes, sets of .fsa and .tbl files (along with
 41 optional .src, .pep, .rna, and .qvl files) are placed into a source
 42 directory.  The path to this directory is specified in the -p command-line
 43 argument.  The path for the resulting .sqn submission files is given in the
 44 -r argument.  If the -r argument is not given, the .sqn files are saved in
 45 the source directory.
 46 
 47 For example, if an organism has fifteen chromosomes, one would expect at
 48 least the following files in the source directory:
 49 
 50   chr01.fsa
 51   chr01.tbl
 52   chr02.fsa
 53   ...
 54   chr14.tbl
 55   chr15.fsa
 56   chr15.tbl
 57 
 58 The exact names of the files are not important, but when a file with a
 59 suffix of .fsa is found, tbl2asn will look for a file with the same prefix
 60 that has a .tbl suffix, and then generate a .sqn file.
 61 
 62 The -t command-line argument specifies the template file.
 63 
 64 Normally a single FASTA sequence per .fsa file is expected.  If there are
 65 multiple sequences, only the first is processed, unless one of two -a flag
 66 variants are given.  These are discussed below.
 67 
 68 The -a s flag tells tbl2asn to package the multiple FASTA components as a set
 69 of unrelated sequences.  This accommodates users who create a single file
 70 instead of one file per sequence.  A single FASTA component can now have gap
 71 indications (e.g., >?unk100) on a separate line as long as it is followed
 72 immediately by more sequence lines, with no > and a mock identifier.
 73 
 74 The -a d flag tells the program to make a delta sequence out of the multiple
 75 components.  This can be used for HTGS submissions where the sequence of
 76 the BAC/PAC clone has not been completely determined.  By convention, gaps
 77 of 100 base pairs should be inserted in between the actual sequence
 78 segments with lines containing an angle bracket '>', a question mark '?',
 79 the letters "unk", and the length of the gap.
 80 
 81 >?unk100
 82 
 83 The -g flag causes tbl2asn to generate a genomic product set.  Within the
 84 set, the products of each related mRNA and CDS are packaged together in an
 85 internal nuc-prot set.  The feature table must provide reciprocal
 86 protein_id and transcript_id qualifiers in order to correctly identify each
 87 mRNA/CDS pair.  From the resulting .sqn file, the genomic sequence, all
 88 transcripts, and all proteins will be entered into the database and given
 89 accessions.  Note, however, that -g cannot be used for records submitted to
 90 GenBank.  It is only suitable for records going into RefSeq.
 91 
 92 If a feature table is not given, the -k c flag tells tbl2asn to annotate the
 93 longest Open Reading Frame (ORF) on each record.  The -k m flag allows
 94 alternative start codons to be used when finding the ORF.  The protein will
 95 be named 'unknown' unless the name is present in the .fsa file definition
 96 line, e.g., [protein=helicase].  The two flags can be combined as -k cm.
 97 
 98 Data records will be validated when the -V v flag is indicated.  Output is
 99 saved to files with a .val suffix.  The validator checks for many things,
100 including internal stops in CDS features and mismatches between the CDS
101 translation and the supplied protein sequence.  Errors need to be corrected
102 before submitting files to GenBank.
103 
104 GenBank format output is generated when the -V b flag is used.  Resulting
105 files have a .gbf suffix.  The validation and flatfile generation flags can
106 be combined as -V vb.
107 
108 
109 NUCLEOTIDE SEQUENCE FORMAT
110 
111 tbl2asn can read nucleotide sequences of any size in FASTA format.  A FASTA
112 record consists of a single definition line, beginning with a '>' and
113 followed by optional text, and subsequent lines of sequence.  At minimum,
114 all definition lines must contain an identifier for the sequence, called
115 the SeqID.  The SeqID cannot begin with "assembly", as this is reserved for
116 entry of accession lists in Sequin.  Other optional information about the
117 biological source of the organism can also be encoded in brackets on the
118 definition line.  A sample definition line is
119 
120 >Sc_16 [organism=Saccharomyces cerevisiae] [strain=S288C] [chromosome=XVI]
121 
122 Other elements include [topology=circular] and [location=mitochondrion].
123 Rna viruses would be indicated by [molecule=rna] and [moltype=genomic]. The
124 sequencing technique can be supplied as [tech=fli cDNA].  Many other source
125 qualifiers, such as map, clone, isolate, cell-line, and cultivar, can be
126 used.  For organisms that are not commonly submitted with tbl2asn, the
127 nuclear and mitochondrial genetic codes can be indicated by [gcode=1] and
128 [mgcode=3], respectively.  This will ensure proper translation of CDS
129 features.  Primary accessions of TPA (third party annotation) records are
130 given by [primary=xxx,xxx,...].  Finally, a general note can be added with
131 [note=xxx].
132 
133 Note that the definition line must be a single line, with no return or
134 newline characters.  Some word-processors will word-wrap text, either during
135 display or when saving to a file, and care must be taken to avoid unwanted
136 newlines introduced by the editor.
137 
138 >slpy [organism=Zea mays] [chromosome=9] Sleepy transposon
139 TGTAAGATCACTGCTGGGTTGTTGATGAGTTGAGCACCGCTCCCGGCACCCGTCTCCTCTCACGAAGATC
140 TTTAGGGTATGAAAAGTATCTGGAGTTCTTACACGACGGCGAGCCGCCTCTTCTCCGGACGCAGCCGGCC
141 AGCCTTCTTCTCCAAGTCACCTTTTACCGACTCCAAACCCCACCTCAAATACTCCACTCAATCCAGATCA
142 ...
143 
144 Multiple SeqIDs can be indicated in FASTA-style parsable strings.
145 
146 >gnl|ZGP|chr1|gb|U28041
147 >gi|54465|emb|X16935|MMTCRAC
148 
149 
150 FEATURE TABLE FORMAT
151 
152 tbl2asn reads features from a simple five-column tab-delimited table.  This
153 is described in more detail at
154 
155 http://www.ncbi.nlm.nih.gov/Sequin/table.html
156 
157 The feature table specifies the location and type of each feature, and
158 tbl2asn processes the feature intervals and translates any CDSs into
159 proteins. The first line of the table contains the following basic
160 information.
161 
162 >Feature SeqId table_name 
163 
164 The SeqId must be the same as that used on the sequence. The table name is
165 optional.  Subsequent lines of the table list the features.  Columns are
166 separated by tabs.
167 
168 The first and second columns are the start and stop locations of the
169 feature, respectively, the third column is the type of feature (the feature
170 key, e.g., gene, mRNA, CDS), the fourth column is a qualifier name (e.g.,
171 "product", and the fifth is a qualifier value (e.g., the name of the protein
172 or gene).
173 
174 A simple feature table is
175 
176 >Feature sde3g
177 240     4084    gene
178                         gene    SDE3
179 240     1361    mRNA
180 1450    1641
181 1730    3184
182 3275    4084
183                         product RNA helicase SDE3
184 
185 579     1361    CDS
186 1450    1641
187 1730    3184
188 3275    3880
189                         product RNA helicase SDE3
190 
191 If a feature contains multiple intervals, each interval is listed on a
192 separate line by its start and stop position.   Features that are on the
193 complementary strand are indicated by reversing the interval locations. 
194 Locations of partial (incomplete) features are indicated with a '>' or '<'
195 next to the number.
196 
197 Gene features are always a single interval, and their location should cover
198 the intervals of all the relevant features.  If the gene feature spans the
199 intervals of the CDS or mRNA features for that gene, there is no need to
200 include gene qualifiers on those features in the table, since they will be
201 picked up by overlap.  Use of the overlapping gene can be suppressed by
202 adding a gene qualifier with the value "-".  This is important when, for
203 example, a tRNA is encoded within an intron of a housekeeping gene.
204 
205 Translation exception qualifiers are parsed from the same style used in the
206 GenBank flatfile.
207 
208                         transl_except   (pos:591..593,aa:Sec)
209 
210 The codon recognized and anticodon position of tRNAs can also be given.
211 
212                         codon_recognized   TGG
213                         anticodon   (pos:7591..7593,aa:Trp)
214 
215 In addition to the standard qualifiers seen in GenBank format, several other
216 tokens are used to direct values to specific fields in the ASN.1 data. 
217 These include gene_syn, gene_desc, locus_tag, prot_desc, prot_note,
218 region_name, bond_type, and site_type.
219 
220 Genomic product sets require protein_id and transcript_id qualifiers on each
221 mRNA and CDS feature.  These are used to associate the correct pair of
222 features for packaging.
223 
224                         protein_id      lcl|sde3p
225                         transcript_id   lcl|sde3m
226 
227 Exceptional biological situations can be annotated by use of the exception
228 qualifier.  For example
229 
230                         exception       ribosomal slippage
231 
232 The following are legal exception qualifier values
233 
234   RNA editing
235   reasons given in citation
236   ribosomal slippage
237   trans-splicing
238   alternative processing
239   artificial frameshift
240   nonconsensus splice site
241   rearrangement required for product
242   modified codon recognition
243   alternative start codon
244   dicistronic gene
245   transcribed pseudogene
246 
247 Since the International Nucleotide Sequence Database collaboration only
248 allows "RNA editing" and "reasons given in citation" to appear in release
249 mode, other exceptions are mapped to the /note qualifier in the flatfile.
250 However, each exception text string turns off specific validator tests that
251 would otherwise produce warning messages, so they should be entered as
252 exception qualifiers, not as notes.
253 
254 Gene Ontology (GO) terms can be indicated with the following qualifiers
255 
256                         go_component    endoplasmic reticulum|0005783
257                         go_process      glycolysis and gluconeogenesis|57|89197757|ACT,TEM
258                         go_function     excision repair|93||IPD
259 
260 The value field is separated by vertical bars '|' into a descriptive
261 string, the GO identifier (leading zeroes are retained), and optionally a
262 PubMed ID (or GO Reference number starting with a leading 0) and one or more
263 GO evidence codes.
264 
265 
266 SOURCE TABLE FORMAT
267 
268 For sets of sequences, a source qualifier table can optionally be placed in
269 a tab-delimited file with a .src extension.  The first line gives the
270 source qualifier names, separated by tabs.  The first column must be the
271 sequence identifier.  For example
272 
273 sequence_id     organism    strain       isolate
274 
275 The remaining lines each give the source qualifiers for one sequence.  For
276 example
277 
278 sde3g           Zea mays    A69Y         JH90.6-2x12
279 
280 The same information can be provided in the FASTA definition line or in the
281 source section of the five-column feature table.
282 
283 
284 PROTEIN SEQUENCE FORMAT
285 
286 Protein sequences are FASTA files with a .pep extension that can substitute
287 for the translated product of a CDS feature.  Supplying these files acts as
288 a reality check that the CDS intervals do in fact translate to the expected
289 protein sequence.  The FASTA defline with a '>' and sequence identifier is
290 required.
291 
292 >sde3p
293 MSVSUYKSDDEYSVIADKGEIGFIDYQNDGSSGCYNPFDEGPVVVSVPFPFKKEKPQSVTVGETSFDSFT
294 VKNTMDEPVDLWTKIYASNPEDSFTLSILKPPSKDSDLKERQCFYETFTLEDRMLEPGDTLTIWVSCKPK
295 ...
296 
297 The SeqID must match a protein_id in the .tbl file.  In the table above,
298 the protein_id and transcript_id need to explicitly use a 'lcl|' prefix
299 before the SeqID string to indicate a local identifier.  A local sequence
300 identifier is assumed when reading FASTA, but a database accession is
301 assumed in the feature table.
302 
303 Sequin's Suggest Interval functionality, which can determine CDS intervals
304 from nucleotide and protein sequences plus the genetic code, is not used in
305 tbl2asn.  Instead, the CDS intervals are required, and the supplied protein
306 sequence is just used to confirm proper translation.
307 
308 
309 MESSENGER RNA SEQUENCE FORMAT
310 
311 mRNA sequences are FASTA files with a .rna extension that can substitute
312 for the transcribed product of an mRNA feature.  Like the .pep files, they
313 act as a reality check that the supplied intervals do in fact encode the
314 expected mRNA sequence.
315 
316 >sde3m
317 TTTTCATGTTTCTTCTCCTTTGAAGCCTGCCTGCGTTAGTCTGGCTTCATTGCTTCTCCATTTCTTGGTG
318 TGATCGAATCAAAGAGTGTAACCCATTTTGCTACTGATTCAGTACGTATGATCAATTCTCTCAATTTCAG
319 ...
320 
321 The SeqID must match the transcript_id from an mRNA feature.
322 
323 
324 QUALITY SCORES FORMAT
325 
326 Phrap/Consed quality scores can be supplied in .qvl files.  These generate
327 Seq-graph data that will be attached to the nucleotide sequence from the
328 .fsa file. Programs such as Sequin can display these in a graphical view.
329 
330 >chr1
331  51 63 70 82 82 82 90 90 90 90 86 86 86 86 90 90 90 90 90 86
332  86 86 86 86 86 86 86 90 90 90 90 90 90 86 86 78 78 90 90 86
333 ...
334 
335 These values can be extracted from the output files of the Phrap and Consed
336 programs used to process raw data from automated sequencing machines.
337 
338 
339 SUBMISSION TEMPLATE FORMAT
340 
341 The submission template is an ASN.1 Submit-block that can be generated by
342 Sequin.  A simple example is shown below.
343 
344 Submit-block ::= {
345   contact {
346     contact {
347       name
348         name {
349           last "Darwin" ,
350           first "Charles" ,
351           initials "C.R." ,
352           suffix "" } ,
353       affil
354         std {
355           affil "Oxbridge University" ,
356           div "Evolutionary Biology Department" ,
357           city "Camford" ,
358           country "United Kingdom" ,
359           street "1859 Tennis Court Lane" ,
360           email "darwin@beagle.edu.uk" ,
361           phone "01 44 171-007-1212" ,
362           postal-code "OX1 2BH" } } } ,
363   cit {
364     authors {
365       names
366         std {
367           {
368             name
369               name {
370                 last "Darwin" ,
371                 first "Charles" ,
372                 initials "C.R." } } } ,
373       affil
374         std {
375           affil "Oxbridge University" ,
376           div "Evolutionary Biology Department" ,
377           city "Camford" ,
378           country "United Kingdom" ,
379           street "1859 Tennis Court Lane" ,
380           postal-code "OX1 2BH" } } ,
381     date
382       std {
383         year 2003 ,
384         month 2 ,
385         day 28 } } ,
386   subtype new  }
387 
388 This can be exported from the Desktop view of a template file in Sequin, or
389 from the initial submission dialogs.  In addition, unpublished reference or
390 comments can also be generated in Sequin and saved from the Desktop.  The
391 two files can be catenated to make a .sbt template with the publication or
392 comment descriptor after the submit block.
393 

source navigation ]   [ diff markup ]   [ identifier search ]   [ freetext search ]   [ file search ]  

This page was automatically generated by the LXR engine.
Visit the LXR main site for more information.