|
NCBI Home IEB Home C Toolkit docs C++ Toolkit source browser C Toolkit source browser (2) |
NCBI C Toolkit Cross ReferenceC/doc/tbl2asn.txt |
source navigation diff markup identifier search freetext search file search |
1 TBL2ASN AUTOMATED BULK SUBMISSION PROGRAM
2
3 tbl2asn is a program that automates the submission of sequence records to
4 GenBank. It uses many of the same functions as Sequin, but is driven
5 entirely by data files, and records need no additional manual editing before
6 submission. Entire genomes, consisting of many chromosomes with feature
7 annotation, can be processed in seconds using this method.
8
9 For a submission, tbl2asn expects a template file containing a text ASN.1
10 Submit-block object. These can be generated in Sequin and saved for use by
11 tbl2asn. The Submit-block contains contact information (to whom questions
12 on the submission can be addressed) and a submission citation (which lists
13 the authors who get scientific credit for the sequencing).
14
15 The template file can also contain one or more text ASN.1 Seq-descr objects
16 (such as Title or BioSource) appended after the Submit-block. These can
17 also be generated in Sequin and saved to a file, and then appended to the
18 template with a text editor. They will become descriptors packaged at the
19 top of each submission file.
20
21 tbl2asn reads six other kinds of data files. Nucleotide sequence data is
22 expected in FASTA format, and these files are identified by having a .fsa
23 suffix. Feature table files, in the five-column format described later,
24 have a .tbl suffix. These can be easily generated by most genome centers
25 that maintain feature locations in a spreadsheet or database. For sets of
26 data records, a source qualifier table can be placed in a .src file. The
27 protein translations of CDS features can be supplied as FASTA sequences in
28 files with a .pep suffix. These will replace the tbl2asn-generated
29 conceptual translations, and can be used to verify correct CDS intervals.
30 The nucleotide sequence products of mRNA features can be provided as FASTA
31 files with a .rna suffix. Sequence quality scores can be supplied in files
32 with a .qvl suffix.
33
34 tbl2asn generates a .sqn file for submission to the database from these
35 input files.
36
37
38 COMMAND-LINE ARGUMENTS
39
40 To process a set of chromosomes, sets of .fsa and .tbl files (along with
41 optional .src, .pep, .rna, and .qvl files) are placed into a source
42 directory. The path to this directory is specified in the -p command-line
43 argument. The path for the resulting .sqn submission files is given in the
44 -r argument. If the -r argument is not given, the .sqn files are saved in
45 the source directory.
46
47 For example, if an organism has fifteen chromosomes, one would expect at
48 least the following files in the source directory:
49
50 chr01.fsa
51 chr01.tbl
52 chr02.fsa
53 ...
54 chr14.tbl
55 chr15.fsa
56 chr15.tbl
57
58 The exact names of the files are not important, but when a file with a
59 suffix of .fsa is found, tbl2asn will look for a file with the same prefix
60 that has a .tbl suffix, and then generate a .sqn file.
61
62 The -t command-line argument specifies the template file.
63
64 Normally a single FASTA sequence per .fsa file is expected. If there are
65 multiple sequences, only the first is processed, unless one of two -a flag
66 variants are given. These are discussed below.
67
68 The -a s flag tells tbl2asn to package the multiple FASTA components as a set
69 of unrelated sequences. This accommodates users who create a single file
70 instead of one file per sequence. A single FASTA component can now have gap
71 indications (e.g., >?unk100) on a separate line as long as it is followed
72 immediately by more sequence lines, with no > and a mock identifier.
73
74 The -a d flag tells the program to make a delta sequence out of the multiple
75 components. This can be used for HTGS submissions where the sequence of
76 the BAC/PAC clone has not been completely determined. By convention, gaps
77 of 100 base pairs should be inserted in between the actual sequence
78 segments with lines containing an angle bracket '>', a question mark '?',
79 the letters "unk", and the length of the gap.
80
81 >?unk100
82
83 The -g flag causes tbl2asn to generate a genomic product set. Within the
84 set, the products of each related mRNA and CDS are packaged together in an
85 internal nuc-prot set. The feature table must provide reciprocal
86 protein_id and transcript_id qualifiers in order to correctly identify each
87 mRNA/CDS pair. From the resulting .sqn file, the genomic sequence, all
88 transcripts, and all proteins will be entered into the database and given
89 accessions. Note, however, that -g cannot be used for records submitted to
90 GenBank. It is only suitable for records going into RefSeq.
91
92 If a feature table is not given, the -k c flag tells tbl2asn to annotate the
93 longest Open Reading Frame (ORF) on each record. The -k m flag allows
94 alternative start codons to be used when finding the ORF. The protein will
95 be named 'unknown' unless the name is present in the .fsa file definition
96 line, e.g., [protein=helicase]. The two flags can be combined as -k cm.
97
98 Data records will be validated when the -V v flag is indicated. Output is
99 saved to files with a .val suffix. The validator checks for many things,
100 including internal stops in CDS features and mismatches between the CDS
101 translation and the supplied protein sequence. Errors need to be corrected
102 before submitting files to GenBank.
103
104 GenBank format output is generated when the -V b flag is used. Resulting
105 files have a .gbf suffix. The validation and flatfile generation flags can
106 be combined as -V vb.
107
108
109 NUCLEOTIDE SEQUENCE FORMAT
110
111 tbl2asn can read nucleotide sequences of any size in FASTA format. A FASTA
112 record consists of a single definition line, beginning with a '>' and
113 followed by optional text, and subsequent lines of sequence. At minimum,
114 all definition lines must contain an identifier for the sequence, called
115 the SeqID. The SeqID cannot begin with "assembly", as this is reserved for
116 entry of accession lists in Sequin. Other optional information about the
117 biological source of the organism can also be encoded in brackets on the
118 definition line. A sample definition line is
119
120 >Sc_16 [organism=Saccharomyces cerevisiae] [strain=S288C] [chromosome=XVI]
121
122 Other elements include [topology=circular] and [location=mitochondrion].
123 Rna viruses would be indicated by [molecule=rna] and [moltype=genomic]. The
124 sequencing technique can be supplied as [tech=fli cDNA]. Many other source
125 qualifiers, such as map, clone, isolate, cell-line, and cultivar, can be
126 used. For organisms that are not commonly submitted with tbl2asn, the
127 nuclear and mitochondrial genetic codes can be indicated by [gcode=1] and
128 [mgcode=3], respectively. This will ensure proper translation of CDS
129 features. Primary accessions of TPA (third party annotation) records are
130 given by [primary=xxx,xxx,...]. Finally, a general note can be added with
131 [note=xxx].
132
133 Note that the definition line must be a single line, with no return or
134 newline characters. Some word-processors will word-wrap text, either during
135 display or when saving to a file, and care must be taken to avoid unwanted
136 newlines introduced by the editor.
137
138 >slpy [organism=Zea mays] [chromosome=9] Sleepy transposon
139 TGTAAGATCACTGCTGGGTTGTTGATGAGTTGAGCACCGCTCCCGGCACCCGTCTCCTCTCACGAAGATC
140 TTTAGGGTATGAAAAGTATCTGGAGTTCTTACACGACGGCGAGCCGCCTCTTCTCCGGACGCAGCCGGCC
141 AGCCTTCTTCTCCAAGTCACCTTTTACCGACTCCAAACCCCACCTCAAATACTCCACTCAATCCAGATCA
142 ...
143
144 Multiple SeqIDs can be indicated in FASTA-style parsable strings.
145
146 >gnl|ZGP|chr1|gb|U28041
147 >gi|54465|emb|X16935|MMTCRAC
148
149
150 FEATURE TABLE FORMAT
151
152 tbl2asn reads features from a simple five-column tab-delimited table. This
153 is described in more detail at
154
155 http://www.ncbi.nlm.nih.gov/Sequin/table.html
156
157 The feature table specifies the location and type of each feature, and
158 tbl2asn processes the feature intervals and translates any CDSs into
159 proteins. The first line of the table contains the following basic
160 information.
161
162 >Feature SeqId table_name
163
164 The SeqId must be the same as that used on the sequence. The table name is
165 optional. Subsequent lines of the table list the features. Columns are
166 separated by tabs.
167
168 The first and second columns are the start and stop locations of the
169 feature, respectively, the third column is the type of feature (the feature
170 key, e.g., gene, mRNA, CDS), the fourth column is a qualifier name (e.g.,
171 "product", and the fifth is a qualifier value (e.g., the name of the protein
172 or gene).
173
174 A simple feature table is
175
176 >Feature sde3g
177 240 4084 gene
178 gene SDE3
179 240 1361 mRNA
180 1450 1641
181 1730 3184
182 3275 4084
183 product RNA helicase SDE3
184
185 579 1361 CDS
186 1450 1641
187 1730 3184
188 3275 3880
189 product RNA helicase SDE3
190
191 If a feature contains multiple intervals, each interval is listed on a
192 separate line by its start and stop position. Features that are on the
193 complementary strand are indicated by reversing the interval locations.
194 Locations of partial (incomplete) features are indicated with a '>' or '<'
195 next to the number.
196
197 Gene features are always a single interval, and their location should cover
198 the intervals of all the relevant features. If the gene feature spans the
199 intervals of the CDS or mRNA features for that gene, there is no need to
200 include gene qualifiers on those features in the table, since they will be
201 picked up by overlap. Use of the overlapping gene can be suppressed by
202 adding a gene qualifier with the value "-". This is important when, for
203 example, a tRNA is encoded within an intron of a housekeeping gene.
204
205 Translation exception qualifiers are parsed from the same style used in the
206 GenBank flatfile.
207
208 transl_except (pos:591..593,aa:Sec)
209
210 The codon recognized and anticodon position of tRNAs can also be given.
211
212 codon_recognized TGG
213 anticodon (pos:7591..7593,aa:Trp)
214
215 In addition to the standard qualifiers seen in GenBank format, several other
216 tokens are used to direct values to specific fields in the ASN.1 data.
217 These include gene_syn, gene_desc, locus_tag, prot_desc, prot_note,
218 region_name, bond_type, and site_type.
219
220 Genomic product sets require protein_id and transcript_id qualifiers on each
221 mRNA and CDS feature. These are used to associate the correct pair of
222 features for packaging.
223
224 protein_id lcl|sde3p
225 transcript_id lcl|sde3m
226
227 Exceptional biological situations can be annotated by use of the exception
228 qualifier. For example
229
230 exception ribosomal slippage
231
232 The following are legal exception qualifier values
233
234 RNA editing
235 reasons given in citation
236 ribosomal slippage
237 trans-splicing
238 alternative processing
239 artificial frameshift
240 nonconsensus splice site
241 rearrangement required for product
242 modified codon recognition
243 alternative start codon
244 dicistronic gene
245 transcribed pseudogene
246
247 Since the International Nucleotide Sequence Database collaboration only
248 allows "RNA editing" and "reasons given in citation" to appear in release
249 mode, other exceptions are mapped to the /note qualifier in the flatfile.
250 However, each exception text string turns off specific validator tests that
251 would otherwise produce warning messages, so they should be entered as
252 exception qualifiers, not as notes.
253
254 Gene Ontology (GO) terms can be indicated with the following qualifiers
255
256 go_component endoplasmic reticulum|0005783
257 go_process glycolysis and gluconeogenesis|57|89197757|ACT,TEM
258 go_function excision repair|93||IPD
259
260 The value field is separated by vertical bars '|' into a descriptive
261 string, the GO identifier (leading zeroes are retained), and optionally a
262 PubMed ID (or GO Reference number starting with a leading 0) and one or more
263 GO evidence codes.
264
265
266 SOURCE TABLE FORMAT
267
268 For sets of sequences, a source qualifier table can optionally be placed in
269 a tab-delimited file with a .src extension. The first line gives the
270 source qualifier names, separated by tabs. The first column must be the
271 sequence identifier. For example
272
273 sequence_id organism strain isolate
274
275 The remaining lines each give the source qualifiers for one sequence. For
276 example
277
278 sde3g Zea mays A69Y JH90.6-2x12
279
280 The same information can be provided in the FASTA definition line or in the
281 source section of the five-column feature table.
282
283
284 PROTEIN SEQUENCE FORMAT
285
286 Protein sequences are FASTA files with a .pep extension that can substitute
287 for the translated product of a CDS feature. Supplying these files acts as
288 a reality check that the CDS intervals do in fact translate to the expected
289 protein sequence. The FASTA defline with a '>' and sequence identifier is
290 required.
291
292 >sde3p
293 MSVSUYKSDDEYSVIADKGEIGFIDYQNDGSSGCYNPFDEGPVVVSVPFPFKKEKPQSVTVGETSFDSFT
294 VKNTMDEPVDLWTKIYASNPEDSFTLSILKPPSKDSDLKERQCFYETFTLEDRMLEPGDTLTIWVSCKPK
295 ...
296
297 The SeqID must match a protein_id in the .tbl file. In the table above,
298 the protein_id and transcript_id need to explicitly use a 'lcl|' prefix
299 before the SeqID string to indicate a local identifier. A local sequence
300 identifier is assumed when reading FASTA, but a database accession is
301 assumed in the feature table.
302
303 Sequin's Suggest Interval functionality, which can determine CDS intervals
304 from nucleotide and protein sequences plus the genetic code, is not used in
305 tbl2asn. Instead, the CDS intervals are required, and the supplied protein
306 sequence is just used to confirm proper translation.
307
308
309 MESSENGER RNA SEQUENCE FORMAT
310
311 mRNA sequences are FASTA files with a .rna extension that can substitute
312 for the transcribed product of an mRNA feature. Like the .pep files, they
313 act as a reality check that the supplied intervals do in fact encode the
314 expected mRNA sequence.
315
316 >sde3m
317 TTTTCATGTTTCTTCTCCTTTGAAGCCTGCCTGCGTTAGTCTGGCTTCATTGCTTCTCCATTTCTTGGTG
318 TGATCGAATCAAAGAGTGTAACCCATTTTGCTACTGATTCAGTACGTATGATCAATTCTCTCAATTTCAG
319 ...
320
321 The SeqID must match the transcript_id from an mRNA feature.
322
323
324 QUALITY SCORES FORMAT
325
326 Phrap/Consed quality scores can be supplied in .qvl files. These generate
327 Seq-graph data that will be attached to the nucleotide sequence from the
328 .fsa file. Programs such as Sequin can display these in a graphical view.
329
330 >chr1
331 51 63 70 82 82 82 90 90 90 90 86 86 86 86 90 90 90 90 90 86
332 86 86 86 86 86 86 86 90 90 90 90 90 90 86 86 78 78 90 90 86
333 ...
334
335 These values can be extracted from the output files of the Phrap and Consed
336 programs used to process raw data from automated sequencing machines.
337
338
339 SUBMISSION TEMPLATE FORMAT
340
341 The submission template is an ASN.1 Submit-block that can be generated by
342 Sequin. A simple example is shown below.
343
344 Submit-block ::= {
345 contact {
346 contact {
347 name
348 name {
349 last "Darwin" ,
350 first "Charles" ,
351 initials "C.R." ,
352 suffix "" } ,
353 affil
354 std {
355 affil "Oxbridge University" ,
356 div "Evolutionary Biology Department" ,
357 city "Camford" ,
358 country "United Kingdom" ,
359 street "1859 Tennis Court Lane" ,
360 email "darwin@beagle.edu.uk" ,
361 phone "01 44 171-007-1212" ,
362 postal-code "OX1 2BH" } } } ,
363 cit {
364 authors {
365 names
366 std {
367 {
368 name
369 name {
370 last "Darwin" ,
371 first "Charles" ,
372 initials "C.R." } } } ,
373 affil
374 std {
375 affil "Oxbridge University" ,
376 div "Evolutionary Biology Department" ,
377 city "Camford" ,
378 country "United Kingdom" ,
379 street "1859 Tennis Court Lane" ,
380 postal-code "OX1 2BH" } } ,
381 date
382 std {
383 year 2003 ,
384 month 2 ,
385 day 28 } } ,
386 subtype new }
387
388 This can be exported from the Desktop view of a template file in Sequin, or
389 from the initial submission dialogs. In addition, unpublished reference or
390 comments can also be generated in Sequin and saved from the Desktop. The
391 two files can be catenated to make a .sbt template with the publication or
392 comment descriptor after the submit block.
393
|
This page was automatically generated by the
LXR engine.
Visit the LXR main site for more information. |