NCBI C Toolkit Cross Reference

C/doc/sequin.htm


  1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  2     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
  3 <html lang="en" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
  4 
  5 <head>
  6 <meta name="generator" content=
  7 "HTML Tidy for Mac OS X (vers 1 September 2005), see www.w3.org" />
  8 <title>Sequin Quick Guide</title>
  9 <meta http-equiv="Content-Type" content=
 10 "text/html; charset=us-ascii" />
 11 <!-- if you use the following meta tags, uncomment them.
 12  <meta name="author" content="sequindoc">
 13  <META NAME="keywords" CONTENT="national center for biotechnology information, ncbi, national library of medicine, nlm, national institutes of health, nih, database, archive, bookshelf, pubmed, pubmed central, bioinformatics, biomedicine, sequence submission, sequin, bankit, submitting sequences, quick guide, format">
 14  <META NAME="description" CONTENT="Sequin is a stand-alone software tool developed by the National Center for Biotechnology Information (NCBI) for submitting and updating entries to the GenBank, EMBL, or DDBJ sequence databases. ">
 15 -->
 16 <link rel="stylesheet" href="images/ncbi_sequin.css" type="text/css" />
 17 </head>
 18 
 19 <body>
 20 <!-- change the link and vlink colors from the original orange (link="#CC6600" vlink="#CC6600") -->
 21 <!--  the header   -->
 22 <div id="header"><a href="http://www.ncbi.nlm.nih.gov" title=
 23 "NCBI home page"><img src="images/logo.png" alt="NCBI logo"
 24 id="ncbilogo" name="ncbilogo" /></a>
 25 <h1 id="tophead">Sequin Quick Guide</h1>
 26 </div>
 27 <!--  the quicklinks bar   -->
 28 <ul id="topnav">
 29 <li><a href=
 30 "http://www.ncbi.nlm.nih.gov/Sequin/index.html">Sequin</a></li>
 31 <li><a href="http://www.ncbi.nlm.nih.gov/Entrez/">Entrez</a></li>
 32 <li><a href="http://www.ncbi.nlm.nih.gov/BLAST/">BLAST</a></li>
 33 <li><a href="http://www.ncbi.nlm.nih.gov/omim/">OMIM</a></li>
 34 <li><a href=
 35 "http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html">Taxonomy</a></li>
 36 <li><a href=
 37 "http://www.ncbi.nlm.nih.gov/Structure/">Structure</a></li>
 38 </ul>
 39 <!--  the contents   -->
 40 
 41 <h1>Sequin for Database Submissions and Updates:<br />
 42 A Quick Guide</h1>
 43 
 44 <hr />
 45 <!-- use img src="http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/image##.png" align=bottom-->
 46 
 47 <h2>Introduction</h2>
 48 
 49 <p>Sequin is a stand-alone software tool developed by the National
 50 Center for Biotechnology Information (NCBI) for submitting and
 51 updating sequences to the GenBank, EMBL, and DDBJ databases. Sequin
 52 has the capacity to handle long sequences and sets of sequences
 53 (segmented entries, as well as population, phylogenetic, and
 54 mutation studies). It also allows sequence editing and updating,
 55 and provides complex annotation capabilities. In addition, Sequin
 56 contains a number of built-in validation functions for enhanced
 57 quality assurance.</p>
 58 
 59 <p>This overview is intended to provide a quick guide to Sequin's
 60 capabilities, including automatic annotation of coding regions, the
 61 graphical viewer, quality control features, and editing features. We
 62 suggest that you read this entire document before beginning your Sequin
 63 submission. More detailed instructions on these and other functions can
 64 be found in Sequin's on-screen <b>Help</b> file, also
 65 available on the World-Wide Web from the Sequin homepage at:</p>
 66 
 67 <p><a href=
 68 "http://www.ncbi.nlm.nih.gov/Sequin/">
 69 <tt>http://www.ncbi.nlm.nih.gov/Sequin/</tt></a></p>
 70 
 71 <p>Email help is also available from <a href=
 72 "mailto:info@ncbi.nlm.nih.gov">
 73 <tt>info@ncbi.nlm.nih.gov</tt></a></p>
 74 
 75 
 76 <h2>Table of Contents</h2>
 77 
 78 <ul>
 79 <li><a href="#BeforeYouBegin">Before You Begin</a>
 80 <ul>
 81 <li><a href="#PrepareSequenceData">
 82 Preparing Nucleotide and Amino Acid Data</a></li>
 83 <li><a href="#DefinitionLine">Definition Lines</a></li>
 84 <li><a href="#FASTAformat">FASTA Format</a>
 85 <ul>
 86 <li><a href="#SingleSequence">Single Sequence</a></li>
 87 <li><a href="#SegmentedSequences">Segmented Nucleotide Sequences</a></li>
 88 <li><a href="#GappedSequences">Gapped Sequences</a></li>
 89 </ul>
 90 </li>
 91 <li><a href="#AlignmentFormats">Alignment Formats</a>
 92 <ul>
 93 <li><a href="#FASTAplusGAP">FASTA+GAP</a></li>
 94 <li><a href="#PHYLIPformat">PHYLIP</a></li>
 95 <li><a href="#NEXUSInterleaved">NEXUS Interleaved</a></li>
 96 <li><a href="#NEXUSContiguous">NEXUS Contiguous</a></li>
 97 <li><a href="#SetsOfSegmentedSequences">Sets of Segmented Sequences</a></li>
 98 </ul>
 99 </li>
100 </ul>
101 </li>
102 <li><a href="#CreatingASubmission">Creating a Submission</a>
103 <ul>
104 <li><a href="#BasicSequinOrganization">Basic Sequin Organization</a></li>
105 <li><a href="#WelcomeToSequinForm">Welcome to Sequin Form</a></li>
106 <li><a href="#SubmittingAuthorsForm">Submitting Authors Form</a>
107 <ul>
108 <li><a href="#SubmissionPage">Submission Page</a></li>
109 <li><a href="#ContactPage">Contact Page</a></li>
110 <li><a href="#AuthorsPage">Authors Page</a></li>
111 <li><a href="#AffiliationPage">Affiliation Page</a></li>
112 </ul>
113 </li>
114 <li><a href="#SequenceFormatForm">Sequence Format Form</a>
115 <ul>
116 <li><a href="#SubmissionType">Submission Type</a></li>
117 <li><a href="#SequenceDataFormat">Sequence Data Format</a></li>
118 <li><a href="#SubmissionCategory">Submission Category</a></li>
119 </ul>
120 </li>
121 <li><a href="#OrganismAndSequencesForm">
122 Organism and Sequences Form</a>
123 <ul>
124 <li><a href="#NucleotidePage">Nucleotide Page</a>
125 <ul>
126 <li><a href="#NucleotidePageSingleSequence">
127 Importing Nucleotide FASTA for a Single Sequence</a></li>
128 <li><a href="#NucleotidePageSequenceSet">
129 Importing Nucleotide FASTA for a Sequence Set</a></li>
130 <li><a href="#NucleotidePageAlignment">
131 Importing an Alignment</a></li>
132 <li><a href="#AfterImporting">After Importing Files</a></li>
133 </ul>
134 </li>
135 <li><a href="#OrganismPage">Organism Page</a></li>
136 <li><a href="#ProteinPage">Proteins Page</a></li>
137 <li><a href="#AnnotationPage">Annotation Page</a></li>
138 </ul>
139 </li>
140 </ul>
141 </li>
142 <li><a href="#viewing">Viewing Your Submission</a>
143 <ul>
144 <li><a href="#GenBankView">GenBank View</a></li>
145 <li><a href="#GraphicalView">Graphical View</a></li>
146 <li><a href="#SequenceView">Sequence View</a></li>
147 </ul>
148 </li>
149 <li><a href="#editing">Editing and Annotating Your Submission</a>
150 <ul>
151 <li><a href="#SequenceEditor">Sequence Editor</a></li>
152 <li><a href="#UpdatingTheSequence">Updating the Sequence</a></li>
153 <li><a href="#autodefline">Generating the Definition Line</a></li>
154 <li><a href="#Validation">Record Validation</a></li>
155 <li><a href="#SubmittingTheEntry">Submitting the Entry</a></li>
156 </ul>
157 </li>
158 <li><a href="#Advanced">Advanced Topics</a>
159 <ul>
160 <li><a href="#FeatureEditorDesign">Feature Editor Design</a>
161 <ul>
162 <li><a href="#CodingRegionPage">Coding Region Page</a></li>
163 <li><a href="#PropertiesPage">Properties Page</a></li>
164 <li><a href="#LocationPage">Location Page</a></li>
165 </ul>
166 </li>
167 <li><a href="#NCBIDesktop">NCBI Desktop</a></li>
168 <li><a href="#AdditionalInformation">Additional Information</a></li>
169 </ul>
170 </li>
171 <li><a href="#Reference">Reference</a>
172 <ul>
173 <li><a href="#NetworkConfiguration">Network Configuration</a></li>
174 <li><a href="#FeatureTableFormat">Feature Table Format</a></li>
175 </ul>
176 </li>
177 </ul>
178 
179 <a name="BeforeYouBegin" id="BeforeYouBegin"></a>
180 <h2>Before You Begin</h2>
181 
182 <a name="PrepareSequenceData" id = "PrepareSequenceData"></a>
183 <h3>Preparing Nucleotide and Amino Acid Data</h3>
184 
185 <p>Sequin normally expects to read sequence files in FASTA format.
186 Note that most sequence analysis software packages include FASTA or
187 "raw" as one of the available output formats. Population studies,
188 phylogenetic studies, mutation studies, and environmental samples
189 may be entered in either FASTA format, or in PHYLIP, NEXUS, MACAW,
190 or FASTA+GAP formats if you are submitting an alignment.</p>
191 
192 <p>See <a href=
193 "http://www.ncbi.nlm.nih.gov/Sequin/sequin.hlp.html#FASTAFormatforNucleotideSequences">
194 <tt>http://www.ncbi.nlm.nih.gov/Sequin/sequin.hlp#FASTAFormatforNucleotideSequences</tt></a>
195 for detailed examples of each of the various input data
196 formats.</p>
197 
198 <p>Prepare your sequence data files using a text editor, and save
199 in ASCII text format (plain text). If your nucleotide sequence
200 encodes one or more protein products, Sequin expects two files, one
201 for the nucleotides and one for the proteins.</p>
202 
203 <a name="DefinitionLine" id="DefinitionLine"></a>
204 <h3>Definition Lines</h3>
205 
206 <p>FASTA format is simply the raw sequence preceded by a definition
207 line. The definition line begins with a &gt; sign and is followed
208 immediately by a name for the sequence (your own local
209 identification code, or sequence ID) and a title. During the
210 submission process, indexing staff at the database to which you are
211 submitting will change your sequence ID to an Accession number. You
212 can embed other important information in the title, and Sequin uses
213 this information to construct a record. Specifically, you can enter
214 organism and strain or clone information in the nucleotide
215 definition line and gene and protein information in the protein
216 definition line using name-value pairs surrounded by square
217 brackets. Example: [organism=Drosophila melanogaster]
218 [strain=Oregon R]</p>
219 
220 <p>Some modifier names have restricted values or formats.</p>
221 
222 <ul>
223 <li><b>organism</b> should use the unabbreviated scientific name.
224 Example: [organism=Drosophila melanogaster]</li>
225 <li><b>molecule</b> should use either "DNA" or "RNA". Example:
226 [molecule=DNA]</li>
227 
228 <li><b>moltype</b> should use one of the following values. Example:
229 [moltype=genomic]
230 <ul>
231 <li>genomic</li>
232 <li>precursor RNA</li>
233 <li>mRNA</li>
234 <li>rRNA</li>
235 <li>tRNA</li>
236 <li>snRNA</li>
237 <li>scRNA</li>
238 <li>other-genetic</li>
239 <li>cRNA</li>
240 <li>snoRNA</li>
241 <li>transcribed RNA</li>
242 </ul>
243 </li>
244 
245 <li><b>location</b> should use one of the following values.
246 Example: [location=mitochondrion]
247 <ul>
248 <li>genomic</li>
249 <li>chloroplast</li>
250 <li>kinetoplast</li>
251 <li>mitochondrion</li>
252 <li>plastid</li>
253 <li>macronuclear</li>
254 <li>extrachromosomal</li>
255 <li>plasmid</li>
256 <li>cyanelle</li>
257 <li>proviral</li>
258 <li>virion</li>
259 <li>nucleomorph</li>
260 <li>apicoplast</li>
261 <li>leucoplast</li>
262 <li>proplastid</li>
263 <li>endogenous-virus</li>
264 <li>hydrogenosome</li>
265 </ul>
266 </li>
267 
268 <li><b>collection-date</b> should be in the form YYYY or Mmm-YYYY
269 or DD-Mmm-YYYY. Example: [collection-date=2005] or
270 [collection-date=Oct-2005] or
271 [collection-date=25-Oct-2005]</li>
272 </ul>
273 
274 <p>The following modifiers should use only TRUE or FALSE. Example:
275 [transgenic=TRUE].</p>
276 
277 <ul>
278 <li><b>environmental-sample</b></li>
279 <li><b>germline</b></li>
280 <li><b>metagenomic</b></li>
281 <li><b>rearranged</b></li>
282 <li><b>transgenic</b></li>
283 </ul>
284 
285 <p>This is the list of the remaining modifier names that you can
286 include in your definition lines for nucleotide files:</p>
287 
288 <table id="sourcemods" summary="remaining modifiers">
289 <tr>
290 <td valign="top">
291 <ul>
292 <li>acronym</li>
293 <li>anamorph</li>
294 <li>authority</li>
295 <li>bio-material</li>
296 <li>biotype</li>
297 <li>biovar</li>
298 <li>breed</li>
299 <li>cell-line</li>
300 <li>cell-type</li>
301 <li>chemovar</li>
302 <li>chromosome</li>
303 <li>clone</li>
304 <li>clone-lib</li>
305 <li>collected-by</li>
306 <li>common</li>
307 <li>country</li>
308 <li>cultivar</li>
309 <li>culture-collection</li>
310 <li>dev-stage</li>
311 <li>ecotype</li>
312 <li>endogenous-virus-name</li>
313 </ul>
314 </td>
315 <td valign="top">
316 <ul>
317 <li>forma</li>
318 <li>forma-specialis</li>
319 <li>fwd-pcr-primer-name</li>
320 <li>fwd-pcr-primer-seq</li>
321 <li>genotype</li>
322 <li>group</li>
323 <li>haplotype</li>
324 <li>identified-by</li>
325 <li>isolate</li>
326 <li>isolation-source</li>
327 <li>lab-host</li>
328 <li>lat-lon</li>
329 <li>map</li>
330 <li>metagenome-source</li>
331 <li>metagenomic</li>
332 <li>note</li>
333 <li>pathovar</li>
334 <li>plasmid-name</li>
335 <li>plastid-name</li>
336 <li>pop-variant</li>
337 <li>rev-pcr-primer-name</li>
338 </ul>
339 </td>
340 <td valign="top">
341 <ul>
342 <li>rev-pcr-primer-seq</li>
343 <li>segment</li>
344 <li>serogroup</li>
345 <li>serotype</li>
346 <li>serovar</li>
347 <li>sex</li>
348 <li>specific-host</li>
349 <li>specimen-voucher</li>
350 <li>strain</li>
351 <li>sub-species</li>
352 <li>subclone</li>
353 <li>subgroup</li>
354 <li>substrain</li>
355 <li>subtype</li>
356 <li>synonym</li>
357 <li>teleomorph</li>
358 <li>tissue-lib</li>
359 <li>tissue-type</li>
360 <li>type</li>
361 <li>variety</li>
362 </ul>
363 </td>
364 </tr>
365 </table>
366 
367 <p>Example: [strain=BALB/c]</p>
368 
369 <p>Some population studies are a mixture of integrated provirus and
370 excised virion. These can be indicated by molecule and location
371 qualifiers, e.g., [molecule=dna] [location=proviral] or
372 [molecule=rna] [location=virion]. You can also embed
373 [moltype=genomic] or [moltype=mRNA] to indicate from what source
374 the molecule was isolated. If you're unsure of which modifier to
375 use, use [note=...], and database staff will determine the
376 appropriate modifier to use.</p>
377 
378 <p>This is the list of modifier names that you can include in your
379 definition lines for protein files:</p>
380 
381 <ul>
382 <li><b>gene</b></li>
383 <li><b>protein</b></li>
384 <li><b>prot_desc</b></li>
385 </ul>
386 
387 <p>A coding region feature will be created on the nucleotide
388 sequence indicating where the protein sequence is encoded. If you
389 specify "gene" in the protein sequence definition line, a gene that
390 covers the coding region will be created with a locus specified by
391 the value of "gene".</p>
392 
393 <p>The product name for the coding region will be the "protein" value
394 specified in the protein sequence definition line, if supplied. The
395 product description for the coding region will be the "prot_desc"
396 value specified in the protein sequence definition line, if
397 supplied.</p>
398 
399 <p>Note that the [ and ] brackets actually appear in the text.
400 (Brackets are sometimes used in computer documentation to denote
401 optional text. This convention is not followed here.) The bracketed
402 information will be removed from the definition line for each
403 sequence. Sequin can also calculate a new definition line by
404 computing on features in the annotated record (see "<a href=
405 "#autodefline">Generating the Definition Line</a>").</p>
406 
407 <p>The ability to embed this information in the definition line is
408 provided as a convenience to the submitter. If these annotations
409 are not present, they can be entered in subsequent forms. Sequin is
410 designed to use this information, and that provided in the initial
411 forms, to build a properly structured record. <b>In many cases,
412 the final submission can be completely prepared from these data, so
413 that no additional manual annotation is necessary once the record
414 is displayed.</b></p>
415 
416 <p><b>It is much easier to produce the final submission if you
417 let Sequin work for you in this manner.</b></p>
418 
419 <p>In this example we show alternative splicing, where a single
420 gene produces multiple messenger RNAs that encode two similar but
421 distinct protein products. Examples for the definition lines for
422 the nucleotide and protein files are shown here:</p>
423 
424 <pre>
425 Nucleotide Sequence:
426 
427 &gt;eIF4E [organism=Drosophila melanogaster] [strain=Oregon R] Drosophila ...
428 CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCA ...
429 
430 Protein Sequences:
431 
432 &gt;4E-I [gene=eIF4E] [protein=eukaryotic initiation factor 4E-I]
433 MQSDFHRMKNFANPKSMFKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETGEPAGN ...
434 &gt;4E-II [gene=eIF4E] [protein=eukaryotic initiation factor 4E-II]
435 MVVLETEKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETGEPAGNTATTTAPAGDD ...
436 </pre>
437 
438 <p>Also, please note that there must be a line break (carriage
439 return) between the definition line and the first line of sequence.
440 Some word processors will break a single line onto two lines
441 without actually adding a carriage return. (This feature is known
442 as "word wrapping".) If you are unsure whether there is a carriage
443 return, you can either set up your word processor so it shows
444 invisible characters like carriage returns, or view the file in a
445 text editor that does not create artificial line breaks. <b>The
446 definition line itself must not have a line break within it,
447 because the second line would then be misinterpreted as the
448 beginning of the sequence data.</b> The actual sequence is usually
449 broken every 50 to 80 characters, but this is not necessary for
450 Sequin to be able to read it.</p>
451 
452 <a name="FASTAformat" id="FASTAformat"></a>
453 <h3>FASTA Format</h3>
454 
455 <p>There are three types of sequences that may be represented using
456 the FASTA format: single, contiguous sequences, segmented sequences,
457 and gapped sequences.</p>
458 
459 <a name="SingleSequence" id="SingleSequence"></a>
460 <h4>Single Sequence</h4>
461 
462 <p>This is the definition line followed by the sequence data. A
463 sample single sequence file is shown here:</p>
464 
465 <pre>
466 &gt;ABC-1 [organism=Saccharomyces cerevisiae][strain=ABC][clone=1]
467 ATTGCGTTATGGAAATTCGAAACTGCCAAATACTATGTCACCATCATTGA
468 TGCACCTGGACACAGAGATTTCATCAAGAACATGATCACTGGTACTT
469 </pre>
470 
471 <a name="SegmentedSequences" id="SegmentedSequences"></a>
472 <h4>Segmented Nucleotide Sequences</h4>
473 
474 <p>A segmented nucleotide entry is an earlier method for capturing
475 a set of non-contiguous sequences that has a defined order and
476 orientation. For example, a genomic DNA segmented set could include
477 encoding exons along with fragments of their flanking introns. An
478 example of an mRNA segmented pair of records would be the 5' and 3'
479 ends of an mRNA, where the middle region has not been sequenced. To
480 import nucleotides in a segmented set, each individual sequence
481 must be in FASTA format with an appropriate definition line, and
482 all sequences should be in the same file. Organism information
483 should only be included in the definition line for the first
484 segment. Notice that there is a square open bracket on a line by
485 itself before the first segment and a square close bracket on a
486 line by itself after the last segment. These square brackets are
487 required if you are importing multiple segmented sequences, but may
488 be omitted if you are importing a file that contains all of the
489 segments and using the "segmented sequence" format. Sequin will
490 also generate an additional sequence to represent the combination
491 of the segments, and that sequence will have a distinct sequence
492 ID. A sample segmented sequence file is shown here:</p>
493 
494 <pre>
495 [
496 &gt;m_gagei_seg1 [organism=Mansonia gagei] Mansonia gagei NADH dehydrogenase ...
497 ATGGAGCATACATATCAATATTCATGGATCATACCGTTTGTGCCACTTCCAATTCCTATTTTAATAGGAA
498 TTGGACTCCTACTTTTTCCGACGGCAACAAAAAATCTTCGTCGTATGTGGGCTCTTCCCAATATTTTATT
499 GTTAAGTATAGTTATGATTTTTTCGGTCGATCTGTCCATTCAGCAAATAAATAAAAGTTCTATCTATCAA
500 TATGTATGGTCTTGGACCATCAATAATGATTTTTCTTTCGAGTTTGGCTACTTTATTGATTCGCTTACCT
501 AGTTCGAATTTGATACAAATTTATATTTTTTGGGAATTAGTTGGAATGTGTTCTTATCTATTAATAGGGT
502 TTTGGTTCACACGACCCGCTGCGGCAAACGCCTGTCAAAAAGCATTTGTAACTAATCGGATAGGCGATTT
503 TGGTTTATTATTAGGAATCTTAGGTTTTTATTGGATAACGGGAAGTTTCGAATTTCAAGATTTGTTCGAA
504 ATATTTAATAACTTGATTTATAATAATGAGGTTCAGTTTTTATTTGTTACTTTATGTGCCTCTTTATTA
505 &gt;m_gagei_seg2
506 GGTATAATAACAGTATTATTAGGGGCTACTTTAGCTCTTGC
507 TCAAAAAGATATTAAGAGGGGTTTAGCCTATTCTACAATGTCCCAACTGGGTTATATGATGTTAGCTCTA
508 GGTATGGGGTCTTATCGAGCCGCTTTATTTCATTTGATTACTCATGCTTATTCGAAGGCATTGTTGTTTT
509 TAGGATCCGGATCCGTTATTCATTCCATGGAAGCTATTGTTGGATATTCTCCAGATAAAAGCCAGAATAT
510 GGTTTTTATGGGCGGTTTAAGAAAGCATGTGCCAATTACACAAATTGCTTTTTTAGTGGGTACACTTTCT
511 CTTTGTGGTATTCCACCCCTTGCTTGTTTTTGGTCCAAAGATGAAATTCTTAGTGACAGCTGGTTGT
512 &gt;m_gagei_seg3
513 TCAATAAAACTATGGGGTAAAGAAGAACAAAAAATAATTAACAGAAATTTTCGTTTATCTCCTTTATTAA
514 TATTAACGATGAATAATAATGAGAAGCCATATAGAATTGGTGATAATGTAAAAAAAGGGGCTCTTATTAC
515 TATTACGAGTTTTGGCTACAAGAAGGCTTTTTCTTATCCTCATGAATCGGATAATACTATGCTATTTCCT
516 ATGCTTATATTGGCTCTATTTACTTTTTTTGTTGGAGCCATAGCAATTCCTTTTAATCAAGAAGGACTAC
517 ATTTGGATATATTATCCAAATTATTAACTCCATCTATAAATCTTTTACATCAAAATTCAAATGATTTTGA
518 GGATTGGTATCAATTTTTAACAAATGCAACTCTTTCAGTGAGTATAGCCTGTTTCGGAATATTTACAGCA
519 TTCCTTTTATATAAGCCTTTTTATTCATCTTTACAAAATTTGAACTTACTAAATTTATTTTCGAAAGGGG
520 GTCCTAAAAGAATTTTTTTGGATAAAATAATATACTTGATATACGATTGGTCATATAATCGTGGTTACAT
521 AGATACGTTTTATTCAGTATCCTTAACAAAAGGTATAAGAGGATTGGCCGAACTAACTCATTTTTTTGAT
522 AGGCGAGTAATCGATGGAATTACAAATGGAGTACGCATCACAAGTTTTTTTATAGGCGAAGGTATCAAAT
523 ATT
524 ]
525 </pre>
526 
527 <a name="GappedSequences" id="GappedSequences"></a>
528 <h4>Gapped Sequences</h4>
529 
530 <p>A gapped sequence represents a newer method for describing
531 non-contiguous sequences, but only requires a single sequence
532 identifier. A gap is represented by a line that starts with &gt;?
533 and is immediately followed by either a length (for gaps of known
534 length) or "unk100" for gaps of unknown length. For example,
535 "&gt;?200". The next sequence segment continues on the next line,
536 with no separate definition line or identifier. The difference
537 between a gapped sequence and a segmented sequence is that the
538 gapped sequence uses a single identifier and can specify known
539 length gaps. Gapped sequences are preferred over segmented
540 sequences. A sample gapped sequence file is shown here:</p>
541 
542 <pre>
543 &gt;m_gagei [organism=Mansonia gagei] Mansonia gagei NADH dehydrogenase ...
544 ATGGAGCATACATATCAATATTCATGGATCATACCGTTTGTGCCACTTCCAATTCCTATTTTAATAGGAA
545 TTGGACTCCTACTTTTTCCGACGGCAACAAAAAATCTTCGTCGTATGTGGGCTCTTCCCAATATTTTATT
546 GTTAAGTATAGTTATGATTTTTTCGGTCGATCTGTCCATTCAGCAAATAAATAAAAGTTCTATCTATCAA
547 TATGTATGGTCTTGGACCATCAATAATGATTTTTCTTTCGAGTTTGGCTACTTTATTGATTCGCTTACCT
548 AGTTCGAATTTGATACAAATTTATATTTTTTGGGAATTAGTTGGAATGTGTTCTTATCTATTAATAGGGT
549 TTTGGTTCACACGACCCGCTGCGGCAAACGCCTGTCAAAAAGCATTTGTAACTAATCGGATAGGCGATTT
550 TGGTTTATTATTAGGAATCTTAGGTTTTTATTGGATAACGGGAAGTTTCGAATTTCAAGATTTGTTCGAA
551 ATATTTAATAACTTGATTTATAATAATGAGGTTCAGTTTTTATTTGTTACTTTATGTGCCTCTTTATTA
552 &gt;?200
553 GGTATAATAACAGTATTATTAGGGGCTACTTTAGCTCTTGC
554 TCAAAAAGATATTAAGAGGGGTTTAGCCTATTCTACAATGTCCCAACTGGGTTATATGATGTTAGCTCTA
555 GGTATGGGGTCTTATCGAGCCGCTTTATTTCATTTGATTACTCATGCTTATTCGAAGGCATTGTTGTTTT
556 TAGGATCCGGATCCGTTATTCATTCCATGGAAGCTATTGTTGGATATTCTCCAGATAAAAGCCAGAATAT
557 GGTTTTTATGGGCGGTTTAAGAAAGCATGTGCCAATTACACAAATTGCTTTTTTAGTGGGTACACTTTCT
558 CTTTGTGGTATTCCACCCCTTGCTTGTTTTTGGTCCAAAGATGAAATTCTTAGTGACAGCTGGTTGT
559 &gt;?unk100
560 TCAATAAAACTATGGGGTAAAGAAGAACAAAAAATAATTAACAGAAATTTTCGTTTATCTCCTTTATTAA
561 TATTAACGATGAATAATAATGAGAAGCCATATAGAATTGGTGATAATGTAAAAAAAGGGGCTCTTATTAC
562 TATTACGAGTTTTGGCTACAAGAAGGCTTTTTCTTATCCTCATGAATCGGATAATACTATGCTATTTCCT
563 ATGCTTATATTGGCTCTATTTACTTTTTTTGTTGGAGCCATAGCAATTCCTTTTAATCAAGAAGGACTAC
564 ATTTGGATATATTATCCAAATTATTAACTCCATCTATAAATCTTTTACATCAAAATTCAAATGATTTTGA
565 GGATTGGTATCAATTTTTAACAAATGCAACTCTTTCAGTGAGTATAGCCTGTTTCGGAATATTTACAGCA
566 TTCCTTTTATATAAGCCTTTTTATTCATCTTTACAAAATTTGAACTTACTAAATTTATTTTCGAAAGGGG
567 GTCCTAAAAGAATTTTTTTGGATAAAATAATATACTTGATATACGATTGGTCATATAATCGTGGTTACAT
568 AGATACGTTTTATTCAGTATCCTTAACAAAAGGTATAAGAGGATTGGCCGAACTAACTCATTTTTTTGAT
569 AGGCGAGTAATCGATGGAATTACAAATGGAGTACGCATCACAAGTTTTTTTATAGGCGAAGGTATCAAAT
570 ATT
571 </pre>
572 
573 <a name="AlignmentFormats" id="AlignmentFormats"></a>
574 <h3>Alignment Formats</h3>
575 
576 <p>Once you have created your alignment file, be sure to note the
577 characters used to indicate ambiguous bases, bases that match the
578 master sequence, and gaps in the alignment. Be aware that some
579 alignment formats use different characters to indicate gaps used to
580 pad sequences at the beginning, middle, and end of the alignment.
581 You will be able to specify these characters separately before
582 importing the alignment file.</p>
583 
584 <a name="FASTAplusGAP" id="FASTAplusGAP"></a>
585 <h4>FASTA+GAP</h4>
586 
587 <pre>
588 &gt;ABC-1 [organism=Saccharomyces cerevisiae][strain=ABC][clone=1]
589 ---ATTGCGTTATGGAAATTCGAAACTGCCAAATACTATGTCACCATCAT
590 TGATGCACCTGGACACAGAGATTTCATCAAGAACATGATCACTGGTACTT
591 &gt;ABC-2 [organism=Saccharomyces cerevisiae][strain=ABC][clone=2]
592 GATATTGCTTTATGGAAATTCGAAACTGCCAAATACTATGTCACCATCAT
593 TGATGCACCTGGACACAGAAATTTCATCAAGAACATGATCACTGGTACTT
594 &gt;ABC-3 [organism=Saccharomyces cerevisiae][strain=ABC][clone=3]
595 ---ATTGCTTTATGGAAATTCGAAACTGCCAAATACTATGTTA-------
596 TGATGCACCTGGACACAGAGATTTCATCAAAAACATGATCACTGGTACTT
597 </pre>
598 
599 <a name="PHYLIPformat" id="PHYLIPformat"></a>
600 <h4>PHYLIP</h4>
601 
602 <pre>
603       3  100
604 ABC-1      ---ATTGCGT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
605 ABC-2      GATATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
606 ABC-3      ---ATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TTA-------
607 
608            TGATGCACCT GGACACAGAG ATTTCATCAA GAACATGATC ACTGGTACTT
609            TGATGCACCT GGACACAGAA ATTTCATCAA GAACATGATC ACTGGTACTT
610            TGATGCACCT GGACACAGAG ATTTCATCAA AAACATGATC ACTGGTACTT
611 
612 &gt;[organism=Saccharomyces cerevisiae][strain=ABC][clone=1]
613 &gt;[organism=Saccharomyces cerevisiae][strain=ABC][clone=2]
614 &gt;[organism=Saccharomyces cerevisiae][strain=ABC][clone=3]
615 </pre>
616 
617 <a name="NEXUSInterleaved" id="NEXUSInterleaved"></a>
618 <h4>NEXUS Interleaved</h4>
619 
620 <pre>
621 #NEXUS
622 
623 begin data;
624         dimensions  ntax=3 nchar=100;
625         format datatype=dna  missing=? gap=-  interleave ;
626         matrix
627 
628 [     1                                                   50]
629 ABC_1 ???ATTGCGT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
630 ABC_2 GATATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
631 ABC_3 ???ATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TTA-------
632 
633 [     51                                                 100]
634 ABC_1 TGATGCACCT GGACACAGAG ATTTCATCAA GAACATGATC ACTGGTACTT
635 ABC_2 TGATGCACCT GGACACAGAA ATTTCATCAA GAACATGATC ACTGGTACTT
636 ABC_3 TGATGCACCT GGACACAGAG ATTTCATCAA AAACATGATC ACTGGTACTT
637 ;
638 END;
639 
640 begin ncbi;
641 sequin
642 &gt;[organism=Saccharomyces cerevisiae][strain=ABC][clone=1]
643 &gt;[organism=Saccharomyces cerevisiae][strain=ABC][clone=2]
644 &gt;[organism=Saccharomyces cerevisiae][strain=ABC][clone=3]
645 ;
646 end;
647 </pre>
648 
649 <a name="NEXUSContiguous" id="NEXUSContiguous"></a>
650 <h4>NEXUS Contiguous</h4>
651 
652 <pre>
653 #NEXUS
654 
655 begin data;
656         dimensions  ntax=3 nchar=100;
657         format datatype=dna  missing=? gap=-  ;
658         matrix
659 
660 ABC_1   
661 ???ATTGCGT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
662 TGATGCACCT GGACACAGAG ATTTCATCAA GAACATGATC ACTGGTACTT
663 ABC_2  
664 GATATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
665 TGATGCACCT GGACACAGAA ATTTCATCAA GAACATGATC ACTGGTACTT
666 ABC_3  
667 ???ATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TTA-------
668 TGATGCACCT GGACACAGAG ATTTCATCAA AAACATGATC ACTGGTACTT
669 ;
670 END;
671 
672 begin ncbi;
673 sequin
674 &gt;[organism=Saccharomyces cerevisiae][strain=ABC][clone=1]
675 &gt;[organism=Saccharomyces cerevisiae][strain=ABC][clone=2]
676 &gt;[organism=Saccharomyces cerevisiae][strain=ABC][clone=3]
677 ;
678 end;
679 </pre>
680 
681 <a name="SetsOfSegmentedSequences" id="SetsOfSegmentedSequences"></a>
682 <h4>Sets of Segmented Sequences</h4>
683 
684 <p>If the sequences in a phylogenetic study are really segmented
685 (e.g., exons 2 and 3 of a gene without intron 2), the individual
686 segments from a single organism can be grouped within square
687 brackets. Subsequent segments are detected by the presence of a
688 FASTA definition line. For example:</p>
689 
690 <pre>
691 [
692 &gt;Qruex2 [organism=Quercus rubra]
693 CGAAAACCTGCACAGCAGAAACGACTCGCAAACTAGTAATAACTGACGGAGGACGGAGGG ...
694 &gt;Qruex3
695 CATCATTGCCCCCCATCCTTTGGTTTGGTTGGGTTGGAAGTTCACCTCCCATATGTGCCC ...
696 ]
697 [
698 &gt;Qsuex2 [organism=Quercus suber]
699 CAAACCTACACAGCAGAACGACTCGAGAACTGGTGACAGTTGAGGAGGGCAAGCACCTTG ...
700 &gt;Qsuex3
701 CATCGTTGCCCCCCTTCTTTGGTTTGGTTGGGTTGGAAGTTGGCCTTCCATATGTGCCCT ...
702 ]
703 ...
704 </pre>
705 
706 <p>FASTA+GAP format can also use this convention for encoding sets
707 of aligned segmented sequences.</p>
708 
709 <a name="CreatingASubmission" id="CreatingASubmission"></a>
710 <h2>Creating a Submission</h2>
711 
712 <p>The sequence data we will use for this example is the genomic
713 sequence of the <span class="taxonomy">Drosophila melanogaster</span>
714 eukaryotic initiation factors 4E-I and 4E-II (GenBank Accession number
715 U54469).</p>
716 
717 <a name="BasicSequinOrganization" id="BasicSequinOrganization"></a>
718 <h3>Basic Sequin Organization</h3>
719 
720 <p>Sequin is organized into a series of forms for entering
721 submitting authors, entering organism and sequences, entering
722 information such as strain, gene, and protein names, viewing the
723 complete submission, and editing and annotating the submission. The
724 goal is to go quickly from raw sequence data to an assembled record
725 that can be viewed, edited, and submitted to your database of
726 choice.</p>
727 
728 <p>Advance through the pages that make up each form by clicking on
729 labeled folder tabs or the <span class="buttonlabel">Next Page</span>
730 button. After the basic information forms have been completed and the
731 sequence data imported, Sequin provides a complete view of your
732 submission, in your choice of text or graphic format. At this point,
733 any of the information fields can be easily modified by double-clicking
734 on any area of the record, and additional biological annotations can be
735 entered by selecting from a menu.</p>
736 
737 <p>Sequin has an on-screen <span class="buttonlabel">Help</span> file
738 that is opened automatically when you start the program. Because it is
739 context sensitive, the <span class="buttonlabel">Help</span> text will
740 change and follow your steps as you progress through the program. A
741 "Find" function is also provided.</p>
742 
743 <a name="WelcomeToSequinForm" id="WelcomeToSequinForm"></a>
744 <h3>Welcome to Sequin Form</h3>
745 
746 <p><img class="figure" src="images/welcome.png" alt=
747 "Welcome to Sequin Form" /></p>
748 
749 <p>Once you have finished preparing the sequence files, you are
750 ready to start the Sequin program. Sequin's first window asks you
751 to indicate the database to which the sequence will be submitted
752 and prompts you to start a new project or continue with an existing
753 one. Once you choose a database, Sequin will remember it in
754 subsequent sessions. In general, each sequence submission should be
755 entered as a separate project. However, segmented DNA sequences,
756 gapped sequences, population studies, phylogenetic studies, and
757 mutation studies should be submitted together as one project. This
758 feature also eliminates the need to save Sequin information
759 templates for each sequence.</p>
760 
761 <p>To begin creating your submission, click the <span
762 class="buttonlabel">Start New Submission</span> button.</p>
763 
764 <a name="SubmittingAuthorsForm" id="SubmittingAuthorsForm"></a>
765 <h3>Submitting Authors Form</h3>
766 
767 <p>The pages in the <span class="dialoglabel">Submitting Authors</span>
768 form ask you to provide the release date, a working title, names and
769 contact information of submitting authors, and affiliation information.
770 To create a personal template for use in future submissions, use the
771 <span class="menulabel">File-&gt;Export</span> menu item after
772 completing each page of this form.</p>
773 
774 <a name="SubmissionPage" id="SubmissionPage"></a>
775 <h4>Submission Page</h4>
776 
777 <p><img class="figure" src="images/submit.png" alt=
778 "Submission Page" /></p>
779 
780 <p>The <span class="folderlabel">Submission</span> page asks for a
781 tentative title for a manuscript describing the sequence and will
782 initially mark the manuscript as being unpublished. When the article is
783 published, the database staff will update the sequence record with the
784 new citation. This page also lets you indicate that a record should be
785 held confidential by the database until a specified date, although the
786 preferred policy is to release the record immediately into the public
787 databases.</p>
788 
789 <a name="ContactPage" id="ContactPage"></a>
790 <h4>Contact Page</h4>
791 
792 <p><img class="figure" src="images/contact.png" alt=
793 "Contact Page" /></p>
794 
795 <p>The <span class="folderlabel">Contact</span> page asks for the name,
796 phone number, and email address of the person responsible for making
797 the submission. Database staff members will contact this person if
798 there are any questions about the record.</p>
799 
800 <p>The Sfx (suffix) popup is used to enter personal name suffixes
801 (e.g., Jr., Sr., or III), not a person's academic degrees (e.g., MD
802 or PhD). Also, it is not necessary to type periods after
803 initials.</p>
804 
805 <a name="AuthorsPage" id="AuthorsPage"></a>
806 <h4>Authors Page</h4>
807 
808 <p><img class="figure" src="images/authors.png" alt=
809 "Authors Page" /></p>
810 
811 <p>In the <span class="folderlabel">Authors</span> page, enter the
812 names of the people who should get scientific credit for the sequence
813 presented in this record. These will become the authors for the initial
814 (unpublished) manuscript.</p>
815 
816 <p>Authors are entered in a spreadsheet. As soon as anything is
817 typed in the last row, a new (blank) row is added below it. Use the
818 tab key to move between fields. Tabbing from the last column
819 automatically moves to the First Name column in the next row.</p>
820 
821 <a name="AffiliationPage" id="AffiliationPage"></a>
822 <h4>Affiliation Page</h4>
823 
824 <p><img class="figure" src="images/affil.png" alt=
825 "Affiliation Page" /></p>
826 
827 <p>The <span class="folderlabel">Affiliation</span> page asks for the
828 institutional affiliation of the primary author.</p>
829 
830 <a name="SequenceFormatForm" id="SequenceFormatForm"></a>
831 <h3>Sequence Format Form</h3>
832 
833 <p><img class="figure" src="images/format.png" alt=
834 "Format Form" /></p>
835 
836 <p>With Sequin, the actual sequence data are imported from an
837 outside data file. So before you begin, prepare your sequence data
838 files using a text editor, perhaps one associated with your
839 laboratory sequence analysis software (see "<a href=
840 "#BeforeYouBegin">Before you Begin</a>").</p>
841 
842 <a name="SubmissionType" id="SubmissionType"></a>
843 <h4>Submission Type</h4>
844 
845 <p>If you have sequence data from a single source, choose from one of
846 the following submission types:</p>
847 
848 <ul>
849 <li><span class="buttonlabel">Single Sequence</span> if you have a
850 single contiguous mRNA or genomic DNA sequence.</li>
851 <li><span class="buttonlabel">Segmented Sequence</span> if you have a
852 single collection of non-overlapping, non-contiguous sequences that
853 cover a specified genetic region from a single source. A standard
854 example is a set of genomic DNA sequences that encode exons from a gene
855 along with fragments of their flanking introns.</li>
856 <li><span class="buttonlabel">Gapped Sequence</span> if you have a
857 single non-contiguous mRNA or genomic DNA sequence. A gapped sequence
858 contains specified gaps of known or unknown length where the exact
859 nucleotide sequence has not been determined.</li>
860 </ul>
861 
862 <p>See <a href="#BeforeYouBegin">Before You Begin</a> if you have
863 questions about how to format your files or about the differences
864 between these formats.</p>
865 
866 <p>If you have a set of single sequences, segmented sequences, or
867 gapped sequences or a mixture of these types of sequences, you will
868 need to choose one of the following submission types:</p>
869 
870 <ul>
871 <li><span class="buttonlabel">Population Study</span> for a set derived
872 by sequencing the same gene from different isolates of the same
873 organism.</li>
874 <li><span class="buttonlabel">Phylogenetic Study</span> for a set
875 derived by sequencing the same gene from different organisms.</li>
876 <li><span class="buttonlabel">Mutation Study</span> for a set derived
877 by sequencing multiple mutations of a single gene.</li>
878 <li><span class="buttonlabel">Environmental Samples</span> for a set
879 derived by sequencing the same gene from a population of unclassified
880 or unknown organisms.</li>
881 <li><span class="buttonlabel">Batch Submission</span> for a set that is
882 not a population study, mutation study, phylogenetic study, or
883 environmental samples. The sequences should be related in some way,
884 such as coming from the same publication or organism. You should plan
885 that all sequences will be released to the public on the same date.</li>
886 </ul>
887 
888 <a name="SequenceDataFormat" id="SequenceDataFormat"></a>
889 <h4>Sequence Data Format</h4>
890 
891 <p>If you have chosen <span class="buttonlabel">Single Sequence</span>,
892 <span class="buttonlabel">Segmented Sequence</span>, <span
893 class="buttonlabel">Gapped Sequence</span>, or <span
894 class="buttonlabel">Batch Submission</span> for the submission type,
895 you will only be able to select <span class="buttonlabel">FASTA (no
896 alignment).</span></p>
897 
898 <p>If you have chosen one of the other submission types, you may import
899 the sequences in FASTA format, or you may choose to import the
900 sequences using an alignment file by selecting
901 <span class="buttonlabel">Alignment (FASTA+GAP, NEXUS, PHYLIP, etc.)</span>.
902 See <a href= "#AlignmentFormats">Alignment Formats</a> for an
903 explanation of the available formats for alignment files.</p>
904 
905 <a name="SubmissionCategory" id="SubmissionCategory"></a>
906 <h4>Submission Category</h4>
907 
908 <p>Choose <span class="buttonlabel">Original Submission</span> if you
909 have directly sequenced the nucleotide sequence in your laboratory.</p>
910 
911 <p>Choose <span class="buttonlabel">Third Party Annotation</span> if
912 you have downloaded or assembled sequence from GenBank and modified it
913 with your own annotations. See <a href=
914 "http://www.ncbi.nih.gov/Genbank/TPA.html">
915 <tt>http://www.ncbi.nih.gov/Genbank/TPA.html</tt></a> for more information
916 about Third Party Annotation rules.</p>
917 
918 <a name="OrganismAndSequencesForm" id="OrganismAndSequencesForm"></a>
919 <h3>Organism and Sequences Form</h3>
920 
921 <p>The <span class="dialoglabel">Organism and Sequences</span> form has
922 been enhanced with a number of Assistants that allow entry or editing
923 of sequence and source information.</p>
924 
925 <a name="NucleotidePage" id="NucleotidePage"></a>
926 <h4>Nucleotide Page</h4>
927 
928 <p>The <span class="folderlabel">Nucleotide</span> page will have one of
929 three appearances, based on whether you have chosen to import a single
930 sequence, a set of sequences, or an alignment.</p>
931 
932 <a name="NucleotidePageSingleSequence" id="NucleotidePageSingleSequence"></a>
933 <h5>Importing Nucleotide FASTA for a Single Sequence</h5>
934 
935 <p><img class="figure" src="images/nucsing1.png" alt=
936 "Single Sequence Page" /></p>
937 
938 <p>To import a single sequence, click on <span
939 class="buttonlabel">Import Nucleotide FASTA</span> and enter the name
940 of the file that contains your FASTA sequence. See
941 <a href="http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm#BeforeYouBegin">
942 Before You Begin</a> for information on how to format your FASTA file.
943 In addition to importing from a file, sequences can also be read by
944 pasting from the computer's "clipboard" using the <span
945 class="menulabel">Edit-&gt;Paste</span> menu item or by using the <span
946 class="buttonlabel">Add/Modify Sequences</span> button.</p>
947 
948 <a name="NucleotidePageSequenceSet" id="NucleotidePageSequenceSet"></a>
949 <h5>Importing Nucleotide FASTA for a Sequence Set</h5>
950 
951 <p><img class="figure" src="images/nucset.png" alt=
952 "Sequence Set Page" /></p>
953 
954 <p>To import a set of sequences, click on <span
955 class="buttonlabel">Import Nucleotide FASTA</span> and enter the name
956 of the file that contains some or all of your FASTA sequences. See
957 <a href="http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm#BeforeYouBegin">
958 Before You Begin</a> for information on how to format your FASTA file.
959 You may click on <span class="buttonlabel">Import Additional Nucleotide
960 FASTA</span> to import additional files if your sequences are in more
961 than one file. In addition to importing from a file, sequences can also
962 be read by pasting from the computer's "clipboard" using the <span
963 class="menulabel">Edit-&gt;Paste menu</span> item or by using the <span
964 class="buttonlabel">Add/Modify Sequences</span> button.</p>
965 
966 <p>If you would like to create an alignment for your set of sequences,
967 check <span class="buttonlabel">Create Alignment</span> on this page.</p>
968 
969 <a name="NucleotidePageAlignment" id="NucleotidePageAlignment"></a>
970 <h5>Importing an Alignment</h5>
971 
972 <p><img class="figure" src="images/nucaln.png" alt=
973 "Importing an Alignment" /></p>
974 
975 <p>See <a href=
976 "http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm#BeforeYouBegin">
977 Before You Begin</a> for information on how to format your
978 alignment file. Before importing your alignment, choose which
979 characters in the alignment file represent gaps, ambiguous or
980 unknown nucleotides, and "matches".</p>
981 
982 <p>Some data files distinguish between gaps at the beginning, in the
983 middle, and at the end of a sequence. These characters can be
984 entered separately if needed, or you may specify the same character
985 for all three kinds of gaps if appropriate.</p>
986 
987 <p><span class="textlabel">Ambiguous/Unknown</span> characters
988 represent nucleotides that are present in the sequence but were not
989 sequenced. Usually this is "N". <span class="textabel">Match</span>
990 characters are characters in a sequence other than the first that match
991 the character at that alignment position in the first sequence. When
992 match characters are used, usually they are specified as ".", but when
993 match characters are not used, "." is frequently used as a gap
994 character, so the ":" is supplied instead as a default.</p>
995 
996 <p>You may specify more than one character for each of these
997 categories. When you have filled out the character information, click
998 on <span class="buttonlabel">Import Nucleotide Alignment</span> and
999 enter the name of your alignment file.</p>
1000 
1001 <a name="AfterImporting" id="AfterImporting"></a>
1002 <h5>After Importing Files</h5>
1003 
1004 <p><img class="figure" src="images/nucsing2.png" alt=
1005 "After Importing Files" /></p>
1006 
1007 <p>When the sequence file or alignment file import is complete, a box
1008 will appear showing the number of nucleotide segments imported, the
1009 total length in nucleotides of the sequences entered, and the sequence
1010 ID(s) you designated. The actual sequence data are <b>not</b>
1011 shown. If any of this information is missing or incorrect, check the
1012 file containing the sequence data for proper FASTA format, click on the
1013 <span class="buttonlabel">Clear Sequences</span> button, then reimport
1014 the sequence(s).</p>
1015 
1016 <p>If the imported nucleotide sequence or sequences or alignment
1017 have any problems, such as colliding local identifiers in a set or
1018 mismatched brackets in the definition line, an Assistant dialog
1019 appears to help correct the problems. Severe problems must be fixed
1020 before you can continue with the Sequin submission.</p>
1021 
1022 <a name="OrganismPage" id="OrganismPage"></a>
1023 <h4>Organism Page</h4>
1024 
1025 <p><img class="figure" src="images/organism.png" alt=
1026 "Organism Page" /></p>
1027 
1028 <p>The second page of the <span class="folderlabel">Organism and
1029 Sequences</span> form requests information regarding the scientific
1030 name of the organism from which the sequence was derived, if it was not
1031 already encoded in the nucleotide FASTA file. There are Assistants for
1032 manually adding organism name information or adding source
1033 qualifiers.</p>
1034 
1035 <p>Sequin has extracted the organism and strain names from the FASTA
1036 definition line in this example, eliminating the need to manually enter
1037 information in the <span class="folderlabel">Organism</span> page.</p>
1038 
1039 <a name="ProteinPage" id="ProteinPage"></a>
1040 <h4>Proteins Page</h4>
1041 
1042 <p><img class="figure" src="images/protein1.png" alt=
1043 "Proteins Page" /></p>
1044 
1045 <p>If your sequence or sequences encode one or more proteins, you can
1046 enter the sequences of the protein products in this page. To import the
1047 amino acid sequences, click on the <span
1048 class="folderlabel">Proteins</span> folder tab and click on the <span
1049 class="buttonlabel">Import Protein FASTA</span> button. You may import
1050 more than one file by clicking the button again after importing the
1051 first file. See <a href="#BeforeYouBegin">Before You Begin</a> for
1052 information on how to format your protein files.</p>
1053 
1054 <p><img class="figure" src="images/protein2.png" alt=
1055 "Proteins Example" /></p>
1056 
1057 <p>In this example, we imported two protein sequences. These are
1058 the alternative splice products of the same gene. Both protein
1059 sequences were in the same data file, but each had its own
1060 definition line.</p>
1061 
1062 <p>Sequin has extracted the gene and protein names from the FASTA
1063 definition lines, and will use these to construct the initial
1064 sequence record.</p>
1065 
1066 <a name="AnnotationPage" id="AnnotationPage"></a>
1067 <h4>Annotation Page</h4>
1068 
1069 <p><img class="figure" src="images/annot.png" alt=
1070 "Annotation Page" /></p>
1071 
1072 <p>The <span class="folderlabel">Annotation</span> page allows you to
1073 add an rRNA or CDS feature to the entire length of all sequences in the
1074 set. In addition, you can add a title to any sequences that didn't
1075 obtain them from a FASTA definition line. It is much easier to add
1076 these in bulk at this step than to add individual rRNA or CDS features
1077 to each sequence after the record is constructed.</p>
1078 
1079 <p>It is customary in a nucleotide record to format titles for
1080 sequences containing coding region features in the following
1081 way:</p>
1082 
1083 <p>Genus species protein name (gene symbol) mRNA/gene,
1084 complete/partial cds.</p>
1085 
1086 <p>The choice of "mRNA" or "gene" depends upon the molecule type (use
1087 "mRNA" for mRNA or cDNA, and "gene" for genomic DNA). Use "partial" for
1088 incomplete features. The proper organism name in a phylogenetic study
1089 can be added to the beginning of each title automatically by checking
1090 the <span class="buttonlabel">Prefix title with organism name</span>
1091 box.</p>
1092 
1093 <p>However, for records containing CDS, rRNA, or tRNA features,
1094 Sequin can generate the definition line automatically by computing
1095 on the features (see "<a href="#autodefline">Generating the
1096 Definition Line</a>").</p>
1097 
1098 <p>More complex situations, such as a population study of HIV
1099 sequences, can include multiple CDS features in each sequence. In this
1100 case, do not use the <span class="folderlabel">Annotation</span> page
1101 to create features. (You can still use it for a common title, however.)
1102 After the initial submission has been created, you would manually
1103 annotate features onto one of the sequences. If you are submitting an
1104 alignment, or if you are submitting a set of sequences and you have
1105 checked <span class="buttonlabel">Create Alignment</span> on the
1106 <span class="folderlabel">Nucleotide</span> page, you will be able to
1107 use feature propagation to annotate the same features at the equivalent
1108 aligned locations on the remaining sequences.</p>
1109 
1110 <a name="viewing" id="viewing"></a>
1111 <h2>Viewing Your Submission</h2>
1112 
1113 <a name="GenBankView" id="GenBankView"></a>
1114 <h3>GenBank View</h3>
1115 
1116 <p>After you have completed importing the data files, Sequin will
1117 display your full submission information in the GenBank format (or
1118 EMBL format if you chose EMBL as the database for submission in the
1119 first form).</p>
1120 
1121 <p><img class="figure" src="images/genbank.png" alt=
1122 "GenBank Format" /></p>
1123 
1124 <p>On the basis of the information provided in your DNA and amino
1125 acid sequence files, any coding regions will be automatically
1126 identified and annotated for you. The figure shows only the top
1127 portion of the GenBank record, but you can see the first of two
1128 coding region (CDS) features. The vertical bar to the left of the
1129 paragraph indicates that the CDS has been selected by clicking with
1130 the computer's mouse.</p>
1131 
1132 <p>You may now make changes to the coding region, publication, source,
1133 and other features in the record by double clicking on the appropriate
1134 paragraphs in the GenBank display format. You may also use the <span
1135 class="menulabel">Annotate-&gt;Generate Definition Line</span> menu
1136 item to <a href="#autodefline">compute a definition line</a> for the
1137 annotated features in the record.</p>
1138 
1139 <a name="GraphicalView" id="GraphicalView"></a>
1140 <h3>Graphical View</h3>
1141 
1142 <p><img class="figure" src="images/graphic.png" alt=
1143 "Graphic Format" /></p>
1144 
1145 <p>To get a graphical view, change the <span
1146 class="popuplabel">Format</span> popup menu from <span
1147 class="menulabel">GenBank</span> to <span
1148 class="menulabel">Graphic</span>. Reviewing your submission in Graphic
1149 format allows you to visually confirm expected location of exons,
1150 introns, and other features in multiple interval coding regions. The
1151 Graphic view in our eukaryotic initiation factor example illustrates
1152 how the coding region intervals for the two protein products are
1153 spatially related to each other.</p>
1154 
1155 <p>The <span class="menulabel">File-&gt;Duplicate View</span> menu item
1156 will launch a second viewer on the record. The display format on each
1157 viewer can be independently set, allowing you to see a graphical view
1158 and a GenBank text report simultaneously. This is useful for getting an
1159 overall view of the features and seeing the details of annotation.</p>
1160 
1161 <a name="SequenceView" id="SequenceView"></a>
1162 <h3>Sequence View</h3>
1163 
1164 <p><img class="figure" src="images/sequence.png" alt=
1165 "Sequence Format" /></p>
1166 
1167 <p>Sequence view is a static version of the sequence and alignment
1168 editor. It shows the actual nucleotide sequence, with feature
1169 intervals annotated directly on the sequence. Protein translations
1170 of CDS features are also shown, as are all features shown in the
1171 graphical view.</p>
1172 
1173 <a name="editing" id="editing"></a>
1174 <h2>Editing and Annotating Your Submission</h2>
1175 
1176 <p>At this point, Sequin could process your entry based on what you
1177 have entered so far, and you could send it to your nucleotide database
1178 of choice (as set in the initial form). However, to optimize the
1179 usefulness of your entry for the scientific community, you may want to
1180 provide additional information to indicate biologically significant
1181 regions of the sequence. But first, save the entry so that if you make
1182 any unwanted changes during the editing process you can revert to the
1183 original copy.</p>
1184 
1185 <p>Additional information may be in the form of Descriptors or
1186 Features. Descriptors are annotations that apply to an entire
1187 sequence or set of sequences. They are used to remove redundant
1188 information in a record. Features are annotations that apply to a
1189 specific sequence interval.</p>
1190 
1191 <p>Sequin provides two methods to modify your entry: (1) to edit
1192 existing information, double click on the text or graphic area you want
1193 to modify, and Sequin will display forms requesting needed information;
1194 or (2) to add new information, use the <span
1195 class="menulabel">Annotate</span> menu and select from the list of
1196 available annotations.</p>
1197 
1198 <a name="SequenceEditor" id="SequenceEditor"></a>
1199 <h3>Sequence Editor</h3>
1200 
1201 <p>Additional sequence data can also be added using Sequin's sequence
1202 editor, which can be launched using the <span
1203 class="menulabel">Edit-&gt;Edit Sequence</span> menu item. Sequin will
1204 automatically adjust feature intervals when editing the sequence. Prior
1205 to Sequin, it was usually easier to reannotate everything from scratch
1206 when the sequence changed. But an even easier way to update sequences
1207 is described in the following section.</p>
1208 
1209 <a name="UpdatingTheSequence" id="UpdatingTheSequence"></a>
1210 <h3>Updating the Sequence</h3>
1211 
1212 <p>Sequin can also read in a replacement sequence, or an
1213 overlapping sequence extension, and perform the alignment and
1214 feature propagation calculations necessary to adjust feature
1215 intervals, even though the individual editing operations were not
1216 done with the sequence editor.</p>
1217 
1218 <p>The <span class="menulabel">Edit-&gt;Update Sequence</span> submenu
1219 has several choices. These are for use by the original submitter of a
1220 record.</p>
1221 
1222 <p>You can read a FASTA file or raw sequence file. This can be a
1223 replacement sequence, or it can overlap the original sequence at
1224 the 5' or 3' end. After Sequin aligns the two sequences, and you
1225 select optional parameters, the sequence in your record is updated,
1226 with all feature intervals adjusted properly.</p>
1227 
1228 <p>You can also update with an existing sequence record that
1229 contains features. This can be obtained from a file, or retrieved
1230 from Entrez either via an Accession number. The latter choice
1231 requires the <a href=
1232 "http://www.ncbi.nlm.nih.gov/Sequin/netaware.html">network-aware</a>
1233 version of Sequin. Once it gets the new record, Sequin aligns the
1234 two sequences as before. This is typically used either to merge two
1235 records that overlap, or to copy features from database records
1236 onto a new large contig.</p>
1237 
1238 <p><img class="figure" src="images/update.png" alt=
1239 "Update Sequence Form" /></p>
1240 
1241 <p>The first panel shows how the two sequences align to each other.
1242 In this case, it is a 5' extension of the existing sequence. 400
1243 bases are new, 70 bases overlap the old sequence, and there are 30
1244 bases of vector on the new sequence that do not align to the old
1245 sequence and will be trimmed off.</p>
1246 
1247 <p>The second panel shows details of the 70-base aligned region.
1248 There is one single base gap in each sequence. The total number of
1249 sequence letters plus gap characters is the alignment length, 71 in
1250 this example. (This number was shown between the sequence figures
1251 in the first panel.) Mismatched bases are indicated by vertical red
1252 lines between the two sequences.</p>
1253 
1254 <p>The third panel shows the actual sequence letters in the aligned
1255 region. Clicking on a gap or mismatch in the second panel scrolls
1256 to the appropriate place in this panel.</p>
1257 
1258 <p>Before pressing <span class="buttonlabel">Update Sequence</span>,
1259 you need to enter optional parameters. The alignment relationship is
1260 calculated by Sequin, but in some cases you may want to replace or
1261 patch rather than extend the existing sequence.</p>
1262 
1263 <a name="autodefline" id="autodefline"></a>
1264 <h3>Generating the Definition Line</h3>
1265 
1266 <p>The <span class="menulabel">Annotate-&gt;Generate Definition
1267 Line</span> menu item can make the appropriate titles once the record
1268 has been annotated with features. The general format for sequences
1269 containing coding region features is:</p>
1270 
1271 <p>Genus species protein name (gene symbol) mRNA/gene,
1272 complete/partial cds.</p>
1273 
1274 <p>Exceptional cases, where this automatic function is unable to
1275 generate a reasonable definition line, will be edited by the
1276 database staff to conform to the style conventions.</p>
1277 
1278 <p>The new definition line will replace any previous title,
1279 including that originally on the FASTA definition line.</p>
1280 
1281 <a name="Validation" id="Validation"></a>
1282 <h3>Record Validation</h3>
1283 
1284 <p>Once you are satisfied that you have entered all the relevant
1285 information, save your file! Then select the <span
1286 class="menulabel">Search-&gt;Validate</span> menu item. You will either
1287 receive a message that the validation test succeeded or see a screen
1288 listing the validation errors and warnings. Just double click on an
1289 error item to launch the appropriate editor for making corrections. The
1290 validator includes checks for such things as missing organism
1291 information, incorrect coding region lengths, internal stop codons in
1292 coding regions, inconsistent genetic codes, mismatched amino acids, and
1293 non-consensus splice sites.</p>
1294 
1295 <p><img class="figure" src="images/validate.png" alt=
1296 "Record Validator Form" /></p>
1297 
1298 <a name="SubmittingTheEntry" id="SubmittingTheEntry"></a>
1299 <h3>Submitting the Entry</h3>
1300 
1301 <p>When the entry is properly formatted and error-free, click the <span
1302 class="buttonlabel">Done</span> button or select the <span
1303 class="menulabel">File-&gt;Prepare Submission</span> menu item. You
1304 will be prompted to save your entry and email it to the database you
1305 selected. The address for GenBank is <tt>gb-sub@ncbi.nlm.nih.gov</tt>.
1306 The address for EMBL is <tt>datasubs@ebi.ac.uk</tt>. The address for
1307 DDBJ is <tt>ddbjsub@ddbj.nig.ac.jp</tt>.</p>
1308 
1309 <a name="Advanced" id="Advanced"></a>
1310 <h2>Advanced Topics</h2>
1311 
1312 <a name="FeatureEditorDesign" id="FeatureEditorDesign"></a>
1313 <h3>Feature Editor Design</h3>
1314 
1315 <p>Sequin uses a common structure for all feature editor forms, with
1316 (usually) three top-level folder tabs. One folder tab page is specific
1317 to the given feature type (biological source and publications have
1318 more). The <span class="folderlabel">Properties</span> and <span
1319 class="folderlabel">Location</span> pages are common to all features.
1320 Some of these pages may have subpages, accessed by a secondary set of
1321 smaller folder tabs. This organization allows editors for complex data
1322 structures to fit in a reasonably small window size. The most important
1323 information in a given section is always presented in the first
1324 subpage.</p>
1325 
1326 <a name="CodingRegionPage" id="CodingRegionPage"></a>
1327 <h4>Coding Region Page</h4>
1328 
1329 <p><img class="figure" src="images/cds_edit.png" alt=
1330 "Coding Region Page" /></p>
1331 
1332 <p>The coding region editor is perhaps the most complicated form in
1333 Sequin. Within the <span class="folderlabel">Coding Region</span> page,
1334 the <span class="folderlabel">Product</span> subpage lets you predict
1335 the coding region intervals from the protein sequence or translate the
1336 protein sequence from the location. (Importing a protein sequence from
1337 a file will also interpret the [gene=...] and [protein=...] definition
1338 line information and automatically attempt to predict the coding region
1339 intervals.) It also displays the genetic code used for translation and
1340 the reading frame. (Please note that there are currently 17 different
1341 genetic codes present in Sequin. For more information on these, see <a
1342 href= "http://www.ncbi.nlm.nih.gov/Taxonomy/">
1343 <tt>http://www.ncbi.nlm.nih.gov/Taxonomy/</tt></a>.)</p>
1344 
1345 <p>The <span class="folderlabel">Protein</span> subpage lets you set
1346 the name (or, if not known, a description) of the protein product. The
1347 <span class="folderlabel">Exceptions</span> subpage allows you to
1348 indicate translation exceptions to the normal genetic code, such as
1349 insertion of selenocysteine, suppression of terminator codons by a
1350 suppressor tRNA, or completion of a stop codon by poly-adenylation of
1351 an mRNA.</p>
1352 
1353 <p>Additional annotation on the protein product might include a leader
1354 peptide, transmembrane regions, disulfide bonds, or binding sites.
1355 These can be added after setting the <span class="popuplabel">Target
1356 Sequence</span> popup on the sequence viewer to the desired protein
1357 sequence. You can also launch a duplicate view, already targeted to the
1358 appropriate protein, from the <span class="folderlabel">Protein</span>
1359 subpage.</p>
1360 
1361 <a name="PropertiesPage" id="PropertiesPage"></a>
1362 <h4>Properties Page</h4>
1363 
1364 <p><img class="figure" src="images/props_pg.png" alt=
1365 "Properties Page" /></p>
1366 
1367 <p>All features have a number of fields in common. The <span
1368 class="buttonlabel">Partial</span> box will be checked if the 5'
1369 partial or 3' partial boxes on the <span
1370 class="folderlabel">Location</span> page were selected. <span
1371 class="buttonlabel">Exception</span> means that the sequence of the
1372 protein product doesn't match the translation of the DNA sequence
1373 because of some known biological reason (e.g., RNA editing). The <span
1374 class="popuplabel">Evidence</span> popup is now deprecated by the <span
1375 class="folderlabel">Evidence</span> subpage.</p>
1376 
1377 <p>In addition, nucleotide features (other than genes themselves)
1378 can reference a gene feature. This is frequently done by overlap.
1379 (The overlapping gene will show up on the feature as a /gene
1380 qualifier in GenBank format.) Extension of the feature location
1381 will automatically extend the gene that is selected in the editor.
1382 In rare cases, you may want to set a gene by cross-reference.</p>
1383 
1384 <p>The <span class="folderlabel">Comment</span> subpage allows text to
1385 be associated with a feature. In GenBank format, this appears as a
1386 /note qualifier. The <span class="folderlabel">Citations</span> subpage
1387 attaches citations to the feature. (The citations should first be added
1388 to the record using items in the <span
1389 class="menulabel">Annotate-&gt;Publication</span> submenu, whereupon it
1390 will appear in the REFERENCE section.) For example, an article that
1391 justifies a non-obvious or controversial biological conclusion would be
1392 cited here. In GenBank format, for example, if the publication is
1393 listed as Reference 2, the feature citation appears as /citation=[2].
1394 <span class="folderlabel">Cross-Refs</span> are cross-references to
1395 other databases. The contents of this subpage may only be changed by
1396 the GenBank, EMBL, or DDBJ database staff. <span
1397 class="folderlabel">Evidence</span> has experiment and inference
1398 qualifier fields. The experiment qualifier must include details of the
1399 experiment used to justify the annotation.</p>
1400 
1401 <a name="LocationPage" id="LocationPage"></a>
1402 <h4>Location Page</h4>
1403 
1404 <p><img class="figure" src="images/loc_page.png" alt=
1405 "Location Page" /></p>
1406 
1407 <p>All features are required to have a location, i.e., one or more
1408 intervals on a sequence coordinate. The <span
1409 class="folderlabel">Location</span> page provides a spreadsheet for
1410 entering and editing this information. An arbitrary number of lines can
1411 be entered. In this coding region example, the intervals correspond to
1412 the exons. For an mRNA, the intervals would be the exons and UTRs. The
1413 5' Partial and 3' Partial check boxes will show up as
1414 &lt; or &gt; in front of a feature coordinate in the GenBank flatfile,
1415 indicating partial locations.</p>
1416 
1417 <p>The GenBank flatfile view of this location would be:</p>
1418 
1419 <pre>
1420 join(201..224,1550..1920,1986..2085,2317..2404,2466..2629)
1421 </pre>
1422 
1423 <p>If the <span class="buttonlabel">5' Partial</span> or <span
1424 class="buttonlabel">3' Partial</span> boxes were checked, &lt; and &gt;
1425 symbols would appear at the appropriate end of the join statement:</p>
1426 
1427 <pre>
1428 join(&lt;201..224,1550..1920,1986..2085,2317..2404,2466..&gt;2629)
1429 </pre>
1430 
1431 <p>If the sequence was reverse complemented (based on a length of 2881
1432 nucleotides), the <span class="popuplabel">Strand</span> popups would
1433 all indicate <span class="popuplabel">Minus</span>, and the join
1434 statement for the resulting feature location would be as follows:</p>
1435 
1436 <pre>
1437 complement(join(253..416,478..565,797..896,962..1332, 2658..2681))
1438 </pre>
1439 
1440 <a name="NCBIDesktop" id="NCBIDesktop"></a>
1441 <h3>NCBI DeskTop</h3>
1442 
1443 <p><img class="figure" src="images/desktop.png" alt=
1444 "NCBI DeskTop Window" /></p>
1445 
1446 <p>The NCBI DeskTop is a window that directly displays the internal
1447 structure of the record being viewed in Sequin. It can be
1448 understood as a Venn diagram.</p>
1449 
1450 <p>As with other views on a record, the DeskTop indicates selected
1451 items and lets you select items by clicking.</p>
1452 
1453 <p>In this example, Sequin was given the genomic nucleotide and protein
1454 sequences for <span class="taxonomy">Drosophila melanogaster</span>
1455 eukaryotic initiation factor 4E. It then determined the coding region
1456 intervals and built an initial structure. The organism (BioSource
1457 descriptor) is at the nuc-prot set and thus applies to both the
1458 nucleotide and protein sequences.</p>
1459 
1460 <a name="AdditionalInformation" id="AdditionalInformation"></a>
1461 <h3>Additional Information</h3>
1462 
1463 <p>The Sequin homepage <a href=
1464 "http://www.ncbi.nlm.nih.gov/Sequin/">
1465 <tt>http://www.ncbi.nlm.nih.gov/Sequin/</tt></a>
1466 has a Frequently Asked Questions section and more detailed
1467 instructions on using the capabilities of network-aware Sequin.</p>
1468 
1469 <a name="Reference" id="Reference"></a>
1470 <h2>Reference</h2>
1471 
1472 <a name="NetworkConfiguration" id="NetworkConfiguration"></a>
1473 <h3>Network Configuration</h3>
1474 
1475 <p><img class="figure" src="images/net_cfg.png" alt=
1476 "Network Configuration Form" /></p>
1477 
1478 <p>When first downloaded, Sequin runs in stand-alone mode, without
1479 access to the network. However, the program can also be configured
1480 to exchange information with the NCBI (GenBank) over the Internet.
1481 The network-aware mode of Sequin is identical to the stand-alone
1482 mode, but it contains some additional useful options.</p>
1483 
1484 <p>Sequin can only function in its network-aware mode if the
1485 computer on which it resides has a direct Internet connection.
1486 Electronic mail access to the Internet is insufficient. In general,
1487 if you can install and use a WWW browser on your system, you should
1488 be able to install and use network-aware Sequin. Check with your
1489 system administrator or Internet provider if you are uncertain as
1490 to whether you have direct Internet connectivity.</p>
1491 
1492 <p>To launch the configuration form, select Net Configure under the
1493 Misc menu, from either the initial Welcome to Sequin form or from a
1494 viewer on an existing sequence record.</p>
1495 
1496 <p>If you are not behind a firewall, set the <span
1497 class="buttonlabel">Connection</span> control to <span
1498 class="buttonlabel">Normal</span>. If you also have a Domain Name
1499 Server (DNS) available, you can now simply press <span
1500 class="buttonlabel">Accept</span>.</p>
1501 
1502 <p>If DNS is not available, uncheck the <span
1503 class="buttonlabel">Domain Name Server</span> box. If you are behind
1504 a firewall, set the <span class="buttonlabel">Connection</span> control
1505 to <span class="buttonlabel">Firewall</span>. The <span
1506 class="buttonlabel">HTTP Proxy Server</span> box then becomes active.
1507 If you also use a proxy server, type in its address. (If you have
1508 access to DNS, it will be of the form
1509 <tt>www.myproxy.myuniversity.edu</tt>. If you do not have DNS, you
1510 should use the numerical IP address of the form <tt>127.45.23.6</tt>.)
1511 Once you type something in the <span class="buttonlabel">HTTP Proxy
1512 Server</span> box, the <span class="buttonlabel">Port</span> box
1513 becomes active and can be filled in or changed as appropriate. (By
1514 default the <span class="buttonlabel">Non-transparent Proxy
1515 Server</span> box is empty, indicating a CERN-like proxy.) Ask your
1516 network administrator for advice on the proper settings to use.</p>
1517 
1518 <p>If you are in the United States, the default <span
1519 class="textlabel">Timeout</span> of 30 seconds should suffice. From
1520 foreign countries with poor Internet connection to the U.S., you can
1521 select up to 5 minutes as the timeout.</p>
1522 
1523 <p>Finally, you will need to quit and restart Sequin ifor the
1524 network-aware settings to take effect.</p>
1525 
1526 <p>If you are behind a firewall, it must be configured correctly to
1527 access NCBI services. Your network administrator may have done this
1528 already. If not, please have them contact NCBI for further
1529 instructions on setting up firewalls to work with NCBI
1530 services.</p>
1531 
1532 <p><b>The following section is intended for network
1533 administrators:</b></p>
1534 
1535 <p>Using NCBI services from behind a security firewall requires
1536 opening ports in your firewall. Please consult <a href=
1537 "http://www.ncbi.nlm.nih.gov/IEB/ToolBox/NETWORK/firewall.html">
1538 <tt>http://www.ncbi.nlm.nih.gov/IEB/ToolBox/NETWORK/firewall.html</tt></a>
1539 for the list of current hosts and ports that have the firewall
1540 daemon configured.</p>
1541 
1542 <p>If your firewall is not transparent, the firewall port number
1543 should be mapped to the same port number on the external host.</p>
1544 
1545 <p>Note: Old NCBI clients used different application configuration
1546 settings and ports than listed above. If you need to support such
1547 clients, which are becoming obsolete, please contact <a href=
1548 "mailto:info@ncbi.nlm.nih.gov"><tt>info@ncbi.nlm.nih.gov</tt></a>
1549 for further information.</p>
1550 
1551 <a name="FeatureTableFormat" id="FeatureTableFormat"></a>
1552 <h3>Feature Table Format</h3>
1553 
1554 <p>Sequin can now annotate features by reading in a tab-delimited
1555 table. This is most often used by genome centers that store feature
1556 interval information in relational databases or spreadsheets. For
1557 most submitters, it is usually better to supply protein sequences
1558 in FASTA format with gene and protein names embedded in the
1559 definition line.</p>
1560 
1561 <p>The feature table specifies the location and type of each feature,
1562 and Sequin processes the feature intervals and translates any CDSs. The
1563 table is read in the record viewer (after the sequence has been
1564 imported) using the <span class="menulabel">File-&gt;Open</span> menu
1565 item. The table must follow a defined format. The first line starts
1566 with &gt;Feature, a space, and then the Sequence ID of the sequence you
1567 are annotating. In the example below, eIF4E is the Sequence ID, and it
1568 is a local identifier.</p>
1569 
1570 <p>The table is composed of five columns: start, stop, feature key,
1571 qualifier key, and qualifier value. The columns are separated by
1572 tabs. The first row for any given feature has start, stop, and
1573 feature key. Additional feature intervals just have start and stop.
1574 The qualifiers follow on lines starting with three tabs.</p>
1575 
1576 <p>For example, a table that looks like this:</p>
1577 
1578 <pre>
1579 &gt;Features lcl|eIF4E
1580 80      2881    gene
1581                         gene     eIF4E
1582 
1583 201     224     CDS
1584 1550    1920
1585 1986    2085
1586 2317    2404
1587 2466    2629
1588                         product  eukaryotic initiation factor 4E-II
1589 
1590 1402    1458    CDS
1591 1550    1920
1592 1986    2085
1593 2317    2404
1594 2466    2629
1595                         product  eukaryotic initiation factor 4E-I
1596                         note     encoded by two messenger RNAs
1597 
1598 80      224     mRNA
1599 1550    1920
1600 1986    2085
1601 2317    2404
1602 2466    2881
1603                         product  eukaryotic initiation factor 4E-II
1604 
1605 80      224     mRNA
1606 892     1458
1607 1550    1920
1608 1986    2085
1609 2317    2404
1610 2466    2881
1611                         product  eukaryotic initiation factor 4E-I
1612 
1613 80      224     mRNA
1614 1129    1458
1615 1550    1920
1616 1986    2085
1617 2317    2404
1618 2466    2881
1619                         product  eukaryotic initiation factor 4E-I
1620 </pre>
1621 
1622 <p>will result in a GenBank flatfile that contains this:</p>
1623 
1624 <pre>
1625      mRNA            join(80..224,1129..1458,1550..1920,1986..2085,2317..2404,
1626                      2466..2881)
1627                      /gene="eIF4E"
1628                      /product="eukaryotic initiation factor 4E-I"
1629      mRNA            join(80..224,892..1458,1550..1920,1986..2085,2317..2404,
1630                      2466..2881)
1631                      /gene="eIF4E"
1632                      /product="eukaryotic initiation factor 4E-I"
1633      mRNA            join(80..224,1550..1920,1986..2085,2317..2404,2466..2881)
1634                      /gene="eIF4E"
1635                      /product="eukaryotic initiation factor 4E-II"
1636      gene            80..2881
1637                      /gene="eIF4E"
1638      CDS             join(201..224,1550..1920,1986..2085,2317..2404,2466..2629)
1639                      /gene="eIF4E"
1640                      /codon_start=1
1641                      /product="eukaryotic initiation factor 4E-II"
1642                      /translation="MVVLETEKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETG
1643                      EPAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWEDMQNEITSFDTV
1644                      EDFWSLYNHIKPPSEIKLGSDYSLFKKNIRPMWEDAANKQGGRWVITLNKSSKTDLDN
1645                      LWLDVLLCLIGEAFDHSDQICGAVINIRGKSNKISIWTADGNNEEAALEIGHKLRDAL
1646                      RLGRNNSLQYQLHKDTMVKQGSNVKSIYTL"
1647      CDS             join(1402..1458,1550..1920,1986..2085,2317..2404,
1648                      2466..2629)
1649                      /gene="eIF4E"
1650                      /note="encoded by two messenger RNAs"
1651                      /codon_start=1
1652                      /product="eukaryotic initiation factor 4E-I"
1653                      /translation="MQSDFHRMKNFANPKSMFKTSAPSTEQGRPEPPTSAAAPAEAKD
1654                      VKPKEDPQETGEPAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWED
1655                      MQNEITSFDTVEDFWSLYNHIKPPSEIKLGSDYSLFKKNIRPMWEDAANKQGGRWVIT
1656                      LNKSSKTDLDNLWLDVLLCLIGEAFDHSDQICGAVINIRGKSNKISIWTADGNNEEAA
1657                      LEIGHKLRDALRLGRNNSLQYQLHKDTMVKQGSNVKSIYTL"
1658 </pre>
1659 
1660 <p>Note that if the gene feature spans the intervals of the CDS and
1661 mRNA features for that gene, you don't need to include gene
1662 "qualifiers" in those features, because they will be picked up by
1663 overlap.</p>
1664 
1665 <p>Features that are on the complementary strand are indicated by
1666 reversing the interval locations. For example, the table:</p>
1667 
1668 <pre>
1669 &gt;Features lcl|dna2
1670 5284    5202    tRNA
1671                         product  tRNA-Glu
1672 </pre>
1673 
1674 <p>will result in a GenBank flatfile containing:</p>
1675 
1676 <pre>
1677      tRNA            complement(5202..5284)
1678                      /product="tRNA-Glu"
1679 </pre>
1680 
1681 <p>More instructions on using the feature table format for
1682 submitting large genomic records are available at<br />
1683 <a href="http://www.ncbi.nlm.nih.gov/Sequin/table.html">
1684 <tt>http://www.ncbi.nlm.nih.gov/Sequin/table.html</tt></a>.</p>
1685 
1686 <hr />
1687 
1688 <div id="footer">
1689 <b>Questions or Comments?</b><br />
1690 Write to the <a href="mailto:info@ncbi.nlm.nih.gov">NCBI Service Desk</a>
1691 <br />
1692 <br />
1693 Revised August 21, 2007<br />
1694 </div>
1695 
1696 <!--  end of content  -->
1697 </body>
1698 </html>

source navigation ]   [ diff markup ]   [ identifier search ]   [ freetext search ]   [ file search ]  

This page was automatically generated by the LXR engine.
Visit the LXR main site for more information.