|
NCBI Home IEB Home C Toolkit docs C++ Toolkit source browser C Toolkit source browser (2) |
NCBI C Toolkit Cross ReferenceC/doc/sequin.htm |
source navigation diff markup identifier search freetext search file search |
1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 2 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 3 <html lang="en" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> 4 5 <head> 6 <meta name="generator" content= 7 "HTML Tidy for Mac OS X (vers 1 September 2005), see www.w3.org" /> 8 <title>Sequin Quick Guide</title> 9 <meta http-equiv="Content-Type" content= 10 "text/html; charset=us-ascii" /> 11 <!-- if you use the following meta tags, uncomment them. 12 <meta name="author" content="sequindoc"> 13 <META NAME="keywords" CONTENT="national center for biotechnology information, ncbi, national library of medicine, nlm, national institutes of health, nih, database, archive, bookshelf, pubmed, pubmed central, bioinformatics, biomedicine, sequence submission, sequin, bankit, submitting sequences, quick guide, format"> 14 <META NAME="description" CONTENT="Sequin is a stand-alone software tool developed by the National Center for Biotechnology Information (NCBI) for submitting and updating entries to the GenBank, EMBL, or DDBJ sequence databases. "> 15 --> 16 <link rel="stylesheet" href="images/ncbi_sequin.css" type="text/css" /> 17 </head> 18 19 <body> 20 <!-- change the link and vlink colors from the original orange (link="#CC6600" vlink="#CC6600") --> 21 <!-- the header --> 22 <div id="header"><a href="http://www.ncbi.nlm.nih.gov" title= 23 "NCBI home page"><img src="images/logo.png" alt="NCBI logo" 24 id="ncbilogo" name="ncbilogo" /></a> 25 <h1 id="tophead">Sequin Quick Guide</h1> 26 </div> 27 <!-- the quicklinks bar --> 28 <ul id="topnav"> 29 <li><a href= 30 "http://www.ncbi.nlm.nih.gov/Sequin/index.html">Sequin</a></li> 31 <li><a href="http://www.ncbi.nlm.nih.gov/Entrez/">Entrez</a></li> 32 <li><a href="http://www.ncbi.nlm.nih.gov/BLAST/">BLAST</a></li> 33 <li><a href="http://www.ncbi.nlm.nih.gov/omim/">OMIM</a></li> 34 <li><a href= 35 "http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html">Taxonomy</a></li> 36 <li><a href= 37 "http://www.ncbi.nlm.nih.gov/Structure/">Structure</a></li> 38 </ul> 39 <!-- the contents --> 40 41 <h1>Sequin for Database Submissions and Updates:<br /> 42 A Quick Guide</h1> 43 44 <hr /> 45 <!-- use img src="http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/image##.png" align=bottom--> 46 47 <h2>Introduction</h2> 48 49 <p>Sequin is a stand-alone software tool developed by the National 50 Center for Biotechnology Information (NCBI) for submitting and 51 updating sequences to the GenBank, EMBL, and DDBJ databases. Sequin 52 has the capacity to handle long sequences and sets of sequences 53 (segmented entries, as well as population, phylogenetic, and 54 mutation studies). It also allows sequence editing and updating, 55 and provides complex annotation capabilities. In addition, Sequin 56 contains a number of built-in validation functions for enhanced 57 quality assurance.</p> 58 59 <p>This overview is intended to provide a quick guide to Sequin's 60 capabilities, including automatic annotation of coding regions, the 61 graphical viewer, quality control features, and editing features. We 62 suggest that you read this entire document before beginning your Sequin 63 submission. More detailed instructions on these and other functions can 64 be found in Sequin's on-screen <b>Help</b> file, also 65 available on the World-Wide Web from the Sequin homepage at:</p> 66 67 <p><a href= 68 "http://www.ncbi.nlm.nih.gov/Sequin/"> 69 <tt>http://www.ncbi.nlm.nih.gov/Sequin/</tt></a></p> 70 71 <p>Email help is also available from <a href= 72 "mailto:info@ncbi.nlm.nih.gov"> 73 <tt>info@ncbi.nlm.nih.gov</tt></a></p> 74 75 76 <h2>Table of Contents</h2> 77 78 <ul> 79 <li><a href="#BeforeYouBegin">Before You Begin</a> 80 <ul> 81 <li><a href="#PrepareSequenceData"> 82 Preparing Nucleotide and Amino Acid Data</a></li> 83 <li><a href="#DefinitionLine">Definition Lines</a></li> 84 <li><a href="#FASTAformat">FASTA Format</a> 85 <ul> 86 <li><a href="#SingleSequence">Single Sequence</a></li> 87 <li><a href="#SegmentedSequences">Segmented Nucleotide Sequences</a></li> 88 <li><a href="#GappedSequences">Gapped Sequences</a></li> 89 </ul> 90 </li> 91 <li><a href="#AlignmentFormats">Alignment Formats</a> 92 <ul> 93 <li><a href="#FASTAplusGAP">FASTA+GAP</a></li> 94 <li><a href="#PHYLIPformat">PHYLIP</a></li> 95 <li><a href="#NEXUSInterleaved">NEXUS Interleaved</a></li> 96 <li><a href="#NEXUSContiguous">NEXUS Contiguous</a></li> 97 <li><a href="#SetsOfSegmentedSequences">Sets of Segmented Sequences</a></li> 98 </ul> 99 </li> 100 </ul> 101 </li> 102 <li><a href="#CreatingASubmission">Creating a Submission</a> 103 <ul> 104 <li><a href="#BasicSequinOrganization">Basic Sequin Organization</a></li> 105 <li><a href="#WelcomeToSequinForm">Welcome to Sequin Form</a></li> 106 <li><a href="#SubmittingAuthorsForm">Submitting Authors Form</a> 107 <ul> 108 <li><a href="#SubmissionPage">Submission Page</a></li> 109 <li><a href="#ContactPage">Contact Page</a></li> 110 <li><a href="#AuthorsPage">Authors Page</a></li> 111 <li><a href="#AffiliationPage">Affiliation Page</a></li> 112 </ul> 113 </li> 114 <li><a href="#SequenceFormatForm">Sequence Format Form</a> 115 <ul> 116 <li><a href="#SubmissionType">Submission Type</a></li> 117 <li><a href="#SequenceDataFormat">Sequence Data Format</a></li> 118 <li><a href="#SubmissionCategory">Submission Category</a></li> 119 </ul> 120 </li> 121 <li><a href="#OrganismAndSequencesForm"> 122 Organism and Sequences Form</a> 123 <ul> 124 <li><a href="#NucleotidePage">Nucleotide Page</a> 125 <ul> 126 <li><a href="#NucleotidePageSingleSequence"> 127 Importing Nucleotide FASTA for a Single Sequence</a></li> 128 <li><a href="#NucleotidePageSequenceSet"> 129 Importing Nucleotide FASTA for a Sequence Set</a></li> 130 <li><a href="#NucleotidePageAlignment"> 131 Importing an Alignment</a></li> 132 <li><a href="#AfterImporting">After Importing Files</a></li> 133 </ul> 134 </li> 135 <li><a href="#OrganismPage">Organism Page</a></li> 136 <li><a href="#ProteinPage">Proteins Page</a></li> 137 <li><a href="#AnnotationPage">Annotation Page</a></li> 138 </ul> 139 </li> 140 </ul> 141 </li> 142 <li><a href="#viewing">Viewing Your Submission</a> 143 <ul> 144 <li><a href="#GenBankView">GenBank View</a></li> 145 <li><a href="#GraphicalView">Graphical View</a></li> 146 <li><a href="#SequenceView">Sequence View</a></li> 147 </ul> 148 </li> 149 <li><a href="#editing">Editing and Annotating Your Submission</a> 150 <ul> 151 <li><a href="#SequenceEditor">Sequence Editor</a></li> 152 <li><a href="#UpdatingTheSequence">Updating the Sequence</a></li> 153 <li><a href="#autodefline">Generating the Definition Line</a></li> 154 <li><a href="#Validation">Record Validation</a></li> 155 <li><a href="#SubmittingTheEntry">Submitting the Entry</a></li> 156 </ul> 157 </li> 158 <li><a href="#Advanced">Advanced Topics</a> 159 <ul> 160 <li><a href="#FeatureEditorDesign">Feature Editor Design</a> 161 <ul> 162 <li><a href="#CodingRegionPage">Coding Region Page</a></li> 163 <li><a href="#PropertiesPage">Properties Page</a></li> 164 <li><a href="#LocationPage">Location Page</a></li> 165 </ul> 166 </li> 167 <li><a href="#NCBIDesktop">NCBI Desktop</a></li> 168 <li><a href="#AdditionalInformation">Additional Information</a></li> 169 </ul> 170 </li> 171 <li><a href="#Reference">Reference</a> 172 <ul> 173 <li><a href="#NetworkConfiguration">Network Configuration</a></li> 174 <li><a href="#FeatureTableFormat">Feature Table Format</a></li> 175 </ul> 176 </li> 177 </ul> 178 179 <a name="BeforeYouBegin" id="BeforeYouBegin"></a> 180 <h2>Before You Begin</h2> 181 182 <a name="PrepareSequenceData" id = "PrepareSequenceData"></a> 183 <h3>Preparing Nucleotide and Amino Acid Data</h3> 184 185 <p>Sequin normally expects to read sequence files in FASTA format. 186 Note that most sequence analysis software packages include FASTA or 187 "raw" as one of the available output formats. Population studies, 188 phylogenetic studies, mutation studies, and environmental samples 189 may be entered in either FASTA format, or in PHYLIP, NEXUS, MACAW, 190 or FASTA+GAP formats if you are submitting an alignment.</p> 191 192 <p>See <a href= 193 "http://www.ncbi.nlm.nih.gov/Sequin/sequin.hlp.html#FASTAFormatforNucleotideSequences"> 194 <tt>http://www.ncbi.nlm.nih.gov/Sequin/sequin.hlp#FASTAFormatforNucleotideSequences</tt></a> 195 for detailed examples of each of the various input data 196 formats.</p> 197 198 <p>Prepare your sequence data files using a text editor, and save 199 in ASCII text format (plain text). If your nucleotide sequence 200 encodes one or more protein products, Sequin expects two files, one 201 for the nucleotides and one for the proteins.</p> 202 203 <a name="DefinitionLine" id="DefinitionLine"></a> 204 <h3>Definition Lines</h3> 205 206 <p>FASTA format is simply the raw sequence preceded by a definition 207 line. The definition line begins with a > sign and is followed 208 immediately by a name for the sequence (your own local 209 identification code, or sequence ID) and a title. During the 210 submission process, indexing staff at the database to which you are 211 submitting will change your sequence ID to an Accession number. You 212 can embed other important information in the title, and Sequin uses 213 this information to construct a record. Specifically, you can enter 214 organism and strain or clone information in the nucleotide 215 definition line and gene and protein information in the protein 216 definition line using name-value pairs surrounded by square 217 brackets. Example: [organism=Drosophila melanogaster] 218 [strain=Oregon R]</p> 219 220 <p>Some modifier names have restricted values or formats.</p> 221 222 <ul> 223 <li><b>organism</b> should use the unabbreviated scientific name. 224 Example: [organism=Drosophila melanogaster]</li> 225 <li><b>molecule</b> should use either "DNA" or "RNA". Example: 226 [molecule=DNA]</li> 227 228 <li><b>moltype</b> should use one of the following values. Example: 229 [moltype=genomic] 230 <ul> 231 <li>genomic</li> 232 <li>precursor RNA</li> 233 <li>mRNA</li> 234 <li>rRNA</li> 235 <li>tRNA</li> 236 <li>snRNA</li> 237 <li>scRNA</li> 238 <li>other-genetic</li> 239 <li>cRNA</li> 240 <li>snoRNA</li> 241 <li>transcribed RNA</li> 242 </ul> 243 </li> 244 245 <li><b>location</b> should use one of the following values. 246 Example: [location=mitochondrion] 247 <ul> 248 <li>genomic</li> 249 <li>chloroplast</li> 250 <li>kinetoplast</li> 251 <li>mitochondrion</li> 252 <li>plastid</li> 253 <li>macronuclear</li> 254 <li>extrachromosomal</li> 255 <li>plasmid</li> 256 <li>cyanelle</li> 257 <li>proviral</li> 258 <li>virion</li> 259 <li>nucleomorph</li> 260 <li>apicoplast</li> 261 <li>leucoplast</li> 262 <li>proplastid</li> 263 <li>endogenous-virus</li> 264 <li>hydrogenosome</li> 265 </ul> 266 </li> 267 268 <li><b>collection-date</b> should be in the form YYYY or Mmm-YYYY 269 or DD-Mmm-YYYY. Example: [collection-date=2005] or 270 [collection-date=Oct-2005] or 271 [collection-date=25-Oct-2005]</li> 272 </ul> 273 274 <p>The following modifiers should use only TRUE or FALSE. Example: 275 [transgenic=TRUE].</p> 276 277 <ul> 278 <li><b>environmental-sample</b></li> 279 <li><b>germline</b></li> 280 <li><b>metagenomic</b></li> 281 <li><b>rearranged</b></li> 282 <li><b>transgenic</b></li> 283 </ul> 284 285 <p>This is the list of the remaining modifier names that you can 286 include in your definition lines for nucleotide files:</p> 287 288 <table id="sourcemods" summary="remaining modifiers"> 289 <tr> 290 <td valign="top"> 291 <ul> 292 <li>acronym</li> 293 <li>anamorph</li> 294 <li>authority</li> 295 <li>bio-material</li> 296 <li>biotype</li> 297 <li>biovar</li> 298 <li>breed</li> 299 <li>cell-line</li> 300 <li>cell-type</li> 301 <li>chemovar</li> 302 <li>chromosome</li> 303 <li>clone</li> 304 <li>clone-lib</li> 305 <li>collected-by</li> 306 <li>common</li> 307 <li>country</li> 308 <li>cultivar</li> 309 <li>culture-collection</li> 310 <li>dev-stage</li> 311 <li>ecotype</li> 312 <li>endogenous-virus-name</li> 313 </ul> 314 </td> 315 <td valign="top"> 316 <ul> 317 <li>forma</li> 318 <li>forma-specialis</li> 319 <li>fwd-pcr-primer-name</li> 320 <li>fwd-pcr-primer-seq</li> 321 <li>genotype</li> 322 <li>group</li> 323 <li>haplotype</li> 324 <li>identified-by</li> 325 <li>isolate</li> 326 <li>isolation-source</li> 327 <li>lab-host</li> 328 <li>lat-lon</li> 329 <li>map</li> 330 <li>metagenome-source</li> 331 <li>metagenomic</li> 332 <li>note</li> 333 <li>pathovar</li> 334 <li>plasmid-name</li> 335 <li>plastid-name</li> 336 <li>pop-variant</li> 337 <li>rev-pcr-primer-name</li> 338 </ul> 339 </td> 340 <td valign="top"> 341 <ul> 342 <li>rev-pcr-primer-seq</li> 343 <li>segment</li> 344 <li>serogroup</li> 345 <li>serotype</li> 346 <li>serovar</li> 347 <li>sex</li> 348 <li>specific-host</li> 349 <li>specimen-voucher</li> 350 <li>strain</li> 351 <li>sub-species</li> 352 <li>subclone</li> 353 <li>subgroup</li> 354 <li>substrain</li> 355 <li>subtype</li> 356 <li>synonym</li> 357 <li>teleomorph</li> 358 <li>tissue-lib</li> 359 <li>tissue-type</li> 360 <li>type</li> 361 <li>variety</li> 362 </ul> 363 </td> 364 </tr> 365 </table> 366 367 <p>Example: [strain=BALB/c]</p> 368 369 <p>Some population studies are a mixture of integrated provirus and 370 excised virion. These can be indicated by molecule and location 371 qualifiers, e.g., [molecule=dna] [location=proviral] or 372 [molecule=rna] [location=virion]. You can also embed 373 [moltype=genomic] or [moltype=mRNA] to indicate from what source 374 the molecule was isolated. If you're unsure of which modifier to 375 use, use [note=...], and database staff will determine the 376 appropriate modifier to use.</p> 377 378 <p>This is the list of modifier names that you can include in your 379 definition lines for protein files:</p> 380 381 <ul> 382 <li><b>gene</b></li> 383 <li><b>protein</b></li> 384 <li><b>prot_desc</b></li> 385 </ul> 386 387 <p>A coding region feature will be created on the nucleotide 388 sequence indicating where the protein sequence is encoded. If you 389 specify "gene" in the protein sequence definition line, a gene that 390 covers the coding region will be created with a locus specified by 391 the value of "gene".</p> 392 393 <p>The product name for the coding region will be the "protein" value 394 specified in the protein sequence definition line, if supplied. The 395 product description for the coding region will be the "prot_desc" 396 value specified in the protein sequence definition line, if 397 supplied.</p> 398 399 <p>Note that the [ and ] brackets actually appear in the text. 400 (Brackets are sometimes used in computer documentation to denote 401 optional text. This convention is not followed here.) The bracketed 402 information will be removed from the definition line for each 403 sequence. Sequin can also calculate a new definition line by 404 computing on features in the annotated record (see "<a href= 405 "#autodefline">Generating the Definition Line</a>").</p> 406 407 <p>The ability to embed this information in the definition line is 408 provided as a convenience to the submitter. If these annotations 409 are not present, they can be entered in subsequent forms. Sequin is 410 designed to use this information, and that provided in the initial 411 forms, to build a properly structured record. <b>In many cases, 412 the final submission can be completely prepared from these data, so 413 that no additional manual annotation is necessary once the record 414 is displayed.</b></p> 415 416 <p><b>It is much easier to produce the final submission if you 417 let Sequin work for you in this manner.</b></p> 418 419 <p>In this example we show alternative splicing, where a single 420 gene produces multiple messenger RNAs that encode two similar but 421 distinct protein products. Examples for the definition lines for 422 the nucleotide and protein files are shown here:</p> 423 424 <pre> 425 Nucleotide Sequence: 426 427 >eIF4E [organism=Drosophila melanogaster] [strain=Oregon R] Drosophila ... 428 CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCA ... 429 430 Protein Sequences: 431 432 >4E-I [gene=eIF4E] [protein=eukaryotic initiation factor 4E-I] 433 MQSDFHRMKNFANPKSMFKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETGEPAGN ... 434 >4E-II [gene=eIF4E] [protein=eukaryotic initiation factor 4E-II] 435 MVVLETEKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETGEPAGNTATTTAPAGDD ... 436 </pre> 437 438 <p>Also, please note that there must be a line break (carriage 439 return) between the definition line and the first line of sequence. 440 Some word processors will break a single line onto two lines 441 without actually adding a carriage return. (This feature is known 442 as "word wrapping".) If you are unsure whether there is a carriage 443 return, you can either set up your word processor so it shows 444 invisible characters like carriage returns, or view the file in a 445 text editor that does not create artificial line breaks. <b>The 446 definition line itself must not have a line break within it, 447 because the second line would then be misinterpreted as the 448 beginning of the sequence data.</b> The actual sequence is usually 449 broken every 50 to 80 characters, but this is not necessary for 450 Sequin to be able to read it.</p> 451 452 <a name="FASTAformat" id="FASTAformat"></a> 453 <h3>FASTA Format</h3> 454 455 <p>There are three types of sequences that may be represented using 456 the FASTA format: single, contiguous sequences, segmented sequences, 457 and gapped sequences.</p> 458 459 <a name="SingleSequence" id="SingleSequence"></a> 460 <h4>Single Sequence</h4> 461 462 <p>This is the definition line followed by the sequence data. A 463 sample single sequence file is shown here:</p> 464 465 <pre> 466 >ABC-1 [organism=Saccharomyces cerevisiae][strain=ABC][clone=1] 467 ATTGCGTTATGGAAATTCGAAACTGCCAAATACTATGTCACCATCATTGA 468 TGCACCTGGACACAGAGATTTCATCAAGAACATGATCACTGGTACTT 469 </pre> 470 471 <a name="SegmentedSequences" id="SegmentedSequences"></a> 472 <h4>Segmented Nucleotide Sequences</h4> 473 474 <p>A segmented nucleotide entry is an earlier method for capturing 475 a set of non-contiguous sequences that has a defined order and 476 orientation. For example, a genomic DNA segmented set could include 477 encoding exons along with fragments of their flanking introns. An 478 example of an mRNA segmented pair of records would be the 5' and 3' 479 ends of an mRNA, where the middle region has not been sequenced. To 480 import nucleotides in a segmented set, each individual sequence 481 must be in FASTA format with an appropriate definition line, and 482 all sequences should be in the same file. Organism information 483 should only be included in the definition line for the first 484 segment. Notice that there is a square open bracket on a line by 485 itself before the first segment and a square close bracket on a 486 line by itself after the last segment. These square brackets are 487 required if you are importing multiple segmented sequences, but may 488 be omitted if you are importing a file that contains all of the 489 segments and using the "segmented sequence" format. Sequin will 490 also generate an additional sequence to represent the combination 491 of the segments, and that sequence will have a distinct sequence 492 ID. A sample segmented sequence file is shown here:</p> 493 494 <pre> 495 [ 496 >m_gagei_seg1 [organism=Mansonia gagei] Mansonia gagei NADH dehydrogenase ... 497 ATGGAGCATACATATCAATATTCATGGATCATACCGTTTGTGCCACTTCCAATTCCTATTTTAATAGGAA 498 TTGGACTCCTACTTTTTCCGACGGCAACAAAAAATCTTCGTCGTATGTGGGCTCTTCCCAATATTTTATT 499 GTTAAGTATAGTTATGATTTTTTCGGTCGATCTGTCCATTCAGCAAATAAATAAAAGTTCTATCTATCAA 500 TATGTATGGTCTTGGACCATCAATAATGATTTTTCTTTCGAGTTTGGCTACTTTATTGATTCGCTTACCT 501 AGTTCGAATTTGATACAAATTTATATTTTTTGGGAATTAGTTGGAATGTGTTCTTATCTATTAATAGGGT 502 TTTGGTTCACACGACCCGCTGCGGCAAACGCCTGTCAAAAAGCATTTGTAACTAATCGGATAGGCGATTT 503 TGGTTTATTATTAGGAATCTTAGGTTTTTATTGGATAACGGGAAGTTTCGAATTTCAAGATTTGTTCGAA 504 ATATTTAATAACTTGATTTATAATAATGAGGTTCAGTTTTTATTTGTTACTTTATGTGCCTCTTTATTA 505 >m_gagei_seg2 506 GGTATAATAACAGTATTATTAGGGGCTACTTTAGCTCTTGC 507 TCAAAAAGATATTAAGAGGGGTTTAGCCTATTCTACAATGTCCCAACTGGGTTATATGATGTTAGCTCTA 508 GGTATGGGGTCTTATCGAGCCGCTTTATTTCATTTGATTACTCATGCTTATTCGAAGGCATTGTTGTTTT 509 TAGGATCCGGATCCGTTATTCATTCCATGGAAGCTATTGTTGGATATTCTCCAGATAAAAGCCAGAATAT 510 GGTTTTTATGGGCGGTTTAAGAAAGCATGTGCCAATTACACAAATTGCTTTTTTAGTGGGTACACTTTCT 511 CTTTGTGGTATTCCACCCCTTGCTTGTTTTTGGTCCAAAGATGAAATTCTTAGTGACAGCTGGTTGT 512 >m_gagei_seg3 513 TCAATAAAACTATGGGGTAAAGAAGAACAAAAAATAATTAACAGAAATTTTCGTTTATCTCCTTTATTAA 514 TATTAACGATGAATAATAATGAGAAGCCATATAGAATTGGTGATAATGTAAAAAAAGGGGCTCTTATTAC 515 TATTACGAGTTTTGGCTACAAGAAGGCTTTTTCTTATCCTCATGAATCGGATAATACTATGCTATTTCCT 516 ATGCTTATATTGGCTCTATTTACTTTTTTTGTTGGAGCCATAGCAATTCCTTTTAATCAAGAAGGACTAC 517 ATTTGGATATATTATCCAAATTATTAACTCCATCTATAAATCTTTTACATCAAAATTCAAATGATTTTGA 518 GGATTGGTATCAATTTTTAACAAATGCAACTCTTTCAGTGAGTATAGCCTGTTTCGGAATATTTACAGCA 519 TTCCTTTTATATAAGCCTTTTTATTCATCTTTACAAAATTTGAACTTACTAAATTTATTTTCGAAAGGGG 520 GTCCTAAAAGAATTTTTTTGGATAAAATAATATACTTGATATACGATTGGTCATATAATCGTGGTTACAT 521 AGATACGTTTTATTCAGTATCCTTAACAAAAGGTATAAGAGGATTGGCCGAACTAACTCATTTTTTTGAT 522 AGGCGAGTAATCGATGGAATTACAAATGGAGTACGCATCACAAGTTTTTTTATAGGCGAAGGTATCAAAT 523 ATT 524 ] 525 </pre> 526 527 <a name="GappedSequences" id="GappedSequences"></a> 528 <h4>Gapped Sequences</h4> 529 530 <p>A gapped sequence represents a newer method for describing 531 non-contiguous sequences, but only requires a single sequence 532 identifier. A gap is represented by a line that starts with >? 533 and is immediately followed by either a length (for gaps of known 534 length) or "unk100" for gaps of unknown length. For example, 535 ">?200". The next sequence segment continues on the next line, 536 with no separate definition line or identifier. The difference 537 between a gapped sequence and a segmented sequence is that the 538 gapped sequence uses a single identifier and can specify known 539 length gaps. Gapped sequences are preferred over segmented 540 sequences. A sample gapped sequence file is shown here:</p> 541 542 <pre> 543 >m_gagei [organism=Mansonia gagei] Mansonia gagei NADH dehydrogenase ... 544 ATGGAGCATACATATCAATATTCATGGATCATACCGTTTGTGCCACTTCCAATTCCTATTTTAATAGGAA 545 TTGGACTCCTACTTTTTCCGACGGCAACAAAAAATCTTCGTCGTATGTGGGCTCTTCCCAATATTTTATT 546 GTTAAGTATAGTTATGATTTTTTCGGTCGATCTGTCCATTCAGCAAATAAATAAAAGTTCTATCTATCAA 547 TATGTATGGTCTTGGACCATCAATAATGATTTTTCTTTCGAGTTTGGCTACTTTATTGATTCGCTTACCT 548 AGTTCGAATTTGATACAAATTTATATTTTTTGGGAATTAGTTGGAATGTGTTCTTATCTATTAATAGGGT 549 TTTGGTTCACACGACCCGCTGCGGCAAACGCCTGTCAAAAAGCATTTGTAACTAATCGGATAGGCGATTT 550 TGGTTTATTATTAGGAATCTTAGGTTTTTATTGGATAACGGGAAGTTTCGAATTTCAAGATTTGTTCGAA 551 ATATTTAATAACTTGATTTATAATAATGAGGTTCAGTTTTTATTTGTTACTTTATGTGCCTCTTTATTA 552 >?200 553 GGTATAATAACAGTATTATTAGGGGCTACTTTAGCTCTTGC 554 TCAAAAAGATATTAAGAGGGGTTTAGCCTATTCTACAATGTCCCAACTGGGTTATATGATGTTAGCTCTA 555 GGTATGGGGTCTTATCGAGCCGCTTTATTTCATTTGATTACTCATGCTTATTCGAAGGCATTGTTGTTTT 556 TAGGATCCGGATCCGTTATTCATTCCATGGAAGCTATTGTTGGATATTCTCCAGATAAAAGCCAGAATAT 557 GGTTTTTATGGGCGGTTTAAGAAAGCATGTGCCAATTACACAAATTGCTTTTTTAGTGGGTACACTTTCT 558 CTTTGTGGTATTCCACCCCTTGCTTGTTTTTGGTCCAAAGATGAAATTCTTAGTGACAGCTGGTTGT 559 >?unk100 560 TCAATAAAACTATGGGGTAAAGAAGAACAAAAAATAATTAACAGAAATTTTCGTTTATCTCCTTTATTAA 561 TATTAACGATGAATAATAATGAGAAGCCATATAGAATTGGTGATAATGTAAAAAAAGGGGCTCTTATTAC 562 TATTACGAGTTTTGGCTACAAGAAGGCTTTTTCTTATCCTCATGAATCGGATAATACTATGCTATTTCCT 563 ATGCTTATATTGGCTCTATTTACTTTTTTTGTTGGAGCCATAGCAATTCCTTTTAATCAAGAAGGACTAC 564 ATTTGGATATATTATCCAAATTATTAACTCCATCTATAAATCTTTTACATCAAAATTCAAATGATTTTGA 565 GGATTGGTATCAATTTTTAACAAATGCAACTCTTTCAGTGAGTATAGCCTGTTTCGGAATATTTACAGCA 566 TTCCTTTTATATAAGCCTTTTTATTCATCTTTACAAAATTTGAACTTACTAAATTTATTTTCGAAAGGGG 567 GTCCTAAAAGAATTTTTTTGGATAAAATAATATACTTGATATACGATTGGTCATATAATCGTGGTTACAT 568 AGATACGTTTTATTCAGTATCCTTAACAAAAGGTATAAGAGGATTGGCCGAACTAACTCATTTTTTTGAT 569 AGGCGAGTAATCGATGGAATTACAAATGGAGTACGCATCACAAGTTTTTTTATAGGCGAAGGTATCAAAT 570 ATT 571 </pre> 572 573 <a name="AlignmentFormats" id="AlignmentFormats"></a> 574 <h3>Alignment Formats</h3> 575 576 <p>Once you have created your alignment file, be sure to note the 577 characters used to indicate ambiguous bases, bases that match the 578 master sequence, and gaps in the alignment. Be aware that some 579 alignment formats use different characters to indicate gaps used to 580 pad sequences at the beginning, middle, and end of the alignment. 581 You will be able to specify these characters separately before 582 importing the alignment file.</p> 583 584 <a name="FASTAplusGAP" id="FASTAplusGAP"></a> 585 <h4>FASTA+GAP</h4> 586 587 <pre> 588 >ABC-1 [organism=Saccharomyces cerevisiae][strain=ABC][clone=1] 589 ---ATTGCGTTATGGAAATTCGAAACTGCCAAATACTATGTCACCATCAT 590 TGATGCACCTGGACACAGAGATTTCATCAAGAACATGATCACTGGTACTT 591 >ABC-2 [organism=Saccharomyces cerevisiae][strain=ABC][clone=2] 592 GATATTGCTTTATGGAAATTCGAAACTGCCAAATACTATGTCACCATCAT 593 TGATGCACCTGGACACAGAAATTTCATCAAGAACATGATCACTGGTACTT 594 >ABC-3 [organism=Saccharomyces cerevisiae][strain=ABC][clone=3] 595 ---ATTGCTTTATGGAAATTCGAAACTGCCAAATACTATGTTA------- 596 TGATGCACCTGGACACAGAGATTTCATCAAAAACATGATCACTGGTACTT 597 </pre> 598 599 <a name="PHYLIPformat" id="PHYLIPformat"></a> 600 <h4>PHYLIP</h4> 601 602 <pre> 603 3 100 604 ABC-1 ---ATTGCGT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT 605 ABC-2 GATATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT 606 ABC-3 ---ATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TTA------- 607 608 TGATGCACCT GGACACAGAG ATTTCATCAA GAACATGATC ACTGGTACTT 609 TGATGCACCT GGACACAGAA ATTTCATCAA GAACATGATC ACTGGTACTT 610 TGATGCACCT GGACACAGAG ATTTCATCAA AAACATGATC ACTGGTACTT 611 612 >[organism=Saccharomyces cerevisiae][strain=ABC][clone=1] 613 >[organism=Saccharomyces cerevisiae][strain=ABC][clone=2] 614 >[organism=Saccharomyces cerevisiae][strain=ABC][clone=3] 615 </pre> 616 617 <a name="NEXUSInterleaved" id="NEXUSInterleaved"></a> 618 <h4>NEXUS Interleaved</h4> 619 620 <pre> 621 #NEXUS 622 623 begin data; 624 dimensions ntax=3 nchar=100; 625 format datatype=dna missing=? gap=- interleave ; 626 matrix 627 628 [ 1 50] 629 ABC_1 ???ATTGCGT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT 630 ABC_2 GATATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT 631 ABC_3 ???ATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TTA------- 632 633 [ 51 100] 634 ABC_1 TGATGCACCT GGACACAGAG ATTTCATCAA GAACATGATC ACTGGTACTT 635 ABC_2 TGATGCACCT GGACACAGAA ATTTCATCAA GAACATGATC ACTGGTACTT 636 ABC_3 TGATGCACCT GGACACAGAG ATTTCATCAA AAACATGATC ACTGGTACTT 637 ; 638 END; 639 640 begin ncbi; 641 sequin 642 >[organism=Saccharomyces cerevisiae][strain=ABC][clone=1] 643 >[organism=Saccharomyces cerevisiae][strain=ABC][clone=2] 644 >[organism=Saccharomyces cerevisiae][strain=ABC][clone=3] 645 ; 646 end; 647 </pre> 648 649 <a name="NEXUSContiguous" id="NEXUSContiguous"></a> 650 <h4>NEXUS Contiguous</h4> 651 652 <pre> 653 #NEXUS 654 655 begin data; 656 dimensions ntax=3 nchar=100; 657 format datatype=dna missing=? gap=- ; 658 matrix 659 660 ABC_1 661 ???ATTGCGT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT 662 TGATGCACCT GGACACAGAG ATTTCATCAA GAACATGATC ACTGGTACTT 663 ABC_2 664 GATATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT 665 TGATGCACCT GGACACAGAA ATTTCATCAA GAACATGATC ACTGGTACTT 666 ABC_3 667 ???ATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TTA------- 668 TGATGCACCT GGACACAGAG ATTTCATCAA AAACATGATC ACTGGTACTT 669 ; 670 END; 671 672 begin ncbi; 673 sequin 674 >[organism=Saccharomyces cerevisiae][strain=ABC][clone=1] 675 >[organism=Saccharomyces cerevisiae][strain=ABC][clone=2] 676 >[organism=Saccharomyces cerevisiae][strain=ABC][clone=3] 677 ; 678 end; 679 </pre> 680 681 <a name="SetsOfSegmentedSequences" id="SetsOfSegmentedSequences"></a> 682 <h4>Sets of Segmented Sequences</h4> 683 684 <p>If the sequences in a phylogenetic study are really segmented 685 (e.g., exons 2 and 3 of a gene without intron 2), the individual 686 segments from a single organism can be grouped within square 687 brackets. Subsequent segments are detected by the presence of a 688 FASTA definition line. For example:</p> 689 690 <pre> 691 [ 692 >Qruex2 [organism=Quercus rubra] 693 CGAAAACCTGCACAGCAGAAACGACTCGCAAACTAGTAATAACTGACGGAGGACGGAGGG ... 694 >Qruex3 695 CATCATTGCCCCCCATCCTTTGGTTTGGTTGGGTTGGAAGTTCACCTCCCATATGTGCCC ... 696 ] 697 [ 698 >Qsuex2 [organism=Quercus suber] 699 CAAACCTACACAGCAGAACGACTCGAGAACTGGTGACAGTTGAGGAGGGCAAGCACCTTG ... 700 >Qsuex3 701 CATCGTTGCCCCCCTTCTTTGGTTTGGTTGGGTTGGAAGTTGGCCTTCCATATGTGCCCT ... 702 ] 703 ... 704 </pre> 705 706 <p>FASTA+GAP format can also use this convention for encoding sets 707 of aligned segmented sequences.</p> 708 709 <a name="CreatingASubmission" id="CreatingASubmission"></a> 710 <h2>Creating a Submission</h2> 711 712 <p>The sequence data we will use for this example is the genomic 713 sequence of the <span class="taxonomy">Drosophila melanogaster</span> 714 eukaryotic initiation factors 4E-I and 4E-II (GenBank Accession number 715 U54469).</p> 716 717 <a name="BasicSequinOrganization" id="BasicSequinOrganization"></a> 718 <h3>Basic Sequin Organization</h3> 719 720 <p>Sequin is organized into a series of forms for entering 721 submitting authors, entering organism and sequences, entering 722 information such as strain, gene, and protein names, viewing the 723 complete submission, and editing and annotating the submission. The 724 goal is to go quickly from raw sequence data to an assembled record 725 that can be viewed, edited, and submitted to your database of 726 choice.</p> 727 728 <p>Advance through the pages that make up each form by clicking on 729 labeled folder tabs or the <span class="buttonlabel">Next Page</span> 730 button. After the basic information forms have been completed and the 731 sequence data imported, Sequin provides a complete view of your 732 submission, in your choice of text or graphic format. At this point, 733 any of the information fields can be easily modified by double-clicking 734 on any area of the record, and additional biological annotations can be 735 entered by selecting from a menu.</p> 736 737 <p>Sequin has an on-screen <span class="buttonlabel">Help</span> file 738 that is opened automatically when you start the program. Because it is 739 context sensitive, the <span class="buttonlabel">Help</span> text will 740 change and follow your steps as you progress through the program. A 741 "Find" function is also provided.</p> 742 743 <a name="WelcomeToSequinForm" id="WelcomeToSequinForm"></a> 744 <h3>Welcome to Sequin Form</h3> 745 746 <p><img class="figure" src="images/welcome.png" alt= 747 "Welcome to Sequin Form" /></p> 748 749 <p>Once you have finished preparing the sequence files, you are 750 ready to start the Sequin program. Sequin's first window asks you 751 to indicate the database to which the sequence will be submitted 752 and prompts you to start a new project or continue with an existing 753 one. Once you choose a database, Sequin will remember it in 754 subsequent sessions. In general, each sequence submission should be 755 entered as a separate project. However, segmented DNA sequences, 756 gapped sequences, population studies, phylogenetic studies, and 757 mutation studies should be submitted together as one project. This 758 feature also eliminates the need to save Sequin information 759 templates for each sequence.</p> 760 761 <p>To begin creating your submission, click the <span 762 class="buttonlabel">Start New Submission</span> button.</p> 763 764 <a name="SubmittingAuthorsForm" id="SubmittingAuthorsForm"></a> 765 <h3>Submitting Authors Form</h3> 766 767 <p>The pages in the <span class="dialoglabel">Submitting Authors</span> 768 form ask you to provide the release date, a working title, names and 769 contact information of submitting authors, and affiliation information. 770 To create a personal template for use in future submissions, use the 771 <span class="menulabel">File->Export</span> menu item after 772 completing each page of this form.</p> 773 774 <a name="SubmissionPage" id="SubmissionPage"></a> 775 <h4>Submission Page</h4> 776 777 <p><img class="figure" src="images/submit.png" alt= 778 "Submission Page" /></p> 779 780 <p>The <span class="folderlabel">Submission</span> page asks for a 781 tentative title for a manuscript describing the sequence and will 782 initially mark the manuscript as being unpublished. When the article is 783 published, the database staff will update the sequence record with the 784 new citation. This page also lets you indicate that a record should be 785 held confidential by the database until a specified date, although the 786 preferred policy is to release the record immediately into the public 787 databases.</p> 788 789 <a name="ContactPage" id="ContactPage"></a> 790 <h4>Contact Page</h4> 791 792 <p><img class="figure" src="images/contact.png" alt= 793 "Contact Page" /></p> 794 795 <p>The <span class="folderlabel">Contact</span> page asks for the name, 796 phone number, and email address of the person responsible for making 797 the submission. Database staff members will contact this person if 798 there are any questions about the record.</p> 799 800 <p>The Sfx (suffix) popup is used to enter personal name suffixes 801 (e.g., Jr., Sr., or III), not a person's academic degrees (e.g., MD 802 or PhD). Also, it is not necessary to type periods after 803 initials.</p> 804 805 <a name="AuthorsPage" id="AuthorsPage"></a> 806 <h4>Authors Page</h4> 807 808 <p><img class="figure" src="images/authors.png" alt= 809 "Authors Page" /></p> 810 811 <p>In the <span class="folderlabel">Authors</span> page, enter the 812 names of the people who should get scientific credit for the sequence 813 presented in this record. These will become the authors for the initial 814 (unpublished) manuscript.</p> 815 816 <p>Authors are entered in a spreadsheet. As soon as anything is 817 typed in the last row, a new (blank) row is added below it. Use the 818 tab key to move between fields. Tabbing from the last column 819 automatically moves to the First Name column in the next row.</p> 820 821 <a name="AffiliationPage" id="AffiliationPage"></a> 822 <h4>Affiliation Page</h4> 823 824 <p><img class="figure" src="images/affil.png" alt= 825 "Affiliation Page" /></p> 826 827 <p>The <span class="folderlabel">Affiliation</span> page asks for the 828 institutional affiliation of the primary author.</p> 829 830 <a name="SequenceFormatForm" id="SequenceFormatForm"></a> 831 <h3>Sequence Format Form</h3> 832 833 <p><img class="figure" src="images/format.png" alt= 834 "Format Form" /></p> 835 836 <p>With Sequin, the actual sequence data are imported from an 837 outside data file. So before you begin, prepare your sequence data 838 files using a text editor, perhaps one associated with your 839 laboratory sequence analysis software (see "<a href= 840 "#BeforeYouBegin">Before you Begin</a>").</p> 841 842 <a name="SubmissionType" id="SubmissionType"></a> 843 <h4>Submission Type</h4> 844 845 <p>If you have sequence data from a single source, choose from one of 846 the following submission types:</p> 847 848 <ul> 849 <li><span class="buttonlabel">Single Sequence</span> if you have a 850 single contiguous mRNA or genomic DNA sequence.</li> 851 <li><span class="buttonlabel">Segmented Sequence</span> if you have a 852 single collection of non-overlapping, non-contiguous sequences that 853 cover a specified genetic region from a single source. A standard 854 example is a set of genomic DNA sequences that encode exons from a gene 855 along with fragments of their flanking introns.</li> 856 <li><span class="buttonlabel">Gapped Sequence</span> if you have a 857 single non-contiguous mRNA or genomic DNA sequence. A gapped sequence 858 contains specified gaps of known or unknown length where the exact 859 nucleotide sequence has not been determined.</li> 860 </ul> 861 862 <p>See <a href="#BeforeYouBegin">Before You Begin</a> if you have 863 questions about how to format your files or about the differences 864 between these formats.</p> 865 866 <p>If you have a set of single sequences, segmented sequences, or 867 gapped sequences or a mixture of these types of sequences, you will 868 need to choose one of the following submission types:</p> 869 870 <ul> 871 <li><span class="buttonlabel">Population Study</span> for a set derived 872 by sequencing the same gene from different isolates of the same 873 organism.</li> 874 <li><span class="buttonlabel">Phylogenetic Study</span> for a set 875 derived by sequencing the same gene from different organisms.</li> 876 <li><span class="buttonlabel">Mutation Study</span> for a set derived 877 by sequencing multiple mutations of a single gene.</li> 878 <li><span class="buttonlabel">Environmental Samples</span> for a set 879 derived by sequencing the same gene from a population of unclassified 880 or unknown organisms.</li> 881 <li><span class="buttonlabel">Batch Submission</span> for a set that is 882 not a population study, mutation study, phylogenetic study, or 883 environmental samples. The sequences should be related in some way, 884 such as coming from the same publication or organism. You should plan 885 that all sequences will be released to the public on the same date.</li> 886 </ul> 887 888 <a name="SequenceDataFormat" id="SequenceDataFormat"></a> 889 <h4>Sequence Data Format</h4> 890 891 <p>If you have chosen <span class="buttonlabel">Single Sequence</span>, 892 <span class="buttonlabel">Segmented Sequence</span>, <span 893 class="buttonlabel">Gapped Sequence</span>, or <span 894 class="buttonlabel">Batch Submission</span> for the submission type, 895 you will only be able to select <span class="buttonlabel">FASTA (no 896 alignment).</span></p> 897 898 <p>If you have chosen one of the other submission types, you may import 899 the sequences in FASTA format, or you may choose to import the 900 sequences using an alignment file by selecting 901 <span class="buttonlabel">Alignment (FASTA+GAP, NEXUS, PHYLIP, etc.)</span>. 902 See <a href= "#AlignmentFormats">Alignment Formats</a> for an 903 explanation of the available formats for alignment files.</p> 904 905 <a name="SubmissionCategory" id="SubmissionCategory"></a> 906 <h4>Submission Category</h4> 907 908 <p>Choose <span class="buttonlabel">Original Submission</span> if you 909 have directly sequenced the nucleotide sequence in your laboratory.</p> 910 911 <p>Choose <span class="buttonlabel">Third Party Annotation</span> if 912 you have downloaded or assembled sequence from GenBank and modified it 913 with your own annotations. See <a href= 914 "http://www.ncbi.nih.gov/Genbank/TPA.html"> 915 <tt>http://www.ncbi.nih.gov/Genbank/TPA.html</tt></a> for more information 916 about Third Party Annotation rules.</p> 917 918 <a name="OrganismAndSequencesForm" id="OrganismAndSequencesForm"></a> 919 <h3>Organism and Sequences Form</h3> 920 921 <p>The <span class="dialoglabel">Organism and Sequences</span> form has 922 been enhanced with a number of Assistants that allow entry or editing 923 of sequence and source information.</p> 924 925 <a name="NucleotidePage" id="NucleotidePage"></a> 926 <h4>Nucleotide Page</h4> 927 928 <p>The <span class="folderlabel">Nucleotide</span> page will have one of 929 three appearances, based on whether you have chosen to import a single 930 sequence, a set of sequences, or an alignment.</p> 931 932 <a name="NucleotidePageSingleSequence" id="NucleotidePageSingleSequence"></a> 933 <h5>Importing Nucleotide FASTA for a Single Sequence</h5> 934 935 <p><img class="figure" src="images/nucsing1.png" alt= 936 "Single Sequence Page" /></p> 937 938 <p>To import a single sequence, click on <span 939 class="buttonlabel">Import Nucleotide FASTA</span> and enter the name 940 of the file that contains your FASTA sequence. See 941 <a href="http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm#BeforeYouBegin"> 942 Before You Begin</a> for information on how to format your FASTA file. 943 In addition to importing from a file, sequences can also be read by 944 pasting from the computer's "clipboard" using the <span 945 class="menulabel">Edit->Paste</span> menu item or by using the <span 946 class="buttonlabel">Add/Modify Sequences</span> button.</p> 947 948 <a name="NucleotidePageSequenceSet" id="NucleotidePageSequenceSet"></a> 949 <h5>Importing Nucleotide FASTA for a Sequence Set</h5> 950 951 <p><img class="figure" src="images/nucset.png" alt= 952 "Sequence Set Page" /></p> 953 954 <p>To import a set of sequences, click on <span 955 class="buttonlabel">Import Nucleotide FASTA</span> and enter the name 956 of the file that contains some or all of your FASTA sequences. See 957 <a href="http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm#BeforeYouBegin"> 958 Before You Begin</a> for information on how to format your FASTA file. 959 You may click on <span class="buttonlabel">Import Additional Nucleotide 960 FASTA</span> to import additional files if your sequences are in more 961 than one file. In addition to importing from a file, sequences can also 962 be read by pasting from the computer's "clipboard" using the <span 963 class="menulabel">Edit->Paste menu</span> item or by using the <span 964 class="buttonlabel">Add/Modify Sequences</span> button.</p> 965 966 <p>If you would like to create an alignment for your set of sequences, 967 check <span class="buttonlabel">Create Alignment</span> on this page.</p> 968 969 <a name="NucleotidePageAlignment" id="NucleotidePageAlignment"></a> 970 <h5>Importing an Alignment</h5> 971 972 <p><img class="figure" src="images/nucaln.png" alt= 973 "Importing an Alignment" /></p> 974 975 <p>See <a href= 976 "http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm#BeforeYouBegin"> 977 Before You Begin</a> for information on how to format your 978 alignment file. Before importing your alignment, choose which 979 characters in the alignment file represent gaps, ambiguous or 980 unknown nucleotides, and "matches".</p> 981 982 <p>Some data files distinguish between gaps at the beginning, in the 983 middle, and at the end of a sequence. These characters can be 984 entered separately if needed, or you may specify the same character 985 for all three kinds of gaps if appropriate.</p> 986 987 <p><span class="textlabel">Ambiguous/Unknown</span> characters 988 represent nucleotides that are present in the sequence but were not 989 sequenced. Usually this is "N". <span class="textabel">Match</span> 990 characters are characters in a sequence other than the first that match 991 the character at that alignment position in the first sequence. When 992 match characters are used, usually they are specified as ".", but when 993 match characters are not used, "." is frequently used as a gap 994 character, so the ":" is supplied instead as a default.</p> 995 996 <p>You may specify more than one character for each of these 997 categories. When you have filled out the character information, click 998 on <span class="buttonlabel">Import Nucleotide Alignment</span> and 999 enter the name of your alignment file.</p> 1000 1001 <a name="AfterImporting" id="AfterImporting"></a> 1002 <h5>After Importing Files</h5> 1003 1004 <p><img class="figure" src="images/nucsing2.png" alt= 1005 "After Importing Files" /></p> 1006 1007 <p>When the sequence file or alignment file import is complete, a box 1008 will appear showing the number of nucleotide segments imported, the 1009 total length in nucleotides of the sequences entered, and the sequence 1010 ID(s) you designated. The actual sequence data are <b>not</b> 1011 shown. If any of this information is missing or incorrect, check the 1012 file containing the sequence data for proper FASTA format, click on the 1013 <span class="buttonlabel">Clear Sequences</span> button, then reimport 1014 the sequence(s).</p> 1015 1016 <p>If the imported nucleotide sequence or sequences or alignment 1017 have any problems, such as colliding local identifiers in a set or 1018 mismatched brackets in the definition line, an Assistant dialog 1019 appears to help correct the problems. Severe problems must be fixed 1020 before you can continue with the Sequin submission.</p> 1021 1022 <a name="OrganismPage" id="OrganismPage"></a> 1023 <h4>Organism Page</h4> 1024 1025 <p><img class="figure" src="images/organism.png" alt= 1026 "Organism Page" /></p> 1027 1028 <p>The second page of the <span class="folderlabel">Organism and 1029 Sequences</span> form requests information regarding the scientific 1030 name of the organism from which the sequence was derived, if it was not 1031 already encoded in the nucleotide FASTA file. There are Assistants for 1032 manually adding organism name information or adding source 1033 qualifiers.</p> 1034 1035 <p>Sequin has extracted the organism and strain names from the FASTA 1036 definition line in this example, eliminating the need to manually enter 1037 information in the <span class="folderlabel">Organism</span> page.</p> 1038 1039 <a name="ProteinPage" id="ProteinPage"></a> 1040 <h4>Proteins Page</h4> 1041 1042 <p><img class="figure" src="images/protein1.png" alt= 1043 "Proteins Page" /></p> 1044 1045 <p>If your sequence or sequences encode one or more proteins, you can 1046 enter the sequences of the protein products in this page. To import the 1047 amino acid sequences, click on the <span 1048 class="folderlabel">Proteins</span> folder tab and click on the <span 1049 class="buttonlabel">Import Protein FASTA</span> button. You may import 1050 more than one file by clicking the button again after importing the 1051 first file. See <a href="#BeforeYouBegin">Before You Begin</a> for 1052 information on how to format your protein files.</p> 1053 1054 <p><img class="figure" src="images/protein2.png" alt= 1055 "Proteins Example" /></p> 1056 1057 <p>In this example, we imported two protein sequences. These are 1058 the alternative splice products of the same gene. Both protein 1059 sequences were in the same data file, but each had its own 1060 definition line.</p> 1061 1062 <p>Sequin has extracted the gene and protein names from the FASTA 1063 definition lines, and will use these to construct the initial 1064 sequence record.</p> 1065 1066 <a name="AnnotationPage" id="AnnotationPage"></a> 1067 <h4>Annotation Page</h4> 1068 1069 <p><img class="figure" src="images/annot.png" alt= 1070 "Annotation Page" /></p> 1071 1072 <p>The <span class="folderlabel">Annotation</span> page allows you to 1073 add an rRNA or CDS feature to the entire length of all sequences in the 1074 set. In addition, you can add a title to any sequences that didn't 1075 obtain them from a FASTA definition line. It is much easier to add 1076 these in bulk at this step than to add individual rRNA or CDS features 1077 to each sequence after the record is constructed.</p> 1078 1079 <p>It is customary in a nucleotide record to format titles for 1080 sequences containing coding region features in the following 1081 way:</p> 1082 1083 <p>Genus species protein name (gene symbol) mRNA/gene, 1084 complete/partial cds.</p> 1085 1086 <p>The choice of "mRNA" or "gene" depends upon the molecule type (use 1087 "mRNA" for mRNA or cDNA, and "gene" for genomic DNA). Use "partial" for 1088 incomplete features. The proper organism name in a phylogenetic study 1089 can be added to the beginning of each title automatically by checking 1090 the <span class="buttonlabel">Prefix title with organism name</span> 1091 box.</p> 1092 1093 <p>However, for records containing CDS, rRNA, or tRNA features, 1094 Sequin can generate the definition line automatically by computing 1095 on the features (see "<a href="#autodefline">Generating the 1096 Definition Line</a>").</p> 1097 1098 <p>More complex situations, such as a population study of HIV 1099 sequences, can include multiple CDS features in each sequence. In this 1100 case, do not use the <span class="folderlabel">Annotation</span> page 1101 to create features. (You can still use it for a common title, however.) 1102 After the initial submission has been created, you would manually 1103 annotate features onto one of the sequences. If you are submitting an 1104 alignment, or if you are submitting a set of sequences and you have 1105 checked <span class="buttonlabel">Create Alignment</span> on the 1106 <span class="folderlabel">Nucleotide</span> page, you will be able to 1107 use feature propagation to annotate the same features at the equivalent 1108 aligned locations on the remaining sequences.</p> 1109 1110 <a name="viewing" id="viewing"></a> 1111 <h2>Viewing Your Submission</h2> 1112 1113 <a name="GenBankView" id="GenBankView"></a> 1114 <h3>GenBank View</h3> 1115 1116 <p>After you have completed importing the data files, Sequin will 1117 display your full submission information in the GenBank format (or 1118 EMBL format if you chose EMBL as the database for submission in the 1119 first form).</p> 1120 1121 <p><img class="figure" src="images/genbank.png" alt= 1122 "GenBank Format" /></p> 1123 1124 <p>On the basis of the information provided in your DNA and amino 1125 acid sequence files, any coding regions will be automatically 1126 identified and annotated for you. The figure shows only the top 1127 portion of the GenBank record, but you can see the first of two 1128 coding region (CDS) features. The vertical bar to the left of the 1129 paragraph indicates that the CDS has been selected by clicking with 1130 the computer's mouse.</p> 1131 1132 <p>You may now make changes to the coding region, publication, source, 1133 and other features in the record by double clicking on the appropriate 1134 paragraphs in the GenBank display format. You may also use the <span 1135 class="menulabel">Annotate->Generate Definition Line</span> menu 1136 item to <a href="#autodefline">compute a definition line</a> for the 1137 annotated features in the record.</p> 1138 1139 <a name="GraphicalView" id="GraphicalView"></a> 1140 <h3>Graphical View</h3> 1141 1142 <p><img class="figure" src="images/graphic.png" alt= 1143 "Graphic Format" /></p> 1144 1145 <p>To get a graphical view, change the <span 1146 class="popuplabel">Format</span> popup menu from <span 1147 class="menulabel">GenBank</span> to <span 1148 class="menulabel">Graphic</span>. Reviewing your submission in Graphic 1149 format allows you to visually confirm expected location of exons, 1150 introns, and other features in multiple interval coding regions. The 1151 Graphic view in our eukaryotic initiation factor example illustrates 1152 how the coding region intervals for the two protein products are 1153 spatially related to each other.</p> 1154 1155 <p>The <span class="menulabel">File->Duplicate View</span> menu item 1156 will launch a second viewer on the record. The display format on each 1157 viewer can be independently set, allowing you to see a graphical view 1158 and a GenBank text report simultaneously. This is useful for getting an 1159 overall view of the features and seeing the details of annotation.</p> 1160 1161 <a name="SequenceView" id="SequenceView"></a> 1162 <h3>Sequence View</h3> 1163 1164 <p><img class="figure" src="images/sequence.png" alt= 1165 "Sequence Format" /></p> 1166 1167 <p>Sequence view is a static version of the sequence and alignment 1168 editor. It shows the actual nucleotide sequence, with feature 1169 intervals annotated directly on the sequence. Protein translations 1170 of CDS features are also shown, as are all features shown in the 1171 graphical view.</p> 1172 1173 <a name="editing" id="editing"></a> 1174 <h2>Editing and Annotating Your Submission</h2> 1175 1176 <p>At this point, Sequin could process your entry based on what you 1177 have entered so far, and you could send it to your nucleotide database 1178 of choice (as set in the initial form). However, to optimize the 1179 usefulness of your entry for the scientific community, you may want to 1180 provide additional information to indicate biologically significant 1181 regions of the sequence. But first, save the entry so that if you make 1182 any unwanted changes during the editing process you can revert to the 1183 original copy.</p> 1184 1185 <p>Additional information may be in the form of Descriptors or 1186 Features. Descriptors are annotations that apply to an entire 1187 sequence or set of sequences. They are used to remove redundant 1188 information in a record. Features are annotations that apply to a 1189 specific sequence interval.</p> 1190 1191 <p>Sequin provides two methods to modify your entry: (1) to edit 1192 existing information, double click on the text or graphic area you want 1193 to modify, and Sequin will display forms requesting needed information; 1194 or (2) to add new information, use the <span 1195 class="menulabel">Annotate</span> menu and select from the list of 1196 available annotations.</p> 1197 1198 <a name="SequenceEditor" id="SequenceEditor"></a> 1199 <h3>Sequence Editor</h3> 1200 1201 <p>Additional sequence data can also be added using Sequin's sequence 1202 editor, which can be launched using the <span 1203 class="menulabel">Edit->Edit Sequence</span> menu item. Sequin will 1204 automatically adjust feature intervals when editing the sequence. Prior 1205 to Sequin, it was usually easier to reannotate everything from scratch 1206 when the sequence changed. But an even easier way to update sequences 1207 is described in the following section.</p> 1208 1209 <a name="UpdatingTheSequence" id="UpdatingTheSequence"></a> 1210 <h3>Updating the Sequence</h3> 1211 1212 <p>Sequin can also read in a replacement sequence, or an 1213 overlapping sequence extension, and perform the alignment and 1214 feature propagation calculations necessary to adjust feature 1215 intervals, even though the individual editing operations were not 1216 done with the sequence editor.</p> 1217 1218 <p>The <span class="menulabel">Edit->Update Sequence</span> submenu 1219 has several choices. These are for use by the original submitter of a 1220 record.</p> 1221 1222 <p>You can read a FASTA file or raw sequence file. This can be a 1223 replacement sequence, or it can overlap the original sequence at 1224 the 5' or 3' end. After Sequin aligns the two sequences, and you 1225 select optional parameters, the sequence in your record is updated, 1226 with all feature intervals adjusted properly.</p> 1227 1228 <p>You can also update with an existing sequence record that 1229 contains features. This can be obtained from a file, or retrieved 1230 from Entrez either via an Accession number. The latter choice 1231 requires the <a href= 1232 "http://www.ncbi.nlm.nih.gov/Sequin/netaware.html">network-aware</a> 1233 version of Sequin. Once it gets the new record, Sequin aligns the 1234 two sequences as before. This is typically used either to merge two 1235 records that overlap, or to copy features from database records 1236 onto a new large contig.</p> 1237 1238 <p><img class="figure" src="images/update.png" alt= 1239 "Update Sequence Form" /></p> 1240 1241 <p>The first panel shows how the two sequences align to each other. 1242 In this case, it is a 5' extension of the existing sequence. 400 1243 bases are new, 70 bases overlap the old sequence, and there are 30 1244 bases of vector on the new sequence that do not align to the old 1245 sequence and will be trimmed off.</p> 1246 1247 <p>The second panel shows details of the 70-base aligned region. 1248 There is one single base gap in each sequence. The total number of 1249 sequence letters plus gap characters is the alignment length, 71 in 1250 this example. (This number was shown between the sequence figures 1251 in the first panel.) Mismatched bases are indicated by vertical red 1252 lines between the two sequences.</p> 1253 1254 <p>The third panel shows the actual sequence letters in the aligned 1255 region. Clicking on a gap or mismatch in the second panel scrolls 1256 to the appropriate place in this panel.</p> 1257 1258 <p>Before pressing <span class="buttonlabel">Update Sequence</span>, 1259 you need to enter optional parameters. The alignment relationship is 1260 calculated by Sequin, but in some cases you may want to replace or 1261 patch rather than extend the existing sequence.</p> 1262 1263 <a name="autodefline" id="autodefline"></a> 1264 <h3>Generating the Definition Line</h3> 1265 1266 <p>The <span class="menulabel">Annotate->Generate Definition 1267 Line</span> menu item can make the appropriate titles once the record 1268 has been annotated with features. The general format for sequences 1269 containing coding region features is:</p> 1270 1271 <p>Genus species protein name (gene symbol) mRNA/gene, 1272 complete/partial cds.</p> 1273 1274 <p>Exceptional cases, where this automatic function is unable to 1275 generate a reasonable definition line, will be edited by the 1276 database staff to conform to the style conventions.</p> 1277 1278 <p>The new definition line will replace any previous title, 1279 including that originally on the FASTA definition line.</p> 1280 1281 <a name="Validation" id="Validation"></a> 1282 <h3>Record Validation</h3> 1283 1284 <p>Once you are satisfied that you have entered all the relevant 1285 information, save your file! Then select the <span 1286 class="menulabel">Search->Validate</span> menu item. You will either 1287 receive a message that the validation test succeeded or see a screen 1288 listing the validation errors and warnings. Just double click on an 1289 error item to launch the appropriate editor for making corrections. The 1290 validator includes checks for such things as missing organism 1291 information, incorrect coding region lengths, internal stop codons in 1292 coding regions, inconsistent genetic codes, mismatched amino acids, and 1293 non-consensus splice sites.</p> 1294 1295 <p><img class="figure" src="images/validate.png" alt= 1296 "Record Validator Form" /></p> 1297 1298 <a name="SubmittingTheEntry" id="SubmittingTheEntry"></a> 1299 <h3>Submitting the Entry</h3> 1300 1301 <p>When the entry is properly formatted and error-free, click the <span 1302 class="buttonlabel">Done</span> button or select the <span 1303 class="menulabel">File->Prepare Submission</span> menu item. You 1304 will be prompted to save your entry and email it to the database you 1305 selected. The address for GenBank is <tt>gb-sub@ncbi.nlm.nih.gov</tt>. 1306 The address for EMBL is <tt>datasubs@ebi.ac.uk</tt>. The address for 1307 DDBJ is <tt>ddbjsub@ddbj.nig.ac.jp</tt>.</p> 1308 1309 <a name="Advanced" id="Advanced"></a> 1310 <h2>Advanced Topics</h2> 1311 1312 <a name="FeatureEditorDesign" id="FeatureEditorDesign"></a> 1313 <h3>Feature Editor Design</h3> 1314 1315 <p>Sequin uses a common structure for all feature editor forms, with 1316 (usually) three top-level folder tabs. One folder tab page is specific 1317 to the given feature type (biological source and publications have 1318 more). The <span class="folderlabel">Properties</span> and <span 1319 class="folderlabel">Location</span> pages are common to all features. 1320 Some of these pages may have subpages, accessed by a secondary set of 1321 smaller folder tabs. This organization allows editors for complex data 1322 structures to fit in a reasonably small window size. The most important 1323 information in a given section is always presented in the first 1324 subpage.</p> 1325 1326 <a name="CodingRegionPage" id="CodingRegionPage"></a> 1327 <h4>Coding Region Page</h4> 1328 1329 <p><img class="figure" src="images/cds_edit.png" alt= 1330 "Coding Region Page" /></p> 1331 1332 <p>The coding region editor is perhaps the most complicated form in 1333 Sequin. Within the <span class="folderlabel">Coding Region</span> page, 1334 the <span class="folderlabel">Product</span> subpage lets you predict 1335 the coding region intervals from the protein sequence or translate the 1336 protein sequence from the location. (Importing a protein sequence from 1337 a file will also interpret the [gene=...] and [protein=...] definition 1338 line information and automatically attempt to predict the coding region 1339 intervals.) It also displays the genetic code used for translation and 1340 the reading frame. (Please note that there are currently 17 different 1341 genetic codes present in Sequin. For more information on these, see <a 1342 href= "http://www.ncbi.nlm.nih.gov/Taxonomy/"> 1343 <tt>http://www.ncbi.nlm.nih.gov/Taxonomy/</tt></a>.)</p> 1344 1345 <p>The <span class="folderlabel">Protein</span> subpage lets you set 1346 the name (or, if not known, a description) of the protein product. The 1347 <span class="folderlabel">Exceptions</span> subpage allows you to 1348 indicate translation exceptions to the normal genetic code, such as 1349 insertion of selenocysteine, suppression of terminator codons by a 1350 suppressor tRNA, or completion of a stop codon by poly-adenylation of 1351 an mRNA.</p> 1352 1353 <p>Additional annotation on the protein product might include a leader 1354 peptide, transmembrane regions, disulfide bonds, or binding sites. 1355 These can be added after setting the <span class="popuplabel">Target 1356 Sequence</span> popup on the sequence viewer to the desired protein 1357 sequence. You can also launch a duplicate view, already targeted to the 1358 appropriate protein, from the <span class="folderlabel">Protein</span> 1359 subpage.</p> 1360 1361 <a name="PropertiesPage" id="PropertiesPage"></a> 1362 <h4>Properties Page</h4> 1363 1364 <p><img class="figure" src="images/props_pg.png" alt= 1365 "Properties Page" /></p> 1366 1367 <p>All features have a number of fields in common. The <span 1368 class="buttonlabel">Partial</span> box will be checked if the 5' 1369 partial or 3' partial boxes on the <span 1370 class="folderlabel">Location</span> page were selected. <span 1371 class="buttonlabel">Exception</span> means that the sequence of the 1372 protein product doesn't match the translation of the DNA sequence 1373 because of some known biological reason (e.g., RNA editing). The <span 1374 class="popuplabel">Evidence</span> popup is now deprecated by the <span 1375 class="folderlabel">Evidence</span> subpage.</p> 1376 1377 <p>In addition, nucleotide features (other than genes themselves) 1378 can reference a gene feature. This is frequently done by overlap. 1379 (The overlapping gene will show up on the feature as a /gene 1380 qualifier in GenBank format.) Extension of the feature location 1381 will automatically extend the gene that is selected in the editor. 1382 In rare cases, you may want to set a gene by cross-reference.</p> 1383 1384 <p>The <span class="folderlabel">Comment</span> subpage allows text to 1385 be associated with a feature. In GenBank format, this appears as a 1386 /note qualifier. The <span class="folderlabel">Citations</span> subpage 1387 attaches citations to the feature. (The citations should first be added 1388 to the record using items in the <span 1389 class="menulabel">Annotate->Publication</span> submenu, whereupon it 1390 will appear in the REFERENCE section.) For example, an article that 1391 justifies a non-obvious or controversial biological conclusion would be 1392 cited here. In GenBank format, for example, if the publication is 1393 listed as Reference 2, the feature citation appears as /citation=[2]. 1394 <span class="folderlabel">Cross-Refs</span> are cross-references to 1395 other databases. The contents of this subpage may only be changed by 1396 the GenBank, EMBL, or DDBJ database staff. <span 1397 class="folderlabel">Evidence</span> has experiment and inference 1398 qualifier fields. The experiment qualifier must include details of the 1399 experiment used to justify the annotation.</p> 1400 1401 <a name="LocationPage" id="LocationPage"></a> 1402 <h4>Location Page</h4> 1403 1404 <p><img class="figure" src="images/loc_page.png" alt= 1405 "Location Page" /></p> 1406 1407 <p>All features are required to have a location, i.e., one or more 1408 intervals on a sequence coordinate. The <span 1409 class="folderlabel">Location</span> page provides a spreadsheet for 1410 entering and editing this information. An arbitrary number of lines can 1411 be entered. In this coding region example, the intervals correspond to 1412 the exons. For an mRNA, the intervals would be the exons and UTRs. The 1413 5' Partial and 3' Partial check boxes will show up as 1414 < or > in front of a feature coordinate in the GenBank flatfile, 1415 indicating partial locations.</p> 1416 1417 <p>The GenBank flatfile view of this location would be:</p> 1418 1419 <pre> 1420 join(201..224,1550..1920,1986..2085,2317..2404,2466..2629) 1421 </pre> 1422 1423 <p>If the <span class="buttonlabel">5' Partial</span> or <span 1424 class="buttonlabel">3' Partial</span> boxes were checked, < and > 1425 symbols would appear at the appropriate end of the join statement:</p> 1426 1427 <pre> 1428 join(<201..224,1550..1920,1986..2085,2317..2404,2466..>2629) 1429 </pre> 1430 1431 <p>If the sequence was reverse complemented (based on a length of 2881 1432 nucleotides), the <span class="popuplabel">Strand</span> popups would 1433 all indicate <span class="popuplabel">Minus</span>, and the join 1434 statement for the resulting feature location would be as follows:</p> 1435 1436 <pre> 1437 complement(join(253..416,478..565,797..896,962..1332, 2658..2681)) 1438 </pre> 1439 1440 <a name="NCBIDesktop" id="NCBIDesktop"></a> 1441 <h3>NCBI DeskTop</h3> 1442 1443 <p><img class="figure" src="images/desktop.png" alt= 1444 "NCBI DeskTop Window" /></p> 1445 1446 <p>The NCBI DeskTop is a window that directly displays the internal 1447 structure of the record being viewed in Sequin. It can be 1448 understood as a Venn diagram.</p> 1449 1450 <p>As with other views on a record, the DeskTop indicates selected 1451 items and lets you select items by clicking.</p> 1452 1453 <p>In this example, Sequin was given the genomic nucleotide and protein 1454 sequences for <span class="taxonomy">Drosophila melanogaster</span> 1455 eukaryotic initiation factor 4E. It then determined the coding region 1456 intervals and built an initial structure. The organism (BioSource 1457 descriptor) is at the nuc-prot set and thus applies to both the 1458 nucleotide and protein sequences.</p> 1459 1460 <a name="AdditionalInformation" id="AdditionalInformation"></a> 1461 <h3>Additional Information</h3> 1462 1463 <p>The Sequin homepage <a href= 1464 "http://www.ncbi.nlm.nih.gov/Sequin/"> 1465 <tt>http://www.ncbi.nlm.nih.gov/Sequin/</tt></a> 1466 has a Frequently Asked Questions section and more detailed 1467 instructions on using the capabilities of network-aware Sequin.</p> 1468 1469 <a name="Reference" id="Reference"></a> 1470 <h2>Reference</h2> 1471 1472 <a name="NetworkConfiguration" id="NetworkConfiguration"></a> 1473 <h3>Network Configuration</h3> 1474 1475 <p><img class="figure" src="images/net_cfg.png" alt= 1476 "Network Configuration Form" /></p> 1477 1478 <p>When first downloaded, Sequin runs in stand-alone mode, without 1479 access to the network. However, the program can also be configured 1480 to exchange information with the NCBI (GenBank) over the Internet. 1481 The network-aware mode of Sequin is identical to the stand-alone 1482 mode, but it contains some additional useful options.</p> 1483 1484 <p>Sequin can only function in its network-aware mode if the 1485 computer on which it resides has a direct Internet connection. 1486 Electronic mail access to the Internet is insufficient. In general, 1487 if you can install and use a WWW browser on your system, you should 1488 be able to install and use network-aware Sequin. Check with your 1489 system administrator or Internet provider if you are uncertain as 1490 to whether you have direct Internet connectivity.</p> 1491 1492 <p>To launch the configuration form, select Net Configure under the 1493 Misc menu, from either the initial Welcome to Sequin form or from a 1494 viewer on an existing sequence record.</p> 1495 1496 <p>If you are not behind a firewall, set the <span 1497 class="buttonlabel">Connection</span> control to <span 1498 class="buttonlabel">Normal</span>. If you also have a Domain Name 1499 Server (DNS) available, you can now simply press <span 1500 class="buttonlabel">Accept</span>.</p> 1501 1502 <p>If DNS is not available, uncheck the <span 1503 class="buttonlabel">Domain Name Server</span> box. If you are behind 1504 a firewall, set the <span class="buttonlabel">Connection</span> control 1505 to <span class="buttonlabel">Firewall</span>. The <span 1506 class="buttonlabel">HTTP Proxy Server</span> box then becomes active. 1507 If you also use a proxy server, type in its address. (If you have 1508 access to DNS, it will be of the form 1509 <tt>www.myproxy.myuniversity.edu</tt>. If you do not have DNS, you 1510 should use the numerical IP address of the form <tt>127.45.23.6</tt>.) 1511 Once you type something in the <span class="buttonlabel">HTTP Proxy 1512 Server</span> box, the <span class="buttonlabel">Port</span> box 1513 becomes active and can be filled in or changed as appropriate. (By 1514 default the <span class="buttonlabel">Non-transparent Proxy 1515 Server</span> box is empty, indicating a CERN-like proxy.) Ask your 1516 network administrator for advice on the proper settings to use.</p> 1517 1518 <p>If you are in the United States, the default <span 1519 class="textlabel">Timeout</span> of 30 seconds should suffice. From 1520 foreign countries with poor Internet connection to the U.S., you can 1521 select up to 5 minutes as the timeout.</p> 1522 1523 <p>Finally, you will need to quit and restart Sequin ifor the 1524 network-aware settings to take effect.</p> 1525 1526 <p>If you are behind a firewall, it must be configured correctly to 1527 access NCBI services. Your network administrator may have done this 1528 already. If not, please have them contact NCBI for further 1529 instructions on setting up firewalls to work with NCBI 1530 services.</p> 1531 1532 <p><b>The following section is intended for network 1533 administrators:</b></p> 1534 1535 <p>Using NCBI services from behind a security firewall requires 1536 opening ports in your firewall. Please consult <a href= 1537 "http://www.ncbi.nlm.nih.gov/IEB/ToolBox/NETWORK/firewall.html"> 1538 <tt>http://www.ncbi.nlm.nih.gov/IEB/ToolBox/NETWORK/firewall.html</tt></a> 1539 for the list of current hosts and ports that have the firewall 1540 daemon configured.</p> 1541 1542 <p>If your firewall is not transparent, the firewall port number 1543 should be mapped to the same port number on the external host.</p> 1544 1545 <p>Note: Old NCBI clients used different application configuration 1546 settings and ports than listed above. If you need to support such 1547 clients, which are becoming obsolete, please contact <a href= 1548 "mailto:info@ncbi.nlm.nih.gov"><tt>info@ncbi.nlm.nih.gov</tt></a> 1549 for further information.</p> 1550 1551 <a name="FeatureTableFormat" id="FeatureTableFormat"></a> 1552 <h3>Feature Table Format</h3> 1553 1554 <p>Sequin can now annotate features by reading in a tab-delimited 1555 table. This is most often used by genome centers that store feature 1556 interval information in relational databases or spreadsheets. For 1557 most submitters, it is usually better to supply protein sequences 1558 in FASTA format with gene and protein names embedded in the 1559 definition line.</p> 1560 1561 <p>The feature table specifies the location and type of each feature, 1562 and Sequin processes the feature intervals and translates any CDSs. The 1563 table is read in the record viewer (after the sequence has been 1564 imported) using the <span class="menulabel">File->Open</span> menu 1565 item. The table must follow a defined format. The first line starts 1566 with >Feature, a space, and then the Sequence ID of the sequence you 1567 are annotating. In the example below, eIF4E is the Sequence ID, and it 1568 is a local identifier.</p> 1569 1570 <p>The table is composed of five columns: start, stop, feature key, 1571 qualifier key, and qualifier value. The columns are separated by 1572 tabs. The first row for any given feature has start, stop, and 1573 feature key. Additional feature intervals just have start and stop. 1574 The qualifiers follow on lines starting with three tabs.</p> 1575 1576 <p>For example, a table that looks like this:</p> 1577 1578 <pre> 1579 >Features lcl|eIF4E 1580 80 2881 gene 1581 gene eIF4E 1582 1583 201 224 CDS 1584 1550 1920 1585 1986 2085 1586 2317 2404 1587 2466 2629 1588 product eukaryotic initiation factor 4E-II 1589 1590 1402 1458 CDS 1591 1550 1920 1592 1986 2085 1593 2317 2404 1594 2466 2629 1595 product eukaryotic initiation factor 4E-I 1596 note encoded by two messenger RNAs 1597 1598 80 224 mRNA 1599 1550 1920 1600 1986 2085 1601 2317 2404 1602 2466 2881 1603 product eukaryotic initiation factor 4E-II 1604 1605 80 224 mRNA 1606 892 1458 1607 1550 1920 1608 1986 2085 1609 2317 2404 1610 2466 2881 1611 product eukaryotic initiation factor 4E-I 1612 1613 80 224 mRNA 1614 1129 1458 1615 1550 1920 1616 1986 2085 1617 2317 2404 1618 2466 2881 1619 product eukaryotic initiation factor 4E-I 1620 </pre> 1621 1622 <p>will result in a GenBank flatfile that contains this:</p> 1623 1624 <pre> 1625 mRNA join(80..224,1129..1458,1550..1920,1986..2085,2317..2404, 1626 2466..2881) 1627 /gene="eIF4E" 1628 /product="eukaryotic initiation factor 4E-I" 1629 mRNA join(80..224,892..1458,1550..1920,1986..2085,2317..2404, 1630 2466..2881) 1631 /gene="eIF4E" 1632 /product="eukaryotic initiation factor 4E-I" 1633 mRNA join(80..224,1550..1920,1986..2085,2317..2404,2466..2881) 1634 /gene="eIF4E" 1635 /product="eukaryotic initiation factor 4E-II" 1636 gene 80..2881 1637 /gene="eIF4E" 1638 CDS join(201..224,1550..1920,1986..2085,2317..2404,2466..2629) 1639 /gene="eIF4E" 1640 /codon_start=1 1641 /product="eukaryotic initiation factor 4E-II" 1642 /translation="MVVLETEKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETG 1643 EPAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWEDMQNEITSFDTV 1644 EDFWSLYNHIKPPSEIKLGSDYSLFKKNIRPMWEDAANKQGGRWVITLNKSSKTDLDN 1645 LWLDVLLCLIGEAFDHSDQICGAVINIRGKSNKISIWTADGNNEEAALEIGHKLRDAL 1646 RLGRNNSLQYQLHKDTMVKQGSNVKSIYTL" 1647 CDS join(1402..1458,1550..1920,1986..2085,2317..2404, 1648 2466..2629) 1649 /gene="eIF4E" 1650 /note="encoded by two messenger RNAs" 1651 /codon_start=1 1652 /product="eukaryotic initiation factor 4E-I" 1653 /translation="MQSDFHRMKNFANPKSMFKTSAPSTEQGRPEPPTSAAAPAEAKD 1654 VKPKEDPQETGEPAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWED 1655 MQNEITSFDTVEDFWSLYNHIKPPSEIKLGSDYSLFKKNIRPMWEDAANKQGGRWVIT 1656 LNKSSKTDLDNLWLDVLLCLIGEAFDHSDQICGAVINIRGKSNKISIWTADGNNEEAA 1657 LEIGHKLRDALRLGRNNSLQYQLHKDTMVKQGSNVKSIYTL" 1658 </pre> 1659 1660 <p>Note that if the gene feature spans the intervals of the CDS and 1661 mRNA features for that gene, you don't need to include gene 1662 "qualifiers" in those features, because they will be picked up by 1663 overlap.</p> 1664 1665 <p>Features that are on the complementary strand are indicated by 1666 reversing the interval locations. For example, the table:</p> 1667 1668 <pre> 1669 >Features lcl|dna2 1670 5284 5202 tRNA 1671 product tRNA-Glu 1672 </pre> 1673 1674 <p>will result in a GenBank flatfile containing:</p> 1675 1676 <pre> 1677 tRNA complement(5202..5284) 1678 /product="tRNA-Glu" 1679 </pre> 1680 1681 <p>More instructions on using the feature table format for 1682 submitting large genomic records are available at<br /> 1683 <a href="http://www.ncbi.nlm.nih.gov/Sequin/table.html"> 1684 <tt>http://www.ncbi.nlm.nih.gov/Sequin/table.html</tt></a>.</p> 1685 1686 <hr /> 1687 1688 <div id="footer"> 1689 <b>Questions or Comments?</b><br /> 1690 Write to the <a href="mailto:info@ncbi.nlm.nih.gov">NCBI Service Desk</a> 1691 <br /> 1692 <br /> 1693 Revised August 21, 2007<br /> 1694 </div> 1695 1696 <!-- end of content --> 1697 </body> 1698 </html>
|
This page was automatically generated by the
LXR engine.
Visit the LXR main site for more information. |