Sequences

Add Sequences

The Add Sequences function can be used to add new sequences to an existing project already in progress. For example, the nuclear sequences of a bacterial genome already exist in a project, but the plasmid sequences weren’t included. Use Add Sequences to add the plasmids to the project.

Sequences Open Objects Menu

The default file format setting is fasta sequence files. This can be changed to NCBI ASN.1 files or fasta alignment files. The browse function can be used to find files to open or the files can simply be dropped in and recently used files will appear in the Recently used Files box for easy use.

If the imported file has the same sequence identifier as one of the existing files, an error will occur:

Sequences Error

Select OK and a dialog box will open:

Sequences Id Problems Dialog

The top portion of the window displays all Sequence IDs from the existing sequence. The bottom displays the new Sequence IDs. In the new Sequence ID display of the example, it indicates that the problem is a duplicate ID. Simply type in the new ID in the New Sequence ID box and click the recheck problems button.

Sequences Id Problems Corrected

When everything is correct the Problems column (The last column) will be empty. At this point click the accept button. The new sequences will be added to the end of the file. At any point to end the process without making changes simply click the cancel button.

Edit Sequence

Edit Sequence 1

The Edit Sequence dialog is a useful tool for viewing the sequence and features and editing the sequence content and feature locations.

Edit Sequence 2

The Show menu controls the information displayed. For example, the user could choose to view reading frames or display the sequence complement below the sequence. The user could also choose to display features as labeled lines below the sequence or hide them. For coding regions, the On-the-fly option shows the protein translation calculated using the sequence underlying the coding region and the frame of the coding region. The protein currently associated with the coding region is also displayed, and the user can choose Mismatch to highlight the positions on the protein sequence that do not match the calculated translation.

Edit Sequence 3

The Edit->Find menu item launches a dialog that allows the user to search for sequence characters in the nucleotide sequence, the reverse complement, or in translated frames. The “Go to:” text box enables the user to select the position of the cursor. This is useful for navigating to a specific position without scrolling. This control will also allow the user to search for sequence characters – for example, searching for “atg” will move the cursor to the next instance of this codon. The “Select:” text box enables a user to select a range of sequence without using the mouse to drag the cursor. Selecting a sequence range is useful because the Annotate menu can be used to create a feature for the selected location. The user can edit the sequence directly by clicking on the sequence and typing characters or using backspace or delete. When features are displayed, their locations can be adjusted by dragging the endpoints of their intervals. Note that when making changes in the Sequence Editing dialog, it is necessary to use the Commit button to apply changes to the data before adding new features with the Features menu or retranslating coding regions with the Retranslate button. The Cancel button will exit the dialog and undo any edits that have been made by the dialog.

Update Sequence

WORK IN PROGRESS

Remove Sequences

Remove Sequences opens a dialog that allows you to remove one or more sequences from the current group of sequences. All non-chosen/non-removed sequences remain in the current group and can continue to be processed.

GWB Sequences Remove Sequences Dialog

In the open dialog, the left window at the top lists the sequences in the current group by Filename/SequenceID SequenceID (sequence length). For example, a 4,321 nucleotide sequence with SequenceID of SEQ1 would appear: Filename1/SEQ1 SEQ1 (4321)

By clicking on a sequence listed in the left window to highlight it and then clicking on the the [>>>] choice, you can move the chosen sequence to the right window. Clicking [Accept] at the bottom of the dialog will now remove the chosen sequence(s) from the current group.

GWB Sequences Remove Sequences Dialog Choose Seq

GWB Sequences Remove Sequences Dialog Remove Seq

If a sequence was incorrectly chosen and should be moved back to the left window to stay in the current group, highlight the sequence in the right window and click the [<<<] choice to move the sequence back.

Multiple sequences can be chosen and removed from the current group by using the options below the [<<<] and [>>>] choices:

  1. One or more SequenceIDs that include/do not include specific text can be chosen by using the ‘Seq-id’ pull down menu: Is one of/Contains/Does not Contain/Equals/etc and by entering the desired SequenceID text in the free text box. Upper and lower cases and spaces can also be used/ignored to specify desired SequenceIDs by using those choices. The [Clear Constraint] choice removes all choices from this option.

  2. Sequences of certain lengths can be chosen by using the next two options: ‘Select sequences longer than’ and/or ‘Select sequences less than’ and entering a sequence length as number of nucleotides in the corresponding boxes.

  3. The buttons at the bottom of the dialog function as described:

    • Select - Moves the sequences identified from the current group to the list to be removed from the current group
    • Select All - Moves all of the sequences in the current group to the list to be removed from the current group
    • Unselect All – Moves all of the sequences listed to be removed back to the current group.
    • Accept – Confirms the choices of sequences to be removed from the current group, removes them, and closes the dialog.
    • Cancel – Closes the dialog without taking any action.

Reverse Complement Sequences by Sequence ID

The orientation of an individual contig, plasmid or chromosome does not matter to GenBank. Submitters however may prefer sequences be in a particular orientation for example so that all contigs are on the plus strand or so that certain genes are first in the genome. If it is determined that it is necessary to reverse complement a sequence, that can be done with the Reverse Complement by SeqID task.

Sequences Reverse Complement SequencesDialog

The sequence or sequences to be reverse complemented can be selected in the top window. Hold down the Ctrl button when selecting multiple sequences. Alternatively, sequence IDs can be entered in the constraint box to select specific sequences. For example, all sequences with sequence IDs that contains F will select those three sequences in this demonstration record. Once the constraint has been entered, hit the select button to mark those records to be changed. If all sequences should be reverse complemented, simply hit the select all button to highlight all sequences. If something was selected incorrectly, simply hit the unselect all button and start over.

When the list to change is correct, hit accept to perform the action. This will create a nucleotide sequence that is reverse complemented but not make any other changes. If there are features on the sequence and the Reverse features is checked, they can follow the reverse complement sequence action so the last gene will now be first, etc. If the features and sequence are on opposite strands, only one of the two boxes will need to be checked. In all cases, the sequence should be checked after to ensure that the changes were incorporated.

Trim Terminal Ns

Trim Terminal Ns removes one or more unknown nucleotides (represented in the sequence as ‘n’) from the 5’ or 3’ ends of all sequences in the current group.

Terminal Ns are sometimes added by tools that create an alignment of multiple sequences to fill gaps at the ends of sequences when all of the sequences are not the same length. However, GenBank prefers not to include terminal Ns in sequences as they are considered ambiguous data that add no additional value to the sequence.

NNNNNNNNNNNNTGCGGGATTATTCATACCGTCCAACCATCGGGCGTACCTATGTGTACGACAATAAATTGGGTTGTGTTATCAAAAACGCCAAGCGCAAGAAGCACCTAGTCGA …

Using Trim Terminal Ns will remove the 5’ terminal Ns and result in:

TGCGGGATTATTCATACCGTCCAACCATCGGGCGTACCTATGTGTACGACAATAAATTGGGTTGTGTTATCAAAAACGCCAAGCGCAAGAAGCACCTAGTCGA…

When Trim Terminal Ns is used to remove 5’ or 3’ n’s, a pop-up dialog will report the sequence(s) and number of n’s removed.

GWB Sequences Trim Terminal Ns Results

If terminal Ns are not removed from a sequence to be submitted, they will be removed when the sequence is processed by GenBank.

Expand Known Gaps to Include Flanking Ns

After submitting a file which contains gaps, contamination may often be detected by the foreign contamination screen near a gap boundary. In order to keep the coordinate system the same, the contamination will often be replaced by N’s in the sequence. However, these N’s now need to be incorporated into the neighboring gap. Rather than removing all gaps and adding them back, simply use the Expand Known Gaps to Include Flanking N’s button to adjust the gaps in the submission file to include this additional sequence when the gap is of estimated length. Always check to make sure that the estimated length of the gap has changed accordingly. Note that this task only works when the gap is estimated length. It cannot be used when the gaps are unknown length.

Add Linkage Evidence to All Gaps

WORK IN PROGRESS

Add Assembly Gaps to Sequence

Add Assembly Gaps to Sequence opens a dialog that allows the identification and description of gaps in the current sequence.

GWB Sequences Add Assembly Gaps To Sequence

What is a sequencing gap? A section of unknown sequence between sections of known sequence. The unknown sequence can be known (estimated) length based on alignment or other biology or it can be unknown length.

All unknown length gaps in a sequence should represented by a string of exactly 100 internal N’s. The first choice in the dialog defaults to this description and will convert those gaps when Accept is clicked.

GWB Sequences Add Assembly Gaps To Sequence Unknown

All gaps (internal strings of N’s) of 101 N’s or longer will be converted to gaps of known length equal to the number of N’s. This is also a default setting.

GWB Sequences Add Assembly Gaps To Sequence Known

Also by default, CDSs with intervals that include gaps will be adjusted when the gaps are converted.

GWB Sequences Add Assembly Gaps To Sequence Adjust CDS

Add linkage information to gaps allows Gap Type, Linkage and Linkage evidence to be set. Choose the appropriate Type from the pull-down menu and the corresponding Linkage and Evidence when required.

GWB Sequences Add Assembly Gaps To Sequence Linkage Choices

Choose Accept to apply the gaps that have been described or Cancel to close the dialog without any action.

Remove Gap Features

Remove Gap Features removes all of the assembly gap features in the submission file. If gaps were included in the submission file using Add Assembly Gaps to Sequence in Genome Workbench, or using the programs table2asn_gff or tbl2asn, this function can be used to remove them. Note that this does not change the underlying sequence and only reverts the sequence to its pre-gapped state. There will be errors regarding the presence of the N's if assembly gap features are not added back. Gaps can be added back using Add Assembly Gaps to Sequence with new settings.

For more information please see the full documentation for NCBI Genome Workbench Editing Package.

Support Center

Last updated: 2019-09-12T15:19:13Z