Working with Non-Public Data

Step 1: Introduction

This tutorial demonstrates two different ways to manage private data in Genome Workbench.

  • You have created your own sequence and want to work with it in Genome Workbench
  • You want to view your own data/annotation on a publicly available sequence

We will demonstrate using some of the Genome Workbench tools on the data not found in the NCBI databases.

It is recommended that you complete Basic Operation tutorial first.

Here is a link to the sample data you will need to complete this tutorial - BX530088_BX572102.

Step 2: Getting Started

For the first exercise we are going to do the following:

  • Load a user-generated AGP file (download sample)
  • SPLIGN some mRNAs on that AGP sequence
  • Create a FASTA file from the AGP
  • BLAST that FASTA sequence to see what is related to it
  • WindowMask that FASTA sequence (or part of it) to look for repetitive regions

Genome workbench starts up and displays the main screen. Choose File=>Open from the main menu, select File on the left side of the dialog, click the ... button on the right to point to the file location. Genome Workbench understands many different file formats and for this step choose BX530088_BX572102.comp.agp from the data files downloaded. Click Next and then Next again to accept the defaults. Then click Finish to add the data file to a new project.

Now that your data is loaded, you can view it by selecting the data in the project tree, right clicking and choosing Open New View. Then choose Graphical View. While this is not very interesting you can zoom in to see the sequence.

Tiling path

Step 3: Apply the tool to private data

Now let us align an mRNA to our sequence. We will use the SPLIGN tool. SPLIGN (or SPLiced Aligner) is a global alignment tool used in NCBI's annotation pipeline. Open the NM_020137.3 RID from the Data from GenBankdatabase (File=>Open) and add it to the project.

Open RID

Click Next and Finish. Both entries are now shown in the data folder.

Select both entries (SHIFT+left click in both MS Windows and Mac OS). With both entries selected click Tools=>Run Tool to open the Tools dialog and choose SPLIGN and Next. In some systems you will be taken to the next screen even without having to choose Next

Select BX530088... for the Genomic Sequence and NM_020137.3 for the Transcript Sequence. If you do not see both sections of the dialog you need to drag down the lower border of the dialog box.

Select genomic and transcript sequences

Click Next.

Add the results to the existing project and click Finish.

Your private data alignment will be displayed.

Private data alignment

Step 4: Export a FASTA file

Select the data file in the Project Tree View we loaded previously. Right click (control click in the Mac OS) on the selected data and choose Export. Select FASTA as the format, select a location, and give the file a name.

Export fasta

Click Finish.

Now open the FASTA file you have just created. Choose File=>Open. Select the file and click Next. Accept the default settings and click Next again. Choose to create a new project and click Finish.

Select the FASTA data in the Project Tree View and double click it. From the Open View menu choose Graphical View.

Fasta view

Step 5: Alignment

From the Graphical View of the FASTA sequence use region selection to select the entire sequence. Click and drag in the number line at the top of the view to begin the selection.

Region select

Once you have a region selected, click on the edges and stretch it to the boundaries of the view.

Region select complete

With the entire region selected, choose Run Tool (Tools=>Run Tool from the main menu, or Right Click (control-click on the Mac OS)). From the Run Tool dialog choose BLAST Search.

Run tool blast

Click Next.

In the BLAST Search dialog ensure you have selected the Nucleotide option, Nucleotide-Nucleotide (MegaBLAST) from the Program menu, and nt(All GeneBank+EMBL+DDBJ+PDB sequences) from the Database menu. Input biomol mrna[prop] search string into the Entrez Query field.

Blast parameters

Click Next

From the next dialog, accept the general parameters and check the Filter low complexity regions and select Human from the Species specific repeats for: menu.

Blast filter parameters

Then click Next. In the next screen choose to add the results to the existing project (New Project (1)) and click Finish.

It can take some time for the analysis to return and present the results.

Analysis results

Step 6: WindowMasker

In this step we will use WindowMasker on the FASTA sequence to look for repetitive regions. First let us upload the mask. Select Tools=>WindowMasker Data. In the dialogue that appears choose the location (path) to download the mask, choose human.tar.gz as the mask.

Windows masker data

Click OK.The mask data will be downloaded to the selected location.

The FASTA file should still be available in the project tree view. Select it, double click and open a graphical view. Select the region by clicking in the number line and dragging a selection around a region.

Region selection

Choose Tools=>Run Tool from the main menu.

Windowmasker selection

Select Search/Find Repetitive Sequences with WindowMasker and click Next (in some systems you might have to only click the tool without having to click Next).

Ensure that our sequence is selected (BX530088...), select 9606 Homo sapiens from the Mask using parameters for menu.

Windowmasker parameters

Click Next. Choose a project to add the results to and click Finish. It can take some time for the job to complete.

The result is a histogram showing regions of repeats. You can scroll and zoom just like you would any other view.

If the histogram does not appear automatically, select the content menu at the bottom of the graphical view and choose Repeat Region.

Show repeat regiopns

Step 7: Conclusion

There are multiple ways to use Genome Workbench and this only shows some very simple examples. It gives you enough background to start exploring your data in new and interesting ways. It gives you the privacy you need along with the access to public data desired.

Current Version is 2.11.10 (released March 10, 2017)

Documentation Home






Other Resources

Support Center

Last updated: 2014-02-18T12:43:51-05:00