dbGaP Cloud Access

Introduction

Files stored on cloud storage providers have been made available for users in their original format. These files are accessioned by SRA in the same accession ID series (SRR#) as data submitted for processing at NCBI. Any index files for BAM, CRAM, and VCF data that were provided in the submission will be distributed alongside the source data under the same accession. Please note that the content of the source data files may not be verified by NCBI and some data types may require additional files or software to access. Users will be responsible for their own cloud storage, compute, and billing accounts. Please talk with your institution for assistance in that process but do make note that not all data sets are available through all cloud providers.

Log in to Authorized Access and download Repository Key

  • Log in to the Authorized Access Portal offsite image using an eRA Commons or NIH account login.
  • On the far right of the "My Projects" page, click get dbGaP repository key for the project intended to work on in the cloud.
  • Save this file (will have an .ngc extension) for use later on the cloud platform.
Get dbGAP Repository Key

Select data Using Run Selector

Data files located on a cloud provider can be found using the DATASTORE provider, location, and filetype in the SRA Run Selector. For most data sets, the files are only accessible to compute instances using the same DATASTORE provider and DATASTORE region that the cloud storage bucket is hosted in.

To choose a list of runs available on cloud storage:

  • On the "My Projects" page click "run selector" to open a window to select data included in your project. Run Selector may take a minute or more to open for projects with a large data set.
  • In the Run Selector "Facets" menu (left side of the screen), click the checkbox next to "DATASTORE provider."
  • In the DATASTORE provider menu that appears, check the box of the desired cloud provider. (gs = Google Cloud Storage, s3 = Amazon Simple Storage Service)
  • Note the DATASTORE region(s) the files are stored in. You will need to use a compute instance that is compatible with the region. Or you can use the DATASTORE region filter to select runs stored in the same region you will use for access. (STEP 1 in image below.)
  • Use any additional facets to reduce the list of runs to the specific data you would like to work with in the cloud.
  • Either select the runs individually using check boxes for the runs or click the button to add all the runs filtered by the current facets to your "Selected" list. (STEP 2 in image below.)
  • Click the "RunInfo Table" button on the Selected row to download a comma separated table selected runs. Save this file either directly on your cloud storage or to your computer for later upload to your cloud compute/storage. (STEP 3 in image below.)
Run Selector

Getting an Accession List

Currently, Run Selector from Authorized Access can provide either a list of ALL the runs in the project, or a cart file (krt) for selected runs in the project. It also provides RunInfo Table for either all the runs or the selected list of runs. The RunInfo Table is a comma-delimited table that can be opened using Excel or any software that handles comma separated data well. The accessions needed for use with fusera are found in the Run column of the RunInfo Table. The desired accessions can be copied and stored in a txt file.

For users that have an existing cart file (.krt), it can also be used to get the necessary accession list by using the SRA Toolkit program prefetch with the -l option to list the contents of the file, and then using unix text editing software to get just the accession list. For example:
prefetch -l <cart file> | cut -d '|' -f 3 | grep SRR > SRR_Acc_List.txt

Launch Cloud Compute Instance

Details will vary by provider.
SRA Tutorial for first time users can be found here: Download SRA sequence data using Amazon Web Services (AWS)

Pay careful attention to the region you launch the compute instance from to make sure the data you are interested in will be in the same region.

For Amazon users there is a public AMI (Amazon Machine Image) with basic tools such as Samtools, fusera, the SRA Toolkit, and BWA-MEM included as ami-0d57e13fe62227b2e. The script for creating this AMI is provided later in this document. The AMI also includes a README.txt and the keys to access 1000 Genomes Project Data from a cloud location. You will need to launch this EC2 instance from us-east-1 (N. Virginia) for the instructions in the README.txt to work correctly.

For other dbGaP studies a copy of the repository key (.ngc) from dbGaP Authorized Access must be available on your cloud instance. You will also need a list of SRR accessions for fusera to mount.

Mount the Cloud Storage Data

  1. Mount fusera

fusera mount -t token.file.ngc -a accessions.txt /path/to/mountpoint > output.log 2>&1 &

-t, --token string - a path to one of the various security tokens used to authorize access to accessions in dbGaP.
Examples: [local/token/file | https://<bucket>.<region>.s3.amazonaws.com/<token/file>]
NOTE: If using an s3 url, the proper aws credentials need to be in place on the machine.
Environment Variable: $DBGAP_TOKEN

-a, --accession string - a list of accessions to mount or path to accession file.
Examples: ["SRR123,SRR456" | local/accession/file | https://<bucket>.<region>.s3.amazonaws.com/<accession/file>]
NOTE: If using an s3 url, the proper aws credentials need to be in place on the machine.
Environment Variable: $DBGAP_ACCESSION

/path/to/mountpoint - location the data will be mounted on the filesystem. Must be an existing but empty directory.

> output.log - this redirects stdout to a file named output.log. If you don't want the output, use > /dev/null instead.

2>&1 - this redirects stderr to stdout so it is caught in output.log (or /dev/null) as well.

& - runs this process in the background so you can continue using the shell.

  1. Run mount as either a background process or disown it from the current session.
  2. Make sure to save any output or results in directories outside the fusera mount directory.

Access the data files normally

Once the directories are mounted with fusera, the data files can be read as if they were stored in the mounted directory for standard software like samtools, etc. If there were any errors in mounting a run, there will be an error message in the fusera log. Be careful to save any output or results in directories outside the fusera mount directory.

Remember to unmount the directory when finished using the data to reduce CPU usage. Example:
fusera unmount /path/to/mountpoint

Additional Resources

Access by Fusera

https://github.com/mitre/fusera offsite image

MITRE has built software that will allow users to access files and make them visible in a mounted directory and be available to your cloud instance without the need to copy or store the data files on user storage. The data files will appear in a directory by Run (SRR) accession.

To use fusera, first install the package from https://github.com/mitre/fusera offsite image Currently there is data available and support for Google and Amazon cloud services but data is only available in the region and provider used for storage.

Access by URL

Alternatively data can also be accessed via signed URL for dbGaP data. Public data can be accessed by a standard URL. Signed URLs are only issued for dbGaP protected data and will require using the project's repository key (.ngc) to be provided when requesting the URL. The SRA Data Locator (SDL) can be accessed at https://www.ncbi.nlm.nih.gov/Traces/sdl/1/ offsite image https://www.ncbi.nlm.nih.gov/Traces/sdl/1/ and used to find the data location and get URLs to access data in the cloud. Signed URLs are valid for a limited period and will need to be refreshed. The fusera program uses this service internally to generate signed URLs for read access.

Script for AMI

cd /root
echo Started : `date` > tools-setup.log
 
yum install -y mc gcc zlib-devel bzip2-devel xz-devel libcurl-devel openssl-devel
 
echo "----->" Install SRA toolkit
 
wget -q https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.9.4/sratoolkit.2.9.4-centos_linux64.tar.gz
 
tar xzf sratoolkit.2.9.4-centos_linux64.tar.gz
rm -f sratoolkit.2.9.4-centos_linux64.tar.gz
mv ./sratoolkit.2.9.4-centos_linux64 /opt
echo -e "export PATH=/opt/sratoolkit.2.9.4-centos_linux64/bin:\$PATH" > /etc/profile.d/sratoolkit.sh
chmod 755 /etc/profile.d/sratoolkit.sh
 
mkdir /etc/ncbi
 
echo "----->" Install SAM tools
 
wget -q https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2
 
tar xjf samtools-1.9.tar.bz2
rm -f samtools-1.9.tar.bz2
cd samtools-1.9
./configure --without-curses >/dev/null
make --quiet
make --quiet install
cd ..
rm -rf ./samtools-1.9
 
echo "----->" Install Fusera
 
bash <(curl -fsSL https://raw.githubusercontent.com/mitre/fusera/master/install.sh)
yum install -y fuse.x86_64
 
echo "----->" Install BWA-MEM
 
wget -q https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.17.tar.bz2
 
tar xjf bwa-0.7.17.tar.bz2
rm -f bwa-0.7.17.tar.bz2
mv ./bwa-0.7.17 /opt
echo -e "export PATH=/opt/bwa-0.7.17/bwakit:\$PATH" > /etc/profile.d/bwakit.sh
chmod 755 /etc/profile.d/bwakit.sh
 
echo Finished : `date` >> tools-setup.log

Wiki Page for fusera

https://github.com/mitre/fusera/wiki offsite image


Contact SRA

Contact SRA staff for assistance at sra@ncbi.nlm.nih.gov

Support Center

Last updated: 2019-07-19T16:01:57Z