U.S. flag

An official website of the United States government

dbGaP logo

Submitting Sequence Data for a dbGaP project

Introduction

The database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the data and results from studies investigating the interaction of genotype and phenotype in humans.

Find all dbGaP studies in SRA: "cluster dbgap"[Properties].

Genomic sequence data has the potential to personally identify the supposedly deidentified donor and therefore dbGaP data access is restricted to:

  • Researchers who request data sets for specific research uses
  • Institutional signing officials from the PI's home organization who certify and submit such requests
  • NIH staff who review and process requests

Researchers can apply for access to dbGaP data at the Authorized Access Portal.

Exclamation point DO NOT submit sequence data for a dbGaP study through the SRA Submission Portal.

Tack All submissions that require controlled access must be submitted through dbGaP. The consent status of the human subjects in your study must be established prior to data transfer. If patients have not explicitly consented for the public release of their genomic data, it will be archived behind a controlled-access firewall. If you are unsure whether your patients data should be stored in the authorized access archive then we you contact your institutions IRB to determine where your data should be deposited. The funding agency supporting your study may also explicitly specify whether the data be deposited in an authorized access archive or if it can be stored in a public repository (e.g. NIH GDS policy (https://gds.nih.gov/) requires NIH studies funded after January 25th, 2015 to demonstrate consent by providing institutional certification to the NIH Institutes and Centers Genomic Program Administrators who then register the study in dbGaP).

Submission Overview

Register a study and subjects with dbGaP

An interactive overview of dbGaP submission can be found here.

The dbGaP submission documentation package can be downloaded [Download icon here.

All questions pertaining to this stage of submission should be directed to dbgap-help@ncbi.nlm.nih.gov.

Submit sequencing metadata through the dbGaP submission portal

You will receive an email with an attached submission spreadsheet once your study and samples have been registered in dbGaP.

The sample column of the sequencing metadata spreadsheet will be pre-filled with your dbGaP identifiers (the study's accession and the dbGaP's sample IDs). You will need to complete the spreadsheet with required technical details and the names and MD5 checksums of the sequence files you will be uploading

Description of the columns in the spreadsheet

Name (Column) - Instructions

  • phs_accession (A) - This will be filled in for you in the spreadsheet you receive. It will contain the phs accession of the dbGaP study without the study version numbers.
  • sample_ID (B) - The sample IDs will be filled in when you receive the spreadsheet. If you need to submit more than one library per sample, you can copy and paste the same sample name in a new row. Be careful not to edit or change the sample names. They must match the names submitted to dbGaP exactly. Any sample name changes that need to be made must be made through the dbGaP registration process.
  • library_ID (C) - Each library_id must be unique within the submission and for all submissions for the same study in dbGap and is primarily a unique identifier for the sequencing library. This value can be an internal identifier or can be just the sample name repeated if you do not have an additional identifier for the sequencing library. Please note that if you are submitting more than one sequencing library per sample you will need to make sure the library names do not repeat.
  • title/short description (D) - The title should be treated as a name or title that will help a user briefly identify what data was in the sequencing library and should be no longer than a single sentence.
  • library_strategy (E) - [Controlled Vocabulary] The library strategy must be selected from the list of possible values. These are provided both as a drop-down menu in the spreadsheet as well as a clickable link title to the Terms sheet where each option is described a bit more. This field is used by users searching for data so please choose the closest option and use the design description to detail any nuances this list doesn't include.
  • library_source (F) - [Controlled Vocabulary] The library source must be selected from the list of possible values. These are provided both as a drop-down menu in the spreadsheet as well as a clickable link title to the Terms sheet where each option is described a bit more. This field is used by users searching for data so please choose the closest option and use the design description to detail any nuances this list doesn't include.
  • library_selection (G) - [Controlled Vocabulary] The library selection must be selected from the list of possible values. These are provided both as a drop-down menu in the spreadsheet as well as a clickable link title to the Terms sheet where each option is described a bit more. This field is used by users searching for data so please choose the closest option and use the design description to detail any nuances this list doesn't include.
  • library_layout (H) - [Controlled Vocabulary] Select either 'single' or 'paired' from the list.
  • platform (I) - [Controlled Vocabulary] Select the sequencing platform manufacturer from the list of possible platforms. You must select this before selecting the instrument model.
  • instrument_model (J) - [Controlled Vocabulary] Select the model of instrument for the platform. You must select a platform first. After selecting the platform the list of possible models for that platform will be entered in the drop down menu.
  • design_description (K) - The design description should be treated like a materials and methods description explaining how this library was prepared and sequenced. Please provide the design description as single line text without newlines or special characters and make the description long enough (at least 3 sentences) so that a user can understand what the contents of any sequencing data files will be. Include kit name and version and part number if you have it for any kits. Avoid including information like the sequencing platform unless it is necessary to describe unique features of the library or process.
  • reference genome as assembly or accession (L) - [Aligned Data Only] The reference genome used in the alignment. Do not include anything here if submitting unaligned files like FASTQ. Only the base reference genome is needed in most cases and only use a single name or accession. For example "GRCh38" is preferred way to enter the assembly while "GRCh38/hg38" will likely cause delays in processing.
  • alignment_software (M) - [Aligned Data Only] Provide the alignment software that was used to generate the alignment in the data. Please include the software version if known.
  • filetype (N/Q) - [Controlled Vocabulary] Select the filetype from the list of options for the data being submitted.
  • filename (O/R) - The exact name of the file that will be uploaded. Include all extensions but do not include the full or relative path information on your storage.
  • MD5_checksum (P/S) - A unique identifier generated using the MD5 algorithm. Used to ensure that the upload process did not introduce any errors.
    The spreadsheet contains space for two files, each with a filetype, filename, and MD5 checksum required. Bam submissions will typically have only a single file per library. Paired FASTQ data will typically have two files but sometimes will have more than two fastq files per sequencing library. In those cases additional columns of filetype, filename, and MD5 checksum can be added using the same column titles.

SRA: Transfer sequence files to the protected SRA account

An SRA Curator will assist with your file upload once your sequencing metadata spreadsheet has been completed and returned.

Exclamation point Upload the data files to only the upload account provided to you after completing the metadata spreadsheet.

You will need to:

This program is free to use for submitters transmitting data to and from NCBI. Check with your local networking team to ensure UDP transfer is enabled for the following IP range: 130.14.\*.\* and 165.112.\*.\*. The firewall must also allow ssh traffic outbound to NCBI.

Aspera key pairs

Submitters will need to generate key pairs to use the Aspera upload account (see instructions below for creating a key pair on different operation systems).

Send only the public key to the SRA Curator currently assisting to receive access to the asp-sra@gap-submit.ncbi.nlm.nih.gov upload account.

Linux/Unix

Linux/Unix and OS X users can use the command line ssh-keygen utility.

The following command line creates a private key (mykey) and a public key (mykey.pub) in the current working directory:

ssh-keygen -f mykey

macOS/OS X

The following command line creates a private key (mykey) and a public key (mykey.pub) in the current working directory:

Submitteers using macOS can also use the command line tool of ssh-keygen but will need to open the Terminal program installed in /Applications/Utilities/. Once the program opens the following command can be used to generate a key pair using the PEM format to ensure compatibility with aspera Connect.

ssh-keygen -t rsa -m PEM -f key-file-name

Windows

Windows does not have an easy to access key pair generator built in. However there is a free software tool PuTTygen that can be used to make key pairs. Instructions for downloading and using PuTTYgen for key pair generation can be found in the more detailed guide here: Aspera Keys.

Aspera command line usage

Once your key has been added and you have been granted access you can upload your files via ascp.

An example ascp command for dbGaP uploads:

ascp -i <key file> -Q -l 200m -k 1 <file(s) to transfer> asp-sra@gap-submit.ncbi.nlm.nih.gov:<directory>

Where:

  • <directory> is either test or protected.
  • <key file> is a private key file (full pathname must be used).

Aspera uploads frequently fail when using wildcards (e.g. *.bam) in the transfer command to send multiple files. Instead a loop should be used to transfer multiple files. This will result in a much higher rate of success uploading data.

Example upload loops

Bash and macOS

for F in ./\*.bam
do
ascp -i <key file> -l 200m -k 1 $F asp-sra@gap-submit.ncbi.nlm.nih.gov:<directory>
done

Windows

FOR %f IN (\*.bam) DO C:\install\directory\ascp.exe -i <key file> -l 200m -k 1 %f asp-sra@gap-submit.ncbi.nlm.nih.gov:<directory>

Confirm data receipt

Once all files and metadata have been uploaded, please confirm with your SRA Curator that the SRA portion of your dbGAP submission is complete. The curator can provide a report of files that were loaded. There is also a nightly report by samples provided on the dbGaP website. Change the accession phs000000 in the address below to your study for the report. For the report in XML format change rettype=html to rettype=xml.

https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetSampleStatus.cgi?study_id=phs000000&rettype=html

XML Submission

XML submissions of sequencing metadata are only recommended for submitters who will be regularly submitting sequencing data for multiple dbGaP studies. If you would like to set up an account to upload metadata via XML then please notify an SRA Curator and they will assist you. For each study the submitter will need to upload three xml files packaged together in a tarball:

  • submission.xml
  • experiment.xml
  • run.xml

Submitters will create one entry for each library in the experiment.xml, and an entry for each BAM or production run in the run.xml. These XML files will be stored in a single tar archive and uploaded to an account at NCBI for the submitting center. The XML schemas are available here.

Not all possible combinations of XML will be present, please contact sra@ncbi.nlm.nih.gov if you need additional help formatting your XML.

Linking to a Registered dbGaP Study

In the <EXPERIMENT> XML:

<STUDY_REF accession="phs000000"/>

Linking to Registered dbGaP Samples

In the <EXPERIMENT> XML:

<SAMPLE_DESCRIPTOR refcenter="phs000000" refname="submitted_sample_id"/>

Contact SRA staff

Contact SRA staff for assistance at sra@ncbi.nlm.nih.gov.

Support Center

Last updated: 2022-07-20T19:42:58Z