dbGaP Study Submission Guide

You must register your study before submitting data.

Register study --> Prepare files for submission --> Check files before submission --> Submit --> dbGaP curators process --> Receive signal and submit high throughput sequences: BAM, CRAM, FASTQ --> Preview and Approve --> Release

What's new?

  • High throughput sequence metadata should now be uploaded to the dbGaP Submission Portal under section "Sequence metadata" instead of through email. (May 2021)
  • We are offering pre-validation tools for you to check your data before submitting to dbGaP on your system using GaPTools (February 2021)
  • The study config can now be filled out online in your study's Submission Portal. (October 2020)
  • Automated Preprocessing Validation Checks are being run on all studies submitting PLINK or VCF files. This system will provide feedback within a few days of submission for IDs errors and inconsistences between PLINK, VCFs, Subject Consent, SSM, and Pedigree datasets (DS). For active studies pre-dating this new system, curators will work with you to update your files, so that this automated check can be run. (July 2019)
  • Biological sex is required in the Subject Consent files in order to run the Automated Preprocessing Validation Checks. (July 2019)
  • SAMPLE_USE is discontinued from the Subject Sample Mapping files. Please remove before submitting. (April 2018)

Use the questions below to jump to relevant sections or use your browser's find function to search for keywords.

Prepare Files for Submission

1. What files do I need to submit to dbGaP?

When a study is registered by a Genomic Program Administrator (GPA) in the dbGaP Submission System (SS), the GPA indicates what data is expected to be submitted. This may be verified by the Program Officer (PO) who oversees the study funding. The submitter will separately complete a Submission Portal (SP) Questionnaire that summarizes the data that will be uploaded to the SP. The SS and SP Questionnaire must match in order for dbGaP to know what data should be processed for release.

File Submission Checklist

All studies must complete the Study Config web form. This will populate the public study report page and a dbGaP study accession (phs######.v#.p#) will be provided.

For the remaining data, please submit only the files that correspond to the data agreed upon in the SS and SP questionnaire. To determine which files are applicable, go through the File Applicability section immediately following this list.

For faster processing time, submit to the dbGaP Submission Portal by uploading all files in one submission. Do not submit BAM, CRAM, FASTQ files until notified.

File Applicability

Phenotype Dataset (DS) and Data Dictionaries (DD)

  1. Studies that have consented subjects must submit a Subject Consent DS and DD.
  2. Studies that have individual level phenotype data (demographic, clinical, exposure, metabolomic, proteomic, etc) should submit 1 or more Subject Phenotypes DS and DD.
  3. Studies that have molecular data (array, methylation, called variants, sequence, etc) must submit a Subject Sample Mapping DS and DD and 1 or more Sample Attributes DS and DD.
  4. Studies that have self-reported or known genetic relationships and monozygotic twins must submit a Pedigree DS and DD.
  5. Studies that have samples submitted to another NCBI database (GEO, GenBank, Trace, or public SRA) must provide a Mapping DS and DD between the study samples and the other NCBI database sample accession.

Molecular Data

Any GWAS, SNP array, imputations, transcriptomic, epigenomic, gene expression, variant calls from WGS, WXS, and targeted sequencing data. This does not include raw sequencing data and alignment information, which is submitted separately¹.

Association Analyses

Any aggregated genomic level data

Study Documents

Any consent forms, protocols, questionnaires, etc. that corresponds to the data.

Medical Images

Any CT scans, eye images, etc.


¹Sequence data (e.g. BAM, CRAM, FASTQ) should be submitted only after: 1) you have received an email with an attached sequence metadata file containing the registered subject and sample IDs, and consents. This process ensures that submitted sequences are tied to sample IDs that belong to consented subjects. 2) The sequence metadata has been processed and a sequence curator contacts you to upload data.

2. Where can I download dbGaP Submission Guide Templates to generate the files I need to submit?

Download all Submission Templates: dbGaP_Submission_Package_20200930.zip

Download individual Submission Templates: https://ftp.ncbi.nlm.nih.gov/dbgap/dbGaP_Submission_Guide_Templates/Individual_Submission_Templates/

Study Config

3. What is the Study Config?

The Study Config is a web form that collects a description of the study data, methods, and findings, inclusion/exclusion, study history, references, attributions, and terms that will be indexed to enable users to search for your study in dbGaP Advanced Search. The study config must be submitted in order to have a dbGaP study accession (phs######.v#.p#) that can be published in dbGaP and used in journal publications. Here is an example of the study report page populated by the information in the study config: (https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000001.v3.p1).

To fill out the study config, go to your study's dbGaP Submission Portal (https://submit.ncbi.nlm.nih.gov/dbgap/).

  • Click on "Create" if newly filling out the study config or click on "Edit" to modify an existing study config.
  • Once done, press "Submit" and you will be taken back to the study's Submission Portal page.
  • To preview the study config, click on "Preview Study Report Page".

If you would like to see in advance what items will be collected in the web form, open 1_StudyConfig.docx.

Study Participant De-identification

4. What is a dbGaP Subject?

A dbGaP Subject is defined as a single human person/individual/patient that arises from a single germline. Each subject should be submitted with a single, unique, de-identified subject ID. Subjects submitted to dbGaP must be consented to submit to a public database. Subject IDs should be an integer or string value. Integers should not have zero padding. IDs should not have spaces. Specifically, only the following characters can be included in the ID: English letters, Arabic numerals, period (.), hyphen (-), underscore (_), at symbol (@), and the pound sign (#). Once a variable name for the subject ID has been chosen, please use the same variable name throughout all the phenotype files for consistency. For example, please do not use SUBJECT_ID in one file and INDIVIDUAL_ID in another file. Please also do not use "dbGaP" in your submitted ID name, since dbGaP will assign a dbGaP subject ID that will be included in the final dump files along with the submitted subject ID. Subjects that are known to be the same person across dbGaP studies will be assigned the same dbGaP subject ID.

5. What is a dbGaP Sample?

A dbGaP Sample is defined as the ID of the final preps submitted to dbGaP by a genotyping center, a sequencing group, or to an NCBI resource, such as GEO or GenBank. A single subject may be mapped to multiple samples, but a single sample should not be mapped to multiple subjects unless the samples are pooled.* For example, if one subject (SUBJECT_ID) provided one sample, and that sample was processed to generate 2 sequencing runs or 1 sequencing and 1 genotyping array run, the data file would show two rows, both using the same subject ID, but having 2 unique sample IDs.

*Please inquire about pooled samples if applicable. This would only apply to pooled samples that belong to consented subjects. If the samples are pooled from controls that are publicly available, there is no need for marking the pooled samples, and a single sample ID may be assigned.

Each sample should be submitted with a single, unique, de-identified sample ID. Sample IDs should be an integer or string value. Integers should not have zero padding. IDs should not have spaces. Specifically, only the following characters can be included in the ID: English letters, Arabic numerals, period (.), hyphen (-), underscore (_), at symbol (@), and the pound sign (#). Once a variable name for the sample ID has been chosen, please use the same variable name throughout all the phenotype files for consistency. For example, please do not use SAMPLE_ID in one file and SAMPLE_NAME in another file. Please also do not use "dbGaP" in your submitted ID name, since dbGaP will assign a dbGaP sample ID that will be included in the final dump files along with the submitted sample ID.

6. What do I need to know about protecting study participants' privacy, HIPAA, and subject de-identification for dbGaP data submissions?

To comply with HIPAA, personally identifying information must be removed from all data, e.g. names, cities, dates, telephone numbers, social security numbers, and any other potentially identifying information, characteristic, or code.

A 2-Step de-identification is required for all IDs submitted in dbGaP data files.

Example: Two step removal of identifiers

Step one: Personal Information → Remove identifiers → Create Study person ID

Step two: Study person ID → Create Subject ID submitted to dbGaP.

Subject IDs submitted to dbGaP may be randomly assigned or may be consecutive numbers without any identifying information (i.e., the submitted Subject ID should not be based on the study person ID or any personal identifiers such as subject's birth date, health record number, or name). The same applies to sample IDs.

Click here to see the algorithm dbGaP uses to find HIPAA sensitive dates: HIPAA

Phenotype Dataset (DS) and Data Dictionary (DD) Files

This set of files is referred to as phenotype datasets and data dictionaries since this is curated by the phenotype curator.

7. What is a Phenotype Dataset (DS) File?

A Dataset (DS) file is a rectangular table of data values, subject/sample IDs, and variables, to be submitted either in .txt or .xlsx format, with .txt being the preferred format. There are 5 types of datasets required for submission:

  1. Subject Consent (SC) DS – 1 file only per study. This is a list of subjects (person), their consents, and biological sex.
  2. Subject Sample Mapping (SSM) DS – 1 file only per study. This is a list of subjects (person) mapped to their samples submitted as molecular data and high throughput sequencing data.
  3. Pedigree DS – 1 file only per study if there are self-reported or known genetic relationships.
  4. Subject Phenotypes DS – 1 or more files per study. This is person-level phenotypes.
  5. Sample Attributes DS – 1 or more files per study. This is sample-level attributes.

Required if applicable: Sample Mapping to other NCBI databases (e.g. Trace, GEO, GenBank, public SRA) – 1 or more files per study.

Each column represents a single phenotypic variable. Row # 1 (column headers) of a data file will contain only the variable names.

Each row contains phenotypes of one Subject or attributes of one Sample. Following the first row (column headers), each subsequent row will reflect data of one subject or sample, depending on the type of file.

8. What is a Phenotype Data Dictionary (DD) File?

A Data Dictionary (DD) file is a table that defines and describes the variables in the corresponding dataset file (DS). It should be submitted in either .txt or .xlsx format, with .xlsx being the preferred format. Each dataset (DS) file must be submitted with a corresponding DD file. You may review a complete list of data dictionary descriptions and specifications, including those required in your DD file in the APPENDIX.

The required columns and specifications for a DD File are:

Column 1: VARNAME – variable name. Best if the varname reflects the measurements taken (e.g. HDL_am, ALCOHOL_day, TREATMENT_tamoxi). Do not use "dbGaP" in the variable name.

Column 2: VARDESC – variable description. Be specific so that it is clear what you have measured. For example, "blood pressure" is useful, but "brachial blood pressure while sitting" is more informative. Alternatively, submit study documents with details of data collection — dbGaP will link appropriate document sections to variables. For the AFFECTION_STATUS, please fill the disease name in the VARDESC.

Column 3: UNITS – units of measurement. If there are no units, leave the entry blank. If none of the variables have units, the UNITS column may be omitted.

Last set of columns: VALUES – encoded values with definitions to describe the codes used in the DS. Fill single value in one cell; no compound values in one cell. See VALUES in APPENDIX for full requirement details.

Example:

Last column with header Leave header blank Leave header blank Leave header blank
VALUES
10=Elementary 20=High School 40=College 4=Graduate School
1=2-4 drinks per day 2=5-7 drinks per day 3=>7 drinks per day

Study Meta DS and DD Files: Subject Consent, Subject Sample Mapping (SSM), and Pedigree

9. How do I create Subject Consent (SC) DS and DD files?

The Subject Consent (SC) DS contains a comprehensive list of all unique de-identified subject IDs, their assigned consent group, and biological sex value. Open the templates under Phenotype_Data:
2a_SubjectConsent_DS.txt
2b_SubjectConsent_DD.xlsx

The 2 variables required for the DS File are SUBJECT_ID and CONSENT.

Column 1: SUBJECT_ID

The first column must be the IDs of the subjects. Enter a single de-identified subject ID for each person, and preferably use "SUBJECT_ID" as the subject ID header. If necessary, you may use another variable name (but be consistent in all study files). Please do not use "dbGaP" in the variable name or the ID itself. See SUBJECT_ID in Glossary for full requirement details.

IDs listed in the SUBJECT_ID column must include:

  1. All consented de-identified subject IDs with submitted phenotype
  2. All consented de-identified subject IDs with molecular data (e.g. genotypes, high throughput sequences, GEO)
  3. Unconsented pedigree members used for linking purposes only (without submitted data)
  4. Unconsented HapMap subjects used as controls or other publicly available controls in genotype data

Column 2: CONSENT

The second column must be the consents of the subjects. Enter a single consent value for each person using an integer (1,2,3…) encoded in the DD. The DD consents must match the consents registered in the Submission System (SS). If they do not match, we cannot process your study. If you are a submitter and do not have access to the SS, you can see the consent groups in the dbGaP Submission Portal for your study by clicking "View consent group" in the box on the upper right. For questions regarding the registered consent group and DUL, please contact your GPA. For unconsented pedigree linking members and HapMap controls, set CONSENT=0. See CONSENT in Glossary for full requirement details.

In the corresponding DD, dbGaP will automatically code 0=Subjects used as genotyping controls and/or pedigree linking members (i.e. subject IDs without any submitted phenotype and/or molecular data), so that 0 does not need to be included in the DD. For all other consent groups > 0, use the format: code=Consent Group's Title (Consent Group's Abbreviation). For example, here is what a study with 2 consent groups might look like in the DD.

Last column with header Leave header blank
VALUES
1=General Research Use (NPU) (GRU-NPU) 2=Health/Medical/Biomedical (GSO) (HMB-GSO)

Column 3: SEX

Provide the biological sex value of the person listed in the SUBJECT_ID column. To speed up study processing through the dbGaP auto-pipeline, sex values have been restricted to M/Male/1 or F/Female/2 or UNK/Unknown/NULL, and should match the sex values entered into the Pedigree DS if a pedigree DS is applicable. All other values will require a resubmission.

Aliases or Overlapping Subjects between Studies

Include the variables SUBJECT_SOURCE and SOURCE_SUBJECT_ID ONLY IF the following applies: dbGaP aims to label a single person with the same dbGaP assigned subject ID, even though the submitted subject IDs for that person might be different. This is so that users who download multiple studies will not double count a person who has been included in multiple studies. For dbGaP to assign the same dbGaP subject ID, include the two variables, SUBJECT_SOURCE and SOURCE_SUBJECT_ID. This is required for Coriell HapMap subjects, subjects in public repositories (RUCDR, NRGR, NINDS Repository, etc.), and subjects that have been or will be submitted to another dbGaP study. Please avoid a SUBJECT_SOURCE that is very general coupled with a SOURCE_SUBJECT_ID that is a simple integer. For example, SUBJECT_SOURCE=University of California and SOURCE_SUBJECT_ID=1. There is a potential for unintended subject collision; that is, two different people are assigned the same source and ID across studies. There are many University of Californias and there are many studies that use 1 as an ID.

Column 4 and 5: SUBJECT_SOURCE and SOURCE_SUBJECT_ID (Submit both variables. We are unable to process SOURCE_SUBJECT_ID without a SUBJECT_SOURCE).

Name the public repository, consortium, institute, or study. Provide the de-identified subject ID used in the source. SOURCE_SUBJECT_IDs should follow guidelines for SUBJECT_ID.

  1. For referencing HapMap subjects from Coriell, the SUBJECT_SOURCE value should be written as "Coriell." The SOURCE_SUBJECT_ID should be written as the de-identified subject ID assigned by Coriell.
  2. For referencing dbGaP assigned subject IDs, the SUBJECT_SOURCE value should be written as "dbGaP." The SOURCE_SUBJECT_ID should be written as the dbGaP assigned Subject ID.
  3. For linking subjects to a study already in dbGaP, please contact us, and we will provide you with more information on how to map subject IDs.
  4. The SUBJECT_ID and SOURCE_SUBJECT_ID can have identical or different IDs.
  5. For Subject IDs that map to more than one alias, SUBJECT_ID 101 is known as NA1111(Coriell) and 45678(NHGRI), for the first alias, use SUBJECT_SOURCE=Coriell and SOURCE_SUBJECT_ID=NA1111. For the second alias, create two additional columns titled SUBJECT_SOURCE2 and SOURCE_SUBJECT_ID2. Therefore, SUBJECT_SOURCE2=NHGRI and SOURCE_SUBJECT_ID2=45678. If there are even more aliases, then it would be SUBJECT_SOURCE3 and SOURCE_SUBJECT_ID3 and so forth. Note that the first SUBJECT_SOURCE and SOURCE_SUBJECT_ID do not have the number "1" in the header names.

Example of Subject Consent DS File

SUBJECT_ID CONSENT SEX SUBJECT_SOURCE SOURCE_SUBJECT_ID
1 1 1
2 1 1 NRGR 1012
3 1 1 NINDS NDS00008
4 1 2
5 1 2 Example Consortium 1284yA8-B
6 1 UNK
7 1 UNK
8 1 1 dbGaP 12
9 1 1 dbGaP 13
10 1 2
1001 0 2
1002 0 1

Example of Subject Consent DD File

VARNAME VARDESC TYPE VALUES
SUBJECT_ID Subject ID string
CONSENT Consent group as determined by DAC encoded value 1=General Research Use (GRU)
SEX Biological sex encoded value 1=Male 2=Female UNK=Unknown
SUBJECT_SOURCE Source repository where subjects originate string
SOURCE_SUBJECT_ID Subject ID used in the Source Repository string

10. How do I create Subject Sample Mapping (SSM) DS and DD files?

The SSM is a mapping of SUBJECT_IDs (consented subjects and their phenotype data) to SAMPLE_IDs. This list of SAMPLE_IDs is an assertion of the samples that will be submitted in the molecular data. Open the templates under Phenotype_Data:
3a_SSM_DS.txt
3b_SSM_DD.xlsx

The required variables are SUBJECT_ID and SAMPLE_ID.

Column 1: SUBJECT_ID

The first column must be the IDs of the subjects. Enter only SUBJECT_IDs that are linked to SAMPLE_IDs with submitted molecular data. Subjects listed in the SUBJECT_ID column must be consented with CONSENT>0 or is a publicly available control with CONSENT=0 in the Subject Consent DS. For SUBJECT_IDs with multiple types of molecular data (e.g. SNP array data, RNA expression data, sequencing data), use multiple rows with identical subject ID, but distinct sample IDs. See SUBJECT_ID in Glossary for full requirement details.

Column 2: SAMPLE_ID

The second column must be the IDs of the samples. The de-identified SAMPLE_IDs in this column must be identical to those used in the molecular data (PLINK, VCFs, etc) and sequence metadata. Different sample runs or aliquots of the same sample should be identified by different SAMPLE_IDs, but the same SUBJECT_IDs. Likewise, intended duplicates should also be identified by different SAMPLE_IDs, but the same SUBJECT_IDs. Sample IDs mapping to a public NCBI resource (GEO, GenBank, public SRA) should also be included. The SAMPLE_ID column should not have any repeating IDs. See SAMPLE_ID in Glossary for full requirement details.

Can the SAMPLE_ID be the same as the SUBJECT_ID?

Yes, the SAMPLE_ID can be the same as the SUBJECT_ID, as long as samples that belong to the same person share the same SUBJECT_ID. Please submit an SSM even if each person only has 1 sample and the IDs are identical for the SUBJECT_ID and SAMPLE_ID.

Example of SSM DS File

SUBJECT_ID SAMPLE_ID
1 S1
2 S2
3 S3
4 S4
5 S5
6 S6
6 S7
7 S8
7 S9
7 S10
8 S11
8 S12

Example of SSM DD File

VARNAME VARDESC TYPE VALUES
SUBJECT_ID Subject ID string
SAMPLE_ID Sample ID string

11. How do I create Pedigree DS and DD files?

The Pedigree DS lists the genealogical relationships of subjects within a study. If there are no known relationships, this file does not need to be submitted. However, if dbGaP finds that there are possible relationships between subjects after reviewing the genetic data (with the GRAF [Genetic Relationship and Fingerprinting] software), dbGaP will request a pedigree DS or include a README file with the results of IBD and/or dbGaP GRAF. If the IBD or pedigree information should not be released because of data sharing limitations, please let dbGaP know in writing. See GRAF in the Glossary for more information. Open the templates under Phenotype_Data:
4a_Pedigree_DS.txt
4b_Pedigree_DD.xlsx

The required variables are FAMILY_ID, SUBJECT_ID, FATHER, MOTHER, and SEX.

MZ_TWIN_ID is required if applicable.

Column 1: FAMILY_ID

FAMILY_IDs are de-identified and should be the same for members of the same family.

Column 2: SUBJECT_ID

SUBJECT_IDs should include any person with familial relationships relevant to the study. The SUBJECT_ID column should also include FATHER and MOTHER IDs. All SUBJECT_IDs of the pedigree file should be included in the Subject Consent (SC) DS, where the study subjects have CONSENT >=1 and linking pedigree SUBJECT_IDs have CONSENT=0. See SUBJECT_ID in Glossary for full requirement details.

Columns 3 and 4: FATHER and MOTHER

List FATHER IDs in Column 3 and MOTHER IDs in Column 4. FATHER and MOTHER IDs should be unique and de-identified. Each FATHER ID and MOTHER ID should be included in the SUBJECT_ID column of both the Pedigree DS and the Subject Consent (SC) DS. For SUBJECT_IDs that do not have parents, the FATHER and MOTHER IDs should be filled with 0 or left blank. Dummy IDs should be created for the FATHER and MOTHER IDs if no ID is known and it is necessary to indicate sibling or avuncular relationships.

Column 5: SEX

Provide the biological sex value of the person listed in the SUBJECT_ID column. To speed up study processing through the dbGaP auto-pipeline, sex values have been restricted to M/Male/1 or F/Female/2 or UNK/Unknown/NULL, and should match the sex values entered into the Subject Consent DS. All other values will require a resubmission.

Column 6: MZ_TWIN_ID

De-identified monozygotic twin IDs should indicate monozygotic twins and multiples of the same family. The MZ_TWIN_ID column should distinguish sample duplicates from samples of monozygotic twins. Monozygotic twins and multiples should be assigned the same MZ_TWIN_ID, FATHER_ID, and MOTHER ID, but different SUBJECT_IDs. For dizygotic twins and all other individuals, the MZ_TWIN_ID column should be left blank. If you wish to identify dizygotic twins, an additional variable may be included in the subject phenotypes DS.

How should I list families with half siblings?

You may list families with half siblings using either example with Example 1 being more preferable. Please remember to include SEX column and if applicable, the MZ_TWIN_ID column.

  • Example 1:

    FAMILY_ID SUBJECT_ID FATHER MOTHER
    1 A C D
    1 B C E
    1 C 0 0
    1 D 0 0
    1 E 0 0
  • Example 2:

    FAMILY_ID SUBJECT_ID FATHER MOTHER
    1 A C D
    1 B C E
    1 C 0 0
    1 D 0 0
    2 E 0 0

Example of a Pedigree DS File

FAMILY_ID SUBJECT_ID FATHER MOTHER SEX MZ_TWIN_ID
100 1001 0 0 2
100 1002 0 0 1
100 1 1002 1001 1 1
100 2 1002 1001 1 1
101 1011 0 0 2
101 1012 0 0 1
101 3 1012 1011 1
102 1022 0 0 2
102 1023 0 0 1
102 4 1023 1022 2

Example of a Pedigree DD File

VARNAME VARDESC TYPE VALUES
FAMILY_ID Family ID string
SUBJECT_ID Subject ID string
FATHER Father's Subject ID string
MOTHER Mother's Subject ID string
SEX Biological sex encoded value 1=Male 2=Female UNK=Unknown
MZ_TWIN_ID Twin ID for monozygotic twins and multiples. An MZ_TWIN_ID is not provided for dizygotic twins or multiples. string

Subject Phenotypes and Sample Attributes DS and DD Files

12. What data must be included in the Subject Phenotypes and Sample Attributes?

Metadata around the experiment or study and annotations that are necessary to reproduce any published table or analysis must be included with genomic data submissions. In particular, data pertinent to the interpretation of genomic data -- such as associated phenotype data (e.g. clinical information), exposure data, relevant metadata, and descriptive information (e.g. protocols or methodologies used) -- are expected to be shared. To avoid user questions, make sure to include self-reported RACE and relevant dates (e.g., birth, diagnosis, sample collection) written as years or normalized to a set point in time, along with any phenotypes, measured or collected data that are described in your Study Description. For the Subject Phenotypes, it would be data relevant to the individual person. For the Sample Attributes, it would be data relevant to the sample derived from the person. For instance, do not list the RACE variable in the Sample Attributes, since RACE is stable for a person across samples. However, for variables like TREATMENT, if the person was only treated once, and data was collected, then TREATMENT could belong in the Subject Phenotypes table. However, if TREATMENT was completed multiple times, and each time a sample was extracted, then it would be better for TREATMENT to be tracked in the Sample Attributes table.

13. How do I create Subject Phenotypes DS and DD files?

The Subject Phenotypes DS file includes measured and/or descriptive traits per individual person. The primary ID in this file is the SUBJECT_ID. Open the templates under Phenotype_Data:
5a_SubjectPhenotypes_DS.txt
5b_SubjectPhenotypes_DD.xlsx

Column 1: SUBJECT_ID

Each SUBJECT_ID needs to be unique and should be linked to only 1 row of data in the DS. All SUBJECT_IDs included in this file must be found in the subject consent (SC) DS with CONSENT > 0. No CONSENT=0 SUBJECT_IDs should appear in the Subject Phenotypes DS. CONSENT=0 subjects are not permitted to have individual level data. See SUBJECT_ID in Glossary for full requirement details.

All other Column Headers: VARNAMES (variable names)

Submit the following types of variables:

  1. Review section: "What data must be submitted"
  2. Affection status or case/control status of the disease/phenotype. Fill in the phenotypic term in the variable description. Do not include this column if it is not applicable to the study type.
  3. Race/ethnicity/ancestry/heritage
  4. Relevant dates (e.g., birth, diagnosis) written as years or normalized to a set point in time. Do not include month and days directly tied to the person, which are considered HIPAA sensitive. Click here to see the algorithm dbGaP uses to find HIPAA sensitive dates: HIPAA
  5. Since the sex variable is already required in the Subject Consent DS, no need to resubmit in the Subject Phenotypes DS. However, if it is part of your data, no need to go through the extra work of removing and feel free to leave in.

Can I submit multiple subject phenotypes DS files?

You may submit multiple subject phenotypes DS/DD. Subject phenotypes files can be split by race/ethnicity, cohort, collection period, etc. The file name should indicate how the multiple subject phenotypes are split. The primary ID in each subject phenotypes file should be the SUBJECT_ID.

How do I submit data that has been measured serially or longitudinally?

If each SUBJECT_ID has a series of measurements or the data are longitudinal, below are the formatting options for this data:

  1. The first subject phenotypes DS may include all the variables that are stable through events, e.g. sex, race, prior history. The second subject phenotypes DS may include all the variables that correspond to the various events per person. This table will include any other variables that change along with the event type and number. In this case, this table may have a SUBJECT_ID listed multiple times. We would treat this as a longitudinal dataset, where SUBJECT_ID + event number + event type (choose any variable name) are the variables that make the row unique. Mark an "X" under the UNIQUEKEY column for these 3 variables in the corresponding DD.
  2. A variation of this would be a single subject phenotypes file with unique keys, and then the stable variables would be repeated in every row. This is if you want the table denormalized.
  3. Alternatively, you could create a single subject phenotypes DS, but have your table stretch in columns, where each event and number is a variable, such as mi_event1, mi_event2, stk_event1, stk_event2, etc., and the value would be binary. In this model, each SUBJECT_ID would only be listed once. You'd also need mi_event1_dayssinceoccurance, weight_@_mi_event_1, etc. We have received both types of submissions. We prefer option 1.

Example of a Subject Phenotypes DS File

SUBJECT_ID AFFECTION_STATUS RACE EDUCATION AGE AGE_ONSET HEIGHT WEIGHT KRAS
1 1 African American 4 35 25 67 180.2 yes
2 2 Asian 20 56 54 67 201.5 no
3 2 European 40 1000 45 60 160.5 yes
4 1 Latin American 20 37 35 75 99.5 no
5 2 Asian 10 46 40 61 315.2 no

Example of a Subject Phenotypes DD File - this is one table, but has been split into two for viewing purposes. For details about each column header in the DD, see the APPENDIX.

VARNAME VARDESC DOCFILE TYPE UNITS MIN MAX RESOLUTION COMMENT1 COMMENT2
SUBJECT_ID Subject ID string
AFFECTION_STATUS Case control status of the subject for [please fill in phenotypic term] Diagnosis.pdf encoded value
RACE Self-reported race Main_exam.pdf string
EDUCATION Level of education Main_exam.pdf encoded value
AGE Subject age at enrollment Diagnosis.pdf integer, encoded value years 0 >89
AGE_ONSET Disease onset age Diagnosis.pdf integer years 0 >89
HEIGHT Height measured at enrollment Diagnosis.pdf decimal inches
WEIGHT Subject's weight Diagnosis.pdf decimal, encoded value pounds 1
KRAS Somatic mutation in KRAS (Entrez GeneID: 3845) Cancer.docx string
VARIABLE_SOURCE SOURCE_VARIABLE_ID VARIABLE_MAPPING UNIQUEKEY COLLINTERVAL ORDER VALUES
NCI Subject ID X Collected in Exam 1
Collected in Exam 1 1=Control 2=Case 3=Other
MSH Race Collected in Exam 1
MSH Educational Status Collected in Exam 1 99=NA 10=Elementary 20=High School 40=College 4=Graduate School
PhenX PX010101020000 Identical Collected in Exam 1 List 9999=Missing 1000=Not assessed INTEGERS
MSH Age of Onset Collected in Exam 1
MSH Body Height Collected in Exam 1
MSH Body Weight Collected in Exam 1, 2, 3 List 1000=Not assessed DECIMALS 9999=Unknown
LNC KRAS gene mutations tested for in Blood or Tissue by Molecular genetics method Nominal Collected in Exam 3

14. How do I create Sample Attributes DS and DD files?

The Sample Attributes DS includes measured and/or descriptive traits per individual sample (not person). A person may be represented by multiple samples. Therefore, the primary id in this file is the SAMPLE_ID. Open the templates under Phenotype_Data:
6a_SampleAttributes_DS.txt
6b_SampleAttributes_DD.xlsx

Column 1: SAMPLE_ID

Only include SAMPLE_IDs that are listed in the subject sample mapping (SSM) DS and belong to SUBJECT_IDs that have CONSENT>0 in the subject consent (SC) DS. SAMPLE_IDs belonging to CONSENT=0 SUBJECT_IDs should not appear in the Sample Attributes DS file. The SAMPLE_ID should use the exact same syntax used for the SAMPLE_ID listed in the SSM. For example, '0AB12' is not the same as 'AB12', nor is '123-1' the same as '123_1'. Each SAMPLE_ID should be represented by 1 row of data in the DS. See SAMPLE_ID in Glossary for full requirement details.

Columns 2-5: NCBI BioSample variables included in the Sample Attributes DS

The following four sample attributes should be included. The four sample attributes along with subject’s sex value will be displayed on the NCBI BioSample page: https://www.ncbi.nlm.nih.gov/biosample/.

  1. BODY_SITE – the collection site of the sample (ex. skin, breast, peripheral blood, inner oral cavity). If the sample is from a xenograft, you may rename the variable.
  2. ANALYTE_TYPE – the analyte type of the sample (ex. DNA, RNA). If the same sample ID was used for both DNA and RNA aliquots, the value should be "DNA/RNA" instead of listing the sample twice. The BioSample database does not allow multiple values for the same sample ID.
  3. HISTOLOGICAL_TYPE – the type of cell or tissue type/subtype of the sample (ex. melanocytes, keratinocytes, buccal cells, embryonic stem cells). For tumor samples, carcinoma, sarcoma, myeloma, leukemia, lymphoma, and mixed types can be used or a type of greater specificity. If the histological type is not known, the column should be left out completely.
  4. IS_TUMOR – the tumor status of the sample. The values can be binary such as yes/no or encoded 1=yes and 2=no. For non-cancer studies, the values in IS_TUMOR should be "no" or "unknown."

All other Column Headers: VARNAMES (variable names)

Most institutes request all data pertinent to the interpretation of genomic data, such as clinical information, exposure data, and relevant metadata pertaining to the sample. Please note that the template (6a_SampleAttributes_DS.txt) provided is based on a cancer study and the variables listed may be useful for cancer studies. However, if your study is not a cancer study, please do not include the cancer variables. Instead, submit additional sample attribute variables that will provide greater understanding of the study. For example: sample collection date, sample extraction method and date; batch and center effects, sample plate or well number; sample run date, sample QA results; and sample affection status (ex. psoriatic skin sample vs. non-psoriatic skin sample from a case subject who has psoriasis). Relevant dates (e.g., sample collection date) that are directly tied to a person should be written as years or normalized to a set point in time. Do not include month and days directly tied to the person, which are considered HIPAA sensitive. Click here to see the algorithm dbGaP uses to find HIPAA sensitive dates: HIPAA.

Can I submit multiple sample attributes DS files?

You may submit multiple sample attributes DS/DD. You may split out sample attributes files to separate them by race/ethnicity, cohort, collection period, etc. Each of the sample attributes files should have SAMPLE_ID as the primary id. The BioSample required variables should appear only once per SAMPLE_ID, and the values for the BioSample required variables should not conflict. For example, a SAMPLE_ID cannot be marked as both TUMOR and non-TUMOR. In this case, we would request that an additional SAMPLE_ID be created. If this is not possible, please contact the dbGaP phenotype curator.

How do I submit data that has been measured serially or longitudinally?

Each SAMPLE_ID has a series of measurements or the data is longitudinal. In this case, this table may have a SAMPLE_ID listed multiple times. We would treat this as a longitudinal dataset, where SAMPLE_ID + [variable] are the variables that make the row unique. Mark an "X" under the UNIQUEKEY column for the variables in the corresponding DD. In this case, we recommend submitting the BioSample required variables in a separate sample attributes DS/DD.

Example of a Sample Attributes DS File - this is one table, but has been split into two for viewing purposes.

SAMPLE_ID BODY_SITE ANALYTE_TYPE IS_TUMOR HISTOLOGICAL_TYPE COLLECTION_AGE
S1 Skin DNA Y Melanoma 25
S2 Lung RNA Y Liposarcoma 54
S3 Buccal DNA N Buccal cells 45
S4 Skin RNA N Skin 35
S5 Skin RNA N Keratinocytes 40
PRIMARY_METASTATIC_TUMOR PRIMARY_TUMOR_LOCATION TUMOR_STAGE TUMOR_GRADE TUMOR_TREATMENT
Primary Skin II G3 Chemotherapy and biological therapy
Primary Peritoneal cavity III G2 Radiation
N/A N/A N/A N/A N/A
N/A N/A N/A N/A N/A
N/A N/A N/A N/A N/A

Example of a Sample Attributes DD File - For additional options for the DD, see the APPENDIX.

VARNAME VARDESC TYPE UNITS MIN MAX UNIQUEKEY VALUES
SAMPLE_ID Sample ID string X
BODY_SITE Body site where sample was collected string
ANALYTE_TYPE Analyte Type string
IS_TUMOR Tumor status encoded value Y=Is Tumor N=Is not a tumor
HISTOLOGICAL_TYPE Cell or tissue type or subtype of sample string
COLLECTION_AGE Age sample was collected integer years 0 >89
PRIMARY_METASTATIC_TUMOR Primary tumor, metastasis, or transformed cell line string
PRIMARY_TUMOR_LOCATION Primary tumor location string
TUMOR_STAGE Tumor stage of sample string
TUMOR_GRADE Tumor grade of sample string
TUMOR_TREATMENT Type of tumor treatment for sample string

Medical Images

15. How do I submit Medical Images and in what format?

De-identified medical image files of any type or format may be submitted. No validation or QC is run on images submitted to dbGaP. When submitting multiple files, zip or tar files < 1TB.

Also, create a mapping of SUBJECT_IDs to the image files. Open the templates under Medical_Images:
SubjectImageMappingDS.txt
SubjectImageMappingDD.xlsx

Column 1: SUBJECT_ID

All SUBJECT_IDs included in this file must be found in the subject consent (SC) DS with CONSENT>0. No CONSENT=0 SUBJECT_IDs should appear in the Subject Image Mapping DS. See SUBJECT_ID in Glossary for full requirement details.

Columns 2-5: IMAGE_TYPE, BODY_SITE, FILENAME, FILE_TYPE

Include the following four variables for image data.

  1. IMAGE_TYPE – the type of image (ex. CT scan, photograph, MRI).
  2. BODY_SITE – the body site of the image (ex. brain, chest, eye).
  3. FILENAME - the filename including the file extension.
  4. FILE_TYPE – the file type (ex. jpg, dng, tif).

All other Column Headers: VARNAMES (variable names)

Any other relevant information related to the image can be included as additional columns.

Example of a Subject Image Mapping DS File

SUBJECT_ID IMAGE_TYPE BODY_SITE FILENAME FILE_TYPE
1 photograph fundus fundus01a.jpg jpg
1 photograph fundus fundus01b.jpg jpg
4 photograph fundus fundus04a.jpg jpg
4 photograph fundus fundus04b.jpg jpg
6 CT scan chest chest06.tif tif
7 CT scan chest chest07.tif tif

Example of a Subject Image Mapping DD File

VARNAME VARDESC TYPE VALUES
SUBJECT_ID Subject ID string
IMAGE_TYPE Image type string
BODY_SITE Body site of image string
FILENAME Filename including the file extension string
FILE_TYPE File type string

16. How do I verify that my DS and DD Files will pass dbGaP's phenotype quality control (QC) tests?

Go through this list prior to submission. This list will help you eliminate the most common errors detected in formatting and data consistency. You can also check your Subject Consent DS, Subject Sample Mapping (SSM) DS, Pedigree DS against your Genotype data (PLINK and VCF) on your system using GaPTools.

  • Each DS and DD must be submitted as a separate file. Please do not submit multiple worksheets per file.
  • Submit tab-delimited .txt and .xlsx files only. Tab-delimited txt files are preferable for the DS. Excel (.xlsx) format is preferable for the DD. The final dump files provided to Authorized Users of the study will be in the tab-delimited txt format.
  • The DS should be a rectangular table. Column headers should not exceed columns of values. Column headers should not be missing. Primary IDs should not be missing for the row. Remove empty rows or columns between data values or above the headers.
  • File names should not contain special characters, spaces, hyphens, brackets, periods, or forward (/) or backward slashes (\).
  • Check formatting and spelling of the DS and DD. Remove non-ascii characters, new line feeds or carriage return characters (they sometimes may appear like a square or a question mark in a box), unintended quotes (""").
  • All IDs are two-step de-identified.
  • Check that "dbGaP" is not used in any of the variable names or the IDs. "dbGaP" is reserved for dbGaP generated items that are included in the study release.
  • Variable names between DS and its corresponding DD must be identical in syntax. For example, "day_ enrollment" is not the same as "day_enrollment" or "Day_Enrollment."
  • Variable names and variable descriptions need to be distinct within a dataset.
  • The same variable name must be used for the ID columns. For example, do not use SUBJECT_ID in a dataset, and then use Patient_ID in another dataset to refer to the primary subject ID column. If you use SUBJECT_ID as the primary subject ID variable name, then use SUBJECT_ID as the variable name in every dataset that lists out the subjects. Likewise, keep the primary sample ID variable name identical throughout all the datasets.
  • All SAMPLE_IDs listed in the Subject Sample Mapping (SSM) dataset must match the SAMPLE_IDs in the molecular data and high throughput sequences. The syntax must be identical. For example, SAMPLE_ID "1034_abc.20" is not the same as SAMPLE_ID "1034-abc.20" or "1034_abc.2".
  • Remove HIPAA sensitive data, such as patient's name, doctor's name, months and days from dates directly tied to the subject, etc. Year is acceptable. Click here to see the algorithm dbGaP uses to find HIPAA sensitive dates: HIPAA
  • Some HIPAA sensitive data are permissible, such as age > 89 for studies that focus on older populations, or geographic locations, etc. Please work with the dbGaP curator to make sure that the public summaries are correctly hidden.
  • Define encoded values in the DD, one per single cell.
  • Remove repeating IDs in the SUBJECT_ID column of the subject consent and pedigree DS files and SAMPLE_ID column of the subject sample mapping file (SSM).
  • Remove repeating IDs in the SUBJECT_ID column of the subject phenotypes DS and SAMPLE_ID column of the sample attributes DS files, unless they represent repeat measurements per subject/sample and UNIQUEKEYS are clearly defined in the DDs.
  • Remove completely identical rows and empty rows.
  • Check that all subjects IDs found in the subject phenotypes DS have CONSENT>0 and all sample IDs in the sample attributes DS belong to subjects that have CONSENT>0. Another way said, CONSENT=0 (pedigree linking members and HapMap controls) and unconsented IDs should not be in any individual-level subject phenotypes or sample attributes DS.
  • If there are multiple sex variables captured for the same person, verify that all sex values are consistent among the phenotype data as well as the sex determined by the genotypes.
  • If there are multiple case control variables captured for the same person, verify that all case control values are consistent for the same individuals.
  • Double check for data consistency!

Review the descriptions of variables in the APPENDIX for specific instructions on labeling header columns and file-naming conventions. Also read the Glossary for definitions of variables. To see the QC checks that dbGaP completes for each study, see section "What happens once I submit my core data files and phenotype files?".

Study Documents

17. What type of Study Documents may I submit and in what format?

Any document that describes study methods and data collection should be submitted, e.g., protocols, questionnaires, manuals of procedures and operations, consents, and can be published on the public dbGaP page. The preferred file format is pdf, though Word and Excel documents will be accepted. Please submit tabular images in Excel.

The study documents may be annotated by the phenotype curator or submitted with annotations using variable or dataset names. These annotations can be added directly to the document or to a DD under the "DOCFILE" column. The annotations link text segments to corresponding variables and/or datasets. The final annotations will be visible on the public dbGaP pages. Click on the 2 links below to see how to go from the Variable Summary page to the Study Document page and vice versa.

Variable Summary page

Study Document page

18. What should I know about editing, proofreading, and copyright?

Proofreading and Editing – Please proofread and edit your documents thoroughly before submission — they will be posted to the public dbGaP web pages.

dbGaP will not perform any copyediting or proofreading. Any content changes require submission of a new version of the document. Documents that contain potential HIPAA rule violations will not be processed and need to be resubmitted following redactions.

Copyright – Previously Published Work – If you submit a published work (article, review, book chapter, questionnaires, etc.) for dbGaP posting, please include documentation that authorizes the public posting on the dbGaP website. If you are unsure about the copyright status of a document, contact the publisher or owner of the work.

NIH does not claim copyright of any submitted documents. However, NIH must be given nonexclusive rights to freely distribute all documents on the dbGaP site.

Molecular Data

19. How do I submit Molecular Data to dbGaP?

No BAM, CRAM, and FASTQ files should be submitted as "Molecular Data" type to the dbGaP Submission Portal. High throughput human sequence data and alignment information should be submitted through a separate process: High throughput sequencing submission instructions.

Molecular data, that is not high throughput sequencing data, should be submitted to the dbGaP Submission Portal under the section "Other files" with type "Molecular Data". It should be submitted along with the phenotype data or as early as possible so that it enters a dbGaP genotype curator's queue. To compress and bundle files, zip first then tar. This enables dbGaP to run qc checks quickly and report back to you any errors. Do not tar first then zip as this will significantly delay the processing time.

Essential requirement: Sample IDs must be de-identified. Every sample ID found in an individual level Molecular Data file must be mapped to a consented subject in the Subject Sample Mapping (SSM) dataset. See SAMPLE_ID in Glossary for full requirement details. Sample IDs that do not follow the requirements will not be processed.

  • The sample ID is ideally the final aliquot used for a sequencing run or well on an array plate. A person with a given subject ID can have many samples.
  • If a sample ID is a technical control such as Coriell HapMap sample or a publicly available control, it must be mapped to a subject ID in the Subject Sample Mapping (SSM) dataset and that subject ID must be explicitly marked as CONSENT=0 in the Subject Consent (SC) dataset.
  • Single cells or multiplexed single cells should each be given a unique sample ID.
  • Sample IDs in sequence derived genotypes (VCFs) must be identical to the sample IDs used in the corresponding sequence data (BAMs).
  • Include a File Sample Mapping (FSM) file to map sample IDs to single sample data files.
  • Include README to describe content of data files and QC anomalies especially if the content is not in one of the formats listed below and fits into the "Other" category.

See Molecular Data for guidelines, common errors, dbGaP qc checks, and where to submit molecular data. Click below for the specific data types applicable to your study.

20. How do I submit High Throughput Sequencing data and alignment information?

dbGaP accepts high throughput human sequence data in BAM, CRAM, and FASTQ formats. Choose one data storage option below. Existing studies may have a combination of the options but all new submissions should follow a single option.

  1. NCBI Data Storage: Both sequence metadata and sequence files are submitted to NCBI and available for download from NCBI servers OR direct cloud access through Google Cloud and AWS (Amazon).

  2. Cloud Data Storage: The sequence metadata is submitted to NCBI with details of sequence file cloud storage locations. This option requires sponsoring institutes to configure your study with an NIH data repository. Sequence files will be accessed either through the cloud storage provider using dbGaP credentials via Authorized Access or through an NIH data repository platform if available.

Jump to public SRA if you would like to link publicly available metagenomic sequences free of human sequence contaminants to controlled access subjects or samples in a dbGaP study.

Steps to submitting Human Sequence to a dbGaP study

Option 1: NCBI Data Storage

  1. Update or verify that your study is configured for sequence data submission by selecting Yes to #3 "Will Next-Generation Sequencing (NGS) data be submitted?" in the Submission Portal Questionnaire.
  2. Submit Subject Consent (SC) and the Subject Sample Mapping (SSM) files. A dbGaP phenotype curator will validate and load the submitted IDs and consents in the dbGaP database, and each sample ID will be assigned an NCBI BioSample ID (SAMN#). This process instantiates IDs and verifies that sequences submitted for samples belong to consented subjects. This may take a few days.
  3. dbGaP Submission Portal sends email with a sequence metadata spreadsheet attached with your registered sample IDs already entered.
  4. Complete and Submit the sequence metadata spreadsheet to the dbGaP Submission Portal for only sequencing data you plan to submit for this version of the study. Sequence data that have previously been submitted to the study (for example in an earlier version) do not need to be entered in the spreadsheet. Remove sample IDs that do not have sequence data. Take care to not edit the spreadsheet column headers and only use the controlled vocabulary options in fields with a selection menu to ensure that the sequence metadata will pass automated checks. A sequence curator will contact you via email with data upload instructions or questions regarding the sequence metadata spreadsheet within a few days of upload.
  5. You will need a public/private key pair to upload the sequence data files using Aspera Connect
  6. All sequencing data must be processed before a study can be released through Authorized Access.

Option 2: Cloud Data Storage

  1. Update or verify that your study is configured for sequence data submission by selecting Yes to #3 "Will Next-Generation Sequencing (NGS) data be submitted?" in the Submission Portal Questionnaire.
  2. Submit Subject Consent (SC) and the Subject Sample Mapping (SSM) files. A dbGaP phenotype curator will validate and load the submitted IDs and consents in the dbGaP database, and each sample ID will be assigned an NCBI BioSample ID (SAMN#). This process instantiates IDs and verifies that sequences submitted for samples belong to consented subjects. This may take a few days.
  3. dbGaP Submission Portal sends email with a sequence metadata spreadsheet attached with your registered sample IDs already entered as additional columns necessary for cloud data submissions.
  4. Complete and Submit the sequence metadata spreadsheet to the dbGaP Submission Portal for only sequencing data you plan to submit for this version of the study. Sequence data that have previously been submitted to the study (for example in an earlier version) do not need to be entered in the spreadsheet. Remove sample IDs that do not have sequence data. Take care to not edit the spreadsheet column headers and only use the controlled vocabulary options in fields with a selection menu to ensure that the sequence metadata will pass automated checks. A sequence curator will process your sequence metadata and verify that files in the archive can be accessed. You will need to grant access to NCBI operated accounts for this process to occur.

Additional instructions can be found here: https://www.ncbi.nlm.nih.gov/sra/docs/submitdbgap/.

For questions, contact sra@ncbi.nlm.nih.gov.

Tracking samples

A link to the Sample Status Telemetry Report (SSTR) will be provided when the IDs and consents have been loaded. The SSTR includes a complete list of samples, subjects, consents, dbGaP assigned IDs and study repository, BioSample variables, and sra_data_details.

21. How do I submit Copy Number Variation (CNV) data?

CNV is coordinated with NCBI dbVar. Individual-level CNV data should be submitted to dbGaP and released via controlled access. Summary-level (probe/primer and other assay and frequency information) copy number variation data should be submitted to dbVar and released by the public dbVar. Please click on dbVar Submission Guide if your study includes CNV data.

Create a mapping of SAMPLE_IDs to the accessions used in the applicable databases.

Trace: repository of DNA sequence chromatograms (traces, base calls, and quality estimates of single-pass reads from various large-scale sequencing projects). Open the templates under Sample_NCBI_DB_Mapping:
SampleTraceMappingDS.txt
SampleTraceMappingDD.xlsx

GEO: repository of high-throughput gene expression data and hybridization arrays, chips, microarrays. Open the templates under Sample_NCBI_DB_Mapping:
SampleGEOMappingDS.txt
SampleGEOMappingDD.xlsx

GenBank: genetic sequence database comprising an annotated collection of all publicly available DNA sequences. Open the templates under Sample_NCBI_DB_Mapping:
SampleGenBankMappingDS.txt
SampleGenBankMappingDD.xlsx

SRA (public): archive of raw sequencing data and alignment information from high-throughput sequencing platforms of non-human data. Open the templates under Sample_NCBI_DB_Mapping:
SamplePublicSRAMappingDS.txt
SamplePublicSRAMappingDD.xlsx

Column 1: SAMPLE_ID

Eliminate extra work. If additional sample IDs need to be created and/or added to the SSM to account for the sample to NCBI database accession mapping, use the subject IDs in the Subject Consent DS instead and create a mapping of subject IDs to the NCBI database accession. Column 1 will list SUBJECT_IDs and Column 2 will list the corresponding NCBI database accession. Otherwise, use the SAMPLE_ID found in the SSM DS.

The de-identified SAMPLE_ID should be the same as the SAMPLE_IDs listed in the subject sample mapping file (SSM) DS. A sample ID can be listed multiple times if it has multiple accessions (such as GEO accessions) derived from the same sample. See SAMPLE_ID in Glossary for full requirement details.

Column 2: NCBI database accession (i.e. TRACE_ID, GEO_ACCESSION, GENBANK_ACCESSION, SRA_ACCESSION)

The accessions of the various NCBI databases should be linked to the corresponding sample ID. This column should have distinct IDs.

Example of Trace Mapping DS and DD File

SAMPLE_ID TRACE_ID
S2 20394760
S2 20394761
S2 20394762
S2 20394763
S10 20394764
S10 20394765
S10 20394766
VARNAME VARDESC TYPE VALUES
SAMPLE_ID Sample ID string
TRACE_ID Trace ID string

Example of GEO Mapping DS and DD File

SAMPLE_ID GEO_ACCESSION
S2 GSM18467693
S2 GSM18467694
S2 GSM18467695
S10 GSM18467696
S10 GSM18467697
S10 GSM18467698
VARNAME VARDESC TYPE VALUES
SAMPLE_ID Sample ID string
GEO_ACCESSION GEO accession ID string

Example of GenBank Mapping DS and DD File

SAMPLE_ID GENBANK_ACCESSION
S2 HM258784
S2 HM258785
S2 HM258786
S10 HM258787
S10 HM258788
S10 HM258789
VARNAME VARDESC TYPE VALUES
SAMPLE_ID Sample ID string
GENBANK_ACCESSION GenBank accession ID string

Example of SRA (public) Mapping DS and DD File

SAMPLE_ID SRA_ACCESSION
S2 SRS2506412
S10 SRS2506420
S13 SRS2506432
S14 SRS2506433
S15 SRS2506434
S16 SRS2506435
VARNAME VARDESC TYPE VALUES
SAMPLE_ID Sample ID string
SRA_ACCESSION SRA sequence accession ID (SRS#) string

Association Analyses

23. What are Association Analysis Data Files and how should they be formatted?

Association analyses are Genomic Summary Results (GSR) and do not include individual level data. They are from genomic association studies and include linkage and burden testing on genotypic and phenotypic traits. They vary on trait, variant type, frequency, and analytic method. To facilitate data sharing, we have created a unified guideline for Minimum Information Required for Association Data (MIRAD). It includes four essential data elements.

  1. Locus Identifier The identifier includes locus ID and location, but is not limited to rs#, gene ID and SV# for SNP, gene, and structural variant. They can be mapped to the current genome build and can evolve with future reference genome assemblies and NCBI annotations.
  2. Variation summary It contains information about alleles, allele frequencies, sample size, and genotype counts per sample group within each locus. To limit the ability of unauthorized parties to infer individual participants, data like counts and frequency are only accessible to users who have been approved for Authorized Access.
  3. Statistical significance and Effect size p-value and/or FDR either come from univariate testing on variants from a single locus or from burden testing on a set of rare variants from a target-region provided by sequencing projects. The effect size includes odds ratio, regression coefficient, relative risk, etc., on effect allele. These data not only help users to find causal variant and haplotype, but also can be used to estimate locus contribution to the heredity of the trait or disease(s).
  4. Phenotype Definition and Analysis Metadata The main trait or disease analyzed should be defined based on controlled vocabulary such as UMLS, HPO, etc. Descriptions of the analysis and method, include phenotypic covariates, parameters, and ancestry of participants, are needed for reproducing the result set once the individual data are fully available.

Reasoning: Sharing of these data elements allow other researchers to evaluate supporting evidence and independently verify discoveries with different samples and data models. If individual level genotype is inaccessible, people can directly use them for meta-analysis to increase statistical power or for the development of hypotheses. The data, like locus info, effect allele and effect size, can provide valuable information for genomic medicine.

Our practice: Using MIRAD, dbGaP has developed several templates for data submission and genome browser display. You are welcome to join the discussion, make suggestions, and comment on the MIRAD proposal. The dbGaP team is committed to bringing new discoveries to the public and research communities and are happy to work with researchers to promote data sharing within the scientific community.

See the instructions in Association_Analysis.xlsx for Case-Controls (Worksheet 1) or Others (Worksheet 2). Each analysis metadata sheet is given a separate analysis accession (pha#.v#) and will need to have a unique name. If GWAS results are submitted as outputs of the software, please give brief descriptions of the column headers, indicating the linking-columns and/or relationships when several files are involved.

The GSR will be posted on the public FTP site, unless the study investigator and GPA specify that the data is sensitive in the dbGaP Submission System and needs to be restricted under dbGaP Authorized Access. Additionally, there is the option to add a study with analyses to CADA. CADA stands for the Compilation of Aggregate Genomic Data and is a collection of analyses across many dbGaP studies that can be accessed with a single Data Access Request.

Submitting Files

24. Who can submit files to dbGaP?

A dbGaP study must be registered in the dbGaP Submission System before data can be submitted. Please click on "How to Submit" for the overall schema. The study investigator and the person designated by the study investigator (PI Submitter) will be able to submit along with any other individuals they add as a submitter.

25. Where do I submit my dbGaP files?

Submit all files through the dbGaP Submission Portal. Go to https://submit.ncbi.nlm.nih.gov/dbgap/. To safeguard study participants' privacy, dbGaP will not accept individual-level data via email. Once the study is registered, a Submission Portal account is provided to the study invesigator and anyone that the study investigator lists as a submitter. To obtain access to the Submission Portal account, please accept the email invitation you have received immediately. The email invitation will expire in 7 days. Once accepted, you may submit your files any time thereafter. Individuals with "manager" roles in the dbGaP Submission Portal can also add in additional submitters.

Additional guidance for files to upload under "Phenotype data" and "Other files":

  • Phenotype data - upload all Subject Phenotypes DS and DD and any mapping files of study samples to other NCBI databases
  • Other files
    • Molecular Data - select type "Molecular Data". No high throughput sequencing data (FASTQ, BAM, and CRAM) should be submitted here.
    • Study Documents - select type "Document: Phenotype" if the document can be made available on the public webpage. Some READMEs, genotype qc results, etc are not appropriate for public distribution, and should be submitted under type "Molecular Data" instead and packaged for Authorized Access only.

26. What if there are errors or updates in the data and I need to resubmit?

If you must resubmit your files, please follow these instructions:

  • Notify the study curator(s) what type of update is made if not discussed prior.
  • Submit only new or updated files. Do not resubmit all files or we will need to compare every old and new file, which will add significantly to the processing time.
  • Resubmit data through the dbGaP Submission Portal (https://submit.ncbi.nlm.nih.gov/dbgap/) by replacing files, so that we have a formal record of your submission. Do not submit individual-level data through email.
  • Keep resubmitted filenames the same or add the date to the existing filename, i.e. yyyymmdd (ex. 20190101). Do not submit filenames used two versions or more ago. dbGaP will crosscheck the latest file submission against the previous submission and report any unexpected changes.
  • Double check the submission by going through the checklist of common errors: Quality Control

dbGaP Processing and Release

27. What happens once I submit my core data files and phenotype files to the dbGaP database?

dbGaP curators work through the study queue in the order the study is submitted to the dbGaP Submission Portal. Study submissions should be complete, which may include all phenotype files, molecular (non-sequence¹) data, study documents, analyses, and imputation data. Completed study submissions can be released as soon as:

  1. dbGaP has finished processing the study;
  2. If there are high throughput human sequence data and all sequences appear ready/public in the Sample Status Telemetry Report (SSTR);
  3. The registration information is consistent with the submitted data and the study registration in the Submission System is marked "Completed by GPA";
  4. The study investigator or PI assistant has given permission to release the study. If you have additional files to submit for the release, then the study submission is incomplete and will not have priority in the processing queue.

You can track your study's progress through the Study Status Report (SSR).

QC Checks

We are offering pre-validation tools for you to check your data before submitting to dbGaP on your system using GaPTools.

dbGaP will run several quality control (qc) checks upon submission.

  1. Automated preprocessing checks will immediately be run after submission for studies with PLINK or VCFs, Subject Consents DS, Subject Sample Mapping (SSM) DS, and Pedigree DS. The automated system will email all submitters with results from the five types of files. If one of the five types is resubmitted, the automated system will be re-run. Here is a web page showing errors and warnings that the automated system may detect: https://www.ncbi.nlm.nih.gov/gap/public_utils/messages/.
  2. Manual and scripted qc checks will be completed by the dbGaP curators of your study. The phenotype curator and genotype curator will separately report back errors detected, since the processing occurs at different times depending on the queue and the errors can be complex within each component.

    • Phenotype Curation: The phenotype curator coordinates the entire study release and processes the information in the Submission System registration (SS), Submission Portal (SP), Study Config, DS and DD (Subject Consent, SSM, Pedigree, Subject Phenotypes, Sample Attributes, and Sample to NCBI Database Mapping), Study Documents, and Medical Images. All individual level data are split by consents. The manual portion includes reconciling the SS registration against the SP Questionnaire, validating consents, and checking for incongruent phenotypic values and summaries. Scripted qc checks look for inconsistencies between files and between all dbGaP studies, formatting errors that make loading of the datasets (DS) and data dictionaries (DD) into the dbGaP database impossible, inconsistencies between DS and DD with regard to subject consent, sex, affection status, and potential HIPAA violations. See Question 16 for common errors we encounter.
    • Genotype Curation: The genotype curator processes all molecular data EXCEPT for high throughput sequencing data. Molecular data may include SNP array, methylation, expression/epigenetic data, CNV, VCF, MAF, imputation, eQTL, and other formats. QC checks include sex checks, pedigree checks, and unintened duplications. For data where these checks are not relevant, the data is packaged and split by consents. BAM, FASTQ, CRAM are not processed by the dbGaP genotype curator, but by sequence curators.
    • Combined Curation: Inconsistencies between molecular data sample IDs and phenotype sample IDs, unintended data duplications, incorrect pedigree information, Subject relationships will further be checked using dbGaP software, GRAF (Genetic Relationship and Fingerprinting).

GSR, GRAF-pop, and ALFA

dbGaP subjects with genomic data and that have been designated "non-sensitive" for release of Genomic Summary Results (GSR) in the dbGaP Submission System will also be analyzed using GRAF-pop and included for the ALFA (Allele Frequency Aggregator) project. Studies may be contacted to correct the submitted data or provide a README if:

  1. They contain allele frequencies that deviate from the expected range of known allele frequencies for the 12 diverse populations and/or
  2. The submitted ancestry or population deviates from the computed ancestry for a large number of samples.

Careful adherence to this submission guide and the emailed error reports can eliminate the need for resubmission and quicken the schedule for release.

Splitting Files by Consents

dbGaP will assign dbGaP-generated subject IDs and sample IDs and split the final individual level datasets (both phenotypes and genotypes) for release by consent, with the exception of the three meta study DS (Subject Consent, SSM and Pedigree). Subject IDs that have been marked as aliases will be assigned the same dbGaP subject ID. The dbGaP-generated IDs will appear in the final dump files, NCBI BioSample website, and the Sample Status Telemetry Report (SSTR).

Preview

Prior to posting your study, dbGaP will provide you with access to a preview site of your study that shows study content as it might appear on the final public dbGaP page: https://www.ncbi.nlm.nih.gov/gap. Once all the study components have been processed and you have reviewed the preview site, dbGaP will send an email to request the study investigator's or PI assistant's approval to release the study.


¹Sequence data (e.g. BAM, CRAM, FASTQ) should be submitted only after: 1) you have received an email with an attached sequence metadata file containing the registered subject and sample IDs, and consents. This process ensures that submitted sequences are tied to sample IDs that belong to consented subjects. 2) The sequence metadata has been processed and a sequence curator contacts you to upload data.

28. When and what will be released?

The release occurs approximately 6-8 weeks following receipt of final datasets that are without error. If there are errors, the processing time will increase. The study registration in the Submission System must be marked "Completed by GPA". Once the study investigator or PI assistant and dbGaP approve of the posting of the study, it will be released in 2-3 business days to the following sites.

Public dbGaP page (https://www.ncbi.nlm.nih.gov/gap) – includes a study report page, public summary phenotype variables and datasets, molecular data summary, study documents, analyses browser, and indexing of various study terms for users to search and filter for studies. When your study becomes publicly available, the URL will appear like https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs00####.v#.p#, where the last part of the URL is the study accession number.

Public FTP site (https://ftp.ncbi.nlm.nih.gov/dbgap/studies/) – features the study manifest (a list of all released files), study configuration (a list of how the study is configured in the Authorized Access system), release notes (summarizes the data that has been released and any changes since the last version), summary statistics of phenotype variables, phenotype data dictionaries, study documents, and analyses aka genomic summary results (truncated, gene-level, and/or summary level).

Authorized Access portal (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login) – this is the management portal for individual-level data. This site can be used to submit a data access request, manage access requests, and download approved datasets.

What if I have a paper publication or must meet a specific release date?

If you need to schedule a study release to coincide with a publication (e.g. hold the study until a certain date, try to complete study processing by a certain date), communicate to dbGaP the specific date and/or at least a general time frame as soon as you know it. dbGaP will work with you to accommodate your release schedule whenever possible.

How often can dbGaP release my study?

A dbGaP study can be released quarterly at most. Finalized data must be submitted 6-8 weeks in advance for qc checks and processing. Please contact us if we need to work out a release schedule.

What should I do if I need my study accession public before the data has been processed?

If a publication requires that the study is public on the dbGaP page, please let us know and we can release the study report page in advance. The phenotype and molecular data can then be released at a later time.

Can an embargo date be applied?

There are no longer publication embargo dates. See https://osp.od.nih.gov/scientific-sharing/genomic-data-sharing-faqs/. However, if you need dbGaP to postpone a study from release until a certain date, please confer with the PO and GPA assigned to your dbGaP study to agree on a date of release (only weekdays). Once a date has been decided, please email the dbGaP phenotype curator along with the PO and GPA to let us know the agreed upon date.

29. Whom may I contact with questions about my dbGaP data submission?

General dbGaP questions and Authorized Access questions: dbgap-help@ncbi.nlm.nih.gov

dbGaP Submission Portal questions: dbgap-sp-help@ncbi.nlm.nih.gov

Phenotype and molecular data questions, please contact the assigned study curator(s).

  • Phenotype curators: study config, IDs, consents, phenotype data, study documents, medical images, study release schedules, etc.
  • Genotype curators: all molecular data EXCEPT for high throughput sequencing related questions
  • Sequence curators: high throughput sequence data. Contact sra@ncbi.nlm.nih.gov

dbGaP Team Lead: Michael Feolo feolo@ncbi.nlm.nih.gov

dbGaP Versions

30. How can I submit additional data after my study is released?

Once your study is released, it is a historical record in the dbGaP database. If you would like to submit new data or update existing data (correct, remove, or add rows or columns of data), dbGaP will create a new version of your study. This means that the study accession of your study will be updated, e.g., phs001000.v1.p1 to phs001000.v2.p?, where the version number (v#) will increment by one and the participant set number (p#) will increment by one if subjects have been retired or have moved from one consent group to another. If only new subjects have been added, the p# will not be incremented. Once a new version of a study is released, the prior version will no longer be available for download. The new version will encompass all files from the previous version and any newly submitted data. Please be sure to specify if a file needs to be replaced with a new file or if a file needs to be retired and removed from this new version release.

For new versions of a study, we ask users to continue to follow the guidelines in this Submission Guide. Repeated formatting errors will increase processing time. More importantly, if the data is inconsistent such as IDs do not match, counts between files do not match, or reported sex values do not match genotyped sex values, the processing time will be substantially longer to process each iteration of the new version. Double check the submission by going through the checklist of common errors: Quality Control

  1. Submitters should continue to submit data using the dbGaP Submission Portal. Email dbgap-sp-help@ncbi.nlm.nih.gov if you have questions regarding the Portal. Email the assigned phenotype and genotype curator for any data specific questions.
  2. Only new or updated files should be submitted.
    1. Do not submit files that have been submitted previously and are unchanged.
    2. If the study config is updated, it should be cumulative and describe all versions of the study.
    3. The Subject Consent (SC) files, Subject Sample Mapping (SSM) files, and Pedigree files should always be cumulative, i.e. all subjects and samples used in version 1 should be included in the version 2 SC, SSM and pedigree files. If a subject or sample is not included, dbGaP will mark the subject or sample as retired and the data will no longer be available in the new version. High throughput sequences belonging to retired samples will also be removed.
    4. For subject phenotypes and sample attributes files, only new and updated datasets (DS) and data dictionaries (DD) should be submitted. If the DS and DD have not changed from the prior version, do not resubmit. If a dataset needs to be retired, please notify your study curator. dbGaP will not concatenate multiple datasets into a single dataset, so please submit the datasets according to how the user might best use the data. Updated datasets should include all the data that were previously submitted plus any additional changes. Updated datasets will retain the same pht (phenotype table accession) and have an incremented version number. New datasets will be assigned a new pht accession. There can be any number of subject phenotypes and sample attributes. For more guidance on whether to update a previously submitted dataset or add a brand new dataset, see "Can I submit multiple subject phenotypes DS files?" and "Can I submit multiple sample attributes DS files?"
    5. For molecular data, only new and updated molecular data should be submitted. Existing molecular data will be promoted to the next version, unless otherwise indicated. If consents have been updated, genotype curators will re-split the molecular data files according to the new consents, so that you will not need to resubmit for consent updates.
    6. Prior to submission, please check that the files you submit contain the expected number of subjects and samples and the appropriate consent information.
  3. Retain the format and corrections that were made in the previous version following the Submission Guide. Remaking the same changes will take additional time.
    1. Check that variable names in the Dataset and the matching Data Dictionary are identical in spelling, i.e. have the same number of spaces, same case, etc.
    2. Check that every variable has a variable description. Check that coded values in the Dataset have code meanings listed in the Data Dictionary.
    3. Check that the sex of a subject remains consistent throughout a single study. If the sex has been changed as a result of a correction, please let dbGaP know via email.
    4. Check that the case control status of a subject remains consistent throughout a single study.
    5. Check that all subjects have been assigned a consent group.
    6. Check that the existing subject and sample ID mappings remain the same between versions, unless there is an error and an ID needs to be remapped. In case of ID remapping, please let dbGaP know which IDs need to be remapped.
    7. Check that all samples are mapped to a subject and therefore to a consent group.
    8. Check that the data files contain the values you expect. Check for truncated values. Compare new files to the final files submitted for the previous version to check for differences and to make sure all changes are intended. If you need more information regarding which files were incorporated into the final release of the previous version, please make a request to dbGaP.
  4. To help us better understand the new version, please let your phenotype curator know:
    1. Have new subjects been added? How many?
    2. Have any subjects been deleted or changed consent groups? How many? To protect subject identity, if only 1 subject (person) is being deleted, we ask that either additional subjects are added or 1 additional subject is retired. This minimizes public users from comparing variable summaries between versions and identify the phenotypes for that 1 person.
    3. Have new samples been added? How many?
    4. Have any samples been deleted? How many?
    5. Have any samples been remapped to different subjects? How many?
    6. Have any samples and subjects been renamed? If yes, provide a 4 column table with the the column headers: Old Sample, New Sample, Old Subject, New Subject. If only the sample are being renamed, then provide only the first 2 columns. Submit to the dbGaP Submission Portal under "Other files" with Type "Special".
    7. Have new datasets been added? What are their filenames?
    8. Have new variables been added or deleted in existing datasets? How many?
    9. Have variables been renamed between versions? What are they so that we can retain the existing variable accessions and increment the variable version.
    10. Have molecular data been added or do released data need to be replaced or removed?

GLOSSARY OF TERMS

Authorized Access (AA)

Authorized Access, (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login), is the management portal for individual-level data. This site can be used to submit a Data Access Request (DAR), manage access requests, and download approved datasets.

ALFA

NCBI's Allele Frequency Aggregator (ALFA) pipeline computes allele frequencies for variants in dbGaP across approved unrestricted studies and provides the data as open-access to the public through dbSNP. Access ALFA at https://www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/

Consent

Study participant consents are determined by your institution's IRB. When filling out the Submission Certification, the consents should then be matched to the NIH Standard Data Use Limitation consent groups. Each person should belong to a single consent group. If a subject belongs to two or more consent groups in your study, pick the more stringent of the two consent groups, so that each person belongs to a single consent group. The consent groups will then be registered in the dbGaP Submission System by your study's Genomic Program Administrator (GPA). If you are the study investigator, you can see the consent groups in the dbGaP Submission System. If you are a submitter, you can see the consent groups in the dbGaP Submission Portal for your study by clicking "View consent group" in the box on the upper right. dbGaP Authorized Access users request for studies by consent. For questions regarding the registered consent group and DUL, please contact your GPA.

See the NIH Guidance for Consents under the GDS Policy: https://osp.od.nih.gov/wp-content/uploads/NIH_Guidance_on_Elements_of_Consent_under_the_GDS_Policy_07-13-2015.pdf

See NIH Standard Data Use Limitations: https://osp.od.nih.gov/wp-content/uploads/standard_data_use_limitations.pdf

A study should be designated with at least one NIH consent group title.

Consent Group Titles Consent Group Abbreviations
General Research Use GRU
Health/Medical/Biomedical HMB
Disease-Specific (Disease/Trait/Exposure) DS-xxx
Other Customized text

Additional limitations can be added if applicable.

Consent Group Limitations Consent Group Abbreviations
IRB approval required IRB
Publication required PUB
Collaboration required COL
Not-for-profit use only NPU
Methods MDS
Related disorders RD
Genetic studies only GSO

For example, a study might have two consent groups: 1) General Research Use with IRB approval and Not-for-profit use and 2) General Research Use. Therefore, a subset of the subjects would have the GRU-IRB-NPU designation, while the remaining subjects would be GRU. There should be no overlapping subjects between the two consent groups.

Data Access Committee (DAC)

What is the DAC? See https://osp.od.nih.gov/scientific-sharing/data-access-request-dar-approvals-and-disapprovals-by-data-access-committee-dac/

DAC Chairs and Emails: https://osp.od.nih.gov/wp-content/uploads/NIH_DACs_Chairs.pdf

Genomic Data Sharing (GDS): https://osp.od.nih.gov/scientific-sharing/genomic-data-sharing/

DAC Processing Time of DARs: https://osp.od.nih.gov/scientific-sharing/data-access-committee-dac-processing-time-of-data-access-requests-dars/

Data Access Request (DAR)

DAR Approvals and Disapprovals by the DAC: https://osp.od.nih.gov/scientific-sharing/data-access-request-dar-approvals-and-disapprovals-by-data-access-committee-dac/

The Signing Official should confirm that the individual listed as the IT Director, has a background in computer security, has the institutional (and not just a department) authority and can confirm that your institution has the capacity to protect shared data, and will comply with NIH Genomic Data Sharing Policy.

Data Use Limitations (DUL)

https://osp.od.nih.gov/wp-content/uploads/standard_data_use_limitations.pdf

See Consents for examples.

dbGaP Accession Numbers

Study Accession Number - Once the study config is loaded, a study accession is assigned: phs######.v#.p#. The study accession is a unique, stable, and versioned identifier (ID) that can be used in publications. It is prefixed by "phs," indicating a phenotype study.

The version number (.v#) and participant set number (.p#) do not change during iterations within a release cycle, but following release and only after changes have been made to existing data or new data is added. The Study v# is always incremented, while the v# for its components are only incremented when there are changes to that specific component. The p# is incremented when subjects in an existing study set changes consent status. The p# is never incremented when only new subjects are added and existing subjects have not changed consents.

Dataset Accession Number - Each phenotype table (SC, SSM, pedigree, subject phenotypes, and sample attributes) is assigned a pht######.v#.

Variable Accession Number - Each variable in a phenotype table (SC, SSM, pedigree, subject phenotypes, and sample attributes) is assigned a phv########.v#.

Document Accession Number - Each study document (e.g. protocols, questionnaires, manuals of procedures and operations) is assigned a phd######.#, where .# is the version number.

Molecular Data Accession Number - Each grouping of molecular data is assigned a phg######.v#.

Analysis Accession Number - Each analysis is assigned a pha#######.v#.

Dummy IDs

Dummy IDs are IDs created by the submitter to fill in unknown mother and father IDs when establishing a sibling relationship in the pedigree file. It is important that the dummy ID for the mother and father ID be unique. It is assumed that the dummy mother ID and father ID are identical for full sibling pairs.

Dump Files

Dump files is the term used to describe the individual-level phenotype data (SC, SSM, pedigree, subject phenotypes, and sample attributes) generated and distributed through controlled access. Dump file names have the study accession (phs), table accession (pht), a short dataset name, and consent designations. Each file has variable accessions and dbGaP-assigned subject IDs and/or sample IDs in addition to data submitted. The SSM dataset dump file also has BioSample IDs.

Genomic Program Administrator (GPA)

https://osp.od.nih.gov/wp-content/uploads/IC_GPAs.pdf

Genomic Summary Results (GSR)

https://osp.od.nih.gov/wp-content/uploads/What_are_Genomic_Summary_Results.pdf

GRAF

GRAF (Genetic Relationship and Fingerprinting) is a C++ program that quickly finds closely related subjects using SNP genotype data. Access GRAF at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/Software.cgi

HIPAA - algorithm to detect HIPAA sensitive dates

  1. Two 1 or 2-digit numbers and a 2 or 4-digit number, in this order, separated by "/", "-" or ".", e.g., "3/5/1994" or "12-28-03".
  2. One 4-digit number and two 1 or 2-digit numbers separated by "/", "-" or ".", e.g., "1994.2.13".
  3. A 1 or 2-digit number and a 4-digit number starting with 19 or 20 separated by "/", e.g., "10/1994" (but not "10.1994").
  4. A 1 or 2-digit number followed by a "/" and a 2-digit number starting with 0, e.g., "3/04" (but not "3/94").
  5. A month name or abbreviation and a 1, 2, or 4-digit number, in either order, separated by some non-letter, non-number characters or not separated, e.g., "JAN '93", "FEB64", "May 3rd" (but not "May be 14").
  6. A 6-digit number is considered to be potential date value if its first four digits make a valid date in mmdd format (i.e., first two digits read as month and second two as day of the month). For example, 112876 is considered to be a potential date value since 1128 is a valid date (Nov. 28) in mmdd format; 231208 or 113198 is not a potential date since 23/12 or 11/31 is not a valid date in month/day format. If all of the values, or first 10 values, of a variable are 6-digit potential dates, this variable together with it potential date values will be reported by the scripts.
  7. An 8-digit number is considered to be a potential date value if it makes a valid date in the 20th or 21st century in either mmddyyyy or yyyymmdd format. For example, "19940822" is considered to be a potential date since it can be read as 1994/08/22 (Aug. 22, 1994). "10312005" is a potential date value since it can be read as 10/31/2005 (Oct. 31, 2005). "19080230" is not considered to be a potential date since neither 1908/02/30 nor 19/08/0230 is a valid date in the 20th or 21st century. If all of the values or the first 10 values of a variable are 8-digit numbers of potential date values, the variable will be reported as containing potential HIPAA violations.

In addition to date values, the QC scripts also report data values that look like social security numbers (e.g., "123-45-6789" or "123456789"), phone numbers (e.g., "321-456-7890" or "(301)456-7890"), zip codes (e.g., "MD 20892"), etc. A few cases of this kind of sensitive information have been detected by the QC scripts. However, other cases like names of people are not reported by the QC scripts. A few cases of names of patients and providers have been detected by visual inspection.

Institutional Certification (Institution Cert)

https://osp.od.nih.gov/scientific-sharing/institutional-certifications/

IT Director

The IT Director is a person who has the institutional (and not just a department) authority and can confirm that your institution has the capacity to protect shared data, and will comply with NIH Genomic Data Sharing Policy. The IT Director should have a background in computer security and should not be the same person as the PI, any of the collaborators, the Signing Official, or the IRB review board. For example, your Chief Information Officer would be appropriate.

PI Assistant

The study investigator may designate an individual to be the PI Assistant in the dbGaP Submission System. This individual will have "manager" and "submitter" permissions in the dbGaP Submission Portal and will be the primary contact for dbGaP. This individual will be able to provide final approval for the study release.

SAMPLE_ID

A dbGaP Sample is defined as the ID of the final preps submitted to dbGaP by a genotyping center, runs from high throughput sequencing by a sequencing group, or data submitted to an NCBI resource, such as GEO or GenBank. A single subject may be mapped to multiple samples, but a single sample should not be mapped to multiple subjects unless the samples are pooled.* For example, if one subject (SUBJECT_ID) provided one sample, and that sample was processed to generate 2 sequencing runs or 1 sequencing and 1 genotyping array run, the data file would show two rows, both using the same subject ID, but having 2 unique sample IDs.

*Please inquire about pooled samples if applicable. This would only apply to pooled samples that belong to consented subjects. If the samples are pooled from controls that are publicly available, there is no need for marking the pooled samples, and a single sample ID may be assigned.

Each sample should be submitted with a single, unique, de-identified sample ID. Sample IDs should be an integer or string value. Integers should not have zero padding. IDs should not have spaces. Specifically, only the following characters can be included in the ID: English letters, Arabic numerals, period (.), hyphen (-), underscore (_), at symbol (@), and the pound sign (#). Once a variable name for the sample ID has been chosen, please use the same variable name throughout all the phenotype files for consistency. For example, please do not use SAMPLE_ID in one file and SAMPLE_NAME in another file. Please also do not use "dbGaP" in your submitted ID name, since dbGaP will assign a dbGaP sample ID that will be included in the final dump files along with the submitted sample ID.

Sample Status Telemetry Report (SSTR)

The SSTR includes a complete list of sample IDs, subject IDs, consents, dbGaP assigned IDs and study repository, NCBI BioSample accessions (SMN#), SRA sample accessions (SRS#), and sra_data_details. This report is provided to submitters submitting high throughput sequence data once the IDs and consents have been loaded into the dbGaP database and provided to NCBI BioSample. It allows submitters to track when the sequence metadata has been accepted and SRA sample accessions have been assigned. In the sra_data_details column, it allows submitters to see if there are errors with the submitted sequence data or if the sequence data is ready for release or is already public. Submitters should verify that the number of samples with sequences matches what they expect the count to be.

Study Logo

A Study Logo is a high-quality study image at least 200px by 200px in size. Study logos appear on the bottom of the study report page in the attribution section.

Study Registration

https://osp.od.nih.gov/scientific-sharing/study-registration-and-data-submission-to-an-nih-designated-controlled-access-data-repository/

Study Status Report (SSR)

A Study Status Report (SSR) is used to track the progress of your study processing, and includes contact emails for your phenotype curator, genotype curator, Program Officer (PO), and GPA. There is a link to your study's SSR from the Submission System, Submission Portal, preview site instructions, and preview site.

SUBJECT_ID

A dbGaP Subject is defined as a single human person/individual/patient that arises from a single germline. Each subject should be submitted with a single, unique, de-identified subject ID. Subject IDs should be an integer or string value. Integers should not have zero padding. IDs should not have spaces. Specifically, only the following characters can be included in the ID: English letters, Arabic numerals, period (.), hyphen (-), underscore (_), at symbol (@), and the pound sign (#). Once a variable name for the subject ID has been chosen, please use the same variable name throughout all the phenotype files for consistency. For example, please do not use SUBJECT_ID in one file and INDIVIDUAL_ID in another file. Please also do not use "dbGaP" in your submitted ID name, since dbGaP will assign a dbGaP subject ID that will be included in the final dump files along with the submitted subject ID. Subjects that are known to be the same person across dbGaP studies will be assigned the same dbGaP subject ID.

Submission Portal (SP)

The Submission Portal (SP) link is https://submit.ncbi.nlm.nih.gov/dbgap/. Login using the same email address that was used to accept the SP invitation. The SP is a secure way to upload and track study data to dbGaP. The files accepted in the SP are: Study Config, Subject Consents, Subject Sample Mapping, Pedigree, Subject Phenotypes, Sample Attributes, Logos, Documents: Phenotypes, Molecular Data¹, Medical Images, Association Analyses (Genomic Summary Results), special files requested by the study curator, and Exchange Area files. Do not submit sequence data (BAM, CRAM, or FASTQ) through the SP. The SP can be accessed by submitters who have been sent an invitation to submit, and have accepted the invitation within 7 days. Initially, the study investigator and the PI submitter are sent invitations. Any person with a "Manager" role in the SP can add additional submitters. The SP is not the same as the dbGaP Submission System (SS).

Submission Portal Questionnaire (SP Questionnaire)

The Submission Portal Questionnaire is filled out by the submitter and informs dbGaP curators what data is expected for the study. This is matched against the Submission System registration, which is entered by the GPA in consultation with the study investigator and Program Officer (funding). A few of the common questions we have after comparing the SP Questionnaire against the SS are whether VCFs called from sequence data will be submitted and whether expression counts from RNASeq will be submitted. Please resolve with the GPA before submitting.


¹Sequence data (e.g. BAM, CRAM, FASTQ) should be submitted only after: 1) you have received an email with an attached sequence metadata file containing the registered subject and sample IDs, and consents. This process ensures that submitted sequences are tied to sample IDs that belong to consented subjects. 2) The sequence metadata has been processed and a sequence curator contacts you to upload data.

Submission System (SS) aka Registration System

The dbGaP Submission System (SS) is also known as the registration system. The link is https://dbgap.ncbi.nlm.nih.gov/dbgap/ss/dbgapss.cgi?login. The GPA works with the study investigator to determine the following: study principal investigator (PI), study project officer (PO), NIH administration and funding, target data delivery date, target public release date, release type, types of data submission expected, inclusion in CADA (Compilation of Aggregate Genomic Data - a collection of analyses across many dbGaP studies that can be accessed with a single Data Access Request), estimated study participants, SRA submission expected, and PI assistant for study submissions. The GPA will upload the Submission Certification, Institutional Certifications, and Data Use Certification, which specifies the Data Use Limitations (DUL). The DULs form the consent groups that will be used to parse the study data, and also determine which Data Access Requests (DAR) can be approved through dbGaP Authorized Access. BioProjects are created for each new study registered in the SS. The SS is only accessible by the GPA, PO, and PI. The SS is not the same as the dbGaP Submission Portal (SP). To make changes to the registration entry in the Submission System, contact your GPA.

Variable

A dbGaP Variable is defined as the variable name and associated column of data in a phenotype table (SC, SSM, pedigree, subject phenotypes, and sample attributes). The variable's metadata, such as the variable name, description, units, type, and encoded values are defined in its respective phenotype Data Dictionary file. The variable accession is a phv########.v#.p#, where the version number (.v#) is incremented when changes occur to the data columns (phenotype values) following a release.

APPENDIX for Data Dictionary (DD) File Descriptions and Specifications

(*indicates required)

Column Headers Description
VARNAME* Variable name. The VARNAME must not contain forward (/) or backward slashes (\) or commas (,). Do not use "dbGaP" in the variable name.
VARDESC* Variable description. The description should be understandable and enable users to replicate the variable. For example, "blood pressure" is useful, but "brachial blood pressure while sitting" provides more context. Alternatively, study documents with detail are also acceptable.
DOCFILE Study document name associated with the variable. To list multiple documents, add a semicolon (;) between documents. Please list only study document filenames that are submitted to dbGaP.
TYPE Data value type: integer (1,2,3,4,…), encoded value (integers or strings are coded for non-numerical meaning, ex. 1=Control; 2=Case, see VALUES), decimal (0.5,2.5,…), string (African American, Asian, Caucasian, Hispanic, Non-Hispanic). For mixed values (any combination of string, integers, decimals and/or encoded values) in a single data column, list all types present.
UNITS* Units of measurement of variable
MIN The logical minimum value of the variable. If a separate code such as -1 is used for a missing field, this should not be considered as the MIN value.
MAX The logical maximum value for the variable. If a separate code such as 9999 is used for a missing field, this should not be considered as the MAX value.
RESOLUTION Measurement resolution – the number of decimal places to which a measured value is presented in the data. For example, in 54.321 the resolution is 3.
COMMENT1, COMMENT2 Additional information not included in the VARDESC that will further define the variable. If additional comments are needed beyond COMMENT2, insert new columns (COMMENT3, COMMENT4, etc.) before the column "ORDER."
VARIABLE_SOURCE Source of controlled vocabularies. Ex. PhenX, MeSH, SNOMED, NCI. If there is no match, leave blank. (Must be submitted as a group with SOURCE_VARIABLE_ID and VARIABLE_MAPPING).
SOURCE_VARIABLE_ID A unique identifier from the VARIABLE_SOURCE or a unique text concept/term from various controlled vocabularies. (Must be submitted as a group with VARIABLE_SOURCE and VARIABLE_MAPPING).
VARIABLE_MAPPING For example, a variable from the source could be Identical, Related, or Comparable. (Must be submitted as a group with VARIABLE_SOURCE and SOURCE_VARIABLE_ID).
UNIQUEKEY Unique key is a combination of variables that is designed to uniquely identify a row in a longitudinal dataset or rows that have repeating SUBJECT_IDs or SAMPLE_IDs. Mark "X" for variables that constitute the unique keys, and leave other values blank. Ex. SUBJECT_ID and VISIT_NUMBER. UNIQUEKEYs can only be used in the subject phenotypes file and some cases of the sample attributes file. The SC, SSM, and pedigree files should never have UNIQUEKEYs marked, since there should be a unique identifier appearing once in each file.
COLLINTERVAL Collection interval is the time frame in which the data for the variable or dataset was collected.
ORDER The order in which VALUES appear on the variable summary report page. If VALUES of a single variable/column of data are integers or decimals, leave blank. If VALUES are encoded values, string, or mixed, define the order. VALUES can be ordered by Frequency (highest to lowest frequency of VALUES) or by List (user specifies order through placement in VALUES columns). For mixed values within a single variable/column of data, see examples: "age" and "weight" in example file 5b_SubjectPhenotypes_DD.xlsx.
VALUES* List of all unique values and/or descriptions of all encoded values, one value per cell. Encoded values are defined as a value and its meaning. For example, if a data file contains a variable named "EDUCATION" and its data values are "1, 2, 3, and 99," these coded values will need to be defined in the data dictionary. The format of an encoded value is VALUE=MEANING. Therefore, in the data dictionary, there should be 4 separate data cells filled out with the following: 1=Completed High School, 2=Completed College, 3=Completed Graduate School, 99=Unknown. The "VALUES" header must be the last column header (farthest right in the table). It should appear only in the column above the first encoded value that is listed. The remaining column header cells should be left blank. The script will identify the first code meanings and continue right until there are no more code meanings. For example, if the variable "SEX" has 3 encoded values: 1=Male, 2=Female and 3=Unknown, the column header "VALUES" will appear only above the cell that contains 1=Male. 1=Male, 2=Female and 3=Unknown will be listed in three separate cells next to each other. The header column cells above "2=Female" and "3=Unknown" should be left blank.

Example of VALUES:

Last column with header Leave header blank Leave header blank Leave header blank
VALUES
10=Elementary 20=High School 40=College 4=Graduate School
1=2-4 drinks per day 2=5-7 drinks per day 3=>7 drinks per day
Support Center

Last updated: 2021-05-26T16:40:09Z