Handout    NAR 2009 Paper     NAR 2002 Paper     FAQ     Email GEO  
   NCBI > GEO > Info

   

SOFT submission instructions for high-throughput sequencing data


Overview


Simple Omnibus Format in Text (SOFT) is designed for rapid batch submission (and download) of data. SOFT is a simple line-based, plain text format, meaning that SOFT files may be readily generated from common spreadsheet and database applications. A single SOFT file can hold both data tables and accompanying descriptive information for multiple Samples, and/or Series records.

Raw data files and processed data files should be provided - just bundle them together with the SOFT file into a.zip, .rar, or .tar archive using a program like WinZip at the time of submission. Due to the size of the files, high-throughput sequencing submissions should be FTP'd following the instructions on the main high-throughput sequence data submission page.

There is a validate only section on the Direct Deposit page which allows you to validate and de-bug your SOFT files. For validation tests, you should upload the SOFT file only (omit the processed and raw data files). Ignore the validation errors about missing files.

SOFT submission templates are available as guidelines for SOFT file structure and preparation. The templates are a good starting point for understanding SOFT.


SOFT format structure back to top


The following section explains the components and structure of a SOFT submission.



Line-type characters: There are four different types of line that are recognized in SOFT. The presence of any one of three characters in the first character position in the line indicates three of the line types, and the absence of any of these indicates the fourth line type. The four line-type characters and descriptions of what they indicate are:

SymbolDescriptionLine type
^caret linesentity indicator line
!bang linesentity attribute line
#hash linesdata table header description line
n/adata linesdata table row

For simplicity, these lines are referred to as caret lines, bang lines, hash lines, and data lines, respectively.


Label-value pairs: Label-value pairs are the generic way that lines are organized. Data lines are the only line types that are not organized in label-value pairs. Label-value pairs have the form:

[line-type character][label]=[value]




Entity types (caret lines): Entity type and its unique identifier are indicated as a label-value pair on the caret lines. The entity's unique ID is any string of characters different from any other entity ID within the document (i.e., locally unique). High-throughput sequencing submitters should supply entity types SAMPLE and SERIES.

Entity typeExample entity indicator line
Sample^SAMPLE = my_sample_name
Series^SERIES = my_series_name



Attributes (bang lines):

Entity attributes are contained in bang lines and immediately follow caret lines or other bang lines.

See list of attributes.


Sample tables (hash lines and data lines): Processed data tables can be included in the SOFT files.

Please see processed data guidelines.


Attribute content guidelines back to top


Important:
For all studies involving human subjects, it is the submitter's responsibility to ensure that the data and files supplied to GEO protect participant privacy in accordance with all applicable laws, regulations and institutional policies. Make sure to remove any direct personal identifiers from your submission. These identifiers are listed in http://privacyruleandresearch.nih.gov/research_repositories.asp, footnote 1.


The second column in the table indicates the 'number of allowed values' per attribute:
    '1' indicates required, only one value allowed
    '1 or more' indicates required, one or more values allowed
    '0 or more' indicates not required, zero or more values allowed



Label Number of allowed labels Allowed values and constraints Content guidelines
^SAMPLE 1 any, must be unique within local file Provide an identifier for this entity. This identifier is used only as an internal reference within a given file. The identifier will not appear on final GEO records.
!Sample_type 1 SRA !Sample_type = SRA
!Sample_title 1 string of length less than 120 characters, must be unique within local file and over all previously submitted Samples for that submitter Provide a unique title that describes this Sample. We suggest that you use the system [biomaterial]-[condition(s)]-[replicate number], e.g., Muscle_exercised_60min_rep2.
!Sample_supplementary_file_n 1 or more name of processed data file, or 'none' See processed data guidelines for additional instructions.
!Sample_supplementary_file_checksum_n 0 or more any MD5 checksum of the processed file or name of the MD5 file. This helps us verify that the file transfer was complete and didn't corrupt your file.
!Sample_supplementary_file_build_n 0 or more any UCSC or NCBI genome build number (e.g., hg18, mm9, human build 36, etc...). Required when submitting data files which include chromosome position information (e.g., BED, WIG, GFF, etc...).
!Sample_raw_file_n 1 or more name of raw data file See raw data guidelines for additional instructions.
!Sample_raw_file_type_n 1 or more - Illumina options: srf, fastq, or Illumina_native
- 454 options: sff
- AB SOLiD options: srf
The type of raw data files. See raw data guidelines for additional instructions.
!Sample_raw_file_checksum_n 0 or more MD5 checksum of the raw file, or name of the MD5 file. This helps us verify file integrity and that file transfer is complete.
!Sample_source_name 1 any Briefly identify the biological material and the experimental variable(s), e.g., vastus lateralis muscle, exercised, 60 min.
!Sample_organism 1 or more use standard NCBI Taxonomy nomenclature Identify the organism(s) from which the biological material was derived.
!Sample_characteristics 1 or more 'Tag: Value' format Describe all available characteristics of the biological source, including factors not necessarily under investigation. Provide in 'Tag: Value' format, where 'Tag' is a type of characteristic (e.g. "gender", "strain", "tissue", "developmental stage", "tumor stage", etc), and 'Value' is the value for each tag (e.g. "female", "129SV", "brain", "embryo", etc). Include as many characteristics fields as necessary to thoroughly describe your Samples.
!Sample_biomaterial_provider 0 or more any Specify the name of the company, laboratory or person that provided the biological material.
!Sample_treatment_protocol 0 or more any Describe any treatments applied to the biological material prior to extract preparation. You can include as much text as you need to thoroughly describe the protocol; it is strongly recommended that complete protocol descriptions are provided within your submission.
!Sample_growth_protocol 0 or more any Describe the conditions that were used to grow or maintain organisms or cells prior to extract preparation. You can include as much text as you need to thoroughly describe the protocol; it is strongly recommended that complete protocol descriptions are provided within your submission.
!Sample_molecule 1 total RNA, polyA RNA, cytoplasmic RNA, nuclear RNA, genomic DNA, protein, or other Specify the type of molecule that was extracted from the biological material.
!Sample_extract_protocol 1 or more any Describe the protocol used to isolate the extract material. Describe the library construction protocol, ie, the protocols used to extract and prepare the material to be sequenced. You can include as much text as you need to thoroughly describe the protocol; it is strongly recommended that complete protocol descriptions are provided within your submission.
!Sample_library_strategy 1 or more See list of library strategy values below Sequencing technique for this library.
!Sample_library_source 1 or more genomic, non-genomic, synthetic, viral RNA or other Type of source material that is being sequenced.
!Sample_library_selection 1 or more See list of library selection values below Describes whether any method was used to select and/or enrich the material being sequenced.
!Sample_instrument_model 1 or more Illumina Genome Analyzer
Illumina Genome Analyzer II
454 GS
454 GS 20
454 GS FLX
AB SOLiD System
AB SOLiD System 2.0
Other: specify
Select an instrument model from the list.
!Sample_data_processing 1 any Provide details of how data were generated and calculated. For example, what software was used, how and to what were the reads aligned, what filtering parameters were applied, how were peaks calculated, etc. Include a separate 'data processing' attribute for each file type described.
!Sample_barcode 0 or 1 any For multiplexed/barcode experiments, provide the barcode and/or adapter sequences necessary to interpret the raw data files.
!Sample_description 0 or more any Include any additional information not provided in the other fields, or paste in broad descriptions that cannot be easily dissected into the other fields.
!Sample_geo_accession 0 or 1 a valid Sample accession number (GSMxxx) Only use for performing updates to existing GEO records.
!Sample_table_begin 0 or 1 no content required Indicates the start of the data table.
!Sample_table 0 or 1 name of tab-delimited file to be used as data table - Tab-delimited table file name:
If it is convenient for you to generate, you can reference the name of an external tab-delimited table file (see format) in this field, rather than include the table in the !Sample_table_begin section. All external files should be zipped or tarred together with the SOFT file at time of submission.
!Sample_table_end 0 or 1 no content required Indicates the end of the data table.
^SERIES 1 any, must be unique within local file Provide an identifier for this entity. This identifier is used only as an internal reference within a given file. The identifier will not appear on final GEO records.
!Series_title 1 string of length 1-120 characters, must be unique within local file and over all previously submitted Series for that submitter Provide a unique title that describes the overall study.
!Series_summary 1 or more any Summarize the goals and objectives of this study. The abstract from the associated publication may be suitable. You can include as much text as you need to thoroughly describe the study.
!Series_overall_design 1 any Provide a description of the experimental design. Indicate how many Samples are analyzed, if replicates are included, are there control and/or reference Samples, dye-swaps, etc.
!Series_pubmed_id 0 or more an integer Specify a valid PubMed identifier (PMID) that references a published article describing this study. Most commonly, this information is not available at the time of submission - it can be added later once the data are published.
!Series_web_link 0 or more valid URL Specify a Web link that directs users to supplementary information about the study. Please restrict to Web sites that you know are stable.
!Series_contributor 0 or more each value in the form 'firstname,middleinitial,lastname' or 'firstname,lastname': firstname must be at least one character and cannot contain spaces; middleinitial, if present, is one character; lastname is at least two characters and can contain spaces. List all people associated with this study.
!Series_variable_[n] 0 or more dose, time, tissue, strain, gender, cell line, development stage, age, agent, cell type, infection, isolate, metabolism, shock, stress, temperature, specimen, disease state, protocol, growth protocol, genotype/variation, species, individual, or other Indicate the variable type(s) investigated in this study, e.g.,
!Series_variable_1 = age
!Series_variable_2 = age
NOTE - this information is optional and does not appear in Series records or downloads, but will be used to assemble corresponding GEO DataSet records.
!Series_variable_description_[n] 0 or more any Describe each variable, e.g.,
!Series_variable_description_1 = 2 months
!Series_variable_description_2 = 12 months
NOTE - this information is optional and does not appear in Series records or downloads, but will be used to assemble corresponding GEO DataSet records.
!Series_variable_sample_list_[n] 0 or more each value a valid reference to a ^SAMPLE identifier, or all List which Samples belong to each group, e.g.,
!Series_variable_sample_list_1 = samA, samB
!Series_variable_sample_list_2 = samC, samD
NOTE - this information is optional and does not appear in Series records or downloads, but will be used to assemble corresponding GEO DataSet records.
!Series_repeats_[n] 0 or more biological replicate, technical replicate - extract, or technical replicate - labeled-extract Indicate the repeat type(s), e.g.,
!Series_repeats_1 = biological replicate
!Series_repeats_2 = biological replicate
NOTE - this information is optional and does not appear in Series records or downloads, but will be used to assemble corresponding GEO DataSet records.
!Series_repeats_sample_list_[n] 0 or more each value a valid reference to a ^SAMPLE identifier, or all List which Samples belong to each group, e.g.,
!Series_repeats_sample_list_1 = samA, samB
!Series_repeats_sample_list_2 = samC, samD
NOTE - this information is optional and does not appear in Series records or downloads, but will be used to assemble corresponding GEO DataSet records.
!Series_sample_id 1 or more valid Sample identifiers Reference the Samples that make up this experiment. Reference the Sample accession numbers (GSMxxx) if the Samples already exists in GEO, or reference the ^Sample identifiers if they are being submitted in the same file.
!Series_geo_accession 0 or 1 a valid Series accession number (GSExxx) Only use for performing updates to existing GEO records.


Allowed values for !Sample_library_strategy attribute

VALUEDEFINITION
ChIP-SeqDirect sequencing of chromatin immunoprecipitates
MNase-SeqDirect sequencing following MNase digestion
ESTSingle pass sequencing of cDNA templates
FL-cDNAFull-length sequencing of cDNA templates
CTSConcatenated Tag Sequencing
BARCODESequencing of products that have been tagged with a short identifying sequence (barcode)
CLONEENDClone end (5', 3', or both) sequencing
WGSWhole genome shotgun
WCSWhole chromosome (or other replicon) shotgun
CLONEGenomic clone based (hierarchical) sequencing
POOLCLONEShotgun of pooled clones (usually BACs and Fosmids)
AMPLICONSequencing of overlapping or distinct PCR or RT-PCR products
FINISHINGSequencing intended to finish (close) gaps in existing coverage
OTHERLibrary strategy not listed



Allowed values for !Sample_library_selection attribute

VALUEDEFINITION
ChIPChromatin immunoprecipitation
MNaseMicrococcal Nuclease (MNase) digestion
RANDOMRandom shearing only
MFMethyl Filtrated
MSLLMethylation Spanning Linking Library
HMPRHypo-methylated partial restriction digest
cDNAcomplementary DNA
PCRSource material was selected by designed primers
RANDOM PCRSource material was selected by randomly generated primers
RT-PCRSource material was selected by reverse transcription PCR
CF-SCot-filtered single/low-copy genomic DNA
CF-MCot-filtered moderately repetitive genomic DNA
CF-HCot-filtered highly repetitive genomic DNA
CF-TCot-filtered theoretical single-copy genomic DNA
other
unspecified



Raw data guidelines back to top


  • The raw data files should be the original short read format sequence and quality files. These files are required.
  • Accepted file types are listed in the main high-throughput sequence data submission page.
  • It is very important to provide raw data files with your submission. These files will be uploaded to NCBI's Short Read Archive sequence database which has tools to help users view, query, browse and download sequence data. Also, without raw data your submission may not meet the requirements of the journal you are publishing with. We understand that the volumes of raw data can be very large and difficult to transfer - please contact us if you need advice with this matter.
  • Barcode data: At this time, we prefer that submitters split run files so that each barcoded sample ends up with a dedicated run file based on the barcode sequences.
  • !Sample_raw_file_* attributes have _n suffixes where n indicates the file number. For example:

    !Sample_raw_file_1 =
    !Sample_raw_file_type_1 =
    !Sample_raw_file_checksum_1 =
    !Sample_raw_file_2 =
    !Sample_raw_file_type_2 =
    !Sample_raw_file_checksum_2 =



Processed data guidelines back to top


  • Requirements for processed data files are not yet fully standardized and will depend on the nature of the experiment. Multiple types and levels of processed data files per Sample can be accepted, for example, a ChIP-seq Sample would typically have alignment files and peak files. A miRNA profiling experiment would typically have filtered, unique sequence reads with counts and mappings. The file names should be referenced as appropriate in the Metadata spreadsheet. Please consider including a 'readme' file with your submission detailing the content of each of the columns in the processed data table files.
  • !Sample_supplementary_file_* attributes have _n suffixes where n indicates the file number. For example:

    !Sample_supplementary_file_1 =
    !Sample_supplementary_file_checksum_1 =
    !Sample_supplementary_file_build_1 =
    !Sample_supplementary_file_2 =
    !Sample_supplementary_file_checksum_2 =
    !Sample_supplementary_file_build_2 =

  • Processed data consisting of sequences and count may also be supplied as data tables included within the SOFT file. See the template for a Sample submission with internal data table.
    • Data table header descriptions are contained in hash lines and immediately follow caret lines, bang lines, or other hash lines. Hash lines take the label-value pair form. Hash lines are used to provide a description of the headers named in the header line of the data table.
    • A Sample data table should lie between the !Sample_table_begin and !Sample_table_end attributes (unless supplying external text files, see !Sample_table attribute description).
    • Data tables must be in plain text (ASCII) tab-delimited format.
    • The first row in the table must be a header line that identifies the content of each column.
    • The two required columns are SEQUENCE and COUNT.



SOFT submission templates back to top



The following templates can be used to help prepare SOFT submissions:



Batch updates in SOFT back to top


Batch updates are easy to perform in SOFT format - just include the attribute "!Sample_GEO_accession = GSMxxx" where GSMxxx indicates the accession number of the record to be updated (similarly, use !Series_GEO_accession = GSExxx). You can provide the entire SOFT record with the necessary revisions. Alternatively, you can simply provide the revised attributes and/or data table. Data table updates are 'all-or-nothing', e.g., to revise data in one column of a table, it is necessary to update the whole table.

Note that it is possible to perform SOFT updates on data that were submitted via any other submission route, such as the Web or GEOarchive. Likewise, it is possible to perform Web updates on individual records that were originally uploaded in SOFT format.

Submit your SOFT update file by selecting the 'SOFT' option on the Direct Deposit page. Make sure to check the 'Update' box. Successful updates will be reflected immediately on your GEO records.



SOFT download back to top


SOFT format is used not only for batch uploads and updates of data, but also for batch download. The only difference between SOFT input and output is a few additional attributes in the output, including:

_geo_accession
_status
_submission_date
_last_update_date
_row_count
_contact_name
_contact_email
_contact_institute
_contact_department
_contact_city
_contact_phone
_contact_fax
_contact_web_link
Series_type

All GEO data are available for download in SOFT format from our anonymous FTP site.





| NLM | NIH | GEO Help | Disclaimer | Section 508 |
NCBI Home NCBI Search NCBI SiteMap