SOFT submission instructions
|
Overview
|
|
 |
Data prepared in SOFT format can be uploaded directly to GEO by selecting the 'SOFT' option on the
Direct Deposit page.
Any supplementary or external files should be bundled together with the SOFT file into a .zip, .rar, or .tar archive using a program like WinZip;
do not include any sub-directories or sub-folders in the archive. Incomplete submissions will result in processing delays.
There is a validate only section on the Direct Deposit page which allows you to validate and de-bug your SOFT files.
|
Simple Omnibus Format in Text (SOFT) is
designed for rapid batch submission (and download) of data.
SOFT is a simple line-based, plain text format, meaning that SOFT files may be readily generated from common
spreadsheet and database applications.
A single SOFT file can hold both data tables and accompanying descriptive
information for multiple, concatenated Platforms, Samples, and/or Series records.
Affymetrix CHP files and supplementary data files like
Affymetrix CEL or GenePix GPR scan files can also be provided - just bundle them together with the SOFT file
into a.zip, .rar, or .tar archive using a program like WinZip at the time of submission.
SOFT format supports MIAME-compliant data submissions.
TIP: Use the 'command window' on your computer to concatenate multiple individual SOFT files into one large file for easy batch submission.
PC instructions:
To open 'Command Prompt', click 'Start', click 'Run', type 'cmd', and then click 'OK'.
To concatenate individual files (fileA.txt, fileB.txt and fileC.txt) into one large file (largefile.txt), type 'copy fileA.txt + fileB.txt + fileC.txt largefile.txt'
Mac instructions Open a 'Terminal' program window console.
To concatenate individual files (fileA.txt, fileB.txt and fileC.txt) into one large file (largefile.txt), type 'cat fileA.txt fileB.txt fileC.txt > largefile.txt'.
Examples of SOFT submission files and templates are available as guidelines for SOFT file structure and preparation.
Your final GEO records will look something as follows:
| |
|
Summary of required information:
 |
Text description of your array
Provide complete descriptions of the array type and manufacturing protocols using the Platform attribute fields. |
Platform submission is not necessary if your array is already in GEO (e.g., Affymetrix chips).
In this case, all you need is a reference to the Platform accession number (GPLxxx).
Use the FIND PLATFORM tool to locate arrays.
|
 |
Text tab-delimited table of the array template
Guidelines for Platform data tables. |
 |
Text description of a biological sample
Provide complete descriptions of the biological source, treatment protocols, and technical protocols using the Sample attribute fields. |
|
 |
Text tab-delimited table of processed hybridization result
(may optionally include raw data columns)
Guidelines for Sample data tables.
|
For Affymetrix data, a .CHP file may be supplied instead of a tab-delimited table using the
!Sample_table attribute. |
 |
Original raw data file
External raw data files, e.g., Affymetrix .CEL files or GenePix GPR files should be referenced within the SOFT file using the !Sample_supplementary_file attribute.
Guidelines for Raw data.
|
|
 |
Text description of the overall experiment
Provide complete descriptions of the overall experiment aim, design, and conclusions using the Series attribute fields. |
|
|
SOFT format structure and content
|
|
The following section explains the components and structure of a SOFT submission, as well as guidelines for content.
Line-type characters:
There are four different types of line that are recognized in SOFT.
The presence of any one of three characters in the first character
position in the line indicates three of the line types, and the
absence of any of these indicates the fourth line type. The four
line-type characters and descriptions of what they indicate are:
| Symbol | Description | Line type |
| ^ | caret lines | entity indicator line |
| ! | bang lines | entity attribute line |
| # | hash lines | data table header description line |
| n/a | data lines | data table row |
For simplicity, these lines are referred to as caret lines, bang lines,
hash lines, and data lines, respectively.
Label-value pairs:
Label-value pairs are the generic way that lines are organized. Data lines
are the only line types that are not organized in label-value pairs. Label-value pairs
have the form:
| [line-type character] | [label] | = | [value] |
Entity types (caret lines):
Entity type and its unique identifier are indicated as a label-value pair on
the caret lines. The entity's unique ID is any string of
characters different from any other entity ID within the document (i.e., locally
unique). As described in the Overview submitters supply three
entity types: PLATFORM, SAMPLE and SERIES.
| Entity type | Example entity indicator line |
| Platform | ^PLATFORM = my_array_name |
| Sample | ^SAMPLE = my_sample_name |
| Series | ^SERIES = my_series_name |
Attributes (bang lines):
Entity attributes are contained in bang lines and immediately follow caret lines or other bang lines.
The second column in the table indicates the 'number of allowed values' per attribute:
'1' indicates required, only one value allowed
'1 or more' indicates required, one or more values allowed
'0 or more' indicates not required, zero or more values allowed
Several Sample attributes have _[n] where [n] indicates the channel number.
For example, !Sample_label_ch[2]=Cy3 indicates that Cy3 was the label in one of the channels of a two-color experiment.
If your experiment is single channel, _[n] may be omitted from the attribute.
Important: For all studies involving human subjects, it is the submitter's responsibility
to ensure that the data and files supplied to GEO protect participant privacy in accordance with
all applicable laws, regulations and institutional policies. Make sure to remove any direct personal
identifiers from your submission. These identifiers are listed in
http://privacyruleandresearch.nih.gov/research_repositories.asp, footnote 1.
|
| Label |
Number of allowed labels |
Allowed values and constraints |
Content guidelines |
| ^PLATFORM
|
1 |
any, must be unique within local file |
Provide an identifier for this entity. This identifier is used only as an internal reference within a given file. The identifier will not appear on final GEO records. |
| !Platform_title |
1 |
string of length 1-120 characters, must be unique within local file and over all previously submitted Platforms for that submitter |
Provide a unique title that describes your Platform. We suggest that you use the system [institution/lab]-[species]-[number of features]-[version], e.g. "FHCRC Mouse 15K v1.0". |
| !Platform_distribution |
1 |
commercial, non-commercial, custom-commercial, or virtual |
Microarrays are 'commercial', 'non-commercial', or 'custom-commercial' in accordance with how the array was manufactured. Use 'virtual' only if creating a virtual definition for MPSS, SARST, or RT-PCR data. |
| !Platform_technology |
1 |
spotted DNA/cDNA, spotted oligonucleotide, in situ oligonucleotide, antibody, tissue, SARST, RT-PCR, or MPSS |
Select the category that best describes the Platform technology. |
| !Platform_organism |
1 or more |
use standard NCBI Taxonomy nomenclature |
Identify the organism(s) from which the features on the Platform were designed or derived. |
| !Platform_manufacturer |
1 |
any |
Provide the name of the company, facility or laboratory where the array was manufactured or produced. |
| !Platform_manufacture_protocol |
1 |
any |
Describe the array manufacture protocol. Include as much detail as possible, e.g., clone/primer set
identification and preparation, strandedness/length, arrayer hardware/software, spotting protocols.
You can include as much text as you need to thoroughly describe the protocol; it is strongly recommended that complete protocol descriptions are provided within your submission.
|
| !Platform_catalog_number |
0 or more |
any |
Provide the manufacturer catalog number for commercially-available arrays. |
| !Platform_web_link |
0 or more |
valid URL |
Specify a Web link that directs users to supplementary information about the array. Please
restrict to Web sites that you know are stable. |
| !Platform_support |
0 or 1 |
any |
Provide the surface type of the array, e.g., glass, nitrocellulose, nylon, silicon, unknown. |
| !Platform_coating |
0 or 1 |
any |
Provide the coating of the array, e.g., aminosilane, quartz, polysine, unknown. |
| !Platform_description |
0 or more |
any |
Provide any additional descriptive information not captured in another field, e.g.,
array and/or feature physical dimensions, element grid system. |
| !Platform_contributor |
0 or more |
each value in the form 'firstname,middleinitial,lastname' or 'firstname,lastname':
firstname must be at least one character and cannot contain spaces; middleinitial,
if present, is one character; lastname is at least two characters and can contain spaces. |
List all people associated with this array design. |
| !Platform_pubmed_id |
0 or more |
an integer |
Specify a valid PubMed identifier (PMID) that references a published article that describes the array. |
| !Platform_geo_accession |
0 or 1 |
a valid Platform accession number (GPLxxx) |
Only use for performing updates to existing GEO records. |
| !Platform_table_begin |
1 |
no content required |
Indicates the start of the data table. |
| !Platform_table_end |
1 |
no content required |
Indicates the end of the data table. |
| ^SAMPLE |
1 |
any, must be unique within local file |
Provide an identifier for this entity. This identifier is used only as an internal reference within a given file. The identifier will not appear on final GEO records. |
| !Sample_title |
1 |
string of length 1-120 characters, must be unique within local file and over all previously submitted Samples for that submitter |
Provide a unique title that describes this Sample. We suggest that you use the system [biomaterial]-[condition(s)]-[replicate number], e.g., Muscle_exercised_60min_rep2. |
| !Sample_supplementary_file |
1 or more |
name of supplementary file, or 'none' |
Examples of supplementary file types include original Affymetrix CEL and EXP files, GenePix GPR files, and TIFF image files.
Supplementary files should be zipped or tarred together with the SOFT file at time of submission (do not include any sub-directories or sub-folders in your zip/tar archive).
Provision of supplementary raw data files facilitates the unambiguous interpretation of data and potential verification of
conclusions as set forth in the MIAME guidelines.
|
| !Sample_table |
0 or 1 |
name of external CHP or tab-delimited file to be used as data table |
- Affymetrix CHP file name:
If your processed data are CHP files, you can reference the CHP file name in this field.
If your manuscript discusses data processed by
RMA or another algorithm, we recommend providing those values in the table section.
There is no need to specify the !Sample_platform_id when CHP files are supplied.
All external files should be zipped or tarred together with the SOFT file at time of submission.
- Tab-delimited table file name:
If it is convenient for you to generate, you can reference the name of an external tab-delimited table file (see format) in this field, rather than include the table in the !Sample_table_begin section.
All external files should be zipped or tarred together with the SOFT file at time of submission.
|
| !Sample_source_name_ch[n] |
1 per channel |
any |
Briefly identify the biological material and the experimental variable(s), e.g., vastus lateralis muscle, exercised, 60 min. |
| !Sample_organism_ch[n] |
1 or more |
use standard NCBI Taxonomy nomenclature |
Identify the organism(s) from which the biological material was derived. |
| !Sample_characteristics_ch[n] |
1 or more |
'Tag: Value' format |
Describe all available characteristics of the biological source, including factors not necessarily under investigation.
Provide in 'Tag: Value' format, where 'Tag' is a type of characteristic (e.g. "gender", "strain", "tissue", "developmental stage", "tumor stage", etc), and 'Value' is the value for each tag (e.g. "female", "129SV", "brain", "embryo", etc). Include as many characteristics fields as necessary to thoroughly describe your Samples.
|
| !Sample_biomaterial_provider_ch[n] |
0 or more |
any |
Specify the name of the company, laboratory or person that provided the biological material. |
| !Sample_treatment_protocol_ch[n] |
0 or more |
any |
Describe any treatments applied to the biological material prior to extract preparation.
You can include as much text as you need to thoroughly describe the protocol;
it is strongly recommended that complete protocol descriptions are provided within your submission.
|
| !Sample_growth_protocol_ch[n] |
0 or more |
any |
Describe the conditions that were used to grow or maintain organisms or cells prior to extract preparation.
You can include as much text as you need to thoroughly describe the protocol; it is strongly recommended that complete protocol descriptions are provided within your submission.
|
| !Sample_molecule_ch[n] |
1 per channel |
total RNA, polyA RNA, cytoplasmic RNA, nuclear RNA, genomic DNA, protein, or other |
Specify the type of molecule that was extracted from the biological material. |
| !Sample_extract_protocol_ch[n] |
1 or more |
any |
Describe the protocol used to isolate the extract material.
You can include as much text as you need to thoroughly describe the protocol; it is strongly recommended that complete protocol descriptions are provided within your submission.
|
| !Sample_label_ch[n] |
1 per channel |
any |
Specify the compound used to label the extract e.g., biotin, Cy3, Cy5, 33P. |
| !Sample_label_protocol_ch[n] |
1 or more |
any |
Describe the protocol used to label the extract.
You can include as much text as you need to thoroughly describe the protocol; it is strongly recommended that complete protocol descriptions are provided within your submission.
|
| !Sample_hyb_protocol |
1 or more |
any |
Describe the protocols used for hybridization, blocking and washing, and any post-processing steps such as staining.
You can include as much text as you need to thoroughly describe the protocol; it is strongly recommended that complete protocol descriptions are provided within your submission.
|
| !Sample_scan_protocol |
1 or more |
any |
Describe the scanning and image acquisition protocols, hardware, and software.
You can include as much text as you need to thoroughly describe the protocol; it is strongly recommended that complete protocol descriptions are provided within your submission.
|
| !Sample_data_processing |
1 |
any |
Provide details of how data in the VALUE column of your table were generated and calculated, i.e., normalization method, data selection procedures and parameters, transformation algorithm (e.g., MAS5.0), and scaling parameters.
You can include as much text as you need to thoroughly describe the processing procedures. |
| !Sample_description |
1 or more |
any |
Include any additional information not provided in the other fields, or paste in broad descriptions that cannot be easily dissected into the other fields. |
| !Sample_platform_id |
1 |
a valid Platform identifier |
Reference the Platform upon which this hybridization was performed. Reference the Platform accession number (GPLxxx)
if the Platform already exists in GEO, or reference the ^Platform identifier if the Platform record is being batch submitted
within the same SOFT file. To identify the accession number of an existing commercial Platform in GEO, use the FIND PLATFORM tool. |
| !Sample_geo_accession |
0 or 1 |
a valid Sample accession number (GSMxxx) |
Only use for performing updates to existing GEO records. |
| !Sample_anchor |
1 |
SAGE enzyme anchor, usually NlaIII or Sau3A |
Use for SAGE submissions only. |
| !Sample_type |
1 |
SAGE |
Use for SAGE submissions only. |
| !Sample_tag_count |
1 |
sum of tags quantified in SAGE library |
Use for SAGE submissions only. |
| !Sample_tag_length |
1 |
base pair length of the SAGE tags, excluding anchor sequence |
Use for SAGE submissions only. |
| !Sample_table_begin |
1 |
no content required |
Indicates the start of the data table. |
| !Sample_table_end |
1 |
no content required |
Indicates the end of the data table. |
| ^SERIES |
1 |
any, must be unique within local file |
Provide an identifier for this entity. This identifier is used only as an internal reference within a given file. The identifier will not appear on final GEO records. |
| !Series_title |
1 |
string of length 1-120 characters, must be unique within local file and over all previously submitted Series for that submitter |
Provide a unique title that describes the overall study. |
| !Series_summary |
1 or more |
any |
Summarize the goals and objectives of this study. The abstract from the associated publication may be suitable. You can include as much text as you need to thoroughly describe the study. |
| !Series_overall_design |
1 |
any |
Provide a description of the experimental design. Indicate how many Samples are analyzed, if replicates are included, are there control and/or reference Samples, dye-swaps, etc. |
| !Series_pubmed_id |
0 or more |
an integer |
Specify a valid PubMed identifier (PMID) that references a published article describing this study. Most
commonly, this information is not available at the time of
submission - it can be added later once the data are published. |
| !Series_web_link |
0 or more |
valid URL |
Specify a Web link that directs users to supplementary information about the study. Please restrict to
Web sites that you know are stable. |
| !Series_contributor |
0 or more |
each value in the form 'firstname,middleinitial,lastname' or 'firstname,lastname': firstname must be at least one character and cannot contain spaces; middleinitial, if present, is one character; lastname is at least two characters and can contain spaces. |
List all people associated with this study. |
| !Series_variable_[n] |
0 or more |
dose, time, tissue, strain, gender, cell line, development stage, age, agent, cell type, infection, isolate, metabolism, shock, stress, temperature, specimen, disease state, protocol, growth protocol, genotype/variation, species, individual, or other |
Indicate the variable type(s) investigated in this study, e.g., !Series_variable_1 = age !Series_variable_2 = age NOTE - this information is optional and does not appear in Series records or downloads, but will be used to assemble corresponding GEO DataSet records. |
| !Series_variable_description_[n] |
0 or more |
any |
Describe each variable, e.g., !Series_variable_description_1 = 2 months !Series_variable_description_2 = 12 months NOTE - this information is optional and does not appear in Series records or downloads, but will be used to assemble corresponding GEO DataSet records. |
| !Series_variable_sample_list_[n] |
0 or more |
each value a valid reference to a ^SAMPLE identifier, or all |
List which Samples belong to each group, e.g., !Series_variable_sample_list_1 = samA, samB !Series_variable_sample_list_2 = samC, samD NOTE - this information is optional and does not appear in Series records or downloads, but will be used to assemble corresponding GEO DataSet records. |
| !Series_repeats_[n] |
0 or more |
biological replicate, technical replicate - extract, or technical replicate - labeled-extract |
Indicate the repeat type(s), e.g., !Series_repeats_1 = biological replicate !Series_repeats_2 = biological replicate NOTE - this information is optional and does not appear in Series records or downloads, but will be used to assemble corresponding GEO DataSet records. |
| !Series_repeats_sample_list_[n] |
0 or more |
each value a valid reference to a ^SAMPLE identifier, or all |
List which Samples belong to each group, e.g., !Series_repeats_sample_list_1 = samA, samB !Series_repeats_sample_list_2 = samC, samD NOTE - this information is optional and does not appear in Series records or downloads, but will be used to assemble corresponding GEO DataSet records. |
| !Series_sample_id |
1 or more |
valid Sample identifiers |
Reference the Samples that make up this experiment. Reference the Sample accession numbers (GSMxxx) if the Samples already exists in GEO, or reference the ^Sample identifiers if they are being submitted in the same file. |
| !Series_geo_accession |
0 or 1 |
a valid Series accession number (GSExxx) |
Only use for performing updates to existing GEO records. |
Data table header description lines (hash lines):
Data table header descriptions are contained in hash lines and immediately follow
caret lines, bang lines, or other hash lines. Hash lines take the label-value pair
form. Hash lines are used to provide a description of the headers
named in the header line of the data table.
A linking command may be given in the hash lines. Linking commands are used to
automatically link individual cell contents. Linking
commands may occur only in the value portion of the hash lines
There are three linking commands currently defined, as follows:
| Command | Description | Example |
| LINK_PRE | prefix string | LINK_PRE:"http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Search&db=Nucleotide&term=" |
| LINK_SUF | suffix string | LINK_SUF:".html" |
| DELIMIT | delimiter string | DELIMIT:"," |
| Platform data table guidelines |
|
- A Platform data table should lie between the !Platform_table_begin and !Platform_table_end attributes.
- Data tables must be in plain text (ASCII) tab-delimited format.
- It may not be necessary to submit a Platform record if your experiments are performed using commercial arrays (e.g., Affymetrix GeneChips).
Official versions of many commercial array templates have already been deposited with GEO.
To locate a commercial array, use the FIND PLATFORM tool
and reference the appropriate Platform accession number (GPLxxx) in the !Sample_platform_id line.
If you use a commercial array, but cannot locate its template in GEO, please proceed with Platform
submission. If we can verify the content of the commercial Platform you submit, the
contact information presented on that record will be edited from you to that of the vendor
so that other users may easily locate and submit Sample data corresponding to that Platform.
- The Platform data table should only contain information that pertains to the content
and design of the array. No expression measurement or hybridization intensity data should be included in the Platform table.
- Each row of the Platform table must be represented by its own unique identifier (ID). Keep in mind that the ID column you
provide in your Platform table corresponds to
the ID_REF column you provide in accompanying Sample data tables - there should be a 1:1 correspondence. Sample data tables
should contain normalized data. This means, for example, if your normalization strategy requires taking the
average of replicate array features, or removing control spots, your Platform table should reflect the condensed template.
In this case, please e-mail or FTP the full array design file to us and we will attach it to your Platform record as a supplementary file -
this ensures that your submission remains in compliance with
MIAME standards.
- The Platform table must include meaningful, trackable, sequence identifiers (e.g. GenBank/RefSeq accessions,
locus tags, clone IDs, oligo sequences, chromosome locations, etc - see table below for full list).
This information enables other users to comprehensively interpret your data in compliance with
MIAME standards, and allows GEO to retrieve up-to-date
annotation for your Platform when incorporated into our downstream data query tools.
References to in-house databases or top BLAST hits are not sufficient.
- The principal reason many journals require deposit of microarray data to a public repository
is so that the scientific community has the ability to comprehensively evaluate or reanalyze the entire dataset.
While we understand the various reasons and difficulties
some researchers have with sharing data and array designs, the demand from users and journal editors together
with our need to maintain a useful and transparent database has led to our policy of only accepting
well-annotated datasets.
If you have any questions or concerns regarding this issue, please e-mail us.
Standard Platform Headers
The first row in the Platform table must be a header line that identifies the content of each column.
Column headers may be standard or non-standard. It is expected that at least one standard column
(other than ID) will be supplied with each Platform submission.
In addition to these standard columns, your data table may include any number of non-standard columns.
Examples of non-standard columns include array coordinate information, gene symbol or description, gene
ontology terms, quality indicators, etc. Columns may appear in any order after the ID column.
In this way, GEO is a flexible and open system, allowing you to provide all information necessary to
thoroughly annotate your array.
Standard column headers and their content are as follows:
| HEADER | CONTENT |
| ID | (Required) An identifier that unambiguously identifies
each row on your Platform table. Each ID within a Platform table must be unique.
This column heading should appear first and may be used only once in the data table.
Keep in mind that the ID column you provide in your Platform data table corresponds with
the ID_REF column you provide in accompanying Sample data tables. Sample data tables
should contain normalized data. If your normalization strategy requires taking the
average of replicate array features, your Platform should reflect the condensed template.
In this case, please e-mail or FTP the full template file to us and we will attach it to your
Platform record as a supplementary file. |
| SEQUENCE | The nucleotide sequence of each oligo, clone or PCR product. |
| GB_ACC | GenBank accession - identifies a
biological sequence through the GenBank sequence accession number assigned
to the sequence, or the representative GenBank or RefSeq accession number upon
which your sequence was designed. It is recommended that you include the version number of the accessions upon which your sequences were designed
(e.g., NM_022975.1 rather than NM_022975).
This is particularly important for RefSeq accessions which are updated frequently.
GenBank accessions representing the top BLAST hits for your sequences are not acceptable. Also,
chromosome, genome and contig accession numbers are generally not acceptable as they are not specific enough
to accurately identify the portion of the sequence printed on arrays (use GB_RANGE instead). |
| GB_LIST | GenBank accession list - as for GB_ACC, but allows more than one GenBank accession number to be presented. For example,
your sequences may have GenBank accession numbers representing both the 5' and 3' ends of your clones.
Multiple accession numbers should be separated using commas or spaces. Alternatively, more than one GB_ACC column may be supplied. |
| GB_RANGE | GenBank accession range - specifies a particular sequence position within a GenBank accession number.
Use format ACCESSION.VERSION[start..end]. Useful for tiling arrays. |
| RANGE_GB | Use format ACCESSION.VERSION. Should be used in conjunction with RANGE_START and RANGE_END. Useful for tiling arrays. |
| RANGE_START | Use in conjunction with RANGE_GB. Indicates the start position (relative to the RANGE_GB accession). Useful for tiling arrays. |
| RANGE_END | Use in conjunction with RANGE_GB. Indicates the end position (relative to the RANGE_GB accession). Useful for tiling arrays. |
| RANGE_STRAND | Use in conjunction with RANGE_GB. Indicates the strand represented. Use + or - or empty. Useful for tiling arrays. |
| GI | GenBank identifier - as for GB_ACC, but specify the GenBank identifier number rather than the GenBank accession number. |
| GI_LIST | GenBank identifier list - as for GI, but allows more than one GenBank identifier to be presented.
Multiple GIs should be separated using commas or spaces. Alternatively, more than one GI column may be supplied.
| GI_RANGE | GenBank identifier range - specifies a particular sequence position on a GenBank identifier number. Use format GI[start..end]. |
| CLONE_ID | Clone identifier - identifies a biological sequence
through a standard clone identifier. Only CLONE_IDs that can be used to identify
the sequence through an NCBI or other public-database
query should be provided in this column. Examples include FlyBase IDs,
RIKEN clone IDs and IMAGE clone numbers. |
| CLONE_ID_LIST | CLONE_ID list - as for CLONE_ID, but allows more than one clone identifier to be presented.
Multiple Clone IDs should be separated using commas or spaces. Alternatively, more than one CLONE_ID column may be supplied. |
| ORF | Open reading frame designator - identifies a biological sequence through an experimentally or
computationally derived open reading frame identifier. The ORF designator is
intended to represent a known or predicted DNA coding region or locus_tag identified
in NCBI's Entrez Genomes division.
It may be appropriate to include a GENOME_ACC column to reference the GenBank accession from which the ORF names are derived. |
| ORF_LIST | ORF list - as for ORF, but allows more than one open reading frame designator to be presented.
Multiple ORFs should be separated using commas or spaces. Alternatively, more than one ORF column may be supplied. |
| GENOME_ACC | Genome accession number - specifies the GenBank or RefSeq genome accession number from which ORF identifiers are derived. It is
important to include the version number of the genome accession upon which your sequences were generated (e.g., NC_004721.1 rather than NC_004721) because updates to the
genome sequence may render your ORF designations incorrect. |
| SNP_ID | SNP identifier - typically specifies a dbSNP refSNP ID with format rsXXXXXXXX. |
| SNP_ID_LIST | SNP identifier list - as for SNP_ID, but allows more than one SNP_ID to be presented.
Multiple SNP_IDs should be separated using commas or spaces. Alternatively, more than one SNP_ID column may be supplied. |
| miRNA_ID | microRNA identifier - typically has format e.g., hsa-let-7a or MIRNLET7A2.
| miRNA_ID_LIST | microRNA identifier list - as for miRNA_ID, but allows more than one miRNA_ID to be presented.
Multiple miRNA_IDs should be separated using commas or spaces. Alternatively, more than one miRNA_ID column may be supplied. |
| SPOT_ID | Alternative spot identifier - use only when no identifier or sequence tracking information is available.
This column is useful for designating control and empty features. |
| ORGANISM | The organism source of each feature on your array.
This is most useful for when your array contains sequences derived from multiple organisms. |
| PT_ACC | Protein accession - identifies any GenBank or RefSeq protein accession number. Protein accession numbers
should only be supplied for protein arrays. Nucleotide accession numbers should be
supplied for nucleotide arrays. |
| PT_LIST | Protein accession list - as for PT_ACC, but allows more than one protein accession number to be presented.
Multiple accession numbers should be separated using commas or spaces. Alternatively, more than one PT_ACC column may be supplied. Protein accession numbers
should only be supplied for protein arrays. Nucleotide accession numbers should be
supplied for nucleotide arrays. |
| PT_GI | Protein GenBank or RefSeq identifier. Protein identifiers should only be supplied for protein arrays or proteomic mass
spectrometry Platforms. Nucleotide identifiers should be supplied for nucleotide arrays. |
| PT_GI_LIST | Protein identifier list - as for PT_GI, but allows more than one protein identifier to be presented.
Multiple identifiers should be separated using commas or spaces. Alternatively, more than one PT_GI column may be supplied. Protein identifiers
should only be supplied for protein arrays. Nucleotide identifiers should be
supplied for nucleotide arrays. |
| SP_ACC | SwissProt accession. SwissProt accession numbers
should only be supplied for protein arrays. Nucleotide accession numbers should be
supplied for nucleotide arrays. |
| SP_LIST | SwissProt accession list - as for SP_ACC, but allows more than one SwissProt accession number to be presented.
Multiple accession numbers should be separated using commas or spaces. Alternatively, more than one SP_ACC column may be supplied. SwissProt accession numbers
should only be supplied for protein arrays. Nucleotide accession numbers should be
supplied for nucleotide arrays. | | |
| Sample data table guidelines |
|
- A Sample data table should lie between the !Sample_table_begin and !Sample_table_end attributes
(unless supplying Affymetrix CHP files or external text files, see !Sample_table attribute description).
- Data tables must be in plain text (ASCII) tab-delimited format
(unless supplying Affymetrix CHP files, see !Sample_table attribute description).
- Normalized values should be included in the table.
- The Sample data table should only contain information that pertains to the quantification measurements. With the exception of the ID information,
no annotation data that can be found on the reference Platform should be included in the Sample record.
- Complete data tables must be provided; it is not sufficient
to present only significantly regulated genes. The principal reason many journals require deposit of microarray data to a public repository
is so that the scientific community has the ability to comprehensively evaluate or reanalyze the entire dataset.
- If your hybridization data are available in a matrix format (data from multiple hybridizations are side by side in one table)
consider submitting using the GEOarchive format.
Sample data table headers and content
The first row in the file must be a header line that identifies the content of each column.
The two required columns are listed below. In addition to the required columns, submitters are encouraged to supply
any number of auxiliary non-standard columns describing, for example, supporting measurements and calculations, quality evaluations or flags.
Columns may appear in any order after the ID_REF column. In this way, GEO is a flexible and open
system, allowing you to provide all information
necessary to thoroughly describe your hybridization results.
- ID_REF: (Required) Identifier reference - these should match the
unique identifiers given in the identifier (ID) column of the corresponding Platform data table.
- VALUE: (Required) These values should be the final, normalized quantification measurements that are comparable across rows and Samples,
and preferably processed as described in any accompanying manuscript.
Values that should be discarded (e.g., background higher than count, or otherwise flagged as 'bad')
should either be left blank or labeled as "null".
- For single channel data, this column should contain normalized (scaled) signal count data.
- For dual channel data, this column should contain normalized log ratio data (preferably test/reference).
All submitters are now required to provide raw data with their submissions.
Raw data facilitates the unambiguous interpretation of the data
and potential verification of the conclusions as described in the MIAME guidelines.
A raw data file typically represents the original scan file, for example, a GenePix GPR file or
Affymetrix CEL file. Original TIFF image files may also be optionally supplied.
Raw data should be supplied in the form of external files and referenced within the SOFT file using the !Sample_supplementary_file attribute.
Supplementary files should be bundled together with the SOFT file into a.zip, .rar, or .tar archive using a program like WinZip;
do not include any sub-directories or sub-folders in the archive.
Alternatively, where possible, it is acceptable to include raw data columns alongside the normalized value data columns within the
Sample data table. However, your submission will be
enhanced by provision of the original quantified data files as these can often be read directly by data analysis
software packages and may contain additional information that is useful for reanalysis.
|
SOFT submission example and template files
|
|
The following examples (data tables truncated at 20 rows) represent valid GEO SOFT submissions:
The following templates can be used to help prepare SOFT submissions:
Batch updates are easy to perform in SOFT format - just include the attribute "!Sample_GEO_accession = GSMxxx" where
GSMxxx indicates the accession number of the record to be updated
(similarly, use !Platform_GEO_accession = GPLxxx, and !Series_GEO_accession = GSExxx).
You can provide the entire SOFT record with the necessary revisions.
Alternatively, you can simply provide the revised attributes and/or data table.
Data table updates are 'all-or-nothing', e.g.,
to revise data in one column of a table, it is necessary to update the whole table.
Note that it is possible to perform SOFT updates on data that were submitted via any other submission route, such as the Web or GEOarchive. Likewise, it is possible to perform Web updates on individual records that were originally uploaded in SOFT format.
Submit your SOFT update file by selecting the 'SOFT'
option on the
Direct Deposit page. Make sure to check the 'Update' box.
Successful updates will be reflected immediately on your GEO records.
The following examples (data tables, if any, truncated at 20 rows) represent valid SOFT update files:
SOFT format is used not only for batch uploads and updates of data, but also for batch download.
The only difference between SOFT input and output is a few additional attributes in the output, including:
_geo_accession
_status
_submission_date
_last_update_date
_row_count
_contact_name
_contact_email
_contact_institute
_contact_department
_contact_city
_contact_phone
_contact_fax
_contact_web_link
Sample_channel_count
Series_type
All GEO data are available for download in SOFT format from our anonymous
FTP site.
|
|
|
|
|
|