Handout    NAR 2006 Paper     NAR 2002 Paper     FAQ     Email GEO  
   NCBI > GEO > Info

   
Web Deposit Guide

  1. Introduction
  2. Common submission errors
  3. Your GEO account
  4. Submit Platform
  5. Submit Sample
  6. Submit Series
  7. Submit updates
  8. Notes for Microsoft Excel users
  9. Future plans

Introduction

The Web deposit guide provides detailed information about submitting various kinds of data to GEO using the Web deposit submission route. The Web submission process comprises a set of interactive Web forms that provide a simple step-by-step procedure for deposit of individual, MIAME-compliant records.

The Web submission process described here is most useful for the quick and easy deposit of individual records by occasional submitters, or for small experiments. If your data are already in a database or if you have many samples to submit, you may prefer to make a batch deposit of data via Direct Deposit using one of our batch deposit formats. Regardless of the submission method you choose, the final GEO records will look the same and contain equivalent information.

A schematic overview of the GEO Web submission procedure is shown below. Submitters provide their data in three sections: Platform, Samples, and Series.


  Schematic overview of GEO data submission.


Detailed information for each step of the submission process is described in the sections below. If you have read this guide, and are still having difficulty making a submission, please contact us at geo@ncbi.nlm.nih.gov and we'll be happy to provide further assistance.


Common submission errors back to top

The following list is intended to draw attention to common submission errors. More details on these issues may be found in relevant sections later in this document. Please note that making these errors can lead to delays in processing, or non-approval of your submissions. If you require clarification on any of these matters, please do not hesitate to contact us at geo@ncbi.nlm.nih.gov.
  1. Raw data not provided All submitters are required to provide raw data with their submissions. Raw data facilitates the unambiguous interpretation of the data and potential verification of the conclusions as set forth in the MIAME guidelines. Raw data may be supplied in the form of external files, e.g., Affymetrix CEL or GenePix GPR scan files. Where appropriate, it is also acceptable to include raw hybridization measurements, together with processed measurements, within the Sample record data tables.
  2. No Series submission All submitters should complete their GEO deposits by submitting a Series record. A Series record links together a group of related Samples, and provides a focal point and description of the experiment as a whole. Submission of a Series record indicates to GEO staff that you have completed your deposits and that we can begin processing your submissions.
  3. Too many Series submissions A Series record links together a group of related Samples, and provides a focal point and description of the experiment as a whole. Please do not submit multiple Series records for one experiment. You can provide a breakdown of how the Samples in your Series are grouped and classified by defining Series subsets.
  4. Data submitted as the wrong type There are three separate types of records within the GEO database: Platform, Sample, and Series. A Platform describes the features of the array (gene names and other annotation), a Sample contains corresponding value measurements derived from one hybridization, and a Series links a group of related Sample records. Thus, it may be necessary to reformat your data tables so that array annotation is separated from hybridization measurements.
  5. Incomplete data tables Many journals require deposit of microarray data to a public repository so that the scientific community has the ability to comprehensively evaluate or reanalyze the dataset. To be able to do this, it is important that complete data tables are made available. It is not sufficient to supply a summarized list of genes that you determine to be significantly regulated - complete tables must be supplied. The appropriate place to present summary information is as a Series data table.
  6. Insufficient sequence tracking information It is important that meaningful, trackable, sequence identifier information is provided for each feature in the Platform data table. This enables other users to comprehensively interpret your data. In cases where identifier information is not available for the majority of features (e.g., clone identity is proprietary or only selected clones have been sequenced) GEO policy is to approve GEO accession numbers only if the data are to be published and the GEO accession numbers quoted in the manuscript (i.e., the lack of sequence tracking information is acceptable to reviewers/editors). These GEO records will not be released to the public until the manuscript is published.

Your GEO account back to top

First-time submitters must create a GEO account before depositing data with GEO. You will be asked to choose a UserID and password, and to enter some information about how we, and other interested parties, may contact you about your data. This information also allows the source of data to be properly referenced - the contact details you supply will be displayed on the GEO records.

Contact information need only be supplied once. Your UserID and password may be used to log in to your GEO account at any time to make additional submissions, or update existing records. You may log in to your GEO account using the "Depositors only" section at the foot of the GEO home page, or you will be prompted to log in if you attempt to access the Web deposit/update page.

It is only necessary to log in if you are submitting data to GEO. All released data are public and distributed freely. Browsing, querying, viewing, and downloading information does not require that we know who you are and, therefore, does not require login.

To make edits to your contact information, log in to your GEO account and click the 'View your account' box on the GEO home page. This will take you to your account details and provides an 'EDIT' option. The revised contact information you supply will be displayed immediately on all your GEO records (old and new).

To create a new GEO account, click the 'Create new account' link on the GEO home page and enter all the required fields(*) and as many of the optional fields as you want.

 

USER ID Provide a user name. Since GEO is not a completely secure site, please do not choose a UserID that you use for other important accounts. The UserID must be unique from any other UserID tracked by GEO. If a UserID clash occurs, you will be asked to select another one. UserIDs are case-sensitive and must contain only letters and numbers.
PASSWORD Provide a password. Since GEO is not a completely secure site, please do not choose a password that you use for other important accounts. Passwords are case-sensitive and must contain only letters and numbers.
RE-PASSWORD Confirm your password.

Contact information to be displayed on record: In this section, provide details about the person to whom the data are attributable and is primarily responsible for the overall study The names of other persons associated with the study can be provided later during the data submission process. These contact details will be displayed on GEO records.

FIRST NAME Contact's 'given' name - provide a full name or initial.
MIDDLE NAME Contact's middle name or initial.
LAST NAME Contact's 'family' name - provide a full name.
E-MAIL E-mail address by which GEO staff and other interested parties can use to contact the person responsible for the data. GEO staff will send all correspondence regarding submission problems and final GEO accession approval notices to this address. If another person should be contacted regarding these issues (e.g., microarray facility personnel responsible for submitting the data) please fill in the section at the bottom of the form. You may choose not to display this information on GEO records.
PHONE Business phone number where the contact can be reached by GEO staff and other interested parties. You may choose not to display this information on GEO records.
FAX Business fax number where the contact can be reached by GEO staff and other interested parties. You may choose not to display this information on GEO records.
URL World Wide Web link to reference a Web page that further describes the contact or his/her lab.
ORGANIZATION Name of the organization or institute to which the contact belongs.
DEPARTMENT Name of organization or institute department to which the contact belongs.
LAB Laboratory or group name to which the contact belongs.
STREET ADDRESS Street address where the contact can be reached by post or mail.
CITY City name of the contact's business address.
STATE/PROVINCE State or province name of the contact's business address.
ZIP/POSTAL CODE Zip or postal code of the contact's business address.
COUNTRY Country name of the contact's business address.

Person to contact about the submission (if different from above): In this section, provide details about the person responsible for submitting the data to GEO (e.g., microarray facility personnel). GEO staff will contact this person should any submission problems arise. These contact details will not be displayed on GEO records.

NAME First and last name
E-MAIL E-mail address
PHONE Business phone number
ORGANIZATION/FACILITY Organization or facility


Submit Platform back to top

A Platform record describes the array used in your experiments. Platform information is supplied in two sections:
  1. A data table template listing the features (e.g., cDNAs, oligonucleotides, antibodies) present on the array, together with sequence or molecule tracking information and annotation.
  2. General array descriptive information including title, organism from which the features on the array are derived, design and manufacture protocols.

Commercial arrays back to top

It may not be necessary to submit a Platform record if your experiments are performed using commercial arrays (e.g., Affymetrix GeneChips). Official versions of many commercial array templates have already been deposited with GEO. To locate a commercial array, use the FIND PLATFORM tool. If you use a commercial array, but cannot locate its template in GEO, please proceed with Platform submission. If we can verify the content of the commercial Platform you submit, the contact information presented on that record will be edited from you to that of the vendor, so that other users may easily locate and submit Sample data corresponding to that Platform.

A Platform record need only be submitted once. Many Samples, even Samples from unrelated experiments and submitters, may reference the same Platform accession number.

No Platform submission is required for SAGE libraries - to deposit SAGE data, proceed directly to the SAGE sample section.

To begin Platform submission, first log in to your GEO account, then go to the Web deposit page, check the 'Platform' box, and select 'NEW'.

 

On the next page, you will be asked to select one of the following Platform types. These types are based on how the array is manufactured and distributed.
  • Commercial: array is manufactured and purchased from a commercial company and is available to anyone (before proceeding with commercial platform submission, please read the commercial arrays section above)
  • Custom-commercial: array is manufactured by a commercial company specifically for your project
  • Non-commercial: array is manufactured by a non-commercial institution
  • Virtual: not a physical array, but a list of elements that are detected and quantified in your experiment

Technology type back to top

On the next page you will be asked to specify a technology type for your Platform. If you are submitting an array-based Platform, you will be presented with the following choices:
  • in situ oligonucleotide: oligonucleotides manufactured using in situ methods such as photolithography (e.g., Affymetrix) or chemical synthesis (e.g., Agilent)
  • spotted oligonucleotide: oligonucleotides are spotted directly on the array
  • spotted DNA/cDNA: cDNA (e.g., a PCR of a cDNA clone) or genomic DNA (e.g., BACs) is spotted directly on the array
  • antibody: antibodies are spotted directly on the array
  • tissue: tissue spotted directly on the array
If you are submitting a virtual Platform, you will be presented with the following choices:
  • SARST: serial analysis of ribosomal sequence tags
  • MPSS: massively parallel signature sequencing
  • RT-PCR: real time reverse transcriptase PCR (only large-scale high-throughput RT-PCR studies may be accepted; contact us at geo@ncbi.nlm.nih.gov if you require more details)
GEO has a flexible design that can accommodate many styles of high-throughput data. If these platform technology choices do not reflect your data type, please contact GEO staff at geo@ncbi.nlm.nih.gov for advice on how to proceed.


Platform data table guidelines back to top

A 'data file' box is provided in which you must supply a local file that contains your Platform data table. Use the adjacent "Browse" button to specify the file location and name.

Refer to this Platforms guidelines page for detailed instructions on the content and format of Platform tables.

Validation of the data table is performed as the file is transferred when the "Next" button is selected. Validation messages are printed on the next page after a delay. The length of this delay is primarily due to file size and data transfer speed and not the validation process. Error messages will generally provide a line number and column heading where the error(s) occurred.

After your platform data table has passed validation, you will be prompted to provide short descriptions of each column in your data table.

 



Platform descriptive fields back to top

On the next page, specify a release date for your Platform record, and enter Platform descriptive information.

 

DATA RELEASE DATE: A release date must be specified for your Platform record. An upper limit of one year from the time of submission is permitted. However, if publication takes longer than anticipated, you may delay the release date by either e-mailing GEO staff at geo@ncbi.nlm.nih.gov and we will update the release date on your behalf, or log in to your GEO account and check the "Update" box on the GEO Web deposit/update page.
TITLE: Provide a short title specific to your Platform. The title must be less than 120 characters and unique over all your previously submitted GEO Platforms. You will be prompted to edit your Platform title if a name clash occurs or if your title is too long. We suggest that you use the system [institution/lab]-[species]-[number of features]-[version], e.g. "FHCRC Mouse 15K v1.0".
ORGANISM(S): Specify the organism from which the molecules on your Platform are derived or designed. Select one or more relevant organisms from this menu. Multiple selections can be made by holding down the Ctrl key. If your organism is not present on the list, select OTHER and provide the organism name in the Other organism(s) box below. At least one selection is required.
OTHER ORGANISM(S): Use to specify an organism that is not listed in the Organism menu above. Multiple organism names can be provided as a comma-delimited list. Text is required if OTHER was selected from the menu above.
MANUFACTURER: Provide the name of the company, facility, or laboratory where the array was manufactured or produced.
MANUFACTURE PROTOCOL: Describe or reference the array manufacture protocol. Include as much detail as possible, e.g., clone/primer set identification and preparation, strandedness/length, arrayer hardware/software, spotting protocols.
CATALOG NUMBER: Provide the manufacturer catalog number for commercially-available arrays.
SUPPORT: Specify the surface type of the array.
COATING: Specify the coating of the array.
DESCRIPTION: Provide any additional descriptive information not captured in another field, e.g., array and/or feature physical dimensions, element grid system.
WEB LINK: Provide a World Wide Web link to reference a Web page that further describes your Platform.
PUBMED ID: A PubMed ID (PMID) references a publication that describes your Platform. A PMID is a numeric value that you may obtain from the PubMed record. It is likely that the PMID can only be provided at a later date, once your Platform has been published. To add a PMID to an existing Platform record, either e-mail GEO staff at geo@ncbi.nlm.nih.gov and we will update your record with PMID links accordingly, or log in to your GEO account and check the "Update" box on the GEO Web deposit/update page.
CONTRIBUTORS: List all people associated with this array design.

Selecting the 'Next' button will take you to a page that allows you to review the data you provided. Use your browser's 'Back' button to go back and make corrections if necessary.

Selecting the 'Submit' button will submit the data to GEO. A successful submission will return a provisional GEO Platform accession number (GPLxxx). At this stage, you may proceed with Sample submission; you do not need to wait for authorization of your Platform record. Do not quote GEO accession numbers in manuscripts until you have received an approval notice e-mail from GEO staff. Note that in most cases it is most appropriate to quote Series accession numbers (GSExxx), not Platform accessions, in manuscripts that describe your experiment (see linking and citing).

 



Submit Sample back to top

A Sample record describes the biological material under examination and the quantification measurements derived from those samples. Sample information is supplied in two sections:
  1. A data table that includes normalized quantification measurements from hybridization or SAGE library.
  2. Descriptive information regarding the biological source material, and the protocols performed in the experiment. This can include organism, strain, age, specifics on experimental variables, treatments and handling, as well as labeling, hybridization, scanning, quantification and normalization protocols.
Sample submission requires explicit reference to a GEO Platform accession number, so the Platform (i.e., the array template) must exist in GEO before you can supply Sample data (see Platform submission). The exception to this rule is the submission of SAGE data. SAGE data submitters are not required to reference a Platform - an implicit reference to a virtual SAGE Platform is written into the SAGE Sample submission process.


To begin Sample submission go to the Web deposit page, check the 'Sample' box, and select 'NEW'.

 



On the next page, you will be asked to specify the type of Sample, the number of channels, and the parent Platform.

 


Type:
Choose one of the following options from the drop-down menu:
  • array-based: any solid support, hybridization-based array experiment
  • SAGE : serial analysis of gene expression
  • SARST : serial analysis of ribosomal sequence tags
  • MPSS : massively parallel signature sequencing
  • RT-PCR : real time reverse transcriptase
GEO has a flexible design that can accommodate many styles of high-throughput data. If these Sample choices do not reflect your data type, please contact GEO staff at geo@ncbi.nlm.nih.gov for advice on how to proceed.

Channel(s):
Select or specify the number of channels represented in this submission:
  • a single channel Sample generates molecular abundance measurements from one source, e.g., a typical Affymetrix hybridization
  • a dual channel Sample generates molecular abundance measurements from two sources, e.g., a Cy3/Cy5 hybridization that compares gene expression in test and reference
  • if material from more than two sources were analyzed simultaneously on the same array, select the 'More' option and specify the number of channels
Platform:
Specify the parent Platform used to derive this Sample's data. All Platforms you have supplied previously will be listed in the 'your platforms' drop-down menu. If you know the Platform accession number (GPLxxx) enter it in the box provided. If the Platform is commercial and/or you don't know the accession of the Platform you used, use the FIND PLATFORM tool to help locate your Platform. This feature prompts you to select the organism and company name or title keywords of the array you used. If you still have problems locating your Platform, please e-mail GEO staff at geo@ncbi.nlm.nih.gov for assistance. If you are supplying Affymetrix CHP files, there is no need to specify a Platform, just select the "Submitting Affymetrix CHP files" option.


The next section depends on whether you are submitting an array-based single channel Sample, an array-based dual channel Sample, or a SAGE library. Please skip to the relevant section.


Single channel back to top

A 'single channel' Sample represents a hybridization in which cDNA derived from one biosource is hybridized with the array. This method is commonly used for high density oligonucleotide arrays with fluorescent labels (e.g., Affymetrix GeneChips) and membrane (filter) arrays with radionucleotide labels. This experiment type generates gene expression measurements that are represented as scaled/normalized signal count values.


Single channel Sample data table guidelines back to top

  • A valid Sample data table file is a tab-delimited text file.
  • The Sample data table should only contain information that pertains to the quantification measurements; with the exception of the ID information, no annotation data that can be found on the reference Platform should be included in the Sample record.
  • Complete data tables must be provided. The principal reason many journals require deposit of microarray data to a public repository is so that the scientific community has the ability to comprehensively evaluate or reanalyze the dataset. It is not sufficient to present only significantly regulated genes - the appropriate place to present this information is as a Series data table.

  • Single channel Sample data table headers and content
    The first row in the file must be a header line that identifies the content of each column. Required columns are listed below. In addition to the required columns, submitters are encouraged to supply any number of auxiliary non-standard columns describing, for example, supporting measurements and calculations, quality evaluations or flags. Columns may appear in any order after the ID_REF column. In this way, GEO is a flexible and open system, allowing you to provide all information necessary to thoroughly describe your hybridization results.

  • ID_REF: (Required) Identifier reference - references the unique identifiers given in the identifier (ID) column of the corresponding Platform data table.
  • VALUE: (Required) For single channel data, this column should contain normalized (scaled) signal count data that are comparable across rows and Samples. If you want to supply ratio data that compares the values of two Samples, please supply these separately in a Series data table). Values that should be disregarded (e.g., background higher than count, or otherwise flagged as 'bad') may either be left blank or labeled as "null".


  • A typical single channel Sample data table may look as follows:

     

    Validation of the data table is performed as the file is transferred when the "Next" button is selected. Validation messages (i.e., errors and notes) are printed on the next page after a delay. The length of this delay is primarily due to file size and data transfer speed and not the validation process. Error messages will generally provide a line number and column heading where the error(s) occurred. After your data table has passed validation, you will be prompted to provide short descriptions of each column in your data table. When providing a description of your VALUE column, please supply information regarding the normalization/transformation procedure used.


    Single channel Sample descriptive fields back to top

    On the next page, specify a release date for your Sample record, and enter descriptive information about the biological material, how it was handled and processed.
    Complete all fields to comply with MIAME guidelines.

    Important:
    For all studies involving human subjects, it is the submitter's responsibility to ensure that the data and files supplied to GEO protect participant privacy in accordance with all applicable laws, regulations and institutional policies. Make sure to remove any direct personal identifiers from your submission. These identifiers are listed in http://privacyruleandresearch.nih.gov/research_repositories.asp, footnote 1.


     

    DATA RELEASE DATE: A release date must be specified for your Sample record. An upper limit of one year from the time of submission is permitted. However, if publication takes longer than anticipated, you may delay the release date by either e-mailing GEO staff at geo@ncbi.nlm.nih.gov and we will update the release date on your behalf, or log in to your GEO account and check the "Update" box on the GEO Web deposit/update page.
    TITLE: Choose a short title specific to your Sample. The title must be less than 120 characters and unique over all your previously submitted GEO Samples. You will be prompted to edit your Sample title if a name clash occurs or if your title is too long. We suggest that you use the system [biomaterial]-[condition(s)]-[replicate number], e.g., Muscle_exercised_60min_rep2.
    SUPPLEMENTARY FILE/ARCHIVE: Upload a supplementary file(s) for this submission. A supplementary file typically represents the original scan file, for example, a GenePix GPR file, Affymetrix CEL file and EXP files. If you need to upload several files, you can pack them into one archive (zip or tar). Original TIFF image files may also be supplied. If your data type has no supplementary files, or if you prefer to supply supplementary files later, check the relevant box.
    SOURCE NAME: Briefly identify the biological material and the experimental variable(s), e.g., vastus lateralis muscle, exercised, 60 min.
    ORGANISM(S): Specify the organism from which the biological source was derived. Select one or more relevant organisms from this menu. Multiple selections can be made by holding down the Ctrl key. If your organism is not present on the list, select OTHER and provide the organism name in the Other organism(s) box below. At least one selection is required.
    OTHER ORGANISM(S): Use to specify an organism that is not listed in the Organism menu above. Multiple organism names can be provided as a comma-delimited list. Text is required if OTHER was selected from the menu above.
    CHARACTERISTICS: List all available characteristics of the biological source in 'Tag: Value' format. Include as many Characteristics lines as necessary to thoroughly describe your Sample, e.g.,
    Strain: C57BL/6
    Gender: female
    Age: 45 days
    Tissue: bladder tumor
    Tumor stage: Ta
    BIOMATERIAL PROVIDER: Specify the name of the company, laboratory or person that provided the biological material.
    TREATMENT PROTOCOL: Describe any treatments applied to the biological material prior to extract preparation. It is strongly recommended that complete protocol descriptions are provided within your submission. This field may hold very large volumes of text in which to thoroughly describe protocols.
    GROWTH PROTOCOL: Describe or reference the conditions that were used to grow or maintain organisms or cells prior to extract preparation. It is strongly recommended that complete protocol descriptions are provided within your submission. This field may hold very large volumes of text in which to thoroughly describe protocols.
    MOLECULE: Specify the type of molecule that was extracted from the biological material
    EXTRACT PROTOCOL: Describe or reference the protocol used to isolate the extract material. It is strongly recommended that complete protocol descriptions are provided within your submission. This field may hold very large volumes of text in which to thoroughly describe protocols.
    LABEL: Specify the compound used to label the extract e.g., biotin, Cy3, Cy5, 33P.
    LABEL PROTOCOL: Describe or reference the protocol used to label the extract. It is strongly recommended that complete protocol descriptions are provided within your submission. This field may hold very large volumes of text in which to thoroughly describe protocols.
    HYBRIDIZATION PROTOCOL: Describe or reference the protocols used for hybridization, blocking and washing, and any post-processing steps such as staining. It is strongly recommended that complete protocol descriptions are provided within your submission. This field may hold very large volumes of text in which to thoroughly describe protocols.
    SCAN PROTOCOL: Describe or reference the scanning and image acquisition protocols, hardware, and software. It is strongly recommended that complete protocol descriptions are provided within your submission. This field may hold very large volumes of text in which to thoroughly describe protocols.
    DESCRIPTION: Include any additional information not provided in the other fields, or paste in broad descriptions that cannot be easily dissected into the other fields.
    DATA PROCESSING: Provide details of how data in the VALUE column of your table were generated and calculated, i.e., normalization method, data selection procedures and parameters, transformation algorithm (e.g., MAS5.0, scaled to 500). It is strongly recommended that complete protocol descriptions are provided within your submission. This field may hold very large volumes of text in which to thoroughly describe protocols.

    Selecting the 'Next' button will take you to a page that allows you to review the data you provided. Use your browser's 'Back' button to go back and make corrections if necessary.

    A successful submission will return a provisional GEO Sample accession number (GSMxxx). If you have more Samples to submit, select the "Submit Next" button. Some of the descriptive fields in subsequent Sample submissions will be automatically filled to facilitate fast deposit of multiple Sample records. Please ensure that these autofill fields correctly describe the biosource in each new Sample record, and edit if necessary. After you have submitted all your Sample records, please complete your GEO deposits with a Series submission; you do not need to wait for authorization of your Sample records. Do not quote GEO accession numbers in manuscripts until you have received an approval notice e-mail from GEO staff. Note that in most cases it is most appropriate to quote Series accession numbers (GSExxx), not Sample accessions, in manuscripts that describe your experiment (see linking and citing).

     


    Notes for Affymetrix users back to top

    We highly recommend the batch deposit using our simple spreadsheet submission option.

    Platform submission is not required for most Affymetrix submissions since official versions of many commercial array templates have already been deposited with GEO (use the FIND PLATFORM tool). If you cannot find the relevant commercial Affymetrix Platform in GEO, please contact GEO staff at geo@ncbi.nlm.nih.gov for assistance. If you use a custom Affymetrix chip, it will be necessary to upload the Platform template as described in the Submit Platform section.

    GEO accepts condensed probeset data as the primary Sample data type for standard gene expression experiments. Data may be transformed/condensed by any algorithm you choose (MAS5, RMA, dCHIP, etc.); it is recommended to present your data as it was processed in the accompanying manuscript. It is possible to supply native CHP files - just choose "Submitting Affymetrix CHP files" at the Platform selection stage. Original .CEL files from which the condensed data were derived should be supplied as Supplementary files. We generally do not accept .CEL files alone, because many users do not have the software or expertise to process these files.

    To submit Affymetrix hybridization Sample data using the Web deposit pages, go to the Web deposit page, and select the Sample option. On the next page select 'Type: array-based' and 'Channel: single' (all Affymetrix data must be supplied with Signal count values - if you want to supply ratio comparison data, please supply these separately in a Series data table).

    If you are supplying CHP data, select the "Submitting Affymetrix CHP files" option. If you are submitting other file types such as RMA or dCHIP, use the FIND PLATFORM tool feature to locate the GEO accession number (GPLxxx) of the GeneChip used in your study.

    On the next page, specify the native CHP file for that Sample. Alternatively, if not providing CHP files, specify the data table file location and name. Data tables must be supplied in text tab-delimited format. Data may be transformed/condensed by any algorithm you choose (RMA, dCHIP, etc.); it is recommended to present your data as it was processed in the accompanying manuscript. Column headers must be edited so that they include an ID_REF column (Probe Set Name) and a VALUE column (Signal). A typical Affymetrix Sample data table may look as follows (the ABS_CALL and DETECTION P-VALUE columns are optional):

     

    After your Sample data table has passed validation, you will be prompted to provide short descriptions of each column in your data table. When providing a description of your VALUE column, it is important to supply information regarding the transformation procedure used (MAS5, RMA, dCHIP, etc.).

    Sample descriptive information and raw data files must be provided in the next section. See the descriptive fields section above.
    Complete all fields to comply with MIAME guidelines.

    Dual channel back to top

    A 'dual channel' Sample represents a hybridization in which cDNA derived from two biosources are differentially labeled and hybridized with the same array. This method is commonly used for spotted cDNA microarrays with fluorescent labels. This experiment type generates gene expression measurements that are represented as normalized log ratio values.

    NOTE: To reduce costs and the number of arrays used, some researchers perform their experiments technically as dual channel (e.g., Cy3 and Cy5-labeled samples hybridized to the same array) but process the results as though they are single channel (Cy3 and Cy5 signals are treated independently; Cy3/Cy5 ratios are not calculated). In this case, it is usually more appropriate to submit your data as single channel Samples. This also enables us to better incorporate your data into GEO's data display features. If you require clarification on this matter, please do not hesitate to contact us at geo@ncbi.nlm.nih.gov.


    Dual channel Sample data table guidelines back to top

  • A valid Sample data table file is a tab-delimited text file.
  • The Sample data table should only contain information that pertains to the quantification measurements. Repetition of any sequence information or annotation found in the referenced platform is not necessary and will be removed from your records.
  • Complete data tables must be provided. The principal reason many journals require deposit of microarray data to a public repository is so that the scientific community has the ability to comprehensively evaluate or reanalyze the dataset. It is not sufficient to present only significantly regulated genes - the appropriate place to present this information is as a Series data table.

  • Dual channel Sample data table headers and content
    The first row in the file must be a header line that identifies the content of each column. Standard, required columns are listed below. In addition to the required columns, submitters are encouraged to supply any number of auxiliary non-standard columns describing, for example, supporting measurements and calculations, quality evaluations or flags. For dual channel hybridizations, it is highly recommended to include quantification measurements for each channel. Columns may appear in any order after the ID_REF column. In this way, GEO is a flexible and "open" system, allowing you to provide all information necessary to thoroughly describe your hybridization results.

  • ID_REF: (Required) Identifier reference - references one of the unique identifiers given in the identifier (ID) column of the corresponding Platform data table.
  • VALUE: (Required) For dual channel data, this column should contain normalized log ratio data (preferably test/reference) that are comparable across rows and Samples, processed as described in any accompanying manuscript. Values that should be disregarded (e.g., background higher than count, or otherwise flagged as 'bad') may either be left blank or labeled as "null".


  • A typical dual channel Sample data table may look as follows:

     

    Validation of the data table is performed as the file is transferred when the "Next" button is selected. Validation messages are printed on the next page after a delay. The length of this delay is primarily due to file size and data transfer speed and not the validation process. Error messages will generally provide a line number and column heading where the error(s) occurred. After your data table has passed validation, you will be prompted to provide short descriptions of each column in your data table. When providing a description of your VALUE column, please supply information regarding the normalization/transformation procedure used.


    Dual channel Sample descriptive fields back to top

    On the next page, you will be asked to specify a release date for your Sample record, and to enter descriptive information about the biological material. You will be asked to provide the same information as described in the descriptive fields section above, but for both channel 1 and channel 2.
    Complete all fields to comply with MIAME guidelines.

    Selecting the 'Next' button will take you to a page that allows you to review the data you provided. Use your browser's 'Back' button to go back and make corrections if necessary.

    A successful submission will return a provisional GEO Sample accession number (GSMxxx). If you have more Samples to submit, select the "Submit another" button. Some of the descriptive fields in subsequent Sample submissions will be automatically filled to facilitate fast deposit of multiple Sample records. Please ensure that these autofill fields correctly describe the biosource in each new Sample record, and edit if necessary. After you have submitted all your Sample records, please complete your GEO deposits with a Series submission; you do not need to wait for authorization of your Sample records. Do not quote GEO accession numbers in manuscripts until you have received an approval notice e-mail from GEO staff. Note that in most cases it is most appropriate to quote Series accession numbers (GSExxx), not Sample accessions, in manuscripts that describe your experiment (see linking and citing).



    SAGE back to top

    SAGE (serial analysis of gene expression) libraries and accompanying biological material descriptive information may be submitted to GEO. SAGE data submitted to GEO are later incorporated into the SAGEmap website.

    After selecting SAGE on the new Sample submission page, you are presented with the following options:

     

  • Protocol: select the enzyme protocol employed (NlaIII, Sau3A, RsaI, or other)
  • Other Protocol: specify the enzyme protocol used if 'other' was selected above
  • Tag Length: specify the length of the tags (minus the anchor sequence)
  • Data File: provide a local file that contains your SAGE Sample data table - use the adjacent "Browse" button to specify the file location and name


  • SAGE data table guidelines back to top

  • A valid SAGE Sample data table is a tab-delimited text file.
  • SAGE data are represented by a paired list of oligomer "tags" and a measure of abundance.
  • The first row in the file must be a header line that identifies the content of each column.
  • Please provide complete data tables - do include data for tags where the count = 1.

    Standard SAGE column headers and their content are as follows:
    • TAG: (Required) oligomer tag sequence - identifies the tag sequence that is being counted. Each tag must be unique in any given data table. Include tags which have a count = 1. Do not include the anchor enzyme sequence, e.g., "GATC" for NlaIII. This header may be used only once in the table.
    • COUNT: (Required) tag count - specifies the number of times each tag is detected in that Sample. The contents of this column must be a whole number. This header may be used only once in the table.

    A typical SAGE Sample data table may look as follows:

     

    Validation of the data table is performed as the file is transferred when the "Next" button is selected. Validation messages (i.e., errors and notes) are printed on the next page after a delay. The length of this delay is primarily due to file size and data transfer speed and not the validation process. Error messages will generally provide a line number and column heading where the error(s) occurred.


    SAGE descriptive fields back to top

    On the next page, specify a release date for your Sample record, and enter descriptive information about the biosource.

    Important:
    For all studies involving human subjects, it is the submitter's responsibility to ensure that the data and files supplied to GEO protect participant privacy in accordance with all applicable laws, regulations and institutional policies. Make sure to remove any direct personal identifiers from your submission. These identifiers are listed in http://privacyruleandresearch.nih.gov/research_repositories.asp, footnote 1.


     

    DATA RELEASE DATE: A release date must be specified for your Sample record. An upper limit of one year from the time of submission is permitted. However, if publication takes longer than anticipated, you may delay the release date by either e-mailing GEO staff at geo@ncbi.nlm.nih.gov and we will update the release date on your behalf, or log in to your GEO account and check the "Update" box on the GEO Web deposit/update page.
    TITLE: Choose a short title specific to your Sample. The title must be less than 120 characters and unique over all your previously submitted GEO Samples. You will be prompted to edit your Sample title if a name clash occurs or if your title is too long. We suggest that you use the system [biomaterial]-[condition(s)]-[replicate number], e.g., Muscle_exercised_60min_rep2.
    TAG COUNT: Total number of tags extracted from library A whole, non-zero number is required. The reciprocal of this number is used for SAGE library normalization.
    SOURCE NAME: Briefly identify the biological material and the experimental variable(s), e.g., vastus lateralis muscle, exercised, 60 min.
    ORGANISM(S): Specify the organism from which the biological source was derived. Select one or more relevant organisms from this menu. Multiple selections can be made by holding down the Ctrl key. If your organism is not present on the list, select OTHER and provide the organism name in the Other organism(s) box below. At least one selection is required.
    OTHER ORGANISM(S): Use to specify an organism that is not listed in the Organism menu above. Multiple organism names can be provided as a comma-delimited list. Text is required if OTHER was selected from the menu above.
    CHARACTERISTICS: List all available characteristics of the biological source in 'Tag: Value' format. Include as many Characteristics lines as necessary to thoroughly describe your Sample, e.g.,
    Strain: C57BL/6
    Gender: female
    Age: 45 days
    Tissue: bladder tumor
    Tumor stage: Ta
    BIOMATERIAL PROVIDER: Specify the name of the company, laboratory or person that provided the biological material.
    TREATMENT PROTOCOL: Describe any treatments applied to the biological material prior to extract preparation.
    GROWTH PROTOCOL: Describe or reference the conditions that were used to grow or maintain organisms or cells prior to extract preparation.
    MOLECULE: Specify the type of molecule that was extracted from the biological material
    EXTRACT PROTOCOL: Describe or reference the protocol used to isolate the extract material.
    DESCRIPTION: Include any additional information not provided in the other fields, or paste in broad descriptions that cannot be easily dissected into the other fields.
    DATA PROCESSING: Provide details of how data in the VALUE column of your table were generated and calculated.
    Selecting the 'Next' button will take you to a page that allows you to review the data you provided. Use your browser's 'Back' button to go back and make corrections if necessary.

    A successful submission will return a provisional GEO Sample accession number (GSMxxx). If you have more Samples to submit, select the "Submit another" button. Some of the descriptive fields in subsequent Sample submissions will be automatically filled to facilitate fast deposit of multiple Sample records. Please ensure that these autofill fields correctly describe the biosource in each new Sample record, and edit if necessary. After you have submitted all your Sample records, please complete your GEO deposits with a Series submission; you do not need to wait for authorization of your Sample records. Do not quote GEO accession numbers in manuscripts until you have received an approval notice e-mail from GEO staff. Note that in most cases it is most appropriate to quote Series accession numbers (GSExxx), not Sample accessions, in manuscripts that describe your experiment (see linking and citing).


    Submit Series back to top

    All submitters are required to supply a Series record that links together a group of related Samples and provides a focal point and description of the study as a whole. All Samples you submit must be incorporated into a Series record. However, please do not create multiple small Series records for one study - you can provide a breakdown of how the Samples in your Series are grouped and classified by defining Series subsets. In a Series record you can describe the overall experimental aim, design, and conclusions, and Samples may be grouped according to experimental variables, e.g., age, time points, tissues, etc, and repeat types. Summary tables of significant genes, or analyses may also be presented. All data associated with your study may be accessed from your Series record so it is usually appropriate and sufficient to quote a Series accession number in papers that discuss your study.

    To begin Series submission, go to the Web deposit page, check the 'Series' box, and select 'NEW'.

     

    On the next page, all your previously-submitted Sample records are listed. Use the adjacent checkboxes to select which Samples you want to include in your Series. See the "Third-party reanalysis" section below if you need to create a Series based on Samples submitted by another user.

     

    On the next page, specify a release date for your Series record, and enter descriptive information about the overall study.

     

    DATA RELEASE DATE: A release date must be specified for your Series record. An upper limit of one year from the time of submission is permitted. However, if publication takes longer than anticipated, you may delay the release date by either e-mailing GEO staff at geo@ncbi.nlm.nih.gov and we will update the release date on your behalf, or log in to your GEO account and check the "Update" box on the GEO Web deposit/update page.
    TITLE: Provide a succinct title that describes your overall experiment. The title must be less than 120 characters and unique over all your previously submitted GEO Series. You will be prompted to edit your Series title if a name clash occurs or if your title is too long.
    PUBMED ID: A PubMed ID (PMID) references a publication that describes this study. A PMID is a numeric value that you may obtain from the PubMed record. It is likely that the PMID can only be provided at a later date, once your Platform has been published. To add a PMID to an existing Platform record, either e-mail GEO staff at geo@ncbi.nlm.nih.gov and we will update your record with PMID links accordingly, or log in to your GEO account and check the "Update" box on the GEO Web deposit/update page.
    WEB LINK: Provide a World Wide Web link to reference a Web page that further describes this study.
    SUMMARY: Describe your study in as much detail as you want. Use this field to provide information on the experimental aim, design, background, and conclusions. The abstract from the associated publication may be suitable. Include any information not captured in any other field.
    OVERALL DESIGN: Provide a description of the experimental design. Indicate how many Samples are analyzed, if replicates are included, are there control and/or reference Samples, dye-swaps, etc.
    CONTRIBUTORS: List all people associated with this study.

    Define Series subsets and repeats
    Experimental design can be defined by listing and describing the variables under examination in your study. Use this section to provide a breakdown of how the Samples in your Series are grouped and classified. The subset information you supply will not appear on your Series record, but will be used by GEO staff in the creation of GEO DataSet records. A DataSet record (for example see GDS255) represents a collection of biologically comparable GEO samples and forms the basis of GEO's data display and analysis tools.

    Since many experiments investigate more than one variable, this system enables presentation of multiple and overlapping subset types.

    STEP 1

    From the pull-down menu under 'Variable' select the most appropriate term for the first subset variable.

    STEP 2

    In the adjacent 'Description' box, supply a brief text description defining that subset variable.

    STEP 3

    Enter the GSM accession numbers of the Samples that fall into that group separated by commas, or a range, e.g. 4702-4706. For ease of entry, hold down the 'Ctrl' key and click on all the relevant GSM accessions from the Sample list provided on that page. Next, paste that list into the relevant subset box.

    STEP 4

    Repeat for all other subsets.

    In the example depicted below, two factors are investigated - cocaine administration and brain tissue region. Thus, two subset 'Variables' are represented - agent (cocaine or saline) and tissue (amygdala, caudate putamen, nucleus accumbens, prefrontal cortex or ventral tegmental area). Since Sample GSM4696 represents cocaine-treated amygdala, '4696' is entered into two subsets, agent:cocaine and tissue:amygdala. It is possible for a subset to consist of only one sample.

    In the 'repeats' section, you have the opportunity to define the type of replicates performed in your study.
    • biological replicate: Samples derived from independent biosources that represent the same biological condition
    • technical replicate - extract: Samples derived from the same RNA extraction (taken from a single biosource)
    • technical replicate - labeled-extract: Samples derived from the same labeled-extract (taken from a single RNA extraction)

     


    Series data tables
    Series records may include supplementary data table(s) listing, for example, significantly regulated genes, RT-PCR validation results, or containing upper-level analysis data. The Web submission pages do not currently allow for provision of Series tables, so please e-mail such table(s) to GEO staff at geo@ncbi.nlm.nih.gov along with a table title, and we will attach it to your Series record on your behalf. There is no standard format for Series tables other than they have to be text, tab-delimited files, and the first row must be a header line that identifies the content of each column.

    Third party reanalysis Series
    If you have performed reanalysis of Sample data that were submitted to GEO by researchers other than yourself, you can create a Series record that links these Samples; it is not necessary for you to resubmit the Sample data. Such third party reanalysis Series records can only be accepted if all the referenced data (including raw data, like .CEL files) are available in GEO, and the intent is to publish the findings and Series accession number in a manuscript that describes the reanalysis. At this time, the Web submission pages are not designed for submission of third party reanalysis Series, so you must either create a SOFT-formatted file (see example) and submit using the Direct Deposit slip, or contact GEO staff (geo@ncbi.nlm.nih.gov) and they will be happy to assist with submission of your third-party reanalysis Series.


    A successful submission will return a provisional GEO Series accession number (GSExxx).

     


    At this stage, GEO staff assume that you have completed depositing your data, and will begin processing your submissions. If format or content problems are identified, you will be contacted by e-mail explaining how to address the issue(s). Once your records pass review, you will receive an e-mail confirming your GEO accession numbers and their release dates. Processing time normally takes approximately 2-5 business days after completion of submission. If you need approval of your GEO accession numbers to be expedited, please e-mail us at geo@ncbi.nlm.nih.gov. Do not quote GEO accession numbers in manuscripts until you have received an approval notice e-mail from GEO staff. Note that in most cases it is most appropriate to quote Series accession numbers (GSExxx) in manuscripts that describe your experiment (see linking and citing).



    Submit updates back to top

    You may perform updates and edits at any time to any previous submission or your contact information using the 'Update' section on the Web deposit/update page or using the 'Update' button at the head of your GEO records. This procedure leads you through a series of Web forms much like those you used through the initial submission process, enabling you to make edits to any section of your submissions. Updates will be reflected immediately on your GEO records. If global edits are required for multiple records, for example, bringing forward the release date or editing a data table header, simply e-mail the details to us at geo@ncbi.nlm.nih.gov, and a batch edit will be processed on your behalf.


    Notes for Microsoft Excel users back to top

    Many researchers manage their high-throughput data using Microsoft Excel spreadsheets. The following notes are provided to assist submitters format their data correctly using Excel, and to draw attention to common Excel-related problems.

    • GEO accepts tab-delimited text files. To generate a tab-delimited text file from an Excel spreadsheet, save your spreadsheet using File -> Save as -> Type: Text(Tab delimited)(*.txt).
    • When an Excel spreadsheet is saved as text, unwanted double quotation characters (") are often inserted into the text file (Excel does this so that text enclosed within the quotation marks are maintained in a single cell when the file is reopened using Excel). Depending on where they occur, these characters can interfere with GEO's validation processes. To remove these characters from the text file, it is necessary to open the file using a text editor such as WordPad, and use the Edit feature to globally replace unwanted quotation characters.
    • When saving a spreadsheet as text, Excel often does not insert tab character(s) in rows where the last column(s) are empty. This can lead to submission difficulties since GEO's validation system recognizes the table as being incomplete and returns an error report, e.g., "Too few columns in the row: found 4, expected 8". To circumvent this problem, select the whole worksheet and Format -> Cells -> Number -> Text (the default is "General"). When the spreadsheet is saved as a text tab-delimited file, all the expected tab characters will be present. Alternatively, a 'dummy string' can be entered into all blank cells of the final column in Excel, and later removed from the text file.
    • Be aware that Excel may automatically apply irreversible formatting to your data. According to Microsoft support:
      - If a number contains a slash mark (/) or hyphen (-), it may be converted to a date format.
      - If a number contains a colon (:), or is followed by a space and the letter A or P, it may be converted to a time format.
      - If a number contains the letter E (in uppercase or lowercase letters; for example, 10e5), or the number contains more characters than can be displayed based on the column width and font, the number may be converted to scientific notation, or exponential, format.
      - If a number contains leading zeros, the leading zeros are dropped.
      Certain clone identifiers, gene names, and plate coordinates are particularly susceptible to this phenomenon To avoid the problem, make sure to first select the whole spreadsheet and Format -> Cells -> Number -> Text when pasting data into Excel (the default is "General"). For more information, see http://www.biomedcentral.com/1471-2105/5/80.
    • If you Format -> Cells -> Number -> Text as described above, very long data strings (e.g., sequence data) may be converted to hash (#) characters. If this occurs, it is necessary to switch these cells back to "General" format.
    • Some versions of Excel will not recognize or open a text file that has "ID" as the first text in the first column/row (get error "SYLK: file format is not valid"). To reopen such a text file in Excel, it is necessary to open the file in a text editor such as WordPad and edit "ID" to something else.



    Future plans back to top

    We appreciate your patience in reading this document. Given the great variability and complexity of high-throughput molecular abundance data, we have attempted to design user-friendly submission procedures and interface. These procedures and interfaces continue to be developed. If you have any questions, suggestions, or concerns about GEO or the submission process please do not hesitate to contact us at geo@ncbi.nlm.nih.gov.




  • | NLM | NIH | GEO Help | Disclaimer | Section 508 |
    NCBI Home NCBI Search NCBI SiteMap