skip to main content

Trace Archive Frequently Asked Questions

  1. Are there other trace repositories?
  2. How can I obtain a trace for one or just a few sequences?
  3. What files are available on the ftp site?
  4. How often is the ftp site updated?
  5. Can I view the chromatogram on your web site?
  6. What is a mate-pair?
  7. How can I download large data sets?
  8. How can I obtain and run the NCBI Java Chromatogram viewer?
  9. What is the RCF format?
  10. Where can I find field requirements for submitting data?
  11. What are the available common fields in the submission file?
  12. Can submitted data be kept confidential in advance of publication?
  1. Are there other trace repositories?

    Yes. The Trace Archive and Ensembl Trace Server are collaborating to store all of the traces. The servers make an effort to synchronize their data, but there are times when a trace may exist in one archive and not in the other.

  2. How can I obtain a trace for one or just a few sequences?

    The actual trace file for a particular sequence can be obtained from the web site. When you have retrieved the reads in which you are interested use the check boxes at the top to select the information you wish to save and then click the "Save" button.

  3. What files are available on the ftp site?

    fasta.organism.XXX.gz: these are FASTA format files.

    qual.organism.XXX.gz: these are quality scores, in FASTA format.

    clip.organism.XXX.gz: these are quality clip values for a read as provided by the sequencing center. If the center provided no clip values, then the start and stop of the sequence are used.

    anc.organism.XXX.gz: these are tab delimited files containing the ancillary information for each read. Note: The tabular form of the ancillary file has been dropped due to the increased complexity and multiple self dependencies of the ancillary data.

    xml.organism.XXX.gz: these are ancillary data provided in xml format.

  4. How often is the ftp site updated?

    A full dump of the data in the Trace Archive is placed on the ftp site weekly. Updates are made to the ftp site daily, if the database has been updated.

  5. Can I view the chromatogram on your web site?

    Yes. On the results page, select "Trace" from the "Show" pull down menu.

  6. What is a mate-pair?

    Most of the sequence in the Trace Archive is derived from Whole Genome Shotgun (WGS) sequencing. WGS involves generating libraries of discrete size and sequencing both ends of the clones in the library. Sequences derived from different ends of the same clones are called mate-pairs. This information can be useful for inferring the distance between two mate pairs if the average insert size of the library is known.

  7. How can I download large data sets?

    The number of records which can be obtained on a single request is limited. Currently this number is set to 40,000. In order to download more records, you would need to place several requests accordingly. Although it is generally possible to download all needed data with a browser, the best approach to do this job is to use our Perl script query_tracedb. After copying this script, don't forget to make it executable. All records in the archive are assigned a unique identifier - TI, and therefore, first, you would need to obtain all identifiers which comply to your query. Using these identifiers you can then retrieve the actual data. Let's see how this works on a real example (please note that this page is static, and all the numbers shown in the example may not reflect the current status of the archive):

    1. The first step is to count all available records:
      query_tracedb "query count species_code='AEDES AEGYPTI'"
      122116
    2. A simple calculation shows that to retrieve all records we will need to make at least 4 requests, so let's obtain the identifiers. Please note that the identifiers are in network (BIG ENDIAN) format:
      query_tracedb "query page_size 40000 page_number 0 binary species_code='AEDES AEGYPTI'" > page1.bin
      query_tracedb "query page_size 40000 page_number 1 binary species_code='AEDES AEGYPTI'" > page2.bin
      ...
      query_tracedb "query page_size 40000 page_number 3 binary species_code='AEDES AEGYPTI'" > page4.bin
    3. You can now retrieve the data in the submission form (tarball):
      (echo -n "retrieve_tgz all 0b"; cat page1.bin) | query_tracedb > data1.tgz
      ...
      (echo -n "retrieve_tgz all 0b"; cat page4.bin) | query_tracedb > data4.tgz
      The above will retrieve all files from the archive: fasta, quality scores, chromatograms in scf format, mate_pairs, and ancillary files.
    4. *Note: steps 2 and 3 can be done at the same time:
      (echo -n "retrieve_tgz all 0b"; query_tracedb "query page_size 40000 page_number 0 binary species_code='AEDES AEGYPTI'") | query_tracedb > data1.tgz

    For more information please apply 'query_tracedb help' for available data formats, and 'query_tracedb usage' for usage examples.

    If you need to save only TI numbers for future reference, you might want to obtain them in text form:

    query_tracedb "query page_size 40000 page_number 0 text species_code='AEDES AEGYPTI'" > page1.txt
  8. How can I obtain and run the NCBI Java Chromatogram viewer?

    The package is now available for download from the public ftp site as Java Applet It consists of a ready-to-use compiled java application, and the actual sources of the viewer.

    In order to run the standalone java application you would need the java engine of version 1.8 or higher to be accessible from your computer. Then all you have to do is pick your a TI of interest and supply it as a parameter to the application:

    java -jar trace.jar trace=TI
  9. What is the RCF format?

    RCF stands for Relieved Compress Format and represents the data the exact way it is residing on the server. In order to minimize disk space usage as well as computation time, it was decided after thorough tests that the originally supplied data is to be reprocessed and recompressed on-the-fly during the data loading process. Thus all chromatograms are being kept in the proprietary format which is called RCF. RCF is a combination of two simple computation algorithms: derivation and Huffman encoding, which yield a significant data compression while remaining simple and not requiring much computation power.

    Typically it takes much less time when the data is downloaded in RCF format due to the smaller size of the data. The data can then be converted into SCF format locally. We greatly encourage you to do this, since it relieves pressure on the server while also saving you some waiting time. The converter can be obtained from the public ftp site: rcf2scf

  10. Where can I find field requirements for submitting data?

    Check the requirements in the Validation Table (Excel format) for specific combinations of STRATEGY and TRACE_TYPE_CODE.

  11. What are the available common fields in the submission file?

    See the list of common fields here

  12. Can submitted data be kept confidential in advance of publication?

    If you need this feature, please contact us before loading (trace@ncbi.nlm.nih.gov). As soon as data have been loaded they became public.