Send to

Choose Destination
IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1228-1236. doi: 10.1109/TCBB.2016.2568186. Epub 2016 May 16.

Benchmark Dataset for Whole Genome Sequence Compression.


The research in DNA data compression lacks a standard dataset to test out compression tools specific to DNA. This paper argues that the current state of achievement in DNA compression is unable to be benchmarked in the absence of such scientifically compiled whole genome sequence dataset and proposes a benchmark dataset using multistage sampling procedure. Considering the genome sequence of organisms available in the National Centre for Biotechnology and Information (NCBI) as the universe, the proposed dataset selects 1,105 prokaryotes, 200 plasmids, 164 viruses, and 65 eukaryotes. This paper reports the results of using three established tools on the newly compiled dataset and show that their strength and weakness are evident only with a comparison based on the scientifically compiled benchmark dataset.


The sample dataset and the respective links are available @

[Indexed for MEDLINE]

Supplemental Content

Full text links

Icon for IEEE Engineering in Medicine and Biology Society
Loading ...
Support Center