atSNPInfrastructure, a case study for searching billions of records while providing significant cost savings over cloud providers

IEEE Int Symp Parallel Distrib Process Workshops Phd Forum. 2018 May:2018:497-506. doi: 10.1109/IPDPSW.2018.00086. Epub 2018 Aug 6.

Abstract

We explore the feasibility of a database storage engine housing up to 307 billion genetic Single Nucleotide Polymorphisms (SNP) for online access. We evaluate database storage engines and implement a solution utilizing factors such as dataset size, information gain, cost and hardware constraints. Our solution provides a full feature functional model for scalable storage and query-ability for researchers exploring the SNP's in the human genome. We address the scalability problem by building physical infrastructure and comparing final costs to a major cloud provider.

Keywords: Big Data; Billion Records; Cassandra; Data Reduction; Distributed Computing; Economical Computing; Edge Computing; Elasticsearch; Genomics; MySQL; NoSQL; PWM; SNP.