Format

Send to

Choose Destination
Database (Oxford). 2014 Sep 29;2014. pii: bau093. doi: 10.1093/database/bau093. Print 2014.

The Cancer Genomics Hub (CGHub): overcoming cancer through the power of torrential data.

Author information

1
Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA cwilks@soe.ucsc.edu.
2
Biomolecular Engineering, School of Engineering, University of California Santa Cruz, Santa Cruz, CA, USA, Center for Biomolecular Science and Engineering, School of Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, Annai Systems Inc., 2100 Palomar Airport Road, Suite 210 Carlsbad, California 92011, USA, Cardinal Peak, LLC, 1380 Forest Park Circle, Suite 202 Lafayette, CO 80026, USA and Information Technology Services, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA.

Abstract

The Cancer Genomics Hub (CGHub) is the online repository of the sequencing programs of the National Cancer Institute (NCI), including The Cancer Genomics Atlas (TCGA), the Cancer Cell Line Encyclopedia (CCLE) and the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) projects, with data from 25 different types of cancer. The CGHub currently contains >1.4 PB of data, has grown at an average rate of 50 TB a month and serves >100 TB per week. The architecture of CGHub is designed to support bulk searching and downloading through a Web-accessible application programming interface, enforce patient genome confidentiality in data storage and transmission and optimize for efficiency in access and transfer. In this article, we describe the design of these three components, present performance results for our transfer protocol, GeneTorrent, and finally report on the growth of the system in terms of data stored and transferred, including estimated limits on the current architecture. Our experienced-based estimates suggest that centralizing storage and computational resources is more efficient than wide distribution across many satellite labs. Database URL: https://cghub.ucsc.edu.

PMID:
25267794
PMCID:
PMC4178372
DOI:
10.1093/database/bau093
[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for Silverchair Information Systems Icon for PubMed Central
Loading ...
Support Center