Format

Send to

Choose Destination
Natl Sci Rev. 2014 Jun;1(2):293-314.

Challenges of Big Data Analysis.

Author information

1
Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; jqfan@princeton.edu .
2
Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205, USA; fhan@jhsph.edu .
3
Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544, USA; hanliu@princeton.edu .

Abstract

Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions.

KEYWORDS:

Big Data; data storage; high dimensional data; incidental endogeneity; large-scale optimization; massive data; massively parallel data processing; noise accumulation; random projection; scalability; spurious correlation

Supplemental Content

Full text links

Icon for PubMed Central
Loading ...
Support Center