Studies in the use of data mining, prediction algorithms, and a universal exchange and inference language in the analysis of socioeconomic health data

Comput Biol Med. 2019 Sep:112:103369. doi: 10.1016/j.compbiomed.2019.103369. Epub 2019 Jul 25.

Abstract

While clinical and biomedical information in digital form has been escalating, it is socioeconomic factors that are important determinants of health on the national and global scale. We show how collective use of data mining and prediction algorithms to analyze socioeconomic population health data can stand beside classical correlation analysis in routine data analysis. The underlying theoretical basis is the Dirac notation and algebra that is a scientific standard but unusual outside of the physical sciences, combined with a theory of expected information first developed for analyzing sparse data but still largely confined to bioinformatics. The latter was important here because the records analyzed (which are for US counties and equivalents, not patients) are very few by contemporary data mining standards. The approach is very unlikely to be familiar to socioeconomic researchers, so the theory and the advantages of our inference nets over the Bayes Net are reviewed here, mostly using socioeconomic examples. While our expertise and focus is in regard to novel analytical methods rather than socioeconomics per se, a significant negative (countertrending) relationship between population health and equity was initially surprising, at least to the present authors. This encouraged deeper exploration including that of the relationship between our data mining methods and traditional Pearson's correlation. The latter is susceptible to giving wrong conclusions if a phenomenon called Simpson's paradox applies, so this is also investigated. Also discussed is that, even for very few records, associative data mining can still demand significant computational resources due to a combinatorial explosion.

Keywords: Bayes net; Data analytics; Data mining; Decision support; Hyperbolic Dirac net; Inference net; Population health; Socioeconomic; Sparse data.

MeSH terms

  • Algorithms*
  • Data Mining*
  • Humans
  • Language*
  • Socioeconomic Factors