Clinical and pharmacogenomic data mining: 4. The FANO program and command set as an example of tools for biomedical discovery and evidence based medicine

Barry Robson

doi:10.1021/pr800204f

Clinical and pharmacogenomic data mining: 4. The FANO program and command set as an example of tools for biomedical discovery and evidence based medicine

J Proteome Res. 2008 Sep;7(9):3922-47. doi: 10.1021/pr800204f. Epub 2008 Aug 13.

Author

Barry Robson¹

Affiliation

¹ IBM Global Pharmaceutical and Life Sciences, Somers, NY 10589, USA.

PMID: 18698807
DOI: 10.1021/pr800204f

Abstract

The culmination of methodology explored and developed in the preceding three papers is described in terms of the FANO program (also known as CliniMiner) and specifically in terms of the contemporary command set for data mining. This provides a more detailed account of how strategies were implemented in applications described elsewhere, in the previous papers in the series and in a paper on the analysis of 667 000 patient records. Although it is not customary to think of a command set as the output of research, it represents the elements and strategies for data mining biomedical and clinical data with many parameters, that is, in a high dimensional space that requires skilful navigation. The intent is not to promote FANO per se, but to report its science and methodologies. Typical example rules from traditional data mining are that A and B and C associate, or IF A & B THEN C. We need much higher complexity rules for clinical data especially with inclusion of proteomics and genomics. FANO's specific goal is to be able routinely to extract from clinical record repositories and other data not only the complex rules required for biomedical research and the clinical practice of evidence based medicine, but to quantify their uncertainty, that is, their essentially probabilistic nature. The underlying information and number theoretic basis previously described is less of an issue here, being "under the hood", although the fundamental role and use of the Incomplete (generalized) Riemann Zeta Function as a general surprise measure is highlighted, along with its covariance or multivariance analogue, as it appears to be a unique and powerful feature. Another characteristic described is the very general tactic of the metadata operator ':='. It allows decomposition of diverse data types such as trees, spreadsheets, biosequences, sets of objects, amorphous data collections with repeating items, XML structures, and so forth into universally atomic data items with or without metadata, and assists in reconstruction of ontology from the associations and numerical correlations so data mined.

MeSH terms

Evidence-Based Medicine*
Information Storage and Retrieval*
Multivariate Analysis
Pharmacogenetics*