BMC Bioinformatics. 2005; 6: 86.

# HDBStat!: A platform-independent software suite for statistical analysis of high dimensional biology data

Reviewed by

Prinal Trivedi,

^{1} Jode W Edwards,

^{1,}^{3} Jelai Wang,

^{1} Gary L Gadbury,

^{1,}^{2} Vinodh Srinivasasainagendra,

^{1} Stanislav O Zakharkin,

^{1} Kyoungmi Kim,

^{1} Tapan Mehta,

^{1} Jacob PL Brand,

^{1,}^{4} Amit Patki,

^{1} Grier P Page,

^{1} and

David B Allison^{}^{1}^{1}Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA

^{2}Department of Mathematics and Statistics, University of Missouri-Rolla, Rolla, MO 65409, USA

^{3}Department of Agronomy, Iowa State University, Ames, IA 50011, USA

^{4}Pennington Biomedical Research Center, 6400 Perkins Rd., Baton Rouge, LA 70808, USA

^{}Corresponding author.

Prinal Trivedi: PatelHP/at/uab.edu; Jode W Edwards: Jode/at/iastate.edu; Jelai Wang: JelaiW/at/uab.edu; Gary L Gadbury: GadburyG/at/umr.edu; Vinodh Srinivasasainagendra: Vinodh/at/uab.edu; Stanislav O Zakharkin: Stas/at/uab.edu; Kyoungmi Kim: Kyoungmi/at/uab.edu; Tapan Mehta: Tapan/at/uab.edu; Jacob PL Brand: BrandJP/at/pbrc.edu; Amit Patki: APatki/at/uab.edu; Grier P Page: GPage/at/uab.edu; David B Allison: DAllison/at/uab.edu

Received November 29, 2004; Accepted April 6, 2005.

Copyright © 2005 Trivedi et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been

cited by other articles in PMC.

## Abstract

### Background

Many efforts in microarray data analysis are focused on providing tools and methods for the qualitative analysis of microarray data. HDBStat! (High-Dimensional Biology-Statistics) is a software package designed for analysis of high dimensional biology data such as microarray data. It was initially developed for the analysis of microarray gene expression data, but it can also be used for some applications in proteomics and other aspects of genomics. HDBStat! provides statisticians and biologists a flexible and easy-to-use interface to analyze complex microarray data using a variety of methods for data preprocessing, quality control analysis and hypothesis testing.

### Results

Results generated from data preprocessing methods, quality control analysis and hypothesis testing methods are output in the form of Excel CSV tables, graphs and an Html report summarizing data analysis.

## Background

One of the most critical tasks in the field of biology is identifying how and which genes interact with each other under different conditions. Until a few years ago, researchers were only able to accomplish this task for a limited number of genes because the traditional methods in molecular biology allowed them to assess only one gene at a time. The advent of microarray technology has provided investigators the opportunity to simultaneously assess the expression levels of thousands of genes. Microarrays also generate a large amount of data in short period of time. Extracting statistically valid and biologically relevant information from such massive data sets is a major challenge. HDBStat! is a user-friendly and platform-independent software designed for the statistical analysis of microarray data using well-validated methods for quality control of experiments and the identification of differentially expressed genes.

## Implementation

Data analysis in HDBStat! is divided into four steps – data import, data processing, quality control and hypotheses testing (Figure ).

Data analysis in HDBStat! is divided into four steps – data import, data preprocessing, quality control and hypotheses testing. At each step, user input is required and in return, the results are displayed in the interface and/or output to a file. **...**

### Data import

Data is imported into HDBStat! using two files, a gene expression data file (Figure ) and chip level information file (Figure ), both of which must be Microsoft Excel '97 or more recent format (.xls), or Comma Separated Values (.csv) files [see Additional file 1 and Additional file 2]. The gene expression data file contains the output from the chip image processing software, such as MAS 5.0, Bioconductor, or GenePix. The chip level file contains experimental variables such as treatment, time, experiment, and if appropriate, pairing variables for the chips. Upon import some descriptive statistics are automatically generated about the raw data such as Pearson's correlations between chips, mean, standard deviation, minimum and maximum values of gene expression levels for each chip and displayed in graphical and tabular formats.

Screenshot of gene expression data file in Excel format

Screenshot of chip level information file in Excel format

### Data preprocessing

Optionally, a normalization and/or transformation method(s) can be applied prior to the primary statistical analyses. Normalization is a procedure intended to remove variability among chips that is unrelated to treatment conditions of interest. HDBStat! offers Chip Mean normalization, which divides each observation by the chip mean, and Quantile-Quantile normalization, which ranks each observation on the chip based on expression value and then converts to the value of a deviation that would be expected from the standard normal distribution based on the observation rank. Quantile-quantile normalization results in data from each chip with a mean of zero and standard deviation of 1.0. Transformation is a process of applying a mathematical function to every observation in a data set in order to better satisfy assumptions of certain statistical models used for analysis. HDBStat! offers three different scales of logarithmic transformation, base-2, base-e, or base-10. Combinations of normalizations and transformations may be selected.

### Quality control

HDBStat! provides a unique quality control procedure based upon Deleted Residuals (DR). Deleted residuals have traditionally been used in the statistical analysis of data when the number of observations in a group are small or may be influenced by outliers, as in the case in microarrays. In HDBStat!, the deleted residuals for each gene on each chip is calculated by taking the observed value of a gene on a chip subtracting the mean for the gene across all other chips in that group divided by the standard deviation of the mean for the gene across all the other chips in that group. The Probability Density Function (PDF) for the deleted residuals for a gene will follow a Student's t-distribution with n-2 degrees of freedom where n is the number of chips in the treatment group. If we assume that the genes across a chip are independent identically distributed (IID) the distribution of the deleted residuals should approximate a Student's t-distribution with n-2 degrees of freedom. The difference of the observed data from the expected t-distribution is graphically illustrated (Figure ) and the significance of the difference is tested using a Kolmogorov-Smirnov test. If a chip is significantly different from the t-distribution it may be an indication that the particular chip is an outlier compared with the other chips in the group. Further, the user has the opportunity to remove chip(s) from the analysis and re-analyze the data.

### Hypotheses testing

Currently, HDBStat! performs a series of pair wise comparison tests. Based on the information provided by user in chip level information file, a combination of all possible hypotheses is displayed in the user interface. User must select at least one hypothesis in order to perform two group comparisons.

HDBStat! includes parametric and non-parametric methods for estimating the significance of changes in gene expression between groups. Student's t-test, for which the user can choose an equal-variance t-test, which uses a pooled variance across treatments, or Welch's t-test, which assumes unequal variances between the two treatment groups [11]. Another method based on Chebyshev's inequality, Chebby Checker is extremely robust against departures from normality and equality of variance between treatment groups, but it also has very low power [2]. The Chebby Checker is useful for identifying genes that are almost certainly differentially expressed without considering any statistical assumptions. In addition a bootstrap resampling method [6,8] is implemented. One can either conduct an exact bootstrap (all possible permutations) or a random (used specified number of permutations) bootstrap. The bootstrap procedures implement both pivots and smoothes in order to calculate the significance more accurately. As exact bootstrap is more accurate than random bootstrap, it is preferred for computationally feasible cases, but once the n per groups exceeds 6 it is difficult to implement.

Because of the large number of simultaneous hypotheses tested in high dimensional biology experiments, an adjustment for multiple testing is appropriate in order to avoid falsely calling too many genes significant. In HDBStat! several multiplicity control methods are available to be applied to any hypothesis testing method. The available multiplicity control adjustments are Bonferroni [4], Sidak [10], two False Discovery Rate (FDR) estimation methods [3,5], and a method based on a mixture modeling of observed p-values [1], referred to as the "Mix-o-matic" (Figure ) in the HDBStat! software. The Bonferroni and Sidak methods provide experiment-wise (or Family-wise) type I control. The FDR methods are designed to control the proportion of false positives among all genes declared differentially expressed. The mixture modeling method allows for the Bayesian estimation of the probability that each gene is a false positive or negative and this approach is also conveniently for projecting power estimates for future studies [9].

For the planning of future studies HDBStat! implements the method of Gadbury et al to extrapolate power from pilot data [9]. HDBStat! allows for the calculation of the expected discovery rate (EDR), posterior true positive, and posterior true negative rates for large and smaller samples sizes than were entered as pilot data. (Figure )

If an investigator is interested in empirically comparing the size of the observed differences in gene expression, an Empirical Bayes method is provided to provide shrinkage estimators of the true differences in gene expression [7]. In addition group means and fold changes in expression are calculated and output to results directory specified by the user.

### Programming details

HDBStat! is implemented using the Java programming language using various licensed and open source libraries such as Visual Numerics JMSL, Jakarta POI, Velocity, and JFreeChart. Extensive software testing is performed using JUnit library.

## Results

At the completion of calculating the deleted residuals and analyzing a hypothesis, results are output to a date/time-stamped directory into the user specified directory. Chip level statistics, preprocessed data, deleted residuals, standard outliers, various pair wise comparison tests (Table ), mix-o-matic and power analysis results are output in the form of Excel CSV files. Graphs generated from chip level statistics, deleted residuals, mix-o-matic and power analysis results are output in .png format image files. HDBStat! also generates a HTML file that provides a summary of the analysis including the hypotheses tested, chips in each group. This mechanism of outputting results provides the user an opportunity to view quality control results and modify hypotheses, preprocessing methods, and/or chip selections before proceeding to the next step.

Hypothesis testing results

## Discussion

The goal of HDBStat! is to help researchers analyze microarray data to extract valid inferences, estimates and interpretations via a flexible and user-friendly graphical interface. It allows the user to skip preprocessing and quality control methods by simply not selecting those methods. After previewing the preliminary results of raw data, preprocessed data or deleted residuals, user has flexibility to drop a chip by simply un-checking checkbox in the user interface. This feature allows the user to design any number of possible comparisons while the analysis is in progress.

To assist novice users with using HDBStat!, video clips demonstrating how to analyze paired and unpaired data, examples of how to set up input files for paired and unpaired data analyses, screen shots, and FAQ are available on our website. A detailed description of methods as well as additional explanations of the output files in this software is also available in a PDF format on our website.

Additional statistical methods and features are added on an ongoing basis. Support for data import from a text file and results output to a text file will be available for large data sets. In the current version, only single channel or common reference design microarray data can be analyzed using two group comparisons. In the near future, we will add the capability to analyze two channel data and support for ANOVA, and GLM.

There are many software programs available to analyze microarray data, each offering various features and functions. In Table , we have compared the features and functions of HDBStat! to SAM, BRB Array Tools and TM4.

Comparison of HDBStat! with other software packages. All these software packages are still in active development and new functions will undoubtedly be added over time.

## Availability and requirements

System requirements for an end-user are the Java Runtime Environment (JRE 1.4.2 or higher), at least 256 MB RAM and 25 MB hard disk space. Using Java Web Start technology, HDBStat! can be easily downloaded from our website at http://www.soph.uab.edu/ssg_content.asp?id=1164.

## Authors' contributions

JWE, JPLB, KK, GPP wrote the implementation specification documents for various methods. GLG and DBA developed mix-o-matic method, developed prototype implementation in S-Plus and wrote the implementation specification document. GPP and JWE developed the deleted residuals approach. DBA and JWE developed prototype implementation of Empirical Bayes Estimates. JWE and PT developed a prototype implementation of statistical methods in SAS. JW, PT, VS and TM implemented and tested the java code. JWE, PT, JW, SOZ, GPP, VS, TM and AP tested the software. GPP and JWE directed the content of HDBStat! All authors read and approved the manuscript.

## Acknowledgements

This work is supported by grants from the UAB HSF GEF, NSF 0217651 and NIH U54CA100949.

## References

- Allison DB, Gadbury GL, Moonseong H, Fernandez JR, Lee C, Prolla TA, Weindruch R. A mixture model approach for the analysis of microarray gene expression data. Comp Statist & Data Anal. 2002;39:1–20. doi: 10.1016/S0167-9473(01)00046-9. [Cross Ref]
- Beasley TM, Page GP, Brand JPL, Gadbury GL, Mountz JD, Allison DB. Chebyshev's inequality for non-parametric testing with small
*N *and a in microarray research. J R Statist Soc C. 2004;53:95–108. doi: 10.1111/j.1467-9876.2004.00428.x. [Cross Ref] - Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B. 1995;57:289–300.
- Bland JM, Altman DG. Multiple significance tests: the Bonferroni method. BMJ. 1995;310:170. [PMC free article] [PubMed]
- Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Statist. 2001;29:1165–1188. doi: 10.1214/aos/1013699998. [Cross Ref]
- Davison AC, Hinkley DV. Bootstrap methods and their application. Cambridge University Press, United Kingdom; 1997.
- Edwards JW, Page GP, Gadbury G, Heo M, Kayo T, Weindruch R, Allison DB. Empirical Bayes estimation of gene-specific effects in micro-array research. Funct Integr Genomics. 2005;5:32–9. doi: 10.1007/s10142-004-0123-0. [PubMed] [Cross Ref]
- Effron B, Tibshirani RJ. An Introduction to the Bootstrap. Chapmann and Hall New York; 1993.
- Gadbury GL, Page GP, Edwards JW, Kayo T, Prolla TA, Weindruch R, Permana PA, Mountz J, Allison DB. Power and Sample Size Estimation in High Dimensional Biology. Stat Meth Med Res. 2004;13:325–338.
- Sidak Z. Rectangular confidence regions for the means of the multivariate normal distributions. J Am Stat Assoc. 1967;62:626–633.
- Welch BL. The significance of the difference between two means when the population variances are unequal. Biometrika. 1938;29:350–362.