Send to

Choose Destination
See comment in PubMed Commons below
Biostatistics. 2001 Dec;2(4):445-61.

Gene expression analysis with the parametric bootstrap.

Author information

Division of Biostatistics, University of California, Earl Warren Hall 7360, Berkeley, CA 94720-7360, USA.


Recent developments in microarray technology make it possible to capture the gene expression profiles for thousands of genes at once. With this data researchers are tackling problems ranging from the identification of 'cancer genes' to the formidable task of adding functional annotations to our rapidly growing gene databases. Specific research questions suggest patterns of gene expression that are interesting and informative: for instance, genes with large variance or groups of genes that are highly correlated. Cluster analysis and related techniques are proving to be very useful. However, such exploratory methods alone do not provide the opportunity to engage in statistical inference. Given the high dimensionality (thousands of genes) and small sample sizes (often <30) encountered in these datasets, an honest assessment of sampling variability is crucial and can prevent the over-interpretation of spurious results. We describe a statistical framework that encompasses many of the analytical goals in gene expression analysis; our framework is completely compatible with many of the current approaches and, in fact, can increase their utility. We propose the use of a deterministic rule, applied to the parameters of the gene expression distribution, to select a target subset of genes that are of biological interest. In addition to subset membership, the target subset can include information about relationships between genes, such as clustering. This target subset presents an interesting parameter that we can estimate by applying the rule to the sample statistics of microarray data. The parametric bootstrap, based on a multivariate normal model, is used to estimate the distribution of these estimated subsets and relevant summary measures of this sampling distribution are proposed. We focus on rules that operate on the mean and covariance. Using Bernstein's Inequality, we obtain consistency of the subset estimates, under the assumption that the sample size converges faster to infinity than the logarithm of the number of genes. We also provide a conservative sample size formula guaranteeing that the sample mean and sample covariance matrix are uniformly within a distance epsilon > 0 of the population mean and covariance. The practical performance of the method using a cluster-based subset rule is illustrated with a simulation study. The method is illustrated with an analysis of a publicly available leukemia data set.

PubMed Commons home

PubMed Commons

How to join PubMed Commons

    Supplemental Content

    Full text links

    Icon for Silverchair Information Systems
    Loading ...
    Support Center