![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||
Bayesian Weibull tree models for survival analysis of clinico-genomic data a Department of Epidemiology and Public Health, Leonard M. Miller School of Medicine, University of Miami, Miami, FL 33136, USA b Department of Statistical Science, Duke University, Durham, NC 27705, USA * Corresponding author. Tel.: +1 604 628 9831; fax: +1 604 628 9831. E-mail address: Email: JClarke/at/med.miami.edu (J. Clarke). Abstract An important goal of research involving gene expression data for outcome prediction is to establish the ability of genomic data to define clinically relevant risk factors. Recent studies have demonstrated that microarray data can successfully cluster patients into low- and high-risk categories. However, the need exists for models which examine how genomic predictors interact with existing clinical factors and provide personalized outcome predictions. We have developed clinico-genomic tree models for survival outcomes which use recursive partitioning to subdivide the current data set into homogeneous subgroups of patients, each with a specific Weibull survival distribution. These trees can provide personalized predictive distributions of the probability of survival for individuals of interest. Our strategy is to fit multiple models; within each model we adopt a prior on the Weibull scale parameter and update this prior via Empirical Bayes whenever the sample is split at a given node. The decision to split is based on a Bayes factor criterion. The resulting trees are weighted according to their relative likelihood values and predictions are made by averaging over models. In a pilot study of survival in advanced stage ovarian cancer we demonstrate that clinical and genomic data are complementary sources of information relevant to survival, and we use the exploratory nature of the trees to identify potential genomic biomarkers worthy of further study. Keywords: Survival analysis, Weibull, Recursive partitioning, Gene expression, Bayes factor, Variable selection, Ovarian cancer, Clustering 1. Introduction Genomic information, in the form of microarray or gene expression signatures, has an established capacity to define clinically relevant risk factors in disease prognosis. Recent studies have generated such signatures related to disease recurrence and survival in ovarian cancer [62,2] as well as in numerous other disease contexts [72,58,48]. Analyses involving gene expression signatures have focused on clustering or classification to associate such signatures or patterns with ‘low-risk’ versus ‘high-risk’ survival prognoses. The clustering of tumors based on expression levels into multiple subgroups has been performed using various methods including support vector machines [72], k-NN models [62], PLS [45] and hierarchical clustering [47]. A more formal discussion and comparison of various tumor discrimination methods with gene expression data can be found in [19]. The application of gene expression signatures to the prediction of disease outcome is a research area distinct from clustering applications. Less attention has been focused on prediction to date, although single genes or gene signatures have been studied for the prediction of tumor classification related to ‘good’ versus ‘poor’ survival prognoses [70,21]. However, the ‘signature’ approaches to prediction of cancer outcome with microarrays have been shown to be highly unstable and strongly dependent on the selection of patients in the training sets [43]. This has been attributed to inadequate validation leading to overoptimistic results, but also reflects the heterogeneity of complex disease. From the perspective of the individual patient a sharper, more specialized approach to prediction is needed. Bayesian regression tree models, described in Section 2, form the basis of one such approach. The conventional binary regression tree associated with CART [9] has been used successfully for prediction in various modeling contexts [5,63] as have the Bayesian versions of CART [12, 16], other Bayesian binary trees [49], and the relative risk trees of Ishwaran, Blackstone, Pothier, and Lauer [32]. A review of tree-based methods for survival can be found in [73]. Research has shown that the prediction accuracy of such models can be improved through Bayesian model averaging [25], bagging [5], boosting [60], and related methods [13]. This holds true for trees whether the model search is stochastic [67,7,35] or deterministic [10,44]. Our approach is a reflection of this finding and of the recent emphasis in the literature on ensemble methods for prediction [14,26,27]. A key aspect in our approach is the averaging of predictions over multiple candidate models, which we discuss in Section 3. Note that Bayesian model averaging not only improves predictive performance but the posterior parameter estimates and standard deviations directly incorporate model uncertainty [51]. In this article we discuss the development of Bayesian tree models that allow the use of clinical, histopathological, and genomic data in the prediction of disease-related survival outcomes. These regression tree models have ability to discover and evaluate interactions of multiple predictor variables, and define flexible, non-linear predictive tools [49]. Specifically, our method allows the direct evaluation of the relative importance of clinical and genomic predictors. Our approach is demonstrated in the context of prediction of survival after surgery for ovarian cancer patients. We stress the utility of such tree models in the exploration of genomic data, and the resulting identification of genes plausibly associated with clinical endpoints, as well as for prediction. 2. Regression trees Our focus is the development of regression trees that recursively generate binary partitions of the covariate space, based upon specific clinical and genomic variables, and within each partition accurately model a continuous survival time response variable. One key advantage of such trees is their interpretability: the entire feature space can be explained by a single tree and the prediction for any given individual can be interpreted as a conjunction of simple logical expressions [17, 24]. Regression tree models serve as tools for prediction as well as for exploratory data analysis by discovering simple combinations of covariates that correlate with a particular outcome. In the case of genomic data these combinations can then serve as a basis for further biological study. Recent additions to the survival tree modeling literature, including [26,27] and [33], reflect the importance of survival trees as an analytic technique for data sets with complex structure. In the remainder of this section we discuss model construction and model inference. We begin with a brief overview of recursive partitioning models (Section 2.1) and the use of the Exponential and Weibull distributions to model the conditional distribution of the response variable (Section 2.3). Then we discuss the splitting criterion based on Bayes factors and inference via Empirical Bayes methods (Section 2.2) and posterior distributions and predictive distributions (Section 2.4). The generation of predictive distributions by model averaging is discussed in Section 3. Although our models can be applied to censored data (under the assumption of non-informative censoring) [15,48], we confine our discussion to the fully observed case. 2.1. Recursive partitioning We assume a continuous survival time response variable Y and a p-dimensional vector of covariates X. Each covariate Xj, j = 1, . . . , p, may be categorical or continuous. We assume that the distribution of Y|X can be expressed as Y|g(X) where g is a recursive binary partitioning or splitting of the covariate space into disjoint subspaces. Each binary split is defined by a rule which assigns an observation in the current partition to one of two partition subspaces based upon a predictor Xj and a threshold value τ. The choice of the pair (Xj, τ) is made by finding the pair which reorganizes the data in the current partition into two subgroups whose survival distributions are most different, as assessed by a splitting criterion (see Section 2.2.1). A split is performed if the value of this criterion exceeds a specified threshold of significance. This splitting process continues in a recursive fashion until the existing model cannot be improved. The result is a tree model M(Y, X) in which the terminal nodes or leaves represent a partition of the covariate space in which the distribution of Y is distinct. For a given node and predictor it is possible that any of several threshold values would yield a significant split. The ability to generate multiple trees at a node may be advantageous. In problems with many predictors, this naturally leads to the generation of many trees, often with small changes from one to the next, and the consequent need to develop inference and prediction in the context of multiple trees generated this way. The use of ‘forests of trees’ and similar ensemble methods has been urged by Breiman [8] as well as others [26] and our perspective endorses this. The involvement of multiple trees in our analyses is supported by the viewpoint that the splitting of nodes is based on the selection of (predictor, threshold) pairs which we view as parameters of the overall tree model. Any single tree is formed by selecting specific values for these parameters and the uncertainty in these parameters is reflected in the variability among trees. The resulting models generate predictions via model averaging. This process is discussed in more detail in the following Section and in Section 3. 2.2. Tree generation We employ a forward-selection process to generate tree models. If the data in a node of a single tree is a candidate for splitting, we find the (predictor,threshold) pair that maximizes the splitting criterion (see below) for a split at the given node. The node is split if the value of this criterion is sufficiently large. Given a current tree the splitting process continues until either the existing model cannot be improved, i.e., the splitting criterion is not sufficiently large for any choice of (predictor,threshold) at any node, or until all of the remaining candidate terminal nodes have very few observations (usually less than 5 observed survival times). Our strategy is unlike other tree-growing methods (including CART), which purposely overgrow a tree and then prune back, due primarily to our focus on prediction in settings of low signal to noise. We want to limit adaptivity and avoid overfitting, at the possible cost of missing an association of moderate significance. 2.2.1. Bayes factors The choice of splitting criterion is based on the association between the outcome variable Y (survival time) and the covariates X in subsamples. Splitting variables and splitting thresholds are selected based on their ability to strengthen this association. With data y1, . . . , yn in a given node and a specified threshold τ on a given predictor Xj, our test of association is based on assessing whether the data are more consistent with a single exponential distribution (with exponential parameter μτ) or with two separate exponential distributions (with parameters μ0,τ and μ1,τ) defined by the specified partition. In our Bayesian approach we adopt the standard conjugate Gamma prior model on the Exponential parameter [61]; the prior is Gamma(a, b) where b = a/m and m is the mean of the Gamma prior. We specify a fixed global prior mean but treat the scale parameter a as uncertain and node specific; a is estimated via Empirical Bayes (EB). In brief, suppose a node has rz individuals with observed survival times and Yz is the sum of all survival times (here z = 0, 1 identifies the node as one of two children nodes of a parent node). Assuming μ0,τ ≠ μ1,τ we take μ0,τ and μ1,τ to be independent with common prior Gamma(aτ, bτ) with mean aτ/bτ. Under the null hypothesis μ0,τ = μ1,τthe common value has the same Gamma prior. Let the parameters of the current prior Gamma(aτ, bτ) be expressed as aτ = cτ and bτ = cτ/m where m is the prior mean. The empirical Bayes approximation to μz,τ | (rz, Yz, μ|1–|,τ) is Gamma A candidate split of a given node will organize the data as follows: where nz is the total number of survival times in subgroup z (in the uncensored case rz = nz). The splitting criterion or test of association is based on assessing the Bayes factor Bτ [37] comparing the null hypothesis H0 : μ0,τ = μ1,τ(with common value μτ) with the alternative H1 : μ0,τ ≠ μ1,τ. The Bayes factor Bτ in favor of the alternative over the null hypothesis is simply In comparing predictors the Bayes factor can be evaluated for each predictor across a range of predictor-specific thresholds. For a given predictor this generates values of Bτ as a function of τ, which may suggest promising threshold values. 2.3. Weibull transformation Suppose that a node of a given tree is to be split on a predictor xj at the (threshold) value τ . Let yzi and rz be as defined in Section 2.2.1 where i denotes the ith individual in subgroup z, i = 1, . . . , nz, and yzi ~ Exp(μz,τ). The data density is In our parameterization of the Weibull the scale parameter has been incorporated into the definition of μz,τ. As the value of μz,τ varies across different nodes of a tree so does the scale parameter. Since the splitting criterion for the trees is based on a significance test of the value of μz,τ, the scale parameter is implicitly, although not directly, incorporated into the splitting criterion and hence used for growing the tree. The current model could be reparameterized to address the scale parameter directly; however, this would require an entirely different Bayesian analysis as the interpretation of μz,τ is essential to the current conjugate analysis (see [61]). 2.4. Inference and prediction Inference and prediction at a terminal node or leaf of a given tree involve the calculation of branch probabilities and the posterior predictive distributions which underlie the predictive probabilities for new cases. To calculate the branch probabilities for a leaf we must follow the path or sequence of nodes of the tree that connect the root node with the specified leaf. We consider the kth node of the tree and suppose that it is split on the pair (xjk, τjk), where the notation of Section 2.2 has been extended to include the node index. The data in node k can be divided into two groups based on the values of (xj, τj), where the sums of all of the survival times in the Xj ≤ τj and Xj > τj groups are Y0k and Y1k, respectively. The implied conditional probabilities P(Yzki > t | Z = z), i = 1, . . . , nzk, for some time t are the branch probabilities defined by this split (the dependence of these probabilities on the tree and the data are suppressed for clarity). From Sections 2.2 and 2.3 we know that these probabilities are based on Exponential distributions for yzki with parameter μz,τjk for z = 0, 1 and specified Gamma priors which we index by the parent node, i.e., Gamma(aτjk, bτjk). The use of EB to estimate aτjk has been described in Section 2.2 and will not be discussed here. Assuming that node k is split, the resulting conditional posterior branch probability parameters would be independent with posterior Gamma distributions: Let x be an observed vector of covariates for a new case and consider predicting the response P(y > t | x ) for a given time t. The current tree will define a single path for this observation from the root node to a terminal node or leaf. Prediction requires that we follow x along its path down the tree to the implied leaf and construct the relevant posterior defined by the (x, τ) pairs at the splits that we encounter along the path. For example, suppose that our new case x has an implied path through nodes 1 and 2 terminating at node 5 (a leaf), where each tree split defines exactly 2 children nodes (node numbers increase from left to right within levels starting with the root node as node 1). This path is based on (predictor, threshold) pairs (x1, τ1) and (x2, τ2) and is a result of predictor values (The prior parameters > t | x ) involves the posterior predictive distribution of future survival times for cases in node 5, i.e., Prediction follows by estimating P(y > t | x ) based on the sequence of conditionally-independent posterior distributions for the branch probabilities that define it. Simply plugging in the posterior conditional means of each μz,τ,j will lead to a plug-in estimate of P(y > t | x ). Since each exponential mean follows a Gamma posterior, it is possible to draw Monte Carlo samples of the μz,τ,j and compute the corresponding values of P(y > t | x ) to generate a posterior sample for summarization. In this way we can examine the simulation-based posterior means and uncertainty intervals for P(y > t | x ) which represent predictions of the survival probabilities for the new case.3. Generation and weighing of multiple trees The use of forests of trees and similar ensemble methods has been urged by Breiman [8] as well as others [44,26] as previously noted. In our analyses the (predictor,threshold) pairs are viewed as parameters of the overall tree model. Statistical learning about relevant trees requires the examination of aspects of the posterior distributions of these parameters (and of the branch probabilities). Our Bayesian approach to survival tree modeling allows us to properly address model uncertainty, as has been done in similar contexts by others [10,16,12]. Trees are known as unstable classifiers [9]; however predictions may be improved by selecting a group of models instead of a single model and generating predictions by model averaging, as in [10,25]. Copies of the ‘current’ tree are made and the current node is split on a different significant (predictor,threshold) choice for each copy. Once a number of trees have been generated we can involve all or some of them in inference and prediction by weighting the contribution of each tree by its relative likelihood value. As a result of the current framework of forward generation of trees the likelihood values are easy to compute. For any single tree the overall marginal likelihood can be calculated by identifying the nodes which have been split and taking the product of the component marginal likelihoods defined by each split node. In other words (using the notation of Section 2.4) the marginal likelihood component defined by node k is 4. Sensitivity and performance on simulated data Like any method for statistical inference our modeling approach and results will depend on various assumptions. These include the choice of prior and the data likelihood. In this section we consider the sensitivity of our method to the assumed value of the Weibull shape parameter α (see Section 2.3) in a predictive context using simulated data. To aid in determining whether our method behaves as expected, we employ two other modeling approaches for comparison. Our hope is that this assessment, although limited, will provide useful information concerning the strengths and weaknesses of our approach. 4.1. Setup Our setup is similar to that of Hothorn et al. [28]. Five independent predictors X1, X2, . . ., X5 were generated from a uniform distribution on [0,1]. Survival times were generated from a Exponential distribution with conditional survival function S(y|x) = exp(−yμx) under three models with logarithms of the hazards (A) log(μx) = 0, (B) The behavior of our models was compared to both a simple Kaplan–Meier curve and survival trees as implemented in the rpart package [66] in the R system for statistical computing [50]. Comparison to proportional hazards has been presented elsewhere [15]. The parameters for the rpart routine were set as in [28]. For our tree models the maximum number of trees allowed was 30, the minimum Bayes factor value required for a split was 2.5, and only nodes containing at least 3 observations were candidates for splitting. Numerous parameter combinations were tried with minimal impact on the results, if any. Trees with normalized likelihood values below 5% were removed from consideration. The mean integrated squared error was employed as a measure of the quality of the model predictions (computed by numerical integration). The learning sample contained 200 observations and the value of the predictions was evaluated on an independent sample of 100 observations. 4.2. Results We have selected three representative runs to discuss: model A at α = 0.8, model B at α = 1.2, and model C at α = 1.5. Within each run the value of α assumed by the Weibull tree models (which we will refer to as αfix) takes each value from the set [0.5, 0.8, 1.0, 1.2, 1.5]. The median MIE result and the 95% confidence interval for the median MIE calculated for 100 replications of the learning and evaluation sets for these runs are displayed in Figs. 1
In Fig. 2
5. Analysis in ovarian cancer research Ovarian cancer is the deadliest of the gynecologic cancers and the fifth leading cause of cancer deaths among women today [1]. When making ovarian cancer diagnoses and prognoses clinicians rely on subjective interpretations of both clinical and histopathological information, which can be incomplete or unreliable [62]. Recent studies in ovarian cancer have demonstrated the potential of genomic data to improve our ability to predict patient survival and treatment response [62,2]. We chose to utilize Weibull trees to explore pilot data collected from 119 advanced stage ovarian cancer patients treated at either Duke University Medical Center or H. Lee Moffitt Cancer Center & Research Institute. The primary purpose of this analysis was to determine whether genomic data could demonstrate ability to predict survival that was not reflected in available clinical data such as disease-free interval (time between primary chemotherapy/disease relapse and disease recurrence) and, if so, to explore which genes may demonstrate such ability and whether a larger study would be of interest. Tissue samples were collected at the time of initial cytoreductive surgery and all patients received primary chemotherapy with a platinum-based regimen (usually including taxane) subsequent to surgery. Detailed clinical records of traditional risk factors (age, stage, grade, debulking status) and measurement of disease-free interval were available for 55 of the 119 patients and have been summarized in Table 1. Gene expression data was generated for each patient at the institution of sample origin from RNA extracted from banked tissue derived from primary tumor biopsies. This RNA was hybridized to Affymetrix Human U133A GeneChips according to standard Affymetrix protocol. The results were expression levels from over 22,000 genes and expressed sequence tags (ESTs) for each individual. The pre-processing of the gene expression data (normalization and screening) and the use of dimension reduction techniques to build composite genomic predictors prior to analysis are discussed in Sections 5.1 and 5.2. The overall survival time (time from diagnosis to patient death) was selected as the response variable.
The clinical characteristics of the Duke and Moffitt samples were not comparable (see Table 1) and hence we could not use one for training the model and one for validation. We excluded the possibility of using leave-one-out cross-validation due to its instability in model selection [6] and decided instead to divide the combined data set of 55 samples into a training set (60% of samples) and a test set (40% of samples). Although this may introduce bias in internal validation [52], the primary interest in terms of a possible future study is in external validation. Training and test sets were balanced for age, array location (Duke or Moffitt), debulking status, and response to platinum therapy. In order to account for possible assignment bias due to unknown factors we performed 10 runs; in each run the samples were split into different training and test sets and all steps of the analysis, including expression data pre-processing, were repeated. 5.1. Pre-processing of expression data The ovarian cancer data contained expression levels from over 22,000 genes and expressed sequence tags (ESTs) for each individual. We chose to use GeneChip RMA (GCRMA) as our measure of expression since it has been shown to balance accuracy and precision [31]. Our expression data were initially screened to exclude genes showing minimal variation across samples. We evaluated the remaining genes for consistency across both sets using integrative correlations as described in [46]. Across different runs an average of 6400 genes passed all screens (sd = 53.42 genes). Although individual genes could be used as predictors, we chose to create predictors from clusters of similar genes both to reduce dimension and to identify multiple underlying patterns of variation across samples. 5.2. Clustering and metagene selection The evaluation and summarization of large-scale gene expression data in terms of lower dimensional factors of some form are being increasingly utilized both to reduce dimension and to characterize the diversity of expression patterns evidenced in the full sample [39,23]. The idea is to extract multiple patterns as candidate predictors while reducing dimension and multiplicities and smoothing out gene-specific noise. Discussion of various factor model approaches appears in [71]. Considering the number of genes in our data set and the heterogeneity of the sample patients we first applied k-means correlation-based clustering to the genes and selected the dominant principal component (or metagene [29]) to represent each cluster. These metagene predictors are input to the tree model analysis, along with the clinical predictors, as a re-expression of the genomic information contained in the original microarray data. Although k-means was chosen for its ease of use and wide availability our approach is amenable to other clustering techniques. The k-means clustering algorithm was applied to the training data in each run, generating an average of 490 gene clusters (sd = 2.76 clusters). As the true number of clusters is unknown it was possible that some clusters did not represent subsets of related genes but were simply an artifact of the clustering algorithm. We identified such clusters by assessing the silhouette widths [38] of genes within clusters and removing clusters containing genes whose widths were not significant. This approach is similar to that of Dudoit and Fridlyand [18]. The significance of a width was determined by comparison to a permutation-based null distribution generated by randomly permuting the entries of each row of the observed gene expression matrix, clustering this permuted matrix using k-means as above, and calculating the silhouette values for the permuted genes. Only clusters whose genes had significant silhouette values (p < 0.05) were retained, leaving an average of 310 metagenes for analysis (sd = 20.87 clusters). The permutation null distribution and the gene silhouette values from the initial training/test run are displayed in Fig. 4
5.3. Predictive results Using the training data as a learning set we generated multiple trees under a variety of parameter settings using clinical predictors only, metagenes only, and both metagenes and clinical predictors. The parameter settings were as follows: Bayes factor thresholds of 2.0, 2.5, or 3 on the log base 2 scale, Weibull shape parameter values of 0.8, 1.0, or 1.2, Gamma prior parameters of α = 2 and β = 1/60 or 1/120, up to 20 splits (i.e., 20 new trees) at the root node and up to 3 at each second level node. The choice of Bayes factor threshold was based on frequentist properties: a Bayes factor of 3 is approximately equivalent to a p-value of 0.05. The Gamma prior parameters were chosen to roughly match the mean of the training data, i.e., αβ = μ. The Weibull shape parameter is unknown but values were selected based on the histogram of the training data. Any tree whose relative likelihood value exceeded 1% contributed to the generation of predictions via model averaging. The combination of parameter settings which produced the trees with the most accurate fitted values were retained and used to generate predictions for the validation set. A fitted value at time t for an individual was ‘accurate’ if the fitted probability of surviving for at least time t was greater than a specified cutoff if the recorded survival time for the individual is greater than time t, and vice versa. The specified cutoff was based on an ROC curve to balance specificity and sensitivity. The predictive accuracy of a fitted model was assessed by calculating the predicted auROC estimates at 3-, 4-, and 5-year survival endpoints [11]. As can be seen in Table 2 the predictive results varied across the runs with a validation auROC for the median predictions of the clinical only (C), genomic only (G), and clinico-genomic (CG) tree models of 78.96%, 81.27%, and 84.28% at 3-year survival; 79.94%, 81.19%, and 83.55% at 4-year survival; and 76.93%, 77.92%, and 81.11% at 5-year survival. For C models an average of 4 trees had appreciable relative likelihood and contributed to the predictions in any given run. For the G and CG models the average number of contributing trees was 35 and 36, respectively, although only an average of 4.2 and 2.4 trees, respectively, had relative likelihoods above 5%. Note that in several runs the genomic predictors did not improve upon the predictive ability of the clinical data, and in one run (run #8) none of the models demonstrated the ability to predict, but the additional predictive ability provided by the genomic variables is evident when looking across all runs.
A high likelihood tree from run 1 is shown in Fig. 6
Fig. 8
A posterior sample of predictions for each individual can be generated via Monte Carlo sampling of the > t | x ). This provides simulation-based posterior means and uncertainty intervals which are critical in determining the importance of a prediction in clinical decision making. To illustrate this, we selected three individuals from the data set and displayed their predicted survival curves from the CG models in the panels of Fig. 9
In some cases the results using clinical data alone are better than those using both clinical and genomic data (see, for example, Run 3 in Table 2). We suppose that this is due to heterogeneity in the patient subsamples, as no specific gene or metagene was found to be relevant to all samples. It is possible that the clinico-genomic trees could be improved further in these cases by altering the hyperparameter values, such as the Bayes factor threshold, but given the limited amount of data available we chose not to vary the parameter settings across different runs. Given more data the specific tuning of model parameters can be explored in more depth. 5.4. Biological relevance As mentioned in Section 5.3 the metagene predictors vary with each run but we did identify genes which appear in the key metagenes of several runs and for which potentially very relevant biological connections can be made. This demonstrates the power of our approach for exploratory data analysis as well as prediction. We mention a few examples here; a more complete list of metagenes which appeared in predictive trees and their component genes are given in Table 3 [65].
First, the tree in Fig. 6 The examination of subjects whose predictions improve significantly upon the inclusion of genomic data can also yield potentially informative genes. The ACK1 gene mentioned in Section 5.3 was discovered by this strategy. The amplification of the ACK1 gene in primary tumors has been shown to correlate with poor prognosis and the overexpression of ACK1 in cancer cell lines can increase the invasive phenotype of these cells both in vitro and in vivo [69]. In our data set the expression of ACK1 was found to be negatively correlated (not significantly) with survival; however, in the complete data set of 119 individuals this correlation was significant (see Fig. 10
The identification of multiple genes with predictive ability and potential biological relevance to tumor development, reflective of the heterogeneity of the patient sample and the complexity of the underlying disease, is a key finding and suggestive of plausible directions for biological investigation. 6. Discussion We have presented a Bayesian approach to tree analysis in the specific context of a survival time response and both clinical and genomic predictors. Survival times are assumed to follow a Weibull distribution and tree construction is based on forward selection where a split on a (predictor,threshold) pair is performed if the evidence for or against a difference in survival distributions between the resulting subgroups is significant, as assessed by the associated Bayes factor. By averaging predictions across trees with the relative likelihood values as weights we will tend to improve predictions by respecting, and properly accounting for, tree model uncertainty [25]. We note that although averaging predictions across trees does improve model performance, it also decreases the interpretability of the model. This is an important trade-off: predictive ability versus model interpretability. We advocate model averaging because of its improved predictions and because in high-dimensional data settings model uncertainty can be substantial. By building multiple tree models we can explore the covariate space and attempt to address model uncertainty. We understand that in the interpretation of tree models (in terms of prediction accuracy as well as variable selection) it is important that the parameter estimates be unbiased. This has been stressed in the recent tree literature, e.g., [41,40,27]. Our models are not unbiased in the sense that variables with more splitting values are more likely to be selected in model building. To address this bias we have chosen a metric for model accuracy based on predictive accuracy. This metric will help us to identify and remove from consideration ‘fluke’ models which fit the data well but have poor predictive performance. We concede that this approach is not computationally efficient but it does allow for model exploration which is critical at this point of our analysis. Of course as more data is collected we suspect that computational expense will increase but model uncertainty will decrease, at which point we may focus on averaging over fewer models or employing an alternate method which places more emphasis on unbiasedness and model estimation. We implemented our survival tree modeling in the analysis of pilot data from a study of advanced stage ovarian cancer. Multiple, related patterns of gene expression in combination with clinical data provided strong and predictively valid associations with survival. The models delivered predictive survival assessments together with measures of uncertainty about the predictions. As a result of tree spawning and model averaging these measures of uncertainty reflected within-tree variability as well as the variability resulting from the sensitivity of the Bayes factor to specific predictor choices and small changes in threshold values. An examination of genes which demonstrate predictive ability across various training and test sets revealed several genes with biologically plausible relevance to carcinogenesis, warranting further investigation. We chose to use a conjugate Gamma prior in our analysis although a non-informative prior, such as a Jeffreys' prior, could have been employed. The Jeffreys' prior is a Gamma(a, b) where a = b = 0 [53]. This prior would put relatively more weight on extreme survival values; we felt it was more appropriate to choose values of a and b based on the observed survival times. However for large sample sizes there should be little difference in the results under the conjugate versus the Jeffreys' prior. Thus we expect little difference in the results from each prior at the root and upper level nodes. In small sample sizes, e.g., lower level nodes, we may see some differences in the models but the prior parameters are being updated by previous tree splits which will mitigate any differences. These suppositions were confirmed when we repeated a subset of the simulations from Section 4 using the Jeffreys' prior. The MIE values increased under the Jeffreys' prior relative to the results under the conjugate prior, and the ability to capture the correct model decreased, but qualitatively the results did not change. In anticipation of future studies we intend to perform further comparisons with existing methods [27,33] and further simulations to examine the impact of tuning parameters and prior assumptions on model performance. Our current approach to missing values is to perform imputation prior to modeling; however, we are considering adjusting our method to deal with missing values as these are common in realistic data analysis contexts. In this study our models were built on 6400 genes and 310 metagenes; it is possible that information from normal tissue samples could be employed to perform further variable selection. Finally, although some progress has been made in developing stochastic simulation methods for Bayesian trees [54] the topic remains a very challenging research area, both conceptually and computationally, particularly in the context of more than a few predictors. We believe that in problems where the numbers of predictors is very large, properly addressing the issue of stochastic search will involve the development of a formal, conceptual foundation before making them practicable. The development of such ideas is a focus of our current research. Acknowledgements J.C. was supported by NCI grant 5K25CA111636. The authors wish to thank the following persons for their invaluable contributions: Jeffrey R. Marks, Department of Experimental Surgery, Duke University Medical Center; John Lancaster, Divisions of Gynecologic Surgical Oncology and Cancer Prevention and Control, H. Lee Moffitt Cancer Center and Research Institute; Holly Dressman and Joe Nevins, Institute for Genome Sciences and Policy, Duke University Medical Center; Bertrand Clarke, Department of Statistics, University of British Columbia; Torsten Hothorn, Institut für Medizininformatik, Biometrie und Epidemiologie, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany; Ed Iversen, Institute of Statistics and Decision Sciences, Duke University. References 1. American Cancer Society Cancer Facts and 2006, American Cancer Society. 2006. 2. Berchuck A, Iversen E, Lancaster J, Pittman J, Luo J, Lee P, Murphy S, Dressman H, Febbo P, West M, Nevins J, Marks J. Patterns of gene expression that characterize long-term survival in advanced stage serous ovarian cancers. Clinical Cancer Research. 2005;11:3686–3696. [PubMed] 3. Berger J. Statistical Decision Theory and Bayesian Analysis. 2nd ed. Springer Verlag Inc.; 1993. 4. Bernards R, Weinberg R. Metastasis genes: A progression puzzle. Nature. 2002;418:823. [PubMed] 5. Breiman L. Bagging predictors. Machine Learning. 1996;24:123–140. 6. Breiman L. Heuristics of instability and stabilization in model selection. Annals of Statistics. 1996b;24:2350–2383. 7. Breiman L. Random forests. Machine Learning. 2001;45:5–32. 8. Breiman L. Statistical modeling: The two cultures. Statistical Science. 2001;16:199–225. 9. Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. Chapman & Hall/CRC Press; 1984. 10. Buntine W. Learning classification trees. Statistics and Computing. 1992;2:63–73. 11. Cawley G. Miscellaneous MATLAB software, data, tricks and demonstrations. 2004. Online: http://theoval.sys.uea.ac.uk/matlab/default.html. 12. Chipman H, George E, McCulloch R. Bayesian CART model search (with discussion). Journal of the American Statistical Association. 1998;93:935–960. 13. Chipman H, George E, McCulloch R. Managing multiple models. In: Jaakkola T, Richardson T, editors. Proceedings of the Eighth International Workshop on Artificial Intelligence and Statistics; 2001. pp. 11–18. 14. Chipman H, George E, McCulloch R. Bayesian treed models. Machine Learning. 2002;48:299–320. 15. Clarke J, Horng C-F, Tsou M-H, Huang A, Nevins J, West M, Cheng S. Modeling of clinical information in breast cancer for personalized prediction of disease outcomes, Technical Report, Department of Biostatistics and Bioinformatics. Duke University; Durham: 2006. 16. Denison D, Mallick B, Smith A. A Bayesian CART algorithm. Biometrika. 1998;85:363–377. 17. Duda R, Hart P, Stork D. Pattern Classification. 2nd ed. Wiley; 2001. 18. Dudoit S, Fridlyand J. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology. 2002;3:research0036.1–0036.21. [PubMed] 19. Dudoit S, Fridlyand J, Speed T. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association. 2002;97:77–87. 20. Esufali S, Bapat B. Cross-talk between Rac1 GTPase and dysregulated Wnt signaling pathway leads to cellular redistribution of beta-catenin and TCF/LEF-mediated transcriptional activation. Oncogene. 2004;23:8260–8271. [PubMed] 21. Glinsky G, Higashiyama T, Glinskii A. Classification of human breast cancer using gene expression profiling as a component of the survival predictor algorithm. Clinical Cancer Research. 2004;10:2272–2283. [PubMed] 22. Grant S, Thorleifsson G, Reynisdottir I, Benediktsson R, Manolescu A, Sainz J, Helgason A, Stefansson H, Emilsson V, Helgadottir A, Styrkarsdottir U, Magnusson K, Walters G, Palsdottir E, Jonsdottir T, Gudmundsdottir T, Gylfason A, Saemundsdottir J, Wilensky R, Reilly M, Rader D, Bagger Y, Christiansen C, Gudnason V, Sigurdsson G, Thorsteinsdottir U, Gulcher J, Kong A, Stefansson K. Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nature Genetics. 2006;38:320–323. [PubMed] 23. Hastie T, Tibshirani R. Efficient quadratic regularization for expression arrays. Biostatistics. 2004;5:329–340. [PubMed] 24. Hastie T, Tibshirani R, Friedman J. Springer-Verlag Inc.; 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 25. Hoeting J, Madigan D, Raftery A, Volinsky C. Bayesian model averaging: A tutorial (with discussion). Statistical Science. 1999;14:382–401. 26. Hothorn T, Bühlmann P, Dudoit S, Molinaro A, van der Laan M. Survival ensembles. Biostatistics. 2006;7:355–373. [PubMed] 27. Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics. 2006;15:651–674. 28. Hothorn T, Lausen B, Benner A. Radespiel-Tröger, Bagging survival trees. Statistics in Medicine. 2004;23:77–91. [PubMed] 29. Huang E, Cheng S, Dressman H, Pittman J, Tsou M-H, Horng C-F, Bild A, Iversen E, Liao M, Chen C-M, West M, Nevins J, Huang A. Gene expression predictors of breast cancer outcomes. The Lancet. 2003;361:1590–1596. 30. Ibrahim J, Chen M-H, Sinha D. Bayesian Survival Analysis. Springer-Verlag Inc.; 2001. 31. Irizarry R, Wu Z, Jaffee H. Comparison of affymetrix genechip expression measures. Bioinformatics. 2005;1:1–7. 32. Ishwaran H, Blackstone E, Pothier C, Lauer M. Relative risk forests for exercise heart rate recovery as a predictor of mortality. Journal of the American Statistical Association. 2004;99:591–600. 33. Ishwaran H, Kogalur U. Random survival forests. Rnews. 2006;7/2:25–31. 34. Johnson N, Kotz S, Balakrishnan N. Continuous Univariate Distributions. 2nd ed. Wiley; 1994. 35. Jordan M, Jacobs R. Hierarchical mixtures of experts and the EM algorithm. Neural Computation. 1994;6:181–214. 36. Kang H, Kim H, Kim S, Barouki R, Cho C, Khanna K, Rosen E, Bae I. BRCA1 modulates xenobiotic stress-inducible gene expression by interacting with arnt in human breast cancer cells. Journal of Biological Chemistry. 2006 (epub ahead of print March 27). 37. Kass R, Raftery A. Bayes factors and model uncertainty. Journal of the American Statistical Association. 1993;90:773–795. 38. Kaufman L, Rousseeuw P. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley; 1990. 39. Li L, Li H. Dimension reduction methods for microarrays with application to censored survival data. Bioinformatics. 2004;20:3406–3412. [PubMed] 40. Loh W-Y. Regression trees with unbiased variable selection and interaction detection. Statistica Sinica. 2002;12:361–386. 41. Loh W-Y, Shih Y-H. Split selection methods for classification trees. Statistica Sinica. 1997;7:815–840. 42. Matsui T, Katsuno Y, Inoue T, Fujita F, Joh T, Niida H, Murakami H, Itoh M, Nakanishi M. Negative regulation of Chk2 expression by p53 is dependent on the CCAAT-binding transcription factor NF-Y. Journal of Biological Chemistry. 2004;279:25093–25100. [PubMed] 43. Michaels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: A multiple random validation strategy. The Lancet. 2005;365:488–492. 44. Oliver J, Hand D. On pruning and averaging decision trees; Proceedings of the Twelfth International Conference on Machine Learning, Morgan Kaufmann; 1995. pp. 430–437. 45. Park P, Tian L, Kohane I. Linking gene expression data with patient survival times using partial least squares. Bioinformatics. 2002;18:S120–S127. [PubMed] 46. Parmigiani G, Garrett-Mayer E, Anbazhagan R, Gabrielson E. A cross-study comparison of gene expression studies for the molecular classification of lung cancer. Clinical Cancer Research. 2005;10:2922–2927. [PubMed] 47. Perou C, Sørlie T, Eisen M, van de Rijn M, Jeffrey S, Rees C, Pollack J, Ross D, Johnsen H, Akslen L, Fluge O, Pergamenschikov A, Williams C, Zhu S, Lønning P, Børresen Dale A-L, Brown P, Botstein D. Molecular portraits of human breast tumours. Nature. 2000;406:747–752. [PubMed] 48. Pittman J, Huang E, Dressman H, Horng C-F, Cheng S, Tsou M-H, Chen C-M, Bild A, Iversen E, Huang A, Nevins J, West M. Clinico-genomic models for personalized prediction of disease outcomes. Proceedings of the National Academy of Sciences. 2004;101:8431–8436. 49. Pittman J, Huang E, Nevins J, West M. Bayesian analysis of binary prediction tree models for retrospectively sampled outcomes. Biostatistics. 2004;5:587–601. [PubMed] 50. R development Core Team R. R Foundation for Statistical Computing. Vienna; Austria: 2007. A Language and Environment for Statistical Computing. ISBN: 3−900051−07−0. 51. Raftery A, Madigan D, Hoeting J. Bayesian model averaging for linear regression models. Journal of American Statistical Association. 1997;92:179–191. 52. Ransohoff D. Bias as a threat to the validity of cancer molecular-marker research. Nature Reviews Cancer. 2005;5:142–149. 53. Ren C, Sun D, Dey D. Bayesian and frequentist estimation and prediction for exponential distributions. Journal of Statistical Planning and Inference. 2006;136:2873–2897. 54. Rigat F. Parallel hierarchical sampling: a practical multiple-chains sampler. Bayesian Analysis. 2007 (submitted for publication). 55. Sasaki M, Tanaka Y, Kaneuchi M, Sakuragi N, Dahiya R. CYP1B1 gene polymorphisms have higher risk for endometrial cancer and positive correlations with estrogen receptor alpha and estrogen receptor beta expressions. Cancer Research. 2003;63:3913–3918. [PubMed] 56. Sato Y, Suzuki T, Hidaka K, Sato H, Ito K, Ito S, Sasano H. Immunolocalization of nuclear transcription factors, DAX-1 and COUP-TF II, in the normal human ovary: correlation with adrenal 4 binding protein/steroidogenic factor-1 immunolocalization during the menstrual cycle. Journal of Clinical Endocrinology and Metabolism. 2003;88:3415–3420. [PubMed] 57. Selke T, Bayarri M, Berger J. Calibration of p-values for testing precise null hypotheses. The American Statistician. 2001;55:62–71. 58. Seo D, Dressman H, Hergerick E, Iversen E, Dong C, Vata K, Milano C, Rigat F, Pittman J, Nevins J, West M, Goldschmidt-Clermont P. Gene expression phenotypes of atherosclerosis. Atherosclerosis, Thrombosis, and Vascular Biology. 2003;24:1922–1927. 59. Serebriiskii I, Estojak J, Sonoda G, Testa J, Golemis E. Association of Krev-1/rap1a with Krit1, a novel ankyrin repeat-containing protein encoded by a gene mapping to 7q21−22. Oncogene. 1997;15:1043–1049. [PubMed] 60. Shapire R, Freund Y, Bartlett P, Lee W. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics. 1998;26:1651–1686. 61. Soland R. Bayesian analysis of the weibull process with unknown scale parameter and its application to acceptance sampling. IEEE Transactions on Reliability. 1968;R-17:84–90. 62. Spentzos D, Levine D, Ramoni M, Joseph M, Gu X, Boyd J, Libermann T, Cannistra S. Gene expression signature with independent prognostic significance in epithelial ovarian cancer. Journal of Clinical Oncology. 2004;22:4648–4658. [PubMed] 63. Sun X. Pitch accent prediction using ensemble machine learning; Proceedings of ICSLP; 2002. pp. 953–956. 64. Takamoto N, Kurihara I, Lee K, Demayo F, Tsai M, Tsai S. Haploinsufficiency of chicken ovalbumin upstream promotertranscription factor II in female reproduction. Molecular Endocrinology. 2005;9:2299–2308. [PubMed] 65. The Gene Ontology Consortium. Gene Ontology: Tool for the unification of biology. Nature Genetics. 2000;25:25–29. [PubMed] 66. Therneau T, Atkinson E. Technical Report, 61, Section of Biostatistics. Mayo Clinic; Rochester: 1997. An introduction to recursive partitioning using the rpart routine. 67. Tibshirani R, Knight K. Model search by bootstrap ‘bumping’ Journal of Computational and Graphical Statistics. 1995;8:671–686. 68. Uramoto H, Hackzell A, Wetterskog D, Ballagi A, Izumi H, Funa K. pRb, Myc and p53 are critically involved in SV40 large t antigen repression of PDGF beta-receptor transcription. Journal of Cell Science. 2004;117:3855–3865. [PubMed] 69. van der Horst E, Degenhardt Y, Strelow A, Slavin A, Chinn L, Orf J, Rong M, Li S, See L, Nguyen K, Hoey T, Wesche H, Powers S. Metastatic properties and genomic amplification of the tyrosine kinase gene ACK1. Proceedings of the National Academy of Sciences USA. 2005;102:15901–15906. 70. van 't Veer L, Dai H, van de Vijver M, He Y, Hart A, Mao M, Peterse H, van der Kooy K, Marton M, Witteveen A, Schreiber G, Kerkhoven R, Roberts C, Linsley P, Bernards R, Friend S. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–536. [PubMed] 71. West M. Bayesian Statistics 7. Oxford University Press; 2003. Bayesian factor regression models in the ‘large p, small n’ paradigm; pp. 723–732. 72. Yeoh E-J, Ross M, Shurtleff S, Williams W, Patel D, Mahfouz1 R, Behm F, Raimondi S, Relling M, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui C-H, Evans W, Naeve C, Wong L, Downing J. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002;1:133–143. [PubMed] 73. Zhang H, Singer B. Statistics for Biology and Health. Vol. 12. Springer Verlag Inc.; 1999. Recursive Partitioning in the Health Sciences. |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||
J Clin Oncol. 2004 Nov 15; 22(22):4648-9.
[J Clin Oncol. 2004]Clin Cancer Res. 2005 May 15; 11(10):3686-96.
[Clin Cancer Res. 2005]Cancer Cell. 2002 Mar; 1(2):133-43.
[Cancer Cell. 2002]Bioinformatics. 2002; 18 Suppl 1():S120-7.
[Bioinformatics. 2002]Nature. 2000 Aug 17; 406(6797):747-52.
[Nature. 2000]Nature. 2002 Jan 31; 415(6871):530-6.
[Nature. 2002]Clin Cancer Res. 2004 Apr 1; 10(7):2272-83.
[Clin Cancer Res. 2004]Biostatistics. 2004 Oct; 5(4):587-601.
[Biostatistics. 2004]Biostatistics. 2006 Jul; 7(3):355-73.
[Biostatistics. 2006]Biostatistics. 2004 Oct; 5(4):587-601.
[Biostatistics. 2004]Biostatistics. 2006 Jul; 7(3):355-73.
[Biostatistics. 2006]Biostatistics. 2006 Jul; 7(3):355-73.
[Biostatistics. 2006]Clin Cancer Res. 2005 May 15; 11(10):3686-96.
[Clin Cancer Res. 2005]Biostatistics. 2006 Jul; 7(3):355-73.
[Biostatistics. 2006]Stat Med. 2004 Jan 15; 23(1):77-91.
[Stat Med. 2004]Stat Med. 2004 Jan 15; 23(1):77-91.
[Stat Med. 2004]J Clin Oncol. 2004 Nov 15; 22(22):4648-9.
[J Clin Oncol. 2004]Clin Cancer Res. 2005 May 15; 11(10):3686-96.
[Clin Cancer Res. 2005]Clin Cancer Res. 2004 May 1; 10(9):2922-7.
[Clin Cancer Res. 2004]Bioinformatics. 2004 Dec 12; 20(18):3406-12.
[Bioinformatics. 2004]Biostatistics. 2004 Jul; 5(3):329-40.
[Biostatistics. 2004]Genome Biol. 2002 Jun 25; 3(7):RESEARCH0036.
[Genome Biol. 2002]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Cancer Res. 2003 Jul 15; 63(14):3913-8.
[Cancer Res. 2003]Oncogene. 1997 Aug 28; 15(9):1043-9.
[Oncogene. 1997]Mol Endocrinol. 2005 Sep; 19(9):2299-308.
[Mol Endocrinol. 2005]J Clin Endocrinol Metab. 2003 Jul; 88(7):3415-20.
[J Clin Endocrinol Metab. 2003]Nat Genet. 2006 Mar; 38(3):320-3.
[Nat Genet. 2006]Nature. 2002 Aug 22; 418(6900):823.
[Nature. 2002]