![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||
Comparing the Characteristics of Gene Expression Profiles Derived by Univariate and Multivariate Classification Methods* *German Cancer Research Centre, Email: manuela.zucknick03/at/imperial.ac.uk †Imperial College London, Email: sylvia.richardson/at/imperial.ac.uk ‡Imperial College London, Email: e.stronach/at/imperial.ac.uk The publisher's final edited version of this article is available free at Stat Appl Genet Mol Biol. See other articles in PMC that cite the published article.Abstract One application of gene expression arrays is to derive molecular profiles, i.e., sets of genes, which discriminate well between two classes of samples, for example between tumour types. Users are confronted with a multitude of classification methods of varying complexity that can be applied to this task. To help decide which method to use in a given situation, we compare important characteristics of a range of classification methods, including simple univariate filtering, penalised likelihood methods and the random forest. Classification accuracy is an important characteristic, but the biological interpretability of molecular profiles is also important. This implies both parsimony and stability, in the sense that profiles should not vary much when there are slight changes in the training data. We perform a random resampling study to compare these characteristics between the methods and across a range of profile sizes. We measure stability by adopting the Jaccard index to assess the similarity of resampled molecular profiles. We carry out a case study on five well-established cancer microarray data sets, for two of which we have the benefit of being able to validate the results in an independent data set. The study shows that those methods which produce parsimonious profiles generally result in better prediction accuracy than methods which don’t include variable selection. For very small profile sizes, the sparse penalised likelihood methods tend to result in more stable profiles than univariate filtering while maintaining similar predictive performance. Keywords: microarrays, molecular signature, classification, multivariate analysis, penalised likelihood 1 Introduction Gene expression microarray technologies allow the study of the simultaneous mRNA expression of thousands of genes and their comparison between different samples and under varying conditions. Of special interest is the construction of gene expression profiles for classification, for example of tumours, or to predict pathological characteristics and clinical outcomes of complex diseases such as cancer. A wide variety of classification methods have been applied to this task (for an overview see Dudoit et al. 2002, Dupuy and Simon 2007) and biologists and clinicians are confronted with a multitude of methods of different complexities. Thus, it is a difficult task to decide which method is best in a given context. To help with this decision, we here compare the most important characteristics of a range of classification methods, from simple univariate filtering to penalised likelihood methods and state-of-the-art machine learning algorithms. We restrict ourselves to the binary classification problem, but all methods can be generalised to situations where one wants to discriminate between more than two classes. It is an additional complicating factor that often there are several potentially contrasting aims involved. One very important goal is to construct gene expression profiles which have a good prediction accuracy, i.e. which are able to classify new samples with a small misclassification error. An additional aim is that the expression profiles can be interpreted in biological terms and provide insight into the data structure (e.g. Dudoit et al. 2002, Somorjai et al. 2003, Díaz-Uriarte and Alvarez de Andrés 2006). This implies parsimony, that is that profiles should contain only a relatively small number of genes, which can then be followed up by literature searches and functional experiments to determine their role in the biological processes influencing the phenotype of interest. A third desirable property, that is also related to the interpretability of profiles, is their stability in the sense that the set of genes selected into the molecular profile and the associated predictive performance should not vary much when the set of samples used for training is altered slightly (e.g. Díaz-Uriarte and Alvarez de Andrés 2006). A basic principle of statistical analysis is to provide measures of uncertainty for all estimates. This also applies to molecular profiles derived from microarray gene expression data (Michiels et al. 2005, Ein-Dor et al. 2005, Simon 2006). This means that the uncertainty of genes associated with their inclusion in the profile should be assessed as well as the probability for a particular profile to be selected relative to other possible solutions. Here we address this issue by estimating the instability in molecular profiles using a resampling setup, where the data are repeatedly randomly split into training and validation data and the molecular profiles associated with each of the splits are compared. A perfectly stable profile would contain the same genes for all training/validation splits. We propose to use the Jaccard similarity measure to assess how similar the profiles for all data splits are. The assessment of uncertainty is particularly important for data coming from high-throughput technologies such as gene expression microarrays, because for these data sources a large amount of instability is expected for two reasons. Firstly, such data are associated with large technical and biological variation leading to low signal-to-noise ratio. And secondly, the data are high-dimensional and usually comprise many more variables (p) than samples (n), which introduces multi-collinearity in the input data matrix leading to instability in the estimation procedure. This is known as the “large p, small n” problem. One way of solving this problem is by using a univariate filtering method to reduce the number of variables (e.g. Golub et al. 1999, Dudoit et al. 2002, van’t Veer et al. 2002). However, expression levels are often quite highly correlated between genes, because genes are co-regulated or act in the same biological pathways. Univariate approaches do not take the correlation structure into account, in contrast to multivariate methods. Multivariate approaches that are capable of handling p >> n data sets include penalised likelihood methods where a penalty term added to the log-likelihood function enforces unique parameter estimates. Here, we employ and compare the L1- and L2-penalties which correspond to the lasso (Tibshirani 1996) and ridge (Hoerl and Kennard 1970) logistic regression models, as well as the elastic net which combines both penalties (Zou and Hastie 2005). Another approach comes from the machine learning community, where ensemble methods have been used extensively. These methods build powerful classifiers from many weak simple classifiers. Here, we apply random forests as a representative from this class of methods, since they have been shown to perform very well in the context of microarray data, especially when the method is combined with an additional variable selection step (Breiman 2001, Díaz-Uriarte and Alvarez de Andrés 2006). In the following section, the classification methods to derive molecular profiles are described, the resampling setup used to assess the stability of profiles is specified and measures for quantifying stability are characterised. The methods are applied to five publicly available microarray gene expression data sets which are introduced in Section 3; for two of these independent data are available for validation. The results are presented in Section 4 and the paper concludes with a discussion. 2 Methods 2.1 Classification methods We compare several binary classification methods, and use the logistic regression model for all methods except the random forest: {0, 1}n is the binary response vector.For all methods the amount of shrinkage and thus the sizes of the molecular profiles and their prediction performances depend on tuning parameters. For the univariate method this is simply the number of genes p* chosen to be included in a profile, while for the penalised regression methods they are the penalty parameters for the L1 and L2 norms λ1 and λ2. Several parameters can be tuned for the random forest methods, the most important one being number of variables to be considered for node splits in the decision trees. However, Díaz-Uriarte and Alvarez de Andrés (2006) perform an extensive sensitivity analysis and come to the conclusion that the performance of random forests is quite insensitive to the choice of tuning parameter values, and we follow their suggestions for the choice of parameter values for microarray data analyses. For most analyses the statistical computing package R (R Development Core Team 2006) was used, in particular the affy library for data pre-processing of the Affymetrix data sets and the glm library for univariate logistic regression analyses. The R library glmpath was used for the elastic net and the randomforest and varSelRF libraries for random forest without and with variable selection, respectively. Ridge and lasso regression analyses were carried out with the BBR software by Genkin et al. (2007). 2.1.1 Univariate filtering Univariate filtering methods select a small number of gene variables based on univariate statistics assessing the potential of individual genes for class prediction. Here, we use the gene effects estimated by logistic regression models fitted for each gene variable separately plus optionally any clinical covariates. The p* “best” genes with the largest absolute effects Nearest-centroid classification (NC) The simple nearest-centroid classification rule has often been applied successfully to gene expression data (e.g. van’t Veer et al. 2002, Michiels et al. 2005). First, centroids, i.e. mean average profiles, are constructed for each class based on the training data available for the selected genes in the molecular profile. New samples are then assigned to the class whose centroid is closer to the sample based on a similarity (or distance) measure, here Pearson’s correlation r. That is, for two classes 0 and 1, a sample with gene expression profile x = (x1, ..., xp*) is assigned to class 1 iff
k (k {0, 1}) is the mean expression vector (centroid) in the training samples of class k.Diagonal linear discriminant analysis (DLDA) Dudoit et al. (2002) compared various classification rules for the univariate filtering approach. They selected between 10 and 200 variables in several microarray data sets and found that simple classification methods generally outperformed more complex methods in this context. In particular, diagonal linear discriminant analysis was found to perform very well. DLDA is similar to the nearest-centroid method, except that here the sample variances are taken into account. Sample x = (x1, ..., xp*) is assigned to class 1 rather than class 0 iff
2.1.2 Multivariate penalised regression Ridge regression The ridge estimator (β) which is proportional to a tuning parameter λ2 > 0. For logistic regression the log-likelihood is given as
Lasso regression Lasso regression (Tibshirani 1996) is similar to ridge regression, with the only difference being that here the L1 norm of the regression coefficient vector
Elastic net (ENet) The naïve elastic net simply uses both L1 and L2 penalty terms in the penalised log-likelihood function:
The L1-penalty has the advantage of automated variable selection over the L2-penalty. This implies that for the lasso and the elastic net, the estimated effect of most variables will be shrunk to zero, effectively excluding them from the set of relevant covariates. Note that for the lasso method there is a practical restriction on the maximum number of variables which can be selected, which depends on the sample size n and number of variables p: min(n − 1, p) (Zou and Hastie 2005). This restriction does not apply to the elastic net. All the estimated penalised regression models can be directly used for class prediction, since the logistic regression model provides probability estimates for class membership. A sample is predicted as belonging to a class, if the estimated class probability is larger than 1/2. 2.1.3 Random forest (RF) and varSelRF The random forest classifier (Breiman 2001) is an example of the class of ensemble classification algorithms, which combine the outputs of many “weak” classifiers, in this case classification trees, to produce a powerful ensemble. The random forest can be successful in dealing with the multi-collinearity of “large p, small n” applications, because it combines two ideas to help find as many of the multiple best solutions as possible: firstly, it uses repeated bootstraps, that is each tree is grown using a different bootstrap sample of the data, and secondly it also employs random subspace selection, i.e. it only uses a random subset of all available variables to grow each tree. Because of this, for p >> n data it is likely that most or all variables will get used in node splits for some of the trees. The final classification is the mode of the classifications of all trees: the random forest chooses the class that has been decided by the majority of trees. While random forests can deal with p >> n data, it has been found that the classification performance can be improved if the classifier is combined with a variable selection step so that only a small number of variables get used in the entire forest, see Díaz-Uriarte and Alvarez de Andrés (2006). There, the performance of random forests is compared to the varSelRF method, which implements variable selection by iteratively fitting random forests and discarding the variables which get used as nodes least often. 2.2 Multiple random validation study setup We employ a multiple random validation setup (e.g. Michiels et al. 2005), where the data are repeatedly randomly divided into training data and validation data. We perform 50 random samplings, each with a ratio of 2:1 for the size of training to validation sample sizes. The resampling scheme is outlined in Table 1.
2.3 Assessing the instability of molecular profiles We view stability of gene expression profiles in terms of whether the same genes get selected for different training data sets. Naturally, this concept does not apply to those classifiers that use all the gene variables. Hence, we only assess the stability for those methods that do incorporate feature selection. In the microarray literature most attempts to evaluate the stability of gene expression profiles for classification have focussed on resampling setups such as bootstrapping (Díaz-Uriarte and Alvarez de Andrés 2006) or repeated splits into training and validation subsets (Michiels et al. 2005). Examples include the approach taken by Díaz-Uriarte and Alvarez de Andrés (2006), Davis et al. (2006), Ma et al. (2006) and others, who argue that if gene sets are stable then the majority of genes will be included in most sets. Consequently, they use the inclusion frequencies to derive a single measure of stability, e.g. by averaging over the frequencies of all genes that get included at least once. Another approach is to use the size of the intersection between gene sets. For example, Ein-Dor et al. (2005) consider all To reflect this, Blangiardo and Richardson (2007) propose the ratio of observed to expected size of intersection in a situation where gene sets are independent. However, in our resampling setup the gene sets are not independent since the various training subsets partly overlap, and in the absence of independence the expected size of intersection is difficult to obtain without computationally demanding data-dependent permutation studies. In addition, relying on the size of the intersection between sets as a measure of similarity between the sets is not satisfactory in itself, because the intersection size does not fulfill several criteria (outlined in the next section) that are desirable in this context. 2.3.1 Similarity indices There is a wide variety of similarity measures available for the comparison of sets (see Simpson 1960, Hazel 1970, Sokal and Sneath 1973, and others). Measures ρ(Z1, Z2) for the comparison of two discrete sets Z1 and Z2 are usually based on the two-by-two table counting the presences and absences in both sets (Table 2).
Generally, the gene expression profiles will be parsimonious, so that the number of present genes will be much smaller than the number of absent genes. Because of this we consider the presence of a gene in two profiles to contribute more to the similarity of these profiles than its absence and prefer measures which are independent of the value of d. In addition, the measure should also have a number of other desirable properties (e.g. Janson and Vegelius 1981, Sepkoski Jr. 1974), in particular
Note that the simple intersection size a, which seems an obvious choice and has been proposed often (see the previous section), does not fulfill the homogeneity and boundedness criteria. This means that one cannot compare the values of a(Z1, Z2) and a(Z3, Z4) computed for different pairs of sets, especially if the sets are of different sizes and hence the maximum possible value of a is different for both cases. Three popular measures that do fulfill the requirements and are better suited for comparisons are:
The indices assess the similarity between pairs of sets, so in order to compare m > 2 sets we compute the indices of all possible 3 Data We apply the resampling scheme and assess the predictive accuracy and stability of the molecular profiles derived from the methods described in the previous section to five publicly available gene expression data sets, which are summarised in Table 3. In the resampling scheme, the samples are assigned randomly to either training or validation subset without restriction. An exception is the ovarian cancer data set, where only 17% of all samples belong to the less frequent class. Here, to ensure that all subsets contain samples from both classes, the class proportions in all training and validation subsets are fixed to be the same as in the complete data.
In addition, for two of the data sets (breast and ovarian cancer) independent validation data sets are available. The breast cancer validation data (van de Vijver et al. 2002) have been generated by the same centre using the same platform and protocols as in the original study by van’t Veer et al. (2002). We only include samples in the validation set that were not part of the original study. In addition, we restrict our validation samples to lymph-node negative patients only, since that was an inclusion criterion for the original study. In the instance of the ovarian cancer validation data the validation samples have been collected and processed by a different team in a study conducted independently from the original study by Schwartz et al. (2002). However, both study groups have very similar clinical characteristics and outcome data (Lu et al. 2004) and it is reasonable to assume that both groups come from comparable populations. All gene expression data were generated using Affymetrix oligoarrays (although with several different chip types), except the breast cancer expression data, which were generated with Agilent two-colour arrays. All Affymetrix data are pre-processed and normalised in the same way using RMA background-correction (Irizarry et al. 2003) and loess regression for array normalisation (Cleveland 1979). An exception is the ALL/AML data set where the pre-processed data provided by the R package golubEsets were used. All Affymetrix data are centered and scaled to zero mean and unit variance for all gene variables in the binary classification analysis. The Agilent data are normalised in the same way as described in the original paper (van’t Veer et al. 2002). After filtering and removing the small proportion of genes with missing values, 4770 genes remain for analysis. Note that for the breast cancer data set, clinical data, which are known predictive factors for breast cancer progression, are available in addition to the gene expression data. These are patient age, tumour grade, tumour diameter and angioinvasion. Since the interest is in developing molecular profiles which can improve on predictive accuracy on top of known clinical factors, the clinical data are included in all classification methods and their effects are not allowed to be shrunken by the penalised likelihood methods nor to be removed from the active variable set in all other methods. 4 Results Each of the five data sets is randomly split into training and validation subset m = 50 times. For each of the data sets all classification methods are fitted to each of the 50 training subsets for a range of tuning parameter values. The tuning parameter values are carefully chosen to cover a wide range of models and model sizes. For univariate filtering the number of variables to be selected is p* {5, 10, 50, 100, 500}, since we assume that the inclusion of more than 500 gene variables will not result in further improvement of the predictive accuracy. For the Affymetrix data sets the ridge and lasso penalty parameters were chosen so that the corresponding prior variances τ range from 0.01 to 100 for lasso and from 10−5 to 1 for ridge regression (both on the log10 scale). For lasso regression, the smallest profiles with τ = 0.01 usually contain less than 5 genes (an exception is the AML/karyotype data set). Since the data are normalised to have unit variance, the choice τ = 100 reflects an extremely large prior variance compared to the sample variances and induces very little shrinkage. Recall that in ridge regression no sparsity is induced, and hence the profile sizes cannot be used to determine the range of values for the penalty parameter. Instead, preliminary test runs were performed to ensure that the range of models with best prediction accuracy is covered for all data sets. Since the Agilent data are pre-processed and normalised in a different way to the Affimetrix data, slightly different penalty parameter values had to be chosen to cover the entire range of models (τ1 {10−3, 10−2.5, 10−2, 10−1.5, 10−1} for lasso and τ2 {10−6, 10−5, 10−4, 10−3, 10−2} for ridge). Note that for the elastic net two tuning parameters exist, which regulate the size of the L1- and L2-penalty terms, respectively. Here, we use a fixed penalty for the L2-term (λ2 = 1) and vary the size of the L1-penalty parameter λ1 from zero, resulting in the largest possible profiles, which are of comparable sizes to the largest observed lasso profiles, to a value large enough to induce maximum sparsity, i.e. so that no genes are included in the model.4.1 Prediction accuracy The fitted models can be applied to the corresponding validation subsets, where we record the proportion of misclassified samples as a measure of prediction accuracy, resulting in 50 misclassification error values per data set and fitted model, which are summarised as boxplots in Figure 1
Note that because we want to compare characteristics of molecular profiles of different sizes, we do not tune the classification methods to optimise their performances in terms of minimal misclassification errors, i.e. we do not attempt to choose “optimal” tuning parameter values among those values presented alongside each other in Figure 1 The predictive accuracies that can be achieved by gene expression profiles vary widely between data sets. The first three data sets (ovarian cancer, ALL/AML, and prostate cancer) are easily separable, and the minimum median error rates correspond to as little as only one or two misclassified samples for the ovarian and prostate cancer data. Classification is not so easy for the last two data sets (breast cancer and AML/karyotype), where the best median error rates achieved are about 30%. This reflects the idea that different types of outcome data are more or less related to gene expressions. For example, tissue or tumour type can be explained to a large degree by gene expression, as we observe for the first three data sets. On the other hand, making a prognosis for cancer survival is a much more complex problem which is influenced to a large degree by environmental factors, as well as additional genetic factors other than just gene expression levels. This is in part reflected by the fact that for our breast cancer data, where clinical covariates are available, the use of these clinical variables alone for classification between patients with favourable and unfavourable survival prognosis already achieves a median misclassification proportion of 39%. So the additional inclusion of gene expression data as predictors only reduces the median error rate from 39% to about 30%. In general, prediction performances of the lasso, elastic net, and univariate filtering methods all seem to be comparable in the sense that for all data sets we observe similar minimum values of the median error rates between these methods. There is not much difference between the two sparse penalised likelihood methods (lasso and elastic net); the additional L2-penalty term introduced in the elastic net does not seem to improve the prediction accuracy. Remember that the two univariate filtering methods only differ in the final classifier applied to the selected genes; the gene lists themselves are identical. Despite this, the prediction performances can somewhat differ. In particular, DLDA achieves smaller median error rates than the nearest centroid (NC) methods for the prostate cancer and AML/karyotype data. For the classification methods that do not perform automatic variable selection, i.e. ridge regression and the random forest (RF), the prediction performances are comparable to the other methods for some data sets (with equal or slightly larger median error rates), but in some cases they perform substantially worse. For example, both ridge regression and random forest have higher misclassification errors in the ovarian and breast cancer data, and in addition the random forest has higher error rates when applied to the prostate cancer data. The varSelRF method (random forest with variable selection) tends to have smaller prediction errors than the random forest without variable selection, except in the ALL/AML data set. 4.2 Ranking genes by their profile inclusion frequencies Classification methods with inherent variable selection produce parsimonious gene expression profiles consisting of a small number of genes. In our case these methods are univariate filtering, lasso, elastic net and varSelRF. Since for these methods not all genes are selected into all profiles corresponding to the resampled training data subsets, we can rank the genes by their frequency of selection, giving a measure for the relative importance of a gene for class prediction (e.g. Michiels et al. 2005, Díaz-Uriarte and Alvarez de Andrés 2006). We will be more certain that a gene is relevant if it is selected most of the time. As an example, the selection frequencies are shown for the ovarian cancer data in Figure 2 {5, 10, 50} are shown to avoid plot overcrowding.
Lasso regression always selects the same five genes into more than half of the m = 50 resamples, for all penalty values λ1 > 0.01. The same five genes also get chosen very often into elastic net profiles. Two of these (M82809 and U11862) are the only genes that get selected into more than half of the varSelRF profiles, which are generally very small with a median profile size of three. While we observe this good agreement for the multivariate methods, different genes are included most frequently by univariate filtering. Only one of the five genes always found by lasso is also part of more than half of the univariate profiles with p* ≤ 50 genes (X65614). This is reflective of the fact that variables get selected independently in the univariate filtering approach, while multivariate methods take into account the correlation structure between genes. Note that when the profile sizes become larger either by increasing p* or by decreasing the penalty parameter λ1, genes that had been included often in smaller profiles, are generally included in the larger profiles as well and rarely get dropped. 4.3 Profile stability In addition to assessing the frequencies of individual genes, we can use the resampling setup to evaluate the molecular profiles as a whole in terms of their stability. We compute the Jaccard similarity indices for all pairs of non-empty gene sets, which are summarised in Figure 3
In general, the observed Jaccard index distributions of elastic net and lasso are similar. The mean Jaccard values are largest for the smallest profiles which contain less than about ten genes. They then decline roughly monotonically with decreasing values of the penalty parameter λ1 (which is equivalent to increasing profile sizes). The similarity values observed for the molecular profiles from univariate filtering follow a different pattern. They vary less across profile sizes and are largest for the very large profiles containing 100 or 500 genes, with the exception of the AML/karyotype data. Note that this increase in similarity for very large univariate profiles can at least in part be explained by a spurious effect due to how the resampling study is designed. Because each of the m = 50 training subsets consists of two-thirds of the complete data sets, the expected intersection between any two training subsets is considerable: 4/9. Because of the overlap in samples, one expects a certain size of intersection between any two selected gene lists, even in cases where the gene expression data have no predictive power for the response of interest at all, for example because the response data have been randomised. This overlap can hence not be attributed to the classification method’s ability to produce stable profiles for classification. This is illustrated in the top-right plot of Figure 3 Among the very parsimonious profiles (≤ 5 genes), lasso and elastic net molecular profiles tend to be equally or more stable than the univariate profiles with respect to our stability measures. An exception is the AML/karyotype data set where the univariate method has much larger similarity values across all profile sizes than all other methods. The random forest with variable selection is comparable to the penalised likelihood methods. For some applications, the computed similarity indices are very small for all but the smallest profiles, this applies in particular to the breast cancer data for all methods and to the AML/karyotype data for the penalised likelihood methods. After accounting for the effect of the resampling study design described above, the remaining stability that can be attributed to the classification method itself is even smaller. We find that the overall stability patterns across the range of tuning parameter values are quite different between data sets. In particular, the similarities between profiles from univariate filtering vary widely. The profiles are least stable for the breast cancer data and most stable for the prostate cancer data. These differences in similarity distributions must be due to the differences in data structure, for example in the correlation structure between genes which are related to the response and get selected into classification profiles. Hence, for each profile we compute all pairwise correlations between all genes in the profile and record the mean of the absolute values of these correlations as a summary measure for the strength of correlations in that profile. The distributions of these mean absolute correlations across all m = 50 resamples are illustrated by their mean and standard deviations in Figure 4
Note that the shapes of the median absolute correlations plotted against median profile sizes in Figure 4 One big difference between univariate filtering and multivariate classification methods is that univariate filtering variables are selected individually without taking the correlation structure between variables into account. Imagine two highly positively correlated variables. If these variables are also highly related to the response, then they would likely both be included by univariate filtering despite the fact that, given one variable is already included, the other one does not add much to the explanatory power of the profile and might be quite unnecessary. Contrary to that, the L1-penalty term in the lasso and elastic net methods discourages the inclusion of both variables together, because the decrease in the likelihood achieved by including both is likely to be outweighed by the increase in the penalty term. In a resampling study, one of the two variables might be selected into most of the resampled lasso and elastic net profiles, but they will rarely be selected together. This affects the Jaccard index for larger profiles, as indeed we have observed earlier. On the other hand, most resampled univariate filtering profiles will contain both variables, resulting in both a larger within-profile correlation as well as a larger Jaccard similarity measure between the univariate profiles. However, one can argue that two highly correlated variables in two different profiles do contribute to the similarity of these two profiles, since they can replace each other without much loss of information. In a first attempt to reflect this in the similarity measurements, we extend the pairwise Jaccard index by adding a term to the numerator that summarises the contributions of genes, which are present in one set but not the other, and which have large correlations with genes of the other set. The approach is outlined in Appendix B and the results for one possible way of extending the Jaccard index to incorporate correlation are shown in Figure 6 4.4 Validation on independent data sets For the breast and ovarian cancer data sets, we have independent validation data available which represent populations which are comparable to the original studies in terms of clinical and phenotypical characteristics and disease outcome. This allows us to assess how well the predictive abilities of the gene expression profiles translate to new data, that is whether the gene sets we found earlier using the data sets by Schwartz et al. (2002) and van’t Veer et al. (2002) are still predictive for the binary response in new validation data (Lu et al. 2004, van de Vijver et al. 2002). Because we are interested in how well the performances of parsimonious molecular profiles translate, we focus on those classification methods which incorporate variable selection, i.e. univariate filtering, lasso, elastic net and varSelRF. We choose tuning parameter values that result in very small profiles of comparable sizes with small misclassification errors observed for the original data. On average the profiles contain between 5 and 11 genes for the ovarian cancer data and between 5 and 18 genes for the breast cancer data. We employ logistic regression models with all genes from the molecular profile of interest included as covariates. Note that of the four clinical covariates used in the analysis of the van’t Veer et al. (2002) breast cancer data, only three are available for the validation data (patient age, tumour grade and tumour diameter, but not angioinvasion), so only these three can be included here. This potentially compromises the predictive abilities of the molecular profiles and of course reflects a common problem with validation studies. The validation results are listed in Table 4 which shows the median misclassification rates across all profiles derived from the 50 training subsets of the Schwartz et al. (2002) and van’t Veer et al. (2002) data, respectively. In order to assess whether the predictive accuracies achieved by the molecular profiles are better than expected of randomly produced gene sets, one-sided permutation tests are performed. This is done by randomly sampling from the data with replacement 1000 sets of the same number of k genes as contained in the real profile and comparing the error rates observed for the real profiles with the distributions of error rates obtained for the random gene sets. We report the proportion among the m = 50 profiles for each method and data set that have significantly low misclassification rates, i.e. where the error rates for the real profiles are smaller than the 10%-quantiles of the random distributions.
For both data sets, the median error rates are always smaller than the baseline error, which is the minimum misclassification error achievable if the gene expression data were not taken into account. For ovarian cancer this is the proportion of samples that would be misclassified if all samples were simply assigned to the most frequent class. Since for the breast cancer data additional clinical data are available, the baseline error rate is the error achieved by a logistic regression model only containing the clinical covariates. In both examples, the univariate filtering approach results in the highest misclassification error rates. The elastic net performs best in the sense that it provides the smallest misclassification rates in both applications. However, this could be linked to the larger sizes of these profiles - the elastic net profiles have on average slightly more genes than those of the other methods. This is supported by the fact that in terms of the proportion of results which are significantly better than expected at random, the elastic net performs less well than the other methods. Only 21 (ovarian cancer) and 29 (breast cancer) out of all 50 results are significant for the elastic net, while for the other three methods the proportions of significant results range between 31/50 and 33/50 for the ovarian cancer application and are even higher for the breast cancer data ranging from 43/50 to 50/50. In summary, these results show that it is possible to generate molecular profiles for binary classification, such that the predictive abilities translate well to new data. It is hard to come to a conclusion on which classification method performs best on a new data set based on our two examples only. 5 Discussion It has been pointed out (e.g. Ein-Dor et al. 2005, Michiels et al. 2005) that molecular profiles derived from gene expression microarray data can be highly unstable, i.e. which genes get selected into a profile depends much on the choice of training data. Hence, there is need for careful validation of results (Simon et al. 2003, Dupuy and Simon 2007) and it is important to assess the uncertainty associated with molecular profiles. To do this we employed a resampling approach, and compared important characteristics of gene expression profiles derived using several univariate and multivariate methods for binary classification. We applied these methods to five publicly available gene expression microarray data sets. In particular, we compared the classification methods in terms of the predictive accuracy of resulting gene expression profiles and how well the predictive ability translates to new data, as well as profile stability and how parsimonious the profiles are. Results vary between the different data sets and depend much on the data structure, e.g. the correlation structure between those genes which are related to the response variable and that get selected into the molecular profiles. But for all data sets, the best predicting gene expression profiles are small (between 3 and 80 genes). In terms of predictive ability, we observe comparable performances between those methods that incorporate variable selection, in particular univariate filtering, lasso and elastic net. In contrast, the methods that employ most or all genes for classification, i.e. ridge regression and random forest, often performed worse, sometimes substantially. This reflects the p >> n situation of gene expression data, but also conforms with the idea that usually only a small number of genes are expected to influence any particular biological condition or disease. It is particularly interesting to note that the prediction performance of simple univariate filtering methods is comparable to that of more complex multivariate methods, even though they do not take the correlation structure of gene expression data into account. This has also been observed in a recent study by Lai et al. (2006). An explanation for this is the small sample size in most available microarray data sets, due to which the correlation structure in the data cannot be estimated accurately enough, so that multivariate methods cannot profit sufficiently from the correlation structure. Note, however, that if the response data to be fitted is continuous (censored or not censored) rather than binary, then the sample size needed to give multivariate methods an edge over univariate methods can be smaller. This is because a vector of continuous data contains more information than a binary vector of the same length, which makes a perfect model fit harder to achieve. In that situation, methods which can use more information, in particular the correlation structure between covariates, gain more of an advantage than in the binary classification situation. In a recent study comparing several methods applied to Cox proportional hazards models for predicting survival from microarray data (Bøvelstad et al. 2007), the authors found that all multivariate methods performed clearly better than univariate approaches. The stability of the molecular profiles was assessed by the distribution of pairwise similarity between the profiles resulting from the resampled training sets. Similarity was measured by the Jaccard index and by an extended Jaccard index that accommodates for highly correlated gene variables in different profiles. We found both indices to lead to similar results in terms of comparisons between classification methods. In all data sets, for the multivariate methods lasso and elastic net, the stability depends much on the number of genes in the molecular profiles and decreases with increasing profile sizes. The stabilities observed for univariate filtering profiles, on the other hand, are not influenced much by an increase in the number of genes included in the profiles. For very parsimonious profiles with p* ≤ 5 genes, both lasso regression and elastic net are found to be more stable than the univariate methods, except in one data set. The very small profiles are often those we are most interested in, because of their good interpretability and because we found that generally small profiles perform better in terms of prediction accuracy. For two of the data sets independent data were available for validation. We applied parsimonious gene expression profiles constructed with the original data (using those classification methods which incorporate variable selection) to the new data, using the genes as covariates in logistic regression models. We found that all profiles translated reasonably well in terms of their predictive accuracy achieved on both new data sets. Note that the question whether the predictor accuracy of existing profiles translates to new data is different from the question whether the same genes would be found in a new analysis on the new data as has been pointed out e.g. by Somorjai et al. (2003), Roepman et al. (2006) and Simon (2006). This will generally not be the case, in part due to the correlated nature of gene expression data because of underlying biological processes and also due to the “large p, small n” nature of microarray data which implies multicollinearity and non-uniqueness of solutions. In this context it is often noted that there is a difference in the use of gene expression profiles as prognostic factors to predict disease outcome etc. in new samples and their use for detecting genes which are causal to the disease. Nonetheless, gene expression profiles are often used as a starting point for the exploration of causality and underlying biological processes. In such a situation, parsimony and stability of profiles are important properties, because concise and stable gene lists are easier to interpret and provide a more clear-cut starting point for further exploration than e.g. a list of several hundred genes where the distinction between their individual contributions is less clear. Also, for the direct use of gene expression profiles as prognostic factors, small profiles consisting of a few genes only are much cheaper and easier to apply on a large scale than large gene lists. In addition, the expression of a small number of genes can readily and accurately be tested by using, for example, quantitative real time PCR rather than high-throughput microarrays, thus improving the signal-to-noise ratio while also being more cost effective. In summary, we find that binary classification methods which produce parsimonious gene expression profiles generally result in profiles which have better prediction accuracy than methods which do not include variable selection. Multivariate sparse penalised likelihood methods like the lasso and elastic net might have a slight edge compared to univariate filtering in terms of prediction performance and how well the predictive ability of profiles translates to new data, although the differences are not large. Their performance is likely to improve further relative to univariate methods when sample sizes increase in the future. For very small molecular profiles containing only 5 genes or less, the sparse penalised likelihood methods have an additional advantage, as they tend to produce profiles which are more stable than univariate filtering profiles while maintaining similar or better predictive performance. Appendix A Similarity indices: results for Schwartz et al. (2002) B Extension of Jaccard index to incorporate correlation Gene expression data often contain gene variables which are highly correlated, for example because the genes are co-regulated or act together in the same biological pathway. Suppose gene i with expression xi is present in profile Z1 but not in profile Z2. Assume further that this gene i is highly correlated with gene j with expression xj, which is present in profile Z2 but not in Z1. In this situation one can argue that genes i and j contribute to the similarity of the molecular profiles Z1 and Z2 with a weight that is proportional to their (absolute) correlation Rxi,xj. In order to illustrate the possible effect this could have on the similarity index, we define the contribution of a gene, which is present in Z1 but not in Z2, to the similarity by its (thresholded) mean absolute correlation with all genes which are in profile Z2 but not in Z1. In order to reduce the influence of spurious small correlations, only absolute correlations larger than a threshold t are taken into consideration, which we take to be data dependent and set to the median absolute correlation between genes in Z1 and Z2:
{0.25, 0.5, 0.75}). In conclusion, the choice of the threshold did not influence the results much as long as t was not too close to zero. We also replaced the thresholded mean absolute correlations by equivalent thresholded median absolute correlations, which had very little impact on the results.
Footnotes *The authors thank the reviewers for their helpful comments and suggestions. We also thank Prof. Hani Gabra for helpful discussions and Dr. David Cameron for advice on the choice of clinical predictors for breast cancer. The work of Manuela Zucknick was carried out at the Centre of Biostatistics at the Department of Epidemiology and Public Health, Imperial College London, and was funded through a PhD studentship by the Wellcome Trust (UK). References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||
J Natl Cancer Inst. 2007 Jan 17; 99(2):147-57.
[J Natl Cancer Inst. 2007]Bioinformatics. 2003 Aug 12; 19(12):1484-91.
[Bioinformatics. 2003]BMC Bioinformatics. 2006 Jan 6; 7():3.
[BMC Bioinformatics. 2006]Lancet. 2005 Feb 5-11; 365(9458):488-92.
[Lancet. 2005]Bioinformatics. 2005 Jan 15; 21(2):171-8.
[Bioinformatics. 2005]J Natl Cancer Inst. 2006 Sep 6; 98(17):1169-71.
[J Natl Cancer Inst. 2006]Science. 1999 Oct 15; 286(5439):531-7.
[Science. 1999]Nature. 2002 Jan 31; 415(6871):530-6.
[Nature. 2002]BMC Bioinformatics. 2006 Jan 6; 7():3.
[BMC Bioinformatics. 2006]BMC Bioinformatics. 2006 Jan 6; 7():3.
[BMC Bioinformatics. 2006]Nature. 2002 Jan 31; 415(6871):530-6.
[Nature. 2002]Lancet. 2005 Feb 5-11; 365(9458):488-92.
[Lancet. 2005]BMC Bioinformatics. 2006 Jan 6; 7():3.
[BMC Bioinformatics. 2006]Lancet. 2005 Feb 5-11; 365(9458):488-92.
[Lancet. 2005]BMC Bioinformatics. 2006 Jan 6; 7():3.
[BMC Bioinformatics. 2006]Lancet. 2005 Feb 5-11; 365(9458):488-92.
[Lancet. 2005]Bioinformatics. 2006 Oct 1; 22(19):2356-63.
[Bioinformatics. 2006]BMC Bioinformatics. 2006 May 9; 7():253.
[BMC Bioinformatics. 2006]Bioinformatics. 2005 Jan 15; 21(2):171-8.
[Bioinformatics. 2005]Lancet. 2005 Feb 5-11; 365(9458):488-92.
[Lancet. 2005]Bioinformatics. 2006 Oct 1; 22(19):2356-63.
[Bioinformatics. 2006]Genome Biol. 2007; 8(4):R54.
[Genome Biol. 2007]N Engl J Med. 2002 Dec 19; 347(25):1999-2009.
[N Engl J Med. 2002]Nature. 2002 Jan 31; 415(6871):530-6.
[Nature. 2002]Cancer Res. 2002 Aug 15; 62(16):4722-9.
[Cancer Res. 2002]Clin Cancer Res. 2004 May 15; 10(10):3291-300.
[Clin Cancer Res. 2004]Biostatistics. 2003 Apr; 4(2):249-64.
[Biostatistics. 2003]Nature. 2002 Jan 31; 415(6871):530-6.
[Nature. 2002]Lancet. 2005 Feb 5-11; 365(9458):488-92.
[Lancet. 2005]BMC Bioinformatics. 2006 Jan 6; 7():3.
[BMC Bioinformatics. 2006]Cancer Res. 2002 Aug 15; 62(16):4722-9.
[Cancer Res. 2002]Nature. 2002 Jan 31; 415(6871):530-6.
[Nature. 2002]Clin Cancer Res. 2004 May 15; 10(10):3291-300.
[Clin Cancer Res. 2004]N Engl J Med. 2002 Dec 19; 347(25):1999-2009.
[N Engl J Med. 2002]Nature. 2002 Jan 31; 415(6871):530-6.
[Nature. 2002]Cancer Res. 2002 Aug 15; 62(16):4722-9.
[Cancer Res. 2002]Nature. 2002 Jan 31; 415(6871):530-6.
[Nature. 2002]Bioinformatics. 2005 Jan 15; 21(2):171-8.
[Bioinformatics. 2005]Lancet. 2005 Feb 5-11; 365(9458):488-92.
[Lancet. 2005]J Natl Cancer Inst. 2003 Jan 1; 95(1):14-8.
[J Natl Cancer Inst. 2003]J Natl Cancer Inst. 2007 Jan 17; 99(2):147-57.
[J Natl Cancer Inst. 2007]BMC Bioinformatics. 2006 May 2; 7():235.
[BMC Bioinformatics. 2006]Bioinformatics. 2007 Aug 15; 23(16):2080-7.
[Bioinformatics. 2007]Bioinformatics. 2003 Aug 12; 19(12):1484-91.
[Bioinformatics. 2003]Cancer Res. 2006 Feb 15; 66(4):2361-6.
[Cancer Res. 2006]J Natl Cancer Inst. 2006 Sep 6; 98(17):1169-71.
[J Natl Cancer Inst. 2006]Cancer Res. 2002 Aug 15; 62(16):4722-9.
[Cancer Res. 2002]Nature. 2002 Jan 31; 415(6871):530-6.
[Nature. 2002]Clin Cancer Res. 2004 May 15; 10(10):3291-300.
[Clin Cancer Res. 2004]N Engl J Med. 2002 Dec 19; 347(25):1999-2009.
[N Engl J Med. 2002]