Logo of procamiaLink to Publisher's site
AMIA Annu Symp Proc. 2009; 2009: 406–410.
Published online 2009 Nov 14.
PMCID: PMC2815476
PMID: 20351889

Measuring Stability of Feature Selection in Biomedical Datasets

Abstract

An important step in the analysis of high-dimensional biomedical data is feature selection. Typically, a feature subset selected by a feature selection method is evaluated for relevance towards a task such as prediction or classification. Another important property of a feature selection method is stability that refers to robustness of the selected features to perturbations in the data. In biomarker discovery, for example, domain experts prefer a parsimonious subset of features that are relatively robust to slight changes in the data. We present a stability measure called the adjusted stability measure that computes robustness of a feature selection method with respect to random feature selection. This measure is useful for comparing the robustness of feature selection methods and is superior to similar measures that do not account for random feature selection. We demonstrate the application of this measure on a biomedical dataset.

Introduction

High-throughput platforms such as DNA microarrays, mass spectrometry and single nucleotide polymorphism microarrays generate high-dimensional transcriptomic, proteomic and genomic data. The analysis of such data typically includes development of classification models to discriminate, for example, between disease and health, and discovery of biomarkers of disease which involves the selection of features that are highly associated with disease. In both situations, an important step in the analysis is the reduction in the dimensionality of the data that is typically accomplished by feature selection. Feature selection, also known as feature subset selection, is the process of selecting a subset of features that optimizes a certain criterion such as classification accuracy.

In addition to classifier performance, another important criterion for evaluating a feature selection method is its stability. Stability is the sensitivity exhibited by a feature selection method to perturbations in the data. Stability is typically more important for knowledge discovery as in biomarker discovery, than in constructing accurate classifiers. In biomarker discovery, domain experts prefer a parsimonious subset of features that are relatively robust to slight changes in the data.

In this paper we present a stability measure that can be used to evaluate and compare feature selection methods. We demonstrate the application of this measure to several classifier based feature selection methods on a biomedical dataset.

Background

A stability measure typically requires defining a similarity measure that assesses the commonality of a pair of feature subsets. Given an appropriate similarity measure, stability is computed as the average similarity over all pairs of feature subsets. Several stability measures that utilize a similarity measure have been described in the literature. Given two feature subsets si and sj, Kalousis et al.1 measure similarity between the two subsets as:

SS(si,sj)=1|si|+|sj|2|sisj||si|+|sj||sisj|=|sisj||sisj|
(1)

where, |sisj| is the cardinality of the union of si and sj, and |sisj| is the cardinality of the intersection of si and sj. The unadjusted stability measure USM for a total of c feature subsets (that may be obtained, for example, from c runs of a feature selection method) is defined as the average of the pairwise similarity for all pairs of feature subsets:

USM=2C(C1)i=1c1j=i+1cSS(si,sj).
(2)

The reason for the term unadjusted in USM will become clear later on. Jordan et al2 developed a analogous similarity and stability measure to compare methods of feature selection for gene expression data.

Dunne et al.3 measure similarity as the relative Hamming distance between the two subsets si and sj:

SH(si,sj)=1|si\sj|+|si\sj|n,
(3)

where ‘\’ is the set-minus operation and n is the total number of features.

One deficiency in the above similarity measures (and the corresponding stability measures) is that they are not adjusted for the commonality of a pair of feature subsets that may be obtained purely by chance. These measures do not correct for the artificial increase in the score that occurs with increasingly larger feature subsets. For example, a pair of feature subsets that each contains every feature in the domain attains the maximum score which is not informative. An improved similarity measure would indicate how much better or worse a feature selection method is over one that selects features at random. An improved similarity measure that incorporates the effect of chance was developed by Kuncheva4 and is given by:

IC(Si,Sj)=rk2nkk2n=rnk2k(nk),
(4)

where, k = |si| = |sj|, r = |si ∩ sj|, n is the total number of features, and k2 / n is the expected value of |sisj| that is obtained by modeling the cardinality of the intersection of two equal subsets as a hypergeometric distribution. However, one limitation of this measure is that it is applicable only to feature selection methods that generate feature subsets of fixed cardinality. Feature ranking methods can be adapted to generate feature subsets of a fixed size; however, feature selection methods typically then define ASM at the end of this section. Let si and sj be two subsets of features with cardinalities ki and kj respectively obtained from a dataset containing n features. Let r be the cardinality of the intersection sisj of the two subsets. We derive the expected value of r that would be obtained by chance alone for fixed ki, kj and n. Using combinatorics, the probability of obtaining an intersection of cardinality r where the feature subsets are of cardinality ki and kj is given by:

P(X=r|ki,kj,n)=(nr)(nrkir)(nkikjr)(nki)(nkj).
(5)

After algebraic manipulation, Equation 5 is rewritten as:

P(X=r|ki,kj,n)=(kir)(nkikjr)(nkj).
(6)

This expression defines the hypergeometric distribution of a random variable X with parameters n, ki and kj. Consider a set of n objects in which ki are defective. The hypergeometric distribution describes the probability that exactly r objects are defective in a sample of kj distinct objects drawn from the set. There are a total of (nkj) possible ways to obtain a sample of kj objects from n objects, there are (kir) ways to obtain r defective objects in a sample of size kj, and there are (nkikjr) ways to obtain the non-defective objects in the sample. The probability is positive when r is between max(0, ki + kj − n) and min(ki, kj), where the former expression gives the smallest possible value for r and the latter expression gives the largest possible value. The expected value of r is given by:

E(X=r)=kikjn.
(7)

Note that when k = ki = kj the expected value for r as given by Expression 7 reduces to k2 / n which is used in Kuncheva’s similarity measure We define the similarity for a pair of feature subsets as:

SA(si,sj)=rkikjnmin(ki,kj)max(0,ki+kjn).
(8)

In the numerator, we subtract the expected value for r from the actual value of r and in the denominator we normalize the score to the range of r. This measure has a range of (−1, 1]. A value of 0 is what would be expected from random feature selection; a positive value indicates that the feature selection method is more stable than random feature selection and a negative value indicates that the feature selection method is less stable than random feature selection. When si or sj or both are empty sets or si or sj or both contain every feature in the domain, equation 8 is undefined, and we assume the value to be 0. Figure 1 shows the maximum values attained by SA at different values of ki and kj and a total of 100 features. The maximum value varies with ki and kj; maximum values close to 1 are obtained when both ki and kj are either very small or very close to n and maximum values close to 0 are obtained when either of ki or kj takes on a small value and the other takes on a value close to n. A similar plot is obtained for the minimum values attained by SA at different values of ki and kj, except that it is shifted down along the vertical axis, as shown in Figure 2.

An external file that holds a picture, illustration, etc.
Object name is amia-f2009-406f1.jpg

The maximum adjusted similarity SA plotted for various cardinalities of k1 and k2 of a pair of feature subsets.

An external file that holds a picture, illustration, etc.
Object name is amia-f2009-406f2.jpg

The minimum adjusted similarity SA plotted for various cardinalities of k1 and k2 of a pair of feature subsets.

The ASM for c feature subsets is obtained by averaging the pairwise similarity for all pairs of feature subsets, and is given by:

ASM=2c(c1)i=1c1j=i+1cSA(si,sj).
(9)

The ASM (Equation 9) differs from the USM (Equation 2) in that the similarity measure SA used in ASM is adjusted for the commonality of a pair of feature subsets that may be obtained purely by chance while the similarity measure SS used in USM is unadjusted for the effect of chance.

Application to a Biomedical Dataset

We applied the adjusted and the unadjusted stability measures, ASM and USM, to a proteomic dataset obtained from the University of Pittsburgh Cancer Institute. This dataset contains 240 samples, a total of 70 features (protein probes) and a binary class variable (cancer vs. healthy). We evaluated three different classifier based feature selection methods that included Support Vector Machines (SVM) with a linear kernel which has been shown to perform well on high-dimensional biomedical datasets7, Logistic Regression (LR) and Naïve Bayes (NB). For the experiments, we utilized the classifiers implemented in the Waikato Environment for Knowledge Acquisition (WEKA version 3.6) set to their default parameter values.

For each classifier, we performed greedy forward search for feature selection, starting with an empty set of features and adding a single feature in each iteration that most improved the overall classifier performance. The search was terminated when none of the remaining candidate features was unable to improve the accuracy of the classifier.

We performed 10 runs of 10-fold stratified cross-validation, and computed the mean stability measures and the mean classifier accuracies from a total of 100 folds. While bootstrap sampling simulates the sampling distribution underlying the data better, we opted to perform N-fold stratified cross-validation since it is more commonly used for evaluating classifier performance8.

Results

Figure 3 compares the adjusted and the unadjusted stability measures, ASM and USM, for the three classifier based feature selection methods, namely, SVM, LR and NB. Based on the USM, NB had better stability than both SVM and LR. However, the ASM shows that when adjusted for chance NB has worse stability than the other two classifiers. This indicates that the good performance of NB on the unadjusted measure is mainly due to its larger feature subsets (average of 54 features) as compared to LR (average of 32 features) and SVM (average of 38 features). On examination of the feature subsets, NB indeed generated subsets of widely varying cardinalities ranging from 0 to 70.Figure 3 also shows that at larger feature subsets, the effect of the correction incorporated by ASM is bigger as seen in the difference between ASM and USM for NB, and at smaller feature subsets sizes the two stability measures generate similar values as can be seen in the case of SVM and LR.

An external file that holds a picture, illustration, etc.
Object name is amia-f2009-406f3.jpg

Comparison of ASM vs. USM on three classifier based feature selection methods.

The average accuracies of the three classifiers learned from the optimal subset of features selected by each classifier based feature selection method are reported in Table 1. Overall, SVM had the highest average accuracy and NB had the lowest with LR in between. Stability and accuracy are not necessarily related; though in our example SVM simultaneously attains the highest accuracy and stability.

Table 1.

Average accuracy with standard error of mean (SEM) obtained from i0 x i0-fold cross-validation for the three classifiers. The classifiers were learned from the best feature subset chosen by the wrapper method.

ClassifierLRSVMNB
Avg. Accuracy76.84%87.24%73.i8%
SEM3.682.042.89

Discussion

Typically, classifier based feature selection methods are evaluated and compared on the quality of the subset of the features selected, such as, the ability of the feature subset to predict a target variable of interest. However, an additional measure to evaluate the stability of feature selection methods is needed, especially in tasks such as biomarker discovery. Such a measure should have a low value when there are large variations in the sizes of the feature subsets; lower stability values may indicate either that the feature selection method is not robust or that the data contains many correlated, redundant or noisy features. Of note, measures that evaluate the quality of the selected features are unrelated to the measures of stability. Specifically, a stability measure provides no information on the quality of the selected features and a quality measure such as classifier accuracy provides no information on the stability of the selected features. In the domain of biomarker discovery these measures provide complementary information to domain experts.

We have developed a stability measure called the ASM that can be applied to feature selection and feature ranking methods. A positive value on this measure indicates that the feature selection method is more stable than random feature selection and a negative value indicates that it is more unstable than random feature selection. When compared to an unadjusted stability measure, we demonstrated on an example dataset, that the ASM is a more robust measure. Thus, the ASM has the advantage that it can be applied to directly compare methods that generate feature subsets of different cardinalities, while the unadjusted methods cannot be applied directly.

The ASM can be used as a secondary measure for selecting between classifier based and other feature selection methods that have similar performance on a quality measure such as classifier accuracy. Some attempts at combining stability and classifier performance have been described in the literature. For example, David et al.9 describe a weighted combination of stability and classification performance for selecting reliable gene signatures for microarray classification. Classification and feature selection methods that are able to jointly optimize stability and quality may be more useful than those that only optimize one or the other measure.

Future Work

In future work, we plan to evaluate the ASM extensively on several biomedical datasets to compare the stability of the commonly used feature selection methods. We also plan to develop a measure that combines ASM with a quality measure such as classification accuracy for evaluating feature subsets.

Acknowledgments

This research was funded by the National Library of Medicine (Ti5-LM007059) and the National Institute of General Medical Sciences (GM 071951). We thank Dr. William Bigbee who heads the Molecular Biomarkers Group at the University of Pittsburgh Cancer Institute for the lung cancer dataset.

References

1. Kalousis A, Prados J, Hilario M. Stability of feature selection algorithms: A study on high-dimensional spaces. Knowledge and Information Systems. 2007;12(1):95–116. [Google Scholar]
2. Jordan R, Patel S, Hu H, et al. Efficiency analysis of competing tests for finding differentially expressed genes in lung adenocarcinoma. Cancer Informatics. 2008;6:389–421. [PMC free article] [PubMed] [Google Scholar]
3. Dunne K, Cunningham P, Azuaje F.Solutions to instability problems with sequential wrapper-based approaches to feature selection. 2002
4. Kuncheva LI. Proceedings of the 25th IASTED International Multi-Conference: Artificial Intelligence and Applications. Innsbruck, Austria: ACTA Press; 2007. A stability index for feature selection. [Google Scholar]
5. Lustgarten JL, Visweswaran S, Grover H, et al. An evaluation of discretization methods for learning rules from biomedical datasets; Proceedings of the International Conference on Bioinformatics and Computational Biology (BIOCOMP-08); Las Vegas, NV: 2008. [Google Scholar]
6. Blum A, Langley P. Selection of relevant features and examples in machine learning. Artificial Intelligence. 1997;97(1–2):245–271. [Google Scholar]
7. Hauskrecht M, Pelikan R, Malehorn DE, et al. Feature selection for classification of SELDI-TOF-MS proteomic profiles. Applied Bioinformatics. 2005;4(4):227–246. [PubMed] [Google Scholar]
8. Kim J-H. Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Computational Statistics & Data Analysis. 2009;53(11):3735–3745. [Google Scholar]
9. Davis CA, Gerick F, Hintermair V, et al. Reliable gene signatures for microarray classification: Assessment of stability and performance. Bioinformatics. 2006;22(19):2356–2363. [PubMed] [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association