![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||
A Parametric k-Means Algorithm 1 Wright State University, Department of Mathematics and Statistics, Dayton, Ohio. The publisher's final edited version of this article is available at Comput Stat.Summary The k points that optimally represent a distribution (usually in terms of a squared error loss) are called the k principal points. This paper presents a computationally intensive method that automatically determines the principal points of a parametric distribution. Cluster means from the k-means algorithm are nonparametric estimators of principal points. A parametric k-means approach is introduced for estimating principal points by running the k-means algorithm on a very large simulated data set from a distribution whose parameters are estimated using maximum likelihood. Theoretical and simulation results are presented comparing the parametric k-means algorithm to the usual k-means algorithm and an example on determining sizes of gas masks is used to illustrate the parametric k-means algorithm. Keywords: Cluster analysis, finite mixture models, principal component analysis, principal points 1 Introduction One of the classic statistical problems is to find a set of points that optimally represents a distribution or to determine an optimal partition of a distribution. Applications related to this problem include: optimal grouping (Cox 1957, Connor 1972), optimal stratification (Dalenius 1950, Dalenius and Gurney 1951), signal processing and quantization (e.g. see the March 1982 issue of the IEEE Transactions on Information Theory which is devoted to the subject), optimal sizing of clothing and equipment (Fang and He 1982, Flury 1990, 1993), selective assembly and optimal binning (Mease et al. 2004), and representative response profiles in clinical trials (Tarpey et al. 2003). The problem of determining and estimating an optimal representation of a distribution by a set of points has been studied by many authors (Eubank 1988, Gu and Mathew 2001, Flury and Tarpey 1993, Iyengar and Solomon 1983, Li and Flury 1995, Graf and Luschgy 2000, Luschgy and Pagés 2002, Pötzelberger and Felsenstein 1994, Rowe 1996, Stampfer and Stadlober 2002, Su 1997, Tarpey 1994, 1995, 1997, 1998, Tarpey et al. 1995, Yamamoto and Shinozaki 2000a,b, Zoppé 1995, 1997). This paper presents a very simple but computer intensive approach to solving this problem based on the k-means clustering algorithm (e.g. MacQueen 1967, Hartigan 1975, Hartigan and Wong 1979). The single point that best approximates the distribution of a random variable X in terms of mean squared error is the mean μ: E||X − μ||2 ≤ E||X − m||2 for any m. The framework for determining a set of points that optimally represents a distribution in terms of mean squared error is to generalize the mean from one point to several points as follows. Let X denote a p-dimensional random vector. For a given set of k points: {y1, y2, … yk} with yj p, denote the set of points in p closer to yj than the other yi as Dj = {x p: ||x − yj||2 ≤ ||x − yi||2, i ≠ = j}. Define a k-point approximation Y to X as
The k points are called self-consistent points, or equivalently, Y is called self-consistent for X if E[X|Y] = Y (Flury 1993, Tarpey and Flury 1996). The mean is the “center-of-gravity” of a distribution and k self-consistent points represent a k-point generalization of the center-of-gravity from one to many points because each self-consistent point yj is the conditional mean of X over Dj. If Y is the optimal k-point approximation to X in terms of mean squared error (i.e., E||X − Y ||2 ≤ E||X − Y 0||2, for any other k-point approximation Y 0 to X), then the points yj in the support of Y are called the k principal points of X (Flury 1990). Flury (1990) showed that a set of principal points must be self-consistent points. Thus, the set of principal points for X can be determined by finding the optimal set of k self-consistent points. This definition of principal points is given in terms of a squared error loss, but other loss functions could be considered as well. In applications of designing clothing or equipment, a single size may be based on the mean of the distribution but multiple sizes (e.g. small, medium and large) can be based on the principal points of the distribution. Section 6 provides an example of determining optimal sizes and shapes of gas masks. In functional data analysis applications (Ramsay and Silverman 1997) when the data consist of curves, principal point methodology can be used to determine a small set of curves that represent the primary modes of variation (e.g. see Flury and Tarpey 1993). For instance, using principal points to estimate a set of representative longitudinal response curves from a clinical trial can be used to describe various patient types such as non-responders, drug responders, placebo responders, drug/placebo responders (Tarpey et al. 2003). In signal processing and digital communication the term quantization is used when a signal is represented by a finite set of values. The solution to finding the set of values that minimizes the loss of information due to quantization is mathematically equivalent to determining principal points. When applied to data, the k-means algorithm converges to a set of k self-consistent points for the empirical distribution. The k-means algorithm seeks a partition of the data that minimizes the within cluster sum of squares and hence cluster means from the k-means algorithm provide nonparametric estimators for the principal points of the distribution. More efficient methods of estimating principal points in terms of a lower mean squared error are available if certain distributional assumptions hold (e.g. Stampfer and Stadlober 2002, Tarpey 1997). Defining parametric estimators of principal points often requires knowledge of the principal points of theoretical distributions. However, analytically determining principal points of theoretical distributions is extremely difficult, particularly for multivariate distributions and mixture models. In the next section, we define a parametric k-means algorithm that produces maximum likelihood estimators of principal points automatically without requiring knowledge of the principal points of the underlying population. The asymptotic performance of the parametric k-means algorithm is provided in Section 3 and simulation results comparing the parametric k-means algorithm to the usual nonparametric k-means algorithm are provided in Section 4. The performance of the parametric k-means algorithm is examined for finite mixture distributions in Section 5. The method is illustrated on a problem of fitting gas masks in Section 6 and the paper is concluded in Section 7. 2 The Parametric k-Means Algorithm The idea behind the parametric k-means algorithm is very simple. The goal is to estimate the k principal points of a distribution based on a sample x1, …, xn from the distribution. One approach is to simply throw the data into the k-means algorithm. If a larger sample size were available, then the principal point estimators would be more stable. The idea of the parametric k-means algorithm is to run the k-means algorithm, not on the raw data, but on a simulated data set with a huge sample size. The key is to simulate data from a distribution that is parametrically estimated. The idea behind the parametric k-means algorithm is similar in spirit to the Monte Carlo EM algorithm (Wei and Tanner 1990) where the (typically intractable) analytical computation of the E-step in the EM algorithm is replaced by an average obtained from a simulated data set. The following is a description of the parametric k-means algorithm. Let x1, …, xn denote a sample from a population with distribution F (·; θ), where the parameter θ can be one-dimensional or a vector.
The cluster means from step (3) are then used as estimators of the principal points of the underlying distribution. High speed computing is readily available and thus it is very easy to implement the parametric k-means algorithm. Several of the references in Section 1 deal with the problem of determining principal points for theoretical distributions. Steps 2 and 3 of the parametric k-means algorithm provide a solution to this problem. As with the usual nonparametric k-means algorithm, the parametric k-means algorithm may converge to a local instead of a global optimum solution. Thus, it is generally a good idea to run the k-means algorithm on the simulated data many times with different initial values when searching for the globally optimal solution (i.e. the principal points). Hand and Krzanowski (2005) have proposed an iterative refinement method based on simulated annealing that generally offers an improvement over a “best of 20 random starts” approach. It should be pointed out that these methods do not guarantee that the globally optimal solution will be found. Note that implementing the parametric k-means algorithm requires that the user specify the distribution F(·; θ). Exploratory data analysis and goodness-of-fit tests can be used to determine a reasonable distribution to use for the parametric k-means algorithm. In order to use the k-means algorithm, the number k of cluster means needs to be specified. In many clustering applications, the number of clusters may be well-defined (e.g. male/female clusters, different species of animals). However, for continuous distributions, there exists a set of k principal points for all positive integers k. There is no right or wrong value for k. Instead, the appropriate choice for k in principal point applications depends on the particular application and needs to be determined by the investigator. The choice of k often depends on economic factors as well as the desired degree to which the k principal points approximate the continuous distribution. For instance, when manufacturing clothing or equipment (pants, shirts, gloves, helmets, goggles etc.), k corresponds to the number of sizes to produce and can vary from k = 1 (one size fits all) to k → ∞ (tailor an outfit for each individual). In these types of applications, a balance must be decided upon between the extra cost of producing many different sizes and making sure enough size choices exist to guarantee a good fit for everyone. In other applications such as optimal stratification (Dalenius 1950, Dalenius and Gurney 1951) or optimal grouping for testing trends in categorical data (Connor 1972), the value of k may be chosen to achieve a desired efficiency relative to estimators that do not use grouping. One could argue that if the data are sampled from a known parametric family of distributions, then why not just compute the principal points of the distribution directly without using the k-means algorithm? As noted above, analytical determination of principal points is usually very difficult, often requiring numerical integration over complicated high dimensional regions (e.g. Tarpey 1998). The parametric k-means algorithm on the other hand produces the results automatically by allowing the computer to do all the work. 3 Asymptotics Suppose a random sample of size n is obtained from a distribution F(·; θ). Let denote an asymptotically normal estimator of θ from the sample:
where Ψ is the covariance matrix. Let ξ(θ) denote the k principal points for the distribution F(·; θ) and let ξn(θ) denote the principal points of the empirical distribution that can be obtained by running the k-means algorithm on the sample data. Pollard (1981) proved strong consistency of the k-means algorithm estimators and showed in Pollard (1982) that the k-means algorithm estimators are asymptotically normal provided certain regularity conditions are satisfied (such as finite second moments, a continuous density, a unique set of k principal points and a couple other conditions typically satisfied by most common distributions). Let ns denote the simulation sample size for the parametric k-means algorithm with ns n. A sample of size ns is simulated from the distribution F(·; ) and the k-means algorithm is applied to this simulated data yielding cluster means denoted by ξns ( ). By the strong consistency and asymptotic normality results for k-means clustering, we can write
Where
Suppose that ξ(θ) is a continuously differentiable function of θ. Then using a Taylor series expansion of ξ( ) about θ, one can write
where H is the matrix of partial derivatives of ξ(θ) with respect to the parameters in θ. Combining (3) and (4) gives
provided ns > n2. Therefore, from (2), the parametric k-means estimators will be asymptotically normal If is the maximum likelihood estimator of θ, then we have just shown that the parametric k-means estimators are asymptotically equivalent to maximum likelihood estimators of principal points.The following simple example illustrates the results. The k = 2 principal points of a N(μ, σ2) distribution are
and s2 denote the sample mean and variance of the original data set, we have
, s2), it follows that the k = 2 parametric k-means estimators for the normal distribution ξns ( , s2) are asymptotically normal with mean
In this simple example, the principal points ξ(θ) are known and maximum likelihood estimators can be computed using ξ ( ) by invoking the invariance principal. However, in most practical situations, the function ξ(θ) will be unknown. In fact, in only a few relatively simple cases (small k and low dimension) have the principal points ξ(θ) been determined and these often require iterative searches and/or numerical integration (see the references in Section 1). If one knows the distribution F(·; θ) then the parametric k-means algorithm allows us to avoid these difficulties.4 Simulation Results In this section, the k-means algorithm (Hartigan and Wong 1979) applied to the raw data will be referred to as the nonparametric k-means algorithm in order to distinguish it from the parametric k-means algorithm described in Section 2. This section presents simulation results comparing the nonparametric and parametric k-means algorithms in a variety of situations. The simulation results presented here were obtained using the R-software (R Development Core Team 2003). The actual principal points for the distribution were determined from known results or from extensive simulations whereby a very large sample size (usually of size ns = 500, 000 or 1, 000, 000) is simulated from the given distribution and the k-means algorithm is then applied to the simulated sample several times (usually 20 to 25) with random initial seeds. By the strong convergence of k-means clustering (Pollard 1981), the difference between the values obtained for the principal points and the true principal points is of the order
The nonparametric and parametric k-means algorithms will be compared in terms of a mean squared error (MSE) between the estimated principal points and the actual principal points:
where ξj and j, j = 1, …, k, are the k principal points and the k estimated principal points respectively of the underlying distribution. To compute the MSE, the expectation in (7) is estimated by averaging
As an illustrative first example, two principal points were estimated from a N(0, 1) distribution using the usual k-means algorithm and the parametric k-means algorithm. The two principal points of N(0, 1) are
and standard deviation s from the simulated data set, (ii) simulating 100,000 random variates from a N( , s2) distribution, and (iii) running the k-means algorithm on this larger simulated data set. Figure 1
The next illustration is again for the standard normal distribution, except this time k = 5 principal points will be estimated. Figure 2 and standard deviation s need to be estimated regardless of the number k of cluster means specified. Thus, as the number k of principal points increases, the nonparametric k-means algorithm will deteriorate in terms of efficiency compared to the parametric k-means algorithm.
In order to illustrate the performance of the parametric k-means algorithm versus the nonparametric k-means algorithm for multivariate data, k = 2 principal points were estimated for a bivariate normal distribution with mean zero and a diagonal covariance matrix diag(σ2, 1). The two principal points for this distribution lie along the first principal component axis (Tarpey et al. 1995) and are given by: Letting and S denote the sample mean and covariance matrix, the parametric k-means algorithm is run by simulating data from a N( , S) distribution.Because the two principal points must lie along the first principal component axis, we can modify the parametric k-means algorithm by constraining the cluster means to lie along the first sample principal component axis. The constrained principal point estimators are given by + 1 j for j = 1, …, k, where is the sample mean, 1 is the eigenvector of the sample covariance matrix associated with the largest eigenvalue, and 1, …, k are cluster means from the parametric k-means algorithm applied to the first principal component scores. Figure 3
The performance of the parametric k-means algorithm depends on the validity of the parametric assumptions. For instance, suppose the parametric k-means algorithm is implemented assuming the data are from a bivariate normal distribution as above when in fact the true distribution is a bivariate t (Fang et al. 1990, page 85). A small scale simulation was performed to evaluate the performance of the misspecified parametric k-means algorithm in this situation for sample sizes ranging from 50 to 200, degrees of freedom ranging from 5 to 50, and for k = 2 and 5 principal points. In each case the parametric k-means algorithm based on the erroneous normality assumption performed much better than the nonparametric k-means algorithm with n × MSE being about 2 to 3 times greater for the nonparametric k-means algorithm compared to the parametric k-means algorithm, even for low degrees of freedom. Another non-normal simulation was run using the chi-square distribution. k = 2 principal points were estimated for samples of sizes n = 25, 50, 75 and 100 using the nonparametric and parametric k-means algorithms. For the parametric k-means algorithm, a random sample of 100,000 was simulated from the correctly specified chi-square distribution with degrees of freedom equal to the sample mean. A parametric k-means algorithm was also run by misspecifying a normal distribution even though the true underlying distribution is chi-square. The simulation results, shown in Figure 4
5 Parametric k-Means Applied to Finite Mixtures This section illustrates the parametric k-means algorithm in the setting of finite mixture distributions (see e.g. McLachlan and Krishnan 1997, Titterington et al. 1985) where closed form expressions do not typically exist for parameter estimates. Very little is known about principal points for mixture distributions. Yamamoto and Shinozaki (2000b) studied two principal points for the specialized case of a mixture of two spherically symmetric distributions. Determining principal points analytically for a mixture distribution is very difficult because of the wide variety of ways mixture distributions can be parameterized in terms of number of mixture components, mixing proportions, means and covariance structures of the mixture components. In order to apply the parametric k-means algorithm, maximum likelihood estimation of the parameters of the mixture distribution are obtained via the EM algorithm (Dempster et al. 1977). Next, a very large sample size is simulated from a mixture distribution with parameters equal to the maximum likelihood estimates. In practice, one needs to specify the number of mixture components in order to run the EM algorithm. There have been numerous studies for determining the number of “groups” in a data set. A promising and simple approach is proposed by Sugar and James (2003) who also provide a review of many other methods of determining the number of groups in a data set. In principal point applications involving finite mixtures, the number of principal points required will often differ from the number of mixture components. For example, suppose helmets are to be used for men and women and k different sizes need to be determined. The full population consists of two mixture components (males and females) but more than two sizes may be needed. In these types of applications, the number of mixture components is known and does not need to be determined. In fact, if the data identify who is male and who is female, then the EM algorithm is not needed to estimate the parameters of the two mixture components. On the other hand, consider a clinical trial where quadratic curves are used to model longitudinal responses and the shape of the curve is clinically meaningful. Suppose the population is a mixture of two components: those who do and do not experience a placebo response. Membership in the two mixture components is not directly observed. Even if the distribution within each mixture component is homogeneous (e.g. normal), there will often be more than one representative response curve for individual components. For instance, if the degree of the responses (weak to strong) and the timing of the responses (immediate to delayed) vary according to a normal distribution, then the resulting response curves can take a variety of different shapes (e.g. see Tarpey et al. 2003). Our first illustration is to estimate k = 4 principal points of a univariate normal mixture consisting of two components: where the prior probabilities are both set equal to a half: π1 = π2 = 1/2, f1 and f2 are univariate normal distributions with unit variance and means μ1 and μ2 respectively. Figure 5 1 and 2 and the parametric simulation sample size (ns = 100, 000) was split in two according to these estimated prior probabilities.
For the plots at the top of Figure 5 Next the nonparametric and parametric k-means algorithms are compared for a bivariate normal mixture consisting of two components. For this illustration, k = 4 principal points are estimated from a mixture of two bivariate normal distributions with equal prior probabilities. The first component is centered at the origin with covariance matrix diag(2, 1) and the second component is centered at the point (7, 1) and the covariance matrix has eigenvalues 2 and 1, similar to the first component, but the distribution has been rotated by π/3 radians. Figure 6
6 Example Flury (1990) coined the term principal points in the problem of determining optimal sizes and shapes of protection masks for men in the Swiss army. For example, estimating k = 3 principal points would be useful for determining a small, medium and large size mask. p = 6 head dimension variables were measured on a sample of n = 200 men: minimal frontal breadth (MFB), breadth of angulus mandibulae (BAM), true facial height (TFH), length from glabella to apex nasi (LGAN), length from tragion to nasion (LTN), and length from tragion to gnathion (LTG) (Flury 1997, page 8). The Swiss head dimension data appears consistent with a multivariate normal distribution (Flury 1990) and thus the parametric k-means algorithm will be used assuming the distribution is multivariate normal. Without using the parametric k-means algorithm, it would be extremely difficult to determine k > 4 principal points of a p = 6 dimensional multivariate normal distribution. In order to compare the performance of the nonparametric and parametric k-means algorithm, a leave-one-out prediction error was computed for different values of k. The leave-one-out prediction error was computed by leaving out a single data point and then estimating the k principal points using both the nonparametric and parametric k-means algorithms. The squared distance between the left-out point to the nearest estimated cluster mean was then computed. This was repeated for all observations and the average squared error, called the Prediction Mean Squared Error (PMSE) was computed. For each leave-one-out iteration, estimated cluster centers from the full data set were used as starting seeds for the k-means algorithm so the problem of multiple stationary points was minimized. 500,000 simulated observations were used in the parametric k-means algorithm. The results, summarized in Figure 8
The computation time required to implement the parametric k-means algorithm for this example is modest. It took about 3 seconds to implement the parametric k-means using a simulated sample size of 500,000 in this example for k = 2 and about 15 seconds for k = 8 on a pentium 1.79GHz machine. The time required for the leave-one-out PMSE computations can be computed by multiplying these times by the sample size n, in this case n = 200. More complicated models, like finite mixture models, would require additional time to get the initial parameter estimates for the parametric k-means simulation. A Self-Consistency Test For the Swiss head data we assumed the distribution is multivariate normal when implementing the parametric k-means algorithm. The validity of the normality assumption can be assessed by evaluating the self-consistency of the parametric k-means solution. From the definition of self-consistent points in Section 1 it follows that each estimated principal point from the parametric k-means algorithm should be approximately equal to the average of the original data points that are closest to the principal point. To illustrate, Figure 9 j denote an estimated principal point (j = 1,…, k) and let j denote the average of the original Swiss head data points closest to j. The j are plotted by the large open circles in Figure 9 j ≈ j for j = 1 …, k. If the data deviates strongly from normality, then the parametric k-means estimators of principal points should fail to be self-consistent points for the data. In order to access the self-consistency condition, we can form the test statistic:
An approximate test of significance can be performed by evaluating T2 under the null hypothesis that the data is from a normal distribution by the following steps:
This self-consistency test was run on the Swiss head data using k = 6. The p-value (estimated using N = 100) is p = 0.51 indicating that there is no evidence against self-consistency of the k = 6 principal point estimates from the parametric k-means algorithm solution for the Swiss head data. The small solid circles in Figure 9 j for N = 100 data sets simulated from N( , S). It is evident from Figure 9 j and j) is consistent with the variability one would expect to see with actual normal data since the large open circles fall within the cloud of points formed by the small solid circles. The self-consistency test was also run for other values of k = 2, …, 8 for the Swiss head data (using N = 100) yielding p-values p = 0.81, 0.83, 0.47, 0.35, 0.51, 0.63, and 0.70 respectively indicating that there is no evidence against self-consistency of the principal point estimates from the parametric k-means algorithm.R-code for implementing the parametric k-means algorithm for a multivariate normal distribution and the self-consistency test can be found at: Head dimension data was also available for n = 59 females. Combining the male and female data would allow for determining sizes that could be used by both sexes. This is analogous to a mixture distribution, except individual data points were identified as male or female, so the EM algorithm was not needed to estimate parameters for males and females separately. The nonparametric and parametric k-means algorithms were compared again in terms of PMSE for the combined data and essentially both methods performed about the same. The reason the parametric k-means did not perform consistently better than the nonparametric k-means algorithm in this example is because the distribution for some of the variables for the females deviated from normality. This illustrates that the optimality of the parametric k-means algorithm depends on the validity of the parametric assumptions. 7 Discussion Before powerful computing was readily available, determining optimal point representations or optimal partitions of theoretical distributions was severely limited. The computer intensive parametric k-means algorithm illustrated in this paper provides an almost effortless method of determining and estimating principal points with maximum likelihood efficiency. The simulation results in Section 4 demonstrate that the parametric k-means algorithm can be considerably more efficient than the usual nonparametric k-means algorithm. The performance of the parametric k-means algorithm depends on the validity of the parametric assumptions. Section 4 reported some preliminary results on the performance of the parametric k-means algorithm when the specified distribution (e.g. normal) differs from the true distribution. It would be useful to perform an in depth investigation into the robustness of the parametric k-means algorithm in the presence of outliers and other deviations from parametric assumptions. It would also be useful to evaluate the performance of the parametric k-means algorithm for high dimensional data. However, for moderate values of k that tend to be used in practice, the space spanned by k principal points will often tend to be low dimensional (Tarpey et al. 1995). The Swiss head dimension data was used to illustrate the parametric k-means algorithm in Section 6 since that was the original example that motivated the term principal points. However, the motivation for this current work was the problem of clustering functional data (Abraham et al. 2003, James and Sugar 2003, Luschgy and Pagés 2002, Tarpey and Kinateder 2003) where different cluster means can be used to identify representative curve shapes in the data. It is anticipated that the parametric k-means algorithm will be very useful in these types of applications due to the high dimensional nature of the data. Acknowledgments The author would like to thank the referees and the Editors for their constructive comments and suggestions for improving this article. This work was supported by NIMH grant R01 MH68401. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||