![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||
Linear Transformations and the k-Means Clustering
Algorithm: Applications to Clustering Curves * Thaddeus Tarpey is Professor, Department of Mathematics and Statistics, Wright State University, Dayton, Ohio. See other articles in PMC that cite the published article.Abstract Functional data can be clustered by plugging estimated regression
coefficients from individual curves into the k-means algorithm.
Clustering results can differ depending on how the curves are fit to the data.
Estimating curves using different sets of basis functions corresponds to
different linear transformations of the data. k-means
clustering is not invariant to linear transformations of the data. The optimal
linear transformation for clustering will stretch the distribution so that the
primary direction of variability aligns with actual differences in the clusters.
It is shown that clustering the raw data will often give results similar to
clustering regression coefficients obtained using an orthogonal design matrix.
Clustering functional data using an L2 metric on
function space can be achieved by clustering a suitable linear transformation of
the regression coefficients. An example where depressed individuals are treated
with an antidepressant is used for illustration. Keywords: Allometric extension, canonical discriminant analysis, orthogonal design matrix, principal component analysis 1 Introduction Functional data applications, where each data point corresponds to a curve,
have come to play a prominent role in statistical practice (e.g. Ramsay and Silverman, 1997, 2002). The curves in a functional data set often have a
variety of distinctive shapes that can have important interpretations.
Representative curve shapes can be found by clustering the curves (e.g. Heckman and Zamar, 2000; Abraham et al., 2003; James and Sugar, 2003; Luschgy and Pagés, 2002; Tarpey and Kinateder, 2003). The k-means clustering
algorithm (e.g. Forgy, 1965; Hartigan and Wong, 1979; MacQueen, 1967) has been and remains one of the most popular tools for
clustering data. When applied to functional data, k-means
clustering results vary depending on how the curves are fit to the data. Ultimately,
the problem of k-means clustering of functional data boils down to
the behavior of the k-means algorithm for different linear
transformations of the data which is the focus of this paper. Let y1(t),
y2(t), . . . , yn(t) denote a sample of functional responses. In most
applications the functions are only observed at a finite number of time points along
with a random error. Thus, a regression model can be used to estimate the function:
where yi = (yi(t1) + i1, yi(t2) + i2, . . . yi(tmi) + imi)′, i is a vector of random errors, bi is the p × 1 vector of regression
coefficients for the ith function and X is a design matrix determined by the choice of basis functions used to
represent the functions (e.g. Ramsay and Silverman,
1997, Section 3.2). The estimated regression coefficients can be obtained
using least-squares:
A natural way to cluster the functions is to apply the
k-means algorithm to the estimated regression coefficients Figure 1
Figure 2
The results from clustering shown in panels A, B and D in Figure 2 Figure 2 If A denotes an arbitrary square invertible matrix, the regression model (1) can be expressed as
where Z = XA and ai = A−1bi. Although the fitted values from (1) and (3) are
identical, k-means clustering of the It is interesting to note that in the Prozac example above that clustering
an appropriate linear transformation of the Fourier coefficients produces results
almost identical to clustering the B-spline coefficients shown in
panel B of Figure 2 where the transformation matrix T = (X′BPFXB)−1X′BXF, and The remainder of the paper is organized as follows: Section 2 shows that
clustering the raw data will often produce results very similar to clustering
estimated regression coefficients from an orthogonal regression; Section 3 shows
that the k-means algorithm using an L2
function space metric is equivalent to clustering regression coefficients after an
appropriate linear transformation; Section 4 discusses optimal linear
transformations of the data for k-means clustering and provides a
simple illustration for clustering linear functions. The paper is concluded in
Section 5. 2 Clustering the Raw Data In the Prozac example of Section 1, it turns out that clustering regression
coefficients obtained using an orthogonal design matrix produces cluster mean curves
that are essentially indistinguishable from those produced by clustering the raw
data shown in panel A of Figure 2 First, certain linear transformations have no effect on
k-means clustering. In particular, clustering
p-dimensional observations, yi’s, and the transformed data
where μ p, c 1, and H is an orthogonal p × p matrix,
yield identical results. If Let X = UDV′ denote the singular value decomposition of the design matrix X in (1). Then Xo = XVD−1 = U is an orthogonal design matrix and bo = DVb is the vector of associated regression coefficients. The least squares
estimator of bo is The raw data, after rotating by H′, has two orthogonal parts: the estimated orthogonal regression
coefficients . Thus, if the error variance is zero, clustering the raw data is exactly
equivalent to clustering the regression coefficients from an orthogonal design
matrix. That is, both methods will produce identical clusters. Since the (rotated)
raw data has a pure error component V′ that presumably contains no information on the true clusters, one would
expect that clustering the estimated coefficients 3 Clustering Functional Data with an L2 Metric In standard applications of k-means clustering, data points
in p are assigned to clusters based minimal Euclidean distance to the cluster
centers. If the data are functions, then an L2 metric in
function space may be a more appropriate metric to use for clustering. If
y(t) is a functional observation and
ξ(t) is a functional cluster mean, then the squared
L2 distance between these two functions on an
interval [T1,
T2] is
Suppose functions y(t) are represented
using a regression relation For instance, in a quadratic regression we would have
u0(t) = 1,
u1(t) =
t, u2(t) =
t2. Alternatively, the ul(t) could be orthogonal polynomials or, in the case of a
Fourier expansion, trigonometric functions. Denote the expansion of a cluster mean
ξ(t) by where β and γ are the vector of regression coefficients for
y(t) and
ξ(t) respectively, W is the symmetric (p + 1) ×
(p + 1) matrix with elements
and W1/2 is the symmetric square root of W. Thus, if one wishes to cluster functional data using an
L2 metric, then one can simply plug in the
transformed regression coefficients W1/2β into a standard k-means algorithm. This transformation was
used in the Prozac example of Section 1 to obtain an L2
metric clustering of the estimated power series basis coefficients (see panel D of
Figure 2 4 A Canonical Transformation for Clustering The k-means algorithm may fail to find true clusters in a
data set if there is substantial variability in the data unrelated to differences in
clusters. In fact, there is nothing inherent in the k-means
algorithm that guarantees that true clusters will be discovered. Instead the
k-means algorithm tends to place sample cluster means where
maximal variation occurs in the data. Thus, clustering functional data using the
k-means algorithm will perform best if the linear
transformations used to fit the curves stretch the data in a direction that
corresponds to true cluster differences. Basically the k-means algorithm begins with an initial set
of k cluster means and then assigns individual data points to
clusters depending on which cluster center the individual points are nearest. The
cluster means are then updated based on the assignment of points to clusters and the
algorithm continues to iterate until no more points are reassigned to clusters.
Because the algorithm iterates by assigning points to the cluster whose center is
closest, the optimization achieved by the algorithm is to find groupings that
minimize the within group sum-of-squares, or equivalently, to maximize the between
group sum-of-squares. We will assume that differences between clusters lie in the random
regression coefficients b in (1) and not in the random
error . Let μj and Ψj denote the mean and covariance matrix respectively of the random
regression coefficient b for the jth cluster and let πj denote the proportion of the population in cluster j
= 1, 2, . . . , k. The covariance matrix for the b can be decomposed as
where are the within cluster and the between cluster covariance matrices
respectively and where
In order to accent the between cluster variability and diminish the
contribution of the within cluster variability, one can further transform using a
canonical transformation for clustering
where C = diag(c1,
c2, . . . , cp) and the cj ≥ 0 are appropriately chosen constants. From (8), the covariance matrix for the canonically
transformed coefficients in (9) is C2 + C2D. Thus, choosing large values of cj corresponding to eigenvalues in D greater than one inflates the between cluster variability relative to the
within cluster variability of the canonically transformed coefficients and setting
cj = 0 for eigenvalues between zero and one minimizes the
contribution of the within cluster variability. For instance, suppose the cluster
means lie on a line. Then multiplying the positive eigenvalue
λ1 in D by a large value of c1 transforms the
coefficient distribution by stretching it in the direction of the line containing
the cluster means. Consequently, the k-means algorithm will place
cluster means along this line for large values of c1. If
the cluster means lie approximately in a q-dimensional plane, then
one would choose c1, . . . , cq to be large and the remaining cj to be small. An interesting problem is to determine the optimal settings
for the cj in order to optimize the k-means algorithm according to
minimizing a mean squared error or a classification error rate. The canonical transformation of the regression coefficients in (9) can be adjusted for the random
error in a regression model. Letting
where σ2 is the error variance and
we have assumed the error components are independent. The canonical transformation
for Example: An Simple Illustration of a Canonical Transformation Consider k = 3 clusters of random linear
functions y(t) =
b0 +
b1t +
. The y-intercepts and slopes
were simulated from a three component normal mixture with mean values of
y-intercept and slopes equal to (0, 1), (2, 1), and (3, 3)
in the three clusters and a common within cluster covariance matrix equal toThe proportion of the population in each of the three clusters is taken
to be π1 = π2
= π3 = 1/3. The error variance is
σ2 = 0.25. Regression coefficients were
estimated via least-squares. In addition, regression coefficient estimates were
also estimated using an orthogonal design matrix Xo where X′oXo = I. Finally, a canonical transformation for clustering (9) was also used. If
c1 in C of (9) is too large
relative to c2, the sample cluster means from the
k-means algorithm for the canonically transformed data will
lie along a line which will not be optimal because the true cluster means
defined above are not co-linear. Testing different values of
c1 with simulated data indicated that setting C = diag(3.5, 1) appears to be a nearly optimal canonical
transformation in terms of minimizing the average squared difference between the
estimated cluster means and the true cluster means. Figure 3
In order to compare the performance of the three cases shown in Figure 3
The main point of this section is that one should not blindly throw
regression coefficients into a clustering algorithm and expect the results to
coincide with actual clustering in the data. In particular, as the simulation
example above illustrates, the performance of the k-means
algorithm for clustering functional data can vary considerably depending how the
functional data is transformed prior to clustering. The optimal canonical transformation for clustering (9) requires knowing the true within and
between covariance matrices which in practice are unknown. Unfortunately the
sample between covariance matrix In situations where cluster means lie in a common hyperplane, Bock (1987) proposes a projection
pursuit clustering algorithm. This algorithm iterates by estimating
the common hyperplane using the subspace spanned by the largest eigenvectors
from the between group sums-of-squares-and-products matrix and then applying the
k-means algorithm to the data projected onto this
hyperplane. Bolton and Krzanowski (2003)
note that Bock’s algorithm tends to find groups in the direction of
the data corresponding to the largest variance and they propose a slightly
different projection pursuit index to avoid this problem. When actual cluster means do indeed lie along the major axis of
variation (i.e. the first principal component), the k-means
algorithm should perform quite well. This phenomenon occurs frequently in
morphometric studies of growth and is called allometric extension (Hills, 1982; Bartoletti et al., 1999; Tarpey and Ivey, 2006). Let μ1 and μ2 denote the means of the two populations and suppose that the
eigenvector associated with the largest eigenvalue of the covariance matrices in
both populations is the same, call it β1. Then the allometric extension model states that μ2 − μ1 = δβ1 where δ is a constant (Flury, 1997, page 630). The allometric extension
model may be reasonable in cases where two (or more) closely related species
follow a common growth pattern where one species evolved to a larger overall
size. If the first principal component accounts for a large proportion of the
overall variance, then the k-means algorithm will tend to place
estimated cluster means along the first principal component axis where the true
cluster means reside. Thus, one would not want to automatically standardized the
data before clustering in these cases because it may hurt the
k-means algorithm ability to correctly determine groupings
along the primary axis. In a functional data analysis context, suppose the results of clustering
curves produces roughly “parallel” cluster mean curves
with the same shape. Parallel cluster mean curves occur quite often in practice
when the variability in the intercepts of the curves overwhelms other modes of
variation. In these cases, the first principal component variable will tend to
coincide with the intercept approximately. Consequently all the cluster mean
curves have basically the same shape as the overall mean curve and differ only
in their intercepts. This is fine if the actual clusters differ in terms of
their intercepts only. However, if curve shapes differ among groups, then the
data needs to be transformed to minimize the variability of the intercept and
allow the k-means algorithm to find distinct curve shapes. A
couple possible solutions are to either drop the intercept term when clustering,
or to cluster the derivatives of the estimated functions, see Tarpey and Kinateder (2003). 5 Discussion An appealing aspect of functional data is that the observations are not just
ordinary points in Euclidean space, but they are curves with distinct shapes.
Clustering functional data is a useful way of determining representative curve
shapes in a functional data set. However, the results from clustering curves depend
on how the curves are fit to the data. The k-means clustering
algorithm will perform best if the linear transformation used to fit the curves
stretches the data in the direction corresponding to true cluster differences.
Unfortunately optimal transformations required for clustering require knowing the
true cluster means. A promising approach to solving this problem is to use
projection pursuit clustering (Bolton and Krzanowski,
2003). It has been assumed that the error in the regression model contained no
information on the underlying clusters. This assumption may not always hold if the
error variances in different clusters differ. In addition, if the wrong model is fit
to the data producing non-random structure in the residuals, then this structure
could contain information on clusters. Clustering functional data by applying the k-means
algorithm to the estimated coefficients is very easy and fast. There are two
substantial disadvantages of the k-means algorithm: (i) the
algorithm chops up the data into non-overlapping clusters, whereas in practice
distinct groups in the data will often overlap; (ii) the k-means
algorithm is completely nonparametric and does not take advantage of any valid
parametric assumptions. Finite mixture models do not suffer from these two
weaknesses and provide a useful alternative to the k-means
algorithm. A simple approach is to plug the estimated coefficients into the EM
algorithm for estimating parameters of a finite mixture. A computationally more
complicated but highly flexible approach is to express the cluster/mixture model as
a random effects model with a latent categorical variable for cluster membership and
then estimate the parameters using maximum likelihood via the EM algorithm (James and Sugar, 2003; Muthén and Shedden, 1999). Footnotes I am grateful to Minwei Li for programming assistance and to Eva Petkova for
helpful discussions related to this work. I wish to thank the referees, an
Associate Editor and the Editor for their comments and suggestions which have
improved this paper. This work was supported by NIMH grant R01 MH68401-01A2. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||
J Clin Psychiatry. 2000 Jul; 61(7):518-24.
[J Clin Psychiatry. 2000]Biometrics. 1999 Jun; 55(2):463-9.
[Biometrics. 1999]