An index for cancer clustering.

This paper generalizes the index for temporal clustering proposed by Tango in two ways: it allows for nonuniform population distributions across the study period and it is applicable to the detection of disease clustering in space where there are variations in population distribution among categories of the confounding factor such as age and sex. Applications are illustrated with 1833 cases of mortality from uterine cancer in the Tokyo metropolitan area during 1978-1982.


Introduction
The investigation of disease clustering in space, in time, or in both is an important aspect ofepidemiological studies in order to find clues to the causative mechanism of the disease in question. For example, the evidence of space-time clustering suggests that individual cases of disease are closely related in both space and time, as is often found in the case with infectious diseases. It has been stated on several occasions that childhood leukemia occurs in clusters in both space and time in many of the studies, which indicates the possibility of viral etiology. Therefore, tests for the detection of space-time clustering have been the subject of considerable research in recent years (1)(2)(3)(4)(5)(6).
In the study of chronic disease such as cancer, on the other hand, those tests for space-time clustering may not be adequate because cases of chronic disease may be close in space, but they are unlikely to be close in time because of long and variable periods between exposure and diagnosis. Thus, tests for space clustering may be more adequate in this case. However, previous tests for space clustering (6)(7)(8) have been derived under the unrealistic assumption that the population at risk is fairly uniforn across the region. Therefore direct use of those tests would produce spurious evidence of space clustering.
This paper presents a test statistic for the detection of disease clustering in space or in time as an extension of the index C for temporal clustering proposed by Tango (9) which can adjust differences in population distribution among categories of the confounding factor such as age and sex. Recently, Whittemore et al. (10) proposed a test having the capability of adjusting variations in population distribution among demographic subgroups at different disease risk. However, their procedure, based on the statistic that is essentialiy identical *Division of Theoretical Epidemiology, Department of Epidemiology, The Institute of Public Health, 6-1 Shirokanedai 4-chome, Minato-ku, Tokyo 108, Japan. in form to the index C, is shown to be less adequate than the method proposed in this paper.
An Index for Time Clustering Tango (9) proposed an index C for disease clustering in time C = rtAr (1) where nre = (nl,..., nfm), n = nl+ ... +nm, denote a vector-of observed frequencies in m successive time intervals, which is assumed to be a random sample from the uniform multinomial distribution. Hence, asymptotically, V(rm-11) -N(O, m-2V(ml) ) (2) where V(x) = A(x) -lit (3) and A(x) is the m x m diagonal matrix based on the vector xc and 1 is the m-dimensional vector of one. The entries aij of mi x m symmetric matrix A are arbitrary known measures of closeness between ith and jth interval with property ai = 1 and aij is a monotonically nonincreasing function of dij, the time distance between ith and jth interval. This index attains its maximum value of 1 if and only if ni = n for some i and nj = 0 forj # i. A natural selection for the form of the distance dj may be dij = lijil. (4) Although the choice of the form of aij may be variable depending on the situation, an exponential formn a=j = ekp( -dij) (5) has been considered.
The asymptotic distribution function of the index C under the hypothesis of no clustering in time has been, at first, derived using expansion in a series of central chi-square distribution (9): Pr{C < c} = >, ajPr{X2 -1+2 < (ch)1l} (6) j=0 where X: denote the chi-square variable with g degrees of freedom. We shall omit the details on the parameters aj, h, and P here. However, this formula was not so easy to use in a simple way for more general cases. Recently, Tango (11) suggested that a better approximation for the distribution of C may be obtained by standardizing C with T = (C -E(C)) / \/Var(C) (7) and approximating it with one central chi-square distribution, i.e., the p-value for the observed value c of the index C can be approximated by (9) is the incomplete gamma function and E(C) = M-2{ltA 1 + n-tr[AV(ml) ] }, One example of this quantity is the well known SMR (standardized mortality ratio), which is frequently used in epidemiological studies. Using the above quantity, an extended index can be introduced: and v is the degrees of freedom of approximated chisquare distribution and is given by where VT, (C) is the skewness of the index C and given by ( 8{3 1t(AV(ml))2A 1 + n-'tr[(AV(ml))3]} V'01(C)= n V{4 ltAV(ml)A 1 + 2 n-'tr[(AV(m1))2W. 5 (13) For convenience in practical applications, the approximated upper 100a percentiles Ta of standardized clustering index T are given in Table 1 as a function of the skewness value vTl (C).

Extension of the Index
In this section we shall extend the index C so that it is applicable to disease clustering in time or in space where the overall population at risk is not uniform across the region or where there are differences in population distributions among categories of confounding factors such as age.
Let m indicates the number of points in time or in space called regions. Let ni and Et (i = 1, . . . ,m) denote the observed number of cases and the expected number ofcases in the ith region, respectively. Then, as a proper index which can measure the relative intensity of dis-m m G = -2, n aij = qt Aq i=i j=i Ei Ej (14) where a is the same form defined by Eq. (5) and dij may be tie Euclidean distance properly scaled between the ith region and the jth region for the case of space clustering problem, Ei can be computed by combining all the regions (i.e., take the standard population to be the entire population being studied), and In fact, when Ei = Ej for all ij, then Ei = n/m and G = m2C.
(16) Therefore, it can be said that the index C is reasonably extended to G which can accommodate the variations in the confounding factor distributions over the region.
Second, consider the problem of disease clustering in space where the population size is, of course, different over the region with variations in the distributions of the confounding factor such as age.
Let K denote the number of categories in the confounding factor and let ikand nik denote the population size and the observed number of cases, respectively, for the ith region and the kth category of the confounding factor. Under the hypothesis that there occurs no clustering and the disease incidence rate changes across the categories of the confounding factor, the vector of the observed frequencies (nlk,. . . ,nmk) for the kth category of the confounding factor can be a random sample of size n+k = nlk + ... + n7mk from a nonuniform multinomial distribution with parameter pk = (Plk,. Needless to say, we can use Table 1 to read the approximated upper 100a percentiles ofT for the extended index G. On the other hand, Whittemore et al. (10) proposed a test statistic identical in form to the unadjusted index C even for the above-stated situation and approximated it with normal distribution. Clearly, the statistic C itself cannot be a standardized measure. Furthermore, their test has poorer power compared with the test based on the index G since they have used the matrix A as a measure of distance (11), and the normal approximation to the asymptotic distribution of the index G should be cautious because it almost always has a substantial amount of positive skewness, which was examined by Tango (11) for the detection of time clustering; it will be investigated for the detection of space clustering in detail by simulation study in the next section.

Simulation
To investigate the goodness of approximation by chisquare distribution, we performed the following Monte Carlo simulation. Situations considered here are that there are differences in the overall population size across the region, i.e., K = 1.
Step 0: As an entire population Ql, we shall consider the set of 400 points in two dimensional space Then, repeat the following procedure, step 1 to step 3, 100 times.
Step 2: Take m random numbers from N(10,22), say (r1, ... ,rm). Then the value ri is assigned to (i (i = 1, ... ,m), the population size for the ith region Xi, and compute  Table 2 showing that the asymptotic distribution of the index have a substantial amount of positive skewness and that the chi-square approximation is fairly good.   Therefore, the p-value of T = 1.657 is slightly greater than 0.05, indicating a weak but approximately significant evidence of clustering (p = 0.05). Results for several values of X, summarized in Table 4, are very similar one another. Therefore we can make an inference that some kind of space clustering may have occur for the mortality from uterine cancer during 1978 to 1982 in metropolitan Tokyo. Visual inspection of the map of SMR illustrated in Figure 1 suggests that a clustering occurs in the east of Tokyo such as Arakawa (SMR = 124), Taitoh (SMR = 119), Sumida (SMR = 122), Kohtoh (SMR = 122), and Edogawa (SMR = 118). The result might provide a motivation for further investigation of etiologic clues that may explain the clustering of uterine cancer in this area.
Computing time for these statistics required about 4 min of NEC PC 9801 (VX 21) CPU time using a BASIC computer program that is available from the author upon request.
The author thanks S. Hashimoto for his collaboration in providing data on the population and the latitude and longitude in each of Tokyo T. TANGO metropolitan 23 wards. This work was supported in part by a grant in aid for scientific research (grant no. 62530019) from the Ministry of Education, Science and Culture of Japan.