![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||
Copyright © 2009 Vieira et al; licensee BioMed Central Ltd. Spatial analysis of bladder, kidney, and pancreatic cancer on upper Cape Cod: an application of generalized additive models to case-control data 1Department of Environmental Health, Boston University School of Public Health, 715 Albany Street, Boston, MA 02118, USA 2Department of Biostatistics, Boston University School of Public Health, 715 Albany Street, Boston, MA 02118, USA 3Department of Epidemiology, Boston University School of Public Health, 715 Albany Street, Boston, MA 02118, USA Corresponding author.Verónica Vieira: vmv/at/bu.edu; Thomas Webster: twebster/at/bu.edu; Janice Weinberg: janicew/at/bu.edu; Ann Aschengrau: aaschen/at/bu.edu Received November 7, 2008; Accepted February 10, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background In 1988, elevated cancer incidence in upper Cape Cod, Massachusetts prompted a large epidemiological study of nine cancers to investigate possible environmental risk factors. Positive associations were observed, but explained only a portion of the excess cancer incidence. This case-control study provided detailed information on individual-level covariates and residential history that can be spatially analyzed using generalized additive models (GAMs) and geographical information systems (GIS). Methods We investigated the association between residence and bladder, kidney, and pancreatic cancer on upper Cape Cod. We estimated adjusted odds ratios using GAMs, smoothing on location. A 40-year residential history allowed for latency restrictions. We mapped spatially continuous odds ratios using GIS and identified statistically significant clusters using permutation tests. Results Maps of bladder cancer are essentially flat ignoring latency, but show a statistically significant hot spot near known Massachusetts Military Reservation (MMR) groundwater plumes when 15 years latency is assumed. The kidney cancer map shows significantly increased ORs in the south of the study area and decreased ORs in the north. Conclusion Spatial epidemiology using individual level data from population-based studies addresses many methodological criticisms of cluster studies and generates new exposure hypotheses. Our results provide evidence for spatial clustering of bladder cancer near MMR plumes that suggest further investigation using detailed exposure modeling. Background In 1988, elevated cancer incidence in the upper Cape Cod region of Massachusetts (Figure (Figure1)1
Methods for mapping point-based epidemiologic data have received less attention than mapping areal data [12]. Generalized additive models (GAMs), a type of statistical model that combines smoothing with the ability to analyze binary outcome data and adjust for covariates, provide a useful framework for examining point data [13-18]. Using individual-level information and location in a generalized additive model, we calculated the crude and adjusted odds ratios for bladder, kidney, and pancreatic cancers on upper Cape Cod. These analyses, unlike many registry-based maps, have the advantage of controlling for spatial confounders, examining the effect of latency, and allowing for hypothesis testing for the significance of location in the disease maps. The objectives of the present analyses are to identify exposure hypotheses for further investigation and to demonstrate spatial epidemiology using generalized additive models. Methods Study Population We investigated the association between residence and kidney, pancreatic, and bladder cancer on upper Cape Cod, Massachusetts using data from a population-based case-control study [1]. Spatial analyses of breast, lung, and colorectal cancer were previously reported in Vieira et al. [17]. The Massachusetts Cancer Registry was used to identify incident cancer cases diagnosed from 1983–1986. Participants were restricted to permanent residents (living on upper Cape Cod at least 6 months of the year) with complete residential histories. A total of 62 bladder cancer cases, 35 kidney cancer cases, and 37 pancreatic cancer cases were included. There were 885 bladder cancer controls, 803 kidney cancer controls, and 651 pancreatic cancer controls. A large number of controls were available for the present analyses because the parent study investigated nine cancer types including breast, lung, and colorectal cancers with larger numbers of cases [1]. See earlier papers [2,3] for a detailed description of the methods used to define the study population, including the rationale for the method of control selection. Briefly, controls were chosen to represent the underlying population that gave rise to all the cancer cases from the large epidemiologic study, i.e., permanent residents of upper Cape Cod during the same time period. Controls were frequency matched to all cancer cases on age, gender, and vital status. It is important to note that controls were not matched on town of residence. Because many of the cases were deceased or elderly, three different sources of controls were used: (1) random digit dialing for living controls less than 65 years of age; (2) Centers for Medicare and Medicaid Services (formerly the Health Care Financing Administration) for living controls 65 years of age or older; and (3) death certificates for controls who had died from 1983–1986. Participants or their next-of-kin completed an extensive interview, providing information on demographics (age, sex, marital status, and education), a forty-year residential history, and potential confounders. "Index years" were randomly assigned to controls in a distribution similar to that of diagnosis years for cases. We used the diagnosis and index years to estimate timing and duration of environmental exposure among case and controls, respectively. The Institutional Review Board of Boston University Medical Center approved the research. Geographical Information System (GIS) All residential addresses reported by participants in the upper Cape Cod area over the forty-year period prior to the diagnosis or index year were eligible for the spatial analysis. Many participants lived at more than one address during their residential history on upper Cape Cod. We excluded all addresses where residency time began after the diagnosis date for cases and index date for controls. The bladder cancer data set included 95 case locations and 1,382 control locations. The kidney cancer data set included 54 case locations and 1,220 control locations. The pancreatic cancer data set included 49 case locations and 1,005 control locations. Locations of participant residences were geocoded using the Massachusetts State Plane Coordinate System with North American Datum 1983 (NAD1983). Geocoding, the process where map coordinates are assigned to each street address, was done without knowledge of case status, and the final data were checked for accuracy. GIS allows us to map the coordinates of participants and link the location to individual interview data and environmental information. Figure Figure22
Generalized Additive Models (GAMs) We estimated local disease odds using generalized additive models, a form of non-parametric or semi-parametric regression with the ability to analyze binary and continuous outcome data while adjusting for covariates [13,18]. We modeled location, a potential surrogate measure of exposure, using a bivariate smooth (S) of spatial coordinates (x1) and (x2)
where the left-hand side is the log of the disease odds at location (x1, x2), z is a vector of covariates, and γ is a vector of parameters. Throughout this paper, we will refer to (x1) and (x2) as longitude and latitude, although (x1) and (x2) are technically measures of distance and not degrees. The model is semiparametric because it has the nonparametric smooth but the covariates are modeled parametrically. Without the smooth function, S(x1, x2), the model becomes an ordinary logistic regression on the covariates. Spatial confounding occurs when risk factors for a disease are not evenly distributed. For example, a cluster of lung cancer may be due to an increased density of smokers. A group of core confounders, chosen a priori based on the current scientific literature or study design, was included in all adjusted cancer analyses: vital status at interview, gender, race, age at diagnosis or index year, and usual number of cigarettes smoked. In the kidney and bladder cancer analyses, we also controlled for history of urinary tract infection or stone. In the pancreatic cancer analysis, we also controlled for usual alcohol consumption. We dropped other covariates from the model because they did not change the appearance of the map, including education level, prior medical treatment with irradiation, and occupational exposure to solvents. The covariate with the largest percentage of missing data was education (7% of the participants). Since education was not a confounder in the final model, we did not exclude these participants. We used a loess smooth which adapts to changes in population density [13]. The amount of smoothing, referred to as the span size, depends on the percentage of the data points in the neighborhood. As a result, the geographic extent of the neighborhood is smaller in densely populated areas and larger in areas with sparse population. We determined the optimal amount of smoothing for each map by minimizing the Akaike's Information Criterion (AIC). Small span sizes produce bumpier surfaces and larger span sizes produce smoother surfaces. As the span size increases, the amount of bias in the fit increases and the variance decreases [13]. We created a rectangular grid covering the study area using the minimum and maximum longitude and latitude from the study subjects (The GAM does not predict locations beyond where the subjects live). Grid points lying outside the outline map of the study area were clipped, as were areas where people cannot reside (e.g., ocean or conservation land). We converted from log odds to odds ratios (ORs) using the odds of disease in the whole study area as the reference. When controls are appropriately sampled from the population giving rise to the cases, the case-control ratio (disease odds) in a subset of the area should be proportional to the disease incidence rate and the odds ratio estimates rate ratio. In order to make maps visually comparable, we mapped all results using the same dark blue to dark red continuous (unclassified) color scale and range of odds ratios, 0.25–2.50. This range covers most but not all of the ORs observed in our analyses, preventing maps from being washed out by an area of extremely high or low ORs. We determined the presence of spatial confounding by visually comparing crude and adjusted maps. If their optimal span sizes differ, we also compared maps using a common span size, allowing us to distinguish between changes due to confounder adjustment and changes due to span size. GAMs also provide a framework for hypothesis testing. We first tested the null hypothesis that case status does not depend on the smooth term using a permutation test based on the difference of the deviances of model (1) with and without the smooth term. We condition on the number of cases and controls, preserving the relationship between case/control status and covariates, and randomly assign individuals to locations. We carry out 999 permutations of location in addition to the original. For each permutation, we ran the GAM using the optimal span of the original data and computed the deviance statistic. We do this to preserve the spatial resolution of the original map; the test is thus conditional on the span size. We divide the rank of the observed value by 1000 to obtain a p-value. We used a p-value cut off of 0.05 as a screening tool for possibly meaningful associations. We discuss results as "significant" if the associated p-values are less than 0.05, but acknowledge that some results may be due to chance. If the global deviance test indicates that the map is unlikely to be flat, we next want to locate areas of the map that exhibit unusually high or low disease odds. We examined point-wise departures from the null hypothesis using permutation tests if the global statistic indicated that location was significant at the 0.05 level. We obtained a distribution of the log odds at every point using the same set of permutations we used for calculating the global statistics. We defined areas of significantly decreased odds ("cold spots") to include all points that ranked in the lower 2.5% of the point-wise permutation distributions and areas of elevated odds ("hot spots") to include all points that ranked in the upper 2.5% of the point-wise permutation distributions. By drawing the 2.5% and 97.5% contour lines, we mapped areas of significantly decreased and increased risk. Webster et al. [16] provides a detailed discussion of the statistical methods, analyses using synthetic data, and a comparison with the kernel method of Kelsall and Diggle [14]. We used S-Plus [19] to perform the generalized additive modeling and ArcGIS [20] to map the results of our analyses. Sample program code is available at http://www.cireeh.org/pmwiki.php/Main/SpatialEpidemiology. Residential History Our initial, no-latency analyses included all eligible residences with complete address information to allow for geocoding. Therefore, exposures occurring up to diagnosis were assumed to contribute to the risk of disease. By including all addresses, the disease outcome is replicated with the same covariates but different residence for each participant. However, solid cancers initiated by exposure to carcinogens typically take more than a decade to develop. For cancers with sufficient case numbers, we performed a fifteen-year latency analysis by restricting inclusion to the residences occupied by participants at least fifteen years prior to the diagnosis or index year (Residences within the fifteen year window were excluded because geographical location within that window was assumed not relevant to the outcome). Because the inclusion of multiple residences for the same individual may bias our statistical model, we also performed an analysis of residence of longest duration. Results Participants were predominantly white and over 60 years of age (Table 1). Cases were more likely to be smokers. A larger proportion of bladder and kidney cancer cases than controls were male and less educated. Pancreatic cancer cases and controls were predominantly female. Prior medical radiation was more common among pancreatic cases than controls, and history of urinary tract infection or stone was more common among bladder and kidney cancer cases than controls.
Bladder Cancer Assuming no latency, location was not statistically significant at the 0.05 level in the crude (not shown) and adjusted maps (Table 2 and Figure Figure3a).3a
We also restricted the adjusted 15-year latency analysis to residences of longest duration. Using the same span size as before (0.30), Figure Figure3d3d Kidney Cancer The crude and adjusted analyses predicted similar results (Table 3 and Figures 4a, b
The AIC curve for the adjusted kidney cancer model indicates two local minima at span sizes of 0.15 and 0.40 before reaching the global minimum (and optimal span) of 0.90. We repeated the adjusted kidney analysis using spans of 0.40 (Figure (Figure4c)4c Pancreatic Cancer The variation of risk was smaller in the crude analysis (Table 4 and Figure Figure5a)5a
We examined potential bias from multiple residences in the pancreatic cancer analysis by restricting participants' addresses to the one of longest duration. The spatial patterns of the resulting map differ from the map with all residences suggesting a bias may exist (not shown). The cluster in the center of the study area (Figure (Figure5b)5b Discussion In our analyses, bladder cancer on upper Cape Cod displayed an area of increased risk when we considered latency, and location became a significant predictor. Many geographic analyses based on cancer registry data only use address at diagnosis. The greater spatial variation in bladder cancer with increased latency is consistent with misclassification of geographically associated risk factors, including environmental exposures. If population movement is random with respect to disease status, ignoring latency should increase nondifferential exposure misclassification and tend to make maps flatter. Results of the kidney cancer analysis indicated that there was latitudinal variation causing the odds ratios to tilt significantly in magnitude from north to south. The AIC curve resulted in two additional minima of 0.15 and 0.40. Choice of span size is one of the most important issues in smoothing [13]. We used the Akaike Information Criterion, a computationally feasible method for choosing an "optimal" span based on the tradeoff between bias and variance of the smooth. There are, however, problems with automatic selection procedures. Selecting the span that optimizes the bias-variance tradeoff is not necessarily the same as understanding the importance of map features. Rather than using a single span, there may be important aspects of the data at different scales. New methods are needed to address this issue, e.g., [21]. The adjusted pancreatic cancer map showed more pronounced hot and cold spots compared to the crude analysis. Rather than creating disease clusters as is often assumed, spatial confounding was partially hiding areas of increased risk. Unlike cancer registry maps which contain limited data on covariates, our analyses controlled for many covariates available in the case-control study questionnaire. A number of epidemiologic studies have examined cancer and environmental exposures on Cape Cod [1-11]. Previous studies investigated the association between kidney, pancreatic and bladder cancer and tetrachloroethylene (PCE) in drinking water from vinyl-lined asbestos cement distribution pipes [2,3] and cranberry cultivation [5]. Increased relative risks were found for bladder and pancreatic cancer in the highest PCE-exposed individuals. No evidence was found for increased risks of bladder, kidney, and pancreatic cancer associated with living within 2600 ft of cranberry bogs. Adding PCE exposure and proximity to cranberry bogs to the current spatial models had no effect on the appearance of the maps. Our analysis located a significant bladder cancer "hot spot" to the southwest of the Massachusetts Military Reservation (MMR, Figure Figure6a).6a
Overlaying maps of odds ratios with maps of pollution sources can generate hypotheses about exposure. Caution is needed, however, because many geographic features overlap. The Massachusetts online repository of geographically coded features for shape files potentially related to environmental exposure [23] was explored to generate hypotheses for further investigation. Groundwater plumes were of particular interest because of our earlier analyses of breast cancer and pollution of drinking water [24]. With no prior knowledge of any geographic relationship to bladder cancer, we compared the two data sets (Figure (Figure6),6 Case-control studies are one of the standard epidemiologic tools for investigating associations between disease and exposure. By combining such data with advanced statistical techniques, we were able to address many criticisms of spatial studies. Cancer cases were ascertained from a registry and cancer types were studied separately. Point-based data from a region were used, avoiding aggregation within arbitrary political boundaries. Controls provided an estimate of the underlying, non-uniform population density. Our analyses controlled for many covariates not available using registry data alone. Residential history information allowed us to take latency into account, potentially quite important for diseases like cancer. However, there were only sufficient numbers of cases to perform a latency analysis for bladder cancer. Had there been a larger number of cases, the residential histories would have allowed for space-time analyses using GAMs. In a prior study, we successfully illustrated the ability to generate hypotheses for location and timing of exposures using breast cancer data [25]. Our results have a number of potential limitations. While areas of increased or decreased risk may theoretically be caused by non-uniform control selection, sampling of controls within the study area did not depend on geography. Use of residential history allows analysis of latency (when sample size is sufficient), but it produces multiple residences, a potential source of bias. Because residences were analyzed, an apparent cluster may be caused by a few diseased people moving within a small area. To examine the effect of multiple residences, our analyses were restricted to residences of longest duration. The spatial pattern of risk was similar for bladder and kidney cancers, with little difference in the location and magnitude of hot and cold spots, but the map of pancreatic cancer differed when we restricted to residence of longest duration. This suggests that the inclusion of multiple residences did not bias the bladder and kidney cancers analyses but there may be a bias in the pancreatic cancer analysis. Improved methods for analyzing data with multiple residences are needed; weighting by residence time has been suggested [26]. We computed global and pointwise p-values, but many epidemiologists prefer confidence intervals when evaluating the precision of point estimates [27]. It should be possible to compute variance bands (also known as confidence bands) for our maps [13]. We performed permutation tests that conditioned on the span size of the observed data which may be smaller than the span size for a permuted dataset under the null hypothesis of a flat map. This could possibly lead to a larger deviance statistic under the null hypothesis and the global permutation p-value would then be too large (conservative). The effect of conditioning on the original span size is a topic of future research. Pointwise tests were only conducted if the global deviance test indicated that the map was unlikely to be flat, but performing multiple testing at each location may result in an increase in the type I error rate. Although spatial analyses are useful for generating new hypotheses, the location of significant hot and cold spots should be considered exploratory. Conclusion Using generalized additive models and GIS, we generated maps of bladder, kidney, and pancreatic cancer risk. When available, population-based case-control studies provide extensive data on potential risk factors and residential histories that address many methodological criticisms of cluster studies. The results of the current analysis illustrate the usefulness of GAM methods in generating hypotheses for further investigation. We identified a significant hot spot of bladder cancer that coincides with groundwater plumes. Groundwater in the study area is currently being modeled to explore this possible association between drinking water contamination and cancer. Abbreviations AIC: Akaike's Information Criterion; GAM: generalized additive model; GIS: geographical information systems; MMR: Massachusetts Military Reservation; OR: odds ratio Competing interests The authors declare that they have no competing interests. Authors' contributions VV conducted the spatial analyses and drafted the manuscript. TW collaborated on all analytical and editorial decisions. JW provided statistical support and consulted on analytical and editorial issues. AA provided the data and assisted in epidemiologic analysis and editing. All authors read and approved the final manuscript. Acknowledgements This work was supported by grant number 5 P42 ES007381 from the National Institute of Environmental Health (NIEHS), NIH. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of NIEHS, NIH. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||
Biostatistics. 2004 Apr; 5(2):177-91.
[Biostatistics. 2004]Environ Health. 2005 Jun 14; 4():11.
[Environ Health. 2005]Arch Environ Health. 1993 Sep-Oct; 48(5):284-92.
[Arch Environ Health. 1993]Environ Health Perspect. 1999 Apr; 107(4):265-71.
[Environ Health Perspect. 1999]Int J Health Geogr. 2006 Jun 9; 5():26.
[Int J Health Geogr. 2006]Biostatistics. 2004 Apr; 5(2):177-91.
[Biostatistics. 2004]Arch Environ Health. 1993 Sep-Oct; 48(5):284-92.
[Arch Environ Health. 1993]Environ Health Perspect. 1999 Apr; 107(4):265-71.
[Environ Health Perspect. 1999]Am J Public Health. 1996 Sep; 86(9):1289-96.
[Am J Public Health. 1996]Biostatistics. 2004 Apr; 5(2):177-91.
[Biostatistics. 2004]Environ Health. 2005 Jun 14; 4():11.
[Environ Health. 2005]Int J Health Geogr. 2008 Aug 13; 7():46.
[Int J Health Geogr. 2008]Soc Sci Med. 2000 Apr; 50(7-8):1121-37.
[Soc Sci Med. 2000]