- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

# BME Estimation of Residential Exposure to Ambient PM_{10} and Ozone at Multiple Time Scales

^{1}Department of Bioenvironmental Systems Engineering, National Taiwan University, Taipei, Taiwan;

^{2}Department of Epidemiology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA;

^{3}Department of Geography, San Diego State University, San Diego, California, USA;

^{4}Department of Environmental Health, University of California at Berkeley, Berkeley, California, USA

## Abstract

### Background

Long-term human exposure to ambient pollutants can be an important contributing or etiologic factor of many chronic diseases. Spatiotemporal estimation (mapping) of long-term exposure at residential areas based on field observations recorded in the U.S. Environmental Protection Agency’s Air Quality System often suffer from missing data issues due to the scarce monitoring network across space and the inconsistent recording periods at different monitors.

### Objective

We developed and compared two upscaling methods: UM1 (data aggregation followed by exposure estimation) and UM2 (exposure estimation followed by data aggregation) for the long-term PM_{10} (particulate matter with aerodynamic diameter ≤ 10 μm) and ozone exposure estimations and applied them in multiple time scales to estimate PM and ozone exposures for the residential areas of the Health Effects of Air Pollution on Lupus (HEAPL) study.

### Method

We used Bayesian maximum entropy (BME) analysis for the two upscaling methods. We performed spatiotemporal cross-validations at multiple time scales by UM1 and UM2 to assess the estimation accuracy across space and time.

### Results

Compared with the kriging method, the integration of soft information by the BME method can effectively increase the estimation accuracy for both pollutants. The spatiotemporal distributions of estimation errors from UM1 and UM2 were similar. The cross-validation results indicated that UM2 is generally better than UM1 in exposure estimations at multiple time scales in terms of predictive accuracy and lack of bias. For yearly PM_{10} estimations, both approaches have comparable performance, but the implementation of UM1 is associated with much lower computation burden.

### Conclusion

BME-based upscaling methods UM1 and UM2 can assimilate core and site-specific knowledge bases of different formats for long-term exposure estimation. This study shows that UM1 can perform reasonably well when the aggregation process does not alter the spatiotemporal structure of the original data set; otherwise, UM2 is preferable.

**Keywords:**Bayesian, BME, environment, exposure, spatiotemporal, stochastic

Many human exposure and epidemiologic studies have investigated associations between pollutant exposure and disease risk and their potential consequences (Aickin 2002; Chen et al. 2004). For studies on ambient air pollutants, because of the substantial cost and logistic constraints, personal exposure monitoring can be used only for a small number of study participants, thus resulting in low statistical power to detect small effects (Ozkaynak et al. 1996). Most air pollution epidemiologic investigations use individual health data sets at nationwide or regional scales to assess the subtle risks of pollution exposure. In these cases, ambient air-quality monitoring networks, such as the Air Quality System (AQS) operated by the U.S. Environmental Protection Agency (EPA), constitute important and useful environmental data sources concerning the acute and chronic effects of ambient pollutants (U.S. EPA 2002).

Although these environmental monitoring data sources provide useful information to estimate human exposure across space and time, environmental epidemiologists and exposure scientists still face several practical and methodologic challenges in analyzing and modeling the environmental data (Li et al. 2008; Mutshinda et al. 2008).

One challenge is the geographic coverage of the region of interest. Ideally, if the pollution-monitoring stations are located near the residences of the study participants, a participant’s exposure could be easily estimated from neighboring pollutant observations (Maxwell and Kastenberg 1999; Wu et al. 2004). Unfortunately, the AQS monitoring network is relatively scarce compared with the number and geographic distribution of participants considered in large epidemiologic studies, and the geographic locations with direct ambient observations are often at large distances from the places where the study participants reside. To address this issue while assessing individual-level exposures, the geocoding of the subjects’ residential addresses is usually combined with some form of interpolation of likely pollution levels between monitoring locations. Spatial interpolation techniques can be used to estimate large-region pollutant exposures, including deterministic inverse distance schemes (Michelozzi et al. 2002), Monte Carlo methods (Kentel and Aral 2005), and kriging techniques (Christakos and Thesing 1993; Liao et al. 2006; Rushton et al. 1996). Kriging techniques, in particular, have been applied with increasing frequency in large-scale epidemiologic studies, including long-term exposure assessment (Brauer et al. 2003; Hoek et al. 2001). However, because of their inherent constraints (estimator linearity, probabilistic normality, and limited interpretive features that cannot consider highly relevant qualitative knowledge), the mainstream kriging techniques do not always address successfully important human exposure issues, including the integration of composite space–time dependencies and the assimilation of soft (uncertain) information sources that are prevalent in most human exposure studies.

The second issue relates to the limited sampling frequency of environmental monitoring networks. For example, the current AQS monitoring database includes particulate matter (PM) data sampled at 1-, 2-, 3-, 6-, and 9-day cycles (note that most data are sampled at a 6-day cycle). As a result, even if the residences of the study participants are very close to ambient air monitoring stations, some considerable pollution events likely occur during times when the local monitoring stations are not operating. To address this issue, which is often a significant concern to time-series analyses and epidemiologic panel studies of acute health effects, a smoothing technique is often applied to estimate ambient air pollutant levels that were missing during the times of interest (Conceicao et al. 2001; Sagiv et al. 2005). Nevertheless, neither spatial nor temporal analyses have fully accounted for and taken advantage of the exposure variability generated in a composite space–time dependence domain. Remarkably, the temporal domain of AQS air pollution monitoring is considerably more extensive than its spatial domain. This suggests that, especially in studies where exposures at multiple time scales need to be estimated, extending purely spatial or purely temporal interpolation techniques in a composite space–time context would improve considerably the quality of the information used in exposure estimation (Wang et al. 2008). Not surprisingly, several case studies have explicitly demonstrated that ignoring space–time cross-effects can lead to larger errors in pollution estimation (Christakos and Serre 2000; Christakos and Vyas 1998; Christakos et al. 2001; Vyas and Christakos 1997).

The third major issue is how to aggregate data and estimate exposure at time scales that are relevant to the study outcome. To study acute effects, the exposure is often assessed at small time scales (e.g., hourly or daily) (Stallard and Whitehead 1995; Tamborini et al. 1990). In chronic disease studies, such as lung cancer or cardiovascular diseases, average exposures at large time scales (e.g., monthly or yearly) are often used to represent the cumulative long-term exposures (Katsouyanni and Pershagen 1997; Nyberg et al. 2000). A desirable feature in defining the exposure time scale is to align or reference the estimated exposure values to the timing of the study outcome, because such an approach allows epidemiologists to explore the temporal relationships between index exposure and event occurrence, while accounting for the presence of induction time or latency period. To achieve this goal in large time-scale pollution estimation, aggregation of exposure data at small time scales is needed because daily exposure information is not always available from the existing air monitoring networks. Then, one may first aggregate the environmental monitoring data at small time scales and then apply an interpolation technique to estimate exposure at the large time scale of interest. Alternatively, one may first interpolate the individual-level exposure (e.g., residential exposure) at small time scales, followed by the aggregation of all estimated exposure values from small time scales to derive exposure values at large time scales. Because the existing environmental data with aggregated yearly exposure from air monitoring network are only indexed to calendar years, both approaches offer the advantages of avoiding misalignment between estimated exposures at large time scales and the occurrence of study outcomes. They may also be appealing to researchers interested in differentiating the acute health effects from those related to long-term exposures. Remarkably, the relative performance of these two approaches in the upscaling of environmental exposure data used in health and epidemiologic studies has not been evaluated and compared.

In view of the above considerations, in this study we evaluated and compared the relative performance of two upscaling methods in the analysis and estimation of environmental exposure data at multiple time scales. We also compare the two approaches in the spatio temporal estimation of long-term exposure to ambient air pollutants in the context of the HEAPL (Health Effects of Air Pollution on Lupus) study. In particular, we considered exposures to PM_{10} (PM with aerodynamic diameter ≤ 10 μm) and ozone ambient concentrations. We used the spatiotemporal Bayesian maximum entropy (BME) reasoning and quantitative techniques (Christakos et al. 2005), because they account for the aforementioned issues of individual-level exposure estimation in a mathematically rigorous and interpretively meaningful manner. Numerical implementation of BME in real-world applications is made possible by means of the publicly available SEKS-GUI (Spatiotemporal Epistematics Knowledge Synthesis Model—Graphic User Interface) computer software library (Yu et al. 2007; Kolovos et al. 2006). This software library (SEKS-GUI 2007) was used to analyze the extant AQS data sets in the present study and to derive PM_{10} and ozone exposure estimates across space–time.

## Methods

### Air pollution data processing

The residential locations of the HEAPL study participants are in the Carolinas (states of North and South Carolina), and the time period considered in this analysis is 1995–2002. We obtained PM_{10} and ozone observations for this time period and geographic locations from the AQS database. Each of the raw AQS data sets provides information about the spatial coordinates, collection time, sampling duration, sampling frequency, and data duplication indicators (U.S. EPA 2005).

The PM_{10} (micrograms per cubic meter) and ozone (parts per billion) databases in the study region contained nonuniform data formats and data collection times. A total of 87 PM_{10} monitoring stations were available during the specified study period (1995–2002). Among them, 75 stations generated observations in terms of 24-hr averages every 6 days, whereas the remaining stations recorded hourly; however, only 15 out of 75 daily and 6 out of 12 hourly monitoring stations were in constant operation during the entire study period. In contrast, all of the 77 ozone monitoring stations obtained hourly observations, but only 11 stations operated constantly throughout the study period. Figure 1 shows the spatial distribution of the monitoring stations for both pollutants (PM_{10} and ozone) and the geographic locations of the residences of the study participants.

### Residential data source

HEAPL used extant residential data collected from 620 participants in the Carolina Lupus Study (Cooper et al. 2002). We collected the residential data used in present analyses from the baseline interview that took place in early 1997 to mid-1998 as well as the subsequent interview in 2001. Most participants lived in the eastern and central part of the Carolinas, as shown in Figure 2. To obtain the coordinates (longitudes and latitudes), the geocoding of all study participants’ residential addresses during this period was processed by a specialist at the Cecil G. Sheps Center for Health Services Research at University of North Carolina at Chapel Hill following the standard procedure (Bonner et al. 2003; Ward et al. 2005). The HEAPL study protocols have been approved by the Institutional Review Board of the University of North Carolina at Chapel Hill.

### BME analysis

The BME theory was introduced in geostatistics and space–time statistics by Christakos (2000). BME was later considered in a general epistematics context and applied in the solution of real-world problems in environmental health fields (Choi et al. 2003; Christakos 2009; Law et al. 2006; Savelieva et al. 2005; Serre et al. 2003). BME analysis can incorporate nonlinear exposure estimators and non-Gaussian probability laws, and it can integrate core knowledge (epidemiologic laws, scientific models, theoretical space–time dependence models, etc.) with multisourced, site-specific information at various scales (including aggregated variables and empirical relationships). Central elements of the BME method are described below.

A human exposure attribute (e.g., pollutant concentration) is represented as a spatiotemporal random field (RF) *X*** p** =

*X*

*s**(Christakos and Hristopulos 1998), where the vector*

_{t}**= (**

*p***,**

*s**t*) denotes a spatiotemporal point (

**is the geographic location and**

*s**t*is the time). The RF model is viewed as the collection of all physically possible realizations of the exposure attribute we seek to represent mathematically. It offers a general and mathematically rigorous framework to investigate human exposure that enhances predictive capability in a composite space–time domain. The RF model is fully characterized by its probability density function (pdf) ƒ

*, which is defined as*

_{KB}where the subscript KB denotes the “knowledge base” used to construct the pdf.

We considered two major knowledge bases: the core (or general) knowledge base, denoted by G-KB, which includes physical and biological laws, primitive equations, scientific theories, and theoretical models of space–time dependence; and the specificatory (or site-specific) knowledge base, S-KB, which includes exact numerical values (hard data) across space–time, intervals (of possible values), and probability functions (e.g., the datum at the specified location has the form of a probability distribution). The total knowledge base is denoted by *K* = *G* *S*; that is, it includes both the core and the site-specific knowledge bases.

The fundamental BME equation is as follows (for technical details, see Christakos et al. 2005):

where ** g** is a vector of

*g*

_{α}functions (α = 1, 2, …) that represents stochastically the G-KB under consideration (the bar denotes statistical expectation),

**μ**is a vector of μ

_{α}-coefficients that depends on the space–time coordinates and is associated with

**(i.e., μ**

*g*_{α}expresses the relative significance of each

*g*

_{α}function in the composite solution sought), ξ

*represents the S-KB available,*

_{S}*A*is a normalization parameter, and ƒ

*is the pollutant or exposure pdf at each space–time point (the subscript*

_{K}*K*means that ƒ

*is based on the total knowledge base that is the blending of the core and site-specific knowledge bases). The vectors*

_{K}**and ξ**

*g**are inputs in Equation 2, whereas the unknowns are*

_{S}**μ**and ƒ

*across space–time.*

_{K}The G-KB refers to the entire ** p** domain of interest, which consists of the space–time point vector

*p**k*where exposure estimates are sought and the point vector

*p*_{data}where site-specific information is available. The G-KB may include theoretical space–time dependence models (mean, covariance, variogram, generalized covariance, multiple-point statistics, and continuity orders) of the exposure attribute represented by the RF

*X*

**. Most commonly, however, only the mean and the covariance (or variogram) are used in geostatistics studies of human exposure. In addition, the exposure variables of interest are often log-normally distributed. One cannot avoid noticing that there are serious concerns about the biased estimation of the arithmetic mean on the basis of the log-normal assumption (Parkhurst 1998). In our study, we applied the normal score transformation (Deutsch And Journel 1998) to all PM**

*p*_{10}and ozone data sets, thus relaxing the log-normal assumption and assuring that the transformed data set is normally distributed.

For practical purposes, the data point vector *p*_{data} consists of the hard data point vector *p*_{hard} (where exact measurements are available) and the soft data point vector *p*_{soft} (where qualitative/incomplete yet valuable information may be available). For illustration, assume that 32 exact PM_{10} observations are available at the space–time points *p*_{hard} = (*p*_{1,} …, *p*_{32}), that is, *X**p*_{1} = 5.1, …, *X**p*_{32} = 9.3 (in suitable units); and that 55 uncertain PM_{10} data are available at the points *p*_{soft} = (*p*_{33}, …, *p*_{87}), say, of the interval form 3.2 < *X**p*_{33} < 4.1, …, 5.2 < *X**p*_{87} < 6.4 (in suitable units). This sort of site-specific information is mathematically expressed by *P** _{S}* [

*X*

*p*_{1}= 5.1, …,

*X*

*p*_{32}= 9.3] =1 and

*P*

*[3.2 <*

_{S}*X*

*p*_{33}< 4.1, …, 5.2 <

*X*

*p*_{87}< 6.4 ] =1, respectively. More generally, assume that at point

*p*_{24}the uncertain datum is expressed by the density function ƒ

*(*

_{S}**24); then,**

*p**P*

*[*

_{S}*X*

*p*_{24}< χ] = ∫

_{−∞}

^{χ}

*d*

_{χ}

*f*

*(*

_{S}

*p*_{24}). For several other examples, see Yu et al. (2007).

By incorporating the total K-KB into exposure analysis, the derived pdf ƒ* _{K}* in Equation 2 describes the distribution of exposure values at each estimation point

*p**. Given the ƒ*

_{k}*at*

_{K}

*p**, different exposure estimates (most probable, error minimizing, etc., estimates) can be calculated at each spatiotemporal node of the appropriate mapping grid, depending on the objectives of the study. As mentioned above, in this work the BME method is implemented by means of the publicly available SEKS-GUI software library (Kolovos et al. 2006; Yu et al. 2007).*

_{k}### Multiple time-scale exposure

In the context of HEAPL, we considered air pollution exposure at multiple long time scales (including weekly, monthly, trimonthly, six-monthly, and yearly averages). As described above, the available data sets, which contain either hourly observations or combined daily and hourly observations, are regarded as a realization of the spatiotemporal RF *X*** p** representing the ambient pollutant, and the space–time dependence of the pollutant is characterized by the joint pdf (1) of the

*X*

**. To estimate long-term mean exposure, the available short-time-scale data (hourly and daily) should be upscaled to the larger time scale (monthly, yearly, etc.). Spatiotemporal characteristics at short time scales can be also upscaled to represent long-term exposure characteristics that will be incorporated into the BME framework, as discussed further below. Spatial and/or temporal upscaling has been discussed in several environmental health studies (Choi et al. 2003; Christakos and Hristopulos 1998; Gotway and Young 2002).**

*p*In the present study, to estimate air pollution exposures at large time scales, we examined two different upscaling methods: daily data aggregation followed by BME estimation at longer time scales (UM1) and daily BME estimation followed by aggregation at longer time scales (UM2).

#### G-KB

To obtain long-term exposure estimates at the area of interest in terms of the UM1, we first upscaled the data available from the short time scale of observation, (** s**,

*t*), to the long-time-scale domain, (

**,**

*s**T*),

*T*>

*t*; we then generated estimates of the upscaled pollutant exposure. Consider the pollutant RF

*X*

**=**

*p**X*

*s**,*

*with covariance*

_{t}*c*

*(*

_{X}

*p**,*

_{i}

*p**) =*

_{j}*c*

*(*

_{X}

*s**,*

_{i}

*t**;*

_{i}

*s**,*

_{j}

*t**) at the (*

_{j}**,**

*s**t*) scale. The temporally upscaled RF and the corresponding covariance at the (

**,**

*s**T*) scale are expressed by, respectively,

and

where *T* denotes the time intervals of the upscaled domain within which the original, short-time-scale RF is averaged. Equations 3 and 4 belong to the G-KB of the pollutant. The change of covariance function under a change of support as shown above in spatial analysis is also known as regularization theory (Journel and Huijbregts 1978).

To obtain long-term exposure estimates at the (** s**,

*T*) region of interest in terms of the UM2, we first use the BME technique to generate exposure estimates for all locations of interest at the small time scale (

**,**

*s**t*), and then obtain the upscaled estimates from the aggregation of the short-time-scale estimates. In the UM2 context, the G-KB consists of the mean trend and covariance functions at the short-term time scale (

**,**

*s**t*).

#### S-KB

Daily or hourly observations were aggregated into the multitime scale exposure knowledge base. This upscaled uncertain knowledge base of pollutant concentration is represented in terms of a complete probability distribution rather than a single value. As mentioned above, the sampling frequency generally varies among the monitoring stations. Concerning the raw AQS data set used in this study, both daily and hourly PM_{10} observations were available, whereas hourly data were primarily used in the case of ozone. According to the AQS ambient pollutant manual (U.S. EPA 2004), daily observations can be estimated in terms of the arithmetic mean of hourly observations only if the number of these observations is greater than 18 (i.e., ≥ 75% of intended samples); otherwise, we treated them as missing data. Needless to say that, it is not always easy to assure that the long-term exposure information satisfies the 75% criterion above. In fact, the total number of observation days is often less than half the long-term period of interest. Instead of ignoring the scarce observations, as done by the previous methods, in the present study we considered two different avenues toward quantification of the uncertainty of the long-term exposure estimates: (*a*) for the 25–75% sampling period, data pdfs of various shapes were constructed on the basis of the observation histograms; and (*b*) for the < 25% sampling period, uniform distributions were generated on the basis of the arithmetic mean. The ranges of the upscaled exposure data were between 0.25 and 1.75 times the arithmetic mean. If daily and hourly observations coexisted at the same location, the same 24 observed daily values were assigned into the corresponding hours. If daily and hourly observations were collocated, the daily information was considered to be hard data. In this way, BME was able to account for uncertain yet valuable exposure information.

There were 87 (PM_{10}) and 77 (ozone) monitoring stations, but the spatial network of pollution monitors never operated fully during the entire 2,922 days of the study period. In fact, the mean (median) number of operating stations in any specific day was 15 (8) stations for PM_{10}, and 41 (55) stations for ozone. The maximum (minimum) number of stations per day was 66 (3) for PM_{10} and 69 (7) for ozone, respectively. Moreover, most of the PM_{10} stations obtained observations with a 6-day frequency.

### Spatiotemporal exposure estimation and cross-validation

Daily estimation is the smallest temporal estimation unit in this study. The performance of the BME method in daily PM_{10} and ozone exposure estimation was assessed by cross-validation, using all AQS data available during the study period. Cross-validation allows assessment of the estimation accuracy in different space–time domains and can avoid the potentially biased interpretation of the estimation results induced by purely spatial correlations or purely temporal trends. Therefore, we randomly selected approximately 1,000 observations across space–time to be the estimation points for cross-validation purposes. This selection is based on the objective of achieving a balance between three factors: the desirable size of spatiotemporal clusters, the number of clusters (968 for PM_{10} and 996 for ozone), and the need to reduce the computation burden of the cross-validation of BME estimates at both the daily and the large time scales. The differences of real observations versus BME estimates within each randomly selected spatiotemporal cluster were pooled and assessed across all monitors. For the purpose of comparison, simple kriging with the same spatiotemporal structure for BME method, that is, mean trend and covariance, is also applied to the cross-validation at daily scale.

We also applied the cross-validation of large-scale exposure estimation to assess and compare the predictive accuracy by the two upscaling methods, UM1 and UM2. In UM1, the exposure data were first transformed to the scale of interest, and then the BME technique was applied on the upscaled data, which can be hard or soft, as discussed above, to generate upscaled exposure estimates. In UM2, on the other hand, the daily exposure G-KB and S-KB were processed, as discussed above, and the daily estimates generated by the BME technique, and then the exposure estimates were upscaled to the domain of interest.

In order to produce the long-term exposure estimates, the daily estimates were aggregated as follows:

and

Where σ^{2}* _{X}* (

**,**

*s**T*) and σ

^{2}

*(*

_{X}**,**

*s**t*

*) are the variances of*

_{i}

*X**,*

_{s}*and*

_{T}

*X**,*

_{s}*, respectively, and*

_{ti}*c*

*(*

_{x}**,**

*s**t*

*;*

_{i}**,**

*s**t*

*) is the covariance between (*

_{j}**,**

*s**t*

*) and (*

_{i}**,**

*s**t*

*). Note that in this study the choice of the exposure estimation period (*

_{j}*T*) is different from that in many epidemiologic studies that followed the calendar temporal units. Instead, we define the exposure period in this study as the period that starts at the time of the epidemiologic survey of the participants and retrospectively defines a specified period of interest, making the exposure time window temporally aligned with the timing of collecting health data during the survey.

In the case of multiple-time-scale exposure, we also conducted two additional cross-validation exercises (one for UM1 and one for UM2) to compare the relative performance of the two upscaling methods at large time scales. The idea of cross-validation is to assess estimation accuracy by comparing the exposure estimates with true exposure observations. However, the latter are not directly available at long time scales. To overcome this difficulty, statistical hypothesis tests were implemented to detect if the generated soft exposure data are significantly close to the BME exposure estimates. The “distance” between the pdfs of soft data and the BME estimates was assessed in terms of the relative entropy measure:

where *p** _{k}* and

*q**represent the pdfs of the exposure observations and the BME estimates, respectively. The goodness-of-fit test is usually applied to verify if the two pdfs come from the same random variable. Chi-square distribution with*

_{k}*n*− 1 degrees of freedom can be used in the relative entropy measure tests (Bedford and Cooke 2001). The significance criterion for the tests was set as 95%. Cross-validation for the UM1 and UM2 methods at long time scales was performed at the same temporally-referenced points as in the case for the cross-validation of daily BME estimation.

Finally, we applied both UM1 and UM2 to estimate PM_{10} and ozone exposures at multiple time scales for all the residential locations of the HEAPL study. The correlation coefficients for each BME estimate at different time scales were computed for the UM1 and UM2 methods and compared accordingly. We also examined the distribution of the differences between the UM1 and UM2 estimates at different time scales.

## Numerical Results and Plots

Table 1 presents the cross-validation results for the daily PM_{10} and ozone data by BME and kriging methods. The exposure estimation error at each test point is defined as error = estimate − observation. In general, both the error mean and median are close to zero, so the error distribution is symmetric around zero. To compare the average exposures at multiple time scales from real observations versus the BME estimates, Tables 2 and and33 show the results from UM1 and UM2. Table 2 summarizes simple statistics of the estimation errors given by UM1 and UM2 for both PM_{10} and ozone, and Table 3, results of corresponding comparison on relative entropy at each indicated time scale, showing the percentage of the spatiotemporal estimates that passed the chi-square tests with the null hypothesis: the two pdfs (data and estimates) are the same.

Figures 3–6 show the spatial and temporal distributions of the average estimation errors of the yearly exposure estimates obtained by UM1 and UM2. Figures 3 and and44 show the PM_{10} estimation performance by means of UM1 and UM2, respectively. Similarly, Figures 5 and and66 show the average error distributions of ozone estimation obtained by UM1 and UM2, respectively.

Table 4 presents the summary statistics for the calculated differences in UM1–UM2 that were tabulated, respectively, for PM_{10} and ozone exposure at each indicated time scale. Figure 7 shows the histograms of these differences for both methods. Table 5 shows the correlation coefficients between the PM_{10} and ozone exposure estimates obtained by UM1 and UM2 within each temporal scale at the study residences.

_{10}(

*A*) and ozone (

*B*) at multiple time scales.

_{10}and ozone given for all residential locations.

## Discussion

Scale laws and scaling behaviors at multiple time scales are encountered in many human exposure scenarios, although very often such laws are found in an empirical way, because of the lack of fundamental theories allowing us to understand them from fundamental principles (Christakos and Hristopulos 1998). In the case of chronic diseases, the arithmetic mean of long-term (large time scale) participant exposure rather than the on-site exposure is often considered as the appropriate indicator (AckermannLiebrich et al. 1997; Pope et al. 2002). For regulatory purposes, the National Ambient Air Quality Standards (NAAQS) proposed by U.S. EPA are also based on the arithmetic mean exposure at different time scales, which range from hourly to annual exposure (U.S. EPA 2006). Many studies have focused on long-term arithmetic mean exposure estimates based on small time scale (short-term) observations and assuming lognormal RF to model exposure distributions (Clayton et al. 1999; Wallace and Williams 2005). In general, these studies do not consider important spatiotemporal dependencies between short-term observations and cross-dependencies between short- and long-term exposures.

In this article, we present two upscaling methods and compare them for the estimation of arithmetic average exposures within the different temporal scales. As described in the introductory remarks, previous data analyses often did not consider the uncertainty of the exposure analysis (e.g., by purely spatial or purely temporal analysis or linear assumptions). For the upscaling problem considered here, this uncertainty may be a significant factor in many human exposure situations; for example, in the case of PM_{10} data with a distinct trend and a large number of missing values (because most monitors only record every 6 days), the estimation of the long-term exposure averages can be seriously biased.

As mentioned above, the AQS manual suggests that when there is a large number of missing data the accuracy of the upscaled exposure is in doubt, in which case the rest of the observed information should be ignored. Accordingly, mainstream statistics and geostatistics techniques usually consider incomplete information (qualitative knowledge, uncertain secondary records, etc.) as missing data to avoid potentially misleading estimation results. On the other hand, the BME method used in this study has the significant feature that it is able to rigorously incorporate uncertain information of various kinds and different scales with the minimum number of theoretical assumptions. In other words, the BME method can always express incomplete information in terms of soft site-specific data that can take the form, for example, of probability functions with arbitrary shapes. In addition, BME can incorporate empirical relations and charts as well as core knowledge in the form of epidemiologic laws and scientific human exposure models, whenever available (Christakos et al. 2005). Because of the abundance of missing data, the uncertain (soft) information is available for both PM_{10} and ozone BME predictions at all concerned time scales in this study. Table 1 provides the cross-validation results of daily PM_{10} and ozone estimations by BME and kriging methods and shows that the estimation error distribution of the results of BME method is more condensed and symmetric around zero. The improvement of the estimation accuracy by integrating soft data in BME method is more significant as the amount of missing data is greater, such as the case of PM_{10}.

Concerning the comparison of the accuracies of the two upscaling approaches: based on the cross-validation results (Table 2 and and3),3), the UM2 is generally better than UM1 in terms of smaller mean and median errors and higher success rates of passing the chi-square tests of uncertain information. Table 2 shows that the standard deviation of the differences between observations and estimates decreases as the estimation time scale increases (for both PM_{10} and ozone cases). This is because the aggregated hard and soft data (which emerge as the time scale increases) can lead to a reduction of the estimation uncertainty and provide more informative exposure estimates. In the case of the PM_{10} data set, for example, during the study period of interest about 5,000 more spatiotemporal data are compiled in the yearly database than in the weekly database. The UM1 and UM2 methods generally underestimate the real PM_{10} levels. The preferential sampling of high PM_{10} values can partially contribute to the biased estimations. Also, some extreme high values in PM_{10} data set can also bias the estimations at the process of normal score transform.

Geostatistical techniques generate estimates in terms of spatial and temporal interpolation schemes, which rely on linearity and normality assumptions and tend to generate rather smooth PM_{10} estimates. On the other hand, the UM1 and UM2 use the BME approach that does not make any linearity or normality assumption (nonlinear estimators and non-Gaussian distributions are automatically incorporated) and can rigorously process uncertain yet valuable data sources (e.g., soft data of various forms), thus providing more informative estimates than the geostatistical techniques.

In the case of highly uncertain data, some extremely high observations may not be completely reproduced. Even though both upscaling methods underestimate the actual PM_{10} exposures, the UM2 performs better than UM1 yielding lower estimation errors. In the case of ozone, the performance of UM1 is significantly different than that of UM2. UM1 tends to overestimate the long-term exposure level, and the situation worsens as the estimation scale becomes larger. Remarkably, the UM2 exposure estimates are not biased, whereas the biased UM1 estimation is likely due to the aggregation of the ozone data set. Because of the seasonal ozone pattern, the distribution of daily ozone data during the study period is positively skewed, ranging from 0 to 70 ppb. However, when temporal aggregation was applied, the mean of the upscaling data generally raised to the annual mean level at each spatial location, which may distort the original spatiotemporal ozone pattern at the smaller time scales. As shown in Figure 8, the distribution of the mean of the aggregated ozone data varies significantly by the degree of upscaling, which is not the case of the PM_{10} estimation. Moreover, UM2 does not depend on any distorted upscaled data, so more accurate results are obtained. Despite the significant changes in data structure during aggregation, the rigorous consideration of data uncertainty by BME alleviates such effects to produce better quality estimates (Table 2). Table 3 shows that the estimates are generally superior for ozone than for PM_{10}. This is because most PM_{10} monitors performed air sampling every 6 days, in which case the resulting upscaled long-term exposure is less informative of the exposure situation, especially at the short time scales. Therefore, the shorter the upscaling period considered (e.g., weekly), the more noninformative uncertain data are compared with estimations.

Figures 3–6 plot the spatial and temporal distributions of the UM1 and UM2 results. In the PM_{10} case, the spatial and temporal patterns of the error distributions obtained by the UM1 and UM2 methods are very similar. These plots offer a better understanding of the performance of the proposed approach in space–time. The conclusion drawn from Table 2 concerning long-term PM_{10} underestimation is also illustrated by the temporal error distributions plotted in figures 3–6. In the case of ozone estimation, these figures also depict a similar conclusion drawn from Table 2 (i.e., UM1 tends to overestimate the long-term ozone levels). It is noteworthy that spatial locations where the estimates exhibit higher discrepancies from the data values (for both PM_{10} and ozone) are mostly close to either the boundary between regions of considerable data availability and data scarcity or the metropolitan area where the high variability of PM pollutants and ozone generated from traffic or local industrial emissions may be present.

The mean and median of the differences between the UM1 and UM2 estimates specific to the residential locations in HEAPL at multiple timescales are mostly close to each other and not much departing from zero for both pollutants (Table 4), except in the case of long-term ozone estimation. The estimates obtained by UM1 are biased, so UM1 generates higher ozone levels than UM2, which can be seen more clearly from the histograms at the bottom of Figure 7. In general, the UM1 and UM2 estimates should get closer to each other as the time scale increases under the condition of the unbiased aggregated data provided. As the time scale increases, the number of daily values increases for both upscaling methods (i.e., more data become available for aggregation purposes in the case of UM1, whereas more estimates are generated for integration purposes in the case of UM2). As a consequence, based on the central limit theorem, the exposure mean is optimally calculated at the longer time scale by both upscaling methods (Figure 8), as shown in the case of PM_{10} estimation. However, the exposure estimation accuracy may also decrease if the data uncertainty resulting from the large proportion of missing data or biased aggregated data is large, which is the case of ozone estimation at long time scales. Thus, the mean and median of the differences between the estimation results by UM1 and UM2 can slightly increase with time scale.

In this study, numerical analysis showed that UM2 generally performs slightly better than UM1 in terms of accuracy. UM2 can also be preferable in theory. Instead of aggregating the data and spatiotemporal dependence at small scales, BME analysis incorporates G-KB and S-KB, including detailed local spatiotemporal associations and the original short-term observations. In UM1, on the other hand, both general and specific knowledge are upscaled, so the BME estimation uses the more uncertain information. However, despite the better performance of UM2, in practice the UM1 may be sometimes preferable because of its efficiency. The difference of computation burden between the two approaches increases substantially as the estimation time scale increases. As the exposure estimation at residential locations shows, the UM1 can generate biased estimates in the case of ozone but not in the case of PM_{10}. This suggests the criterion for the selection of UM1 and UM2 in the long-term exposure estimations. UM1 is preferable as long as the aggregation process does not change the original data structure, that is, mean trend and variance/covariance of the data. In such cases, the loss of information during the data aggregation in UM1 can be neglected compared with the increase of time for the estimations by UM2; otherwise, UM2 is preferable. In this study, because of the strong seasonal ozone trend, an aggregation period exceeding 3 months can distort the spatiotemporal data structure.

## Conclusions

To estimate residential levels of exposure to ambient air pollution in a community-based study, in this article we presented and compared two BME-based temporal upscaling methods (UM1: data aggregation followed by BME estimation; and UM2: BME estimation followed by aggregation). BME’s flexibility allowed the assimilation of G-KB and S-KB of different formats; for example, BME exposure analysis can process scarce and uncertain data sets in a probabilistic way, instead of neglecting them, as is the case with most existing quantitative exposure methods. In the context of residential long-term exposure estimation, we showed that the UM1 and UM2 methods produce accurate space–time estimates. By means of cross-validation tests the relative performance of the two upscaling methods was studied in different time scales. We found UM2 to be generally better than UM1, in the sense that the UM2 estimates were unbiased, the differences between the UM2 estimates and the true long-term exposures were smaller, and the UM2 exhibited better test-passing rates than UM1. On the other hand, the UM1 can perform reasonably well when the aggregation process does not alter the spatiotemporal structure of the original data set.

## Footnotes

The research was supported by grants from the National Institute of Environmental Health Sciences (P30ES10126), the California Air Resources Board (55245A), and Taiwan National Science Council (NSC97-2313-B-002-002-MY2).

## References

- AckermannLiebrich U, Leuenberger P, Schwartz J, Schindler C, Monn C, Bolognini C, et al. Lung function and long term exposure to air pollutants in Switzerland. Am J Respir Crit Care Med. 1997;155(1):122–129. [PubMed]
- Aickin M. Causal Analysis in Biomedicine and Epidemiology: Based on Minimal Sufficient Causation. New York: Marcel Dekker; 2002.
- Bedford T, Cooke RM. Probabilistic Risk Analysis: Foundations and Methods. Cambridge, UK: Cambridge University Press; 2001.
- Bonner MR, Han D, Nie J, Rogerson P, Vena JE, Freudenheim AL. Positional accuracy of geocoded addresses in epidemiologic research. Epidemiology. 2003;14(4):408–412. [PubMed]
- Brauer M, Hoek G, van Vliet P, Meliefste K, Fischer P, Gehring U, et al. Estimating long-term average particulate air pollution concentrations: application of traffic indicators and geographic information systems. Epidemiology. 2003;14(2):228–239. [PubMed]
- Chen CC, Wu KY, Chang MJW. A statistical assessment on the stochastic relationship between biomarker concentrations and environmental exposures. Stoch Environ Res Risk Assess. 2004;18(6):377–385.
- Choi KM, Serre ML, Christakos G. Efficient mapping of California mortality fields at different spatial scales. J Expo Anal Environ Epidemiol. 2003;13(2):120–133. [PubMed]
- Christakos G. Modern Spatiotemporal Geostatistics. New York: Oxford University Press; 2000.
- Christakos G. Epistematics: An Evolutionary Framework of Real World Problem-Solving. New York: Springer; 2009.
- Christakos G, Hristopulos DT. Spatiotemporal Environmental Health Modelling: A Tractatus Stochasticus. Boston: Kluwer Academic; 1998.
- Christakos G, Olea RA, Serre ML, Yu H-L, Wang L. Interdisciplinary Public Health Reasoning and Epidemic Modelling: The Case of Black Death. New York: Springer; 2005.
- Christakos G, Serre ML. BME analysis of spatiotemporal particulate matter distributions in North Carolina. Atmos Environ. 2000;34(20):3393–3406.
- Christakos G, Serre ML, Kovitz JL. BME representation of particulate matter distributions in the State of California on the basis of uncertain measurements. Geophys Atmos. 2001;106(D9):9717–9731.
- Christakos G, Thesing GA. The intrinsic random-field model in the study of sulfate deposition processes. Atmos Environ A Gen. 1993;27(10):1521–1540.
- Christakos G, Vyas VM. A composite space/time approach to studying ozone distribution over Eastern United States. Atmos Environ. 1998;32(16):2845–2857.
- Clayton CA, Pellizzari ED, Rodes CE, Mason RE, Piper LL. Estimating distributions of long-term particulate matter and manganese exposures for residents of Toronto, Canada. Atmos Environ. 1999;33(16):2515–2526.
- Conceicao GMS, Miraglia SGEK, Kishi HS, Saldiva PHN, Singer JM. Air pollution and child mortality: a time-series study in Sao Paulo, Brazil. Environ Health Perspect. 2001;109:347–350. [PMC free article] [PubMed]
- Cooper GS, Dooley MA, Treadwell EL, St Clair EW, Gilkeson GS. Hormonal and reproductive risk factors for development of systemic lupus erythematosus: results of a population-based, case-control study. Arthritis Rheum. 2002;46(7):1830–1839. [PubMed]
- Deutsch CV, Journel AG. GSLIB Geostatistical Software Library and User’s Guide [CD] New York: Oxford University Press; 1998.
- Gotway CA, Young LJ. Combining incompatible spatial data. Am Stat Assoc. 2002;97(458):632–648.
- Hoek G, Fischer P, Van den Brandt P, Goldbohm S, Brunekreef B. Estimation of long-term average exposure to outdoor air pollution for a cohort study on mortality. J Expo Anal Environ Epidemiol. 2001;11(6):459–469. [PubMed]
- Journel AG, Huijbregts CJ. Mining Geostatistics. New York: Academic Press; 1978.
- Katsouyanni K, Pershagen G. Ambient air pollution exposure and cancer. Cancer Causes Control. 1997;8(3):284–291. [PubMed]
- Kentel E, Aral MM. 2D Monte Carlo versus 2D fuzzy Monte Carlo Health Risk Assessment. Stoch Environ Res Risk Asses. 2005;19(1):86–96.
- Kolovos A, Yu H-L, Christakos G. SEKS-GUI v.0.6. User’s Manual-06 Ed. San Diego, CA: Department of Geography, San Diego State University; 2006.
- Law DCG, Bernstein KT, Serre ML, Schumacher CM, Leone PA, Zenilman JM, et al. Modeling a syphilis outbreak through space and time using the Bayesian maximum entropy approach. Ann Epidemiol. 2006;16(11):797–804. [PubMed]
- Li HL, Huang GH, Zou Y. An integrated fuzzy-stochastic modeling approach for assessing health-impact risk from air pollution. Stoch Environ Res Risk Assess. 2008;22(6):789–803.
- Liao DP, Peuquet DJ, Duan YK, Whitsel EA, Dou JW, Smith RL, et al. GIS approaches for the estimation of residential-level ambient PM concentrations. Environ Health Perspect. 2006;114:1374–1380. [PMC free article] [PubMed]
- Maxwell RM, Kastenberg WE. A model for assessing and managing the risks of environmental lead emissions. Stoch Environ Res Risk Assess. 1999;13(4):231–250.
- Michelozzi P, Capon A, Kirchmayer U, Forastiere F, Biggeri A, Barca A, et al. Adult and childhood leukemia near a high-power radio station in Rome, Italy. Am J Epidemiol. 2002;155(12):1096–1103. [PubMed]
- Mutshinda CM, Antai I, O’Hara RB. A probabilistic approach to exposure risk assessment. Stoch Environ Res Risk Assess. 2008;22(4):441–449.
- Nyberg F, Gustavsson P, Jarup L, Bellander T, Berglind N, Jakobsson R, et al. Urban air pollution and lung cancer in Stockholm. Epidemiology. 2000;11(5):487–495. [PubMed]
- Ozkaynak H, Xue J, Spengler J, Wallace L, Pellizzari E, Jenkins P. Personal exposure to airborne particles and metals: results from the particle team study in Riverside, California. J Expo Anal Environ Epidemiol. 1996;6(1):57–78. [PubMed]
- Parkhurst DF. Arithmetic versus geometric: means for environmental concentration data. Environ Sci Technol. 1998;32(3):92a–98a.
- Pope CA, Burnett RT, Thun MJ, Calle EE, Krewski D, Ito K, et al. Lung cancer, cardiopulmonary mortality, and long-term exposure to fine particulate air pollution. JAMA. 2002;287(9):1132–1141. [PMC free article] [PubMed]
- Rushton G, Krishnamurthy R, Krishnamurti D, Lolonis P, Song H. The spatial relationship between infant mortality and birth defect rates in a US city. Stat Med. 1996;15(17–18):1907–1919. [PubMed]
- Sagiv SK, Mendola P, Loomis D, Herring AH, Neas LM, Savitz DA, et al. A time-series analysis of air pollution and preterm birth in Pennsylvania, 1997–2001. Environ Health Perspect. 2005;113:602–606. [PMC free article] [PubMed]
- Savelieva E, Demyanov V, Kanevski M, Serre M, Christakos G. BME-based uncertainty assessment of the Chernobyl fallout. Geoderma. 2005;128(3–4):312–324.
- SEKS-GUI. Spatiotemporal Epistematics Knowledge Synthesis Model - Graphic User Interface. 2007. [[accessed 10 March 2009]]. Available: http://homepage.ntu.edu.tw/~hlyu/software/SEKSGUI/SEKSHome.html.
- Serre ML, Kolovos A, Christakos G, Modis K. An application of the holistochastic human exposure methodology to naturally occurring arsenic in Bangladesh drinking water. Risk Anal. 2003;23(3):515–528. [PubMed]
- Stallard N, Whitehead A. The fixed-dose procedure and the acute-toxic-class method: a mathematical comparison. Hum Exp Toxicol. 1995;14(12):974–990. [PubMed]
- Tamborini P, Sigg H, Zbinden G. Acute toxicity testing in the nonlethal dose range—a new approach. Regul Toxicol Pharmacol. 1990;12(1):69–87. [PubMed]
- U.S EPA (Environmental Protection Agency) Technology Transfer Network: Air Quality System (AQS) 2002. [[accessed 1 March 2007]]. Available: http://www.epa.gov/ttn/airs/airsaqs/
- U.S. EPA. AQS Raw Data Summary Formulas Draft. Washington, DC: U.S. Environmental Protection Agency; 2004.
- U.S. EPA. AQS Data Coding Manual v212. Washington, DC: U.S. Environmental Protection Agency; 2005.
- U.S. EPA. National Ambient Air Quality Standards (NAAQS) Washington, DC: U.S. Environmental Protection Agency; 2006.
- Vyas VM, Christakos G. Spatiotemporal analysis and mapping of sulfate deposition data over eastern USA. Atmos Environ. 1997;31(21):3623–3633.
- Wallace L, Williams R. Validation of a method for estimating long-term exposures based on short-term measurements. Risk Anal. 2005;25(3):687–694. [PubMed]
- Wang JF, Christakos G, Han WG, Meng B. Data-driven exploration of “spatial pattern-time process-driving forces” associations of SARS epidemic in Beijing, China. J Public Health (Oxf) 2008;30(3):234–244. [PMC free article] [PubMed]
- Ward MH, Nuckols JR, Giglierano J, Bonner MR, Wolter C, Airola M, et al. Positional accuracy of two methods of geocoding. Epidemiology. 2005;16(4):542–547. [PubMed]
- Wu J, Wang J, Meng B, Chen G, Pang L, Song X, et al. Exploratory spatial data analysis for the identification of risk factors to birth defects. BMC Public Health. 2004;4:23. doi: 10.1186/1471-2458-4-23. [Online 18 June 2004] [PMC free article] [PubMed] [Cross Ref]
- Yu HL, Kolovos A, Christakos G, Chen JC, Warmerdam S, Dev B. Interactive spatiotemporal modelling of health systems: the SEKS-GUI framework. Stoch Environ Res Risk Assess. 2007;21(5):555–572.

**National Institute of Environmental Health Science**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (919K)

- The London low emission zone baseline study.[Res Rep Health Eff Inst. 2011]
*Kelly F, Armstrong B, Atkinson R, Anderson HR, Barratt B, Beevers S, Cook D, Green D, Derwent D, Mudway I, et al.**Res Rep Health Eff Inst. 2011 Nov; (163):3-79.* - Multicity study of air pollution and mortality in Latin America (the ESCALA study).[Res Rep Health Eff Inst. 2012]
*Romieu I, Gouveia N, Cifuentes LA, de Leon AP, Junger W, Vera J, Strappa V, Hurtado-Díaz M, Miranda-Soberanis V, Rojas-Bracho L, et al.**Res Rep Health Eff Inst. 2012 Oct; (171):5-86.* - Effects of short-term exposure to air pollution on hospital admissions of young children for acute lower respiratory infections in Ho Chi Minh City, Vietnam.[Res Rep Health Eff Inst. 2012]
*HEI Collaborative Working Group on Air Pollution, Poverty, and Health in Ho Chi Minh City, Le TG, Ngo L, Mehta S, Do VD, Thach TQ, Vu XD, Nguyen DT, Cohen A.**Res Rep Health Eff Inst. 2012 Jun; (169):5-72; discussion 73-83.* - Use of an index to reflect the aggregate burden of long-term exposure to criteria air pollutants in the United States.[Environ Health Perspect. 2002]
*Kyle AD, Woodruff TJ, Buffler PA, Davis DL.**Environ Health Perspect. 2002 Feb; 110 Suppl 1:95-102.* - Health effects of outdoor air pollution. Committee of the Environmental and Occupational Health Assembly of the American Thoracic Society.[Am J Respir Crit Care Med. 1996]
*.**Am J Respir Crit Care Med. 1996 Jan; 153(1):3-50.*

- Efficient Mapping and Geographic Disparities in Breast Cancer Mortality at the County-level by Race and Age in the U.S.[Spatial and spatio-temporal epidemiology. 2...]
*Chien LC, Yu HL, Schootman M.**Spatial and spatio-temporal epidemiology. 2013 Jun; 027-37* - Estimating Spatiotemporal Variability of Ambient Air Pollutant Concentrations with A Hierarchical Model[Atmospheric environment (Oxford, England : ...]
*Li L, Wu J, Ghosh JK, Ritz B.**Atmospheric environment (Oxford, England : 1994). 2013 Jun 1; 7154-63* - The moving-window Bayesian Maximum Entropy framework: Estimation of PM2.5 yearly average concentration across the contiguous United States[Journal of exposure science & environmental...]
*Akita Y, Chen JC, Serre ML.**Journal of exposure science & environmental epidemiology. 2012 Sep; 22(5)496-501* - Asian Dust Storm Elevates Children's Respiratory Health Risks: A Spatiotemporal Analysis of Children's Clinic Visits across Taipei (Taiwan)[PLoS ONE. ]
*Yu HL, Chien LC, Yang CH.**PLoS ONE. 7(7)e41317* - Bayesian Maximum Entropy Integration of Ozone Observations and Model Predictions: An Application for Attainment Demonstration in North Carolina[Environmental science & technology. 2010]
*de Nazelle A, Arunachalam S, Serre ML.**Environmental science & technology. 2010 Aug 1; 44(15)5707-5713*