• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Comput Stat Data Anal. Author manuscript; available in PMC Dec 1, 2013.
Published in final edited form as:
Comput Stat Data Anal. Dec 1, 2012; 56(12): 4180–4189.
Published online May 7, 2012.
PMCID: PMC3462467
NIHMSID: NIHMS381678

Model-Based Estimation of the Attributable Risk: A Loglinear Approach

Christopher Cox, Ph.D. and Xiuhong Li, M.A.S.

Abstract

This paper considers model-based methods for estimation of the adjusted attributable risk (AR) in both case-control and cohort studies. An earlier review discussed approaches for both types of studies, using the standard logistic regression model for case-control studies, and for cohort studies proposing the equivalent Poisson model in order to account for the additional variability in estimating the distribution of exposures and covariates from the data. In this paper we revisit case-control studies, arguing for the equivalent Poisson model in this case as well. Using the delta method with the Poisson model, we provide general expressions for the asymptotic variance of the AR for both types of studies. This includes the generalized AR, which extends the original idea of attributable risk to the case where the exposure is not completely eliminated. These variance expressions can be easily programmed in any statistical package that includes Poisson regression and has capabilities for simple matrix algebra. In addition, we discuss computation of standard errors and confidence limits using bootstrap resampling. For cohort studies, use of the bootstrap allows binary regression models with link functions other than the logit.

Keywords: adjusted attributable risk, case-control study, cohort study, Poisson regression, delta method, model-based estimate, bootstrap methods

1. Introduction

The attributable risk (AR) represents the relative amount by which the prevalence of a disease D would be reduced if an exposure E were eliminated, taking account of both the relative risk and the prevalence of the exposure. As noted by a reviewer, the data on which an estimate is based must be representative of the population to which the results will be applied. A refinement of the basic definition is to adjust for the effects of covariates. Adjustment is based on the prevalence of disease in a population as a function of a risk factor E with I, usually ordered, levels (with reference level E = 1 representing no exposure) and a set of categorical covariates xj (1 ≤ jJ), which typically represent a compound index generated by the combined levels of two or more factors (Benichou, 2001). The adjusted AR is defined as follows (Basu and Landis, 1995; Eide and Gefeller, 1995; Lehnert-Batar, Pfahlberg and Gefeller, 2006).

$AR=1-P(D∣E=1)/P(D)=1-∑jJP(xj)P(D∣E=1,xj)/P(D)={∑j∑iP(E=i,xj,D)-∑jP(xj)P(D∣E=1,xj)}/P(D)=∑j∑i>1P(E=i,xj){P(D∣E=i,xj)-P(D∣E=1,xj)}/P(D)$
(1)

The final expression in (1) is used to define the AR for the jth level of the covariates or, reversing the order of summation, the ith level of exposure (i > 1). Eide and Gefeller (1995) and Eide and Heuch (2006) refer to the latter as components of the AR due to a particular level of exposure and note that these components sum to the total AR.

We consider model-based approaches for estimation of the adjusted attributable risk. Model-based methods use a regression model to estimate the probabilities in equation (1). For both cohort and case-control studies the standard approach is based on a logistic regression model for disease status as a function of exposure and covariates. For cohort studies Cox (2006) previously proposed the loglinear model equivalent to the standard logistic regression model for the case of cross-sectional sampling, in order to account for the additional variability in estimating the joint distribution of exposure and covariates (Basu and Landis, 1995). The Poisson regression model can also be used for stratified cohort studies when additional data are available to estimate the exposure distribution.

The adjusted AR (1) is based on a comparison of disease risk among exposed individuals to that in the unexposed (E = 1) population. A generalization (Drescher and Becher (1997), Eide and Heuch (2001), and references therein) is to allow comparison to a population in which the exposure is not entirely absent (present only at the lowest level), but rather has a nondegenerate distribution, which is different from that in the original population. An example is when the exposure is reduced but not eliminated as the result of an intervention or education program. The definition of the generalized attributable risk (generalized impact fraction) given by Drescher and Becher (1997) for such an alternative distribution, Pr*(E = i, xj), can be written as follows.

$gAR=[∑j∑i{Pr(E=i,xj)Pr(D∣E=i,xj)-Pr∗(E=i,xj)Pr(D∣E=i,xj)}]/Pr(D)$
(2)

Although defined in more general terms, the alternative distribution would typically involve only the levels of the risk factor. For each value of the covariates it is defined as a re-weighting of the original exposure probabilities by a specified probability density function g(i|k) (1 ≤ iI), defined for each level of exposure, k (1 ≤ kI).

$Pr∗(E=i,xj)=∑kg(i∣k)Pr(E=k,xj)$
(3)

This definition has the intuitively appealing property that P*(xj) = P(xj); the special case of the standard AR corresponds to g(1|k) = 1. To illustrate this idea we will use an example considered by Drescher and Becher (1997). In this case I = 4, and we assume that a proportion q1 (0< q1<1) of subjects change to the lowest risk level, while an additional proportion q2 (0 < q1q1 + q2 ≤ 1) change from the current level to the next lower level, with subjects already at the lowest level of risk remaining where they are. The two-parameter family of density functions specified in (3) is given in the following table, which we will use to illustrate the generalized AR.

g(i|k) k i1234
11q1+q2q1q1
201−(q1+q2)q20
3001− (q1+q2)q2
40001− (q1+q2)

In this paper we consider both cohort and case-control studies. For cohort studies we employ the Poisson regression model. As is standard practice the parameters of the model are estimated by the method of maximum likelihood. For this model we first provide expressions for the large sample variance of the model-based AR for cohort studies, using the delta method (Cox, 1998), which is the standard method for finding asymptotic variances for functions of the original parameters of the model estimated by maximum likelihood. These expressions can be easily programmed in statistical packages having matrix capabilities, such as R.

For case-control studies we propose the equivalent loglinear model as well, again in order to account for estimation of the distribution of exposure and covariates. Expressions for the variance of the adjusted AR based the loglinear model and the delta method are provided. We include a discussion of the generalized AR for both types of studies. We also consider bootstrap methods for computing standard errors and confidence intervals. An advantage of the bootstrap is that for cohort studies, link functions other than the logit can be employed in the binary regression model, and we provide an illustration using a discrete survival model. The approach is not difficult to implement in packages that facilitate resampling methods.

The delta method requires expressions for the partial derivatives of various nonlinear functions of the parameters. There are many ways to write the required vectors of partial derivatives; in addition to ordinary matrix multiplication, we use the Schur product (element-wise multiplication) to simplify the notation. This requires either that both dimensions of the two matrix operands are identical, or that one of the two matrices is a vector whose length equals one of the two dimensions of the other. We will denote this operation with an asterisk; a simple relation that is used repeatedly is (a*b)′c = b′ (c*a) = Σ aibici for three m-vectors a, b and c.

2. Cohort studies

Consider a logistic regression model with binary outcome D = d (0/1), denoted by a disease indicator, zd, and model (design) matrix Zm × p, which includes an intercept. The model matrix consists of one or more exposure variables (indicators for categories of exposure) and categorical covariates, specified by (E = i, xj |1 ≤ iI, 1 ≤ jJ). The usual case for exposure is an ordinal, categorical exposure E having I levels, with a saturated (I-1 parameters) model. We assume that each row $zk′$ of Z, together with the corresponding element of zd, specifies a unique combination of the disease outcome, exposure and covariates, so that m = 2IJ. Note that in this formulation we require separate rows for subjects with and without disease, each with the same pattern of exposure and covariates. Let nk ≥ 0 (some of which may be zero) denote the total number of observations with a given pattern specified by $zk′$ and the corresponding element of zd, and define the vector n = (nk). For a given pattern of exposure and covariates, the sums of the separate totals for diseased and non-diseased subjects are fixed in the likelihood for the logistic regression model, although the individual totals themselves are not (they could be used as frequency weights in fitting the logistic model). Denote the maximum likelihood estimates of the unknown parameters by z, so that the vector of predicted probabilities of disease based on the model is $(P(D∣E=i,xj))=p(β^z)=expit(Zβ^z)=(expit(zk′β^z))$, where expit(z) = 1/[1 + exp(−z)]. (Note that each predicted probability is repeated twice, once for diseased and once for non-diseased subjects having the same pattern of exposure and covariates.) To compute the adjusted AR we define the matrix Z1 as the matrix Z with all exposure columns set to zero (so that in every case the exposure is the reference value, E = 1), in which case p1 (z) = expit(Z1z) = (D|E = 1, xj). The adjusted AR (1) is then (Greenland and Drescher, 1993)

$AR=1-n′p^1n′p^=n′(p^-p^1)n′p^$
(4)

Because the predicted probabilities are repeated for diseased and non-diseased subjects, the AR depends only on the sums of the corresponding totals, not the totals themselves. Since these combined totals are fixed in the logistic regression, a variance computed from (4) using the delta method will not take account of variation in these totals, which determine the joint distribution of exposure and covariates. Similarly, p1(z) is independent of the particular level of exposure in a given row of Z, so that the values of nk corresponding to different levels of exposure for a given pattern of covariates are effectively summed over all levels of exposure, as well as over disease status. For this reason, $n′p^1=n+∑jJP(xj)P^(D∣E=1,xj)$, where the plus notation denotes summation. Although the standard model involves logistic regression, the formulation (4) can clearly be used with other link functions. For an arbitrary link function g(z) we can compute p(z) = g−1(Zz) and p1 (z) = g−1(Z1z), and then use (4).

Generalizing the approach of Walter (1976) for the two-by-two table, Cox (2006) proposed a Poisson regression model for the case of cross-sectional sampling, in order to account for estimation of the joint distribution of exposure and covariates. As described below the loglinear model is equivalent to the standard logistic regression model in the sense that the corresponding regression coefficients and variance matrix for exposure and covariates are identical to those from the logistic model. The Poisson regression model uses the counts nk ≥ 0, as the dependent variable and has augmented model matrix W = [Zd, U], where Zd = zd *Z (zd is the indicator for disease status), and U is a model matrix saturated in the covariates and exposure (corresponding to a categorical variable denoting all possible combinations of levels in the data), including the overall intercept. That is, Zd represents the interaction between the outcome of disease status and the Z terms in the original logistic model, and U contains Z and all distinct pairwise products of columns of Z (with the exception of the intercept).

For example, consider a logistic model with four categories of exposure and one binary covariate. One possible way to set up the required matrices is the following. Columns 2–4 of Z denote the level of exposure and column 5 is the covariate.

$zd=[0000111100001111]Z=[10000110001010010010100001100010100100101000111001101011001110001110011010110011]Zd=[00000000000000000000100001100010100100100000000000000000000010001110011010110011]U=[10000000110000001010000010010000100000001100000010100000100100001000100011001100101010101001100110001000110011001010101010011001]$

The first 5 columns of U replicate Z, and the last three are interactions between the covariate and each of the three indicator variables for exposure. To obtain Z1 we set columns 2–4 of Z to zero. The matrix W = [Zd, U] has 16 rows and 13 columns.

It is well known that for this loglinear model the estimated marginal totals corresponding to the independent variables are equal to the observed totals (McCullagh and Nelder, 1989, Section 6.4.2). The parameter vector may also be partitioned, β = (βz, βu), so that the vector βz corresponds to the parameters of the logistic model, and the m.l.e., z, and corresponding sub-matrix of the estimated variance matrix, Σ() are identical to the estimates from the logistic regression. The remaining parameters account for variation in the marginal totals for exposure and covariates. These parameters are used to estimate the joint distribution of exposure and covariates, and sufficient data must be available to estimate this distribution if (1) is to be applied. Denote the vector of predicted means from the Poisson regression by $μ^+n^=exp(Wβ^)=(exp(wj′β^))$, where $wj′$ denotes the jth row of W. Because of the form of the likelihood equations for the Poisson model (Agresti, 2002, Section 4.4.7), the number of diseased subjects can be estimated directly by zd, which actually equals the observed number.

To compute the adjusted AR we define the matrix V1 = [Z1, 0], having the same dimensions as W but replacing Z by Z1 and U by 0, so that p1 () = expit(V1). Then because the estimated marginal totals for the exposure and covariates equal the observed, it follows from (4) that (Cox, 2006),

$AR=1-μ^′p^1μ^′zd=μ^′(zd-p^1)μ^′zd$
(5)

To apply the delta method we have /β = *W, so that

$∂∂βμ^′zd=zd′(μ^∗W)$

and similarly

$∂∂βμ^′p^1=p^1′(μ^∗W)+μ^′{p^1∗(1m-p^1)∗V1}$

The standard approach is now to use the delta method to compute the asymptotic variances and covariance of the numerator and denominator of the AR (or 1 − AR), and then a second application yields the approximate variance for a ratio (Greenland and Drescher, 1993). To avoid possible numerical difficulties, however, we take a direct approach based on the quotient rule.

$∂∂βAR={∂∂β(μ^′p^1)-(1-AR)∂∂β(μ^′zd)}/μ^′zd$
(6)

The delta method then gives the estimated variance.

$va^r(AR)=∂AR′∂β^∑(β^)∂AR∂β^$

We can also apply the delta method to obtain an approximation for var[log(1 − AR)], a transformation that has been recommended to produce a more normally distributed estimate. Alternative transformations have also been proposed, including the logistic (Lehnert-Batar, Pfahlberg and Gefeller, 2006).

The simplest case is that of a binary exposure and no covariates, considered in detail by Walter (1976). Assuming cross-sectional sampling, let (nij) denote the cell counts in the cross-classified two-by-two table, where i = 1,2 denotes exposure and j = 1,2 disease, with i = 2 indicating exposure and j = 2 indicating disease. The logistic model is based on the conditional distributions ni2 ~b (ni+pi), i = 1,2, where pi = P(D|E = i), which are independent, and the maximum likelihood estimates of the probabilities are the sample proportions. The prevalence of exposure is P(E) = n2+/n++, which is a fixed quantity in the conditional likelihood, while the probability of disease is P(D) = (n+2)/n++, which of course is not. The attributable risk is estimated as follows.

$AR=1-P^(D∣E=1)P^(D)=1-p^1(n1+p^1+n2+p^2)/n++=1-n12/n1+n+2/n++$

To compute a large sample variance, we would calculate AR/pi and apply the delta method using the 2×2 diagonal covariance matrix of (1, 2) with diagonal elements [p1 (1 − p1)/n1+, p2 (1 − p2)/n2+ ], substituting the maximum likelihood estimates for the true parameter values (Lehnert-Batar, Pfahlberg and Gefeller, 2006). For the Poisson model, the 4×4 model matrix W has linearly independent columns and so the model may be reparameterized as four jointly independent Poisson random variables, nij ~ Po(μij); the maximum likelihood estimates of the parameters of this model are the cell counts. The estimate of the exposure prevalence is the same as for the conditional binomial likelihood, but is now subject to random variation. The estimate of the AR is also identical.

$AR=1-P^(D∣E=1)P^(D)=1-μ^12/μ^1+μ^+2/μ^++=1-n12/n1+n+2/n++$

To calculate the asymptotic variance we would compute AR/μij and apply the delta method using the 4×4 diagonal covariance matrix diag (μ11, μ12, μ21, μ22), again with the parameters replaced by their maximum likelihood estimates.

To compute components of the AR for each level of exposure we define 1i as the indicator variable for E = i > 1 (the corresponding column of Z), and from (1) we have

$ARi=(μ^∗1i)′(zd-p^1)μ^′zd$
(7)

Again, because the Poisson model is saturated in the exposure and covariates, we have from (1) that $∑i>1(μ^∗1i)′(zd-p^1)=μ^′(zd-p^1)$, so that $AR=∑i>1ARi$. The component of the attributable risk for the jth stratum of the covariates is defined similarly. To apply the delta method using (6) we need only the derivative of the numerator of (7),

$∂∂β(μ^∗1i)′(zd-p^1)=(zd-p^1)′(1i∗μ^∗W)-(μ^∗1i)′{p^1∗(1m-p^1)∗V1)}$

Finally we consider the generalized AR (2). As discussed above, this is defined by means of an I × I matrix G = [gik] = g(i|k)] of probabilities (0 ≤ g(i|k) ≤ 1), satisfying 1IG = 1I, which is used to define the alternative distribution of exposure at each level of the covariates (3). We also need the predicted probabilities for the logistic model, given by p() = expit(V), using a similar notation for the matrix V = [Z, 0]. Depending on how the rows of the data set are arranged, the matrix G can be expanded into an m × m matrix G* so that for * = G*,

$gAR=1-∑j∑i{P∗(E=i,xj)P(D∣E=i,xj)}P(D)=1-μ^∗′p^μ^′zd$
(8)

If the rows of the Z are sorted by exposure within disease within covariate categories, since we include cases where nj = 0, so that m = 2JI, we have G* = (I2J G). In our example, I = 4 and J = 2, so that G* = diag [G, G, G,G]. To apply the delta method (4) we need only the derivative of the numerator of (8).

$∂∂βμ^∗′p^=p^′G∗(μ^∗W)+μ^∗′{p^∗(1-p^)∗V}$

3. Case-control studies

For case-control studies we can use Bayes theorem to write the adjusted AR (1) as (Bruzzi et al., 1985, Eide and Heuch, 2001),

$AR=1-∑iI∑jJP(E=i,xj∣D)RRi∣j-1=∑i>1∑jP(E=i,xj∣D)(1-RRi∣j-1)$
(9)

where RRi|j = P(D|E = i, xj)/P(D|E = 1, xj) is the relative risk, which will, in general, depend on both the level of exposure and the covariates (Greenland and Drescher, 1993). To compute a model-based estimate, a standard approach is to use a logistic regression model with m × p model matrix Z and dependent variable denoting case (d = 1) or control (d = 0) status, and estimate the relative risks by the corresponding odds ratios. We again assume that each row of Z, together with the corresponding element of the outcome vector, specifies a unique pattern of the independent and dependent variables, and let denote the estimated parameters. The m-vector of predicted probabilities, each one repeated twice, is denoted by = p() = expit(Z). If n = (nk) again denotes the vector of counts (frequencies), then n is a model-based estimate (actually equal to) the total number of cases, whose variability depends only on the estimated parameters (Cox, 2006). (As before, it is true that the numbers of cases and controls are random according to the logistic likelihood, but because the values of depend only on exposure and covariates, they are the same for both cases and controls having the same pattern, so that the estimate n depends only on the estimated probabilities and the combined (case + control) totals for each distinct combination of exposure and covariates, and these totals are fixed in the logistic model.)

Now let Ex denote an m × p matrix, each of whose rows is an indicator for the level of exposure in the corresponding row of the matrix Z, with the exception of the reference level, where the entries are all zero, and zero for the covariates as well. In other words Ex corresponds to the matrix Z with all columns except those indicating the various levels of exposure set to zero. In the previous example in Section 2, Ex would be the matrix Z with the first and last columns set to zero. The relative risks are estimated by the odds ratios, RR = exp(Ex). Then from (9) we have (Greenland and Drescher, 1993; Cox, 2006)

$AR=1-(n∗p^)′{exp(-Exβ^)}n′p^=(n∗p^)′{1m-exp(-Exβ^)}n′p^$
(10)

With this formulation the variance of the AR depends only on the variability of the estimated parameters. Thus, although this approach does provide model-based estimates of the probabilities Pr(E = i, xj| D), it does not allow variation in the totals, which are fixed in the logistic model, and in this sense does not fully account for the distribution of the exposures and covariates.

In contrast, the expression for the variance of an equivalent expression given by Greenland and Drescher (1993, equation 5) involves the multinomial covariance matrices of the observed counts for cases and controls. Using their notation, we let $zj′(1≤j≤k)$ denote the unique patterns of only the independent variables, and n1j ≥ 0 denote the number of cases and n0j ≥ 0 the number of controls with a given pattern, with totals nj = n1j + n0j. Thus multinomial variation is assumed for the counts (n0j) and (n1j). The estimated counts are based on the model rather than the data, and so are model-based. The variance matrix for the totals is then the sum of the two multinomial covariance matrices, reflecting the independent sampling of cases and controls. While this is an interesting approach, it also seems somewhat inconsistent since the total numbers of cases and controls are treated as random by the logistic model but as fixed constants in the multinomial covariance matrices. Overall, however, the approach appears useful, and has been implemented in the post-estimation command aflogit in Stata (Brady, 1998). This command calculates the AR and component ARi for the individual exposure categories, as well as their standard errors and asymmetric confidence intervals based on the log(1− ·) transformation using the approach of Greenland and Drescher (1993).

Rather than pursue this approach, we turn to the following expression for the “case-control likelihood” from Greenland and Drescher (1993 equation 1),

$∏jexp(zj′β^)n1j{1+exp(zj′β^)}njπ(zj)n1j+n0jμn1(1-μ)n0$

where π (zj)nj is proportional to the multinomial likelihood for the totals, and μ = n/n· · is a constant. The parameter estimates, = (n1, …, nk)/n· ·, are based on the saturated model. This is an actual likelihood, which is in fact proportional to the likelihood for the Poisson model, also saturated in the exposures and covariates, corresponding to the logistic model. We propose to use this Poisson model as an alternative to the approach of Greenland and Drescher (1993). The expression for the model-based AR is similar to (10), but the partial derivatives are different since the model now includes k − 1 additional parameters for the totals (nj). The advantage of this approach is that it is based on the (Poisson) likelihood; the disadvantage is that the totals n for cases and n for controls are (consistent with the model) treated as random, which, however, may not be consistent with the sampling design. This might lead one to suspect that the variance calculated from the Poisson model would be larger than the estimate proposed by Greenland and Drescher (1993). Examples show that this is sometimes, but not always, the case.

We now consider the AR based on the loglinear model. To discuss this model we can use the same fremework as for cohort studies; that is, the model matrix Zm × p for the logistic model includes rows for both cases and controls. The Poisson model matrix W = [Zd, U] is constructed as before using the case indicator zd. The predicted probabilities are denoted = p() = expit(V) as before and the predicted counts for cases and controls by = exp(W). For computation of the odds ratios, the matrix Ex must include additional columns of zeroes for the additional parameters, i.e., Ex is based on W rather than Z, with the columns corresponding to U all set to zero. From (10) we then have

$AR=1-{μ^∗p^}′exp(-Exβ^)μ^′p^={μ^∗p^}′{1m-exp(-Exβ^)}μ^′p^$
(11)

For the loglinear model the totals also depend on the parameters, so that the partial derivatives for the numerator and denominator of (11) that are needed for (6) are more complicated than for the logistic model.

$∂∂βμ^′p^=p^′(μ^∗W)+μ^′{p^∗(1m-p^)∗V}$

and

$∂∂β(μ^∗p^)′{1m-exp(-Exβ^)}=[p^∗{1m-exp(-Exβ^)}]′(μ^∗W)+μ^′[{1m-exp(-Exβ^)}∗p^∗(1m-p^)∗V-p^∗exp(-Exβ^)∗Ex]$

A simpler expression can be obtained by working with 1 − AR (9).

To compute the component of the AR specific to the ith level of exposure, we need exposure matrices Ei, which are similar to Ex, but denote only the occurrences of the ith level of exposure, so that $Ex=∑i>1Ei$ and $1m-exp(-Exβ^)=∑i>1{1m-exp(-Eiβ^)}$. Thus,

$ARi=(μ^∗p^)′{1m-exp(-Eiβ^)}μ^′p^$
(12)

The partial derivatives for (12) are similar to those for the AR (11).

For the generalized AR we again need an expanded matrix G *, which is based on the transpose of the matrix G (Cox, 2006, equation 12)

$gAR=1-{μ^∗p^∗exp(-Exβ^)}′G∗exp(Exβ^)μ^′p^$
(13)

If the rows of the data matrix are sorted by exposure within disease within covariate categories, then we have G* = (I2j GT); in our previous example, G* = diag [GT, GT, GT,GT]. To apply the delta method using (8) we only need the partial derivatives for the numerator of (13).

$∂∂β{μ^∗p^∗exp(-Exβ^)}′G∗exp(Exβ^)={p^∗exp(-Exβ^)∗G∗exp(Exβ^)}′(μ^∗W)+{G∗exp(Exβ^)}′{μ^∗exp(-Exβ^)∗p^∗(1m-p^)∗V-μ^∗p∗exp(-Exβ^)∗Ex}+{μ^∗p^∗exp(-Exβ^)}′G∗{exp(-Exβ^)∗Ex}$

4. Illustrations

To compute model-based estimates and their variances, we must be able to fit the Poisson regression model and store the parameter estimates and their covariance matrix for further calculations. Cox (2006) described one approach, involving a general purpose program for maximum likelihood estimation. Here we consider an approach based on statistical software with the ability to perform matrix calculations, so that the computations described above can be programmed directly. Computations for our examples were performed in S-Plus. To compare our results with the approach of Greenland and Drescher (1993) for case-control studies, we include results from the aflogit; post-estimation command in Stata. For cohort studies Greenland and Drescher used a logit model that does not account for estimation of the covariate distribution (Basu and Landis, 1995), so that the standard errors will be too small.

The bootstrap provides an alternative to the delta method for the computation of variances. An attractive feature of bootstrap resampling is that the computation of the AR for both types of studies can be based on the logistic regression model, which simplifies the calculations, particularly since values of nj = 0 can be omitted. In addition, for cohort studies other link functions besides the logit can be used, as illustrated in Section 4.1.2. Computation of the AR for the bootstrap samples can be based on equation (4) for cohort studies, and for case-control studies equation (10) can be used. In our analyses bootstrap computations were performed in SAS 9.2 using the LOGISTIC procedure, which allows a limited number of other link functions, and also has an additional SCORE statement that facilitates computation of the probabilities p1 (z) = g−1(Z1z) required for (4).

Our first example for cohort studies uses a data set considered by Basu and Landis (1995) who presented data (Table 1) from NHANES II, involving 966 women classified by two racial categories and four categories of exposure based on body mass index, with diastolic blood pressure above the 90th percentile as the outcome. We consider both the original data and a reduced data set with five (rows 1, 5 and 14–16) of the 16 counts set to zero, for a total sample size of 671. A second example used pooled binary regression with the complimentary log-log link (g(p) = log [1 − log(p)]) to analyze time to seroconversion data from the Multicenter AIDS Cohort Study (MACS), a longitudinal, cohort study of the natural history of HIV infection among MSM (men who have sex with men) in the United States (Ostrow et al., 2009). Briefly, during the follow-up period (1998–2008) there were 57 HIV seroconverters among 1,667 initially HIV-seronegative men. The regression model included all 8 combinations of three sex-drugs (inhaled nitrates or “poppers”, stimulants, and EDDs – erectile dysfunction drugs) used at the current or previous semi-annual visit, adjusting for other risk factors including sexual behavior, alcohol and other drugs used, and depression. For additional details see Ostrow et al. (2009). For case-control studies we consider the Ille-et-Villaine case-control study of esophageal cancer. The data are taken from Appendix I of Breslow and Day (1980), and were discussed extensively by Benichou (1991). There are 200 cases and 775 controls; the exposure variable is alcohol consumption, having four ordered categories with the lowest level of consumption as the reference. Covariates are age and smoking, each with three levels.

Values of the total and component attributable risks and their standard errors (second row) from a cohort study with four exposure categories and one binary covariate, using Poisson and logistic regression models. The left hand columns are for the full ...

4.1 Cohort studies

4.1.1 Example 1

This is the example discussed in Section 2, having four categories of exposure and a single binary covariate. Results using the Poisson model to estimate the AR and the components ARi for the three categories of exposure above the reference are shown in the first column of Table 1. Bootstrap standard errors based on 2000 samples, in the second column, were consistent with the large sample standard errors from the Poisson model and the delta method. Values from the general purpose MLE program used by Cox (2006, Section 5.2.2) were nearly identical, and are included for reference in the third column. The logistic regression model (using Stata) also gave identical estimates and similar standard errors. As expected the standard errors are smaller than those from the Poisson model, although the differences are slight for this example. We also computed the generalized AR for parameter values of q1 = .80, q2 =.20, which determine the matrix G. The result was gAR (SE) = .286 (.063), which is identical to the penultimate row of Table 3 of Cox (2006), based on the MLE program with the Poisson model.

For a further comparison of the Poisson and logistic regression approaches we consider a modification of this example, in which counts for five of the 16 rows in the data were set to zero. Results for this rather extreme example are shown in the second half of Table 1 (Example 1b), and more clearly illustrate the differences between the two approaches, confirming that the Poisson model gives reasonable standard errors compared to the bootstrap, while those based on the logistic model are too small.

4.1.2 Example 2

Results of the pooled binary regression analysis are summarized in Table 2, which includes the total AR as well as the component AR values for all seven active drug combinations. Standard errors were calculated using the bootstrap. In addition we computed a total AR (SE) of .7408 (.1063) when four categories of frequency of unprotected receptive anal sex partners (URASP) were added to the exposure model. In this case the model for the exposure categories was not saturated.

Component and Total AR values for the risk of seroconversion from categories of recreational drug use using a pooled, binary regression model with the complementary log-log link. Standard errors were calculated using the bootstrap.

4.2 Case-control studies

4.2.1 Example 1

The first example has four categories of alcohol exposure, and no covariates. The results from the Poisson model in the first column of Table 3 are consistent with the bootstrap in the second column, and with the results from the logistic model using the approach of Greenland and Drescher (1993), but not with the results based solely on the logistic model. The results are also consistent with those obtained using the general purpose MLE program (Cox, 2006, Section 5.1.1). For comparison we also computed the generalized AR for the same parameter values of q1 = .80, q2 = .20, giving the result AR (SE) = .6809 (.0487).

Values of the total and component attributable risks and their standard errors from two different models for a case-control study, each with four exposure categories, from Poisson and logistic models. For the third and seventh columns, a general purpose ...

4.2.2 Example 2

In this case we have a binary exposure to any alcohol as well as an interaction between alcohol and age; we consider exposure and the interaction as four categories of exposure. The covariates are age, smoking and their interaction. The results are shown in the right-hand side of Table 3. Of interest is the fact that the bootstrap standard errors are quite consistent with the results from the Poisson model; the results using the approach of Greenland and Drescher (1993) look reasonable as well, although the bootstrap results agree somewhat better with those from the loglinear model.

5. Discussion

The primary goal of this paper is to discuss methods for computing the model-based estimate of the adjusted AR and its standard error using a Poisson regression model and the delta method. The Poisson model is equivalent to the standard logistic model, so that the estimate of the attributable risk is the same, although the standard error is larger. We have argued that the Poisson model is more appropriate since it accounts for estimation of the joint distribution of exposure and covariates using the data. It is hoped that the formulas that were provided will facilitate the use of the loglinear model. For anyone interested in software, we have provided an electronic supplement containing S-Plus programs for the first examples for both cohort and case-control studies; these programs reproduce selected results in Tables 1 and and3,3, and the text.

In most of the examples in the previous section the model for exposure was saturated, so that the number of exposure parameters was one less than the number of categories. In other words, two different binary exposures would yield four exposure categories. Although this is frequently the case in applications, it is not necessary, as shown by Example 4.1.2. The subsequent calculations are the same, but the results are different; in particular the component AR’s no longer sum to the total.

Graubard and Fears (2005) discussed model-based estimates of the adjusted AR in the sample survey context. They assumed known sampling probabilities and proposed weighted estimates based on a logistic model. Standard errors were calculated using the method of Taylor deviates, which is closely related to the delta method. For case-control studies they showed that their variance estimate is approximately the same as that proposed by Benichou and Gail (1990), which uses the sample proportions to estimate the probabilities Pr(E = i, xj|D) and gave similar results to those obtained by Cox (2006) using (logistic) model-based estimates. This is consistent with the fact that the estimator in equation 3 of their paper involves a weighted average of the binary variable denoting case status. For cohort studies, they also used the logistic model, and the estimator in their equation 4 depends only on the estimated regression coefficients from the model; for this reason one would anticipate results similar to the approach of Greenland and Drescher (1993). Of course their approach could be used with the Poisson model as well, provided that one has a sampling design.

In summary we have proposed a unified approach for both cohort and case-control studies based on the Poisson model and the delta method, which will hopefully encourage the use of model-based estimates. For those so inclined, the bootstrap provides an alternate approach to the computation of the standard error or confidence limits that is in some respects simpler and that appears to work quite well. In fact in our examples, which involved fairly large sample sizes, results using the bootstrap agreed with those from the delta method applied to the Poisson model.

Acknowledgments

Contract/grant sponsor: National Institute of Allergy and Infectious Diseases; contract/grant numbers: UO1-AI-35042, UO1-AI-35043, UO1-AI-35039, UO1-AI-35040, UO1-AI-35041, UO1-AI-35004, UO1-AI-31834, UO1-AI-34994, UO1-AI-34989, UO1-AI-34993, UO1-AI-42590.

Contract/grant sponsor: National Institute of Child Health and Human Development: U01-CH-32632.

Comments and additional references from an anonymous reviewer considerably improved the presentation of the proposed method.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

• Agresti A. Categorical Data Analysis. 2. New York: John Wiley & Sons; 2002.
• Basu S, Landis JR. Model-based estimation of population attributable risk under cross-sectional sampling. American Journal of Epidemiology. 1995;142:1338–1343. [PubMed]
• Benichou J. Methods of adjustment for estimating the attributable risk in case-control studies: a review. Statistics in Medicine. 1991;10:1753–1773. [PubMed]
• Benichou J. A review of adjusted estimators of attributable risk. Statistical Methods in Medical Research. 2001;10:195–216. [PubMed]
• Benichou J, Gail MH. Variance calculations and confidence intervals for estimates of attributable risk based on logistic models. Biometrics. 1990;46:991–1003. [PubMed]
• Brady AR. sbe21: Adjusted population attributable fractions from logistic regression. Stata Technical Bulletin. 1998;42(STB42):8–12.
• Breslow N, Day NE. The Analysis of Case-control Studies. Vol. 1. Lyon: International Agency for Research on Cancer; 1980. Statistical Methods in Cancer Research. Scientific Publications No. 32.
• Bruzzi P, Green SB, Byar DP, Brinton LA, Schairer C. Estimating the population attributable risk for multiple risk factors using case-control data. American Journal of Epidemiology. 1985;122:904–914. [PubMed]
• Cox C. Delta Method. In: Armitage Peter, Colton Theodore., editors. Encyclopedia of Biostatistics. New York: John Wiley & Sons; 1998. pp. 1125–1127.
• Cox C. Model-based estimation of the attributable risk in case-control and cohort studies. Statistical Methods in Medical Research. 2006;15:611–625. [PubMed]
• Drescher K, Becher H. Estimating the generalized impact fraction from case-control data. Biometrics. 1997;53:1170–1176. [PubMed]
• Eide GE, Gefeller O. Sequential and average attributable fractions as aids in the selection of preventive strategies. Journal of Clinical Epidemiology. 1995;48:645–655. [PubMed]
• Eide GE, Heuch I. Attributable fractions: fundamental concepts and their visualization. Statistical Methods in Medical Research. 2001;10:159–193. [PubMed]
• Eide GE, Heuch I. Average attributable fractions: A coherent theory for apportioning excess risk to individual risk factors and subpopulations. Biometrical Journal. 2006;48:820–837. [PubMed]
• Graubard BI, Fears TR. Standard errors for attributable risk for simple and complex sample designs. Biometrics. 2005;61:847–855. [PubMed]
• Greenland S, Drescher K. Maximum likelihood estimation of the attributable fraction from logistic models. Biometrics. 1993;49:865–872. [PubMed]
• Lehnert-Batar A, Pfahlberg A, Gefeller O. Comparison of confidence intervals for adjusted attributable risk estimates under multinomial sampling. Biometrical Journal. 2006;48:805–819. [PubMed]
• McCullagh P, Nelder JA. Generalized Linear Models. 2. London: Chapman and Hall; 1989.
• Ostrow DG, Plankey MW, Cox C, Li X, Shoptaw S, Jacobson LP, Stall RC. Specific sex drug combinations contribute to the majority of recent HIV seroconversions among MSM in the MACS. Journal of Acquired Immune Deficiency Syndromes. 2009;51:349–355. [PubMed]
• Walter SD. The estimation and interpretation of attributable risk in health research. Biometrics. 1976;32:829–849. [PubMed]

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

• PubMed
PubMed
PubMed citations for these articles