Skip to main content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
J R Stat Soc Ser C Appl Stat. Author manuscript; available in PMC 2016 Aug 1.
Published in final edited form as:
PMCID: PMC4495770
NIHMSID: NIHMS641826
PMID: 26166904

Optimal retesting configurations for hierarchical group testing

Associated Data

Supplementary Materials

SUMMARY

Hierarchical group testing is widely used to test individuals for diseases. This testing procedure works by first amalgamating individual specimens into groups for testing. Groups testing negatively have their members declared negative. Groups testing positively are subsequently divided into smaller subgroups and are then retested to search for positive individuals. In our paper, we propose a new class of informative retesting procedures for hierarchical group testing that acknowledges heterogeneity among individuals. These procedures identify the optimal number of groups and their sizes at each testing stage in order to minimize the expected number of tests. We apply our proposals in two settings: 1) HIV testing programs that currently use three-stage hierarchical testing and 2) chlamydia and gonorrhea screening practices that currently use individual testing. For both applications, we show that substantial savings can be realized by our new procedures.

Keywords: Classification, HIV, Infertility Prevention Project, Informative retesting, Pooled testing, Retesting

1. INTRODUCTION

Screening for diseases, such as HIV, West Nile Virus, and hepatitis C, has benefited greatly from the use of group testing (also known as “pooled testing”). For these applications, specimens (e.g., blood, urine) from separate individuals are amalgamated into a single specimen. Individuals within negative testing groups are declared negative. Individuals within positive testing groups are retested in some predetermined manner to decode the positive individuals from the negative individuals. As long as group sizes are judiciously chosen, group testing can significantly reduce the overall number of tests required which, as a consequence, reduces costs. Group testing has been successfully applied in numerous settings, including chlamydia and gonorrhea testing as part of the Infertility Prevention Project (Lewis et al. 2012); blood-donor screening for HIV, hepatitis B, and hepatitis C (American Red Cross 2014); and testing known HIV-positive individuals to detect antiretroviral treatment failure (Smith et al. 2009).

Group testing methods are generally divided into hierarchical and non-hierarchical categories, where we focus on hierarchical in this paper. Hierarchical methods divide positive groups into two or more non-overlapping subgroups which are then tested. If any of these subgroups test positively, additional stages of dividing are used until individual testing takes place at the last stage. For any hierarchical application, selecting the number of subgroups, their sizes, and the number of stages are all important decisions to make in order to minimize the number of tests and corresponding costs.

This paper proposes new hierarchical group testing methods by taking advantage of recent advances in group testing collectively known as “informative retesting” (Bilder et al. 2010, McMahan et al. 2012a, McMahan et al. 2012b, Black et al. 2012). Succinctly put, informative retesting algorithms incorporate individual probabilities of positivity into the testing process. To obtain these probabilities, binary regression models are estimated using individual disease statuses and risk factors from a training data set. By exploiting the heterogeneity among the estimated probabilities, we propose a new class of three-stage and higher hierarchical methods that reduce the number of tests needed in comparison to those hierarchical methods which do not use informative retesting. Our procedures identify the optimal number of subgroups and their sizes at each stage, while allowing for multiple stages. No other informative retesting procedure is as flexible in its application.

Hierarchical group testing is already applied in a large number of disease testing settings. In particular, we examine publicly funded HIV testing programs at testing sites throughout the United States. Sherlock et al. (2007) summarizes the algorithms in use at some of these testing sites. For example, the site in Los Angeles initially pools specimens into groups of size 90. If a group tests positively, it is divided into 9 subgroups of size 10. If any of these subgroups test positively, individual testing is then used in the third and final stage. When discussing the different ways that group testing is implemented, Sherlock et al. (2007) remark that “the most efficient approach remains to be determined.” With this as motivation, we apply our methods to determine if they could reduce the number of tests needed to diagnose all individuals.

Our second example focuses on chlamydia and gonorrhea screening activities sponsored by the Centers for Disease Control and Prevention and the Office of Population Affairs in the United States. These activities, which were part of the Infertility Prevention Project (IPP) until 2013, involve testing urine and swab specimens at laboratories across the country. Due to the high volume of specimens, many states use group testing to save money. With each specimen tested, risk factor information, such as gender, sexual history, and clinician observations, are available. This has prompted at least one state, Idaho, to implement the “threshold optimal Dorfman” informative retesting procedure proposed in McMahan et al. (2012a) (Lewis et al. 2012). As we demonstrate herein, our new methods are typically better and are only minimally more complex to implement.

An outline of our paper is as follows. In Section 2, we derive the expected number of tests and measures of classification accuracy for hierarchical group testing when individuals have different probabilities of positivity. Using these derivations, we develop new hierarchical methods that attempt to minimize the expected number of tests. In Section 3, we investigate our methods in a controlled setting to examine the effects of heterogeneity on the expected number of tests and classification accuracy. In Sections 4 and 5, we apply our methods to the HIV and IPP examples, respectively. Finally, in Section 6, we summarize our work and discuss future research.

2. HIERARCHICAL GROUP TESTING

2.1 Expected number of tests

Consider a group of I individuals that are to be screened for a disease using group testing. Define Gsj as a binary random variable denoting the test status for group (or subgroup) j at the sth stage, where 0 denotes a negative test result and 1 denotes a positive test result. For example, G11 denotes the outcome for the initial group test. The number of individuals within the group corresponding to Gsj is defined as Isj, where I11 = I. If Gsj = 0, all individuals within the corresponding group are declared negative. If Gsj = 1, the corresponding group is divided into msj subgroups for the next stage of testing. Define cs as the total number possible of subgroups tested at stage s, where c1 = 1 and cs=j=1cs1ms1,j for s = 2, …, K.

To help explain this notation, consider the three-stage algorithm described in Section 1 for HIV testing in Los Angeles. Specimens from individuals are placed into initial groups of size I = 90. If G11 = 0, all individuals are diagnosed as negatives. If G11 = 1, the initial group is divided into m11 = 9 non-overlapping subgroups of size I21 = … = I29 = 10 for stage 2 testing. For any subgroup at stage 2 that tests negatively, i.e., G2j = 0, the corresponding individuals are declared negative. Any subgroup with G2j = 1 is further divided into m2j = 10 subgroups of size one. The maximum number of subgroups tested at stage 3 is c3 = 90 because individual testing occurs. We provide additional examples of other hierarchical algorithms in the on-line supplementary materials.

Choosing the number of stages K, the number of subgroups msj, and the subgroup sizes Isj are all important decisions. Our goal is to develop methods that determine these quantities by minimizing the expected number of tests for a potential application. The expected number of tests for an initial group of I individuals is

E(T)=1+s=1K1j=1csmsjP({(sj):Gsj=1}{Gsj=1})
(1)

over K stages, where T is number of tests. This expression arises by noting that msj more tests are performed whenever the jth group of size Isj > 1 tests positively at stage s. The probability in Equation (1) is a joint probability representing a succession of groups testing positively up to and including Gsj = 1. For example, E(T) for the Los Angeles application is found as

E(T) = 1 + m11P({G11 = 1}) + m21P({G11 = 1} ∩ {G21 = 1}) + m22P({G11 = 1} ∩ {G22 = 1}) +  ⋯  + m29P({G11 = 1} ∩ {G29 = 1}).

To find the joint probability in Equation (1), we need to re-express it as a function of the true group statuses G~sj to account for testing error. Suppose a single, properly calibrated assay is used to test groups so that the sensitivity Se=P(Gsj=1G~sj=1) and the specificity Sp=P(Gsj=0G~sj=0) are constant for all group sizes to be considered. The joint probability in Equation (1) is

P({(sj):Gsj=1}{Gsj=1})=(1Sp)s{i=1I(1pi)}+a=1s1Sea(1Sp)sa{iBa+1,j(1pi)}{1iBa+1,j(1pi)}+Ses{1iBsj(1pi)},
(2)

where pi is the probability that individual i is truly positive. The notation “iBsj” is understood to mean those individuals who belong to the jth ordered group at the sth stage, and “iBsj” denotes the set of individuals within the parent group of Bsj excluding those in Bsj itself. For example, consider again the application in Los Angeles. The notation iB29 denotes indices of those individuals who belong to the ninth subgroup formed at stage 2. If individuals were placed into groups by numerical order, individuals 81 to 90 would correspond to iB29, and individuals 1 to 80 would correspond to iB29.

The right-hand side of Equation (2) is written as the sum of three distinguishable components. The first component represents P({(s,j):Gsj=1}{Gsj=1}{G~sj=0}) i.e., the probability that all groups included in the intersection test positively when they are all truly negative. The third component is the same as the first except that all groups are truly positive. The middle component includes terms within the summand that correspond to a truly positive groups and sa truly negative groups, for a = 1, …, s − 1. One will notice that Equation (2) is written the same way as Equation (2) in Black et al. (2012), which examined the special case where positive groups are halved. This equivalence arises because of our Bsj and Bsj notation. However, unlike Black et al. (2012), we attempt to find the specific hierarchical algorithm which provides the smallest expected number of tests, as described next.

2.2 Optimal retesting configurations

Before an application of group testing begins, we will not necessarily know the retesting configuration (i.e., number of subgroups, their sizes, and their members at each stage) that would result in the smallest number of tests. However, we can examine potential retesting configurations before testing begins and choose the one that minimizes E(T). We refer to this configuration as “optimal.” Of course, E(T) depends on the individual probabilities pi that are likely unknown. In practice, these probabilities would be estimated, and a chosen configuration would then minimize estimates of E(T). For now, we assume all individual probabilities are known. We address estimation in Section 5.

To find the retesting configuration that minimizes E(T), we first order individuals by their probability of positivity within an initial group that tests positively. This helps to isolate those individuals with small and large probabilities while also decreasing the number of possible configurations that need to be examined. Let p(1)p(2) ≤ … ≤ p(I) denote these ordered individual probabilities. Whenever a group tests positively, we assume that individuals are assigned to subgroups successively by this ordering. For example, a group of size I = 6 could be divided into m11 = 3 subgroups of size I21 = 3, I22 = 2, and I23 = 1. The first subgroup in this retesting configuration contains those individuals having probabilities p(1), p(2), and p(3); the second subgroup contains those individuals having probabilities p(4) and p(5); and the third subgroup contains the individual with probability p(6). Ordering in this manner is intuitive because it allows larger (smaller) subgroups to be formed among the low-probability (high-probability) individuals. This, in turn, can help lead to a reduction in the number of tests. For a group of size I > 1, we define the optimal retesting configuration (ORC) as the configuration which minimizes E(T) when ordered individuals are successively placed into subgroups.

2.2.1 All possible configurations

The most direct approach to find the ORC is to calculate E(T) for all possible configurations and to choose the configuration that minimizes E(T). For a three-stage procedure, the number of configurations equates to the number of ways to choose subgroups at stage 2. Splitting the individuals into b (say) subgroups, while respecting the ordering p(1)p(2) ≤ … ≤ p(I), is equivalent to placing b − 1 “partitions” in the I − 1 “spaces” between the ordered probabilities; there are Cb1I1 ways to do this. Because we could choose b to be any number between 1 and I, the total number of possible configurations for three stages is i=0I1CiI1=2I1.

To illustrate for a simple case, consider an initial group of size I = 4. There are 24-1 = 8 possible configurations of subgroups at stage 2 with sizes: [4], [3,1], [2,2], [1,3], [2,1,1], [1,2,1], [1,1,2] or [1,1,1,1], where the notation “[·]” denotes possible subgroup configurations. For instance, [3,1] means there are two subgroups with individuals corresponding to p(1), p(2), p(3) in the first subgroup and the individual corresponding to p(4) in the second subgroup. Of course, configurations like [3,1] and [1,3] are different because subgroups are formed based on ordered individual probabilities. If needed, a third stage for positive testing subgroups of size two or more leads to individual testing. Note that the enumeration above contains configurations that would not typically be implemented, such as [4] (retest the entire group again), and those that would not allow for a third stage, such as [1,1,1,1]. Therefore, the number of configurations that actually would be considered in practice for three stages in general is less than 2I−1.

Similarly, the number of all configurations involving four stages (the number of ways to choose subgroups at stages 2 and 3) is

c2=0I1Cc2I1c3=0c21Cc3c21=c2=0I1Cc2I12c21=3I1.

For five stages, there are 4I−1 possible configurations. Through binomial expansion, this pattern can be shown to continue so that a K-stage procedure has (K−1)I−1 possible configurations (see on-line supplementary materials). While the number of configurations that would actually be considered in practice is somewhat less than this value, it can still be very computationally time consuming to list those that are possible. For example, calculation times for E(T) with I = 12 are less than one second for K = 3, approximately 2.4 minutes for K = 4, and approximately one hour for K = 5 using R 2.15.0 (R Development Core Team, 2012) and a 2.40 GHZ core of a processor. For very large I, such as I = 90 in the Los Angeles example, computational time will be impractical even for K = 3.

It is important to note that we have limited the possible configurations to those constructed sequentially with ordered individual probabilities. In addition to this being intuitive, past research has shown that ordering is the preferred choice. For halving algorithms, Black et al. (2012) proved that ordering always produces E(T) as small or smaller than when not ordering. McMahan et al. (2012a) used ordering in applications of two-stage hierarchical group testing. Also for two-stages, Hwang (1975) showed that groups with a larger number of individuals should always have smaller probabilities than groups with fewer individuals.

2.2.2 Steepest descent search algorithm

For some applications, the maximum allowable initial group size will be small due to the size of the overall prevalence, laboratory constraints, or fear that large groups will dilute positive specimens below a detection threshold. In those cases, the all-possible-configuration approach of Section 2.2.1 will work fine. For other applications, these limitations may not be present or not as severe, allowing for large initial groups such as in the Los Angeles example. Examining all possible configurations then may not be possible. To consider these applications, we formulate the optimality problem as an integer program and use the method of steepest descent in an attempt to find the retesting configuration that minimizes E(T). We call this retesting configuration the candidate retesting configuration (CRC) to distinguish it from the ORC.

The method of steepest descent begins by first choosing a starting configuration for a specified number of subgroups at each stage. For each possible subgroup pair, we alter the starting configuration by adding one member to a subgroup and subtracting one member from a different subgroup. We then choose a “better” retesting configuration that has the lowest E(T) among the new ones created. We continue this same process, keeping the number of subgroups the same at each stage, until no other configuration can be found with a smaller E(T). We then repeat this process for all other possible numbers of subgroups. The configuration that minimizes E(T) overall is the CRC. Details regarding the application of this algorithm are given in the on-line supplementary materials.

The CRC and ORC will coincide when E(T) is convex upward as function of the subgroup sizes. Unfortunately, this is not always the case, which could lead to a local minimum being found rather than a global minimum. We give special cases in the on-line supplementary materials where convexity fails. However, despite the absence of convexity in general, we show in Sections 3 and 5 that the CRC results in an E(T) which is the same as or very close to that resulting from the ORC for the cases examined. Therefore, we regard the CRC as a convenient computational alternative to the ORC when all possible configurations cannot be easily enumerated.

2.3 Accuracy measures

In addition to the expected number of tests, the accuracy of correctly classifying truly positive and negative individuals is also important. Define Yi = 1(0) as the positive (negative) diagnosed status of the ith individual (i = 1, …, I), and define Y~i=1(0) similarly to denote the true status. The probability of a correct positive diagnosis, the pooling sensitivity, is PSe(i)=P(Yi=1Y~i=1) for individual i. Similarly, the pooling specificity is PSp(i)=P(Yi=0Y~i=0) for a correct negative diagnosis. We also define the pooling positive predictive value and the pooling negative predictive value as PPPV(i)=P(Y~i=1Yi=1) and PNPV(i)=P(Y~i=0Yi=0), respectively. These predictive values can be useful once a diagnosis has been made.

For the ith individual to be diagnosed as positive (Yi = 1), the initial group and all subsequent subgroups containing the individual, including the last subgroup which contains only the ith individual, need to test positively as well. Define subgroup j* in stage L (LK) as the stage where individual i could be tested individually with respect to the configuration. The pooling sensitivity for this individual is

PSe(i)=P(Yi=1Y~i=1)=P({(sj):GLj=1}{Gsj=1}{(sj):GLj=1}{G~sj=1})=SeL,

where we use the standard assumption that test results are conditionally independent once the true status is known (see Litvak et al. 1994, p. 425, p. 430, and Kim and Hudgens 2009, p. 904, for more information). Thus, the pooling sensitivity is the same for each individual testing positively within L stages. Interestingly, this is the same as when all individual probabilities are homogeneous (pi = p for i = 1, …, I); see Kim et al. (2007).

The remaining accuracy measures are found similarly. The pooling specificity is

PSp(i)=1P(Yi=1Y~i=0)=1P(Y~i=0)1[P(Yi=1)P(Yi=1Y~i=1)P(Y~i=1)]=1(1pi)1[P({(sj):GLj=1}{Gsj=1})SeLpi],

where P({(sj):GLj=1}{Gsj=1}) is given by Equation (2). Predictive values follow from Bayes’ rule:

PPPV(i)=piPSe(i)piPSe(i)+(1pi)(1PSp(i))

and

PNPV(i)=(1pi)PSp(i)(1pi)PSp(i)+pi(1PSe(i)).

Because PPPV(i) and PNPV(I) are individual specific, these measures could be useful if confirmatory testing was performed to check individual diagnoses (e.g., an individual with a small PNPV(I) could be tested again to ensure they are truly negative).

3. MEAN AND ACCURACY COMPARISONS

In this section, we simultaneously examine how E(T) and measures of classification accuracy are affected by the initial group size, the number of stages, and heterogeneity among the individual probabilities. In addition, we examine how often the ORC and CRC coincide. We continue to assume that pi is known for this investigation, postponing estimation until Section 5.

Suppose pi ~ beta(α, α(1 − p)/p) for i = 1, …, I, where α > 0 and 0 < p < 1, so that E(pi) = p and Var(pi) = p2(1 − p)/(α + p). In this family of distributions, note that p represents the overall prevalence for a population. We consider different combinations of p and α to examine how heterogeneity affects E(T) and classification accuracy. Some of these combinations are motivated by the IPP data example in Section 5. Other combinations are chosen because they lead to extreme cases. For example, as α → ∞, the variance for pi approaches 0 so that individual probabilities become homogeneous (pi = p for i = 1, …, I). Conversely, as α decreases, the variance for pi increases, which induces more heterogeneity. In fact, one can show that the limiting distribution as α → 0 is Bernoulli with mean p (McMahan et al. 2012b); that is, pi = 1 with probability p and pi = 0 with probability 1 − p. While this specific distribution is unlikely in application, it is still useful to consider because it maximizes the amount of heterogeneity within a group.

Figure 1 shows values of E(T)/I, the expected number of tests per individual, for CRCs under different combinations of α, p, K, and I when Se = Sp = 0.95. The on-line supplementary materials list the numerical values for E(T)/I along with the group sizes at each stage. We replaced pi with E(p(i)), the expected value of the ordered individual probability, when calculating E(T) in order to obtain an “average” assessment without simulation error. These values of E(p(i)) are calculated in the same manner as described in Black et al. (2012) or through Monte Carlo simulation for the larger initial group sizes. When constructing Figure 1, we found that the ORC and the CRC always were the same for three stages. For four stages, the ORC and CRC always were the same when I ≤ 14; we did not calculate the ORC for I > 14 when K = 4 due to the computational reasons outlined in Section 2.2.1.

An external file that holds a picture, illustration, etc.
Object name is nihms-641826-f0001.jpg

Expected number of tests per individual for the CRC using E(p(i)) from beta distributions when Se = Sp = 0.95. Lines for K = 3 have twice the thickness as lines for K = 4. Note that the vertical axis scales are not the same for each plot.

Within Figure 1, we first note that four stages generally reduce E(T)/I when compared to three stages for the same values of α and p. Exceptions can arise when individual probabilities are homogeneous with a large p or with an initial group size I that is small. Second, larger initial group sizes can greatly reduce E(T)/I when the prevalence is very small (e.g., p = 0.01). This finding is consistent with previous research, but it is interesting to note that it persists for the CRC across different levels of heterogeneity and number of stages. Finally, and most importantly, E(T)/I can be greatly reduced as the variability among the pi values increases (α decreases). This figure clearly shows that our procedures can adeptly exploit differences among individuals as populations become more heterogeneous.

Figure 2 displays the classification accuracy measures described in Section 2.3 for p = 0.05 and Se = Sp = 0.95. Plots for p = 0.01, 0.10, and 0.15 are in the on-line supplementary materials. When calculating the accuracy measures, we used the same E(p(i)) substitution as mentioned above. The most important finding from these plots is that the accuracy measures tend to increase as the variability among the pi values increases (α decreases). This is consistent across the different initial group sizes. At the same level of heterogeneity (same levels of α), PSp(i) and PPPV(i) are larger for four stages than for three stages, while the opposite is generally true for PSe(i) and PNPV(i). The reason why PSp(i) and PPPV(i) are larger is that positive diagnoses occur only after multiple positive tests. Thus, a more stringent criterion is needed to be diagnosed as positive with four stages. Conversely, PSe(i) and PNPV(i) tend to be larger for three stages because it takes only one negative test to produce a negative diagnosis.

An external file that holds a picture, illustration, etc.
Object name is nihms-641826-f0002.jpg

Accuracy measures for the CRC calculated using E(p(i)) from beta distributions when p = 0.05 and Se = Sp = 0.95. Lines for K = 3 have twice the thickness as lines for K = 4. Note that the vertical axis scales are not the same for each plot.

4. HIV TESTING

Sherlock et al. (2007) describe the use of three-stage hierarchical group testing for HIV testing at different locations in the United States. Our goal is to examine whether their implementations could be more efficient. Table 1 lists the locations with group sizes used at each stage and the overall prevalences observed during the study period. Note that testing in San Francisco was performed using both two and three stages, but we consider this location to use only three stages for demonstration purposes.

Table 1

Three-stage hierarchical group testing for HIV testing. The prevalences, initial group sizes, and actual second stage group sizes are obtained from Table 1 of Sherlock et al. (2007). The optimal second stage group sizes are calculated assuming every individual has a probability of positivity equal to the prevalence (i.e., homogeneity) at a particular location. The last four columns present the percentage reduction in E(T) obtained by using the CRC rather than the optimal second stage under homogeneity, where a is a beta distribution parameter controlling the amount of heterogeneity. All third stages involved individual testing.

Under homogeneity
Prevalence
(p)
Initial
group size
Actual
second stage
Optimal
second stage
Reduction in E(T) from CRC
α = 2α = 1α = 0.5α = 0.1



North Carolina0.0021909 groups of size 1010 groups of size 94.2%7.7%13.6%33.1%
Los Angeles0.0045909 groups of size 1010 groups of size 94.4%8.6%15.2%36.8%
San Francisco0.0175505 groups of size 102 groups of size 74.7%8.4%15.1%37.4%
6 groups of size 6
Seattle-King County0.0164303 groups of size 106 groups of size 53.5%7.2%12.2%32.0%
Atlanta0.0030486 groups of size 86 groups of size 73.2%6.5%11.1%27.3%
1 group of size 6

There are two parts of our evaluation. First, we find the retesting configuration that minimizes E(T) for each location by using the same initial group size and by treating the observed prevalence as the true probability of positivity for each individual. This would have been the ideal way to determine the retesting configuration in the absence of our research. We refer to this configuration as being “optimal” under the assumption of homogeneity. Second, we find the CRC and the corresponding E(T) by taking into account potential heterogeneity among individuals at each location. Because the amount of heterogeneity is unknown, we use beta distributions to describe it as done in Section 3. Our choices for α are motivated by the IPP example in Section 5 and by referee suggestions. Because multiple assays are sometimes used at the same location and because different assays are used across locations, we assume Se = Sp = 0.99 for simplicity.

Table 1 presents the results. First, we see that no locations used the optimal second stage under a homogeneity assumption (although some were very close). Second, among different levels of variability and across locations, we compute the percentage reduction in E(T) from using the CRC rather than what would have been optimal assuming homogeneity. For example, Los Angeles has a reduction of 8.6% for α = 1 when applying our CRC rather than using 10 groups of size 9 at stage 2. The corresponding CRCs for each location are included in the on-line supplementary materials. Overall, we generally see large reductions in E(T), where the level of reduction increases as the heterogeneity increases. For example, if α = 0.1, the reduction is 36.8% for Los Angeles.

While heterogeneity among the individuals being tested would certainly be expected, the actual amount is unknown. However, the levels of variability incorporated into this example are certainly not extreme. For example, for Los Angeles with α = 0.1, the minimum and maximum values of E(p(i)) are slightly larger than 0 and approximately 0.086, respectively.

5. INFERTILITY PREVENTION PROJECT

To further assess how well the ORC and CRC work in application, we examine a database of previously diagnosed individuals for chlamydia and gonorrhea in Nebraska for the IPP. Based on these data, we perform Monte Carlo simulations to estimate the number of tests that our own and other group testing methods would use in application. We also estimate measures of accuracy, which will help practitioners understand the classification characteristics of each method.

5.1 Data and models

We begin by estimating the probability of disease positivity for each individual using logistic regression. As would be done in practice, these estimated probabilities are used in place of the true probabilities when implementing informative retesting. The covariates used in the regression models describe clinical observations (symptoms, cervical friability, pelvic inflammatory disease, cervicitis, urethritis), demographic variables (age, race), and risk behavior (multiple partners, new partner in the last 90 days, contact with an individual who has a sexually transmitted disease). Each covariate is binary except for age. The response variable is the observed disease diagnosis. We treat diagnoses as the true disease statuses when fitting regression models so that we can assess accuracy.

Separate models are estimated for each disease (chlamydia or gonorrhea), gender (male or female), and specimen type (urine or swab) combination. We estimate the models based on a training data set that includes 23,146 individuals who were tested in 2008 (parameter estimates are given in the on-line supplementary materials). These models are then applied to a test data set, which includes 27,521 individuals screened in 2009, in order to obtain individual estimates of positivity. Table 2 summarizes these individuals. Group testing is then applied to the individuals from 2009, where individuals are placed into initial groups chronologically by specimen date.

Table 2

Summary statistics for chlamydia and gonorrhea testing in 2009. The overall observed prevalence is denoted by p. The maximum likelihood estimates (MLE) for α and p are found by fitting beta distributions to all estimated individual probabilities.

DiseaseGenderSpecimenCount p Se Sp MLE
of α
MLE
of p
ChlamydiaFemaleSwab14,5030.06992.8%96.0%2.50.067
Urine4,9700.08080.5%96.0%1.10.087

MaleSwab1,9090.15792.5%95.0%1.00.149
Urine6,1390.08193.0%95.0%1.80.090

GonorrheaFemaleSwab14,5030.01396.6%98.0%0.50.011
Urine4,9700.01784.9%98.0%0.50.018

MaleSwab1,9090.07098.5%96.0%0.40.077
Urine6,1390.02197.0%96.0%0.20.014

Because we would like to obtain measurements of accuracy, all group, subgroup, and individual responses that would be observed are simulated, assuming the actual observed individual responses as the true statuses. These binary responses are simulated from a Bernoulli distribution with the appropriate value of Se and 1 − Sp as the probability parameter. For instance, if a group of size four has all negative responses in the test data set, we simulate a response for the group from a Bernoulli(1 − Sp) distribution. If instead at least one of the four was positive, we simulate from a Bernoulli(Se) distribution. The values for Se and Sp are given in Table 2; these are for the assay used in Nebraska during this time period.

5.2 Group testing methods

We compare our proposed methods in this paper to other hierarchical group testing methods. First, we include Dorfman’s (1943) original two-stage procedure (hierarchical group testing with K = 2 assuming equal individual probabilities) because it is the easiest method to apply. We also include the informative retesting procedure known as pool-specific optimal Dorfman (PSOD) proposed by McMahan et al. (2012a). PSOD is also a two-stage procedure that tries to minimize the estimated expected number of tests by strategically placing individuals within first-stage groups. We do not include the threshold optimal Dorfman method mentioned in Section 1 because McMahan et al. (2012a) showed it did not perform as well as PSOD.

We also include competing three- and four-stage hierarchical methods. First, we apply the informative retesting procedure known as “ordered halving” proposed by Black et al. (2012). It is limited by only being able to divide positive testing groups in half. Individuals are placed into subgroups that maximize one subgroup’s probability of testing positively, while minimizing the other subgroup’s probability of testing positively. We abbreviate ordered halving by OH3 and OH4 for three and four stages, respectively. Second, similar to Section 4, we find the retesting configuration that minimizes the estimated expected number of tests assuming homogeneity among the individuals. These hierarchical methods are abbreviated as H3 and H4 for three and four stages, respectively.

With respect to our proposed methods, there are two ways that the ORC or CRC can be found in practice. First, one can find a retesting configuration separately for each positive testing initial group. We refer to this as an “adaptive” procedure (A-ORC, A-CRC). Second, a simpler approach is to estimate one overall retesting configuration from a training data set and then apply it to new specimens. One can implement this approach by first finding p^(1)h,,p^(I)h, where the subscript “(i)h” allows us to denote ordered probabilities within some group h, for all possible groups of the same size I in the training data (we cap I at 20 in this section). The ORC and CRC are found using the probabilities averaged over all groups, p^¯(1),,p^¯(I), say, for the training data. Because only one configuration is applied to new specimens (those in our test data here), we refer to this as a “non-adaptive” procedure (N-ORC, N-CRC). In the results that follow, we denote the maximum number of stages by appending K to the end of each acronym; e.g., three-stage, non-adaptive ORC is denoted by N-ORC3. Note that we did not include N-ORC4 or any four-stage adaptive procedures in our study because of excessive computing time.

To ensure a fair comparison, we use the optimal initial group size for each method by minimizing the estimated expected number of tests in the training data for group sizes 5, 6, …, 20 and then use that as the initial group size for the test data. While groups only up to size 10 have been used for chlamydia and gonorrhea screening (see the references within the review paper of Mund et al. 2008), we decided to increase our maximum size to 20 in order to determine if benefits may occur from a larger group size. We would like to emphasize that the estimated expected number of tests are found only when working with the training data, as would likely be done in practice. Note that PSOD does not have one overall initial group size due to the nature of the method itself.

5.3 Monte Carlo simulations

To account for the variation introduced by simulating responses, we repeat the application of group testing over 500 data sets. Tables 3 and and44 provide summaries of our results for chlamydia and gonorrhea testing, respectively. Within the tables, the mean column provides the average number tests used across the simulation runs, while the SD column provides the standard deviation for the number of tests. The PSe, PSp, PPPV, and PNPV columns provide the corresponding observed accuracy measures (Altman and Bland 1994a, 1994b) as averaged over the simulated data sets. For example, the observed PSe for one simulated data set is the observed proportion of true positives that are diagnosed to be positive by a group testing method.

Table 3

Mean number of tests, standard deviation (SD) for the number of tests, and mean accuracy measures for Nebraska IPP chlamydia screening based on 500 simulated data sets. The total number of individuals screened is given in Table 2.

Female/Urine
Male/Urine
Method I MeanSD PSe PSp PPPV PNPV I MeanSD PSe PSp PPPV PNPV


Dorfman52,4954465.0%99.0%84.6%97.0%53,4184286.6%98.5%83.2%98.8%
PSODNA2,4364565.1%99.0%85.2%97.0%NA3,1973987.4%98.8%86.1%98.9%
H3121,9965952.1%99.4%88.3%96.0%93,0394080.5%99.2%90.3%98.3%
OH392,0595552.3%99.3%87.2%96.0%73,0294080.3%99.2%89.5%98.3%
N-CRC3162,0316552.4%99.4%88.2%96.0%82,8864182.2%99.2%89.6%98.4%
N-ORC3162,0266652.2%99.4%88.2%96.0%82,8904182.3%99.2%89.5%98.5%
A-CRC3162,0496452.8%99.3%86.7%96.0%92,9074282.2%99.1%89.4%98.5%
A-ORC3162,0466652.7%99.3%86.9%96.0%92,9134282.3%99.2%89.5%98.5%
H4201,8017441.9%99.6%90.0%95.1%182,9945575.0%99.3%90.6%97.8%
OH4191,8206942.1%99.4%86.8%95.2%132,9395475.0%99.3%90.5%97.8%
N-CRC4201,7476942.0%99.6%90.7%95.2%162,8625276.9%99.3%90.7%98.0%


Female/Swab
Male/Swab
Method I MeanSD PSe PSp PPPV PNPV I MeanSD PSe PSp PPPV PNPV


Dorfman57,2926086.1%99.0%86.2%99.0%41,4032285.5%98.0%88.7%97.3%
PSODNA7,0145786.4%99.1%87.6%99.0%NA1,2992287.2%98.2%90.1%97.6%
H396,3666279.9%99.5%92.2%98.5%91,3593179.0%98.7%91.7%96.2%
OH376,3885879.8%99.5%91.5%98.5%71,3712779.1%98.5%90.9%96.2%
N-CRC386,2375881.6%99.4%91.0%98.7%91,2953081.9%98.4%90.6%96.7%
N-ORC386,2355781.6%99.4%91.1%98.7%91,2922881.8%98.4%90.7%96.7%
A-CRC396,1105881.1%99.5%92.0%98.6%81,2862882.1%98.5%90.8%96.8%
A-ORC396,1116181.1%99.5%92.0%98.6%81,2863082.1%98.5%90.9%96.8%
H4186,2628874.2%99.6%92.4%98.1%201,3634973.1%98.8%92.1%95.2%
OH4136,1717174.1%99.5%92.3%98.1%171,2584473.1%98.6%90.4%95.2%
N-CRC4165,9708076.3%99.5%92.4%98.3%201,2524576.4%98.8%92.1%95.8%


Table 4

Mean number of tests, standard deviation (SD) for the number of tests, and mean accuracy measures for Nebraska IPP gonorrhea screening based on 500 simulated data sets. The total number of individuals screened is given in Table 2.

Female/Urine
Male/Urine
Method I MeanSD PSe PSn PPPV PNPV I MeanSD PSe PSn PPPV PNPV


Dorfman81,2383671.8%99.8%84.4%99.5%101,9585094.1%99.2%71.5%99.9%
PSODNA1,1854272.3%99.8%84.5%99.5%NA1,8625594.5%99.2%73.0%99.9%
H3208103561.4%99.9%91.1%99.4%201,4173291.3%99.6%84.8%99.8%
OH3138613261.0%99.9%89.7%99.3%151,5233091.2%99.5%80.0%99.8%
N-CRC3207693261.2%99.9%92.6%99.3%201,4073592.1%99.6%82.6%99.8%
N-ORC3207663461.5%99.9%92.6%99.4%201,4103492.0%99.6%82.5%99.8%
A-CRC3207833362.5%99.9%91.6%99.4%201,4453792.2%99.5%80.9%99.8%
A-ORC3207833462.0%99.9%91.6%99.4%201,4473792.2%99.5%80.6%99.8%
H4207213352.5%100.0%94.2%99.2%201,3072788.4%99.8%90.5%99.8%
OH4197223352.3%99.9%92.6%99.2%191,2932588.5%99.7%88.0%99.8%
N-CRC4206993152.3%100.0%95.8%99.2%201,2652890.2%99.7%87.6%99.8%


Female/Swab
Male/Swab
Method I MeanSD PSe PSn PPPV PNPV I MeanSD PSe PSn PPPV PNPV


Dorfman103,3855293.4%99.8%83.6%99.9%51,0351797.1%98.8%86.2%99.8%
PSODNA3,0126293.5%99.8%86.2%99.9%NA7332497.8%99.2%90.2%99.8%
H3202,3163490.1%99.9%92.4%99.9%99351495.6%99.4%92.3%99.7%
OH3152,5433390.3%99.9%89.1%99.9%78611395.5%99.5%93.4%99.7%
N-CRC3202,1193591.0%99.9%92.5%99.9%137341996.5%99.4%92.3%99.7%
N-ORC3202,1163590.9%99.9%92.5%99.9%137341996.4%99.4%92.3%99.7%
A-CRC3202,0743691.2%99.9%93.5%99.9%137301996.3%99.4%92.4%99.7%
A-ORC3202,0723791.0%99.9%93.5%99.9%137291896.4%99.4%92.3%99.7%
H4202,1652987.3%99.9%95.2%99.8%99451595.2%99.5%93.0%99.6%
OH4192,1703287.2%99.9%93.8%99.8%158181594.1%99.5%93.5%99.6%
N-CRC4201,9632790.1%99.9%93.9%99.9%206811595.9%99.5%93.8%99.7%


Within the tables, we first note that all three- and four-stage methods provide dramatic reductions in the average number of tests in comparison to Dorfman-based methods. The one exception is for PSOD when testing male swabs, where the average number of tests is similar to our proposed methods. Second, with respect to the accuracy measures, Dorfman-based methods have higher PSe and slightly higher PNPV than other methods but lower values of PSp and PPPV. These findings are not new in the comparison of hierarchical procedures (e.g., see Kim et al. 2007 and McMahan et al. 2012a), and they are simply due to the number of stages as discussed in Section 3. Third, the standard deviations given in the tables can be used to obtain a rough range for the number of tests observed with each simulated data set (e.g., Mean ± 3SD). Additional summary measures (minimum, maximum, and percentiles) are given in the on-line supplementary materials. Although there is variability in the number of tests, we did not find a pattern that would suggest certain methods are consistently more precise.

Specifically when compared to H3, ORC3 and CRC3 generally reduce the average number of tests, where reductions can be as much as 22.0% (testing males for gonorrhea with swabs). In the two cases where H3 has a smaller average, it is by a relatively small amount. In comparison to H4, CRC4 always reduces the average number of tests, where the largest reduction is 27.9% (testing males for gonorrhea with swabs). With respect to accuracy, the ORC and CRC almost always provide a larger PSe and PNPV than those for H3 and H4. In contrast, the ORC and CRC almost always result in a smaller PSp and PPPV. When compared to ordered halving, the ORC and CRC reduce the average number of tests with similar or larger accuracy in almost all cases.

Interestingly, the differences in the average number of tests among corresponding adaptive and non-adaptive, three-stage procedures are small and are not consistently positive or negative (largest relative difference is 2.7% when testing males for gonorrhea using urine). The accuracy levels are similar too. These results suggest that the non-adaptive approach may be preferred due to its ease in application. Furthermore, there are only very small differences between corresponding applications of the ORC and CRC. This suggests that the CRC may be preferred to the ORC when the ORC takes an excessive amount of time to compute.

6. DISCUSSION

We have demonstrated that the ORC and CRC can significantly reduce the number of tests for hierarchical group testing while maintaining classification accuracy. We also showed two ways that our proposed methods can be implemented, either adaptively or non-adaptively, and we obtained similar results for each. This is especially important because implementing the non-adaptive approach allows practitioners to use only one retesting configuration throughout the screening process. Of course, our evaluations in Sections 3, 4, and 5 are limited to the specific cases we considered. However, we have purposely used real-life scenarios in our evaluations. We expect our conclusions to be similar at least in other disease-testing applications.

Finding the ORC over three stages is closely related to the PSOD procedure described in Section 5. Essentially, the last two stages of a three-stage ORC resemble PSOD. For a positive testing group at stage two of the ORC procedure (or stage one of PSOD), one allocates individuals to subgroups to minimize the expected number of remaining tests. Stage three for ORC (or stage two of PSOD) concludes with individual testing. The advantage that ORC has over PSOD is that ORC’s first stage can immediately diagnose all I individuals as negative. If the initial group tests positively for ORC, PSOD will always result in an expected number of tests greater than or equal to the remaining tests of ORC (this is not necessarily true for CRC). This is because finding the ORC requires an examination of all possible configurations, whereas the greedy search algorithm used by PSOD is not guaranteed to find the optimal configuration.

While the ORC and CRC often result in the same expected number of tests, future research could explore other ways to find the optimal configuration, especially for large initial group sizes. In particular, the enumeration of all possible configurations falls into a setting commonly known as “embarrassingly parallel,” so gains in computational time potentially could be achieved through high performance computing. Also, a genetic search algorithm could potentially be used to search among “good” configurations that lead to new generations of “better” configurations. Finally, as suggested by a referee, Hwang (1981) and Hwang and Rothblum (2012) present examples of how three-stage hierarchical testing, in the absence of testing error, can be envisioned as an ordered optimal partition problem. This could potentially lead to determining an optimal configuration in our context.

R functions to compute the ORC and CRC are provided at www.chrisbilder.com/grouptesting/BBT. We include a program at this web address showing how the computations in Sections 3 and 4 can be performed. Our methodology can be applied in practice by simply inputting the estimated individual probabilities within a group into our functions along with Se, Sp, and K.

Supplementary Material

Supp Material

ACKNOWLEDGEMENTS

This research was supported by Grant R01 AI067373 from the National Institutes of Health. We thank the Editor, Associate Editor, and an anonymous referee for their comments which helped us to improve this paper considerably.

Contributor Information

Michael S. Black, Department of Mathematics, University of Wisconsin-Platteville, Platteville, WI 53818, USA, ude.ttalpwu@imkcalb.

Christopher R. Bilder, Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE 68583, USA.

Joshua M. Tebbs, Department of Statistics, University of South Carolina, Columbia, SC 29208, USA, ude.cs.tats@sbbet.

REFERENCES

  • Altman D, Bland J. Diagnostic tests 1: Sensitivity and specificity. BMJ. 1994a;308:1552. [PMC free article] [PubMed] [Google Scholar]
  • Altman D, Bland J. Diagnostic tests 2: Predictive values. BMJ. 1994b;309:102. [PMC free article] [PubMed] [Google Scholar]
  • American Red Cross Blood testing. 2014 http://www.redcrossblood.org/learn-about-blood/what-happens-donated-blood/blood-testing, retrieved Sept. 27, 2014.
  • Bilder C, Tebbs J, Chen P. Informative retesting. Journal of the American Statistical Association. 2010;105:942–955. [PMC free article] [PubMed] [Google Scholar]
  • Black M, Bilder C, Tebbs J. Group testing in heterogeneous populations by using halving algorithms. Journal of the Royal Statistical Society: Series C. 2012;61:277–290. [PMC free article] [PubMed] [Google Scholar]
  • Dorfman R. The detection of defective members of large populations. Annals of Mathematical Statistics. 1943;14:436–440. [Google Scholar]
  • Hwang F. A generalized binomial group testing problem. Journal of the American Statistical Association. 1975;70:923–926. [Google Scholar]
  • Hwang F. Optimal partitions. Journal of Optimization Theory and Applications. 1981;34:1–10. [Google Scholar]
  • Hwang F, Rothblum U. Partitions: Optimality and Clustering: Single-Parameter. World Scientific; Singapore: 2012. [Google Scholar]
  • Kim H, Hudgens M, Dreyfuss J, Westreich D, Pilcher C. Comparison of group testing algorithms for case identification in the presence of test error. Biometrics. 2007;63:1152–1163. [PubMed] [Google Scholar]
  • Kim H, Hudgens M. Three-dimensional array-based group testing algorithms. Biometrics. 2009;65:903–910. [PMC free article] [PubMed] [Google Scholar]
  • Lewis J, Lockary V, Kobic S. Cost savings and increased efficiency using a stratified specimen pooling strategy for Chlamydia trachomatis and Neisseria gonorrhoeae. Sexually Transmitted Diseases. 2012;39:46–48. [PubMed] [Google Scholar]
  • Litvak E, Tu X, Pagano M. Screening for the presence of a disease by pooling sera samples. Journal of the American Statistical Association. 1994;89:424–434. [Google Scholar]
  • McMahan C, Tebbs J, Bilder C. Informative Dorfman screening. Biometrics. 2012a;68:287–296. [PMC free article] [PubMed] [Google Scholar]
  • McMahan C, Tebbs J, Bilder C. Two-dimensional informative array testing. Biometrics. 2012b;68:793–804. [PMC free article] [PubMed] [Google Scholar]
  • Mund M, Sander G, Potthoff P, Schicht H, Matthias K. Introduction of Chlamydia trachomatis screening for young women in Germany. Journal der Deutschen Dermatologischen Gesellschaft. 2008;6:1032–1037. [PubMed] [Google Scholar]
  • R Development Core Team . R: A language and environment for statistical computing. R Foundation for Statistical Computing; Vienna, Austria: 2012. http://www.R-project.org. ISBN 3-900051-07-0. [Google Scholar]
  • Sherlock M, Zetola N, Klausner J. Routine detection of acute HIV infection through RNA pooling: Survey of current practice in the United States. Sexually Transmitted Diseases. 2007;34:314–316. [PubMed] [Google Scholar]
  • Smith D, May S, Perez-Santiago J, Strain M, Ignacio C, Haubrich R, Richman D, Benson C, Little S. The use of pooled viral load testing to identify antiretroviral treatment failure. AIDS. 2009;23:2151–2158. [PMC free article] [PubMed] [Google Scholar]