Optimal individualized decision rules from a multi-arm trial: A comparison of methods and an application to tailoring inter-donation intervals among blood donors in the UK

There is a growing interest in precision medicine where individual heterogeneity is incorporated into decision-making and treatments are tailored to individuals to provide better healthcare. One important aspect of precision medicine is the estimation of the optimal individualized treatment rule (ITR) that optimizes the expected outcome. Most methods developed for this purpose are restricted to the setting with two treatments, while clinical studies with more than two treatments are common in practice. In this work, we summarize methods to estimate the optimal ITR in the multi-arm setting and compare their performance in large-scale clinical trials via simulation studies. We then illustrate their utilities with a case study using the data from the INTERVAL trial, which randomly assigned over 20,000 male blood donors from England to one of the three inter-donation intervals (12-week, 10-week, and eight-week) over two years. We estimate the optimal individualized donation strategies under three different objectives. Our findings are fairly consistent across five different approaches that are applied: when we target the maximization of the total units of blood collected, almost all donors are assigned to the eight-week inter-donation interval, whereas if we aim at minimizing the low hemoglobin deferral rates, almost all donors are assigned to donate every 12 weeks. However, when the goal is to maximize the utility score that “discounts” the total units of blood collected by the incidences of low hemoglobin deferrals, we observe some heterogeneity in the optimal inter-donation interval across donors and the optimal donor assignment strategy is highly dependent on the trade-off parameter in the utility function.


Web Appendix A: quantitative interactions and qualitative interactions
In Section 4.1. of the main paper, we discuss the distinction between quantitative and qualitative interactions. These two types of interactions are plotted in Supplementary Figure 1.

B.1 Scenarios with correlated covariates
We conduct additional simulation studies under scenarios with correlated covariates when n = 20000. To demonstrate the idea, we pick settings 1, 2, 3, and 6 from the main paper (settings 1, 2, and 3 correspond to scenarios with tree-type, linear, and nonlinear qualitative treatment-covariate interactions, while setting 6 considers the situation where there is no qualitative interaction and the true optimal treatment is the same for all individuals). Five correlated covariates, X 1 , . . . , X 5 , are generated. We examine the case with pairwise correlation coefficient being 0. 3    We get similar comparative conclusions across methods as with independent covariates: except for the case of linear decision boundaries (setting 2), BART performs better in all other settings. In addition, when the true optimal is the "one-size-fits-all" rule (setting 6), all methods recover the true optimal ITR with no misclassification.

B.2 Scenarios with 20 and 50 covariates
In this section, we conduct simulation studies under scenarios with more covariates and more prognostic factors than those considered in the main paper. We fix n at 20000. Same as in Section B.1, we pick settings 1, 2, 3 and 6 as representative settings for illustration. In the main paper, we examine the case with p = 5 covariates. Here, we run additional simulations with (a) p = 20 covariates, among which 10 covariates are involved in the main effect, m(X) = 1 + 0.1X 1 − 0.2X 2 − 0.3X 3 + 0.4X 4 − 0.5X 5 + X 6 + X 7 + X 8 − X 9 − X 10 , and (b) p = 50 covariates, among which 20 covariates are involved in the main effect, m(X) = 1 + 0.
. For both (a) and (b), the treatment-covariate interaction effects ∆(X, A) remain the same as those in the original settings (described in the main paper Table 1). Results are summarized in Supplementary Findings from these moderate-dimensional scenarios are similar to those from low-dimensional scenarios: when decision boundaries are tree-type (setting 1), ACWL and BART perform much better than the other competing methods, while when decision boundaries are linear (setting 2), l 1 -PLS-HGL, l 1 -PLS-GL and Dlearning perform similarly, and significantly better than ACWL as expected. BART outperforms all the other methods under nonlinear decision boundaries (setting 3) due to its flexibility. When the true optimal treatment is the same for everyone (setting 6), all methods perform perfectly with no misclassification. We note that except for BART, all the other methods contain intrinsic variable selection when estimating the optimal ITR (l 1 -PLS-HGL, l 1 -PLS-GL and D-learning perform LASSO-type variable selection, while ACWL performs variable selection via the node splitting process). However, BART still performs reasonably well for p = 20 and p = 50 (when n = 20000) since we still have n p. The performance of BART can be further improved using the variable selection method proposed by Linero 1 for large p at a slightly higher computational cost.

B.3 Variations of setting 6 (true optimal is the one-size-fits-all rule)
In this section, we focus on simulation settings in which the true optimal treatment for everyone is the same (i.e., one-size-fits-all rule) and we examine some variations of setting 6 (of the main paper).
We first consider the case where we increase the noise or reduce the signal in setting 6. We simulate The functional forms of m(X), ∆ 1 (X), ∆ 2 (X), and ∆ 3 (X) are the same as those in setting 6, which implies that treatment 1 is the optimal treatment for everyone. However, in setting 6, we set t = 0.5 and σ = 1, while here we vary values of t and σ with the sample size n fixed at 20000. The results are shown in Supplementary   Table 3.
Supplementary Table 3 We observe that when the noise level increases to σ = 5, all methods still perform perfectly in the setting where the "one-size-fits-all" rule is the optimal treatment regime. When we fix σ and reduce the value of t from 0.5 to 0.1, the perfect classification results remain. These sensitivity analysis results are not surprising since the sample size we consider is sufficiently large.
In setting 6, the quantitative interactions are tree-type. We also consider other types of quantitative interactions, such as linear additive and nonlinear (Supplementary Table 4).
According to Supplementary Table 4, when the true optimal treatment is the same for all subjects ("trivial" decision rule that assigns all to the marginally best treatment), all methods perform perfectly with no misclassification, regardless of whether the type of underlying quantitative interactions are linear, nonlinear, or tree-type.
Supplementary Table 4. Simulation results based on 100 replicates (n = 20000) under the setting where the optimal ITR is the "one-size-fits-all" rule with different types of quantitative interactions: mean (sd) of misclassification rates and value functions. Methods under comparison include the l1-penalized least squares with hierarchical group LASSO variable selection (l1-PLS-HGL), l1-penalized least squares with group LASSO variable selection (l1-PLS-GL), adaptive contrast weighted learning (ACWL), direct learning (D-learning), and Bayesian additive regression trees (BART). The smallest misclassification rates and the largest value functions for each setting are in bold.

Web Appendix C: covariate balance after data cleaning
In this section, we examine the covariate balance after data cleaning, and we present donors' baseline characteristics by randomized groups (inter-donation intervals) after data cleaning in Supplementary Table   5 for continuous covariates and Supplementary Table 6 for categorical covariates, respectively. Based on these results, we conclude that the data cleaning process does not distort the balance of baseline covariates across randomized groups.

Web Appendix D: additional analysis results based on bootstrap samples
In Tables 2-4  Sample estimates based on the INTERVAL data and standard deviation estimates based on 500 bootstrap samples (in parenthesis) of assignment proportions in % and empirical ITR effects on donation and deferral outcomes are reported. ITR effects measure the difference in the average outcome between donors whose assigned inter-donation intervals in the trial are optimal (with respect to the method used to estimate the ITR) and those whose assigned inter-donation intervals are non-optimal. A larger ITR effect on donation and a smaller ITR effect on deferral are more desirable. The first four and last four rows correspond to the target being maximizing total units of blood collected by the blood service, and minimizing the rate of low Hb deferrals, respectively.  Table 8. Sample estimates based on the INTERVAL data and standard deviation estimates based on 500 bootstrap samples (in parenthesis) of empirical ITR effects of three non-personalized rules on donation and deferral outcomes. ITR effects measure the difference in the average outcome between donors whose assigned inter-donation intervals in the trial are the same as the one specified in the non-personalized rule and those whose assigned inter-donation intervals are different from that specified in the non-personalized rule. A larger ITR effect on donation and a smaller ITR effect on deferral are more desirable.  Table 8 shows empirical ITR effects of three non-personalized rules on both the donation and the deferral outcome.

ITR Effects
Supplementary Table 9. Applications of the l1-penalized least squares with hierarchical group LASSO variable selection (l1-PLS-HGL), l1-penalized least squares with group LASSO variable selection (l1-PLS-GL), adaptive contrast weighted learning (ACWL), and direct learning (D-learning) to data from male donors in the INTERVAL trial assuming the target is to maximize the utility. The trade-off parameter b in the utility function varies from 1 to 5 at an increment of 1. Sample estimates based on the INTERVAL data and standard deviation estimates based on 500 bootstrap samples (in parenthesis) of assignment proportions in % and empirical ITR effects on donation, deferral, and utility are reported. ITR effects measure the difference in the average outcome between donors whose assigned inter-donation intervals in the trial are optimal (with respect to the method used to estimate the ITR) and those whose assigned inter-donation intervals are non-optimal. A larger ITR effect on donation/utility and a smaller ITR effect on deferral are more desirable. Results on assignment percentages and ITR effects indicate that different methods perform similarly. Not surprisingly, bootstrap-based standard deviation estimates that reflect the uncertainty of the observed dataset are much larger than cross-validation-based standard deviation estimates reported in the main paper.

Assignment Percentages ITR Effects
Supplementary Table 10. Sample estimates based on the INTERVAL data and standard deviation estimates based on 500 bootstrap samples (in parenthesis) of ITR effects of three non-personalized rules on the utility outcome. The trade-off parameter b in the utility function varies from 1 to 5 at an increment of 1. ITR effects measure the difference in the average outcome between donors whose assigned inter-donation intervals in the trial are the same as the one specified in the non-personalized rule and those whose assigned inter-donation intervals are different from that specified in the non-personalized rule. A larger ITR effect on utility is more desirable.

ITR Effects on Utility
Recommend all male donors to donate every 12 weeks

E.1 The first measure of agreement: percent agreement
The first one is a simple "agreement proportion" measure that assesses the amount of "overlap" of the optimal decisions made according to each method (l 1 -PLS-HGL, l 1 -PLS-GL, ACWL, D-learning, and BART). We calculate the "observed proportion of agreement" (percentage of male donors who receive the same recommendation on his optimal inter-donation interval by all five methods): In the case where recommendations are not consistent across all five methods, but three or four out of five methods "agree", the majority voting rule can be used to determine the optimal individualized inter-donation interval. Therefore, we also calculate the percentage of male donors of whom at least four methods "agree" on his optimal inter-donation interval: Results are presented in Supplementary Table 11.
We observe that when the target outcome is donation, deferral, or utility with b = 1 or 2, the levels of agreement between five methods are high (> 80%). As the trade-off parameter b in the utility function Supplementary Table 11. Percentages of male donors who receive the same recommendation on the optimal interdonation interval by all five methods, by at least four methods, or by at least three methods when the target outcome is donation, deferral, and utility, respectively. Methods to estimate the optimal individualized inter-donation interval include the l1-penalized least squares with hierarchical group LASSO variable selection, l1-penalized least squares with group LASSO variable selection, adaptive contrast weighted learning, direct learning, and Bayesian additive regression trees.

Donation
Deferral All

E.2 The second measure of agreement: pairwise B statistics
One limitation of the "percent agreement" measure is that this type of statistics does not account and correct for the "agreement by chance". Chance-corrected measures have been proposed for assessing inter-rater agreement, for example, the Cohen's kappa for two raters or Fleiss' kappa for more than two raters. 3,4 However, we note that kappa is not particularly suitable in the current setting because the performance of kappa depends on marginal distributions, while in our case, marginal distributions are highly symmetricallyimbalanced, especially when the target outcome is donation, deferral, or utility with small values for b. This phenomenon has been well-documented in the literature and is commonly referred to as the "high agreement but low kappa paradox". [5][6][7] Some improved measures that still account for chance agreement but address the kappa paradox have been proposed. For example, the Bangdiwala's B statistics proposed by Bangdiwala 8 has been shown to be robust to different marginal distributions. 7,9 However, B statistics has not been extended to handle the case with more than two raters. We report the pairwise B statistics as the second type of agreement measure and use the guidelines suggested by Munoz and Bangdiwala 10 for the interpretation of B statistics (Supplementary Table 12).
Supplementary Figure 2 presents the pairwise B statistics among 5 methods when the donation and deferral outcomes are analyzed separately, and Supplementary Figure 3 shows the corresponding results when the utility score is the target outcome. Different colors indicate the extent of agreement according to pairwise B statistics with yellow or light yellow representing the situations where the agreement is "almost perfect" or "perfect". We observe that all pairwise B statistics suggest at least "almost perfect" agreements when the target outcome is donation, deferral, or utility with b = 1 or 2. As b gets larger, the degree of pairwise agreements decreases. For b = 3 or 4, pairwise agreements never go below "substantial", whereas when the target outcome is the utility score with b = 5, several pairwise B statistics indicate only "moderate" agreement. In general, pairwise B statistics between l 1 -PLS-HGL and the other methods seem to be slightly higher than the rest of