-
PDF
- Split View
-
Views
-
Cite
Cite
James Willard, Shirin Golchi, Erica E M Moodie, Bruno Boulanger, Bradley P Carlin, Bayesian optimization for personalized dose-finding trials with combination therapies, Journal of the Royal Statistical Society Series C: Applied Statistics, Volume 74, Issue 2, March 2025, Pages 373–390, https://doi.org/10.1093/jrsssc/qlae058
- Share Icon Share
Abstract
Identification of optimal dose combinations in early-phase dose-finding trials is challenging, due to the trade-off between precisely estimating the many parameters required to flexibly model the possibly nonmonotonic dose-response surface, and the small sample sizes in early-phase trials. This difficulty is even more pertinent in the context of personalized dose-finding, where patient characteristics are used to identify tailored optimal dose combinations. To overcome these challenges, we propose the use of Bayesian optimization for finding optimal dose combinations in standard (one size fits all) and personalized multi-agent dose-finding trials. Bayesian optimization is a method for estimating the global optima of expensive-to-evaluate objective functions. The objective function is approximated by a surrogate model, commonly a Gaussian process, paired with a sequential design strategy to select the next point via an acquisition function. This work is motivated by an industry-sponsored problem, where the focus is on optimizing a dual-agent therapy in a setting featuring minimal toxicity. To compare the performance of the standard and personalized methods under this setting, simulation studies are performed for a variety of scenarios. Our study concludes that taking a personalized approach is highly beneficial in the presence of heterogeneity.
1 Introduction
Early-phase clinical trials are designed to assess the safety and efficacy profiles of first in human doses of an experimental drug. Traditional adaptive dose-finding designs fall into three major families: algorithmic designs (e.g. Storer, 1989), model-assisted designs (e.g. Yuan et al., 2016), and model-based designs (e.g. O’Quigley et al., 1990). Here, our interest lies in model-based designs, which have been extended to the dual-agent dose combination setting (e.g. Wages & Conaway, 2014; K. Wang & Ivanova, 2005). Some of these designs (e.g. Houede et al., 2010; Z. Wang et al., 2023) propose using flexible models to handle possible nonmonotonicities in the dose-efficacy/dose-toxicity surfaces, recognizing that monotonicity depends on the type of drug being administered and may not hold in general (Li et al., 2017). Additionally, existing methods often restrict dose-finding to a small set of preselected dose combinations, which can fail to identify the optimal dose combination (Hirakawa et al., 2015). Adaptive dose insertion designs have been proposed to allow for the evaluation of additional dose combinations if warranted by the data (Cai et al., 2014; Lyu et al., 2019).
Most existing dose-finding designs, including those mentioned above, seek to find optimal dose combinations without consideration of additional covariate information (we refer to this as standard dose-finding). These designs ignore the possibility that patient responses may be variable across different subgroups within the population. With recent advances in molecular biology, specifically the identification of novel biomarkers, interest has grown in personalized (or precision) medicine. Personalized dose-finding aims to find optimal dose combinations based on individual patient characteristics. When utilizing parametric models, as in many of the standard dose-finding methods described above, extension to the personalized setting is challenging due to the limited sample sizes and potentially large number of dose-covariate interaction terms that are required to be estimated. Personalized dose-finding for monotherapies has been investigated using a variety of methods, including dimension-reduction techniques (e.g. Guo & Zang, 2022), Bayesian hierarchical models (e.g. Morita et al., 2017), and Bayesian model averaging (e.g. Psioda et al., 2021). The increased dimensionality in the combination therapy setting makes estimation even more challenging. Mozgunov et al. (2022) have considered personalized dose-finding for a dual-agent combination therapy where the patient-specific dose of one of the agents is selected externally by clinicians, but multi-agent personalized dose-finding designs where all dose dimensions are explored during optimization are still needed.
Our work is motivated by a problem in the industry, where a sponsor is interested in a design for the development of an intraocular implant that combines two topical agents. Each agent has been in use and is well tolerated. No drug-related adverse events are expected, and we anticipate minimal toxicity, if any. Interest lies in obtaining the optimal dose combination of these two agents using a continuous efficacy measure, which need not be monotonic with the doses of each agent. Additionally, response heterogeneity is expected to exist with respect to a key binary covariate, a particular characteristic of the lens of the eye, and we seek a design that accommodates this feature. We allow for exploration of the dose combination region, which has been defined using data from previous observational and randomized studies, without consideration of formal dose escalation/de-escalation rules. As each new dose combination investigated carries engineering costs, early stopping is desirable if warranted by the data.
Considering this motivating example, we propose novel methodology for multi-agent dose-finding which meets several criteria. First, the model estimating the response surface is parsimonious, which is suitable for small sample sizes, yet remains flexible enough to model both monotonic and nonmonotonic dose-response surfaces. Next, the design sequentially explores the entire dose combination space in a principled way, and allows for early stopping if warranted. Finally, our methodology facilitates extension to personalized dose-finding. Specifically, we propose modelling the response surface with a Gaussian process (GP) and using Bayesian optimization for efficient standard and personalized multi-agent dose-finding, assuming minimal toxicity. Bayesian optimization has been previously proposed for monotherapy in a standard dose-finding setting (Takahashi & Suzuki, 2021a, 2021b). Our contributions here differ from this previous work by utilizing a different sequential search policy, by extending the approach to combination therapies and personalized dose-finding setting, and by permitting early stopping.
The remainder of this manuscript is organized in the following manner. We first introduce Bayesian optimization, propose methods for standard and personalized multi-agent dose-finding, and describe an early stopping rule. A simulation study is then performed to compare the standard and personalized dose-finding methods for a dual-agent therapy under a variety of scenarios without early stopping. We then consider the development of the previously described intraocular implant and compare several dose-finding designs which include early stopping rules. We conclude with a discussion and possible directions for future work.
2 Bayesian optimization for dose-finding
Standard dose-finding. Let be a continuous dose-response surface of interest, defined as a function of dose combinations which are comprised of dosing agents and which lie in a continuous dose combination region . In this setting, may be the dose-efficacy surface or the dose-utility surface, where utility is defined using both efficacy and toxicity outcomes. For the remainder of the manuscript, we assume is the dose-efficacy surface, which has been transformed such that smaller values are more desirable. We assume minimal toxicity over . Thus, dose escalation/de-escalation rules are not considered, and it is ethically permissible to select future dose combinations throughout the design space. Our interest is in obtaining the optimal dose combination , defined as that which minimizes :
To do so, we use Bayesian optimization, which is a derivative-free optimization method that estimates the global optima of expensive-to-evaluate objective functions (Garnett, 2023). It is a sequential design strategy that approximates the objective function with a stochastic surrogate model (commonly a GP), and selects the next point by optimizing an acquisition function. Bayesian optimization is commonly used in engineering, machine learning, and computer experiments (Gramacy, 2020; Murphy, 2023). In this work, we model the efficacy function with a GP, and select the next dose combination as that which maximizes an acquisition function. This process repeats until optimality criteria are satisfied or sample size limits are reached.
To begin, a GP prior is placed on . This prior is defined by its mean function, , and its correlation function (kernel) , which determines the correlation between two responses as a function of their corresponding dose combinations, and (Williams & Rasmussen, 2006). This correlation function is multiplied by scale parameter ν, which determines the variability of the efficacy function throughout the dose combination space and yields . Since the method’s flexibility is largely determined by the choice of kernel function, constant mean GPs can be used, and in this work, we set . However, information about the response surface (e.g. pharmacokinetic/pharmacodynamic models) can be incorporated through the mean function if desired. Furthermore, we utilize a separable anisotropic squared exponential kernel,
This kernel is parameterized by characteristic length-scales , which control the rate of decay in the correlation between two dose combinations with respect to each dosing dimension and which are estimated using available data. Additionally, it is an example of a stationary kernel, which assumes that the degree of correlation between two dose combinations depends only on their distance and not on their locations in the dose combination space. We assume stationary dose-response surfaces throughout this work, but comment on relaxing this assumption in the discussion.
To optimize , r patients are assigned to each of c initial dose combinations, yielding patient responses which are treated as noisy observations , where for . This yields the observed data , where we denote the vector of observations by . Specifying the GP prior above induces a multivariate normal distribution on the observations, , where and is a noise parameter. After observing , we adopt an empirical Bayes approach towards the kernel hyperparameters , replacing them with their maximum-likelihood estimates (Gramacy, 2020). The posterior distribution of the efficacy function at a new dose combination, denoted by , is then (Binois & Gramacy, 2021), such that
where is , is , and is a plug-in estimate for the mean, each of which is calculated using the maximum-likelihood estimate of . A derivation of (2) is provided in the online supplementary material.
The next dose combination for evaluation, denoted by , is then selected as that candidate dose combination which maximizes an acquisition function, denoted by :
One acquisition function commonly paired with a GP is the expected improvement (EI), defined as
and available in closed form (Jones et al., 1998),
where denotes the value of the current observed optimum, and denote the cdf and pdf of a standard normal random variable, respectively, and where and are the posterior mean and standard deviation of the efficacy function evaluated at , respectively. The EI balances between exploiting regions that have desirable values of (first term), and exploring regions in the dose combination space that are imprecisely estimated (second term). Under a noisy setting, is not observed and so plug-in estimates have been proposed, for example, by using the minimum of the posterior mean, (Picheny et al., 2013). A different acquisition function, called the augmented expected improvement (AEI; Huang et al., 2006), has been shown to offer better performance than EI under higher noise settings (Picheny et al., 2013). The AEI defines , the posterior mean of f at the current ‘effective best solution’, , which can be defined as the point which minimizes a posterior β-quantile. We follow the recommendation in Huang et al. (2006) and define as that point which minimizes the posterior quantile, which is equivalent to setting when . The AEI is then defined as:
The multiplicative term serves to promote exploration by penalizing dose combinations that have small posterior variance (Picheny et al., 2013). In this work, we use AEI since we find it offers moderate improvement over EI.
After evaluation of in r new patients, the data are updated, . The GP model is refit to obtain a new posterior distribution . Then samples from this posterior are obtained to yield S samples from the posterior of the optimal dose combination as
This procedure continues until the sample size limit is reached or an early stopping rule, which we denote by , is satisfied.
Require: r patient responses at each of c initial dose combinations, |
1: |
2: |
3: Obtain and using fitted GP ⊳ Obtain posteriors |
4: ⊳ Obtain effective best point |
5: ⊳ Obtain best value of f |
6: Calculate for ⊳ Compute AEI |
7: while and do |
8: ⊳ Obtain next dose |
9: for do |
10: Evaluate at ⊳ Observe outcomes |
11: end for |
12: ; ⊳ Update |
13: ⊳ Update data |
14: Obtain and using fitted GP ⊳ Obtain posteriors |
15: ⊳ Obtain effective best point |
16: ⊳ Obtain best value of f |
17: Calculate for ⊳ Compute AEI |
18: Update using (4) ⊳ Update stopping rule |
19: end while |
Require: r patient responses at each of c initial dose combinations, |
1: |
2: |
3: Obtain and using fitted GP ⊳ Obtain posteriors |
4: ⊳ Obtain effective best point |
5: ⊳ Obtain best value of f |
6: Calculate for ⊳ Compute AEI |
7: while and do |
8: ⊳ Obtain next dose |
9: for do |
10: Evaluate at ⊳ Observe outcomes |
11: end for |
12: ; ⊳ Update |
13: ⊳ Update data |
14: Obtain and using fitted GP ⊳ Obtain posteriors |
15: ⊳ Obtain effective best point |
16: ⊳ Obtain best value of f |
17: Calculate for ⊳ Compute AEI |
18: Update using (4) ⊳ Update stopping rule |
19: end while |
Require: r patient responses at each of c initial dose combinations, |
1: |
2: |
3: Obtain and using fitted GP ⊳ Obtain posteriors |
4: ⊳ Obtain effective best point |
5: ⊳ Obtain best value of f |
6: Calculate for ⊳ Compute AEI |
7: while and do |
8: ⊳ Obtain next dose |
9: for do |
10: Evaluate at ⊳ Observe outcomes |
11: end for |
12: ; ⊳ Update |
13: ⊳ Update data |
14: Obtain and using fitted GP ⊳ Obtain posteriors |
15: ⊳ Obtain effective best point |
16: ⊳ Obtain best value of f |
17: Calculate for ⊳ Compute AEI |
18: Update using (4) ⊳ Update stopping rule |
19: end while |
Require: r patient responses at each of c initial dose combinations, |
1: |
2: |
3: Obtain and using fitted GP ⊳ Obtain posteriors |
4: ⊳ Obtain effective best point |
5: ⊳ Obtain best value of f |
6: Calculate for ⊳ Compute AEI |
7: while and do |
8: ⊳ Obtain next dose |
9: for do |
10: Evaluate at ⊳ Observe outcomes |
11: end for |
12: ; ⊳ Update |
13: ⊳ Update data |
14: Obtain and using fitted GP ⊳ Obtain posteriors |
15: ⊳ Obtain effective best point |
16: ⊳ Obtain best value of f |
17: Calculate for ⊳ Compute AEI |
18: Update using (4) ⊳ Update stopping rule |
19: end while |
One possible early stopping rule proposed in Huang et al. (2006) is to allow stopping only after . In this case, the algorithm terminates only if there is little improvement to be gained over across the dose combination space. Under a noisy setting, the authors suggest this be satisfied for consecutive algorithm iterations before termination. This yields the following stopping rule:
We note that δ is a tuning parameter that controls the performance of the algorithms. Its value can be determined through sensitivity analysis using several values obtained through Monte Carlo simulation. For example, the Monte Carlo distributions of can be obtained at each iteration and different values of δ can be selected as summary statistics of these distributions (e.g. the median). The performance of these values can then be compared, with smaller values of δ implying more stringent stopping criteria, and larger values of δ permitting earlier stopping.
The sequential procedure and stopping rule described above suggest Algorithm 1, referred to as the standard optimization algorithm. In many applications of Bayesian optimization, it is standard to collect the initial data and then continue with at each iteration of the algorithm. This permits maximal exploration of the domain, as a novel input point from the design space can be evaluated each time. In certain dose-finding applications, such as the motivating example of this work, engineering costs may prohibit the creation of a novel dose combination for each new patient, and some patients may necessarily be assigned to the same dose.
Personalized dose-finding. Personalized medicine recognizes that response heterogeneity may exist within the population of interest; personalized dose-finding incorporates covariate information to account for this. Consider a set of P discrete covariates . The Cartesian product of the levels of these P covariates define K strata. Personalized dose-finding seeks to find the optimal dose combinations, denoted by , across the continuous dose combination space for each of the K strata
The efficacy function is modelled using a single GP fit to the data , where with , which allows information to be borrowed across strata. One possible method of incorporating the additional covariates into the GP model is through the kernel function. We use a (stationary) separable anisotropic squared exponential kernel function that includes the additional covariates of interest
As before, we use an empirical Bayes approach for estimating the hyperparameters. In the case, where is a binary variable representing the levels of covariate p, the correlation between two patient responses is reduced by a factor of if they belong to different strata with respect to covariate p.
Since the efficacy function may exhibit different behaviour within each stratum, each stratum may have a unique best efficacy function value . Similar to the standard case, is estimated as the posterior mean at the effective best point within stratum k, . To account for this heterogeneity, the sequential selection is performed within each stratum but with the GP fit utilizing data from all strata. This proceeds by modifying the AEI acquisition function to use when conditioned on being in stratum k, denoted by . The sequential procedure continues until sample size limits are reached or until an early stopping rule is satisfied for each stratum. Since optimal doses in some strata may be easier to identify than in others, stratum-specific early stopping should be employed. One possible stratum-specific stopping rule, which we denote by , replaces in (4) with . Upon termination of the algorithm, the posterior distribution of the optimal dose combination for each strata is returned, denoted by . These modifications suggest Algorithm 2, the personalized optimization algorithm.
Require: r patient responses at each of initial dose combinations per stratum, |
1: |
2: Obtain using fitted GP ⊳ Obtain posterior of f |
3: for do |
4: Obtain ⊳ Obtain posterior of |
5: ⊳ Obtain effective best point |
6: ⊳ Obtain best value |
7: Calculate for ⊳ Compute AEI |
8: |
9: end for |
10: while and for at least one stratum k do |
11: for do |
12: |
13: |
14: if then |
15: ⊳ Obtain next dose |
16: for do |
17: Evaluate at ⊳ Observe outcomes |
18: end for |
19: ; ⊳ Update |
20: |
21: end if |
22: end for |
23: ⊳ Update n |
24: ⊳ Update data |
25: Obtain using fitted GP ⊳ Obtain posterior of f |
26: for do |
27: Obtain ⊳ Obtain posterior of |
28: if then |
29: ⊳ Obtain effective best point |
30: ⊳ Obtain best value |
31: Calculate for ⊳ Compute AEI |
32: Update using (4) ⊳ Update stopping rule |
33: end if |
34: end for |
35: end while |
Require: r patient responses at each of initial dose combinations per stratum, |
1: |
2: Obtain using fitted GP ⊳ Obtain posterior of f |
3: for do |
4: Obtain ⊳ Obtain posterior of |
5: ⊳ Obtain effective best point |
6: ⊳ Obtain best value |
7: Calculate for ⊳ Compute AEI |
8: |
9: end for |
10: while and for at least one stratum k do |
11: for do |
12: |
13: |
14: if then |
15: ⊳ Obtain next dose |
16: for do |
17: Evaluate at ⊳ Observe outcomes |
18: end for |
19: ; ⊳ Update |
20: |
21: end if |
22: end for |
23: ⊳ Update n |
24: ⊳ Update data |
25: Obtain using fitted GP ⊳ Obtain posterior of f |
26: for do |
27: Obtain ⊳ Obtain posterior of |
28: if then |
29: ⊳ Obtain effective best point |
30: ⊳ Obtain best value |
31: Calculate for ⊳ Compute AEI |
32: Update using (4) ⊳ Update stopping rule |
33: end if |
34: end for |
35: end while |
Require: r patient responses at each of initial dose combinations per stratum, |
1: |
2: Obtain using fitted GP ⊳ Obtain posterior of f |
3: for do |
4: Obtain ⊳ Obtain posterior of |
5: ⊳ Obtain effective best point |
6: ⊳ Obtain best value |
7: Calculate for ⊳ Compute AEI |
8: |
9: end for |
10: while and for at least one stratum k do |
11: for do |
12: |
13: |
14: if then |
15: ⊳ Obtain next dose |
16: for do |
17: Evaluate at ⊳ Observe outcomes |
18: end for |
19: ; ⊳ Update |
20: |
21: end if |
22: end for |
23: ⊳ Update n |
24: ⊳ Update data |
25: Obtain using fitted GP ⊳ Obtain posterior of f |
26: for do |
27: Obtain ⊳ Obtain posterior of |
28: if then |
29: ⊳ Obtain effective best point |
30: ⊳ Obtain best value |
31: Calculate for ⊳ Compute AEI |
32: Update using (4) ⊳ Update stopping rule |
33: end if |
34: end for |
35: end while |
Require: r patient responses at each of initial dose combinations per stratum, |
1: |
2: Obtain using fitted GP ⊳ Obtain posterior of f |
3: for do |
4: Obtain ⊳ Obtain posterior of |
5: ⊳ Obtain effective best point |
6: ⊳ Obtain best value |
7: Calculate for ⊳ Compute AEI |
8: |
9: end for |
10: while and for at least one stratum k do |
11: for do |
12: |
13: |
14: if then |
15: ⊳ Obtain next dose |
16: for do |
17: Evaluate at ⊳ Observe outcomes |
18: end for |
19: ; ⊳ Update |
20: |
21: end if |
22: end for |
23: ⊳ Update n |
24: ⊳ Update data |
25: Obtain using fitted GP ⊳ Obtain posterior of f |
26: for do |
27: Obtain ⊳ Obtain posterior of |
28: if then |
29: ⊳ Obtain effective best point |
30: ⊳ Obtain best value |
31: Calculate for ⊳ Compute AEI |
32: Update using (4) ⊳ Update stopping rule |
33: end if |
34: end for |
35: end while |
3 Simulation study
Below, we perform a simulation study to compare the performance of the standard and personalized algorithms (Algorithms 1 and 2, respectively) under three scenarios with no early stopping (i.e. in (4)). Early stopping will be investigated in the next section. Scenarios 1 and 2 consider a single binary covariate . We index the true optimal dose combinations, , and the true optimal values of the efficacy function, , using the values of . That is, when we use and . Scenario 3 considers two binary covariates, and , and the and are indexed similarly. For example, when and , we use and . To make the simulations in this manuscript more computationally feasible, we modify Algorithms 1 and 2 to return a point estimate of rather than the entire posterior distribution. The point estimate is defined as the minimizer of the posterior mean surface, .
We utilize dose combinations , assumed to be standardized, where corresponds to the combination using the lowest doses of interest for each agent and where corresponds to the combination using the highest doses of interest for each agent. The point estimates, , and the next dose combinations for evaluation, , are set, respectively, as the minimizers of and maximizers of . The is evaluated across an evenly spaced grid on . The grid is incremented by in each dimension, reflecting the degree of precision to which the drug maker can manufacture a particular dose combination. As a result, some dose combinations may be suggested more than once in the algorithm. While the proposed method is capable of optimizing over the continuous dose combination space, it is important to incorporate any manufacturing constraints into the optimization procedures to avoid suggesting doses that are not feasible to engineer.
As before, we define to be a continuous efficacy surface and assume we are in a minimal toxicity setting where it is ethically permissible to select future doses anywhere in the dose combination space. Let be a bivariate normal density function, , which is parameterized by mean vector and covariance matrix , for . The efficacy functions under the considered scenarios use the following densities evaluated at , denoted by in Table 1:
The data-generating mechanism for each scenario is where , with the specification of included in the first panel of Table 1 (rows labelled ‘Simulation Study’) and plotted in Figures 1a–3a. The values of are chosen to ensure specific standardized effect sizes, defined as . We consider several standardized effect sizes drawn from a meta-analysis of dose-responses for a large drug development portfolio at a pharmaceutical company (Thomas et al., 2014). We focus on standardized effect sizes for drugs that had laboratory-confirmed endpoints, which is the type of endpoint used in our motivating example. The percentiles of these standardized effect sizes are , which we refer to as small/medium/large effect sizes.

Scenario 3. (a) Efficacy function, white stars denote , (b) expected dose units from the optimal dose combination as defined in (6) by iteration, (c) average RPSEL as defined in (7) by iteration, and (d) average absolute deviation of the posterior mean estimate by iteration. In (b), the plot for is empty as no optimal dose combination exists.
. | . | . | . | . | . | . | ses . | monotone . |
---|---|---|---|---|---|---|---|---|
Simulation Study | 1) | 2.015 | 0 | 1.592 | 0.79 | Yes | ||
1 | 1.592 | 0.79 | Yes | |||||
2) | 0.319 | 0 | 1.203 | 3.77 | No | |||
1 | 1.203 | 3.77 | No | |||||
3) | 1 | 0 | 0 | None | None | 0 | No | |
0 | 1 | 1 | 1 | No | ||||
1 | 0 | 3.77 | 3.77 | No | ||||
1 | 1 | 0.79 | 0.79 | Yes | ||||
Implant | 1) | 5 | 0 | 5 | 1 | No | ||
1 | 10 | 2 | No |
. | . | . | . | . | . | . | ses . | monotone . |
---|---|---|---|---|---|---|---|---|
Simulation Study | 1) | 2.015 | 0 | 1.592 | 0.79 | Yes | ||
1 | 1.592 | 0.79 | Yes | |||||
2) | 0.319 | 0 | 1.203 | 3.77 | No | |||
1 | 1.203 | 3.77 | No | |||||
3) | 1 | 0 | 0 | None | None | 0 | No | |
0 | 1 | 1 | 1 | No | ||||
1 | 0 | 3.77 | 3.77 | No | ||||
1 | 1 | 0.79 | 0.79 | Yes | ||||
Implant | 1) | 5 | 0 | 5 | 1 | No | ||
1 | 10 | 2 | No |
Note. The data-generating mechanism for each scenario is where . The table columns contain the location of the optimal dose combination (), the optimal value of the efficacy function (), the standardized effect size (ses), and whether or not the dose-efficacy surface is monotonically increasing with respect to each dosing dimension (monotone). Note that for represents the value of a bivariate normal density function with specific mean vector and covariance matrix evaluated at as defined in the text. The subtraction of 2 in under the Implant scenario corresponds to a base level of drug response outside the regions of optimality.
. | . | . | . | . | . | . | ses . | monotone . |
---|---|---|---|---|---|---|---|---|
Simulation Study | 1) | 2.015 | 0 | 1.592 | 0.79 | Yes | ||
1 | 1.592 | 0.79 | Yes | |||||
2) | 0.319 | 0 | 1.203 | 3.77 | No | |||
1 | 1.203 | 3.77 | No | |||||
3) | 1 | 0 | 0 | None | None | 0 | No | |
0 | 1 | 1 | 1 | No | ||||
1 | 0 | 3.77 | 3.77 | No | ||||
1 | 1 | 0.79 | 0.79 | Yes | ||||
Implant | 1) | 5 | 0 | 5 | 1 | No | ||
1 | 10 | 2 | No |
. | . | . | . | . | . | . | ses . | monotone . |
---|---|---|---|---|---|---|---|---|
Simulation Study | 1) | 2.015 | 0 | 1.592 | 0.79 | Yes | ||
1 | 1.592 | 0.79 | Yes | |||||
2) | 0.319 | 0 | 1.203 | 3.77 | No | |||
1 | 1.203 | 3.77 | No | |||||
3) | 1 | 0 | 0 | None | None | 0 | No | |
0 | 1 | 1 | 1 | No | ||||
1 | 0 | 3.77 | 3.77 | No | ||||
1 | 1 | 0.79 | 0.79 | Yes | ||||
Implant | 1) | 5 | 0 | 5 | 1 | No | ||
1 | 10 | 2 | No |
Note. The data-generating mechanism for each scenario is where . The table columns contain the location of the optimal dose combination (), the optimal value of the efficacy function (), the standardized effect size (ses), and whether or not the dose-efficacy surface is monotonically increasing with respect to each dosing dimension (monotone). Note that for represents the value of a bivariate normal density function with specific mean vector and covariance matrix evaluated at as defined in the text. The subtraction of 2 in under the Implant scenario corresponds to a base level of drug response outside the regions of optimality.
Scenario 1 considers the case of no response heterogeneity across a binary covariate and includes a small standardized effect size with a dose-efficacy surface, which is monotonically increasing with respect to each dosing dimension. Both the locations of the optimal dose combinations, and the optimal values of the efficacy function, are the same across the strata. That is, and . Scenario 2 considers response heterogeneity across , where the locations of the optimal dose combinations differ. Under this scenario, but . This scenario considers large standardized effect sizes with efficacy surfaces that are nonmonotone with respect to each dosing dimension. An example of a single trial performed using the personalized algorithm for this scenario is provided in Figure E.1 in the online supplementary material. Scenario 3 considers heterogeneity across two binary covariates, and , where both the locations of the optimal dose combinations and the optimal values of the efficacy function differ across the strata. Thus, for and for . This scenario includes a zero stratum, , which represents the covariate pattern of those who do not respond to the drug. We note that in this stratum, and do not exist, but we consider the standardized effect size to be 0. This scenario includes small, medium, and large standardized effect sizes as well as both monotone and nonmonotone dose-efficacy surfaces.
For each scenario, the standard and personalized algorithms are run using a maximum sample size of 80 participants. While this number is larger than many early-phase trials might be in practice, our goal is to investigate the algorithms’ performance characteristics as the sample size increases. We defer the use of early stopping rules to the following section, but note that these will permit a reduction in the expected sample size. We assign participants to dose combinations such that the total sample size of each algorithm is equal at each iteration, which allows a comparison of their performance. For the standard algorithm under scenarios 1 and 2, participants are assigned to each dose combination on an initial dose matrix comprised of dose combinations, which are selected via a two-dimensional quasi-random Sobol sequence (Morgan-Wall, 2022). This sequence serves as a space-filling design and seeks to spread out the initial dose combinations in a more uniform manner than is typically accomplished via random sampling. A comparison to other space-filling designs is provided in the online supplementary material. This yields , which represents 25% of the total sample size, following a recommendation of using 20%–50% of the data for the initial design (Picheny & Ginsbourger, 2014). More than one patient is assigned per dose combination to control the cost associated with producing novel dose combinations, a financial constraint from our motivating problem. At each iteration of the algorithm, additional participants are assigned to each proposed dose combination. This yields a total sample size of 80 after 15 iterations. We note that r is a tuning parameter, and that reducing it will lead to more unique doses being explored for the same fixed sample size. We explore this in the next section.
For the personalized algorithm, initial dose combinations are selected in the same manner as above. To achieve the same total sample size by iteration as the standard algorithm, only participants are assigned to each dose combination within each stratum. This yields , and so , as in the standard algorithm. At each iteration within each stratum, additional participants are assigned to each proposed dose combination. This yields a total sample size of 80 after 15 iterations. We note that this equal allocation of patients across strata represents an idealized trial, which assumes that both the prevalence and enrolment of the specific subgroups is the same throughout the course of the trial. We consider departures from these assumptions in the discussion section. The same setup is used for scenario 3. However, since the number of strata is doubled, the number of participants evaluated at each dose combination is halved for the personalized algorithm. Thus, the standard algorithm still evaluates participants per dose combination, but the personalized algorithm evaluates participant per dose combination, yielding a total sample size of 80 after 15 iterations. All computing is performed in the statistical programming language R (R Core Team, 2022). The efficacy functions are modelled using constant mean GP models, which utilize the anisotropic separable squared exponential kernels previously described, with the hyperparameters being jointly optimized by maximizing the marginal log-likelihood of the data (i.e. empirical Bayes GP; Binois & Gramacy, 2021). A set of parametric dose-finding designs was also considered, but yielded performance which was, in general, worse than the standard and personalized algorithms and so are not included here. The simulation details and results are provided in the online supplementary material.
Algorithm performance is compared using several criteria, which are estimated via Monte Carlo simulations. The expected number of dosing units from the optimal dose combination is used to assess how close the recommended dose combination is to at each iteration. This measure is defined as the expected value of the Euclidean distance between and divided by the precision to which the sponsor can manufacture doses, which is in our simulations
We utilize this single number summary to facilitate visual comparison between the algorithms. We also report different selection metrics, including the probability of correctly selecting the optimal dose, the probability of selecting within one dose of the optimal dose, and the probability of selecting within one dose along the diagonal of the optimal dose, in the online supplementary material. We utilize the average root posterior squared error loss (RPSEL) to assess how well the true efficacy function value at the recommended dose is estimated by the pointwise posterior distribution of the efficacy function at the recommended dose, , where denotes a single posterior sample out of posterior samples:
We employ here rather than to understand how well the algorithms capture f at the recommended dose combination even if the recommended dose combination is not optimal. This is important for later phase studies, which may utilize estimates of f obtained at the recommended dose combination for sample size and power calculations. In a similar manner, we present the average absolute deviation of the posterior mean point estimates from the true value of at each iteration. Note that the standard algorithm ignores the strata and so yields only a single recommended dose , posterior distribution , and posterior mean per iteration for . See panels b–d of Figures 1–3 for these criteria by iteration.
Scenario 1 considers the case of no response heterogeneity across a single binary covariate for small standardized effect sizes with monotonically increasing dose-efficacy surfaces. Under this scenario, both algorithms converge to the locations of , coming within one dosing unit of the optimal (Figure 1b). Estimation of f is challenging for the small standardized effect size (i.e. higher level of noise), but by termination, the algorithms come within 0.4 units of the true f at the recommended dose combination (Figure 1c–d). The personalized algorithm is slightly less efficient than the standard algorithm, however, and takes longer to converge. This is likely the result of the GP model used in the personalized algorithm needing to estimate the additional length-scale parameter for , .
Scenario 2 considers response heterogeneity across for large standardized treatment effect sizes with nonmonotonic dose-efficacy surfaces. Under this scenario, the personalized algorithm converges to the locations of and estimates the true value of f at the recommended dose combination well, whereas the standard algorithm does not (Figure 2b–d). The standard algorithm typically explores the area near only a single or attempts to explore both optima (not shown). This results from the marginal efficacy function surface being bimodal, since it is a mixture distribution comprised of the equally weighted strata of , which are displayed in Figure 2a. Note that even if the bimodality of the marginal surface is properly identified and explored, patients cannot be optimally treated without consideration of . That is, without this additional covariate information, the standard algorithm cannot determine which mode should be used to treat patients with versus . This is possible using the personalized approach, however.
Scenario 3 considers heterogeneity across two binary covariates, and , and includes zero, small, medium, and large standardized effect sizes with both monotone and nonmonotone dose-efficacy surfaces. Recall that the stratum corresponds to those patients who do not respond to the drug. Thus, there are no optima in this stratum, and the corresponding plot in Figure 3b is empty. Figure 3c,d is not empty for this stratum; however, since everywhere, and so it is of interest to see how the algorithms estimate this value. Under this scenario, the personalized algorithm converges to the locations of , coming within 1–1.5 dosing units of the optimal depending on the standardized effect size, whereas the standard algorithm does not (Figure 3b). Since the standard algorithm targets the global optimum, it performs best in stratum where the standardized effect is the largest. Accurate estimation of f is challenging under this scenario (Figure 3c,d). The personalized algorithm shows evidence of convergence towards the true values of f (i.e. RPSEL and absolute deviation of the posterior mean estimates decreasing to 0), whereas the standard algorithm does not. The standard algorithm yields only a single estimate of f and so must split the difference among the different efficacy function values across the strata. This scenario suggests that as the number of strata grow, and thus also the likelihood for some degree of response heterogeneity to be present, the performance of the standard algorithm will be further degraded.
In summary, when heterogeneity exists across the strata, the personalized algorithm is superior in both identifying the locations of the and estimating f. When no heterogeneity exists, the standard algorithm is slightly more efficient. Additionally, the proposed methods have performed well for both monotonic and nonmonotonic dose-efficacy surfaces, and have done so without utilizing strong prior information.
4 Dose-finding design for an intraocular implant
In this section, we focus on the intraocular implant example. The goal is to develop an intraocular implant with an optimal dose combination of two agents, which reduce intraocular pressure (IOP), a laboratory-confirmed measurement. The normal range of IOP is 12–21 mmHg, with 21–30 mmHg considered elevated IOP. Elevated IOP is a risk factor for ocular hypertension and glaucoma, and is strongly associated with increased vision loss (Leske et al., 2003). The implant seeks to reduce elevated IOP by combining two agents, with doses and , each of which has been in use individually in topical form for many years. The agents are well tolerated and we do not expect any drug-related adverse events. We are interested in obtaining the optimal dose combination of these two agents using a reduction in IOP from baseline as a continuous efficacy measure. It is hypothesized that higher doses do not necessarily imply greater efficacy, which is expected to plateau or even decrease at higher levels of agent concentration. Additionally, we expect response heterogeneity to exist with respect to a particular characteristic of the lens of the eye, which we treat as a binary covariate , and are interested in a design that permits the identification of potentially different optimal dose combinations according to this patient characteristic.
We allow the dose-finding algorithm to explore the dose-combination region, assumed to be standardized, subject to the manufacturing precision constraint of standardized dosing units and deem it ethically permissible to proceed without safety-related dose escalation/de-escalation rules. It is hypothesized that the implant can reduce elevated IOP by 5 mmHg in individuals with , but may be even more effective in individuals with , leading to reductions as high as 10 mmHg. To assess the cost and size of a hypothetical trial, we are interested in comparing the standard and personalized dosing approaches under different stopping rule specifications for the scenario described above. The final design is then selected as the one that balances good performance while controlling expected cost. Costs are measured in terms of enrolled participants and also the number of unique dose combinations, since there are engineering costs associated with the production of novel doses.
The goal is to minimize the efficacy function. The data-generating mechanism is where , with the specification of included in the second panel of Table 1 (row labelled ‘Implant’) and plotted in Figure 4a. We use the same indexing for and as described previously. The value of ensures medium standardized effect sizes of 1 and 2 for and , respectively, across nonmonotonic dose-efficacy surfaces.

Intraocular Implant Scenario: (a) Efficacy function, white stars denote , (b) expected sample size and expected number of unique doses evaluated, (c) expected dose units from the optimal dose combination as defined in (6) by iteration, (d) average RPSEL as defined in (7) by iteration, and (e) average absolute deviation of the posterior mean estimate by iteration. On the x-axis in (b), P stands for personalized and S for standard, followed by the number referring to high replication for a smaller number of doses (1) or low replication for a larger number of doses (2).
Two settings of the algorithms are compared for a maximum sample size of 80: one setting includes a higher number of replications at a smaller number of doses, and the other includes a smaller number of replications at a larger number of doses. We denote the standard/personalized algorithms under the first setting as and under the second setting as . Under the first setting, are run under the same specifications described in the previous section, where and for the standard versus personalized algorithms, respectively. Under the second setting, are run with and for the standard versus personalized algorithms, respectively, for initial dose combinations selected via Sobol sequences.
As the sponsor is concerned about cost and size of the trial, early stopping is permitted using the rule defined in (4). Early stopping is investigated by choosing values of δ as previously described such that there is a moderately high chance of stopping after roughly 40 or 60 total participants are enrolled in the trial (denoted by the values of in Figure 4). These values are for , for , for , and for . Since we are considering a dual-agent dose combination, and we thus require the stopping criteria in (4) to be satisfied times before stopping early.
For the personalized algorithm, we permit stratum-specific early stopping. Importantly, should the exploration of one stratum stop early, we allocate the remaining budget to the recruitment of participants in the other stratum. This assumes the sponsor can target recruitment specifically for this group. This reallocation enables resources to be utilized in strata that are harder to optimize, and so may increase performance. We compare this to no early stopping (), where the numbers of participants enrolled in each stratum are equal. When combined with the two settings for each algorithm, 12 unique designs are defined: each of which has three stopping rules, denoted by . All computing and inference is performed as previously described. The performance of the designs is compared using the previously defined criteria, which are estimated via 1,000 Monte Carlo replicates.
The expected sample size and expected number of unique doses evaluated by each design for the scenario described above is included in Figure 4b. The standard and personalized designs have approximately equal expected samples sizes within each stopping rule, but the personalized designs are expected to evaluate more unique doses on average. The performance of the personalized designs with respect to expected dosing units from the (Figure 4c) is comparable within each stopping rule, having differences of around 0.1 standardized dosing units, which are too small to be practically meaningful.
There is some suggestion of a possible increase in the expected number of dosing units from under no early stopping (i.e. ) when compared with early stopping. To investigate this, an additional simulation was performed and compared designs with maximum samples sizes of 40 and 60 under no early stopping to those that permit early stopping at roughly these same number of participants. The simulation suggests that the increase mentioned above results from the equal allocation of participants across strata under no early stopping. When early stopping is permitted, a larger proportion of evaluations is allocated to the stratum, which is more difficult to optimize, and so provides an improved model fit at each iteration of the algorithm, which improves the dose-finding overall. This difference in proportions can be observed in Figure 4b by noting that higher proportions of the expected sample sizes under early stopping rules come from stratum , which has a smaller standardized effect size and is thus harder to optimize. Regardless, under the current scenario, the observed difference in expected dosing units from between the stopping rules and for the personalized designs is too small to be meaningful. However, future work should more fully investigate how equal versus unequal allocation of participants across the strata at each algorithm iteration impacts design performance.
Design estimates f the best, supporting findings in the literature that suggest higher degrees of replication can beneficial for estimation under noisy settings (Binois et al., 2018). This difference is most apparent between designs and under stopping rule (Figure 4d,e). The standard algorithms perform poorly across all performance metrics, recommending doses that are on average farther away from , and poorly estimating the efficacy function f at the recommended dose combinations (Figure 4c–e). This poor performance is expected since response heterogeneity is present in the true data-generating mechanism. If response heterogeneity were not present, we would expect the standard algorithms to be slightly more efficient than their personalized counterparts as was observed in the simulation study from the last section.
To suggest a final design to the sponsor, we use Figure 4 as a visual aid. Since response heterogeneity is expected a priori, the poor performance of the standard designs under this scenario renders them inappropriate. Instead, we select the personalized design with . For roughly the same expected sample size but for fewer unique dose evaluations, this design yields final dose suggestions that are as close to as design . This design also offers a mild improvement in the estimation of f when compared with with . Choosing with over with does come with additional cost, however: the design with expects to evaluate 13 unique doses and enrol approximately 44 participants, whereas the design with expects to evaluate about 15 unique dose combinations (a 15% increase) and enrol approximately 58 participants (a 32% increase). The sponsor would need to weigh the increased engineering and enrolment costs against the increase in performance.
5 Discussion
In this manuscript, we proposed the use of Bayesian optimization for early-phase multi-agent dose-finding trials in a tolerated toxicity setting. We showed the benefit of taking a personalized approach for dual-agent trials when heterogeneity exists across a set of prespecified subgroups. As expected, under no response heterogeneity, the personalized approach is slightly less efficient. As noted in the introduction, parametric models may suffer from the curse of dimensionality when transitioning from standard to personalized dose-finding, as they require terms for all higher-order dose-covariate interactions. By using the anisotropic squared exponential GP kernel in the Bayesian optimization methods proposed here, however, only a single additional parameter per covariate is required (the additional length-scale parameter corresponding to that covariate). Thus, the proposed methods highlight the benefit and feasibility of adopting a personalized approach towards early-phase multi-agent dose-finding trials for both monotonic and nonmonotonic dose-response surfaces.
The proposed approach is not without limitations. First, the methods proposed in this work assumed no meaningful toxicity across the dose combination space. Extension to higher-grade toxicity settings through incorporation of dose escalation/de-escalation remains as future work. Second, the personalized approach was demonstrated by considering dual-agent dose combinations in predefined subgroups only. Furthermore, the subgroups were defined using binary covariates only, reflecting the focus of the design that motivated this work. Extension to dose combinations with more than two agents and to categorical covariates is straightforward, though other kernel functions may be used if appropriate. Extension to continuous covariates without categorization is not trivial and could be the subject of future investigations. Under this setting, the models may extrapolate into regions of the covariate space that are unobserved in the sample data, and so this should be considered for any future approach. Another direction for future development is to extend the proposed approach to binary and ordinal outcomes where the proposed response models may be defined over a latent continuous surface.
In this manuscript, Bayesian optimization is utilized as a global optimizer. While other global optimization methods exist (e.g. genetic algorithms and simulated annealing), they require many function evaluations and are thus not appropriate for early-phase dose-finding trials where evaluations are expensive. We employed the AEI acquisition function under a GP surrogate model. The performance of the algorithms under additional acquisition functions, surrogate models, and/or kernel functions should be investigated. Different kernel functions may be required to incorporate informative prior information, e.g. specific relationships or orderings among patient subgroups. The (stationary) separable anisotropic kernel used in this work assumes that all strata have the same correlation structure and that the covariance between points in different strata are changed by a multiplicative factor only, which may not be true in general. Indeed, under simulation scenario 3 which included two binary covariates, the efficacy function is zero everywhere for stratum , so this assumption is not true in this case. The efficacy function values in this stratum are perfectly correlated, whereas those corresponding to dose combinations in other strata are not. Future work should investigate relaxing the assumption of stationarity by using kernels that are nonseparable (e.g. including dose-covariate interaction terms in the kernel function (5)), or even nonstationary, or deep GPs (Sauer et al., 2023). Finally, if the number of included covariates is large and there is reason to believe that only low-level interactions between the drug combinations and covariates exist, different surrogate models could be employed, such as additive GPs or Bayesian additive regression trees (Chipman et al., 2010).
The simulations performed in this manuscript assumed that the subgroups represented by each strata were equally prevalent and had the same enrolment rates throughout the trial. This represents an idealized situation and may not be reasonable to assume in practice. In the extreme scenario of one strata having zero patients, the algorithm would simply base recommendations for this stratum on prior information only. Furthermore, due to the borrowing of information across the strata, the estimation for sparse strata may be dominated by strata with many patients. For these reasons, Zhang et al. (2024) suggest subgroup-specific dose-optimization only be performed for predefined subgroups that have large enough sample sizes in the trial, a recommendation that supports the simulation scenarios evaluated in the present work.
Finally, we adopted an empirical Bayes approach towards the GP hyperparameter estimation to decrease the computational burden of the simulations. Likelihood methods can yield poor results when the sample sizes are small, as in early-phase dose-finding trials, and so full Bayesian inference may be preferred. Unfortunately, the Markov chain Monte Carlo methods typically used to perform full Bayesian inference are prohibitively expensive for the algorithms proposed here, and so a sequential Monte Carlo approach may be a less computationally demanding alternative (Gramacy & Polson, 2011).
Acknowledgments
J.W. acknowledges the support of a doctoral training scholarship from the Fonds de recherche du Québec—Nature et technologies (FRQNT) and a research exchange internship with PharmaLex Belgium.
Funding
S.G. acknowledges the support by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (NSERC; RGPIN-2020-04115) and a Chercheurs-boursiers (Junior 1) award from the Fonds de recherche du Québec - Santé (FRQS; #313252). E.E.M.M. acknowledges support from a Discovery Grant from NSERC (RGPIN-2019-04230). E.E.M.M. is a Canada Research Chair (Tier 1) in Statistical Methods for Precision Medicine and acknowledges the support of a chercheur de mérite career award from the FRQS (#309780). This research was enabled in part by support provided by Calcul Québec and the Digital Research Alliance of Canada.
Data availability
No new data were created or analysed in this study. The R scripts used for the simulations and graphics can be found on a public GitHub repository at https://github.com/jjwillard/bayesopt_pers_df.
Supplementary material
Supplementary material is available online at Journal of the Royal Statistical Society: Series C.
References
Author notes
Conflicts of interest: None declared.