-
PDF
- Split View
-
Views
-
Cite
Cite
Ethan M Alt, Matthew A Psioda, Joseph G Ibrahim, Bayesian multivariate probability of success using historical data with type I error rate control, Biostatistics, Volume 24, Issue 1, January 2023, Pages 17–31, https://doi.org/10.1093/biostatistics/kxab050
- Share Icon Share
Summary
In clinical trials, it is common to have multiple clinical outcomes (e.g., coprimary endpoints or a primary and multiple secondary endpoints). It is often desirable to establish efficacy in at least one of multiple clinical outcomes, which leads to a multiplicity problem. In the frequentist paradigm, the most popular methods to correct for multiplicity are typically conservative. Moreover, despite guidance from regulators, it is difficult to determine the sample size of a future study with multiple clinical outcomes. In this article, we introduce a Bayesian methodology for multiple testing that asymptotically guarantees type I error control. Using a seemingly unrelated regression model, correlations between outcomes are specifically modeled, which enables inference on the joint posterior distribution of the treatment effects. Simulation results suggest that the proposed Bayesian approach is more powerful than the method of Holm (1979), which is commonly utilized in practice as a more powerful alternative to the ubiquitous Bonferroni correction. We further develop multivariate probability of success, a Bayesian method to robustly determine sample size in the presence of multiple outcomes.
1. Introduction
Clinical trials fail when there is a lack of statistical evidence to suggest that treatment is efficacious. One of the reasons for this is because the sample size is too small to establish statistical significance in the treatment effect when, in actuality, treatment is efficacious, that is, the study is underpowered. Sample size determination becomes more difficult in the presence of multiple hypotheses (e.g., coprimary endpoints, multiple primary endpoints, or primary and secondary endpoints), which are common in clinical trials. In particular, sample sizes are determined based solely on establishing efficacy in the primary endpoint(s). However, if multiple hypotheses are being considered, it is recommended to determine a sample size that is likely to provide substantial evidence of efficacy across all hypotheses (FDA, 2018). Unfortunately, this is seldom done in practice, in part due to a lack of statistical tools that can be used to robustly determine sample size in the presence of multiple, potentially correlated endpoints.
Due to safety concerns, potential lack of efficacy, cost, and the high failure rate for Phase III clinical trials, it has become common for clinical trials sponsors to make go/no-go decisions on the basis of data from Phase II trials quantify the likelihood of a successful Phase III clinical trial. Typically, decisions involved in study planning are made on the basis of classical hypothesis testing, controlling for type I and type II error rates by defining a prespecified criteria for the success or futility of the treatment. Sample sizes are often calculated based on an assumed treatment effect size and its variance, with success being confirmed if a significant |$p$|-value (say, |$p$|-value |$< 0.05$|) is obtained. However, uncertainty in the treatment effect size and its variance are unincorporated in this traditional approach.
As discussed by O’Hagan and others (2005), Chuang-Stein (2006), and Ibrahim and others (2015b), probability of success (POS) can be interpreted as a weighted average of power across possible effect sizes, that is, POS is the expectation of power with respect to some predictive distribution for future data. O’Hagan and others (2005) referred to POS as assurance, and it is also known as Bayesian expected power or simply Bayesian power (Psioda and Ibrahim, 2019).
The literature for computing POS for the no-covariate setting is vast, with most methods using Bayesian techniques to sample future clinical data while using frequentist approaches to determine success for the simulated data sets. Wang and others (2013), studying oncology trials, proposed to use the posterior distribution of the treatment effect to generate data for future trials, computing whether or not the trial had a significant |$p$|-value. Ren and Oakley (2014) developed methods to compute POS for a single time-to-event endpoint for both parametric and nonparametric models. Ciarleglio and others (2016) and Ciarleglio and Arendt (2017) utilized a hybrid no-covariates method to compute POS using Bayesian conditional expected power (BCEP), which corresponds to POS conditioning on a nonnull treatment effect, for superiority trials for normally distributed and binary endpoints, respectively.
In contrast, methods for computing POS in the presence of covariates are sparse. Ibrahim and others (2015b) developed a fully Bayesian approach to compute POS for a single normally distributed endpoint in the presence of covariates. Shieh (2017) showed that, under certain conditions, so-called multiplier methods, where multiples of the sample variance are used determine sample size, are equivalent to POS for the normal linear model.
Recently, some methods have been developed for computing POS for correlated outcomes or when multiple trials must be conducted (e.g., multiple Phase III or Phase II and III trials). Zhang and Zhang (2013) developed a method to compute POS when multiple successful trials are required, noting that two successful Phase III trials are needed for approval and that computing POS using common observed data from earlier studies yields correlated posterior treatment effects. Yin (2017) developed a Bayesian approach for determining go/no-go decisions for Phase 2 clinical trials based on Phase 3 efficacy and POS requirements. Zhou and others (2018) developed closed-form equations for POS methods for interim monitoring in clinical trials with longitudinal outcomes. Saint-Hilary and others (2019) develop a method to compute POS using surrogate endpoints, which are common in oncology trials. However, none of these methods consider regression settings.
In this article, we develop a new approach to testing unions of hypotheses for clinical trials in which multiple continuous outcomes that may reasonably modeled by a multivariate normal distribution. The proposed method is applicable to a wide range of clinical trials whose primary and key secondary outcomes are continuous, such as ongoing trials in diabetes (e.g., NCT03739125) and Alzheimer’s Disease (e.g., NCT03443973 and NCT02008357). We show that, under the null hypothesis, the asymptotic distribution of the posterior probability of a union of alternative hypotheses is equal in distribution to the multivariate normal cumulative distribution function (CDF) transform of a multivariate normal random variable, unifying single and multiple hypothesis testing. Using this result, we develop an approach that asymptotically guarantees frequentist type I error control for testing unions of hypotheses. In an extensive simulation study, we compare type I error rates and power to testing unions of hypotheses using the Holm procedure, finding that our proposed method is more powerful across various correlations between outcomes while maintaining type I error control. While there are other methods for multiplicity control in the frequentist paradigm, typically based on fitting marginal models to each outcome separately, they are not recommended by regulators due to being dependent on the order of testing or due to assumptions regarding the correlations between outcomes (FDA, 2018).
We further develop a multivariate extension of POS. The method proposed includes those of Ibrahim and others (2015b) and Chuang-Stein (2006) as special cases. This fully Bayesian approach does not rely on |$p$|-values, hybrid approaches (e.g., Chuang-Stein, 2006), or hierarchical testing procedures (e.g., Holm, 1979).
The rest of the article is organized as follows. In Section 2, we develop the proposed Bayesian approach to ascertain whether treatment is efficacious for at least one of several clinical outcomes with type I error control. In Section 3, we introduce a multivariate version of POS. In Section 4, we present simulation results that compare frequentist type I error and a Bayesian version of power for our proposed approach to the Holm procedure. Section 5 applies the POS method using data from a real clinical trial. In Section 6, we close with some discussion.
2. Joint analysis of multiple clinical outcomes
In clinical trials, it is often of interest to know whether an investigational new drug is efficacious in at least one of several outcomes. For example, trials frequently have one primary and multiple secondary endpoints, and investigators wish to “hit” (e.g., reject the null hypothesis) on at least one of the secondary endpoints. With multiple chances to hit, a multiplicity problem arises. In this section, we introduce a novel Bayesian method for testing hypotheses where one wishes to know if there is substantial evidence of efficacy for at least one of several outcomes. We begin by introducing the Bayesian seemingly unrelated regression (SUR) model, which allows one to jointly model continuous outcomes. The Bayesian SUR model is particularly useful as it enables inference on the joint posterior distribution of the treatment effects. We then introduce the proposed method and provide heuristic and formal arguments of its type I error rate control.
2.1. The SUR model
2.2. Type I error control for unions
We now introduce the proposed method for testing whether there is substantial evidence of treatment efficacy on at least one of several outcomes. We assume without loss of generality that larger values of treatment effects are more indicative of efficacy. Let |$\Omega$| denote the set that defines the success criteria for the trial, for example, if the investigators wish to know whether treatment is efficacious in at least one of the outcomes, we may express |$\Omega = \{ \boldsymbol{\theta} : \cup_{j=1}^J (\beta_{1j} > \tau_j) \}$|, where |$\tau_j$| is a “target value,” that is, the minimum clinically important difference. We assume that covariates and treatment assignment are fixed and known. Note |$\Omega$| is defined irrespective of prior information.
Let |${\pi}^{(v)}_{\Omega_0}(\boldsymbol{\theta})$| denote a proper validation prior (Ibrahim and others, 2015b) with respect to the parameter space under the null hypothesis |$\boldsymbol{\theta} \in \Omega_0$|, where |$\Omega_0 = \{ \boldsymbol{\theta} : \cap_{j=1}^J \{\beta_{1j} = \tau_j\} \}$|. Validation priors are also referred to as sampling priors (Wang and Gelfand, 2002) or design priors (O’Hagan and others, 2005) and are generally specified to be nonzero over some subspace of |$\Omega$|. We assume here that |${\pi}^{(v)}_{\Omega_0}(\boldsymbol{\theta}) \propto \tilde{\pi}^{(v)}(\boldsymbol{\theta}) 1\{ \boldsymbol{\theta} \in \Omega_0\}$|, where |$\tilde{\pi}^{(v)}$| is a proper prior. In order to compute type I error rates, we may generate from the prior predictive distribution of the validation prior, which is given by |$ f(\boldsymbol{y}) = \int f(\boldsymbol{y} | \boldsymbol{\theta}) \pi^{(v)}_{\Omega_0}(\boldsymbol{\theta}) {\rm d}\boldsymbol{\theta}. $| In the Bayesian paradigm, the type I error rate is controlled if |$E[1\{ P(\boldsymbol{\theta} \in \Omega | \boldsymbol{y}) \ge \gamma^* \}] \le 1 - \gamma$|, where |$\gamma^*$| is an evidence threshold, |$1 - \gamma$| is the desired type I error rate, and the expectation is taken with respect to |$f(\boldsymbol{y})$|. It is well known (and we show below) that under the validation prior |$\pi^{(v)}_{\Omega_0}$|, |$U_j \equiv P(\beta_{1j} > \tau_j | D)$| has an asymptotic standard uniform distribution, |$j = 1, 2$|. Hence, for a single outcome, we have |$\gamma^* = 0.95$| to control the type I error rate at |$0.05$|. However, when |$\Omega$| consists of unions of events, it is not immediately clear how to choose |$\gamma^*$|. For example, let |$\Omega = \cup_{j=1}^2 \{ \beta_{1j} > 0 \}$| so that |$P(\Omega | \boldsymbol{y}) = U_1 + U_2 - U_{12}$|, where |$U_{12} = P(\{\beta_{11} > \tau_1\} \cap \{\beta_{12} > \tau_2\} | D)$|. Clearly, |$P(\Omega | \boldsymbol{y})$| will only be asymptotically standard uniform if |$U_{12}$| is equal to |$U_1$| or |$U_2$| with probability 1. In fact, the asymptotic distribution of a union posterior probability may be derived as a multivariate normal CDF transformation of a multivariate normal random variable, which we state formally for a general setting in the following theorem:
Let |$\boldsymbol{\nu} = (\boldsymbol{\alpha}', \boldsymbol{\gamma}')'$|, where |$\boldsymbol{\alpha}$| consists of the |$J$| parameters of interest and |$\boldsymbol{\gamma}$| is a vector containing nuisance parameters. Let |$\tilde{\pi}_{A_0}^{(v)}(\boldsymbol{\nu}) \propto \pi(\boldsymbol{\nu}) 1\{ \boldsymbol{\alpha} \in A_0 \}$|, where |$A_0 = \cap_{J=1}^J \{ \alpha_j = \tau_j \}$| and where |$\pi(\nu)$| is a proper prior. We may suppose without loss of generality that |$\tau_j = 0$| for |$j = 1, \ldots, J$|. Let the data generation process be given by |$f\left(\boldsymbol{y} \bigg| \pi^{(v)}_{A_0}\right) = \int f(\boldsymbol{y} | \boldsymbol{\nu}) \pi^{(v)}_{A_0}(\boldsymbol{\nu}) {\rm d}\boldsymbol{\nu}$|, that is, |$f(\boldsymbol{y})$| is the prior predictive distribution of |$\pi^{(v)}_{A_0}(\boldsymbol{\nu})$|. Let |$\pi^{(f)}(\boldsymbol{\nu})$| denote a prior on |$\boldsymbol{\nu}$| that results in a proper posterior density. Finally, let |$A = \cup_{j=1}^J \{ \alpha_j > 0 \}$| denote the set that defines clinical success. Then |$ P(\boldsymbol{\alpha} \in A | D, \pi^{(f)}) \overset{\text{d}}{\to} 1 - \Phi_J(\boldsymbol{Z} | \boldsymbol{0}, \boldsymbol{\Sigma}^*), $| where |$\boldsymbol{Z} \sim N_J(\boldsymbol{0}, \boldsymbol{\Sigma}^*)$|, |$\boldsymbol{\Sigma}^*$| is the limiting posterior covariance matrix of |$\boldsymbol{\alpha}$| with respect to the sampling distribution |$f\left(\boldsymbol{y}\bigg|\pi^{(v)}_{A_0}\right)$|, |$N_J(\cdot, \cdot)$| is the |$J$|-dimensional multivariate normal distribution, and |$\Phi_J(\cdot | \boldsymbol{m}, \boldsymbol{C})$| is the |$J$|-dimensional multivariate normal CDF with mean |$\boldsymbol{m}$| and covariance matrix |$\boldsymbol{C}$|.
The proof of Theorem 2.1 is provided in Appendix C of the Supplementary material available at Biostatistics online. In words, Theorem 2.1 says that, under the null hypothesis, the posterior probability of a union of events converges in distribution to the multivariate normal CDF transform of a multivariate normal random variable. Note that if |$J = 1$|, then |$\boldsymbol{Z} = Z$| is a scalar and therefore |$1 - \Phi_1(Z | 0, \sigma^2) \sim U(0, 1)$|, that is, the posterior probability based on an individual outcome is asymptotically standard uniform under the null hypothesis, a well-known result.
Using Theorem 2.1, we may find the value |$\gamma^*$| such that under the validation prior |$\pi^{(v)}$|, |$E\left[ 1\{ P(\Omega | D) \ge \gamma^* \} \right] = 1 - \gamma$| via simulation. In words, |$\gamma^*$| is an adjusted evidence threshold such that the type I error rate is asymptotically controlled for a union hypothesis. Specifically, we may generate |$\{\boldsymbol{Z}^{(m)}, m = 1, \ldots, M \}$| from |$N_J(\boldsymbol{0}, \boldsymbol{\Sigma}^*)$| and compute |$\{U^{(m)}, m = 1, \ldots, M\}$| via |$U^{(m)} = 1 - \Phi_{J}(\boldsymbol{Z}^{(m)} | \boldsymbol{0}, \boldsymbol{\Sigma}^*)$|. We then obtain |$\gamma^*$| as the solution to |$M^{-1} \sum_{m=1}^M 1\{ U^{(m)} > \gamma^* \} = 1 - \gamma$|. In practice, we may replace |$\boldsymbol{\Sigma}^*$| with its consistent estimator |$\hat{\boldsymbol{\Sigma}}^*$|, the posterior covariance matrix of |$\boldsymbol{\alpha}$| approximated via Markov chain Monte Carlo (MCMC).
This method enables practitioners to discover if treatment is efficacious for at least one outcome with type I error control, but it does not tell the practitioner which outcome. As the individual posterior probabilities, |$\{P(\beta_{1j} > \tau_j | D), j = 1, \ldots, J\}$|, factor into the computation of |$P(\Omega | D)$|, one way to discover the main contributors to the total evidence represented by the posterior probability for the union is to compare the marginal posterior probabilities for each treatment effect in the union. Larger values of the marginal posterior probabilities of the treatment effects may be interpreted as having comparatively more evidence. In Appendix F of the Supplementary material available at Biostatistics online, we illustrate via an example how one may ascertain which of the treatment effects having the greatest evidence of efficacy given substantial evidence for the overall union. In particular, if the posterior probability of the union exceeds its evidence threshold, then it is reasonable to conclude that treatment is efficacious for the outcomes for which the associated individual posterior probabilities are large.
3. Bayesian POS for multivariate linear regression models
In this section, we introduce the concept of multivariate POS. Specifically, we extend the methods of Ibrahim and others (2015b) to accommodate multiple outcomes using the SUR model. We then discuss computational development.
3.1. The general methodology
Suppose we wish to determine the sample size of a future study. Let |$\Omega$| denote the success criterion of the future study. For example, if we are only interested in a single primary endpoint, we may specify |$\Omega := \{ \boldsymbol{\theta} : \beta_{11} > \tau_1 \}$| for some target value |$\tau_1$|. In situations of coprimary endpoints, we may specify |$\Omega = \{ \boldsymbol{\theta} : \cap_{j=1}^K \{ \beta_{1j} > \tau_j \} \}$|, that is, a trial is declared successful when there is substantial evidence that the drug is efficacious in all of |$K$| clinical outcomes. A third important specification of |$\Omega$| is when it is of interest to determine whether a treatment improves at least one of several outcomes, which we may write as |$\Omega = \{ \boldsymbol{\theta} : \cup_{j=1}^K \{ \beta_{1j} > \tau_j \} \}$|.
In words, (3.4) is the marginal probability that a future trial satisfies the success criteria with respect to the prior predictive distribution for the future data based on the validation prior.
3.2. Using historical data
We now discuss POS when we are in possession of one or two historical data sets. For example, when one is planning an initial Phase III study, they are typically in possession of a single historical data set (e.g., from a Phase II study). In general, we may possess more than two historical data sets, but the methodology developed in this section may be easily extended to handle such a case.
Let the historical data sets be represented by |$D_{0k} = \{(y_{0kij}, \boldsymbol{x}_{0ki}, z_{0ki}), i = 1, \ldots, n_{0k}; j = 1, \ldots, J; k = 1, 2\}$|, where |$z_{0ki} = 1$| if the |$i{\rm th}$| subject in study |$k$| received the experimental treatment and 0 otherwise, |$\boldsymbol{x}_{0ki}$| represents the vector of unique covariates for subject |$i$| in study |$k$|, |$y_{0kij}$| is the response variable for individual |$i$| and outcome |$j$| in study |$k$|, |$k = 1,2$|, |$j = 1, \ldots, J$|, |$i = 1, \ldots, n_{0k}$|. We assume without loss of generality that the data sets are ordered chronologically, so that |$D_{01}$| is the older study and |$D_{02}$| is the more recently completed study.
We assume |$\boldsymbol{y}_{0k} = \boldsymbol{X}_{0k} \boldsymbol{\beta} + \boldsymbol{\epsilon}$| where |$\boldsymbol{y}_{0k} = (\boldsymbol{y}_{0k1}', \ldots, \boldsymbol{y}_{0kJ}')'$| is a |$n_{0k}J$|-dimensional response vector of all responses in study |$k$|, and |$\boldsymbol{X}_{0k} = \mathop{\mathrm{blkdiag}}\nolimits\{ \boldsymbol{X}_{0k1}, \ldots, \boldsymbol{X}_{0kJ} \}$| is a block diagonal matrix of design matrices for each outcome in study |$k$|, where |$\boldsymbol{X}_{0kj} = (\boldsymbol{z}_{0kj}, \boldsymbol{X}_{0kj2})$|, and |$\boldsymbol{\epsilon}_{0k} = (\boldsymbol{\epsilon}_{0k1}', \ldots, \boldsymbol{\epsilon}_{0kJ})'$| is a |$n_{0k}J$|-dimensional vector with |$\boldsymbol{\epsilon}_{0k} \sim \mathop{\mathrm{N}}\nolimits_{n_{0k}J}(\boldsymbol{0}_{n_{0k}}, \boldsymbol{\Sigma} \otimes \boldsymbol{I}_{n_{0k}})$|.
The quantities |$b_{0j}, j = 1, 2$|, are hyperparameters that downweight the likelihood of the historical data. Typically, we choose values of |$b_{0j}$| closer to 0 when characteristics of the study participants in the planned future study are expected to be more different than those in the completed trial and values closer to 1 when the characteristics are similar. Note that if we are in possession of only one historical data set, we may set |$b_{01} = 0$| in (3.6).
Since the covariates are dependent, it may be difficult to specify |$f(\boldsymbol{x}_{0ki} | \boldsymbol{\alpha})$|. One way to do this is to utilize a factorization of the joint distribution informed by temporal relationships between the variables. For example, suppose we have three covariates |$(x_{01}, x_{02}, x_{03})'$|, where |$x_{01}$| is a binary variable indicating gender, |$x_{02}$| is weight measured in kilograms, and |$x_{03}$| counts the number of tumors at baseline. It is impossible for weight or number of tumors to cause gender, so we might first generate |$x_{01}$|. Gender likely has an effect on both weight and number of tumors, but there may not be an obvious way to factorize |$x_{02}$| and |$x_{03}$|. The two permutations may be compared, and a choice may be selected using model selection criteria (e.g., deviance information criterion). Suppose that such analysis yields |$x_{02}$| to generate after |$x_{01}$|. We can therefore generate from the joint distribution via |$f(x_{01}, x_{02}, x_{03}) = f(x_{01}) f(x_{02} | x_{01}) f(x_{03} | x_{01}, x_{02})$|.
In general, suppose there are |$L$| unique covariates |$\{x_{0i1}, \ldots, x_{0iL}, i = 1, \ldots, n_0 \}$|. Without loss of generality, suppose that they are ordered so that we want to generate samples of |$x_{01}$| first, |$x_{02}$| second, and so on, lastly generating samples from |$x_{0L}$|. Let |$\boldsymbol{\alpha}_l$| denote the parameter vector for |$f(x_{0l} | x_{01}, \ldots, x_{0,l-1} )$|. We may rewrite the validation prior (3.6) as |$ \pi_2^{(v)}(\boldsymbol{\alpha} | D_0, b_0) \propto \prod_{l=1}^L \prod_{k=1}^2 \left[ \prod_{i = 1}^{n_{0k}} f(x_{0il} | (x_{0i1}, \ldots, x_{0i,l-1}, \boldsymbol{\alpha}_l) \right]^{b_{0k}} \pi_{20}^{(v)}(\boldsymbol{\alpha}). $| We assume each covariate is generated by a generalized linear model. For the purposes of our simulations and data analysis, which contain only binary and continuous data, we assume that the continuous data are normally distributed, and we take |$\pi_{20}^{(v)}(\boldsymbol{\alpha}) \propto 1$|. We note that multiple continuous covariates may alternatively be generated via a multivariate normal distribution. In general, more complicated covariate distributions may be utilized, such as the beta-binomial distribution for binary variables or the negative binomial distribution for count data.
The marginal posterior distribution of the regression coefficients based on (3.8) is not available in closed-form except in special cases. However, it can be shown that the full conditional densities of |$\boldsymbol{\beta}$| and |$\boldsymbol{\Sigma}$| are normal and inverse-Wishart, respectively, so that an efficient Gibbs sampling algorithm may be utilized. We state this formally in the following theorem
Let |$D = \{ (\boldsymbol{y}_i, \boldsymbol{x}_i, z_i), i = 1, \ldots, n \}$| denote the observed data with likelihood from the SUR model (2.2). Let the fitting prior be given by (3.7), where |$\pi_0^{(f)}(\boldsymbol{\theta}) \propto \lvert \boldsymbol{\Sigma} \rvert^{-d_0/2}$|. Finally, define |$\tilde{\boldsymbol{\beta}} = (\boldsymbol{I}_p - \boldsymbol{\Lambda}) \hat{\boldsymbol{\beta}} + \boldsymbol{\Lambda} \hat{\boldsymbol{\beta}}_0$|, |$\hat{\boldsymbol{\beta}} = [\boldsymbol{X}'(\boldsymbol{\Sigma}^{-1} \otimes \boldsymbol{I}_n) \boldsymbol{X}]^{-1} \boldsymbol{X}'(\boldsymbol{\Sigma}^{-1} \otimes \boldsymbol{I}_n) \boldsymbol{y}$| and |$\hat{\boldsymbol{\beta}}_0$| analogously defined, |$\tilde{\boldsymbol{\Omega}} = \boldsymbol{X}'(\boldsymbol{\Sigma}^{-1} \otimes \boldsymbol{I}_n) \boldsymbol{X} + a_0 \boldsymbol{X}_0' (\boldsymbol{\Sigma}^{-1} \otimes \boldsymbol{I}_n) \boldsymbol{X}_0$|, |$\boldsymbol{\Lambda} = \tilde{\boldsymbol{\Omega}}^{-1} a_0 \boldsymbol{X}_0'(\boldsymbol{\Sigma}^{-1} \otimes \boldsymbol{I}_{n_0}) \boldsymbol{X}_0$|, |$\tilde{\boldsymbol{V}} = \boldsymbol{A}(\boldsymbol{\beta}) + a_0 \boldsymbol{A}_0(\boldsymbol{\beta})$|, |$\tilde{\nu} = n + d_0 - J - 1$|, where |$\boldsymbol{A}(\boldsymbol{\beta})$| and |$\boldsymbol{A}_0(\boldsymbol{\beta})$| is the matrix of dot product residuals described in Section 2 for the current and historical data, respectively. Then |$ \boldsymbol{\beta} | D, \boldsymbol{\Sigma} \sim N_p(\tilde{\boldsymbol{\beta}}, \tilde{\boldsymbol{\Omega}}^{-1}) $| and |$ \boldsymbol{\Sigma} | D, \boldsymbol{\beta} \sim IW_J(\tilde{\nu}, \tilde{\boldsymbol{V}} ), $| where |$IW_J$| denotes the |$J$|-dimensional inverse-Wishart distribution.
The proof of Theorem 3.1 is provided in Appendix A of the Supplementary material available at Biostatistics online. It is worth mentioning that sampling from the posterior (3.8) is computationally expensive because, given |$\boldsymbol{\Sigma}$|, the posterior mean of |$\boldsymbol{\beta}$| depends on the generalized least squares estimator |$\hat{\boldsymbol{\beta}}$|, which involves computing large, sparse matrix products. An efficient algorithm for computing these products is provided in Appendix A of the Supplementary material available at Biostatistics online.
Under the special case that all the design matrices are the same, it can be shown that the posterior distribution of the regression coefficients is a matrix |$t$| distribution. In particular, it may be shown that the posterior distribution of the treatment effects is a multivariate |$t$| distribution, and there are algorithms available for efficiently and accurately computing probabilities from the multivariate |$t$| distribution. Details are presented in Appendix B of the Supplementary material available at Biostatistics online. For our data analysis and simulations, the design matrices are the same, and we utilize the multivariate |$t$| result for posterior inference.
In our development, the validation prior was only discounted for the covariates and not the response variables. The reasons for doing so is because the validation prior for |$\boldsymbol{\theta}$| represents belief about the effects to be studied. It does not have any impact on the analysis of a future data set. If the validation prior for |$\boldsymbol{\theta}$| is discounted, it becomes more likely that unrealistic values for |$\boldsymbol{\theta}$| are generated, which may provide misleading estimates of POS. We allow discounting for the covariate parameters |$\boldsymbol{\alpha}$| in the validation prior because the covariates and characteristics of the trial participants from a future trial could be different than those from the historical data. On the other hand, the fitting prior for |$\boldsymbol{\theta}$| is discounted because the fitting prior is the prior utilized in the analysis of the future data, and it will generally be the case that one would want the future data to have more weight than the historical data. The weight parameter |$a_0$| in the fitting prior (3.7) can be treated as either fixed or random. For computational reasons, we consider |$a_0$| as fixed in this article. Ibrahim and others (2015a) give a thorough review of the power prior and a discussion of treating |$a_0$| as fixed and random.
3.3. POS algorithm
The algorithm to compute POS is as follows:
- Step 1:
Specify |$n$|, |$\{\Omega_k\}_{k=1}^K$| (the trial success criteria), |$q$| (the proportion of subjects randomized to the test drug group), |$M$| (the number of samples to draw), and |$B$| (the number of future data sets).
- Step 2:
Generate |$\boldsymbol{\theta}$| and |$\boldsymbol{\alpha}$| from the validation prior |$\pi^{(v)}(\boldsymbol{\theta}, \boldsymbol{\alpha})$|.
- Step 3:
Using the value of |$\boldsymbol{\alpha}$| from step 1, generate |$\boldsymbol{x}_i$| from |$f(\boldsymbol{x}_i | \boldsymbol{\alpha})$| and randomly assign the ith subject into two treatment arms by generating |$z_i \sim \text{Bernoulli}(q)$| independently for |$i = 1, \ldots, n$| and set |$\boldsymbol{X}$| to be the resulting |$nJ \times (p + J)$| design matrix.
- Step 4:
Using the value of |$\boldsymbol{\theta}$| from Step 1 and the value of |$\boldsymbol{X}$| from Step 2, generate a future response vector |$\boldsymbol{y}$| from |$\mathop{\mathrm{N}}\nolimits_{nJ}(\boldsymbol{X\beta}, \boldsymbol{\Sigma}^{-1} \otimes \boldsymbol{I}_n)$| to obtain |$D = (\boldsymbol{y}, \boldsymbol{X})$|.
- Step 5:
Using |$D$| from Step 4, sample from the posterior distribution using the fitting prior to generate |$\left\{ \boldsymbol{\theta}^{(m)}, m = 1, \ldots, M \right\}$|.
- Step 6:
Compute |$I^{(b)} = \prod_{k=1}^K 1\{ P(\Omega_k | D) \ge \gamma^*_k \}$| using |$\boldsymbol{\theta}^{(m)}$|, |$m = 1, \ldots, M$|, from Step 5, where |$\gamma^*_k$| is computed as described in Section 2.
- Step 7:
Repeat Step 2 to Step 6 |$B$| times to obtain |$I^{(1)}, \ldots, I^{(B)}$|. The POS is given by |$\mathop{\mathrm{POS}}\nolimits = \frac{1}{B} \sum_{b=1}^B I^{(b)}$|.
Steps (5) and (6) become multivariate t CDF calculations when all covariates are identical.
4. Simulation study
The Comprehensive Post-Acute Stroke Services (COMPASS) (Duncan and others, 2017) study was a two-arm, cluster-randomized pragmatic trial to evaluate whether a novel care model (the COMPASS intervention) improved a variety of patient-centered outcomes at 90 days compared to usual care (UC). We base our simulations based on summary statistics from a subset of five high-enrolling intervention arm hospitals from the COMPASS study. Details on the COMPASS study are provided in Section 5. The complete historical data consisted of |$n_0 = 981$| participants and |$J = 3$| endpoints. We fix all of the power prior hyperparameters to 1 and assume the following model to generate the historical data: |$\boldsymbol{y}_{0i} = \boldsymbol{\beta}_1 z_{0i} + \boldsymbol{X}_{0i} \boldsymbol{\beta}_2 + \boldsymbol{\epsilon}_{0i}, i = 1, 2, \ldots, 981,$| where |$\boldsymbol{y}_{0i} = (y_{0i1}, y_{0i2}, y_{0i3})'$| is a vector of responses corresponding to the |$J=3$| endpoints, |$\boldsymbol{\beta}_1 = (\beta_{11}, \beta_{12}, \beta_{13})'$| is the vector of treatment effects, |$\boldsymbol{X}_{0i} = \mathop{\mathrm{blkdiag}}\nolimits\{ \boldsymbol{x}_{01}', \boldsymbol{x}_{02}', \boldsymbol{x}_{03}' \}$| is a |$n_0 J \times p$| block diagonal matrix of covariates for endpoints 1, 2, and 3, |$\boldsymbol{\beta}_2 = (\boldsymbol{\beta}_{21}', \boldsymbol{\beta}_{22}', \boldsymbol{\beta}_{23}')'$| is a |$p$|-dimensional vector of effects for the covariates of each endpoint, and |$\boldsymbol{\epsilon}_{0i} \sim N_3(\boldsymbol{0}, \boldsymbol{\Sigma})$|, where |$\boldsymbol{\Sigma}$| is constructed to be positive definite. All three endpoints shared the same seven covariates and each had an intercept term (|$p = 24)$|, and we thus utilize the matrix t posterior of the regression coefficients described in Section 3. The vector of treatment effects was assumed to be |$\boldsymbol{\beta}_1 = (0.0333, 0.1667, 0.5980)'$|. The standard deviations for endpoints 1, 2, and 3 were, respectively, 0.193, 0.748, and 7.422. For simplicity in order to conduct large-scale simulation studies, we reduced the variance of each outcome by a factor of 2 so that the sample size required to have a high probability of clinical success in the primary endpoint would be lowered, reducing computation time. Several different correlation matrices were considered. Let |$\boldsymbol{\rho} = (\rho_{12}, \rho_{13}, \rho_{23})'$|, where each |$|\rho_{ij}| < 1$| and |$\rho_{ij} = \mathop{\mathrm{Corr}}\nolimits(y_{0ij}, y_{0ik})$| for |$i = 1, \ldots, n_0$| and |$1 \le j < k \le 3$|. The pairwise correlations considered were |$\boldsymbol{\rho}_{\text{HN}} = (-0.3, -0.4, -0.7)'$|, |$\boldsymbol{\rho}_{\text{LN}} = (-0.05, -0.1, -0.2)'$|, |$\boldsymbol{\rho}_{\text{ind}} = (0, 0, 0)'$|, |$\boldsymbol{\rho}_{\text{LP}} = (0.05, 0.1, 0.2)'$|, and |$\boldsymbol{\rho}_{\text{HP}} = (0.3, 0.4, 0.7)'$|, which correspond to high negative, low negative, independent, low positive, and high positive correlations, respectively.
We focus on type I error control and BCEP for unions of hypotheses. We utilize a type I error rate of 0.05. In order to explore the performance of the proposed method, we consider all two-way unions of the endpoints, that is, we consider the success sets |$\Omega_{j|k} = \{ \boldsymbol{\beta} : \beta_{1j} > 0 \text{ or } \beta_{1k} > 0 \}$| for |$1 \le j < k \le 3$|. In words, success is declared when there is evidence of efficacy for the treatment on at least one of two outcomes. We compare the proposed method outlined in Section 2 against a frequentist multiplicity adjustment procedure, where success is declared if the minimum |$p$|-value corresponding to the treatment effect is less than 0.025 (the Holm procedure).
The validation priors utilized to obtain type I error and BCEP are, respectively, |$ \pi_{\overline{\Omega}}^{(v)}(\boldsymbol{\theta}) \propto \pi_{1}^{(v)}(\boldsymbol{\theta}) 1\left\{ \boldsymbol{\theta} \in \overline{\Omega} \right\} $| and |$ \pi_{\Omega}^{(v)}(\boldsymbol{\theta}) \propto \pi_{1}^{(v)}(\boldsymbol{\theta}) 1\left\{ \boldsymbol{\theta} \in \Omega \right\}\!, $| where |$\Omega$| is the set that defines success and |$\overline{\Omega}$| is the set defined by the boundary of the null hypothesis, where the type I error rate is at its maximum, that is, |$\overline{\Omega}_{j|k} = \{ \boldsymbol{\theta} : \beta_{1j} = 0 \cap \beta_{1k} = 0 \}$|.
The first column of Figure 1 depicts the type I error rate when the set that defines success is denoted by |$\Omega_{1|2}$|, that is, the trial is declared successful when there is sufficient evidence that treatment is efficacious for outcome 1 or outcome 2. The type I error rate is controlled for both the proposed Bayesian method and the Holm procedure. The second column of Figure 1 shows BCEP based on the two methods. Across all correlation scenarios, power was higher for the proposed Bayesian method compared to the Holm method. The lower the correlation between the outcomes, the more substantial the power increase over the frequentist approach. This is intuitive, since, as the correlation between the two outcomes approaches 1, then |$\Omega_{1|2} \to \Omega_{1} \cap \Omega_{2}$|, where |$\Omega_{j} = \{ \boldsymbol{\beta} : \beta_j > 0 \}$|. For the set |$\Omega_{1} \cap \Omega_{2}$|, no multiplicity adjustment is needed, and |$\gamma^* = 0.95$| controls the type I error rate for this set.

Type I error rate and BCEP of adjusted and unadjusted fully Bayesian POS with a hybrid Holm procedure where |$\Omega_{1|2}$| defines the set that determines success and when type I error rate is 0.05.
The type I error and power results for other unions are presented in Appendix E of the Supplementary material available at Biostatistics online. For both |$\Omega_{1|3}$| and |$\Omega_{2|3}$|, shown in Figures 2 and 3 of the Supplementary material available at Biostatistics online, type I error is controlled for both the proposed Bayesian and frequentist methods, but the power continues to be higher for the proposed Bayesian method across all correlation scenarios. In summary, for all of the two-way unions, the proposed Bayesian approach controlled type I error and had higher power across all outcome pairs and across all correlation scenarios.
5. Data application
In Phase 1 of the COMPASS study, a diverse collection of 40 North Carolina, US hospitals were randomized in a 1:1 allocation scheme to administer their patients the COMPASS intervention or UC. Hospitals randomized to the intervention arm in Phase 2 were asked to continue providing the intervention in Phase 2 with minimal resources. Hospitals randomized to the UC arm in Phase 1 were allowed to transition to provide the intervention in Phase 2.
The primary outcome was stroke impact scale (SIS-16), a measure of physical function ranging from 0 to 100. There were several secondary outcomes in the study. The continuous outcomes we utilize are self-reported general health and physical health. Self-reported general health was measured by a semicontinuous 5-point Likert scale. Physical health was measured using the PROMIS Global Health Scale (Hays and others, 2009), which is a continuous measure of overall physical health, mental health, social health, pain, fatigue, and overall perceived quality of life, and was collected only in Phase 2 of the study. As a result, we only consider the Phase 2 data in our analysis. For the purposes of demonstration of our method, we partition the full data between the study’s vanguard site where the intervention was developed and piloted (considered the older historical data |$D_{01}$|) and Phase 2 data from a subset of five high-enrolling intervention arm hospitals that elected to participate in Phase 2 (considered the newer historical data |$D_{02}$|).
We assumed the following model for each outcome: |$y_{ij} = \beta_0 + \boldsymbol{\beta}_1 z_i + \boldsymbol{x}_{i}' \boldsymbol{\beta}_{2j} + \epsilon_{ij}$|, where |$\boldsymbol{\epsilon}_i = (\epsilon_{i1}, \epsilon_{i2}, \epsilon_{i3})'$| are independent and identically distributed (i.i.d.) |$\mathop{\mathrm{N}}\nolimits_3(\boldsymbol{0}, \boldsymbol{\Sigma}),$| where |$\boldsymbol{\Sigma}$| is a |$3 \times 3$| covariance matrix assumed to be positive definite, |$z_i = 1$| if subject |$i$| received the eCare plan (a key component of the COMPASS Intervention) and 0 otherwise, |$\boldsymbol{x}_{i}$| is a vector of control variables, which include history of stroke, history of transient ischemic attack (TIA), age, race (white or nonwhite), severity of stroke, and a binary variable indicating whether the patient was hospitalized due to a stroke or a TIA. We applied a log transformation to the SIS score to reduce skewness and asymmetry. For the future study, we are interested in three success criteria, namely, |$\Omega_{1} = \{ \boldsymbol{\beta} : \beta_{11} > 0 \}$| (success in the primary endpoint), |$\Omega_{2|3} = \cup_{j=2}^3 \{ \boldsymbol{\beta} : \beta_{j} > 0 \}$| (success in at least one of the two secondary endpoints), and |$\Omega_{1(2|3)} = \Omega_{1} \cap \Omega_{2|3}$| (success in the primary endpoint and at least one of the two secondary endpoints).
In order to provide intuition for how sensitive the POS is to hyperparameters, we studied the behavior of POS for combinations of |$(b_{01}, b_{02}) \in \{ (0, 1), (1, 1) \}$| (corresponding to full borrowing from one and both of the data sets for the covariate distribution, respectively), |$a_0 \in \{ 0.0, 0.5, 1.0\}$| (corresponding to, in the fitting prior, a uniform improper prior, half borrowing from the historical data, and full borrowing from the historical data), and future sample sizes |$n \in \{ 1000, 1050, 1100, \ldots, 2000 \}$|.
Figure 2 shows the probabilities of success for the sets |$\Omega_{1}$|, |$\Omega_{2|3}$|, and |$\Omega_{1(2|3)}$| defined above. For the primary endpoint without borrowing from the historical data (|$a_0 = 0$|), a future sample size of |$2000$| yields a POS just below 80%. Conversely, under half or full borrowing of the historical data (|$a_0 = 0.5$| and |$a_0 = 1.0$|, respectively), the same sample size yields a POS of 85% and 90%, respectively. Thus, utilizing the historical data increases the POS for the primary endpoint.

Probability of success curves for the COMPASS data. The quantities |$b_{01}$| and |$b_{02}$| denote the power prior hyperparameters for the older and newer historical data sets, respectively, when fitting the validation prior for the covariates. The hyperparameter |$a_0$| is the power prior parameter for the older historical data set in the fitting prior. The solid lines indicate the hyperparameter |$a_0 = 0$| and the dashed lines indicate |$a_0 = 1$|. Power prior parameters of 0 indicate that the historical data set was unused.
However, utilizing the historical data decreased the POS for |$\Omega_{2|3}$|, the set corresponding to treatment efficacy in at least one of the two secondary outcomes. For example, a sample size of |$2000$| results in a POS of just over 70% for |$\Omega_{2|3}$| under no borrowing of the historical data, while the power prior hyperparameters |$0.5$| and |$1.0$| yield approximately a POS of approximately 68% and 65%, respectively. The reduction in the POS is a result of the high degree of uncertainty surrounding the coefficient corresponding to treatment for the third outcome. Conversely, as compared to no borrowing from the historical data (i.e., when |$a_0 = 0$|), utilizing the historical data marginally raised the POS when success was defined to be treatment efficacy for the primary and at least one of the secondary outcomes, for example, when the set that defines success is |$\Omega_{1(2|3)}$|. This indicates that, using the power prior, the increase in power in the primary outcome outweighs the loss in power of the secondary outcomes.
Figure 2 also indicates that the power prior hyperparameters for the covariates do not have a large effect on the resulting POS. For a fixed level of |$a_0$|, POS is similar between the two covariate power priors. In general, it is advisable to try multiple different hyperparameters for the power priors, or to try the maximal range of possibilities (e.g., 0 and 1) for the power prior parameters to understand how sensitive POS is with respect to these hyperparameters. For this application, it is clear that POS is quite sensitive to the power prior parameter corresponding to the fitting prior, and less sensitive to the validation prior of the covariates. However, the choice of |$a_0$| should not depend on the resulting POS, as the choice of |$a_0$| must be defended to regulators and should reflect the practitioner’s beliefs for how applicable the current and historical data sets are. While the sample sizes are large for a POS of 80%, we note that POS is primarily a function of three components: the effect size in the historical data, the sample size of the historical data, and the sample size of the future data. The COMPASS data used for our illustration suggest modest effects for the outcomes of interest.
6. Discussion
There are several advantages of our approach compared to previous POS approaches (e.g., Ibrahim and others, 2015b; Chuang-Stein, 2006). First, the SUR model allows clinicians to have different covariates across endpoints. For example, a practitioner may wish to control for the baseline value of the outcome of interest, but not the baseline value of other outcomes. Second, testing a single or a union of hypotheses may be cast in the same framework.
For practitioners, it is worth asking when an analysis consisting of marginal regressions should be conducted as opposed to a joint analysis. Berry and Hochberg (1999) argue that it is more realistic to view parameters as dependent in multiple testing situations. In particular, knowledge of the value of one parameter should influence the value of other parameters when the study data are correlated, and they argue it seldom makes sense to analyze the data independently. Moreover, we have established a Bayesian approach that asymptotically guarantees exact control of the type I error rate regardless of the posterior correlation between the parameters. Although the type I error control is an asymptotic result, we note that the vast majority of Phase 3 trials with multiple endpoints will have a large sample size, making errors in the approximation minimal. It is important to differentiate our method from a posterior model probability, although the two are related. While Scott and Berger (2010) argue that the only way to control for multiplicity in a Bayesian perspective is via prior model probabilities, our approach does not rely on prior model probabilities, which are typically subjective; nor does it require prior propriety.
The proposed method may also be applied when the historical data sets are small by retaining the |$q$|th highest posterior density (HPD) region for the treatment effects in the validation prior. For details and simulation results, see Appendix D of the Supplementary material available at Biostatistics online. We apply these validation prior adjustment methods based on a simulated historical data set. A comparison of their effects is presented in Figure 1 of the Supplementary material available at Biostatistics online, which suggests that adjusting the validation prior using HPD regions tend to increase the POS when there is high power to detect an effect.
We emphasize that the proposed method is useful regardless of whether a Bayesian or classical frequentist analysis will ultimately be used to analyze the future study. If several outcomes are recorded for each individual in a clinical trial, the outcomes will be correlated. Thus, it would be counterintuitive to generate future data sets by using a marginal model, even if marginal models will ultimately be utilized in the future clinical trial. While the focus of this article has been on instances where all outcomes could be reasonably modeled as multivariate normal, the method can be extended to outcomes of possibly mixed types and/or time-to-event endpoints, and we are actively exploring these areas.
Software and data
A Gibbs sampler for the SUR power prior is available at github.com/ethan-alt/multivarpos. A simulated data set based on the COMPASS study is available at https://figshare.com/articles/dataset/compass_simulated_rds/17155655, with an analysis example provided in Appendix F of the Supplementary material available at Biostatistics online.
Supplementary material
Supplementary material is available at http://biostatistics.oxfordjournals.org.
Acknowledgments
Conflict of interest. None declared.
Funding
The National Institute of Environmental Health Sciences (T32ES007018), in part.
References
FDA. (