Summary

In clinical trials, it is common to have multiple clinical outcomes (e.g., coprimary endpoints or a primary and multiple secondary endpoints). It is often desirable to establish efficacy in at least one of multiple clinical outcomes, which leads to a multiplicity problem. In the frequentist paradigm, the most popular methods to correct for multiplicity are typically conservative. Moreover, despite guidance from regulators, it is difficult to determine the sample size of a future study with multiple clinical outcomes. In this article, we introduce a Bayesian methodology for multiple testing that asymptotically guarantees type I error control. Using a seemingly unrelated regression model, correlations between outcomes are specifically modeled, which enables inference on the joint posterior distribution of the treatment effects. Simulation results suggest that the proposed Bayesian approach is more powerful than the method of Holm (1979), which is commonly utilized in practice as a more powerful alternative to the ubiquitous Bonferroni correction. We further develop multivariate probability of success, a Bayesian method to robustly determine sample size in the presence of multiple outcomes.

1. Introduction

Clinical trials fail when there is a lack of statistical evidence to suggest that treatment is efficacious. One of the reasons for this is because the sample size is too small to establish statistical significance in the treatment effect when, in actuality, treatment is efficacious, that is, the study is underpowered. Sample size determination becomes more difficult in the presence of multiple hypotheses (e.g., coprimary endpoints, multiple primary endpoints, or primary and secondary endpoints), which are common in clinical trials. In particular, sample sizes are determined based solely on establishing efficacy in the primary endpoint(s). However, if multiple hypotheses are being considered, it is recommended to determine a sample size that is likely to provide substantial evidence of efficacy across all hypotheses (FDA, 2018). Unfortunately, this is seldom done in practice, in part due to a lack of statistical tools that can be used to robustly determine sample size in the presence of multiple, potentially correlated endpoints.

Due to safety concerns, potential lack of efficacy, cost, and the high failure rate for Phase III clinical trials, it has become common for clinical trials sponsors to make go/no-go decisions on the basis of data from Phase II trials quantify the likelihood of a successful Phase III clinical trial. Typically, decisions involved in study planning are made on the basis of classical hypothesis testing, controlling for type I and type II error rates by defining a prespecified criteria for the success or futility of the treatment. Sample sizes are often calculated based on an assumed treatment effect size and its variance, with success being confirmed if a significant |$p$|-value (say, |$p$|-value |$< 0.05$|⁠) is obtained. However, uncertainty in the treatment effect size and its variance are unincorporated in this traditional approach.

As discussed by O’Hagan and others (2005), Chuang-Stein (2006), and Ibrahim and others (2015b), probability of success (POS) can be interpreted as a weighted average of power across possible effect sizes, that is, POS is the expectation of power with respect to some predictive distribution for future data. O’Hagan and others (2005) referred to POS as assurance, and it is also known as Bayesian expected power or simply Bayesian power (Psioda and Ibrahim, 2019).

The literature for computing POS for the no-covariate setting is vast, with most methods using Bayesian techniques to sample future clinical data while using frequentist approaches to determine success for the simulated data sets. Wang and others (2013), studying oncology trials, proposed to use the posterior distribution of the treatment effect to generate data for future trials, computing whether or not the trial had a significant |$p$|-value. Ren and Oakley (2014) developed methods to compute POS for a single time-to-event endpoint for both parametric and nonparametric models. Ciarleglio and others (2016) and Ciarleglio and Arendt (2017) utilized a hybrid no-covariates method to compute POS using Bayesian conditional expected power (BCEP), which corresponds to POS conditioning on a nonnull treatment effect, for superiority trials for normally distributed and binary endpoints, respectively.

In contrast, methods for computing POS in the presence of covariates are sparse. Ibrahim and others (2015b) developed a fully Bayesian approach to compute POS for a single normally distributed endpoint in the presence of covariates. Shieh (2017) showed that, under certain conditions, so-called multiplier methods, where multiples of the sample variance are used determine sample size, are equivalent to POS for the normal linear model.

Recently, some methods have been developed for computing POS for correlated outcomes or when multiple trials must be conducted (e.g., multiple Phase III or Phase II and III trials). Zhang and Zhang (2013) developed a method to compute POS when multiple successful trials are required, noting that two successful Phase III trials are needed for approval and that computing POS using common observed data from earlier studies yields correlated posterior treatment effects. Yin (2017) developed a Bayesian approach for determining go/no-go decisions for Phase 2 clinical trials based on Phase 3 efficacy and POS requirements. Zhou and others (2018) developed closed-form equations for POS methods for interim monitoring in clinical trials with longitudinal outcomes. Saint-Hilary and others (2019) develop a method to compute POS using surrogate endpoints, which are common in oncology trials. However, none of these methods consider regression settings.

In this article, we develop a new approach to testing unions of hypotheses for clinical trials in which multiple continuous outcomes that may reasonably modeled by a multivariate normal distribution. The proposed method is applicable to a wide range of clinical trials whose primary and key secondary outcomes are continuous, such as ongoing trials in diabetes (e.g., NCT03739125) and Alzheimer’s Disease (e.g., NCT03443973 and NCT02008357). We show that, under the null hypothesis, the asymptotic distribution of the posterior probability of a union of alternative hypotheses is equal in distribution to the multivariate normal cumulative distribution function (CDF) transform of a multivariate normal random variable, unifying single and multiple hypothesis testing. Using this result, we develop an approach that asymptotically guarantees frequentist type I error control for testing unions of hypotheses. In an extensive simulation study, we compare type I error rates and power to testing unions of hypotheses using the Holm procedure, finding that our proposed method is more powerful across various correlations between outcomes while maintaining type I error control. While there are other methods for multiplicity control in the frequentist paradigm, typically based on fitting marginal models to each outcome separately, they are not recommended by regulators due to being dependent on the order of testing or due to assumptions regarding the correlations between outcomes (FDA, 2018).

We further develop a multivariate extension of POS. The method proposed includes those of Ibrahim and others (2015b) and Chuang-Stein (2006) as special cases. This fully Bayesian approach does not rely on |$p$|-values, hybrid approaches (e.g., Chuang-Stein, 2006), or hierarchical testing procedures (e.g., Holm, 1979).

The rest of the article is organized as follows. In Section 2, we develop the proposed Bayesian approach to ascertain whether treatment is efficacious for at least one of several clinical outcomes with type I error control. In Section 3, we introduce a multivariate version of POS. In Section 4, we present simulation results that compare frequentist type I error and a Bayesian version of power for our proposed approach to the Holm procedure. Section 5 applies the POS method using data from a real clinical trial. In Section 6, we close with some discussion.

2. Joint analysis of multiple clinical outcomes

In clinical trials, it is often of interest to know whether an investigational new drug is efficacious in at least one of several outcomes. For example, trials frequently have one primary and multiple secondary endpoints, and investigators wish to “hit” (e.g., reject the null hypothesis) on at least one of the secondary endpoints. With multiple chances to hit, a multiplicity problem arises. In this section, we introduce a novel Bayesian method for testing hypotheses where one wishes to know if there is substantial evidence of efficacy for at least one of several outcomes. We begin by introducing the Bayesian seemingly unrelated regression (SUR) model, which allows one to jointly model continuous outcomes. The Bayesian SUR model is particularly useful as it enables inference on the joint posterior distribution of the treatment effects. We then introduce the proposed method and provide heuristic and formal arguments of its type I error rate control.

2.1. The SUR model

We now review the SUR model. Let |$z_i$| denote the treatment indicator, such that |$z_i = 1$| if the |$i{\rm th}$| patient receives treatment and |$z_i = 0$| otherwise. Further, let |$\boldsymbol{x}_{2ij}$| denote a |$p_j$|-dimensional covariate vector for the |$i{\rm th}$| individual and |$j{\rm th}$| endpoint, |$i = 1, \ldots, n$| and |$j = 1, \ldots, J$|⁠, which may include an intercept term. Let the |$j{\rm th}$| outcome variable for subject |$i$| be represented by |$y_{ij}$|⁠. We consider a two-arm randomized controlled trial and let |$D = \{ \left( y_{ij}, z_i, \boldsymbol{x}_{2i} \right), i = 1, \ldots, n; j = 1, \ldots, J \}$| represent the study data, where |$\boldsymbol{x}_{2i}$| is a vector consisting of the unique covariates for subject |$i$|⁠. We assume a linear regression model of the form
(2.1)
where |$\beta_{1j}$| is the treatment effect for outcome |$j$|⁠, |$\boldsymbol{\beta}_{2j}$| is a |$p_j$|-dimensional parameter vector corresponding to |$\boldsymbol{x}_{2ij}$|⁠, and |$\epsilon_{ij}$| is a random error term. We assume that |$\boldsymbol{\epsilon}_i = (\epsilon_{i1}, \ldots, \epsilon_{iJ})' \sim \mathop{\mathrm{N}}\nolimits_J(\boldsymbol{0}_J, \boldsymbol{\Sigma})$|⁠, where |$\boldsymbol{0}_J$| is a |$J$|-dimensional vector of zeros and |$\boldsymbol{\Sigma}$| is a covariance matrix assumed to be positive definite. We see from the model (2.1) that each individual outcome |$y_{ij}$| marginally has a linear regression model, but the |$J$| outcomes are correlated with one another through the covariance matrix |$\boldsymbol{\Sigma}$|⁠. The SUR model is particularly useful for clinical trials, where multiple outcomes are obtained from the same experimental units (i.e., the trial participants) and hence are correlated, but where one might wish to control only for the baseline level of the outcome being considered, so that the design matrix for each marginal regression model is different.
Let |$\boldsymbol{y}_j$| and |$\boldsymbol{X}_j$| denote the |$n$|-dimensional vector of response variables for outcome |$j$| and the design matrix associated with the |$j{\rm th}$| regression model, respectively. We write |$\boldsymbol{X}_J = (\boldsymbol{z}, \boldsymbol{X}_{2j})$|⁠, where |$\boldsymbol{z}$| is the |$n$|-dimensional vector of treatment indicators and |$\boldsymbol{X}_{2j}$| is the matrix of covariates for regression model |$j$| (which may include an intercept term). Let |$\boldsymbol{y} = (\boldsymbol{y}_1, \ldots, \boldsymbol{y}_J)'$| denote the |$nJ$|-dimensional vector of response variables and let |$\boldsymbol{X} = \mathop{\mathrm{blkdiag}}\nolimits\{ \boldsymbol{X}_1, \ldots, \boldsymbol{X}_J \}$|⁠. Let |$\boldsymbol{\theta} = (\boldsymbol{\beta}, \boldsymbol{\Sigma})$|⁠. We may write the likelihood for the model (2.1) as
(2.2)
where |$\otimes$| is the Kronecker product operator and |$\boldsymbol{A}$| is a |$J \times J$| matrix with |${(\boldsymbol{A})}_{kl} = (\boldsymbol{y}_{k} - \boldsymbol{X}_{k} \boldsymbol{\beta}_{k})'(\boldsymbol{y}_{l} - \boldsymbol{X}_{l} \boldsymbol{\beta}_{l})$|⁠. Bayesian analysis of the model is completed by specifying a prior for |$\boldsymbol{\theta}$|⁠, which we may specify as |$\pi^{(f)}(\boldsymbol{\theta}) \propto \lvert \boldsymbol{\Sigma} \rvert^{-d_0/2}$| for some |$d_0 \ge 0$|⁠. For example, if |$d_0 = 0$|⁠, then |$\pi^{(f)}$| is an improper uniform prior. If |$d_0 = J+1$|⁠, then |$\pi^{(f)}$| is the Jeffreys prior for the SUR. The hyperparameter |$d_0$| may be chosen in such a way that the posterior mean of the covariance matrix of |$\boldsymbol{\theta}$| is similar to the point estimate of the covariance matrix in a frequentist analysis. For large samples, the choice of |$d_0$| does not have a large impact on the posterior distribution.

2.2. Type I error control for unions

We now introduce the proposed method for testing whether there is substantial evidence of treatment efficacy on at least one of several outcomes. We assume without loss of generality that larger values of treatment effects are more indicative of efficacy. Let |$\Omega$| denote the set that defines the success criteria for the trial, for example, if the investigators wish to know whether treatment is efficacious in at least one of the outcomes, we may express |$\Omega = \{ \boldsymbol{\theta} : \cup_{j=1}^J (\beta_{1j} > \tau_j) \}$|⁠, where |$\tau_j$| is a “target value,” that is, the minimum clinically important difference. We assume that covariates and treatment assignment are fixed and known. Note |$\Omega$| is defined irrespective of prior information.

Let |${\pi}^{(v)}_{\Omega_0}(\boldsymbol{\theta})$| denote a proper validation prior (Ibrahim and others, 2015b) with respect to the parameter space under the null hypothesis |$\boldsymbol{\theta} \in \Omega_0$|⁠, where |$\Omega_0 = \{ \boldsymbol{\theta} : \cap_{j=1}^J \{\beta_{1j} = \tau_j\} \}$|⁠. Validation priors are also referred to as sampling priors (Wang and Gelfand, 2002) or design priors (O’Hagan and others, 2005) and are generally specified to be nonzero over some subspace of |$\Omega$|⁠. We assume here that |${\pi}^{(v)}_{\Omega_0}(\boldsymbol{\theta}) \propto \tilde{\pi}^{(v)}(\boldsymbol{\theta}) 1\{ \boldsymbol{\theta} \in \Omega_0\}$|⁠, where |$\tilde{\pi}^{(v)}$| is a proper prior. In order to compute type I error rates, we may generate from the prior predictive distribution of the validation prior, which is given by |$ f(\boldsymbol{y}) = \int f(\boldsymbol{y} | \boldsymbol{\theta}) \pi^{(v)}_{\Omega_0}(\boldsymbol{\theta}) {\rm d}\boldsymbol{\theta}. $| In the Bayesian paradigm, the type I error rate is controlled if |$E[1\{ P(\boldsymbol{\theta} \in \Omega | \boldsymbol{y}) \ge \gamma^* \}] \le 1 - \gamma$|⁠, where |$\gamma^*$| is an evidence threshold, |$1 - \gamma$| is the desired type I error rate, and the expectation is taken with respect to |$f(\boldsymbol{y})$|⁠. It is well known (and we show below) that under the validation prior |$\pi^{(v)}_{\Omega_0}$|⁠, |$U_j \equiv P(\beta_{1j} > \tau_j | D)$| has an asymptotic standard uniform distribution, |$j = 1, 2$|⁠. Hence, for a single outcome, we have |$\gamma^* = 0.95$| to control the type I error rate at |$0.05$|⁠. However, when |$\Omega$| consists of unions of events, it is not immediately clear how to choose |$\gamma^*$|⁠. For example, let |$\Omega = \cup_{j=1}^2 \{ \beta_{1j} > 0 \}$| so that |$P(\Omega | \boldsymbol{y}) = U_1 + U_2 - U_{12}$|⁠, where |$U_{12} = P(\{\beta_{11} > \tau_1\} \cap \{\beta_{12} > \tau_2\} | D)$|⁠. Clearly, |$P(\Omega | \boldsymbol{y})$| will only be asymptotically standard uniform if |$U_{12}$| is equal to |$U_1$| or |$U_2$| with probability 1. In fact, the asymptotic distribution of a union posterior probability may be derived as a multivariate normal CDF transformation of a multivariate normal random variable, which we state formally for a general setting in the following theorem:

 
Theorem 2.1

Let |$\boldsymbol{\nu} = (\boldsymbol{\alpha}', \boldsymbol{\gamma}')'$|⁠, where |$\boldsymbol{\alpha}$| consists of the |$J$| parameters of interest and |$\boldsymbol{\gamma}$| is a vector containing nuisance parameters. Let |$\tilde{\pi}_{A_0}^{(v)}(\boldsymbol{\nu}) \propto \pi(\boldsymbol{\nu}) 1\{ \boldsymbol{\alpha} \in A_0 \}$|⁠, where |$A_0 = \cap_{J=1}^J \{ \alpha_j = \tau_j \}$| and where |$\pi(\nu)$| is a proper prior. We may suppose without loss of generality that |$\tau_j = 0$| for |$j = 1, \ldots, J$|⁠. Let the data generation process be given by |$f\left(\boldsymbol{y} \bigg| \pi^{(v)}_{A_0}\right) = \int f(\boldsymbol{y} | \boldsymbol{\nu}) \pi^{(v)}_{A_0}(\boldsymbol{\nu}) {\rm d}\boldsymbol{\nu}$|⁠, that is, |$f(\boldsymbol{y})$| is the prior predictive distribution of |$\pi^{(v)}_{A_0}(\boldsymbol{\nu})$|⁠. Let |$\pi^{(f)}(\boldsymbol{\nu})$| denote a prior on |$\boldsymbol{\nu}$| that results in a proper posterior density. Finally, let |$A = \cup_{j=1}^J \{ \alpha_j > 0 \}$| denote the set that defines clinical success. Then |$ P(\boldsymbol{\alpha} \in A | D, \pi^{(f)}) \overset{\text{d}}{\to} 1 - \Phi_J(\boldsymbol{Z} | \boldsymbol{0}, \boldsymbol{\Sigma}^*), $| where |$\boldsymbol{Z} \sim N_J(\boldsymbol{0}, \boldsymbol{\Sigma}^*)$|⁠, |$\boldsymbol{\Sigma}^*$| is the limiting posterior covariance matrix of |$\boldsymbol{\alpha}$| with respect to the sampling distribution |$f\left(\boldsymbol{y}\bigg|\pi^{(v)}_{A_0}\right)$|⁠, |$N_J(\cdot, \cdot)$| is the |$J$|-dimensional multivariate normal distribution, and |$\Phi_J(\cdot | \boldsymbol{m}, \boldsymbol{C})$| is the |$J$|-dimensional multivariate normal CDF with mean |$\boldsymbol{m}$| and covariance matrix |$\boldsymbol{C}$|⁠.

The proof of Theorem 2.1 is provided in Appendix C of the Supplementary material available at Biostatistics online. In words, Theorem 2.1 says that, under the null hypothesis, the posterior probability of a union of events converges in distribution to the multivariate normal CDF transform of a multivariate normal random variable. Note that if |$J = 1$|⁠, then |$\boldsymbol{Z} = Z$| is a scalar and therefore |$1 - \Phi_1(Z | 0, \sigma^2) \sim U(0, 1)$|⁠, that is, the posterior probability based on an individual outcome is asymptotically standard uniform under the null hypothesis, a well-known result.

Using Theorem 2.1, we may find the value |$\gamma^*$| such that under the validation prior |$\pi^{(v)}$|⁠, |$E\left[ 1\{ P(\Omega | D) \ge \gamma^* \} \right] = 1 - \gamma$| via simulation. In words, |$\gamma^*$| is an adjusted evidence threshold such that the type I error rate is asymptotically controlled for a union hypothesis. Specifically, we may generate |$\{\boldsymbol{Z}^{(m)}, m = 1, \ldots, M \}$| from |$N_J(\boldsymbol{0}, \boldsymbol{\Sigma}^*)$| and compute |$\{U^{(m)}, m = 1, \ldots, M\}$| via |$U^{(m)} = 1 - \Phi_{J}(\boldsymbol{Z}^{(m)} | \boldsymbol{0}, \boldsymbol{\Sigma}^*)$|⁠. We then obtain |$\gamma^*$| as the solution to |$M^{-1} \sum_{m=1}^M 1\{ U^{(m)} > \gamma^* \} = 1 - \gamma$|⁠. In practice, we may replace |$\boldsymbol{\Sigma}^*$| with its consistent estimator |$\hat{\boldsymbol{\Sigma}}^*$|⁠, the posterior covariance matrix of |$\boldsymbol{\alpha}$| approximated via Markov chain Monte Carlo (MCMC).

This method enables practitioners to discover if treatment is efficacious for at least one outcome with type I error control, but it does not tell the practitioner which outcome. As the individual posterior probabilities, |$\{P(\beta_{1j} > \tau_j | D), j = 1, \ldots, J\}$|⁠, factor into the computation of |$P(\Omega | D)$|⁠, one way to discover the main contributors to the total evidence represented by the posterior probability for the union is to compare the marginal posterior probabilities for each treatment effect in the union. Larger values of the marginal posterior probabilities of the treatment effects may be interpreted as having comparatively more evidence. In Appendix F of the Supplementary material available at Biostatistics online, we illustrate via an example how one may ascertain which of the treatment effects having the greatest evidence of efficacy given substantial evidence for the overall union. In particular, if the posterior probability of the union exceeds its evidence threshold, then it is reasonable to conclude that treatment is efficacious for the outcomes for which the associated individual posterior probabilities are large.

3. Bayesian POS for multivariate linear regression models

In this section, we introduce the concept of multivariate POS. Specifically, we extend the methods of Ibrahim and others (2015b) to accommodate multiple outcomes using the SUR model. We then discuss computational development.

3.1. The general methodology

Suppose we wish to determine the sample size of a future study. Let |$\Omega$| denote the success criterion of the future study. For example, if we are only interested in a single primary endpoint, we may specify |$\Omega := \{ \boldsymbol{\theta} : \beta_{11} > \tau_1 \}$| for some target value |$\tau_1$|⁠. In situations of coprimary endpoints, we may specify |$\Omega = \{ \boldsymbol{\theta} : \cap_{j=1}^K \{ \beta_{1j} > \tau_j \} \}$|⁠, that is, a trial is declared successful when there is substantial evidence that the drug is efficacious in all of |$K$| clinical outcomes. A third important specification of |$\Omega$| is when it is of interest to determine whether a treatment improves at least one of several outcomes, which we may write as |$\Omega = \{ \boldsymbol{\theta} : \cup_{j=1}^K \{ \beta_{1j} > \tau_j \} \}$|⁠.

Let |$\pi^{(v)}(\boldsymbol{\theta}, \boldsymbol{\alpha})$| denote a validation prior, where |$\boldsymbol{\alpha}$| is vector of parameters for the distribution of covariates |$f(\boldsymbol{x} | \boldsymbol{\alpha})$|⁠. For the purposes of computing POS, the validation prior |$\pi^{(v)}(\boldsymbol{\theta}, \boldsymbol{\alpha})$| is used to generate data |$D = \{ y_{ij}, z_i, \boldsymbol{x}_{ij}, i = 1, \ldots, n, j = 1, \ldots, J \}$| based on the study designer’s a priori beliefs regarding the parameters of the data generation process for the future trial. Let |$f(z)$| denote the density for the randomization scheme. For example, in a two-arm trial with 50/50 probabilistic allocation, |$f(z) = 0.5$| for |$z = 0, 1$|⁠. The density for the future trial data is obtained from the prior predictive distribution of the validation prior and the density for the randomization scheme
(3.3)
where |$f(\boldsymbol{x} | \boldsymbol{\alpha})$| is the joint density for the covariates given parameters |$\boldsymbol{\alpha}$| and |$f(\boldsymbol{y} | \boldsymbol{\theta}, \boldsymbol{x}, z)$| is the SUR likelihood function (2.2).
Let |$\pi^{(f)}(\boldsymbol{\theta})$| denote the “fitting prior,” also referred to as the “analysis prior.” The fitting prior is the prior that will be utilized in the analysis of the future data. The fitting prior may be noninformative or informative (e.g., a power prior (Ibrahim and others, 2015a) in the presence of historical data). Suppose that we wish to determine the sample size for a trial that satisfies |$K$| conditions, say, |$\{\Omega_k\}_{k=1}^K$|⁠. In many cases, |$K = 1$|⁠, such as when testing a single endpoint or a single union of endpoints. Often, however, we wish to test whether the treatment is efficacious for a primary outcome and at least one of a collection of secondary outcomes. In this case, |$K = 2$|⁠, |$\Omega_1 = \{ \boldsymbol{\beta} : \beta_{11} > 0 \}$|⁠, and |$\Omega_2 = \{ \boldsymbol{\beta} : \cup_{j=2}^J \{\beta_{1j} > 0 \} \}$|⁠. We declare the trial successful if each of the posterior probabilities corresponding to |$\Omega_k$| is sufficiently large, that is, if |$S(\boldsymbol{y}, \boldsymbol{x}, z | \pi^{(f)}) \equiv \prod_{k=1}^K 1\{ P(\Omega_k | \boldsymbol{y}, \boldsymbol{x}, \boldsymbol{z}, \pi^{(f)}) \ge \gamma^*_k \} = 1$|⁠, where |$\gamma^*_k$| is an evidence threshold associated with |$\Omega_k$|⁠. Mathematically, we define the POS as
(3.4)

In words, (3.4) is the marginal probability that a future trial satisfies the success criteria with respect to the prior predictive distribution for the future data based on the validation prior.

3.2. Using historical data

We now discuss POS when we are in possession of one or two historical data sets. For example, when one is planning an initial Phase III study, they are typically in possession of a single historical data set (e.g., from a Phase II study). In general, we may possess more than two historical data sets, but the methodology developed in this section may be easily extended to handle such a case.

Let the historical data sets be represented by |$D_{0k} = \{(y_{0kij}, \boldsymbol{x}_{0ki}, z_{0ki}), i = 1, \ldots, n_{0k}; j = 1, \ldots, J; k = 1, 2\}$|⁠, where |$z_{0ki} = 1$| if the |$i{\rm th}$| subject in study |$k$| received the experimental treatment and 0 otherwise, |$\boldsymbol{x}_{0ki}$| represents the vector of unique covariates for subject |$i$| in study |$k$|⁠, |$y_{0kij}$| is the response variable for individual |$i$| and outcome |$j$| in study |$k$|⁠, |$k = 1,2$|⁠, |$j = 1, \ldots, J$|⁠, |$i = 1, \ldots, n_{0k}$|⁠. We assume without loss of generality that the data sets are ordered chronologically, so that |$D_{01}$| is the older study and |$D_{02}$| is the more recently completed study.

We assume |$\boldsymbol{y}_{0k} = \boldsymbol{X}_{0k} \boldsymbol{\beta} + \boldsymbol{\epsilon}$| where |$\boldsymbol{y}_{0k} = (\boldsymbol{y}_{0k1}', \ldots, \boldsymbol{y}_{0kJ}')'$| is a |$n_{0k}J$|-dimensional response vector of all responses in study |$k$|⁠, and |$\boldsymbol{X}_{0k} = \mathop{\mathrm{blkdiag}}\nolimits\{ \boldsymbol{X}_{0k1}, \ldots, \boldsymbol{X}_{0kJ} \}$| is a block diagonal matrix of design matrices for each outcome in study |$k$|⁠, where |$\boldsymbol{X}_{0kj} = (\boldsymbol{z}_{0kj}, \boldsymbol{X}_{0kj2})$|⁠, and |$\boldsymbol{\epsilon}_{0k} = (\boldsymbol{\epsilon}_{0k1}', \ldots, \boldsymbol{\epsilon}_{0kJ})'$| is a |$n_{0k}J$|-dimensional vector with |$\boldsymbol{\epsilon}_{0k} \sim \mathop{\mathrm{N}}\nolimits_{n_{0k}J}(\boldsymbol{0}_{n_{0k}}, \boldsymbol{\Sigma} \otimes \boldsymbol{I}_{n_{0k}})$|⁠.

We assume that the parameters for the response variables and covariates are independent a priori, that is, |$\pi^{(v)}(\boldsymbol{\theta}, \boldsymbol{\alpha}) = \pi^{(v)}_1(\boldsymbol{\theta}) \pi^{(v)}_2(\boldsymbol{\alpha})$|⁠. The validation prior for |$\boldsymbol{\theta}$| may be specified as the posterior density of the historical data, that is,
(3.5)
where the initial validation prior may be specified as |$\pi_{10}^{(v)}(\boldsymbol{\theta}) \propto \lvert \boldsymbol{\Sigma} \rvert^{-d_0/2}$| for some |$d_0 \ge 0$|⁠. Adopting the convention of Ibrahim and others (2015b), the validation prior for |$\boldsymbol{\theta}$| utilizes the most recently completed study, which will likely have a larger sample size and be more similar to the future study than the older study.
We elicit the validation prior for |$\boldsymbol{\alpha}$| via a power prior (Ibrahim and Chen, 2000). Specifically, we elicit |$\pi_2^{(v)}(\boldsymbol{\alpha})$| as
(3.6)
where |$0 < b_{0j} \le 1$| and |$\pi_{20}^{(v)}(\boldsymbol{\alpha})$| is an initial validation prior, which we may take as |$\pi_{20}^{(v)}(\boldsymbol{\alpha}) \propto 1$| for many distributions (e.g., normal, Poisson, and binomial). For some exponential family distributions with dispersion parameters, such as the gamma distribution, a proper prior on the dispersion parameter needs to be specified in order to guarantee propriety.

The quantities |$b_{0j}, j = 1, 2$|⁠, are hyperparameters that downweight the likelihood of the historical data. Typically, we choose values of |$b_{0j}$| closer to 0 when characteristics of the study participants in the planned future study are expected to be more different than those in the completed trial and values closer to 1 when the characteristics are similar. Note that if we are in possession of only one historical data set, we may set |$b_{01} = 0$| in (3.6).

Since the covariates are dependent, it may be difficult to specify |$f(\boldsymbol{x}_{0ki} | \boldsymbol{\alpha})$|⁠. One way to do this is to utilize a factorization of the joint distribution informed by temporal relationships between the variables. For example, suppose we have three covariates |$(x_{01}, x_{02}, x_{03})'$|⁠, where |$x_{01}$| is a binary variable indicating gender, |$x_{02}$| is weight measured in kilograms, and |$x_{03}$| counts the number of tumors at baseline. It is impossible for weight or number of tumors to cause gender, so we might first generate |$x_{01}$|⁠. Gender likely has an effect on both weight and number of tumors, but there may not be an obvious way to factorize |$x_{02}$| and |$x_{03}$|⁠. The two permutations may be compared, and a choice may be selected using model selection criteria (e.g., deviance information criterion). Suppose that such analysis yields |$x_{02}$| to generate after |$x_{01}$|⁠. We can therefore generate from the joint distribution via |$f(x_{01}, x_{02}, x_{03}) = f(x_{01}) f(x_{02} | x_{01}) f(x_{03} | x_{01}, x_{02})$|⁠.

In general, suppose there are |$L$| unique covariates |$\{x_{0i1}, \ldots, x_{0iL}, i = 1, \ldots, n_0 \}$|⁠. Without loss of generality, suppose that they are ordered so that we want to generate samples of |$x_{01}$| first, |$x_{02}$| second, and so on, lastly generating samples from |$x_{0L}$|⁠. Let |$\boldsymbol{\alpha}_l$| denote the parameter vector for |$f(x_{0l} | x_{01}, \ldots, x_{0,l-1} )$|⁠. We may rewrite the validation prior (3.6) as |$ \pi_2^{(v)}(\boldsymbol{\alpha} | D_0, b_0) \propto \prod_{l=1}^L \prod_{k=1}^2 \left[ \prod_{i = 1}^{n_{0k}} f(x_{0il} | (x_{0i1}, \ldots, x_{0i,l-1}, \boldsymbol{\alpha}_l) \right]^{b_{0k}} \pi_{20}^{(v)}(\boldsymbol{\alpha}). $| We assume each covariate is generated by a generalized linear model. For the purposes of our simulations and data analysis, which contain only binary and continuous data, we assume that the continuous data are normally distributed, and we take |$\pi_{20}^{(v)}(\boldsymbol{\alpha}) \propto 1$|⁠. We note that multiple continuous covariates may alternatively be generated via a multivariate normal distribution. In general, more complicated covariate distributions may be utilized, such as the beta-binomial distribution for binary variables or the negative binomial distribution for count data.

Once the validation prior and treatment allocation distribution are specified, future trial data |$D = \{ (\boldsymbol{y}_i, \boldsymbol{x}_i, z_i), i = 1, \ldots, n \}$| may be generated via the prior predictive distribution (3.3). The Bayesian analysis is conducted by eliciting a fitting prior for the future trial data, which we may specify as a power prior based on the older historical data set, that is,
(3.7)
where |$0 \le a_0 \le 1$| and |$\pi_0^{(f)}(\boldsymbol{\theta})$| is an initial fitting prior, which we may specify as |$\pi_0^{(f)}(\boldsymbol{\theta}) \propto \lvert \boldsymbol{\Sigma} \rvert^{-d_0/2}$| for some |$d_0 \ge 0$|⁠. If we are in possession of only a single historical data set, we may set |$a_0 = 0$| so that the fitting prior is just the initial prior. Using the fitting prior (3.7) and the future trial data |$D$|⁠, the posterior distribution for |$\boldsymbol{\theta}$| is
(3.8)

The marginal posterior distribution of the regression coefficients based on (3.8) is not available in closed-form except in special cases. However, it can be shown that the full conditional densities of |$\boldsymbol{\beta}$| and |$\boldsymbol{\Sigma}$| are normal and inverse-Wishart, respectively, so that an efficient Gibbs sampling algorithm may be utilized. We state this formally in the following theorem

 
Theorem 3.1

Let |$D = \{ (\boldsymbol{y}_i, \boldsymbol{x}_i, z_i), i = 1, \ldots, n \}$| denote the observed data with likelihood from the SUR model (2.2). Let the fitting prior be given by (3.7), where |$\pi_0^{(f)}(\boldsymbol{\theta}) \propto \lvert \boldsymbol{\Sigma} \rvert^{-d_0/2}$|⁠. Finally, define |$\tilde{\boldsymbol{\beta}} = (\boldsymbol{I}_p - \boldsymbol{\Lambda}) \hat{\boldsymbol{\beta}} + \boldsymbol{\Lambda} \hat{\boldsymbol{\beta}}_0$|⁠, |$\hat{\boldsymbol{\beta}} = [\boldsymbol{X}'(\boldsymbol{\Sigma}^{-1} \otimes \boldsymbol{I}_n) \boldsymbol{X}]^{-1} \boldsymbol{X}'(\boldsymbol{\Sigma}^{-1} \otimes \boldsymbol{I}_n) \boldsymbol{y}$| and |$\hat{\boldsymbol{\beta}}_0$| analogously defined, |$\tilde{\boldsymbol{\Omega}} = \boldsymbol{X}'(\boldsymbol{\Sigma}^{-1} \otimes \boldsymbol{I}_n) \boldsymbol{X} + a_0 \boldsymbol{X}_0' (\boldsymbol{\Sigma}^{-1} \otimes \boldsymbol{I}_n) \boldsymbol{X}_0$|⁠, |$\boldsymbol{\Lambda} = \tilde{\boldsymbol{\Omega}}^{-1} a_0 \boldsymbol{X}_0'(\boldsymbol{\Sigma}^{-1} \otimes \boldsymbol{I}_{n_0}) \boldsymbol{X}_0$|⁠, |$\tilde{\boldsymbol{V}} = \boldsymbol{A}(\boldsymbol{\beta}) + a_0 \boldsymbol{A}_0(\boldsymbol{\beta})$|⁠, |$\tilde{\nu} = n + d_0 - J - 1$|⁠, where |$\boldsymbol{A}(\boldsymbol{\beta})$| and |$\boldsymbol{A}_0(\boldsymbol{\beta})$| is the matrix of dot product residuals described in Section 2 for the current and historical data, respectively. Then |$ \boldsymbol{\beta} | D, \boldsymbol{\Sigma} \sim N_p(\tilde{\boldsymbol{\beta}}, \tilde{\boldsymbol{\Omega}}^{-1}) $| and |$ \boldsymbol{\Sigma} | D, \boldsymbol{\beta} \sim IW_J(\tilde{\nu}, \tilde{\boldsymbol{V}} ), $| where |$IW_J$| denotes the |$J$|-dimensional inverse-Wishart distribution.

The proof of Theorem 3.1 is provided in Appendix A of the Supplementary material available at Biostatistics online. It is worth mentioning that sampling from the posterior (3.8) is computationally expensive because, given |$\boldsymbol{\Sigma}$|⁠, the posterior mean of |$\boldsymbol{\beta}$| depends on the generalized least squares estimator |$\hat{\boldsymbol{\beta}}$|⁠, which involves computing large, sparse matrix products. An efficient algorithm for computing these products is provided in Appendix A of the Supplementary material available at Biostatistics online.

Under the special case that all the design matrices are the same, it can be shown that the posterior distribution of the regression coefficients is a matrix |$t$| distribution. In particular, it may be shown that the posterior distribution of the treatment effects is a multivariate |$t$| distribution, and there are algorithms available for efficiently and accurately computing probabilities from the multivariate |$t$| distribution. Details are presented in Appendix B of the Supplementary material available at Biostatistics online. For our data analysis and simulations, the design matrices are the same, and we utilize the multivariate |$t$| result for posterior inference.

In our development, the validation prior was only discounted for the covariates and not the response variables. The reasons for doing so is because the validation prior for |$\boldsymbol{\theta}$| represents belief about the effects to be studied. It does not have any impact on the analysis of a future data set. If the validation prior for |$\boldsymbol{\theta}$| is discounted, it becomes more likely that unrealistic values for |$\boldsymbol{\theta}$| are generated, which may provide misleading estimates of POS. We allow discounting for the covariate parameters |$\boldsymbol{\alpha}$| in the validation prior because the covariates and characteristics of the trial participants from a future trial could be different than those from the historical data. On the other hand, the fitting prior for |$\boldsymbol{\theta}$| is discounted because the fitting prior is the prior utilized in the analysis of the future data, and it will generally be the case that one would want the future data to have more weight than the historical data. The weight parameter |$a_0$| in the fitting prior (3.7) can be treated as either fixed or random. For computational reasons, we consider |$a_0$| as fixed in this article. Ibrahim and others (2015a) give a thorough review of the power prior and a discussion of treating |$a_0$| as fixed and random.

3.3. POS algorithm

The algorithm to compute POS is as follows:

  • Step 1:

    Specify |$n$|⁠, |$\{\Omega_k\}_{k=1}^K$| (the trial success criteria), |$q$| (the proportion of subjects randomized to the test drug group), |$M$| (the number of samples to draw), and |$B$| (the number of future data sets).

  • Step 2:

    Generate |$\boldsymbol{\theta}$| and |$\boldsymbol{\alpha}$| from the validation prior |$\pi^{(v)}(\boldsymbol{\theta}, \boldsymbol{\alpha})$|⁠.

  • Step 3:

    Using the value of |$\boldsymbol{\alpha}$| from step 1, generate |$\boldsymbol{x}_i$| from |$f(\boldsymbol{x}_i | \boldsymbol{\alpha})$| and randomly assign the ith subject into two treatment arms by generating |$z_i \sim \text{Bernoulli}(q)$| independently for |$i = 1, \ldots, n$| and set |$\boldsymbol{X}$| to be the resulting |$nJ \times (p + J)$| design matrix.

  • Step 4:

    Using the value of |$\boldsymbol{\theta}$| from Step 1 and the value of |$\boldsymbol{X}$| from Step 2, generate a future response vector |$\boldsymbol{y}$| from |$\mathop{\mathrm{N}}\nolimits_{nJ}(\boldsymbol{X\beta}, \boldsymbol{\Sigma}^{-1} \otimes \boldsymbol{I}_n)$| to obtain |$D = (\boldsymbol{y}, \boldsymbol{X})$|⁠.

  • Step 5:

    Using |$D$| from Step 4, sample from the posterior distribution using the fitting prior to generate |$\left\{ \boldsymbol{\theta}^{(m)}, m = 1, \ldots, M \right\}$|⁠.

  • Step 6:

    Compute |$I^{(b)} = \prod_{k=1}^K 1\{ P(\Omega_k | D) \ge \gamma^*_k \}$| using |$\boldsymbol{\theta}^{(m)}$|⁠, |$m = 1, \ldots, M$|⁠, from Step 5, where |$\gamma^*_k$| is computed as described in Section 2.

  • Step 7:

    Repeat Step 2 to Step 6 |$B$| times to obtain |$I^{(1)}, \ldots, I^{(B)}$|⁠. The POS is given by |$\mathop{\mathrm{POS}}\nolimits = \frac{1}{B} \sum_{b=1}^B I^{(b)}$|⁠.

Steps (5) and (6) become multivariate t CDF calculations when all covariates are identical.

4. Simulation study

The Comprehensive Post-Acute Stroke Services (COMPASS) (Duncan and others, 2017) study was a two-arm, cluster-randomized pragmatic trial to evaluate whether a novel care model (the COMPASS intervention) improved a variety of patient-centered outcomes at 90 days compared to usual care (UC). We base our simulations based on summary statistics from a subset of five high-enrolling intervention arm hospitals from the COMPASS study. Details on the COMPASS study are provided in Section 5. The complete historical data consisted of |$n_0 = 981$| participants and |$J = 3$| endpoints. We fix all of the power prior hyperparameters to 1 and assume the following model to generate the historical data: |$\boldsymbol{y}_{0i} = \boldsymbol{\beta}_1 z_{0i} + \boldsymbol{X}_{0i} \boldsymbol{\beta}_2 + \boldsymbol{\epsilon}_{0i}, i = 1, 2, \ldots, 981,$| where |$\boldsymbol{y}_{0i} = (y_{0i1}, y_{0i2}, y_{0i3})'$| is a vector of responses corresponding to the |$J=3$| endpoints, |$\boldsymbol{\beta}_1 = (\beta_{11}, \beta_{12}, \beta_{13})'$| is the vector of treatment effects, |$\boldsymbol{X}_{0i} = \mathop{\mathrm{blkdiag}}\nolimits\{ \boldsymbol{x}_{01}', \boldsymbol{x}_{02}', \boldsymbol{x}_{03}' \}$| is a |$n_0 J \times p$| block diagonal matrix of covariates for endpoints 1, 2, and 3, |$\boldsymbol{\beta}_2 = (\boldsymbol{\beta}_{21}', \boldsymbol{\beta}_{22}', \boldsymbol{\beta}_{23}')'$| is a |$p$|-dimensional vector of effects for the covariates of each endpoint, and |$\boldsymbol{\epsilon}_{0i} \sim N_3(\boldsymbol{0}, \boldsymbol{\Sigma})$|⁠, where |$\boldsymbol{\Sigma}$| is constructed to be positive definite. All three endpoints shared the same seven covariates and each had an intercept term (⁠|$p = 24)$|⁠, and we thus utilize the matrix t posterior of the regression coefficients described in Section 3. The vector of treatment effects was assumed to be |$\boldsymbol{\beta}_1 = (0.0333, 0.1667, 0.5980)'$|⁠. The standard deviations for endpoints 1, 2, and 3 were, respectively, 0.193, 0.748, and 7.422. For simplicity in order to conduct large-scale simulation studies, we reduced the variance of each outcome by a factor of 2 so that the sample size required to have a high probability of clinical success in the primary endpoint would be lowered, reducing computation time. Several different correlation matrices were considered. Let |$\boldsymbol{\rho} = (\rho_{12}, \rho_{13}, \rho_{23})'$|⁠, where each |$|\rho_{ij}| < 1$| and |$\rho_{ij} = \mathop{\mathrm{Corr}}\nolimits(y_{0ij}, y_{0ik})$| for |$i = 1, \ldots, n_0$| and |$1 \le j < k \le 3$|⁠. The pairwise correlations considered were |$\boldsymbol{\rho}_{\text{HN}} = (-0.3, -0.4, -0.7)'$|⁠, |$\boldsymbol{\rho}_{\text{LN}} = (-0.05, -0.1, -0.2)'$|⁠, |$\boldsymbol{\rho}_{\text{ind}} = (0, 0, 0)'$|⁠, |$\boldsymbol{\rho}_{\text{LP}} = (0.05, 0.1, 0.2)'$|⁠, and |$\boldsymbol{\rho}_{\text{HP}} = (0.3, 0.4, 0.7)'$|⁠, which correspond to high negative, low negative, independent, low positive, and high positive correlations, respectively.

We focus on type I error control and BCEP for unions of hypotheses. We utilize a type I error rate of 0.05. In order to explore the performance of the proposed method, we consider all two-way unions of the endpoints, that is, we consider the success sets |$\Omega_{j|k} = \{ \boldsymbol{\beta} : \beta_{1j} > 0 \text{ or } \beta_{1k} > 0 \}$| for |$1 \le j < k \le 3$|⁠. In words, success is declared when there is evidence of efficacy for the treatment on at least one of two outcomes. We compare the proposed method outlined in Section 2 against a frequentist multiplicity adjustment procedure, where success is declared if the minimum |$p$|-value corresponding to the treatment effect is less than 0.025 (the Holm procedure).

The validation priors utilized to obtain type I error and BCEP are, respectively, |$ \pi_{\overline{\Omega}}^{(v)}(\boldsymbol{\theta}) \propto \pi_{1}^{(v)}(\boldsymbol{\theta}) 1\left\{ \boldsymbol{\theta} \in \overline{\Omega} \right\} $| and |$ \pi_{\Omega}^{(v)}(\boldsymbol{\theta}) \propto \pi_{1}^{(v)}(\boldsymbol{\theta}) 1\left\{ \boldsymbol{\theta} \in \Omega \right\}\!, $| where |$\Omega$| is the set that defines success and |$\overline{\Omega}$| is the set defined by the boundary of the null hypothesis, where the type I error rate is at its maximum, that is, |$\overline{\Omega}_{j|k} = \{ \boldsymbol{\theta} : \beta_{1j} = 0 \cap \beta_{1k} = 0 \}$|⁠.

The first column of Figure 1 depicts the type I error rate when the set that defines success is denoted by |$\Omega_{1|2}$|⁠, that is, the trial is declared successful when there is sufficient evidence that treatment is efficacious for outcome 1 or outcome 2. The type I error rate is controlled for both the proposed Bayesian method and the Holm procedure. The second column of Figure 1 shows BCEP based on the two methods. Across all correlation scenarios, power was higher for the proposed Bayesian method compared to the Holm method. The lower the correlation between the outcomes, the more substantial the power increase over the frequentist approach. This is intuitive, since, as the correlation between the two outcomes approaches 1, then |$\Omega_{1|2} \to \Omega_{1} \cap \Omega_{2}$|⁠, where |$\Omega_{j} = \{ \boldsymbol{\beta} : \beta_j > 0 \}$|⁠. For the set |$\Omega_{1} \cap \Omega_{2}$|⁠, no multiplicity adjustment is needed, and |$\gamma^* = 0.95$| controls the type I error rate for this set.

Type I error rate and BCEP of adjusted and unadjusted fully Bayesian POS with a hybrid Holm procedure where $\Omega_{1|2}$ defines the set that determines success and when type I error rate is 0.05.
Fig. 1.

Type I error rate and BCEP of adjusted and unadjusted fully Bayesian POS with a hybrid Holm procedure where |$\Omega_{1|2}$| defines the set that determines success and when type I error rate is 0.05.

The type I error and power results for other unions are presented in Appendix E of the Supplementary material available at Biostatistics online. For both |$\Omega_{1|3}$| and |$\Omega_{2|3}$|⁠, shown in Figures 2 and 3 of the Supplementary material available at Biostatistics online, type I error is controlled for both the proposed Bayesian and frequentist methods, but the power continues to be higher for the proposed Bayesian method across all correlation scenarios. In summary, for all of the two-way unions, the proposed Bayesian approach controlled type I error and had higher power across all outcome pairs and across all correlation scenarios.

5. Data application

In Phase 1 of the COMPASS study, a diverse collection of 40 North Carolina, US hospitals were randomized in a 1:1 allocation scheme to administer their patients the COMPASS intervention or UC. Hospitals randomized to the intervention arm in Phase 2 were asked to continue providing the intervention in Phase 2 with minimal resources. Hospitals randomized to the UC arm in Phase 1 were allowed to transition to provide the intervention in Phase 2.

The primary outcome was stroke impact scale (SIS-16), a measure of physical function ranging from 0 to 100. There were several secondary outcomes in the study. The continuous outcomes we utilize are self-reported general health and physical health. Self-reported general health was measured by a semicontinuous 5-point Likert scale. Physical health was measured using the PROMIS Global Health Scale (Hays and others, 2009), which is a continuous measure of overall physical health, mental health, social health, pain, fatigue, and overall perceived quality of life, and was collected only in Phase 2 of the study. As a result, we only consider the Phase 2 data in our analysis. For the purposes of demonstration of our method, we partition the full data between the study’s vanguard site where the intervention was developed and piloted (considered the older historical data |$D_{01}$|⁠) and Phase 2 data from a subset of five high-enrolling intervention arm hospitals that elected to participate in Phase 2 (considered the newer historical data |$D_{02}$|⁠).

We assumed the following model for each outcome: |$y_{ij} = \beta_0 + \boldsymbol{\beta}_1 z_i + \boldsymbol{x}_{i}' \boldsymbol{\beta}_{2j} + \epsilon_{ij}$|⁠, where |$\boldsymbol{\epsilon}_i = (\epsilon_{i1}, \epsilon_{i2}, \epsilon_{i3})'$| are independent and identically distributed (i.i.d.) |$\mathop{\mathrm{N}}\nolimits_3(\boldsymbol{0}, \boldsymbol{\Sigma}),$| where |$\boldsymbol{\Sigma}$| is a |$3 \times 3$| covariance matrix assumed to be positive definite, |$z_i = 1$| if subject |$i$| received the eCare plan (a key component of the COMPASS Intervention) and 0 otherwise, |$\boldsymbol{x}_{i}$| is a vector of control variables, which include history of stroke, history of transient ischemic attack (TIA), age, race (white or nonwhite), severity of stroke, and a binary variable indicating whether the patient was hospitalized due to a stroke or a TIA. We applied a log transformation to the SIS score to reduce skewness and asymmetry. For the future study, we are interested in three success criteria, namely, |$\Omega_{1} = \{ \boldsymbol{\beta} : \beta_{11} > 0 \}$| (success in the primary endpoint), |$\Omega_{2|3} = \cup_{j=2}^3 \{ \boldsymbol{\beta} : \beta_{j} > 0 \}$| (success in at least one of the two secondary endpoints), and |$\Omega_{1(2|3)} = \Omega_{1} \cap \Omega_{2|3}$| (success in the primary endpoint and at least one of the two secondary endpoints).

In order to provide intuition for how sensitive the POS is to hyperparameters, we studied the behavior of POS for combinations of |$(b_{01}, b_{02}) \in \{ (0, 1), (1, 1) \}$| (corresponding to full borrowing from one and both of the data sets for the covariate distribution, respectively), |$a_0 \in \{ 0.0, 0.5, 1.0\}$| (corresponding to, in the fitting prior, a uniform improper prior, half borrowing from the historical data, and full borrowing from the historical data), and future sample sizes |$n \in \{ 1000, 1050, 1100, \ldots, 2000 \}$|⁠.

Figure 2 shows the probabilities of success for the sets |$\Omega_{1}$|⁠, |$\Omega_{2|3}$|⁠, and |$\Omega_{1(2|3)}$| defined above. For the primary endpoint without borrowing from the historical data (⁠|$a_0 = 0$|⁠), a future sample size of |$2000$| yields a POS just below 80%. Conversely, under half or full borrowing of the historical data (⁠|$a_0 = 0.5$| and |$a_0 = 1.0$|⁠, respectively), the same sample size yields a POS of 85% and 90%, respectively. Thus, utilizing the historical data increases the POS for the primary endpoint.

Probability of success curves for the COMPASS data. The quantities $b_{01}$ and $b_{02}$ denote the power prior hyperparameters for the older and newer historical data sets, respectively, when fitting the validation prior for the covariates. The hyperparameter $a_0$ is the power prior parameter for the older historical data set in the fitting prior. The solid lines indicate the hyperparameter $a_0 = 0$ and the dashed lines indicate $a_0 = 1$. Power prior parameters of 0 indicate that the historical data set was unused.
Fig. 2.

Probability of success curves for the COMPASS data. The quantities |$b_{01}$| and |$b_{02}$| denote the power prior hyperparameters for the older and newer historical data sets, respectively, when fitting the validation prior for the covariates. The hyperparameter |$a_0$| is the power prior parameter for the older historical data set in the fitting prior. The solid lines indicate the hyperparameter |$a_0 = 0$| and the dashed lines indicate |$a_0 = 1$|⁠. Power prior parameters of 0 indicate that the historical data set was unused.

However, utilizing the historical data decreased the POS for |$\Omega_{2|3}$|⁠, the set corresponding to treatment efficacy in at least one of the two secondary outcomes. For example, a sample size of |$2000$| results in a POS of just over 70% for |$\Omega_{2|3}$| under no borrowing of the historical data, while the power prior hyperparameters |$0.5$| and |$1.0$| yield approximately a POS of approximately 68% and 65%, respectively. The reduction in the POS is a result of the high degree of uncertainty surrounding the coefficient corresponding to treatment for the third outcome. Conversely, as compared to no borrowing from the historical data (i.e., when |$a_0 = 0$|⁠), utilizing the historical data marginally raised the POS when success was defined to be treatment efficacy for the primary and at least one of the secondary outcomes, for example, when the set that defines success is |$\Omega_{1(2|3)}$|⁠. This indicates that, using the power prior, the increase in power in the primary outcome outweighs the loss in power of the secondary outcomes.

Figure 2 also indicates that the power prior hyperparameters for the covariates do not have a large effect on the resulting POS. For a fixed level of |$a_0$|⁠, POS is similar between the two covariate power priors. In general, it is advisable to try multiple different hyperparameters for the power priors, or to try the maximal range of possibilities (e.g., 0 and 1) for the power prior parameters to understand how sensitive POS is with respect to these hyperparameters. For this application, it is clear that POS is quite sensitive to the power prior parameter corresponding to the fitting prior, and less sensitive to the validation prior of the covariates. However, the choice of |$a_0$| should not depend on the resulting POS, as the choice of |$a_0$| must be defended to regulators and should reflect the practitioner’s beliefs for how applicable the current and historical data sets are. While the sample sizes are large for a POS of 80%, we note that POS is primarily a function of three components: the effect size in the historical data, the sample size of the historical data, and the sample size of the future data. The COMPASS data used for our illustration suggest modest effects for the outcomes of interest.

6. Discussion

There are several advantages of our approach compared to previous POS approaches (e.g., Ibrahim and others, 2015b; Chuang-Stein, 2006). First, the SUR model allows clinicians to have different covariates across endpoints. For example, a practitioner may wish to control for the baseline value of the outcome of interest, but not the baseline value of other outcomes. Second, testing a single or a union of hypotheses may be cast in the same framework.

For practitioners, it is worth asking when an analysis consisting of marginal regressions should be conducted as opposed to a joint analysis. Berry and Hochberg (1999) argue that it is more realistic to view parameters as dependent in multiple testing situations. In particular, knowledge of the value of one parameter should influence the value of other parameters when the study data are correlated, and they argue it seldom makes sense to analyze the data independently. Moreover, we have established a Bayesian approach that asymptotically guarantees exact control of the type I error rate regardless of the posterior correlation between the parameters. Although the type I error control is an asymptotic result, we note that the vast majority of Phase 3 trials with multiple endpoints will have a large sample size, making errors in the approximation minimal. It is important to differentiate our method from a posterior model probability, although the two are related. While Scott and Berger (2010) argue that the only way to control for multiplicity in a Bayesian perspective is via prior model probabilities, our approach does not rely on prior model probabilities, which are typically subjective; nor does it require prior propriety.

The proposed method may also be applied when the historical data sets are small by retaining the |$q$|th highest posterior density (HPD) region for the treatment effects in the validation prior. For details and simulation results, see Appendix D of the Supplementary material available at Biostatistics online. We apply these validation prior adjustment methods based on a simulated historical data set. A comparison of their effects is presented in Figure 1 of the Supplementary material available at Biostatistics online, which suggests that adjusting the validation prior using HPD regions tend to increase the POS when there is high power to detect an effect.

We emphasize that the proposed method is useful regardless of whether a Bayesian or classical frequentist analysis will ultimately be used to analyze the future study. If several outcomes are recorded for each individual in a clinical trial, the outcomes will be correlated. Thus, it would be counterintuitive to generate future data sets by using a marginal model, even if marginal models will ultimately be utilized in the future clinical trial. While the focus of this article has been on instances where all outcomes could be reasonably modeled as multivariate normal, the method can be extended to outcomes of possibly mixed types and/or time-to-event endpoints, and we are actively exploring these areas.

Software and data

A Gibbs sampler for the SUR power prior is available at github.com/ethan-alt/multivarpos. A simulated data set based on the COMPASS study is available at https://figshare.com/articles/dataset/compass_simulated_rds/17155655, with an analysis example provided in Appendix F of the Supplementary material available at Biostatistics online.

Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

Acknowledgments

Conflict of interest. None declared.

Funding

The National Institute of Environmental Health Sciences (T32ES007018), in part.

References

Berry,
D. A.
and
Hochberg,
Y.
(
1999
).
Bayesian perspectives on multiple comparisons
.
Journal of Statistical Planning and Inference
82
,
215
227
.

Chuang-Stein,
C.
(
2006
).
Sample size and the probability of a successful trial
.
Pharmaceutical Statistics
5
,
305
309
.

Ciarleglio,
M. M.
and
Arendt,
C. D.
(
2017
).
Sample size determination for a binary response in a superiority clinical trial using a hybrid classical and Bayesian procedure
.
Trials
18
. https://trialsjournal.biomedcentral.com/articles/10.1186/s13063-017-1791-0.

Ciarleglio,
M. M.
,
Arendt,
C. D.
and
Peduzzi,
P. N.
(
2016
).
Selection of the effect size for sample size determination for a continuous response in a superiority clinical trial using a hybrid classical and Bayesian procedure
.
Clinical Trials
13
,
275
285
.

Duncan,
P. W.
,
Bushnell,
C. D.
,
Rosamond,
W. D.
,
Jones Berkeley,
S. B.
,
Gesell,
S. B.
,
D’agostino,
R. B.
,
Ambrosius,
W. T.
,
Barton-Percival,
B.
,
Bettger,
J. P.
,
Coleman,
S. W.
, and others. (
2017
).
The Comprehensive Post-Acute Stroke Services (COMPASS) study: design and methods for a cluster-randomized pragmatic trial
.
BMC Neurology
17
.

Hays,
R. D.
,
Bjorner,
J. B.
,
Revicki,
D. A.
,
Spritzer,
K. L.
and
Cella,
D.
(
2009
).
Development of physical and mental health summary scores from the patient-reported outcomes measurement information system (PROMIS) global items
.
Quality of Life Research
18
,
873
880
.

Holm,
S.
(
1979
).
A simple sequentially rejective multiple test procedure
.
Scandinavian Journal of Statistics
6
,
65
70
.

Ibrahim,
J. G.
and
Chen,
M.-H.
(
2000
).
Power prior distributions for regression models
.
Statistical Science
15
,
46
60
.

Ibrahim,
J. G.
,
Chen,
M.-H.
,
Gwon,
Y.
and
Chen,
F.
(
2015a
).
The power prior: theory and applications
.
Statistics in Medicine
34
,
3724
3749
.

Ibrahim,
J. G.
,
Chen,
M.-H.
,
Lakshminarayanan,
M.
,
Liu,
G. F.
and
Heyse,
J. F.
(
2015b
).
Bayesian probability of success for clinical trials using historical data
.
Statistics in Medicine
34
,
249
264
.

O’Hagan,
A.
,
Stevens,
J. W.
and
Campbell,
M. J.
(
2005
).
Assurance in clinical trial design
.
Pharmaceutical Statistics
4
,
187
201
.

Psioda,
M. A.
and
Ibrahim,
J. G.
(
2019
).
Bayesian clinical trial design using historical data that inform the treatment effect
.
Biostatistics
20
,
400
415
.

Ren,
S.
and
Oakley,
J. E.
(
2014
).
Assurance calculations for planning clinical trials with time-to-event outcomes
.
Statistics in Medicine
33
,
31
45
.

Saint-Hilary,
G.
,
Barboux,
V.
,
Pannaux,
M.
,
Gasparini,
M.
,
Robert,
V.
and
Mastrantonio,
G.
(
2019
).
Predictive probability of success using surrogate endpoints
.
Statistics in Medicine
38
,
1753
1774
.

Scott,
J. G.
and
Berger,
J. O.
(
2010
).
Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem
.
Annals of Statistics
38
,
2587
2619
.

Shieh,
G.
(
2017
).
The equivalence of two approaches to incorporating variance uncertainty in sample size calculations for linear statistical models
.
Journal of Applied Statistics
44
,
40
56
.

Wang,
F.
and
Gelfand,
A. E.
(
2002
).
A simulation-based approach to Bayesian sample size determination for performance under a given model and for separating models
.
Statistical Science
17
,
193
208
.

Wang,
Y.
,
Fu,
H.
,
Kulkarni,
P.
and
Kaiser,
C.
(
2013
).
Evaluating and utilizing probability of study success in clinical development
.
Clinical Trials; London
10
,
407
413
.

Yin,
Y.
(
2017
).
A backward Bayesian method for determination of criteria for making go/no-go decisions in the early phases
.
Statistics in Biopharmaceutical Research
9
,
153
159
.

Zhang,
J.
and
Zhang,
J. J.
(
2013
).
Joint probability of statistical success of multiple phase III trials
.
Pharmaceutical Statistics
12
,
358
365
.

Zhou,
M.
,
Tang,
Q.
,
Lang,
L.
,
Xing,
J.
and
Tatsuoka,
K.
(
2018
).
Predictive probability methods for interim monitoring in clinical trials with longitudinal outcomes
.
Statistics in Medicine
37
,
2187
2207
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data