Abstract

We provide commentary on the paper by Willi Maurer, Frank Bretz, and Xiaolei Xun entitled, “Optimal test procedures for multiple hypotheses controlling for the familywise expected loss.” The authors provide an excellent discussion of the multiplicity problem in clinical trials and propose a novel approach based on a decision-theoretic framework that incorporates loss functions that can vary across multiple hypotheses in a family. We provide some considerations for the practical use of the authors' proposed methods as well as some alternative methods that may also be of interest in this setting.

The authors offer a bold re-imagining of the error control problem in clinical trials and should be congratulated on this thought-provoking paper! Capitalizing on the shortcomings of traditional approaches that treat all possible errors in a family of hypotheses as equal, they propose a decision-theoretic framework that substitutes losses and gains for type I errors and power and suggest trials be designed to maximize false negative conclusions subject to a constraint on the false positive rate, allowing for losses to vary across the family of hypotheses. The advantage of a hypothesis-specific loss function lies in the ability to reflect the real-world consequences of the different errors that can occur in making decisions about multiple hypotheses at once. Multiplicity can occur along any of several trial parameter dimensions; the authors choose the subgroup problem for illustration.

The concept of family-wise error rates (FWERs) was introduced into the statistical literature in relation to any family of hypothesis tests (Tukey, 1953), but the FDA's embrace of the concept has centered on two settings: dose-finding trials and trials with multiple primary endpoints. For the former, members of the family are well-defined, corresponding to the dose levels under investigation in a trial, and the goal of the analysis is fairly consistent—to select an optimal dose for further study (typically selected in phase 2 for evaluation in phase 3 trials). The importance to regulators and drug developers alike of avoiding a type I error in this setting is reflected by the fact that an innovative statistical procedure for doing just that was qualified by the European Medicines Agency (EMA) and determined as fit for purpose by the FDA in recent years (FDA, 2016; Pinheiro et al., 2014). For the latter, primary and secondary endpoint families are pre-specified in most drug trials, and multiplicity both within and across families is the focus of the recently issued final guidance, Multiple Endpoints in Clinical Trials (FDA, 2022), as was the case in the earlier draft guidance of 2018. This guidance document provides the most detailed explanation of the agency's view on the importance of FWER control, and although the application is to endpoints, other applications are alluded to as well (multiple analysis populations, subgroups, etc.).

As the authors so clearly explain, the cost of making a type I error for one of many secondary endpoints will rarely be of similar magnitude to the cost of making an error on a primary endpoint. For a clinical trial conducted for product approval, successful secondary endpoints may be included in the product label, which has commercial and payor implications, but of a lesser magnitude than the decision to approve the drug. In trials conducted to guide medical practice or make payor decisions, attention shifts to the use of secondary endpoints to more fully describe the effect of an intervention and/or provide a comparison to certain features of other marketed drugs for the same indication, especially for endpoints that are important to patients or reflect patients' willingness to tolerate the intervention. In these and other settings, incorporating a loss function to reflect the real-world cost of making a decision error could be strategic.

Even within the multiple endpoint setting, there exist cases where determining the need for managing multiplicity, and the methods for doing so, is not straightforward. Two such cases come to mind. First, multiplicity adjustments are rarely made with respect to safety endpoints, where the error of concern is missing a safety signal, not over-stating the statistical significance of any signals that are observed. Second, in the International Council on Harmonization (ICH) guidelines document on Multi-Regional Clinical Trials (ICH, 2017), reference is made to the advantages of harmonizing endpoints across regulatory regions (e.g., US, EU, and Japan) in a clinical trial to avoid the need for separate trials conducted in different regions. But when this is not possible, due to one agency requiring one primary endpoint as the basis for product approval, and a second agency requiring another, there is general agreement that no multiplicity adjustment is required across the two primary endpoints. The regulatory decisions for product approval are unrelated, even though the endpoints probably are.

Situations where traditional multiplicity adjustment may not be appropriate arise in precision medicine, and this is the setting used for illustration here. When investigating a drug that targets a particular disease characteristic, such as a genetic mutation or disease subtype, and a biomarker is available for identifying patients with that characteristic, it is common to enroll more heavily among patients that are biomarker-positive than biomarker-negative in an early phase trial looking for an efficacy signal. Information about the drug's effect in both subgroups is important for understanding how the drug works and how it should be labeled, if approved. But, as the authors point out, the loss function associated with a type I error for the two groups could be very different.

Precision medicine advances have motivated the use of master protocols to study multiple interventions under the same protocol and trial infrastructure, with shared biomarker screening to more efficiently assign treatments targeting specific disease characteristics compared to trials run independently for each drug. Woodcock and LaVange (2017) emphasized the FDA's endorsement of these collaborations, arguing they were in the best interest of patients, but acknowledged the hurdle of convincing pharmaceutical companies to participate, as their participation might be counter to commercial considerations. For late-phase studies, data from the master protocol comparing each drug to an appropriate control may be used to support a regulatory submission for approval of that drug. There is typically no concept of the protocol being successful if at least one of the drugs is successful, as the regulatory decisions are made independently for each drug (possibly at different times), even in cases where some portion of control patients are shared across the different comparative analyses. For this reason, as with the multi-regional trial example, multiplicity adjustments across drugs are, in our opinion, difficult to defend. Furthermore, a sponsor would not likely view participation in a master protocol to their advantage, if the type I error rate was controlled at a lower level than in a stand-alone trial of their drug, removing the advantages of these collaborative research efforts the FDA intended to promote. The authors' mention of this application as an area for future research seems right on target.

In more standard situations where FWERs are of prime concern, the derivation of optimal decision rules under sensible loss functions is attractive, and the authors have made a significant contribution in this regard. Here, we offer some additional considerations for the use of such an approach.

The proposed method makes several assumptions that may not be tenable. First, the separability assumption on the decision rule may not be a reasonable assumption in general. While the authors focus on the case of subgroup analysis, where independence is a reasonable assumption, such an assumption is more difficult to defend in the context of multiple correlated endpoints (discussed further below). Second, loss or gain functions are hard to quantify and elicit in practice, and their specification is analogous to informative prior elicitation based on expert opinion. Third, decision-theoretic approaches may be computationally very intensive. Without general purpose software packages that can accommodate any general loss function, the use of decision-theoretic approaches in practice remains limited. Even if such a software package were developed, finding an optimal rule involves iterating through the decision space when there is no analytical solution for such a rule, which can be high-dimensional in the context of multiple testing.

An alternative strategy that may be of interest is the development and application of Bayesian methods that yield well-known frequentist test statistics and decision rules as special cases. Chen et al. (2011, 2014), Ibrahim et al. (2012), and Psioda and Ibrahim (2018, 2019) develop such procedures.

In their development, they define a Bayesian criterion for accepting one or multiple null or alternative hypotheses in superiority or non-inferiority testing, and then determine an appropriate sample size based on suitably defined Bayesian versions of the type I error rate (i.e., false positive rate) and power. To be more specific, they consider hypotheses for superiority testing given by formula, where γ is a treatment effect (e.g., a log-hazard ratio) and formula is a prespecified design margin. The trial is successful if H1 is accepted. Following the simulation-based framework of Wang and Gelfand (2002), Chen et al. (2011, 2014), and Ibrahim et al. (2012), Psioda and Ibrahim (2018, 2019) first specified the sampling prior, denoted by formula, which is used to generate the data formula based on a sample of size n. The fitting prior, denoted by formula, is then used to fit the model once the data are generated. Under the fitting prior, the posterior distribution of the parameters formula given the data formula takes the form

where formula is the likelihood function. We note that formula may be improper as long as the resulting posterior formula is proper. Further, we let formula denote the marginal distribution of formula induced from the sampling prior. That is, formula is the prior preditive distribution of formula.

Then, Ibrahim and colleagues defined the key Bayesian design quantity as follows:

1

where the indicator function formula is 1 if A is true and 0 otherwise, the expectation is taken with respect to the prior predictive distribution of formula under the sampling prior formula, and formula is a prespecified quantity that controls the Bayesian type I error rate and power. The posterior probability given in Equation (1) above is computed with respect to the fitting prior formula.

For a given formula and formula, we then compute

where formula and formula corresponding to formula and formula are the Bayesian type I error and power, respectively. For example, to mimic a frequentist design in a superiority trial, formula and formula may be point mass distributions with mass corresponding to the null and alternative hypotheses, respectively. Then, the Bayesian sample size is given by

With point mass sampling priors, this algorithm for Bayesian type I error reduces to the frequentist definition for large samples. For example, choosing formula corresponds to a type I error rate of formula in large samples.

More recently, Alt et al. (2023) developed a fully Bayesian approach to design and analyze a study with multiple endpoints, providing a testing method that asymptotically attains the FWER of a specified level α. To fix ideas, suppose there are J hypotheses. Let formula denote the J-dimensional parameter vector (e.g., J treatment effects) corresponding to the J hypotheses. The sets corresponding to the null and alternative hypothesis for the jth test is denoted by formula and formula. The global null hypothesis may then be denoted as formula versus formula. We reject the global null hypothesis if formula. Note that if formula, then we may choose formula to control for the type I error rate at level α, illustrating the intimate relationship between posterior probabilities and frequentist p-values (i.e., in large samples, they both converge to a standard uniform distribution). However, when formula and the components of formula are correlated, it is not immediately clear how to choose ϕ0 to control for the FWER.

Under some mild regularity conditions, Alt et al. (2023) showed that, under the global null hypothesis H0,

2

where the notation formula denotes convergence in distribution, formula is the limiting posterior correlation matrix of formula for data generated under the global null hypothesis, formula is a zero-mean multivariate normal random variable with correlation matrix formula, and formula denotes the multivariate cumulative distribution function (CDF) of the multivariate normal distribution with mean formula and covariance matrix formula. In words, the relationship in Equation (2) says that the asymptotic distribution of the posterior probability converges to one minus multivariate normal CDF transformation of a multivariate normal random variable under the global null hypothesis. Note that when formula, we obtain the well-known property that the asymptotic distribution of a single posterior probability under H0 is standard uniform.

Using the distribution in Equation (2), Alt et al. (2023) developed a multiple testing procedure that asymptotically attains a frequentist FWER of the specified value regardless of how the components of formula are correlated. In their simulations, their proposed approach was always more powerful than the procedure of Holm (1979), which is a recommended approach in the new FDA guidance on multiple comparisons (FDA, 2022).

We note methods for hypothesis testing relying on posterior probabilities cannot be claimed to be optimal. Indeed, the purpose of decision-theoretic arguments is to derive optimal rules under a loss function. Nevertheless, the loss functions that enable analytically derivable optimal rules can be overly simplistic, especially in a multiple testing situation. Put differently, an optimal rule for a simple loss function (e.g., one that assumes tests are independent) could be suboptimal, or even inadmissible, under a more complex loss function (e.g., one that considers correlations between the tests). On the other hand, the method proposed by Alt et al. (2023) is related to adjusting evidence thresholds, which is an attractive alternative to decision theory.

We appreciate the opportunity to comment on the authors' novel approach to error control in drug development clinical trials and believe the publication of their article and accompanying commentaries will promote a good discussion of the proposed methods, as well as alternatives such as those we mention above, with the end result of improving the way regulatory and healthcare decisions are made. Dr. Willi Maurer's long and distinguished career has produced many valuable contributions to the field of statistics and a wide range of statistical applications, and it is an honor to be able to comment on this paper as one of his final contributions.

Acknowledgments

We would like to thank the editors for inviting us to provide a commentary on such an impactful paper.

References

Alt
,
E.M.
,
Psioda
,
M.A.
&
Ibrahim
,
J.G.
(
2023
)
Bayesian multivariate probability of success using historical data with type I error rate control
.
Biostatistics
,
24
,
17
31
.

Chen
,
M.-H.
,
Ibrahim
,
J.G.
,
Lam
,
P.
,
Yu
,
A.
&
Zhang
,
Y.
(
2011
)
Bayesian design of noninferiority trials for medical devices using historical data
.
Biometrics
,
67
,
1163
1170
.

Chen
,
M.-H.
,
Ibrahim
,
J.G.
,
Zeng
,
D.
,
Hu
,
K.
&
Jia
,
C.
(
2014
)
Bayesian design of superiority clinical trials for recurrent events data with applications to bleeding and transfusion events in myelodyplastic syndrome
.
Biometrics
,
70
,
1003
1013
.

FDA
. (
2016
)
Drug development tools: fit-for-purpose initiative
.
Washington, DC
:
Food and Drug Administration
.

FDA
. (
2022
)
Multiple endpoints in clinical trials guidance for industry
.
Washington, DC
:
Food and Drug Administration
.

Holm
,
S.
(
1979
)
A simple sequentially rejective multiple test procedure
.
Scandinavian Journal of Statistics
,
6
,
65
70
.

Ibrahim
,
J.G.
,
Chen
,
M.-H.
,
Xia
,
H.A.
&
Liu
,
T.
(
2012
)
Bayesian meta-experimental design: evaluating cardiovascular risk in new antidiabetic therapies to treat type 2 diabetes
.
Biometrics
,
68
,
578
586
.

ICH
. (
2017
)
General principles for planning and design of multi-regional clinical trials E17
. The International Council for Harmonisation.

Pinheiro
,
J.
,
Bornkamp
,
B.
,
Glimm
,
E.
&
Bretz
,
F.
(
2014
)
Model-based dose finding under model uncertainty using general parametric models
.
Statistics in Medicine
,
33
,
1646
1661
.

Psioda
,
M.A.
&
Ibrahim
,
J.G.
(
2018
)
Bayesian design of a survival trial with a cured fraction using historical data
.
Statistics in Medicine
,
37
,
3814
3831
.

Psioda
,
M.A.
&
Ibrahim
,
J.G.
(
2019
)
Bayesian clinical trial design using historical data that inform the treatment effect
.
Biostatistics
,
20
,
400
415
.

Tukey
,
J.W.
(
1953
) The problem of multiple comparisons.
Multiple comparisons
.
Princeton University
.

Wang
,
F.
&
Gelfand
,
A.E.
(
2002
)
A simulation-based approach to Bayesian sample size determination for performance under a given model and for separating models
.
Statistical Science
,
17
,
193
208
.

Woodcock
,
J.
&
LaVange
,
L.M.
(
2017
)
Master protocols to study multiple therapies, multiple diseases, or both
.
New England Journal of Medicine
,
377
,
62
70
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)