-
PDF
- Split View
-
Views
-
Cite
Cite
L.M. LaVange, E.M. Alt, J.G. Ibrahim, Discussion of “Optimal Test Procedures for Multiple Hypotheses Controlling the Familywise Expected Loss” by Willi Maurer, Frank Bretz, and Xiaolei Xun, Biometrics, Volume 79, Issue 4, December 2023, Pages 2802–2805, https://doi.org/10.1111/biom.13910
- Share Icon Share
Abstract
We provide commentary on the paper by Willi Maurer, Frank Bretz, and Xiaolei Xun entitled, “Optimal test procedures for multiple hypotheses controlling for the familywise expected loss.” The authors provide an excellent discussion of the multiplicity problem in clinical trials and propose a novel approach based on a decision-theoretic framework that incorporates loss functions that can vary across multiple hypotheses in a family. We provide some considerations for the practical use of the authors' proposed methods as well as some alternative methods that may also be of interest in this setting.
The authors offer a bold re-imagining of the error control problem in clinical trials and should be congratulated on this thought-provoking paper! Capitalizing on the shortcomings of traditional approaches that treat all possible errors in a family of hypotheses as equal, they propose a decision-theoretic framework that substitutes losses and gains for type I errors and power and suggest trials be designed to maximize false negative conclusions subject to a constraint on the false positive rate, allowing for losses to vary across the family of hypotheses. The advantage of a hypothesis-specific loss function lies in the ability to reflect the real-world consequences of the different errors that can occur in making decisions about multiple hypotheses at once. Multiplicity can occur along any of several trial parameter dimensions; the authors choose the subgroup problem for illustration.
The concept of family-wise error rates (FWERs) was introduced into the statistical literature in relation to any family of hypothesis tests (Tukey, 1953), but the FDA's embrace of the concept has centered on two settings: dose-finding trials and trials with multiple primary endpoints. For the former, members of the family are well-defined, corresponding to the dose levels under investigation in a trial, and the goal of the analysis is fairly consistent—to select an optimal dose for further study (typically selected in phase 2 for evaluation in phase 3 trials). The importance to regulators and drug developers alike of avoiding a type I error in this setting is reflected by the fact that an innovative statistical procedure for doing just that was qualified by the European Medicines Agency (EMA) and determined as fit for purpose by the FDA in recent years (FDA, 2016; Pinheiro et al., 2014). For the latter, primary and secondary endpoint families are pre-specified in most drug trials, and multiplicity both within and across families is the focus of the recently issued final guidance, Multiple Endpoints in Clinical Trials (FDA, 2022), as was the case in the earlier draft guidance of 2018. This guidance document provides the most detailed explanation of the agency's view on the importance of FWER control, and although the application is to endpoints, other applications are alluded to as well (multiple analysis populations, subgroups, etc.).
As the authors so clearly explain, the cost of making a type I error for one of many secondary endpoints will rarely be of similar magnitude to the cost of making an error on a primary endpoint. For a clinical trial conducted for product approval, successful secondary endpoints may be included in the product label, which has commercial and payor implications, but of a lesser magnitude than the decision to approve the drug. In trials conducted to guide medical practice or make payor decisions, attention shifts to the use of secondary endpoints to more fully describe the effect of an intervention and/or provide a comparison to certain features of other marketed drugs for the same indication, especially for endpoints that are important to patients or reflect patients' willingness to tolerate the intervention. In these and other settings, incorporating a loss function to reflect the real-world cost of making a decision error could be strategic.
Even within the multiple endpoint setting, there exist cases where determining the need for managing multiplicity, and the methods for doing so, is not straightforward. Two such cases come to mind. First, multiplicity adjustments are rarely made with respect to safety endpoints, where the error of concern is missing a safety signal, not over-stating the statistical significance of any signals that are observed. Second, in the International Council on Harmonization (ICH) guidelines document on Multi-Regional Clinical Trials (ICH, 2017), reference is made to the advantages of harmonizing endpoints across regulatory regions (e.g., US, EU, and Japan) in a clinical trial to avoid the need for separate trials conducted in different regions. But when this is not possible, due to one agency requiring one primary endpoint as the basis for product approval, and a second agency requiring another, there is general agreement that no multiplicity adjustment is required across the two primary endpoints. The regulatory decisions for product approval are unrelated, even though the endpoints probably are.
Situations where traditional multiplicity adjustment may not be appropriate arise in precision medicine, and this is the setting used for illustration here. When investigating a drug that targets a particular disease characteristic, such as a genetic mutation or disease subtype, and a biomarker is available for identifying patients with that characteristic, it is common to enroll more heavily among patients that are biomarker-positive than biomarker-negative in an early phase trial looking for an efficacy signal. Information about the drug's effect in both subgroups is important for understanding how the drug works and how it should be labeled, if approved. But, as the authors point out, the loss function associated with a type I error for the two groups could be very different.
Precision medicine advances have motivated the use of master protocols to study multiple interventions under the same protocol and trial infrastructure, with shared biomarker screening to more efficiently assign treatments targeting specific disease characteristics compared to trials run independently for each drug. Woodcock and LaVange (2017) emphasized the FDA's endorsement of these collaborations, arguing they were in the best interest of patients, but acknowledged the hurdle of convincing pharmaceutical companies to participate, as their participation might be counter to commercial considerations. For late-phase studies, data from the master protocol comparing each drug to an appropriate control may be used to support a regulatory submission for approval of that drug. There is typically no concept of the protocol being successful if at least one of the drugs is successful, as the regulatory decisions are made independently for each drug (possibly at different times), even in cases where some portion of control patients are shared across the different comparative analyses. For this reason, as with the multi-regional trial example, multiplicity adjustments across drugs are, in our opinion, difficult to defend. Furthermore, a sponsor would not likely view participation in a master protocol to their advantage, if the type I error rate was controlled at a lower level than in a stand-alone trial of their drug, removing the advantages of these collaborative research efforts the FDA intended to promote. The authors' mention of this application as an area for future research seems right on target.
In more standard situations where FWERs are of prime concern, the derivation of optimal decision rules under sensible loss functions is attractive, and the authors have made a significant contribution in this regard. Here, we offer some additional considerations for the use of such an approach.
The proposed method makes several assumptions that may not be tenable. First, the separability assumption on the decision rule may not be a reasonable assumption in general. While the authors focus on the case of subgroup analysis, where independence is a reasonable assumption, such an assumption is more difficult to defend in the context of multiple correlated endpoints (discussed further below). Second, loss or gain functions are hard to quantify and elicit in practice, and their specification is analogous to informative prior elicitation based on expert opinion. Third, decision-theoretic approaches may be computationally very intensive. Without general purpose software packages that can accommodate any general loss function, the use of decision-theoretic approaches in practice remains limited. Even if such a software package were developed, finding an optimal rule involves iterating through the decision space when there is no analytical solution for such a rule, which can be high-dimensional in the context of multiple testing.
An alternative strategy that may be of interest is the development and application of Bayesian methods that yield well-known frequentist test statistics and decision rules as special cases. Chen et al. (2011, 2014), Ibrahim et al. (2012), and Psioda and Ibrahim (2018, 2019) develop such procedures.
In their development, they define a Bayesian criterion for accepting one or multiple null or alternative hypotheses in superiority or non-inferiority testing, and then determine an appropriate sample size based on suitably defined Bayesian versions of the type I error rate (i.e., false positive rate) and power. To be more specific, they consider hypotheses for superiority testing given by , where γ is a treatment effect (e.g., a log-hazard ratio) and
is a prespecified design margin. The trial is successful if H1 is accepted. Following the simulation-based framework of Wang and Gelfand (2002), Chen et al. (2011, 2014), and Ibrahim et al. (2012), Psioda and Ibrahim (2018, 2019) first specified the sampling prior, denoted by
, which is used to generate the data
based on a sample of size n. The fitting prior, denoted by
, is then used to fit the model once the data are generated. Under the fitting prior, the posterior distribution of the parameters
given the data
takes the form

where is the likelihood function. We note that
may be improper as long as the resulting posterior
is proper. Further, we let
denote the marginal distribution of
induced from the sampling prior. That is,
is the prior preditive distribution of
.
Then, Ibrahim and colleagues defined the key Bayesian design quantity as follows:

where the indicator function is 1 if A is true and 0 otherwise, the expectation is taken with respect to the prior predictive distribution of
under the sampling prior
, and
is a prespecified quantity that controls the Bayesian type I error rate and power. The posterior probability given in Equation (1) above is computed with respect to the fitting prior
.
For a given and
, we then compute

where and
corresponding to
and
are the Bayesian type I error and power, respectively. For example, to mimic a frequentist design in a superiority trial,
and
may be point mass distributions with mass corresponding to the null and alternative hypotheses, respectively. Then, the Bayesian sample size is given by

With point mass sampling priors, this algorithm for Bayesian type I error reduces to the frequentist definition for large samples. For example, choosing corresponds to a type I error rate of
in large samples.
More recently, Alt et al. (2023) developed a fully Bayesian approach to design and analyze a study with multiple endpoints, providing a testing method that asymptotically attains the FWER of a specified level α. To fix ideas, suppose there are J hypotheses. Let denote the J-dimensional parameter vector (e.g., J treatment effects) corresponding to the J hypotheses. The sets corresponding to the null and alternative hypothesis for the jth test is denoted by
and
. The global null hypothesis may then be denoted as
versus
. We reject the global null hypothesis if
. Note that if
, then we may choose
to control for the type I error rate at level α, illustrating the intimate relationship between posterior probabilities and frequentist p-values (i.e., in large samples, they both converge to a standard uniform distribution). However, when
and the components of
are correlated, it is not immediately clear how to choose ϕ0 to control for the FWER.
Under some mild regularity conditions, Alt et al. (2023) showed that, under the global null hypothesis H0,

where the notation denotes convergence in distribution,
is the limiting posterior correlation matrix of
for data generated under the global null hypothesis,
is a zero-mean multivariate normal random variable with correlation matrix
, and
denotes the multivariate cumulative distribution function (CDF) of the multivariate normal distribution with mean
and covariance matrix
. In words, the relationship in Equation (2) says that the asymptotic distribution of the posterior probability converges to one minus multivariate normal CDF transformation of a multivariate normal random variable under the global null hypothesis. Note that when
, we obtain the well-known property that the asymptotic distribution of a single posterior probability under H0 is standard uniform.
Using the distribution in Equation (2), Alt et al. (2023) developed a multiple testing procedure that asymptotically attains a frequentist FWER of the specified value regardless of how the components of are correlated. In their simulations, their proposed approach was always more powerful than the procedure of Holm (1979), which is a recommended approach in the new FDA guidance on multiple comparisons (FDA, 2022).
We note methods for hypothesis testing relying on posterior probabilities cannot be claimed to be optimal. Indeed, the purpose of decision-theoretic arguments is to derive optimal rules under a loss function. Nevertheless, the loss functions that enable analytically derivable optimal rules can be overly simplistic, especially in a multiple testing situation. Put differently, an optimal rule for a simple loss function (e.g., one that assumes tests are independent) could be suboptimal, or even inadmissible, under a more complex loss function (e.g., one that considers correlations between the tests). On the other hand, the method proposed by Alt et al. (2023) is related to adjusting evidence thresholds, which is an attractive alternative to decision theory.
We appreciate the opportunity to comment on the authors' novel approach to error control in drug development clinical trials and believe the publication of their article and accompanying commentaries will promote a good discussion of the proposed methods, as well as alternatives such as those we mention above, with the end result of improving the way regulatory and healthcare decisions are made. Dr. Willi Maurer's long and distinguished career has produced many valuable contributions to the field of statistics and a wide range of statistical applications, and it is an honor to be able to comment on this paper as one of his final contributions.
Acknowledgments
We would like to thank the editors for inviting us to provide a commentary on such an impactful paper.
References