Optimal test Procedures for Multiple Hypotheses Controlling the Familywise Expected Loss

Classes of constrained rectangular decision rules and corresponding sets of valid and exhaustive rules with respect to additive loss functions.

Decision rule	Parameter constraints (C) and sets of valid and exhaustive rules (B)
Symmetric	C:
	B: ⁠,
	with
Separable	C:
	B:
Asymmetric	C:
	B:

TABLE 1

Classes of constrained rectangular decision rules and corresponding sets of valid and exhaustive rules with respect to additive loss functions.

Decision rule	Parameter constraints (C) and sets of valid and exhaustive rules (B)
Symmetric	C:
	B: ⁠,
	with
Separable	C:
	B:
Asymmetric	C:
	B:

5 Application to Subgroup Analysis

We now illustrate the proposed approach using the subgroup analysis problem introduced in Section 1. Assume that the treatment effect could be different in two non-overlapping subgroups of patients defined by a biomarker. Therefore, the primary objective of a new clinical trial is to investigate the two subgroups separately and decide whether a benefit of the new treatment over the control can be claimed in any of the two subgroups. We test the null hypotheses that the new treatment is no better than the control in subgroup against the alternative hypotheses ⁠, where denotes the treatment effect in subgroup ⁠, for ⁠. If is rejected, a claim can be made for subgroup ⁠. Let denote the size of subgroup ⁠, ⁠. For a parallel-group design with equal number of patients in each treatment group, the test statistic for a continuous endpoint with known variance follows ⁠, for a given effect size and assuming a standard deviation of 1 without loss of generality. Furthermore, T₁ and T₂ are independent.

The choice of losses and gains should reflect the consequences of incorrect and correct decisions, respectively. We set to reflect the assumption that a larger subgroup suffers (benefits) more from a wrong (right) decision. In deriving the Bayes gain (13), we assume independent 's and marginally to account for uncertainty. Such a density covers clinically relevant effect sizes, but alternative densities could be chosen based on relevant knowledge. We aim at maximizing the Bayes gain while controlling the FWEL at for a given decision rule.

Using existing numerical optimization routines, Table 2 summarizes the optimal rules for the four classes considered in terms of their parameters ⁠, (⁠⁠), and ⁠. For equal subgroup sizes (i.e., ⁠), the optimal symmetric decision rule is equal to the unconstrained rule with ⁠, whereas for the asymmetric rule is forced to be 1. Their Bayes expected gain, however, differs only by 0.001. This contrasts the much larger difference of 0.052 in Bayes gain between the optimal unconstrained rule and the separable rule. The prime reason for this difference is the much larger region A_{(1, 1)} for the unconstrained rule, where both hypotheses can be rejected if both p-values are smaller than ⁠, instead of just for the separable rule. The top row of Figure 2 visualizes the respective rejection areas.

TABLE 2

Optimal Bayes rules for different classes of rectangular decision rules with respect to additive loss functions and strong control of the familywise expected loss at formula ⁠.

	Decision rules	Bayes gain	Parameters
0.5	Unconstrained	0.821	for
	Symmetric	0.821	for
	Separable	0.769	for
	Asymmetric	0.820	for
0.25	Unconstrained	0.838
	Symmetric	0.800	for
	Separable	0.790
	Asymmetric	0.837

	Decision rules	Bayes gain	Parameters
0.5	Unconstrained	0.821	for
	Symmetric	0.821	for
	Separable	0.769	for
	Asymmetric	0.820	for
0.25	Unconstrained	0.838
	Symmetric	0.800	for
	Separable	0.790
	Asymmetric	0.837

TABLE 2

Open in new tab Download slide

Optimal Bayes rules for different classes of rectangular decision rules with respect to additive loss functions and strong control of the familywise expected loss at formula ⁠.

	Decision rules	Bayes gain	Parameters
0.5	Unconstrained	0.821	for
	Symmetric	0.821	for
	Separable	0.769	for
	Asymmetric	0.820	for
0.25	Unconstrained	0.838
	Symmetric	0.800	for
	Separable	0.790
	Asymmetric	0.837

	Decision rules	Bayes gain	Parameters
0.5	Unconstrained	0.821	for
	Symmetric	0.821	for
	Separable	0.769	for
	Asymmetric	0.820	for
0.25	Unconstrained	0.838
	Symmetric	0.800	for
	Separable	0.790
	Asymmetric	0.837

Graphical visualization of the optimal decision rules from Table 2. Top: . Bottom: .

Figure 2

Graphical visualization of the optimal decision rules from Table 2. Top: formula ⁠. Bottom: formula ⁠.

For unequal subgroup sizes (i.e., ⁠), the losses assigned to the two subgroups are and ⁠. This asymmetry in the losses and in the marginal power of the two tests leads to different optimal rules for the different classes under consideration. The optimal asymmetric rule with a Bayes gain of 0.837 turns out to be close to the optimal unconstrained rule with only a slightly larger Bayes gain of 0.838. Nevertheless, the optimal asymmetric rule might be preferred in practice because the are forced to 1 so that a possibly large p-value in one subgroup does not prevent the rejection of the null hypothesis in the other subgroup if its p-value is small. The asymmetry in a₁ and a₂ allows for a larger area where both hypotheses can be rejected than in the symmetric or separable case and for a clear advantage of this type of rule. The bottom row of Figure 2 visualizes the optimal asymmetric and separable decision rules as examples.

Finally, Figure 3 displays the Bayes gain as a function of ranging from 0.05 to 0.5 for the four classes of rectangular decision rules. We conclude that the asymmetric rule performs closest to the unconstrained rule, both of which are superior to the symmetric and the separable rules. The maximum difference in Bayes gain between the four different classes is about 0.05. The differences in Bayes gain, however, are diluted by cases where both are either very small or very large and, consequently, lead to expected gains close to 0 or 1 irrespective of the rule. Therefore, we plot in Figure 3 the expected gain (20) of the same optimal Bayes rules as a function of ⁠, where we fix such that the marginal power is roughly between 0.2 and 0.8. As seen from Figure 3, for the expected gain for the separable rule is about 0.07 smaller than for the other rules. For ⁠, the differences in expected gain between the separable and unconstrained rules decrease somewhat and the symmetric rule is worse than the optimal separable rule.

(A) Bayes gain (13) of optimal Bayes rules with respect to additive loss functions within each class of rectangular decision rules. Expected gain (20) of the optimal Bayes rule for . This figure appears in color in the electronic version of this paper.

Figure 3

(A) Bayes gain (13) of optimal Bayes rules with respect to additive loss functions within each class of rectangular decision rules. formula Expected gain (20) of the optimal Bayes rule for formula ⁠. This figure appears in color in the electronic version of this paper.

Open in new tab Download slide

Considering that the optimal asymmetric rule is so close to the optimal unconstrained rule, we investigate further the following rule obtained by setting for the asymmetric rule, ⁠. Such a rule turns out to be almost optimal under the current choice of and may serve as a rule of thumb, due to its convenient computation without any numerical optimization. The largest difference between this rule and the optimal unconstrained rule is less than 0.0017 for the whole range of ⁠. If we instead assume a density ⁠, the optimal unconstrained rule is virtually identical to the above rule of thumb (i.e., ⁠) with expected gains between 0.84 and 0.88. As before, the difference in expected gain between both rules is negligible, with at most 0.0002 over for the whole range of ⁠.

6 Discussion

Our approach includes several common multiple test procedures as special cases. For example, using separable decision rules and additive loss functions, our approach is equivalent to Benjamini and Hochberg (1997) who generalized earlier results of Spjotvoll (1972) to maximize the expected number of rejections while controlling the expected number of incorrectly rejected hypotheses. Moreover, using the binary loss function (4) leads to FWER-controlling procedures, as shown in Section 2.3. In Web Appendix C, we specialize the results in Section 4 with respect to binary loss functions and provide connections to common multiple test procedures. For example, using separable decision rules and binary loss functions, our approach specializes to the Šidák (1967) test for subgroups of equal size.

Our approach can be extended to other applications than subgroup analyses, although the assumptions and derivations may need modifications. For example, when testing a new drug for two unequally important variables, losses can be caused by, say, an erroneous drug approval based on the primary variable or, less severe, an erroneous drug label based on the secondary variable. Conventional practice of controling the FWER, however, rates both null hypotheses equally important and does not account for the different losses inflicted by inappropriate drug approval or label (Wang et al., 2015). As another example, several treatments may be evaluated in parallel in a master protocol trial when treatment options depend on one or more biomarkers within a single cancer type. If each treatment is evaluated in a single subgroup, the parallel sub-studies are essentially independent trials, each considering one treatment in its own distinct patient group, conducted together for administrative or operational reasons such as saving time, reducing costs or facilitating recruitment (Woodcock & LaVange, 2017). In these applications, the test statistics will be stochastically dependent and we leave the extension of the proposed methods to these settings for future research.

In applying the proposed approach, we need to set a value for the upper threshold α. As mentioned earlier, we standardize the maximum total loss at 1 unit. That is, in the setting of Section 5, the loss for rejecting erroneously the null hypothesis that there is no treatment effect in either subgroup is set to 1. The expected loss for a test in the overall population that disregards possible subgroups is then equal to the Type I error rate. A natural choice for α is therefore 0.025 in analogy to current clinical trial practice. We also need to assign actual values to the losses and gains ⁠. One choice is to set and proportional to the prevalence of the two subgroups (i.e., in Section 5). Also, we recommend using the same proportions among the losses and gains. This keeps the relative importance of the subgroups the same in evaluating expected losses and gains: if a right (wrong) decision is made for a larger subgroup, the induced gain (loss) will be larger as well, and vice versa.

We also investigated optimal decision rules following the minimax principle to account for the intrinsic uncertainty in ⁠. Roughly speaking, these decision rules minimize the maximum regret (over all possible ⁠) of not adopting the best rule. The advantage is that we do not have to specify a density for ⁠. However, the numerical optimization is more challenging, because of the repeated minimization/maximization steps. The minimax approach is also more difficult to interpret because the regret is essentially a difference of expected gains, whereas the Bayes gain is an average on the same scale as the expected gains themselves. Our numerical studies indicated that minimax and Bayes approaches do not differ much in practice in the sense that their maximized gains are similar for a given class of rules.

Following Müller et al. (2004) and Sun and Cai (2009), we suggest in this paper to maximize the Bayes gain while controlling the expected loss at a pre-specified threshold α. Alternatively, one could maximize a utility function of the difference between gain and loss, under the same constraint on the expected loss. Technically, both approaches lead to the same optimal decision rules, hence the same design. Practically, gains and losses may be measured on different scales. We therefore prefer the proposed approach because of the easier interpretation. Also, one could consider more general loss and gain functions that are dependent on the actual value of and evaluate decisions in a continuous fashion. We leave such possible extensions for future research.

Data Availability Statement

Data sharing not applicable to this paper as no datasets were generated or analyzed during the current study.

Acknowledgments

We are grateful to two reviewers and the associate editor for their constructive comments on an earlier version of this paper. We would like to express our sadness that our co-author Willi Maurer passed away, while this manuscript was under preparation. We are indebted to him as our colleague, mentor, and friend. This way, we want to thank him posthum for his collaboration and guidance throughout many years.

References

Benjamini

,

Y.

&

Hochberg

,

Y.

(

1997

)

Multiple hypotheses testing with weights

.

Scandinavian Journal of Statistics

,

24

,

407

–

418

.

. https://doi.org/10.1177/09622802221135249

Brannath

,

W.

,

Hillner

,

C.

&

Rohmeyer

,

K.

(

2023

)

The population-wise error rate for clinical trials with overlapping populations

.

Statistical Methods in Medical Research

,

32

,

334

–

352

Cohen

,

A.

&

Sackrowitz

,

H.B.

(

2005

)

Decision theory results for one-sided multiple comparison procedures

.

The Annals of Statistics

,

33

,

126

–

144

.

Gabriel

,

K.R.

(

1969

)

Simultaneous test procedures: some theory of multiple comparisons

.

The Annals of Mathematical Statistics

,

40

,

224

–

250

.

Graf

,

A.C.

,

Posch

,

M.

&

Koenig

,

F.

(

2015

)

Adaptive designs for subpopulation analysis optimizing utility functions

.

Biometrical Journal

,

57

,

76

–

89

.

Hommel

,

G.

&

Bretz

,

F.

(

2008

)

Aesthetics and power considerations in multiple testing—a contradiction?

Biometrical Journal

,

50

,

657

–

666

.

Lehmann

,

E.L.

(

1957

)

A theory of some multiple decision problems, I

.

The Annals of Mathematical Statistics

,

28

,

1

–

25

.

Lisovskaja

,

V.

&

Burman

,

C.-F.

(

2014

)

A decision theoretic approach to optimization of multiple testing procedures

.

Biometrical Journal

,

57

,

64

–

75

.

Marcus

,

R.

,

Peritz

,

E.

&

Gabriel

,

K.R.

(

1976

)

On closed testing procedures with special reference to ordered analysis of variance

.

Biometrika

,

63

,

655

–

660

.

Müller

,

P.

,

Parmigiani

,

G.

&

Rice

,

K.

(

2007

) FDR and Bayesian multiple comparisons rules. In:

Bernardo

,

J.

,

Bayarri

,

M.

,

Berger

,

J.

,

Dawid

,

A.

,

Heckerman

,

D.

,

Smith

,

A.

&

West

,

M.

(Eds.)

Bayesian Statistics

,

Oxford

:

Oxford University Press

, pp.

349

–

370

.

Google Preview

Müller

,

P.

,

Parmigiani

,

G.

,

Robert

,

C.

&

Judith

,

R.

(

2004

)

Optimal sample size for multiple testing

.

Journal of the American Statistical Association

,

99

,

990

–

1001

.

Rosenblum

,

M.

,

Fang

,

E.X.

&

Liu

,

H.

(

2020

)

Optimal, two-stage, adaptive enrichment designs for randomized trials, using sparse linear programming

.

Journal of the Royal Statistical Society: Series B (Statistical Methodology)

,

82

,

749

–

772

.

Rosenblum

,

M.

,

Liu

,

H.

&

Yen

,

E.-H.

(

2014

)

Optimal tests of treatment effects for the overall population and two subpopulations in randomized trials, using sparse linear programming

.

Journal of the American Statistical Association

,

109

,

1216

–

1228

.

Senn

,

S.

&

Bretz

,

F.

(

2007

)

Power and sample size when multiple endpoints are considered

.

Pharmaceutical Statistics

,

6

,

161

–

170

.

Šidák

,

Z.

(

1967

)

Rectangular confidence regions for the means of multivariate normal distributions

.

Journal of the American Statistical Association

,

62

,

626

–

633

.

Simes

,

R.J.

(

1986

)

An improved Bonferroni procedure for multiple tests of significance

.

Biometrika

,

73

,

751

–

754

.

Spiegelhalter

,

D.J.

,

Abrams

,

K.R.

&

Myles

,

J.P.

(

2004

)

Bayesian approaches to clinical trials and health-care evaluation

.

John Wiley & Sons

.

Google Preview

Spjotvoll

,

E.

(

1972

)

On the optimality of some multiple comparison procedures

.

The Annals of Mathematical Statistics

,

43

,

398

–

411

.

Sun

,

W.

&

Cai

,

T.

(

2009

)

Large-scale multiple testing under dependence

.

Journal of the Royal Statistical Society: Series B (Statistical Methodology)

,

71

,

393

–

424

.