-
PDF
- Split View
-
Views
-
Cite
Cite
Willi Maurer, Frank Bretz, Xiaolei Xun, Optimal test Procedures for Multiple Hypotheses Controlling the Familywise Expected Loss, Biometrics, Volume 79, Issue 4, December 2023, Pages 2781–2793, https://doi.org/10.1111/biom.13907
- Share Icon Share
Abstract
We consider the problem of testing multiple null hypotheses, where a decision to reject or retain must be made for each one and embedding incorrect decisions into a real-life context may inflict different losses. We argue that traditional methods controlling the Type I error rate may be too restrictive in this situation and that the standard familywise error rate may not be appropriate. Using a decision-theoretic approach, we define suitable loss functions for a given decision rule, where incorrect decisions can be treated unequally by assigning different loss values. Taking expectation with respect to the sampling distribution of the data allows us to control the familywise expected loss instead of the conventional familywise error rate. Different loss functions can be adopted, and we search for decision rules that satisfy certain optimality criteria within a broad class of decision rules for which the expected loss is bounded by a fixed threshold under any parameter configuration. We illustrate the methods with the problem of establishing efficacy of a new medicinal treatment in non-overlapping subgroups of patients.
1 Introduction
Consider the problem of comparing a new medicinal treatment with a control treatment in a given patient population S. Suppose that the treatment effects θ1 and θ2 in two non-overlapping subgroups S1 and S2, respectively, could be different from each other, where ,
and
is defined by predictive biomarkers, demographic factors, or any other classifier. We are then interested in testing the (one-sided) null hypothesis
that the new treatment is not better than the control in subgroup
,
. A decision has to be made whether any of the hypotheses can be rejected to claim an advantage of the new treatment over the control in the respective subgroup.
Conventional multiple test procedures control the familywise error rate (FWER) strongly at a given significance level . The probability to incorrectly reject at least one true null hypothesis is then bounded by α under any configuration of true and false null hypotheses. When controlling the FWER, we adjust for multiplicity and test each hypothesis at, for example, level
(Bonferroni test). If, however, only the hypothesis H1 for the first subgroup is rejected, the new treatment may be approved only for that subgroup S1. The risk of a false positive decision for that subgroup is therefore controlled by testing H1 at level α, as also noticed by Brannath et al. (2003). Since the same argument also holds for the other subgroup S2, it then seems reasonable to test each individual hypothesis H1 and H2 at level α, although the FWER can be almost 2α.
If FWER control is used, the consequences of incorrectly rejecting either H1 or H2 are treated equally, regardless of whether one or both null hypotheses are rejected. Embedding incorrect decisions into a real-world context, however, may inflict different losses. Consider, for example, a clinical trial to treat advanced breast cancer patients with a new treatment. Results from previous trials indicate that patients with certain mutant cell lines are more sensitive to the new treatment. The primary objective of the new trial is then to investigate the new treatment separately in the two non-overlapping patient subgroups S1 (mutant cell lines) and S2 (non-mutant cell lines). In this example, we may attribute a loss of, say, 1 unit to the wrong decision that there is a treatment effect in both subgroups if in fact there is no effect in either one. It seems reasonable, however, to assume that an incorrect decision related to just one of the subgroups leads to a smaller loss
. One possibility is to assume the loss
to be proportional to the size of
relative to the overall population S and to be additive, that is,
.
In the following, we apply a decision-theoretic approach to such settings, while maintaining the principles of standard hypothesis testing. Decision theory lends itself naturally to multiple test problems. Lehmann (1957) formulated the problem as to make a choice from 2m finite actions when there are m hypotheses, discussed a general class of multiple decision problems with an additive loss function over false rejections and acceptances for individual hypotheses, and proved that such procedures uniformly minimize the risk among all unbiased procedures. Cohen and Sackrowitz (2005), with the same formulation and loss function, further examined single-step, step-down and step-up procedures for their admissibility properties under a multivariate normal model. The finite actions formulation entails the specification of losses for each individual decision, which can be quite general.
Various authors investigated combined measures of false positives and false negatives using a single loss function for large-scale multiple test problems. For example, Müller et al. (2004) proposed linear combinations of posterior expected false positive and false negative rates and counts, or, alternatively, a bivariate loss function that explicitly acknowledges the competing goals; see also Müller et al. (2007). Sun and Cai (2009) minimized the false negative rate subject to a constraint on the false discovery rate (FDR), essentially the same as the bivariate loss function of Müller et al. (2004). The advantage of such an approach is that the FDR or its variants are controlled exactly at a pre-specified level. Different to the needs in our applications, however, it is practically infeasible to assign 2m individual loss values for large-scale multiple testing and most likely such a fine granularity is not of interest.
Others applied decision-theoretic approaches to small-scale multiple test problems often encountered in clinical trials. For example, Rosenblum et al. (2014) investigated constrained Bayes and minimax optimization problems when testing the treatment effects for an overall population and two subgroups within the FWER framework; see also Rosenblum et al. (2020). Lisovskaja and Burman (2014) proposed a utility function to describe the perceived gain of rejecting a certain constellation of hypotheses. Graf et al. (2015) argued that losses and gains may be quantified differently by relevant stakeholders and proposed different utility functions that formalize the key decision-making components better than traditional power considerations. Nevertheless, the goal remains to control the conventional FWER.
Motivated by our applications, we argue that the FWER is not always fit for purpose and alternative error rates should be considered. Given that the number of hypotheses is relatively small in typical clinical trial applications, we adopt the finite actions perspective in order to adequately reflect the consequences of decisions in the real world. Furthermore, we propose to minimize a measure of false negatives subject to the constraint that a measure of false positives is bounded by a given threshold under any parameter configuration, similar in spirit to Müller et al. (2004) and Sun and Cai (2009). This is also in analogy to classical hypothesis testing where power is maximized while controlling the Type I error rate.
We formalize these ideas in Section 2 and define suitable loss functions for a given decision rule, where incorrect decisions can be treated unequally by assigning different loss values. Taking expectation with respect to the sampling distribution of the data allows us to control the familywise expected loss (FWEL) instead of the conventional FWER. That is, a given decision rule is said to control the FWEL at level α if the expected loss is bounded by α under any parameter value. Furthermore, we introduce relevant properties of decision rules related to FWEL control (validity, exhaustiveness, and admissibility). Next, we define gain functions for a given decision rule based on a suitable choice of gain values and introduce the familywise expected gain, in analogy to the traditional power concept. For composite alternatives, we introduce the Bayes gain to account for the intrinsic uncertainty in the parameter, similar to what is often done in conventional hypothesis testing. With this framework in place, we search in the remainder of this paper for optimal decision rules that maximize the familywise expected gain while controlling the FWEL. Technical results and proofs are deferred to the Web Appendix.
2 Decision Rules Controlling the Familywise Expected Loss
2.1 General Decision Rules
Let denote the parameter space and
the parameter vector of interest. We consider testing simultaneously m null hypotheses
against the alternative hypotheses
,
. The combinations (intersections) of null and alternative hypotheses partition the parameter space Θ into 2m disjoint sets
for
. Let
denote the Cartesian product
. Let further
denote the indicator function for the partition sets of Θ, that is,
if
and
if
,
. Based on observed data, a decision
is made whether to reject or retain
, where
if
is retained and 1 otherwise,
.
Let the data relevant for decision-making be summarized in form of p-values . We call a function
, which maps observed values p on decisions
, a decision rule. Note that a decision rule
to either reject or retain an individual null hypothesis
may not only depend on the individual p-value
, but on the complete vector p. If the decision on each
depends only on its associated p-value
,
, such that
, we call
a separable decision rule.
2.2 Loss and Gain Functions
We consider score functions that assign a non-negative value to each decision
with respect to a true parameter
. Formally, loss functions are score functions that assign a positive value to an incorrect decision by quantifying the associated costs. Similarly, gain functions are score functions that assign a positive value to a correct decision by quantifying the associated benefits. This distinction between loss and gain functions resembles the general notion of incorrectly rejecting a null hypothesis (Type I error) and correctly rejecting a null hypothesis, respectively. Following the classical hypothesis testing framework, we thus evaluate decisions in a binary way (i.e., whether a decision is correct or not), regardless of the magnitude of the underlying parameter value. We therefore focus on step score functions
which depend on
only through
and are thus constant on each of the 2m disjoint sets
,
.
Let denote a real-valued function
. We investigate additive step score functions
, which can also be written as

where denotes the weight of a correct (or incorrect) rejection of
and
a Boolean function depending on a decision
and
.
For loss functions we consider the Boolean function ,
, which takes value 1 if
and
(i.e.,
is incorrectly rejected) and 0 otherwise. We further attribute a loss
if
is incorrectly rejected. Following Equation (1), we therefore assign to each
and
a non-negative value through the additive step-loss function

For the remainder of this paper, we standardize the maximum total loss at 1 unit, that is, . Web Lemma 1 shows that this standardization can be done without loss of generality with respect to the determination of optimal decision rules.
For gain functions, we consider the Boolean function ,
, which takes values 1 if
and
(i.e.,
is correctly rejected) and 0 otherwise. We further attribute a gain
if
is correctly rejected. Following Equation (1), we therefore assign to each
and
a non-negative value through the additive step gain function

We again standardize the maximum gain at 1 unit, that is, without loss of generality. Note that gains and losses can be measured on different scales and a loss of 1 unit may not be comparable to a gain of 1 unit.
We also consider maximum step score functions with the same Boolean functions as in Equation (1), for
. Applying equal weights
and
,
, we obtain a binary step-loss function

where denotes the index set of true null hypotheses. Note that
is the indicator function of the familywise Type I error as it assigns a loss 1 if at least one true null hypothesis is rejected and 0 otherwise.
2.3 Familywise Expected Losses and Valid Decision Rules
Assume that p is a random vector with a distribution depending on . The score function
is then a random variable with a distribution parameterized by
. Because
is a discrete random variable defined in
, the expectation of the score function is

where ,
. For additive step score functions defined in Equation (1),

For loss functions defined in Equation (2), we have . That is, we assign a loss of 0 when retaining a null hypothesis, regardless of whether that null hypothesis is true or not. Therefore,
and hence
. Denote
by
, where
. For
, we then have because of Equation (6)

We define the FWEL as the expectation for a given decision rule
and a loss function
, where the expectation is taken with respect to the sampling distribution of the data parameterized by
. A decision rule
is said to control the FWEL at a given level
if
for a given loss function
. Furthermore,
is called valid if it controls the FWEL strongly at level α, that is, for all
. Since the maximum total loss is standardized at 1 unit according to Section 2.2, the same standardization also holds for the FWEL.
Note that the FWEL for additive step loss functions (2) is closely related to the per-family error rate (PFER), which is defined as the expected total number of false discoveries (Benjamini & Hochberg, 1997). If for all i, the expectation in Equation (7) reduces to the PFER. However, assigning unequal losses such that
after standardization, this relationship does not hold anymore and we rather obtain a weighted version of the PFER.
For binary step loss functions (4), if
and 0 otherwise. Therefore,

which is precisely the FWER, that is, the probability that erroneously rejects at least one true null hypothesis
,
; see Web Appendix C for more discussions.
Since and
is constant on
for each index set
, conditions (7) and (8) can be rewritten as Equations (9) and (10), respectively, in the following proposition.
Proposition 1. Let for
, and
for
,
. A decision rule
is valid with respect to an additive step-loss function (2) if and only if

Furthermore, a decision rule is valid with respect to a binary step loss function (4) if and only if

Condition (10) is closely related to closed test procedures defined by level-α tests for
,
(Marcus et al., 1976), as shown in Corollary 1 below. In there, a closed test procedure is said to be equivalent with a decision rule if an individual null hypothesis
, rejected by one is also rejected by the other. Following Gabriel (1969), a closed test procedure is called consonant if a rejection of an intersection null hypothesis
implies the rejection of at least one individual null hypothesis
for
.
Corollary 1. The following assertions hold:
- (i)
For any closed test procedure there is an equivalent valid decision rule controlling strongly the FWER (i.e., the FWEL with respect to a binary loss function).
- (ii)
Any valid decision rule with respect to the binary loss function is equivalent to a consonant closed test procedure.
- (iii)
For any closed test procedure there is an equivalent consonant closed test procedure.
Despite the similarity of conditions (9) and (10), a similar result as Corollary 1 for decision rules that are valid with respect to additive loss functions is not possible. To see this, note that and the same inequality holds also for the expected losses. Hence, a valid decision rule with respect to a binary loss is also valid with respect to any additive loss function. While a decision rule defined by a closed test procedure via rejection regions
for intersection hypotheses
is valid for any additive loss, it is always strictly conservative, that is, it does not exhaust the level. However, Equation (9) allows one to construct such rules via the rejection sets
in a stepwise fashion, starting with
.
2.4 Optimal Decision Rules
In analogy to the FWEL, we define the familywise expected gain (FWEG) as the expectation . For additive gain functions defined in Equation (3), this becomes

which is a weighted average power where we assign weight if
is correctly rejected with probability
. It can be interpreted as a weighted version of the expected number of true discoveries, similar to how the FWEL for additive step loss functions can be interpreted as a weighted version of the PFER (Section 2.3).
In the remainder of this section, we search for optimal valid decision rules that maximize the FWEG while controlling the FWEL at level α for a given loss function . We assume the distribution of p to be continuous and show in Proposition 2 that the search can be restricted to exhaustive decisions rules
satisfying
for at least one
. That is, any valid decision rule which does not exhaust α can be excluded from the search of optimal decision rules for given loss and gain functions.
We say that for any two decisions
and d if
, with at least one strict inequality. Further, we call a step score function
increasing if
implies
for all
and if in addition
holds for at least one
. We say that a valid decision rule
strictly dominates another valid rule
for a given gain function if the FWEG of
is not smaller than that of
for all
and strictly larger for at least one
. A valid decision rule
is called inadmissible if there exists a valid decision function that strictly dominates it. One can show that additive and maximum step score functions are increasing (Web Lemma 2) and that any non-exhaustive valid decision rule
is inadmissible if the step score functions are increasing (Web Lemma 3). This leads to the following proposition.
Proposition 2. A non-exhaustive, valid decision rule for a given additive or maximum step score functions is inadmissible.
We now briefly discuss optimal decision rules for point alternatives before turning our attention to composite hypotheses. When testing against a point alternative
,
, the aim is to determine a decision rule that maximizes the expected gain (11) among all valid decision rules, given a fixed parameter
. In conventional multiple testing, a common definition of power is given by the probability to reject at least one hypothesis. For example, the outcome of a clinical trial is often considered to be a success if at least one null hypothesis is rejected regardless of whether it is true or not, because the true state of nature remains unknown (Senn & Bretz, 2007). Applying this principle to additive gain functions then leads to the step gain function
which is independent of
. According to Equation (11),

Note that using instead of
often leads to very close optimal rules, as shown now. In many applications, it is reasonable to assume that losses and gains are proportional to each other. Due to Web Lemma 1 they can be then assumed to be equal, that is,
. Then, according to Equations (7), (11), and (12),
. Because
is valid,
for any
and we have
. The expected maximal loss α is generally much smaller than a targeted gain under a relevant alternative and the optimal rules are primarily influenced by the latter.
In the following, we call a valid decision rule optimal if its expected gain (12) is at least as large as for any other decision rule of the same type. In the special case of separable rules , Benjamini and Hochberg (1997) have derived equations to determine optimal rules by generalizing earlier results of Spjotvoll (1972) for equally weighted loss and gain functions; see Section 3 for further discussion. For non-separable decision rules, no analytic optimality results are available and numerical solutions are needed.
When testing against composite alternatives
, we should account for the intrinsic uncertainty about the unknown parameter
in the FWEG from Equation (11). Following what is often done in conventional hypotheses testing (see Spiegelhalter et al. 2004 among many others), we introduce the Bayes gain defined as the weighted average of the expected gain (12) with respect to a suitable density
,

An optimal decision rule then maximizes the Bayes gain among all valid decision rules. Note that the Bayes gain always exists, as .
Changing the order of integration, the Bayes gain (13) can be rewritten as

where is a compound distribution. Thus, a Bayes optimal decision rule can be determined by finding the optimal decision rule for a point hypothesis (Proposition 3), leading to considerable savings in computation time. Furthermore, statements on optimal rules that are true for a point hypothesis are also true for Bayes optimal rules.
Proposition 3. Given a step gain function , maximizing the Bayes gain
in Equation (13) among valid decision rules is equivalent to maximizing the expected gain
with respect to the compound distribution
.
We conclude this section with two remarks. First, we note that optimal decision rules are invariant under linear transformations of the gain function for a given loss function and a given upper limit for the expected loss (Web Lemma 1), This justifies the standardization of the maximum loss and gain at 1 unit in Section 2.2. It is of practical importance, as it ensures that there is no need to quantify the relative importance of losses versus gains. This would not be not true if, instead, the expected value of a utility function is maximized based on a linear combination of losses and gains, e.g.
. Second, we note that the distribution of p as a function of test statistics
is usually not available in closed form and computing expected gains based on the distribution of
is usually more convenient. Let t denote the observed values of
. Given
,
and
so that
. Therefore,
and
. For example, consider m hypotheses each tested with independent z-tests, that is,
, where
is the non-centrality parameter. Assuming that the
's are also independently distributed with density
then the density of the random vector
is the multivariate compound distribution
and has independent components (Web Lemma 6). Furthermore, compounding a normal distribution with a normal distribution for its mean parameter results in another normal distribution. Assuming
and applying the laws of total expectation and total variance, we have
for
.
3 Separable Decision Rules
Recall from Section 2 that a decision rule is separable if the rejection of
depends only on
,
, that is,
. More specifically, let
if
and 0 otherwise, where
is a vector of fixed constants
. Equivalently,
for
and 0 otherwise, where
denotes the
-quantile of the distribution of
under
. Separable decision rules are constructed in analogy to the classical Bonferroni or Šidák (1967) tests and offer certain advantages over other decision rules, as shown in the following.
Let and
. We then obtain for additive step loss and gain functions
and
the expressions
and
, respectively. If the individual tests are unbiased with
, it follows from Web Lemma 4 that a separable decision rule is valid and exhaustive for additive step loss functions if

That is, validity, and thus strong FWEL control, of a separable decision rule follows immediately from the control of the expected loss for the global null hypothesis . In general, this is neither true for non-separable decision rules nor for non-additive loss functions.
We discuss non-separable decision rules and non-additive loss and gain functions in later sections. Instead, we now focus on optimal separable decision rules investigated by Benjamini and Hochberg (1997) in the context of weighted tests, but embed them in the proposed decision-theoretic framework. We search for the vector satisfying the validity condition (14) that maximizes the expected gain in Equation (12), more specifically

Assume that the test statistics are continuous, with densities
and
under
and
, respectively. Assume further that the ratio
is monotonically increasing in
, as is the case if, for example,
. This assumption is necessary to guarantee that the rejection region in function of p consists of just one interval [0, c]. Following from Theorem 1 of Benjamini and Hochberg (1997), the optimal decision rule
is given by

where y is determined by the validity condition (14). With and
a marginal density of
in
, let
denote the marginal compound distribution.
Proposition 4. For given densities and
of the test statistics under the null and alternative point hypotheses, respectively, the optimal separable decision rule is given by
, where
satisfies
. The optimal Bayes separable rule is given by
, where
satisfies
.
From Proposition 4, for normally distributed test statistics with variance 1 and non-centrality parameter
under the alternative hypotheses.
Proposition 4 is a generalization of the Neyman–Pearson lemma to multiple hypotheses and applied to test statistics instead of the original observations. It implies that the optimal separable decision rules depend only on the marginal distributions of and
. This is neither the case for more general decision rules nor for non-additive loss and gain functions. Note that if
for
and some constant κ with
, then
for all i and the optimal solution does not depend on the choice of
and
. If the marginal distributions under the alternative hypotheses are also the same and
, then the optimal solution is seen to be
.
4 Rectangular Decision Rules for m = 2
Motivated by conventional multiple test procedures (such as Simes 1986), we now investigate decision rules for that are defined by four non-overlapping rectangular rejection areas
,
. Figure 1 visualizes the rectangular decision rules

for given thresholds , where A(0, 0) denotes the area in the
-plane for which no hypothesis is rejected. To avoid undesirable test decisions, we assume the inequalities
,
, which can be motivated as follows. If a single null hypothesis H is rejected because its associated p-value
, it is desirable to require that H should also be rejected for any
. This monotonicity requirement can be extended as follows to two hypotheses H1 and H2: If
,
, is rejected by a multiple test procedure based on
, then it should also be rejected based on any
with
(Hommel & Bretz, 2008). Decision rules in Figure 1 satisfy this monotonicity property only if
,
. Choosing
excludes the possibility of rejecting a null hypothesis when the test statistic for the other null hypothesis trends strongly in the opposite direction. Note that by setting
or
for at least one i the decision rule is not separable because then the test decisions depend on each other.

4.1 Familywise Expected Additive Losses and Gains
For rectangular decision rules with , we obtain from Equation (5) the FWEL

and the FWEG

So far we did not make any assumption about the dependence between the p-values p1 and p2. If the p-values are correlated, so are the decisions
and hence their expectations will depend on these correlations. Motivated by the example in Section 1 investigating non-overlapping subgroups, we now assume that p1 and p2 are stochastically independent. As we restrict ourselves to rectangular decision rules in this section, the probabilities
can be calculated as the product of the marginal probabilities. We assume that the individual tests of
against
are unbiased,
. Let
denote the cumulative distribution function of the random variable
for a given parameter
,
, and
. Then,
for
and
. Hence, for
we have
for
. For
, that is,
, this probability depends on the parameter
as well as the particular test and sample sizes being employed.
Under independence of p1 and p2 we finally obtain from Equation (17) the expected loss

for point null hypotheses ,
. In Web Lemma 5, we show that these become upper bounds for the expected losses in the more general cases
, thus leading to valid decision rules also for composite null hypotheses. Finally, we use Equation (18) to calculate the expected gain for
as

4.2 Admissible Rectangular Decision Rules
We now provide conditions in terms of , and
for a rectangular decision rule
to be valid. As above, we assume independent p-values p1 and p2. According to Equation (9), validity holds if
. For
this implies the condition
. Because
for rectangular decision rules as given in Figure 1, sufficient validity conditions for
, are given by

For , that is,
, we have with the first branch of Equation (19)

We denote decision rules (16) as unconstrained decision rules, because all six parameters ,
and
for
can be freely chosen under the above conditions. Exhaustive decision rules have one free parameter less since at least one of the inequalities in Equations (21) and (22) is sharp. Otherwise,
for all
and the respective rule is not exhaustive and hence inadmissible. Imposing constraints on the parameters
, and
reduces the original optimization problem to one- or two-dimensional problems and may reflect additional practical considerations. The three conceivable classes of rectangular decision rules introduced in Table 1 are investigated further in Section 5. The asymmetric rules are motivated by numerical results indicating that maximizing the thresholds
may perform better, leading to the constraints
and the additional restriction
. Table 1 also presents the sets of valid and exhaustive rules for these constrained rectangular decision rules; see Web Appendix B for the derivations.
Classes of constrained rectangular decision rules and corresponding sets of valid and exhaustive rules with respect to additive loss functions.
Decision rule . | Parameter constraints (C) and sets of valid and exhaustive rules (B) . |
---|---|
Symmetric | C: ![]() |
B: ![]() | |
with ![]() | |
Separable | C: ![]() |
B: ![]() | |
Asymmetric | C: ![]() |
B: ![]() | |
![]() |
Decision rule . | Parameter constraints (C) and sets of valid and exhaustive rules (B) . |
---|---|
Symmetric | C: ![]() |
B: ![]() | |
with ![]() | |
Separable | C: ![]() |
B: ![]() | |
Asymmetric | C: ![]() |
B: ![]() | |
![]() |
Classes of constrained rectangular decision rules and corresponding sets of valid and exhaustive rules with respect to additive loss functions.
Decision rule . | Parameter constraints (C) and sets of valid and exhaustive rules (B) . |
---|---|
Symmetric | C: ![]() |
B: ![]() | |
with ![]() | |
Separable | C: ![]() |
B: ![]() | |
Asymmetric | C: ![]() |
B: ![]() | |
![]() |
Decision rule . | Parameter constraints (C) and sets of valid and exhaustive rules (B) . |
---|---|
Symmetric | C: ![]() |
B: ![]() | |
with ![]() | |
Separable | C: ![]() |
B: ![]() | |
Asymmetric | C: ![]() |
B: ![]() | |
![]() |
5 Application to Subgroup Analysis
We now illustrate the proposed approach using the subgroup analysis problem introduced in Section 1. Assume that the treatment effect could be different in two non-overlapping subgroups of patients defined by a biomarker. Therefore, the primary objective of a new clinical trial is to investigate the two subgroups separately and decide whether a benefit of the new treatment over the control can be claimed in any of the two subgroups. We test the null hypotheses that the new treatment is no better than the control in subgroup
against the alternative hypotheses
, where
denotes the treatment effect in subgroup
, for
. If
is rejected, a claim can be made for subgroup
. Let
denote the size of subgroup
,
. For a parallel-group design with equal number of patients in each treatment group, the test statistic
for a continuous endpoint with known variance follows
, for a given effect size
and assuming a standard deviation of 1 without loss of generality. Furthermore, T1 and T2 are independent.
The choice of losses and gains should reflect the consequences of incorrect and correct decisions, respectively. We set to reflect the assumption that a larger subgroup suffers (benefits) more from a wrong (right) decision. In deriving the Bayes gain (13), we assume independent
's and marginally
to account for uncertainty. Such a density covers clinically relevant effect sizes, but alternative densities could be chosen based on relevant knowledge. We aim at maximizing the Bayes gain while controlling the FWEL at
for a given decision rule.
Using existing numerical optimization routines, Table 2 summarizes the optimal rules for the four classes considered in terms of their parameters ,
(
), and
. For equal subgroup sizes (i.e.,
), the optimal symmetric decision rule is equal to the unconstrained rule with
, whereas for the asymmetric rule
is forced to be 1. Their Bayes expected gain, however, differs only by 0.001. This contrasts the much larger difference of 0.052 in Bayes gain between the optimal unconstrained rule and the separable rule. The prime reason for this difference is the much larger region A(1, 1) for the unconstrained rule, where both hypotheses can be rejected if both p-values are smaller than
, instead of just
for the separable rule. The top row of Figure 2 visualizes the respective rejection areas.
Optimal Bayes rules for different classes of rectangular decision rules with respect to additive loss functions and strong control of the familywise expected loss at .
![]() | Decision rules . | Bayes gain . | Parameters . |
---|---|---|---|
0.5 | Unconstrained | 0.821 | ![]() ![]() |
Symmetric | 0.821 | ![]() ![]() | |
Separable | 0.769 | ![]() ![]() | |
Asymmetric | 0.820 | ![]() ![]() | |
0.25 | Unconstrained | 0.838 | ![]() |
Symmetric | 0.800 | ![]() ![]() | |
Separable | 0.790 | ![]() | |
Asymmetric | 0.837 | ![]() |
![]() | Decision rules . | Bayes gain . | Parameters . |
---|---|---|---|
0.5 | Unconstrained | 0.821 | ![]() ![]() |
Symmetric | 0.821 | ![]() ![]() | |
Separable | 0.769 | ![]() ![]() | |
Asymmetric | 0.820 | ![]() ![]() | |
0.25 | Unconstrained | 0.838 | ![]() |
Symmetric | 0.800 | ![]() ![]() | |
Separable | 0.790 | ![]() | |
Asymmetric | 0.837 | ![]() |
Optimal Bayes rules for different classes of rectangular decision rules with respect to additive loss functions and strong control of the familywise expected loss at .
![]() | Decision rules . | Bayes gain . | Parameters . |
---|---|---|---|
0.5 | Unconstrained | 0.821 | ![]() ![]() |
Symmetric | 0.821 | ![]() ![]() | |
Separable | 0.769 | ![]() ![]() | |
Asymmetric | 0.820 | ![]() ![]() | |
0.25 | Unconstrained | 0.838 | ![]() |
Symmetric | 0.800 | ![]() ![]() | |
Separable | 0.790 | ![]() | |
Asymmetric | 0.837 | ![]() |
![]() | Decision rules . | Bayes gain . | Parameters . |
---|---|---|---|
0.5 | Unconstrained | 0.821 | ![]() ![]() |
Symmetric | 0.821 | ![]() ![]() | |
Separable | 0.769 | ![]() ![]() | |
Asymmetric | 0.820 | ![]() ![]() | |
0.25 | Unconstrained | 0.838 | ![]() |
Symmetric | 0.800 | ![]() ![]() | |
Separable | 0.790 | ![]() | |
Asymmetric | 0.837 | ![]() |

Graphical visualization of the optimal decision rules from Table 2. Top: . Bottom:
.
For unequal subgroup sizes (i.e., ), the losses assigned to the two subgroups are
and
. This asymmetry in the losses and in the marginal power of the two tests leads to different optimal rules for the different classes under consideration. The optimal asymmetric rule with a Bayes gain of 0.837 turns out to be close to the optimal unconstrained rule with only a slightly larger Bayes gain of 0.838. Nevertheless, the optimal asymmetric rule might be preferred in practice because the
are forced to 1 so that a possibly large p-value in one subgroup does not prevent the rejection of the null hypothesis in the other subgroup if its p-value is small. The asymmetry in a1 and a2 allows for a larger area where both hypotheses can be rejected than in the symmetric or separable case and for a clear advantage of this type of rule. The bottom row of Figure 2 visualizes the optimal asymmetric and separable decision rules as examples.
Finally, Figure 3 displays the Bayes gain as a function of ranging from 0.05 to 0.5 for the four classes of rectangular decision rules. We conclude that the asymmetric rule performs closest to the unconstrained rule, both of which are superior to the symmetric and the separable rules. The maximum difference in Bayes gain between the four different classes is about 0.05. The differences in Bayes gain, however, are diluted by cases where both
are either very small or very large and, consequently, lead to expected gains close to 0 or 1 irrespective of the rule. Therefore, we plot in Figure 3 the expected gain (20) of the same optimal Bayes rules as a function of
, where we fix
such that the marginal power is roughly between 0.2 and 0.8. As seen from Figure 3, for
the expected gain for the separable rule is about 0.07 smaller than for the other rules. For
, the differences in expected gain between the separable and unconstrained rules decrease somewhat and the symmetric rule is worse than the optimal separable rule.
Considering that the optimal asymmetric rule is so close to the optimal unconstrained rule, we investigate further the following rule obtained by setting for the asymmetric rule,
. Such a rule turns out to be almost optimal under the current choice of
and may serve as a rule of thumb, due to its convenient computation without any numerical optimization. The largest difference between this rule and the optimal unconstrained rule is less than 0.0017 for the whole range of
. If we instead assume a density
, the optimal unconstrained rule is virtually identical to the above rule of thumb (i.e.,
) with expected gains between 0.84 and 0.88. As before, the difference in expected gain between both rules is negligible, with at most 0.0002 over for the whole range of
.
6 Discussion
Our approach includes several common multiple test procedures as special cases. For example, using separable decision rules and additive loss functions, our approach is equivalent to Benjamini and Hochberg (1997) who generalized earlier results of Spjotvoll (1972) to maximize the expected number of rejections while controlling the expected number of incorrectly rejected hypotheses. Moreover, using the binary loss function (4) leads to FWER-controlling procedures, as shown in Section 2.3. In Web Appendix C, we specialize the results in Section 4 with respect to binary loss functions and provide connections to common multiple test procedures. For example, using separable decision rules and binary loss functions, our approach specializes to the Šidák (1967) test for subgroups of equal size.
Our approach can be extended to other applications than subgroup analyses, although the assumptions and derivations may need modifications. For example, when testing a new drug for two unequally important variables, losses can be caused by, say, an erroneous drug approval based on the primary variable or, less severe, an erroneous drug label based on the secondary variable. Conventional practice of controling the FWER, however, rates both null hypotheses equally important and does not account for the different losses inflicted by inappropriate drug approval or label (Wang et al., 2015). As another example, several treatments may be evaluated in parallel in a master protocol trial when treatment options depend on one or more biomarkers within a single cancer type. If each treatment is evaluated in a single subgroup, the parallel sub-studies are essentially independent trials, each considering one treatment in its own distinct patient group, conducted together for administrative or operational reasons such as saving time, reducing costs or facilitating recruitment (Woodcock & LaVange, 2017). In these applications, the test statistics will be stochastically dependent and we leave the extension of the proposed methods to these settings for future research.
In applying the proposed approach, we need to set a value for the upper threshold α. As mentioned earlier, we standardize the maximum total loss at 1 unit. That is, in the setting of Section 5, the loss for rejecting erroneously the null hypothesis that there is no treatment effect in either subgroup is set to 1. The expected loss for a test in the overall population that disregards possible subgroups is then equal to the Type I error rate. A natural choice for α is therefore 0.025 in analogy to current clinical trial practice. We also need to assign actual values to the losses and gains
. One choice is to set
and
proportional to the prevalence of the two subgroups (i.e.,
in Section 5). Also, we recommend using the same proportions among the losses and gains. This keeps the relative importance of the subgroups the same in evaluating expected losses and gains: if a right (wrong) decision is made for a larger subgroup, the induced gain (loss) will be larger as well, and vice versa.
We also investigated optimal decision rules following the minimax principle to account for the intrinsic uncertainty in . Roughly speaking, these decision rules minimize the maximum regret (over all possible
) of not adopting the best rule. The advantage is that we do not have to specify a density for
. However, the numerical optimization is more challenging, because of the repeated minimization/maximization steps. The minimax approach is also more difficult to interpret because the regret is essentially a difference of expected gains, whereas the Bayes gain is an average on the same scale as the expected gains themselves. Our numerical studies indicated that minimax and Bayes approaches do not differ much in practice in the sense that their maximized gains are similar for a given class of rules.
Following Müller et al. (2004) and Sun and Cai (2009), we suggest in this paper to maximize the Bayes gain while controlling the expected loss at a pre-specified threshold α. Alternatively, one could maximize a utility function of the difference between gain and loss, under the same constraint on the expected loss. Technically, both approaches lead to the same optimal decision rules, hence the same design. Practically, gains and losses may be measured on different scales. We therefore prefer the proposed approach because of the easier interpretation. Also, one could consider more general loss and gain functions that are dependent on the actual value of and evaluate decisions in a continuous fashion. We leave such possible extensions for future research.
Data Availability Statement
Data sharing not applicable to this paper as no datasets were generated or analyzed during the current study.
Acknowledgments
We are grateful to two reviewers and the associate editor for their constructive comments on an earlier version of this paper. We would like to express our sadness that our co-author Willi Maurer passed away, while this manuscript was under preparation. We are indebted to him as our colleague, mentor, and friend. This way, we want to thank him posthum for his collaboration and guidance throughout many years.
References
Supplementary data
Web Appendices referenced in Sections 1–4, and 6 are available with this paper at the Biometrics website on Wiley Online Library.
Table C.1: Sets of valid and exhaustive constrained rectangular decision rules with respect to the binary loss function .
Table C.2: Optimal Bayes rules for different classes of rectangular decision rules with respect to the binary loss function and α = 0.025.