Abstract

We consider the problem of testing multiple null hypotheses, where a decision to reject or retain must be made for each one and embedding incorrect decisions into a real-life context may inflict different losses. We argue that traditional methods controlling the Type I error rate may be too restrictive in this situation and that the standard familywise error rate may not be appropriate. Using a decision-theoretic approach, we define suitable loss functions for a given decision rule, where incorrect decisions can be treated unequally by assigning different loss values. Taking expectation with respect to the sampling distribution of the data allows us to control the familywise expected loss instead of the conventional familywise error rate. Different loss functions can be adopted, and we search for decision rules that satisfy certain optimality criteria within a broad class of decision rules for which the expected loss is bounded by a fixed threshold under any parameter configuration. We illustrate the methods with the problem of establishing efficacy of a new medicinal treatment in non-overlapping subgroups of patients.

1 Introduction

Consider the problem of comparing a new medicinal treatment with a control treatment in a given patient population S. Suppose that the treatment effects θ1 and θ2 in two non-overlapping subgroups S1 and S2, respectively, could be different from each other, where formula, formula and formula is defined by predictive biomarkers, demographic factors, or any other classifier. We are then interested in testing the (one-sided) null hypothesis formula that the new treatment is not better than the control in subgroup formula, formula. A decision has to be made whether any of the hypotheses can be rejected to claim an advantage of the new treatment over the control in the respective subgroup.

Conventional multiple test procedures control the familywise error rate (FWER) strongly at a given significance level formula. The probability to incorrectly reject at least one true null hypothesis is then bounded by α under any configuration of true and false null hypotheses. When controlling the FWER, we adjust for multiplicity and test each hypothesis at, for example, level formula (Bonferroni test). If, however, only the hypothesis H1 for the first subgroup is rejected, the new treatment may be approved only for that subgroup S1. The risk of a false positive decision for that subgroup is therefore controlled by testing H1 at level α, as also noticed by Brannath et al. (2003). Since the same argument also holds for the other subgroup S2, it then seems reasonable to test each individual hypothesis H1 and H2 at level α, although the FWER can be almost 2α.

If FWER control is used, the consequences of incorrectly rejecting either H1 or H2 are treated equally, regardless of whether one or both null hypotheses are rejected. Embedding incorrect decisions into a real-world context, however, may inflict different losses. Consider, for example, a clinical trial to treat advanced breast cancer patients with a new treatment. Results from previous trials indicate that patients with certain mutant cell lines are more sensitive to the new treatment. The primary objective of the new trial is then to investigate the new treatment separately in the two non-overlapping patient subgroups S1 (mutant cell lines) and S2 (non-mutant cell lines). In this example, we may attribute a loss of, say, 1 unit to the wrong decision that there is a treatment effect in both subgroups if in fact there is no effect in either one. It seems reasonable, however, to assume that an incorrect decision related to just one of the subgroups formula leads to a smaller loss formula. One possibility is to assume the loss formula to be proportional to the size of formula relative to the overall population S and to be additive, that is, formula.

In the following, we apply a decision-theoretic approach to such settings, while maintaining the principles of standard hypothesis testing. Decision theory lends itself naturally to multiple test problems. Lehmann (1957) formulated the problem as to make a choice from 2m finite actions when there are m hypotheses, discussed a general class of multiple decision problems with an additive loss function over false rejections and acceptances for individual hypotheses, and proved that such procedures uniformly minimize the risk among all unbiased procedures. Cohen and Sackrowitz (2005), with the same formulation and loss function, further examined single-step, step-down and step-up procedures for their admissibility properties under a multivariate normal model. The finite actions formulation entails the specification of losses for each individual decision, which can be quite general.

Various authors investigated combined measures of false positives and false negatives using a single loss function for large-scale multiple test problems. For example, Müller et al. (2004) proposed linear combinations of posterior expected false positive and false negative rates and counts, or, alternatively, a bivariate loss function that explicitly acknowledges the competing goals; see also Müller et al. (2007). Sun and Cai (2009) minimized the false negative rate subject to a constraint on the false discovery rate (FDR), essentially the same as the bivariate loss function of Müller et al. (2004). The advantage of such an approach is that the FDR or its variants are controlled exactly at a pre-specified level. Different to the needs in our applications, however, it is practically infeasible to assign 2m individual loss values for large-scale multiple testing and most likely such a fine granularity is not of interest.

Others applied decision-theoretic approaches to small-scale multiple test problems often encountered in clinical trials. For example, Rosenblum et al. (2014) investigated constrained Bayes and minimax optimization problems when testing the treatment effects for an overall population and two subgroups within the FWER framework; see also Rosenblum et al. (2020). Lisovskaja and Burman (2014) proposed a utility function to describe the perceived gain of rejecting a certain constellation of hypotheses. Graf et al. (2015) argued that losses and gains may be quantified differently by relevant stakeholders and proposed different utility functions that formalize the key decision-making components better than traditional power considerations. Nevertheless, the goal remains to control the conventional FWER.

Motivated by our applications, we argue that the FWER is not always fit for purpose and alternative error rates should be considered. Given that the number of hypotheses is relatively small in typical clinical trial applications, we adopt the finite actions perspective in order to adequately reflect the consequences of decisions in the real world. Furthermore, we propose to minimize a measure of false negatives subject to the constraint that a measure of false positives is bounded by a given threshold under any parameter configuration, similar in spirit to Müller et al. (2004) and Sun and Cai (2009). This is also in analogy to classical hypothesis testing where power is maximized while controlling the Type I error rate.

We formalize these ideas in Section 2 and define suitable loss functions for a given decision rule, where incorrect decisions can be treated unequally by assigning different loss values. Taking expectation with respect to the sampling distribution of the data allows us to control the familywise expected loss (FWEL) instead of the conventional FWER. That is, a given decision rule is said to control the FWEL at level α if the expected loss is bounded by α under any parameter value. Furthermore, we introduce relevant properties of decision rules related to FWEL control (validity, exhaustiveness, and admissibility). Next, we define gain functions for a given decision rule based on a suitable choice of gain values and introduce the familywise expected gain, in analogy to the traditional power concept. For composite alternatives, we introduce the Bayes gain to account for the intrinsic uncertainty in the parameter, similar to what is often done in conventional hypothesis testing. With this framework in place, we search in the remainder of this paper for optimal decision rules that maximize the familywise expected gain while controlling the FWEL. Technical results and proofs are deferred to the Web Appendix.

2 Decision Rules Controlling the Familywise Expected Loss

2.1 General Decision Rules

Let formula denote the parameter space and formula the parameter vector of interest. We consider testing simultaneously m null hypotheses formula against the alternative hypotheses formula, formula. The combinations (intersections) of null and alternative hypotheses partition the parameter space Θ into 2m disjoint sets formula for formula. Let formula denote the Cartesian product formula. Let further formula denote the indicator function for the partition sets of Θ, that is, formula if formula and formula if formula, formula. Based on observed data, a decision formula is made whether to reject or retain formula, where formula if formula is retained and 1 otherwise, formula.

Let the data relevant for decision-making be summarized in form of p-values formula. We call a function formula, which maps observed values p on decisions formula, a decision rule. Note that a decision rule formula to either reject or retain an individual null hypothesis formula may not only depend on the individual p-value formula, but on the complete vector p. If the decision on each formula depends only on its associated p-value formula, formula, such that formula, we call formula a separable decision rule.

2.2 Loss and Gain Functions

We consider score functions formula that assign a non-negative value to each decision formula with respect to a true parameter formula. Formally, loss functions are score functions that assign a positive value to an incorrect decision by quantifying the associated costs. Similarly, gain functions are score functions that assign a positive value to a correct decision by quantifying the associated benefits. This distinction between loss and gain functions resembles the general notion of incorrectly rejecting a null hypothesis (Type I error) and correctly rejecting a null hypothesis, respectively. Following the classical hypothesis testing framework, we thus evaluate decisions in a binary way (i.e., whether a decision is correct or not), regardless of the magnitude of the underlying parameter value. We therefore focus on step score functions formula which depend on formula only through formula and are thus constant on each of the 2m disjoint sets formula, formula.

Let formula denote a real-valued function formula. We investigate additive step score functions formula, which can also be written as

1

where formula denotes the weight of a correct (or incorrect) rejection of formula and formula a Boolean function depending on a decision formula and formula.

For loss functions we consider the Boolean function formula, formula, which takes value 1 if formula and formula (i.e., formula is incorrectly rejected) and 0 otherwise. We further attribute a loss formula if formula is incorrectly rejected. Following Equation (1), we therefore assign to each formula and formula a non-negative value through the additive step-loss function

2

For the remainder of this paper, we standardize the maximum total loss at 1 unit, that is, formula. Web Lemma 1 shows that this standardization can be done without loss of generality with respect to the determination of optimal decision rules.

For gain functions, we consider the Boolean function formula, formula, which takes values 1 if formula and formula (i.e., formula is correctly rejected) and 0 otherwise. We further attribute a gain formula if formula is correctly rejected. Following Equation (1), we therefore assign to each formula and formula a non-negative value through the additive step gain function

3

We again standardize the maximum gain at 1 unit, that is, formula without loss of generality. Note that gains and losses can be measured on different scales and a loss of 1 unit may not be comparable to a gain of 1 unit.

We also consider maximum step score functions with the same Boolean functions as in Equation (1), formula for formula. Applying equal weights formula and formula, formula, we obtain a binary step-loss function

4

where formula denotes the index set of true null hypotheses. Note that formula is the indicator function of the familywise Type I error as it assigns a loss 1 if at least one true null hypothesis is rejected and 0 otherwise.

2.3 Familywise Expected Losses and Valid Decision Rules

Assume that p is a random vector with a distribution depending on formula. The score function formula is then a random variable with a distribution parameterized by formula. Because formula is a discrete random variable defined in formula, the expectation of the score function is

5

where formula, formula. For additive step score functions defined in Equation (1),

6

For loss functions defined in Equation (2), we have formula. That is, we assign a loss of 0 when retaining a null hypothesis, regardless of whether that null hypothesis is true or not. Therefore, formula and hence formula. Denote formula by formula, where formula. For formula, we then have because of Equation (6)

7

We define the FWEL as the expectation formula for a given decision rule formula and a loss function formula, where the expectation is taken with respect to the sampling distribution of the data parameterized by formula. A decision rule formula is said to control the FWEL at a given level formula if formula for a given loss function formula. Furthermore, formula is called valid if it controls the FWEL strongly at level α, that is, for all formula. Since the maximum total loss is standardized at 1 unit according to Section 2.2, the same standardization also holds for the FWEL.

Note that the FWEL for additive step loss functions (2) is closely related to the per-family error rate (PFER), which is defined as the expected total number of false discoveries (Benjamini & Hochberg, 1997). If formula for all i, the expectation in Equation (7) reduces to the PFER. However, assigning unequal losses such that formula after standardization, this relationship does not hold anymore and we rather obtain a weighted version of the PFER.

For binary step loss functions (4), formula if formula and 0 otherwise. Therefore,

8

which is precisely the FWER, that is, the probability that formula erroneously rejects at least one true null hypothesis formula, formula; see Web Appendix C for more discussions.

Since formula and formula is constant on formula for each index set formula, conditions (7) and (8) can be rewritten as Equations (9) and (10), respectively, in the following proposition.

Proposition 1. Let formula for formula, and formula for formula, formula. A decision rule formula is valid with respect to an additive step-loss function (2) if and only if

9

Furthermore, a decision rule formula is valid with respect to a binary step loss function (4) if and only if

10

Condition (10) is closely related to closed test procedures defined by formula level-α tests for formula, formula (Marcus et al., 1976), as shown in Corollary 1 below. In there, a closed test procedure is said to be equivalent with a decision rule if an individual null hypothesis formula, rejected by one is also rejected by the other. Following Gabriel (1969), a closed test procedure is called consonant if a rejection of an intersection null hypothesis formula implies the rejection of at least one individual null hypothesis formula for formula.

Corollary 1. The following assertions hold:

  • (i)

    For any closed test procedure there is an equivalent valid decision rule controlling strongly the FWER (i.e., the FWEL with respect to a binary loss function).

  • (ii)

    Any valid decision rule with respect to the binary loss function is equivalent to a consonant closed test procedure.

  • (iii)

    For any closed test procedure there is an equivalent consonant closed test procedure.

Despite the similarity of conditions (9) and (10), a similar result as Corollary 1 for decision rules that are valid with respect to additive loss functions is not possible. To see this, note that formula and the same inequality holds also for the expected losses. Hence, a valid decision rule with respect to a binary loss is also valid with respect to any additive loss function. While a decision rule defined by a closed test procedure via rejection regions formula for intersection hypotheses formula is valid for any additive loss, it is always strictly conservative, that is, it does not exhaust the level. However, Equation (9) allows one to construct such rules via the rejection sets formula in a stepwise fashion, starting with formula.

2.4 Optimal Decision Rules

In analogy to the FWEL, we define the familywise expected gain (FWEG) as the expectation formula. For additive gain functions defined in Equation (3), this becomes

11

which is a weighted average power where we assign weight formula if formula is correctly rejected with probability formula. It can be interpreted as a weighted version of the expected number of true discoveries, similar to how the FWEL for additive step loss functions can be interpreted as a weighted version of the PFER (Section 2.3).

In the remainder of this section, we search for optimal valid decision rules that maximize the FWEG while controlling the FWEL at level α for a given loss function formula. We assume the distribution of p to be continuous and show in Proposition 2 that the search can be restricted to exhaustive decisions rules formula satisfying formula for at least one formula. That is, any valid decision rule which does not exhaust α can be excluded from the search of optimal decision rules for given loss and gain functions.

We say that formula for any two decisions formula and d if formula, with at least one strict inequality. Further, we call a step score function formula increasing if formula implies formula for all formula and if in addition formula holds for at least one formula. We say that a valid decision rule formula strictly dominates another valid rule formula for a given gain function if the FWEG of formula is not smaller than that of formula for all formula and strictly larger for at least one formula. A valid decision rule formula is called inadmissible if there exists a valid decision function that strictly dominates it. One can show that additive and maximum step score functions are increasing (Web Lemma 2) and that any non-exhaustive valid decision rule formula is inadmissible if the step score functions are increasing (Web Lemma 3). This leads to the following proposition.

Proposition 2. A non-exhaustive, valid decision rule for a given additive or maximum step score functions is inadmissible.

We now briefly discuss optimal decision rules for point alternatives before turning our attention to composite hypotheses. When testing formula against a point alternative formula, formula, the aim is to determine a decision rule that maximizes the expected gain (11) among all valid decision rules, given a fixed parameter formula. In conventional multiple testing, a common definition of power is given by the probability to reject at least one hypothesis. For example, the outcome of a clinical trial is often considered to be a success if at least one null hypothesis is rejected regardless of whether it is true or not, because the true state of nature remains unknown (Senn & Bretz, 2007). Applying this principle to additive gain functions then leads to the step gain function formula which is independent of formula. According to Equation (11),

12

Note that using formula instead of formula often leads to very close optimal rules, as shown now. In many applications, it is reasonable to assume that losses and gains are proportional to each other. Due to Web Lemma 1 they can be then assumed to be equal, that is, formula. Then, according to Equations (7), (11), and (12), formula. Because formula is valid, formula for any formula and we have formula. The expected maximal loss α is generally much smaller than a targeted gain under a relevant alternative and the optimal rules are primarily influenced by the latter.

In the following, we call a valid decision rule optimal if its expected gain (12) is at least as large as for any other decision rule of the same type. In the special case of separable rules formula, Benjamini and Hochberg (1997) have derived equations to determine optimal rules by generalizing earlier results of Spjotvoll (1972) for equally weighted loss and gain functions; see Section 3 for further discussion. For non-separable decision rules, no analytic optimality results are available and numerical solutions are needed.

When testing formula against composite alternatives formula, we should account for the intrinsic uncertainty about the unknown parameter formula in the FWEG from Equation (11). Following what is often done in conventional hypotheses testing (see Spiegelhalter et al. 2004 among many others), we introduce the Bayes gain defined as the weighted average of the expected gain (12) with respect to a suitable density formula,

13

An optimal decision rule then maximizes the Bayes gain among all valid decision rules. Note that the Bayes gain always exists, as formula.

Changing the order of integration, the Bayes gain (13) can be rewritten as

where formula is a compound distribution. Thus, a Bayes optimal decision rule can be determined by finding the optimal decision rule for a point hypothesis (Proposition 3), leading to considerable savings in computation time. Furthermore, statements on optimal rules that are true for a point hypothesis are also true for Bayes optimal rules.

Proposition 3. Given a step gain function formula, maximizing the Bayes gain formula in Equation (13) among valid decision rules is equivalent to maximizing the expected gain formula with respect to the compound distribution formula.

We conclude this section with two remarks. First, we note that optimal decision rules are invariant under linear transformations of the gain function for a given loss function and a given upper limit for the expected loss (Web Lemma 1), This justifies the standardization of the maximum loss and gain at 1 unit in Section 2.2. It is of practical importance, as it ensures that there is no need to quantify the relative importance of losses versus gains. This would not be not true if, instead, the expected value of a utility function formula is maximized based on a linear combination of losses and gains, e.g. formula. Second, we note that the distribution of p as a function of test statistics formula is usually not available in closed form and computing expected gains based on the distribution of formula is usually more convenient. Let t denote the observed values of formula. Given formula, formula and formula so that formula. Therefore, formula and formula. For example, consider m hypotheses each tested with independent z-tests, that is, formula, where formula is the non-centrality parameter. Assuming that the formula's are also independently distributed with density formula then the density of the random vector formula is the multivariate compound distribution formula and has independent components (Web Lemma 6). Furthermore, compounding a normal distribution with a normal distribution for its mean parameter results in another normal distribution. Assuming formula and applying the laws of total expectation and total variance, we have formula for formula.

3 Separable Decision Rules

Recall from Section 2 that a decision rule formula is separable if the rejection of formula depends only on formula, formula, that is, formula. More specifically, let formula if formula and 0 otherwise, where formula is a vector of fixed constants formula. Equivalently, formula for formula and 0 otherwise, where formula denotes the formula-quantile of the distribution of formula under formula. Separable decision rules are constructed in analogy to the classical Bonferroni or Šidák (1967) tests and offer certain advantages over other decision rules, as shown in the following.

Let formula and formula. We then obtain for additive step loss and gain functions formula and formula the expressions formula and formula, respectively. If the individual tests are unbiased with formula, it follows from Web Lemma 4 that a separable decision rule is valid and exhaustive for additive step loss functions if

14

That is, validity, and thus strong FWEL control, of a separable decision rule follows immediately from the control of the expected loss for the global null hypothesis formula. In general, this is neither true for non-separable decision rules nor for non-additive loss functions.

We discuss non-separable decision rules and non-additive loss and gain functions in later sections. Instead, we now focus on optimal separable decision rules investigated by Benjamini and Hochberg (1997) in the context of weighted tests, but embed them in the proposed decision-theoretic framework. We search for the vector formula satisfying the validity condition (14) that maximizes the expected gain in Equation (12), more specifically

Assume that the test statistics formula are continuous, with densities formula and formula under formula and formula, respectively. Assume further that the ratio formula is monotonically increasing in formula, as is the case if, for example, formula. This assumption is necessary to guarantee that the rejection region in function of p consists of just one interval [0, c]. Following from Theorem 1 of Benjamini and Hochberg (1997), the optimal decision rule formula is given by

15

where y is determined by the validity condition (14). With formula and formula a marginal density of formula in formula, let formula denote the marginal compound distribution.

Proposition 4. For given densities formula and formula of the test statistics under the null and alternative point hypotheses, respectively, the optimal separable decision rule is given by formula, where formula satisfies formula. The optimal Bayes separable rule is given by formula, where formula satisfies formula.

From Proposition 4, formula for normally distributed test statistics with variance 1 and non-centrality parameter formula under the alternative hypotheses.

Proposition 4 is a generalization of the Neyman–Pearson lemma to multiple hypotheses and applied to test statistics instead of the original observations. It implies that the optimal separable decision rules depend only on the marginal distributions of formula and formula. This is neither the case for more general decision rules nor for non-additive loss and gain functions. Note that if formula for formula and some constant κ with formula, then formula for all i and the optimal solution does not depend on the choice of formula and formula. If the marginal distributions under the alternative hypotheses are also the same and formula, then the optimal solution is seen to be formula.

4 Rectangular Decision Rules for m = 2

Motivated by conventional multiple test procedures (such as Simes 1986), we now investigate decision rules for formula that are defined by four non-overlapping rectangular rejection areas formula, formula. Figure 1 visualizes the rectangular decision rules

16

for given thresholds formula, where A(0, 0) denotes the area in the formula-plane for which no hypothesis is rejected. To avoid undesirable test decisions, we assume the inequalities formula, formula, which can be motivated as follows. If a single null hypothesis H is rejected because its associated p-value formula, it is desirable to require that H should also be rejected for any formula. This monotonicity requirement can be extended as follows to two hypotheses H1 and H2: If formula, formula, is rejected by a multiple test procedure based on formula, then it should also be rejected based on any formula with formula (Hommel & Bretz, 2008). Decision rules in Figure 1 satisfy this monotonicity property only if formula, formula. Choosing formula excludes the possibility of rejecting a null hypothesis when the test statistic for the other null hypothesis trends strongly in the opposite direction. Note that by setting formula or formula for at least one i the decision rule is not separable because then the test decisions depend on each other.

Rectangular decision rule  in the (p1, p2)-plane, with  for .
Figure 1

Rectangular decision rule formula in the (p1, p2)-plane, with formula for formula.

4.1 Familywise Expected Additive Losses and Gains

For rectangular decision rules with formula, we obtain from Equation (5) the FWEL

17

and the FWEG

18

So far we did not make any assumption about the dependence between the p-values p1 and p2. If the p-values formula are correlated, so are the decisions formula and hence their expectations will depend on these correlations. Motivated by the example in Section 1 investigating non-overlapping subgroups, we now assume that p1 and p2 are stochastically independent. As we restrict ourselves to rectangular decision rules in this section, the probabilities formula can be calculated as the product of the marginal probabilities. We assume that the individual tests of formula against formula are unbiased, formula. Let formula denote the cumulative distribution function of the random variable formula for a given parameter formula, formula, and formula. Then, formula for formula and formula. Hence, for formula we have formula for formula. For formula, that is, formula, this probability depends on the parameter formula as well as the particular test and sample sizes being employed.

Under independence of p1 and p2 we finally obtain from Equation (17) the expected loss

19

for point null hypotheses formula, formula. In Web Lemma 5, we show that these become upper bounds for the expected losses in the more general cases formula, thus leading to valid decision rules also for composite null hypotheses. Finally, we use Equation (18) to calculate the expected gain for formula as

20

4.2 Admissible Rectangular Decision Rules

We now provide conditions in terms of formula, and formula for a rectangular decision rule formula to be valid. As above, we assume independent p-values p1 and p2. According to Equation (9), validity holds if formula. For formula this implies the condition formula. Because formula for rectangular decision rules as given in Figure 1, sufficient validity conditions for formula, are given by

21

For formula, that is, formula, we have with the first branch of Equation (19)

22

We denote decision rules (16) as unconstrained decision rules, because all six parameters formula, formula and formula for formula can be freely chosen under the above conditions. Exhaustive decision rules have one free parameter less since at least one of the inequalities in Equations (21) and (22) is sharp. Otherwise, formula for all formula and the respective rule is not exhaustive and hence inadmissible. Imposing constraints on the parameters formula, and formula reduces the original optimization problem to one- or two-dimensional problems and may reflect additional practical considerations. The three conceivable classes of rectangular decision rules introduced in Table 1 are investigated further in Section 5. The asymmetric rules are motivated by numerical results indicating that maximizing the thresholds formula may perform better, leading to the constraints formula and the additional restriction formula. Table 1 also presents the sets of valid and exhaustive rules for these constrained rectangular decision rules; see Web Appendix B for the derivations.

TABLE 1

Classes of constrained rectangular decision rules and corresponding sets of valid and exhaustive rules with respect to additive loss functions.

Decision ruleParameter constraints (C) and sets of valid and exhaustive rules (B)
SymmetricC: formula
B: formula,
with formula
SeparableC: formula
B: formula
AsymmetricC: formula
B: formula
formula
Decision ruleParameter constraints (C) and sets of valid and exhaustive rules (B)
SymmetricC: formula
B: formula,
with formula
SeparableC: formula
B: formula
AsymmetricC: formula
B: formula
formula
TABLE 1

Classes of constrained rectangular decision rules and corresponding sets of valid and exhaustive rules with respect to additive loss functions.

Decision ruleParameter constraints (C) and sets of valid and exhaustive rules (B)
SymmetricC: formula
B: formula,
with formula
SeparableC: formula
B: formula
AsymmetricC: formula
B: formula
formula
Decision ruleParameter constraints (C) and sets of valid and exhaustive rules (B)
SymmetricC: formula
B: formula,
with formula
SeparableC: formula
B: formula
AsymmetricC: formula
B: formula
formula

5 Application to Subgroup Analysis

We now illustrate the proposed approach using the subgroup analysis problem introduced in Section 1. Assume that the treatment effect could be different in two non-overlapping subgroups of patients defined by a biomarker. Therefore, the primary objective of a new clinical trial is to investigate the two subgroups separately and decide whether a benefit of the new treatment over the control can be claimed in any of the two subgroups. We test the null hypotheses formula that the new treatment is no better than the control in subgroup formula against the alternative hypotheses formula, where formula denotes the treatment effect in subgroup formula, for formula. If formula is rejected, a claim can be made for subgroup formula. Let formula denote the size of subgroup formula, formula. For a parallel-group design with equal number of patients in each treatment group, the test statistic formula for a continuous endpoint with known variance follows formula, for a given effect size formula and assuming a standard deviation of 1 without loss of generality. Furthermore, T1 and T2 are independent.

The choice of losses and gains should reflect the consequences of incorrect and correct decisions, respectively. We set formula to reflect the assumption that a larger subgroup suffers (benefits) more from a wrong (right) decision. In deriving the Bayes gain (13), we assume independent formula's and marginally formula to account for uncertainty. Such a density covers clinically relevant effect sizes, but alternative densities could be chosen based on relevant knowledge. We aim at maximizing the Bayes gain while controlling the FWEL at formula for a given decision rule.

Using existing numerical optimization routines, Table 2 summarizes the optimal rules for the four classes considered in terms of their parameters formula, formula (formula), and formula. For equal subgroup sizes (i.e., formula), the optimal symmetric decision rule is equal to the unconstrained rule with formula, whereas for the asymmetric rule formula is forced to be 1. Their Bayes expected gain, however, differs only by 0.001. This contrasts the much larger difference of 0.052 in Bayes gain between the optimal unconstrained rule and the separable rule. The prime reason for this difference is the much larger region A(1, 1) for the unconstrained rule, where both hypotheses can be rejected if both p-values are smaller than formula, instead of just formula for the separable rule. The top row of Figure 2 visualizes the respective rejection areas.

TABLE 2

Optimal Bayes rules for different classes of rectangular decision rules with respect to additive loss functions and strong control of the familywise expected loss at formula.

formulaDecision rulesBayes gainParameters
0.5Unconstrained0.821formula for formula
Symmetric0.821formula for formula
Separable0.769formula for formula
Asymmetric0.820formula for formula
0.25Unconstrained0.838formula
Symmetric0.800formula for formula
Separable0.790formula
Asymmetric0.837formula
formulaDecision rulesBayes gainParameters
0.5Unconstrained0.821formula for formula
Symmetric0.821formula for formula
Separable0.769formula for formula
Asymmetric0.820formula for formula
0.25Unconstrained0.838formula
Symmetric0.800formula for formula
Separable0.790formula
Asymmetric0.837formula
TABLE 2

Optimal Bayes rules for different classes of rectangular decision rules with respect to additive loss functions and strong control of the familywise expected loss at formula.

formulaDecision rulesBayes gainParameters
0.5Unconstrained0.821formula for formula
Symmetric0.821formula for formula
Separable0.769formula for formula
Asymmetric0.820formula for formula
0.25Unconstrained0.838formula
Symmetric0.800formula for formula
Separable0.790formula
Asymmetric0.837formula
formulaDecision rulesBayes gainParameters
0.5Unconstrained0.821formula for formula
Symmetric0.821formula for formula
Separable0.769formula for formula
Asymmetric0.820formula for formula
0.25Unconstrained0.838formula
Symmetric0.800formula for formula
Separable0.790formula
Asymmetric0.837formula
Graphical visualization of the optimal decision rules from Table 2. Top: . Bottom: .
Figure 2

Graphical visualization of the optimal decision rules from Table 2. Top: formula. Bottom: formula.

For unequal subgroup sizes (i.e., formula), the losses assigned to the two subgroups are formula and formula. This asymmetry in the losses and in the marginal power of the two tests leads to different optimal rules for the different classes under consideration. The optimal asymmetric rule with a Bayes gain of 0.837 turns out to be close to the optimal unconstrained rule with only a slightly larger Bayes gain of 0.838. Nevertheless, the optimal asymmetric rule might be preferred in practice because the formula are forced to 1 so that a possibly large p-value in one subgroup does not prevent the rejection of the null hypothesis in the other subgroup if its p-value is small. The asymmetry in a1 and a2 allows for a larger area where both hypotheses can be rejected than in the symmetric or separable case and for a clear advantage of this type of rule. The bottom row of Figure 2 visualizes the optimal asymmetric and separable decision rules as examples.

Finally, Figure 3 displays the Bayes gain as a function of formula ranging from 0.05 to 0.5 for the four classes of rectangular decision rules. We conclude that the asymmetric rule performs closest to the unconstrained rule, both of which are superior to the symmetric and the separable rules. The maximum difference in Bayes gain between the four different classes is about 0.05. The differences in Bayes gain, however, are diluted by cases where both formula are either very small or very large and, consequently, lead to expected gains close to 0 or 1 irrespective of the rule. Therefore, we plot in Figure 3 the expected gain (20) of the same optimal Bayes rules as a function of formula, where we fix formula such that the marginal power is roughly between 0.2 and 0.8. As seen from Figure 3, for formula the expected gain for the separable rule is about 0.07 smaller than for the other rules. For formula, the differences in expected gain between the separable and unconstrained rules decrease somewhat and the symmetric rule is worse than the optimal separable rule.

(A) Bayes gain (13) of optimal Bayes rules with respect to additive loss functions within each class of rectangular decision rules.  Expected gain (20) of the optimal Bayes rule for . This figure appears in color in the electronic version of this paper.
Figure 3

(A) Bayes gain (13) of optimal Bayes rules with respect to additive loss functions within each class of rectangular decision rules. formula Expected gain (20) of the optimal Bayes rule for formula. This figure appears in color in the electronic version of this paper.

Considering that the optimal asymmetric rule is so close to the optimal unconstrained rule, we investigate further the following rule obtained by setting formula for the asymmetric rule, formula. Such a rule turns out to be almost optimal under the current choice of formula and may serve as a rule of thumb, due to its convenient computation without any numerical optimization. The largest difference between this rule and the optimal unconstrained rule is less than 0.0017 for the whole range of formula. If we instead assume a density formula, the optimal unconstrained rule is virtually identical to the above rule of thumb (i.e., formula) with expected gains between 0.84 and 0.88. As before, the difference in expected gain between both rules is negligible, with at most 0.0002 over for the whole range of formula.

6 Discussion

Our approach includes several common multiple test procedures as special cases. For example, using separable decision rules and additive loss functions, our approach is equivalent to Benjamini and Hochberg (1997) who generalized earlier results of Spjotvoll (1972) to maximize the expected number of rejections while controlling the expected number of incorrectly rejected hypotheses. Moreover, using the binary loss function (4) leads to FWER-controlling procedures, as shown in Section 2.3. In Web Appendix C, we specialize the results in Section 4 with respect to binary loss functions and provide connections to common multiple test procedures. For example, using separable decision rules and binary loss functions, our approach specializes to the Šidák (1967) test for subgroups of equal size.

Our approach can be extended to other applications than subgroup analyses, although the assumptions and derivations may need modifications. For example, when testing a new drug for two unequally important variables, losses can be caused by, say, an erroneous drug approval based on the primary variable or, less severe, an erroneous drug label based on the secondary variable. Conventional practice of controling the FWER, however, rates both null hypotheses equally important and does not account for the different losses inflicted by inappropriate drug approval or label (Wang et al., 2015). As another example, several treatments may be evaluated in parallel in a master protocol trial when treatment options depend on one or more biomarkers within a single cancer type. If each treatment is evaluated in a single subgroup, the parallel sub-studies are essentially independent trials, each considering one treatment in its own distinct patient group, conducted together for administrative or operational reasons such as saving time, reducing costs or facilitating recruitment (Woodcock & LaVange, 2017). In these applications, the test statistics will be stochastically dependent and we leave the extension of the proposed methods to these settings for future research.

In applying the proposed approach, we need to set a value for the upper threshold α. As mentioned earlier, we standardize the maximum total loss at 1 unit. That is, in the setting of Section 5, the loss for rejecting erroneously the null hypothesis that there is no treatment effect in either subgroup is set to 1. The expected loss for a test in the overall population that disregards possible subgroups is then equal to the Type I error rate. A natural choice for α is therefore 0.025 in analogy to current clinical trial practice. We also need to assign actual values to the losses formula and gains formulaformula. One choice is to set formula and formula proportional to the prevalence of the two subgroups (i.e., formula in Section 5). Also, we recommend using the same proportions among the losses and gains. This keeps the relative importance of the subgroups the same in evaluating expected losses and gains: if a right (wrong) decision is made for a larger subgroup, the induced gain (loss) will be larger as well, and vice versa.

We also investigated optimal decision rules following the minimax principle to account for the intrinsic uncertainty in formula. Roughly speaking, these decision rules minimize the maximum regret (over all possible formula) of not adopting the best rule. The advantage is that we do not have to specify a density for formula. However, the numerical optimization is more challenging, because of the repeated minimization/maximization steps. The minimax approach is also more difficult to interpret because the regret is essentially a difference of expected gains, whereas the Bayes gain is an average on the same scale as the expected gains themselves. Our numerical studies indicated that minimax and Bayes approaches do not differ much in practice in the sense that their maximized gains are similar for a given class of rules.

Following Müller et al. (2004) and Sun and Cai (2009), we suggest in this paper to maximize the Bayes gain while controlling the expected loss at a pre-specified threshold α. Alternatively, one could maximize a utility function of the difference between gain and loss, under the same constraint on the expected loss. Technically, both approaches lead to the same optimal decision rules, hence the same design. Practically, gains and losses may be measured on different scales. We therefore prefer the proposed approach because of the easier interpretation. Also, one could consider more general loss and gain functions that are dependent on the actual value of formula and evaluate decisions in a continuous fashion. We leave such possible extensions for future research.

Data Availability Statement

Data sharing not applicable to this paper as no datasets were generated or analyzed during the current study.

Acknowledgments

We are grateful to two reviewers and the associate editor for their constructive comments on an earlier version of this paper. We would like to express our sadness that our co-author Willi Maurer passed away, while this manuscript was under preparation. We are indebted to him as our colleague, mentor, and friend. This way, we want to thank him posthum for his collaboration and guidance throughout many years.

References

Benjamini
,
Y.
&
Hochberg
,
Y.
(
1997
)
Multiple hypotheses testing with weights
.
Scandinavian Journal of Statistics
,
24
,
407
418
.

Brannath
,
W.
,
Hillner
,
C.
&
Rohmeyer
,
K.
(
2023
)
The population-wise error rate for clinical trials with overlapping populations
.
Statistical Methods in Medical Research
,
32
,
334
352
. https://doi.org/10.1177/09622802221135249

Cohen
,
A.
&
Sackrowitz
,
H.B.
(
2005
)
Decision theory results for one-sided multiple comparison procedures
.
The Annals of Statistics
,
33
,
126
144
.

Gabriel
,
K.R.
(
1969
)
Simultaneous test procedures: some theory of multiple comparisons
.
The Annals of Mathematical Statistics
,
40
,
224
250
.

Graf
,
A.C.
,
Posch
,
M.
&
Koenig
,
F.
(
2015
)
Adaptive designs for subpopulation analysis optimizing utility functions
.
Biometrical Journal
,
57
,
76
89
.

Hommel
,
G.
&
Bretz
,
F.
(
2008
)
Aesthetics and power considerations in multiple testing—a contradiction?
Biometrical Journal
,
50
,
657
666
.

Lehmann
,
E.L.
(
1957
)
A theory of some multiple decision problems, I
.
The Annals of Mathematical Statistics
,
28
,
1
25
.

Lisovskaja
,
V.
&
Burman
,
C.-F.
(
2014
)
A decision theoretic approach to optimization of multiple testing procedures
.
Biometrical Journal
,
57
,
64
75
.

Marcus
,
R.
,
Peritz
,
E.
&
Gabriel
,
K.R.
(
1976
)
On closed testing procedures with special reference to ordered analysis of variance
.
Biometrika
,
63
,
655
660
.

Müller
,
P.
,
Parmigiani
,
G.
&
Rice
,
K.
(
2007
) FDR and Bayesian multiple comparisons rules. In:
Bernardo
,
J.
,
Bayarri
,
M.
,
Berger
,
J.
,
Dawid
,
A.
,
Heckerman
,
D.
,
Smith
,
A.
&
West
,
M.
(Eds.)
Bayesian Statistics
,
Oxford
:
Oxford University Press
, pp.
349
370
.

Müller
,
P.
,
Parmigiani
,
G.
,
Robert
,
C.
&
Judith
,
R.
(
2004
)
Optimal sample size for multiple testing
.
Journal of the American Statistical Association
,
99
,
990
1001
.

Rosenblum
,
M.
,
Fang
,
E.X.
&
Liu
,
H.
(
2020
)
Optimal, two-stage, adaptive enrichment designs for randomized trials, using sparse linear programming
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
82
,
749
772
.

Rosenblum
,
M.
,
Liu
,
H.
&
Yen
,
E.-H.
(
2014
)
Optimal tests of treatment effects for the overall population and two subpopulations in randomized trials, using sparse linear programming
.
Journal of the American Statistical Association
,
109
,
1216
1228
.

Senn
,
S.
&
Bretz
,
F.
(
2007
)
Power and sample size when multiple endpoints are considered
.
Pharmaceutical Statistics
,
6
,
161
170
.

Šidák
,
Z.
(
1967
)
Rectangular confidence regions for the means of multivariate normal distributions
.
Journal of the American Statistical Association
,
62
,
626
633
.

Simes
,
R.J.
(
1986
)
An improved Bonferroni procedure for multiple tests of significance
.
Biometrika
,
73
,
751
754
.

Spiegelhalter
,
D.J.
,
Abrams
,
K.R.
&
Myles
,
J.P.
(
2004
)
Bayesian approaches to clinical trials and health-care evaluation
.
John Wiley & Sons
.

Spjotvoll
,
E.
(
1972
)
On the optimality of some multiple comparison procedures
.
The Annals of Mathematical Statistics
,
43
,
398
411
.

Sun
,
W.
&
Cai
,
T.
(
2009
)
Large-scale multiple testing under dependence
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
71
,
393
424
.

Wang
,
S.-J.
,
Bretz
,
F.
,
Dmitrienko
,
A.
,
Hsu
,
J.
,
Hung
,
H. M.J.
,
Koch
,
G.
,
Maurer
,
W.
,
Offen
,
W.
&
O'Neill
,
R.
(
2015
)
Multiplicity in confirmatory clinical trials: a case study with discussion from a JSM panel
.
Statistics in Medicine
,
34
,
3461
3480
.

Woodcock
,
J.
&
LaVange
,
L.M.
(
2017
)
Master protocols to study multiple therapies, multiple diseases, or both
.
New England Journal of Medicine
,
377
,
62
70
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Supplementary data