Summary

It is frequently observed in practice that the Wald statistic gives a poor assessment of the statistical significance of a variance component. This paper provides detailed analytic insight into the phenomenon by way of two simple models, which point to an atypical geometry as the source of the aberration. The latter can in principle be checked numerically to cover situations of arbitrary complexity, such as those arising from elaborate forms of blocking in an experimental context, or models for longitudinal or clustered data. The salient point, echoing Dickey (2020), is that a suitable likelihood-ratio test should always be used for the assessment of variance components.

1 Introduction

Faithful representation of a process generating data often entails specification of two or more sources of variability. In an experimental context, simple or elaborate forms of blocking induce a nested or crossed structure within the set of plots. Similar grouping arises in observational studies where, for instance, data may originate from different hospitals, regions or from several family groups, of no direct interest, but likely to generate structured correlation in the outcome.

As in certain other settings, inference based on the likelihood function for the full generative model is typically miscalibrated, sometimes seriously so, and should ideally be based on a suitable marginal or conditional likelihood. In the present context, appropriate preliminary reduction leads to residual maximum likelihood, REML, developed by Patterson & Thompson (1971), and closely connected to marginal likelihood (Bartlett, 1937).

Even when the REML likelihood is used, the Wald test routinely reported in software implementations is frequently found to be ineffectual for detecting components of variance when they are unambiguously present. A typical example occurs in McCullagh (2023, Ch. 4), where the likelihood-ratio statistic is more than eight times as large as the squared Wald ratio. The phenomenon has also been documented by Dickey (2020), who presented examples in which the REML Wald statistic is bounded. In at least one of the examples he considered, the natural estimate of standard error of the REML estimator is proportionate to the REML estimate itself, so that the role of the data in the Wald construction is negligible. The purpose of the present paper is to expose the matter in a form amenable to detailed analytic calculation, thereby revealing dependence on key aspects and indicating other settings in which the same phenomenon is inevitable. The insights of Dickey (2020) are reproduced and elucidated further. The source of the anomalous behaviour is not a failure of distributional approximations obtained under hypothetical regimes of sometimes questionable adequacy, but rather atypical geometry of the REML loglikelihood function, which induces a bounded Wald statistic even under a notional limiting operation in which the likelihood-ratio statistic for testing the same hypothesis is arbitrarily large. We show that a version of the score statistic is subject to the same aberration.

The implications for applied work are consequential, as undetected sources of variability typically result in estimated standard errors for regression coefficient estimators that are deceptively small, and therefore confidence intervals that are misleadingly narrow. The opposite situation can occasionally arise as well. For instance, in a randomized block design, omission of the block factor as a variance component has the effect of increasing the estimated variance of treatment effect estimators.

2 Variance-component models

In a typical variance-component model, the distribution of the response vector is specified by a mean vector μ=Xβ in the linear subspace X spanned by the columns of the model matrix X, and a covariance matrix Σ in the convex cone

spanned by given matrices V0,,Vs, which are positive definite or semidefinite. Usually, V0=In is the identity; the remaining matrices may be block factors or structured matrices associated with spatial or temporal dependence.

The residual likelihood is the likelihood function based on the residual UTY, where ker(UT)=X. Provided that the matrices UTVuU are linearly independent, the variance components θu may be estimated by maximizing the residual likelihood. In the subsequent discussion, reference to the loglikelihood function and its maximizer means the REML version unless otherwise specified.

There are compelling general arguments, notably invariance and permissibility of asymmetric confidence regions, for basing an assessment of the hypothesis H0:θs=0, say, on the likelihood-ratio statistic

where θ^ is the maximum likelihood estimator and θ^(0) is the constrained estimator under H0. Nonnegativity of the variance coefficients means that the subset of V defined by the constraint θs=0 is a boundary subcone, with the implication that the likelihood achieves its maximum on the boundary with positive probability, usually one half for sufficiently large sample size (Chernoff, 1954). When an exact F statistic exists, the boundary event occurs whenever F  1. On this event, θ^H0 coincides with θ^(0) and the loglikelihood ratio is exactly zero. Thus, with asymptotic probability one half, Λ = 0; otherwise, its distribution under H0 is χ12 under suitable limiting conditions. The realized value of Λ is to be calibrated against this distribution.

The nominal asymptotic variance of θ^s is iss(θ), the (s, s) component of the inverse Fisher information matrix. With asymptotic probability one half, the squared Wald ratio W2=θ^s2/iss(θ^s) is equal to zero under H0; otherwise, it is asymptotically equivalent to Λ by a standard asymptotic argument. However, it is frequently observed in practice that W2 gives a poor assessment of the statistical significance of the sth variance component, in the sense that it fails to reject H0 even when Λ does so unambiguously at the same significance level. If the likelihood is maximized on the boundary, both W2 and Λ are zero and there is no disagreement. Disagreement can only occur when both are positive, and it is that phenomenon that we address here.

Among the substantial literature on testing variance components is an early contribution by Wald (1947), notable in that it recommends an F test but fails to warn against use of the eponymous statistic in this context. Subsequent contributions have developed nonstandard asymptotic theory for the likelihood-ratio statistic and modifications thereof, relaxing an earlier assumption that the true parameter value belongs to the interior of the parameter space (e.g., Self & Liang, 1987; Geyer, 1994; Vu & Zhou, 1997). While the likelihood-ratio test and its variants have been the focus of theoretical development, common software implementations report Wald statistics as standard, without warning. Dickey (2020) appears to have been the first to emphasize the point at issue. The popularity of Wald-based inference perhaps stems from the convenience of its construction, requiring a single maximum likelihood fit in contrast with two for Λ, which facilitates the presentation of confidence statements.

The analysis of § 3 and § 6 is comparable to that of Dickey (2020), whose derivations cover situations in which the likelihood-ratio test coincides with an exact F test. The two papers illustrate the aberration under study in different ways and § 3.4 provides a comparison and synthesis. Sections 4 and 5 cover situations in which a fruitful formulation in terms of F may be infeasible, but for which the shared anomalous geometry can be checked by direct study of the loglikelihood function. Together, the present paper and that of Dickey (2020) provide a thorough explanation of a phenomenon of broad relevance and scientific consequence.

3 Analysis for a single block factor

3.1 Introduction

We consider in this section a simple Gaussian model with two variance components estimated from the sufficient statistic, which consists of the within-block mean square MS0 on f0 degrees of freedom and the between-block mean square MS1 on f1 degrees of freedom. Specifically, the outcome is Yji=μ+ηj+εji (j=1,,k,i=1,,b), where, for an arbitrary block index j, (Yj1,,Yjb)T has a Gaussian distribution of mean μ1b and covariance matrix Σ=σ02Ib+ση21b1bT. Define, in a standard notation for averaging over suffixes, Y¯j=iYji/b, and similarly for the double average. The between-block sum of squares

is distributed as σ12χf12, where f1=k1 and, in the usual variance-component parameterization with θ  0,

The within-block sum of squares j,i(YjiY¯j)2=f0MS0 is σ02χf02 distributed independently of the between-block sum of squares, where f0=k(b1). The constraint on θ means that the null hypothesis H0:θ=0 of equality of variances is on the boundary.

In the balanced one-way analysis of the variance structure of the above formulation, the REML loglikelihood is the marginal loglikelihood (θ,σ02) based on the joint density function of (MS0,MS1). While numerous generalizations may be considered, the simple version presented here isolates the point at issue in the most incisive form, free of secondary effects that complicate the analysis and interpretation.

3.2 Comparison of Wald and likelihood-ratio statistics

Maximization of (θ,σ02) without the constraint θ^  0 produces estimators θ^=(F1)/b and σ^02=MS0, where F=MS1/MS0 is Fisher’s F ratio, whose distribution depends only on the variance ratio θ. The maximum likelihood estimator θ^ has nominal asymptotic variance given by the relevant diagonal component of the inverse Fisher information matrix, namely,

(1)

where f=f1+f0. The squared Wald statistic for testing θ = 0 is therefore

to be compared with the likelihood-ratio statistic

where the pooled mean square MS=(f1MS1+f0MS0)/f is the maximum likelihood estimator of σ02 under the constraint θ = 0.

Although the statistics W2 and Λ are known functions of bθ^=(F1), simple approximations provide insight into the nature of the aberration described in § 1. A Taylor expansion of W2 and Λ around θ^=0 gives

or, in terms of Λ  0,

showing that they agree up to second order in θ^ and to first order in Λ, but not beyond. Typically, f0f1 is large, in which case the ratio is approximately

(2)

for small θ^ or Λ.

The squared Wald statistic W2 is justified based on standard asymptotic theory for maximum likelihood estimators. First- and second-moment theory suggests three further Wald statistics, for which the same anomalous behaviour is demonstrated in the Supplementary Material. For a similar discussion of two score statistics, see also the Supplementary Material.

In the simplest setting, at least, both W and Λ1/2sign(F1) are monotone functions of the mean-square ratio F, so their distributions are known exactly as a function of the variance-ratio parameter θ. The transformations Λ=g1(F) and W2=g2(F) are strictly monotone increasing for F > 1, and decreasing for F < 1. The discrepancies described here arise from the standard practice of treating Λ and W2 as if they were asymptotically identical statistics rather than equivalent statistics. The standard practice is justified only if g1 = g2, at least approximately for large samples, which is not the case in typical variance-components models.

One definition of Wald-detectability equates θ to Φ1(1α) times the estimated standard error of θ^. It can be shown that there does not exist a positive Wald-detectable value in this sense unless the number of blocks is large. It is arguably more natural to compute standard errors under the null, in which case the values at the borderline of Wald detectability are

at the 16% and 2% levels. The corresponding thresholds in terms of F are

With f0=102,f1=5 and b = 18, the ratio Λ/W2 is approximately 1.864 at θ16* and 2.727 at θ2*. With the same numbers, F16*=1.648 and F2*=3.296, to be compared with the 16% and 2% critical values of the F distribution on (f1, f0) degrees of freedom, which are 1.625 and 2.818, respectively. Thus, even when standard errors are computed under the null hypothesis, the observed value of the F statistic has to be 17% larger than the 2% critical value of the F distribution in order for the Wald test to reject at the same level. Equivalently, rejection at the 2% level using a Wald statistic with standard errors computed under the null requires an observed value of θ^ that is double that necessary for rejection at the same level using an F test. These latter conclusions do not involve any approximations, and are similar and complementary to those of Dickey (2020).

Geometric insight is obtained by noting that the Wald statistic W2 implicitly defines a quadratic approximation q(θ) to the profile REML loglikelihood function, (θ;σ^θ2) in a neighbourhood of θ^=(F1)/b. Specifically, for fixed θ, the maximum likelihood estimator of σ02 is σ^θ2=w0MS0+w1MS1/(1+bθ), where wr=fr/f and

(3)

Write (θ;σ^θ2)=^(θ)+f/2. The quadratic approximation to ^(θ) implicit in the Wald test is

where f0f1b2/(2fF2) is 1/iθθ(θ^) and

Since the two functions have the same value at θ^, the discrepancy between the Wald and likelihood-ratio statistics for testing θ = 0 is the difference in y intercepts:

(4)

with the expression for Λ in terms of F given by

For F and f1 fixed,

showing that the discrepancy is roughly linear in F for large f0, while for fixed f1 and f0, (4) converges to zero as F approaches unity and is unbounded for F arbitrarily large. Figure 1 graphs q(θ) and ^(θ) for different values of F.

Graphs of the profile REML loglikelihood function ℓ^(θ) (solid lines) and its quadratic approximation q(θ) (dashed lines) for f1=5, f0=102, b = 18, MS0=1 and F∈{2,3,4,5} (from top left to bottom right).
Fig. 1

Graphs of the profile REML loglikelihood function ^(θ) (solid lines) and its quadratic approximation q(θ) (dashed lines) for f1=5,f0=102, b = 18, MS0=1 and F{2,3,4,5} (from top left to bottom right).

3.3 Nonconstant Fisher information and anomalous geometry

From (3), ^(θ) has second derivative

whose value at θ^ is

(5)

showing that the curvature at the maximum-likelihood point is close to zero for large F, as depicted in Fig. 1. In other words, ^(θ) is arbitrarily well approximated by a horizontal asymptote at ^(θ^) in a neighbourhood of θ^ for arbitrarily large F. Equation (5) also shows that the discrepancy q(0)^(0) is attributable to the higher-order derivatives, since there is no error incurred by using 1/iθθ(θ^) in place of γ(θ^) in the Taylor series approximation to ^(θ). The effect of higher-order derivatives is encapsulated to a large extent in the considerable nonconstancy of iθθ(θ) as a function of θ over the range of interest.

From (1), the ratio of the nominal asymptotic variances of θ^ at arbitrary θ and at θ = 0 is iθθ(θ)/iθθ(0)=(1+bθ)2, and the range of primary interest is

the upper value being the nominal asymptotic standard deviation of θ^ under the null hypothesis, having conditioned on the event that θ^ is not on the boundary. Over this range, the asymptotic variance varies by a factor

which is large for typical values of f1. For example, f0=102,f1=5 gives a range of approximately 1  (1+bθ)2  2.72.

3.4 A synthesis with Dickey (2020)

The motivation for the present paper came from practical examples in McCullagh (2023, Ch. 4), where the Wald statistic was ineffectual at detecting variance components, the absence of which was strongly refuted by a likelihood-ratio test. Dickey (2020) exposed the same phenomenon. His exposition, aimed at practitioners, covers widely used models in the analysis of designed experiments for which exact F and Wald tests are available. The present paper is framed in terms of the loglikelihood-ratio statistic Λ, which is more generally available and points to geometric insights not recoverable from the moment-based Wald constructions. Three instances of the latter are discussed in the Supplementary Material, among which equation (S.1) coincides with equation (2) of Dickey (2020). For cases where F is available, the present paper and that of Dickey provide equivalent explanations from two points of view.

4 Two-sample problem in general scale families

Let Y be a random variable from a scale family with density function τ1g(y/τ),y,τ>0, where g is a known, continuous, density function on the positive real line. Let

Then the Fisher information for τ in an independent and identically distributed sample Y1,,Yn is nIg/τ2 and that for τ0,τ1 in a two-sample problem of sizes n0 and n1 is

(6)

The squared Wald statistic for testing equality of scale parameters is therefore

where θ^ is the estimated ratio τ^1/τ^0. The Wald statistic is bounded in both limits for θ^ arbitrarily large or small. Specifically,

The curvature of the profile loglikelihood function at θ^ is approximated by the expected Fisher information in the profile loglikelihood. The Fisher information transforms as

(7)

where ψ and ϕ are two parameterizations and we have used the convention that symbols appearing both as subscripts and superscripts in the same product are summed. The information about θ, having adjusted for estimation of τ0 is therefore, using (6) and (7),

showing that the two-sample problem in general scale families has the same anomalous geometry documented in § 3 for large θ, leading to the discrepancy between the likelihood-ratio and the Wald statistics.

5 Gaussian variance-component model

Consider a variance-component model in which YRn is normal with mean μX of dimension p < n, and covariance matrix Σ=σ2{In+V(θ)}, where V(θ) is a known matrix function of a vector parameter θ=(θ1,,θs)T with V(0)=0. This encompasses the linear model V(θ)=uθuVu from § 2. The subspace XRn is a group under addition, which implies that, for any matrix UT with kernel X, the normalized residual statistic

is maximal invariant under the affine group with action g(a,x):yay+x, with a > 0 and xX. For distributional calculations, it is convenient to take the columns of U to be an orthonormal basis in X, the orthogonal complement of X with respect to the standard inner product. In that case UUT=InX(XTX)1XT,UTU=Inp and the density function of Q=q(Y,U) is

(8)

where σ2Aθ=UTΣU. At θ = 0, Q is uniformly distributed on the (np)-dimensional unit sphere in Rn. The above exposition hybridizes Kariya (1980) and King (1980).

By construction, distribution (8) does not depend on β or σ2, and inference for θ is conveniently based on the marginal loglikelihood function

(9)

closely related to the REML loglikelihood function that uses the marginal distribution of UTY rather than Q. Since the transformation to Q eliminates the nuisance parameter σ2 as well as β, analysis based on (9) is more amenable to analytic calculation.

The density function in (8) is relative to a particular orthonormal basis in X, and the same basis is embedded in the likelihood function (9) in matrix Aθ. The conclusion in (9) is equivalent to equation (3.3) of McCullagh (2009), which evades the problem of selecting a basis by allowing singular matrices.

The single block factor setting of § 3 is a special case with n=(f1+1)b,Σ=σ2(In+θV),V=If1+11b1bT and X=1, so that UUT=In1n1nT/n. Suppose for an exact analytic calculation that f1+1 and b are both powers of two. The Supplementary Material shows that log|Aθ|=f1log(1+bθ) and

so that (9) becomes

(10)

The solution of the likelihood equation by differentiation of (10) is (1+bθ)=F and direct calculation shows that (θ)(θ)=^(θ)^(θ^) from (3). This is demonstrated empirically in Fig. 2 with the values of f0, f1 and b from the numerical example of § 6.1 so as to also illustrate the anomalous geometry. When an analysis of variance is feasible, it produces identical estimates to those based on maximization of (9), the only difference arising from the choice of basis for the orthogonal subspaces.

Graph of ℓ^(θ)−ℓ^(θ^) from § 3 (solid line) and ℓ⌣(θ)−ℓ⌣(θ⌣) from (10) (dashed line) for f1=22, f0=92, b = 5, MS0=1 and F = 4.
Fig. 2

Graph of ^(θ)^(θ^) from § 3 (solid line) and (θ)(θ) from (10) (dashed line) for f1=22,f0=92, b = 5, MS0=1 and F = 4.

More generally, in a model with a scalar parameter θ generating a variance component in the form Σ=σ2{In+V(θ)}, the likelihood equation for θ is

at θ=θ, where A˙θ=θAθ and, by the Woodbury identity,

The second derivative of (θ) at an arbitrary point is

where A¨θ=θθ2Aθ. A general analytic approximation to the curvature γ(θ) has not been ascertained. However, if V(θ) is of the form θV with V a known matrix, then A˙θ=UTVU,A¨θ=0 and the previous display simplifies. In particular,

and

so that limθθθ2(θ)=0. By consistency of the maximum likelihood estimator, the marginal loglikelihood function has zero curvature at the maximising point when the true value of θ is arbitrarily large.

6 Numerical illustrations

6.1 Covariance determined by split-plot nested blocking

The split-plot covariance σ02(In+θV) is a linear combination of the identity and a binary matrix such that Vij = 1 if observational units i, j belong to the same whole plot or block. Chapter 1 of McCullagh (2023) discusses an example of this type with l = 24 rats constituting the blocks, and s = 5 sites on each rat constituting the observational units. A two-level treatment was assigned at random to the rats, so the model formula site+treat for the expected value determines a subspace of dimension 6. This is not a split-plot design in the traditional sense because the split-plot effect is associated with a classification factor, sites ranging from anterior to caudal, not with a treatment factor.

The real experiment is a little more complicated because some components are missing. With this exception, the simulated data mirror that example, and illustrate the effect of increasing θ on the discrepancy between Λ and W2. Since θ^0.4 for the actual experiment, the range 0  θ  1 was used for simulation. For each of the 1000 Monte Carlo replications, outcomes were generated according to the model YN(μ,Σ), the particular point μX being immaterial for the REML likelihood. As it happens, the analysis of § 3 applies here with two modified mean squares, one for treatment and one for residuals eliminating additive row and column effects, namely,

where f1=l2,f0=(l1)(s1) and, for example, PtrtY is the projection of Y on the subspace spanned by the treatment basis. For the null hypothesis θ = 0, Fig. 3 compares the simulated values of Λ/W2 with the linear approximation 1+4sθ^/3 from (2).

Simulated Λ/W2 plotted against θ^ for the nested-block arrangement with l = 24 large blocks and s = 5 nested blocks.
Fig. 3

Simulated Λ/W2 plotted against θ^ for the nested-block arrangement with l = 24 large blocks and s = 5 nested blocks.

Maximization of the marginal loglikelihood function (θ) from (9) is equivalent. Both normed loglikelihood functions are graphed in Fig. 2 for θ^=θ=0.6, from which the anomalous geometry is apparent.

6.2 Covariance determined by Latin square blocking

The Latin square covariance Σ=σ02(In+θ1V1+θ2V2) is a linear combination of the identity and two binary matrices of the form (V1)ij=1 if observational units i and j share a column and (V2)ij=1 if they share a row. For the simulation, the variance-component parameters were taken as σ02=1,θ2=0.2, while θ1 was varied in the range 0θ11. The relevant mean squares on f0=(b1)(b2) and f1=(b1) degrees of freedom are

where σ12=σ02(1+bθ1). The analogous mean square MS2 for rows on f2 = f1 degrees of freedom is used in variance estimation. Specifically, the information about θ1, having adjusted for estimation of σ02 and θ2, is

(11)

In (11), θj is estimated as the positive part of (Fj1)/b with Fj=MSj/MS0. For f=f1+f0f1, which amounts to b being large, the adjustment is negligible provided that θ2 is not too large relative to θ1, and (11) is comparable to (1). The resulting Wald statistic for testing the hypothesis θ1=0 is to be compared with Λ, whose form is identical to that of § 3.

Figure 4 depicts qualitatively similar behaviour for the ratio Λ/W2 as that of Fig. 3, with additional variability attributable to randomness in θ^2 manifesting through the estimate of (11). The version with θ2 treated as known is depicted in Fig. 4(b).

Simulated Λ/W2 plotted against θ^1 for the Latin square with b = 6 rows, columns and treatments. (a): θ2 estimated in (11); (b): θ2 treated as known.
Fig. 4

Simulated Λ/W2 plotted against θ^1 for the Latin square with b = 6 rows, columns and treatments. (a): θ2 estimated in (11); (b): θ2 treated as known.

Acknowledgement

Section 3.4 was based on a detailed comparison supplied by one of the two referees, to whom we are grateful.

Supplementary material

The Supplementary Material provides more explicit derivations for some of the key results, together with a discussion of three further Wald statistics and two score statistics.

References

Bartlett
M. S
. (
1937
).
Properties of sufficiency and statistical tests
.
Proc. R. Soc. A
155
,
268
82
.

Chernoff
H
. (
1954
).
On the distribution of the likelihood ratio
.
Ann. Math. Statist.
25
,
573
8
.

Dickey
D. A.
(
2020
).
A warning about Wald tests
.
SAS Global Forum
, Paper
5088
-
2020
.

Geyer
C. J
. (
1994
).
On the asymptotics of constrained M-estimation
.
Ann. Statist.
22
,
1993
2010
.

Kariya
T
. (
1980
).
Locally robust tests for serial correlation in least squares regression
.
Ann. Statist.
8
,
1065
70
.

King
M. L.
(
1980
).
Robust tests for spherical symmetry and their application to least squares regression
.
Ann. Statist.
8
,
1265
71
.

McCullagh
P
. (
2009
).
Marginal likelihood for distance matrices
.
Statist. Sinica
19
,
631
49
.

McCullagh
P.
(
2023
).
Ten Projects in Applied Statistics
.
Cham
:
Springer
.

Patterson
H. D.
,
Thompson
R
. (
1971
).
Recovery of inter-block information when block sizes are unequal
.
Biometrika
58
,
545
54
.

Self
S. G.
,
Liang
K-Y
. (
1987
).
Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions
.
J. Am. Statist. Assoc
.
82
,
605
10
.

Vu
H. T. V.
,
Zhou
S
. (
1997
).
Generalization of likelihood ratio tests under nonstandard conditions
.
Ann. Statist.
22
,
1993
2010
.

Wald
A
. (
1947
).
A note on regression analysis
.
Ann. Math. Statist.
18
,
586
9
.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data