An anomaly arising in the analysis of processes with more than one source of variability

Summary

It is frequently observed in practice that the Wald statistic gives a poor assessment of the statistical significance of a variance component. This paper provides detailed analytic insight into the phenomenon by way of two simple models, which point to an atypical geometry as the source of the aberration. The latter can in principle be checked numerically to cover situations of arbitrary complexity, such as those arising from elaborate forms of blocking in an experimental context, or models for longitudinal or clustered data. The salient point, echoing Dickey (2020), is that a suitable likelihood-ratio test should always be used for the assessment of variance components.

Issue Section:

Article

1 Introduction

Faithful representation of a process generating data often entails specification of two or more sources of variability. In an experimental context, simple or elaborate forms of blocking induce a nested or crossed structure within the set of plots. Similar grouping arises in observational studies where, for instance, data may originate from different hospitals, regions or from several family groups, of no direct interest, but likely to generate structured correlation in the outcome.

As in certain other settings, inference based on the likelihood function for the full generative model is typically miscalibrated, sometimes seriously so, and should ideally be based on a suitable marginal or conditional likelihood. In the present context, appropriate preliminary reduction leads to residual maximum likelihood, REML, developed by Patterson & Thompson (1971), and closely connected to marginal likelihood (Bartlett, 1937).

Even when the REML likelihood is used, the Wald test routinely reported in software implementations is frequently found to be ineffectual for detecting components of variance when they are unambiguously present. A typical example occurs in McCullagh (2023, Ch. 4), where the likelihood-ratio statistic is more than eight times as large as the squared Wald ratio. The phenomenon has also been documented by Dickey (2020), who presented examples in which the REML Wald statistic is bounded. In at least one of the examples he considered, the natural estimate of standard error of the REML estimator is proportionate to the REML estimate itself, so that the role of the data in the Wald construction is negligible. The purpose of the present paper is to expose the matter in a form amenable to detailed analytic calculation, thereby revealing dependence on key aspects and indicating other settings in which the same phenomenon is inevitable. The insights of Dickey (2020) are reproduced and elucidated further. The source of the anomalous behaviour is not a failure of distributional approximations obtained under hypothetical regimes of sometimes questionable adequacy, but rather atypical geometry of the REML loglikelihood function, which induces a bounded Wald statistic even under a notional limiting operation in which the likelihood-ratio statistic for testing the same hypothesis is arbitrarily large. We show that a version of the score statistic is subject to the same aberration.

The implications for applied work are consequential, as undetected sources of variability typically result in estimated standard errors for regression coefficient estimators that are deceptively small, and therefore confidence intervals that are misleadingly narrow. The opposite situation can occasionally arise as well. For instance, in a randomized block design, omission of the block factor as a variance component has the effect of increasing the estimated variance of treatment effect estimators.

2 Variance-component models

In a typical variance-component model, the distribution of the response vector is specified by a mean vector $μ = X β$ in the linear subspace $X$ spanned by the columns of the model matrix X, and a covariance matrix Σ in the convex cone

V = {Σ = \sum_{u = 0}^{s} θ_{u} V_{u} : θ_{u} ⩾ 0}

spanned by given matrices $V_{0}, \dots, V_{s}$ ⁠, which are positive definite or semidefinite. Usually, $V_{0} = I_{n}$ is the identity; the remaining matrices may be block factors or structured matrices associated with spatial or temporal dependence.

The residual likelihood is the likelihood function based on the residual $U^{T} Y$ ⁠, where $\ker (U^{T}) = X$ ⁠. Provided that the matrices $U^{T} V_{u} U$ are linearly independent, the variance components θ_u may be estimated by maximizing the residual likelihood. In the subsequent discussion, reference to the loglikelihood function and its maximizer means the REML version unless otherwise specified.

There are compelling general arguments, notably invariance and permissibility of asymmetric confidence regions, for basing an assessment of the hypothesis $H_{0} : θ_{s} = 0$ ⁠, say, on the likelihood-ratio statistic

Λ = 2 {ℓ (\hat{θ}) - ℓ ({\hat{θ}}^{(0)})},

where $\hat{θ}$ is the maximum likelihood estimator and ${\hat{θ}}^{(0)}$ is the constrained estimator under H₀. Nonnegativity of the variance coefficients means that the subset of $V$ defined by the constraint $θ_{s} = 0$ is a boundary subcone, with the implication that the likelihood achieves its maximum on the boundary with positive probability, usually one half for sufficiently large sample size (Chernoff, 1954). When an exact F statistic exists, the boundary event occurs whenever $F ⩽ 1$ ⁠. On this event, $\hat{θ} \in H_{0}$ coincides with ${\hat{θ}}^{(0)}$ and the loglikelihood ratio is exactly zero. Thus, with asymptotic probability one half, Λ = 0; otherwise, its distribution under H₀ is $χ_{1}^{2}$ under suitable limiting conditions. The realized value of Λ is to be calibrated against this distribution.

The nominal asymptotic variance of ${\hat{θ}}_{s}$ is $i^{s s} (θ)$ ⁠, the (s, s) component of the inverse Fisher information matrix. With asymptotic probability one half, the squared Wald ratio $W^{2} = {\hat{θ}}_{s}^{2} / i^{s s} ({\hat{θ}}_{s})$ is equal to zero under H₀; otherwise, it is asymptotically equivalent to Λ by a standard asymptotic argument. However, it is frequently observed in practice that W² gives a poor assessment of the statistical significance of the sth variance component, in the sense that it fails to reject H₀ even when Λ does so unambiguously at the same significance level. If the likelihood is maximized on the boundary, both W² and Λ are zero and there is no disagreement. Disagreement can only occur when both are positive, and it is that phenomenon that we address here.

Among the substantial literature on testing variance components is an early contribution by Wald (1947), notable in that it recommends an F test but fails to warn against use of the eponymous statistic in this context. Subsequent contributions have developed nonstandard asymptotic theory for the likelihood-ratio statistic and modifications thereof, relaxing an earlier assumption that the true parameter value belongs to the interior of the parameter space (e.g., Self & Liang, 1987; Geyer, 1994; Vu & Zhou, 1997). While the likelihood-ratio test and its variants have been the focus of theoretical development, common software implementations report Wald statistics as standard, without warning. Dickey (2020) appears to have been the first to emphasize the point at issue. The popularity of Wald-based inference perhaps stems from the convenience of its construction, requiring a single maximum likelihood fit in contrast with two for Λ, which facilitates the presentation of confidence statements.

The analysis of § 3 and § 6 is comparable to that of Dickey (2020), whose derivations cover situations in which the likelihood-ratio test coincides with an exact F test. The two papers illustrate the aberration under study in different ways and § 3.4 provides a comparison and synthesis. Sections 4 and 5 cover situations in which a fruitful formulation in terms of F may be infeasible, but for which the shared anomalous geometry can be checked by direct study of the loglikelihood function. Together, the present paper and that of Dickey (2020) provide a thorough explanation of a phenomenon of broad relevance and scientific consequence.

3 Analysis for a single block factor

3.1 Introduction

We consider in this section a simple Gaussian model with two variance components estimated from the sufficient statistic, which consists of the within-block mean square ${MS}_{0}$ on f₀ degrees of freedom and the between-block mean square ${MS}_{1}$ on f₁ degrees of freedom. Specifically, the outcome is $Y_{j i} = μ + η_{j} + ε_{j i}$ (⁠ $j = 1, \dots, k, i = 1, \dots, b$ ⁠), where, for an arbitrary block index j, ${(Y_{j 1}, \dots, Y_{j b})}^{T}$ has a Gaussian distribution of mean $μ 1_{b}$ and covariance matrix $Σ = σ_{0}^{2} I_{b} + σ_{η}^{2} 1_{b} 1_{b}^{T}$ ⁠. Define, in a standard notation for averaging over suffixes, ${\bar{Y}}_{j •} = \sum_{i} Y_{j i} / b$ ⁠, and similarly for the double average. The between-block sum of squares

\sum_{j, i} {({\bar{Y}}_{j •} - {\bar{Y}}_{• •})}^{2} = b \sum_{j} {({\bar{Y}}_{j •} - {\bar{Y}}_{• •})}^{2} = f_{1} {MS}_{1}

is distributed as $σ_{1}^{2} χ_{f_{1}}^{2}$ ⁠, where $f_{1} = k - 1$ and, in the usual variance-component parameterization with $θ ⩾ 0$ ⁠,

σ_{1}^{2} = E ({MS}_{1}) = b σ_{η}^{2} + σ_{0}^{2} = σ_{0}^{2} (1 + b θ) .

The within-block sum of squares $\sum_{j, i} {(Y_{j i} - {\bar{Y}}_{j •})}^{2} = f_{0} {MS}_{0}$ is $σ_{0}^{2} χ_{f_{0}}^{2}$ distributed independently of the between-block sum of squares, where $f_{0} = k (b - 1)$ ⁠. The constraint on θ means that the null hypothesis $H_{0} : θ = 0$ of equality of variances is on the boundary.

In the balanced one-way analysis of the variance structure of the above formulation, the REML loglikelihood is the marginal loglikelihood $ℓ (θ, σ_{0}^{2})$ based on the joint density function of $({MS}_{0}, {MS}_{1})$ ⁠. While numerous generalizations may be considered, the simple version presented here isolates the point at issue in the most incisive form, free of secondary effects that complicate the analysis and interpretation.

3.2 Comparison of Wald and likelihood-ratio statistics

Maximization of $ℓ (θ, σ_{0}^{2})$ without the constraint $\hat{θ} ⩾ 0$ produces estimators $\hat{θ} = (F - 1) / b$ and ${\hat{σ}}_{0}^{2} = {MS}_{0}$ ⁠, where $F = {MS}_{1} / {MS}_{0}$ is Fisher’s F ratio, whose distribution depends only on the variance ratio θ. The maximum likelihood estimator $\hat{θ}$ has nominal asymptotic variance given by the relevant diagonal component of the inverse Fisher information matrix, namely,

i^{θ θ} (θ, σ_{0}^{2}) = i^{θ θ} (θ) = \frac{2 f {(1 + b θ)}^{2}}{f_{0} f_{1} b^{2}},

(1)

where $f = f_{1} + f_{0}$ ⁠. The squared Wald statistic for testing θ = 0 is therefore

W^{2} = {\hat{θ}}^{2} {\frac{f_{0} f_{1} b^{2}}{2 f {(1 + b \hat{θ})}^{2}}} = {(\frac{F - 1}{F})}^{2} \frac{f_{0} f_{1}}{2 f},

to be compared with the likelihood-ratio statistic

\begin{matrix} Λ = f log MS - f_{1} log {MS}_{1} - f_{0} log {MS}_{0} \\ = f log (1 + f_{1} b \hat{θ} / f) - f_{1} log (1 + b \hat{θ}), \end{matrix}

where the pooled mean square $MS = (f_{1} {MS}_{1} + f_{0} {MS}_{0}) / f$ is the maximum likelihood estimator of $σ_{0}^{2}$ under the constraint θ = 0.

Although the statistics W² and Λ are known functions of $b \hat{θ} = (F - 1)$ ⁠, simple approximations provide insight into the nature of the aberration described in § 1. A Taylor expansion of W² and Λ around $\hat{θ} = 0$ gives

\begin{matrix} W^{2} = \frac{b^{2} {\hat{θ}}^{2} f_{0} f_{1}}{2 f} (1 - 2 b \hat{θ}) + O ({\hat{θ}}^{4}), \\ Λ = \frac{b^{2} {\hat{θ}}^{2} f_{0} f_{1}}{2 f} {1 - \frac{2 b \hat{θ} (f_{0} + 2 f_{1})}{3 f}} + O ({\hat{θ}}^{4}), \\ Λ / W^{2} = 1 + \frac{2 b \hat{θ} (2 f_{0} + f_{1})}{3 f} + \frac{4 b^{2} {\hat{θ}}^{2} (2 f_{0} + f_{1})}{3 f} + O ({\hat{θ}}^{3}), \end{matrix}

or, in terms of $Λ ⩾ 0$ ⁠,

Λ / W^{2} = 1 + \frac{2 \sqrt 2 (2 f_{0} + f_{1}) Λ^{1 / 2}}{3 {(f f_{0} f_{1})}^{1 / 2}} + \frac{4 (2 f_{0} + f_{1}) (3 f + f_{1}) Λ}{3 f f_{0} f_{1}} + O (Λ^{3 / 2}),

showing that they agree up to second order in $\hat{θ}$ and to first order in Λ, but not beyond. Typically, $f_{0} ≫ f_{1}$ is large, in which case the ratio is approximately

Λ / W^{2} ≃ 1 + \frac{4 b \hat{θ}}{3} = 1 + \frac{4 (F - 1)}{3} ≃ 1 + \frac{4 {(2 Λ / f_{1})}^{1 / 2}}{3}, Λ ⩾ 0,

(2)

for small $\hat{θ}$ or Λ.

The squared Wald statistic W² is justified based on standard asymptotic theory for maximum likelihood estimators. First- and second-moment theory suggests three further Wald statistics, for which the same anomalous behaviour is demonstrated in the Supplementary Material. For a similar discussion of two score statistics, see also the Supplementary Material.

In the simplest setting, at least, both W and $Λ^{1 / 2} sign (F - 1)$ are monotone functions of the mean-square ratio F, so their distributions are known exactly as a function of the variance-ratio parameter θ. The transformations $Λ = g_{1} (F)$ and $W^{2} = g_{2} (F)$ are strictly monotone increasing for F > 1, and decreasing for F < 1. The discrepancies described here arise from the standard practice of treating Λ and W² as if they were asymptotically identical statistics rather than equivalent statistics. The standard practice is justified only if g₁ = g₂, at least approximately for large samples, which is not the case in typical variance-components models.

One definition of Wald-detectability equates θ to $Φ^{- 1} (1 - α)$ times the estimated standard error of $\hat{θ}$ ⁠. It can be shown that there does not exist a positive Wald-detectable value in this sense unless the number of blocks is large. It is arguably more natural to compute standard errors under the null, in which case the values at the borderline of Wald detectability are

θ_{16}^{*} = {(\frac{2 f}{f_{0} f_{1} b^{2}})}^{1 / 2}, θ_{2}^{*} = 2 θ_{16}^{*}

at the 16% and 2% levels. The corresponding thresholds in terms of F are

F_{16}^{*} = 1 + {(\frac{2 f}{f_{0} f_{1}})}^{1 / 2}, F_{2}^{*} = 2 F_{16}^{*} .

With $f_{0} = 102, f_{1} = 5$ and b = 18, the ratio $Λ / W^{2}$ is approximately 1.864 at $θ_{16}^{*}$ and 2.727 at $θ_{2}^{*}$ ⁠. With the same numbers, $F_{16}^{*} = 1.648$ and $F_{2}^{*} = 3.296$ ⁠, to be compared with the 16% and 2% critical values of the F distribution on (f₁, f₀) degrees of freedom, which are 1.625 and 2.818, respectively. Thus, even when standard errors are computed under the null hypothesis, the observed value of the F statistic has to be 17% larger than the 2% critical value of the F distribution in order for the Wald test to reject at the same level. Equivalently, rejection at the 2% level using a Wald statistic with standard errors computed under the null requires an observed value of $\hat{θ}$ that is double that necessary for rejection at the same level using an F test. These latter conclusions do not involve any approximations, and are similar and complementary to those of Dickey (2020).

Geometric insight is obtained by noting that the Wald statistic W² implicitly defines a quadratic approximation $q (θ)$ to the profile REML loglikelihood function, $ℓ (θ; {\hat{σ}}_{θ}^{2})$ in a neighbourhood of $\hat{θ} = (F - 1) / b$ ⁠. Specifically, for fixed θ, the maximum likelihood estimator of $σ_{0}^{2}$ is ${\hat{σ}}_{θ}^{2} = w_{0} {MS}_{0} + w_{1} {MS}_{1} / (1 + b θ)$ ⁠, where $w_{r} = f_{r} / f$ and

2 ℓ (θ; {\hat{σ}}_{θ}^{2}) = - f [1 + log {f_{0} {MS}_{0} (1 + b θ) + f_{1} {MS}_{1}}] + f_{0} log (1 + b θ) .

(3)

Write $ℓ (θ; {\hat{σ}}_{θ}^{2}) = \hat{ℓ} (θ) + f / 2$ ⁠. The quadratic approximation to $\hat{ℓ} (θ)$ implicit in the Wald test is

q (θ) = - (\frac{f_{0} f_{1} b^{2}}{2 f F^{2}}) {(θ - \hat{θ})}^{2} + \hat{ℓ} (\hat{θ}),

where $f_{0} f_{1} b^{2} / (2 f F^{2})$ is $1 / i^{θ θ} (\hat{θ})$ and

2 \hat{ℓ} (\hat{θ}) = - f log f - f_{1} log {MS}_{1} - f_{0} log {MS}_{0} .

Since the two functions have the same value at $\hat{θ}$ ⁠, the discrepancy between the Wald and likelihood-ratio statistics for testing θ = 0 is the difference in y intercepts:

q (0) - \hat{ℓ} (0) = \frac{Λ}{2} - (\frac{f_{0} f_{1}}{2 f}) {(\frac{F - 1}{F})}^{2}

(4)

with the expression for Λ in terms of F given by

Λ = f log {1 + f_{1} (F - 1) / f} - f_{1} log F .

For F and f₁ fixed,

q (0) - \hat{ℓ} (0) = \frac{f_{1}}{2} {(F - 1) (\frac{F^{2} - F + 1}{F^{2}}) - log F} + O (f_{0}^{- 1}),

showing that the discrepancy is roughly linear in F for large f₀, while for fixed f₁ and f₀, (4) converges to zero as F approaches unity and is unbounded for F arbitrarily large. Figure 1 graphs $q (θ)$ and $\hat{ℓ} (θ)$ for different values of F.

$Graphs of the profile REML loglikelihood function ℓ^(θ) (solid lines) and its quadratic approximation q(θ) (dashed lines) for f1=5, f0=102, b = 18, MS0=1 and F∈{2,3,4,5} (from top left to bottom right).$

Fig. 1

Graphs of the profile REML loglikelihood function $\hat{ℓ} (θ)$ (solid lines) and its quadratic approximation $q (θ)$ (dashed lines) for $f_{1} = 5, f_{0} = 102$ ⁠, b = 18, ${MS}_{0} = 1$ and $F \in {2, 3, 4, 5}$ (from top left to bottom right).

Open in new tab Download slide

3.3 Nonconstant Fisher information and anomalous geometry

From (3), $\hat{ℓ} (θ)$ has second derivative

γ (θ) = \frac{b^{2} f_{0}}{2} {\frac{f_{0} f {MS}_{0}^{2}}{{f_{1} {MS}_{1} + f_{0} {MS}_{0} (1 + b θ)}^{2}} - \frac{1}{{(1 + b θ)}^{2}}},

whose value at $\hat{θ}$ is

γ (\hat{θ}) = - \frac{b^{2} f_{0} f_{1}}{2 F^{2}} = - \frac{1}{i^{θ θ} (\hat{θ})},

(5)

showing that the curvature at the maximum-likelihood point is close to zero for large F, as depicted in Fig. 1. In other words, $\hat{ℓ} (θ)$ is arbitrarily well approximated by a horizontal asymptote at $\hat{ℓ} (\hat{θ})$ in a neighbourhood of $\hat{θ}$ for arbitrarily large F. Equation (5) also shows that the discrepancy $q (0) - \hat{ℓ} (0)$ is attributable to the higher-order derivatives, since there is no error incurred by using $- 1 / i^{θ θ} (\hat{θ})$ in place of $γ (\hat{θ})$ in the Taylor series approximation to $\hat{ℓ} (θ)$ ⁠. The effect of higher-order derivatives is encapsulated to a large extent in the considerable nonconstancy of $i^{θ θ} (θ)$ as a function of θ over the range of interest.

From (1), the ratio of the nominal asymptotic variances of $\hat{θ}$ at arbitrary θ and at θ = 0 is $i^{θ θ} (θ) / i^{θ θ} (0) = {(1 + b θ)}^{2}$ ⁠, and the range of primary interest is

0 ⩽ θ ⩽ {(2 f / b^{2} f_{0} f_{1})}^{1 / 2},

the upper value being the nominal asymptotic standard deviation of $\hat{θ}$ under the null hypothesis, having conditioned on the event that $\hat{θ}$ is not on the boundary. Over this range, the asymptotic variance varies by a factor

1 ⩽ {(1 + b θ)}^{2} ⩽ {1 + {(2 f / f_{0} f_{1})}^{1 / 2}}^{2},

which is large for typical values of f₁. For example, $f_{0} = 102, f_{1} = 5$ gives a range of approximately $1 ⩽ {(1 + b θ)}^{2} ⩽ 2.72$ ⁠.

3.4 A synthesis with Dickey (2020)

The motivation for the present paper came from practical examples in McCullagh (2023, Ch. 4), where the Wald statistic was ineffectual at detecting variance components, the absence of which was strongly refuted by a likelihood-ratio test. Dickey (2020) exposed the same phenomenon. His exposition, aimed at practitioners, covers widely used models in the analysis of designed experiments for which exact F and Wald tests are available. The present paper is framed in terms of the loglikelihood-ratio statistic Λ, which is more generally available and points to geometric insights not recoverable from the moment-based Wald constructions. Three instances of the latter are discussed in the Supplementary Material, among which equation (S.1) coincides with equation (2) of Dickey (2020). For cases where F is available, the present paper and that of Dickey provide equivalent explanations from two points of view.

4 Two-sample problem in general scale families

Let Y be a random variable from a scale family with density function $τ^{- 1} g (y / τ), y, τ > 0$ ⁠, where g is a known, continuous, density function on the positive real line. Let

I_{g} = \int_{0}^{\infty} \frac{{x g' (x) + g (x)}^{2}}{g (x)} d x .

Then the Fisher information for τ in an independent and identically distributed sample $Y_{1}, \dots, Y_{n}$ is $n I_{g} / τ^{2}$ and that for $τ_{0}, τ_{1}$ in a two-sample problem of sizes n₀ and n₁ is

I_{g} diag (\frac{n_{0}}{τ_{0}^{2}}, \frac{n_{1}}{τ_{1}^{2}}) .

(6)

The squared Wald statistic for testing equality of scale parameters is therefore

W^{2} = \frac{n_{0} n_{1} I_{g} {({\hat{τ}}_{1} - {\hat{τ}}_{0})}^{2}}{n_{0} {\hat{τ}}_{1}^{2} + n_{1} {\hat{τ}}_{0}^{2}} = \frac{n_{0} n_{1} I_{g} {(\hat{θ} - 1)}^{2}}{n_{0} {\hat{θ}}^{2} + n_{1}},

where $\hat{θ}$ is the estimated ratio ${\hat{τ}}_{1} / {\hat{τ}}_{0}$ ⁠. The Wald statistic is bounded in both limits for $\hat{θ}$ arbitrarily large or small. Specifically,

\lim_{\hat{θ} \to 0} W^{2} = n_{0} I_{g}, \lim_{\hat{θ} \to \infty} W^{2} = n_{1} I_{g} .

The curvature of the profile loglikelihood function at $\hat{θ}$ is approximated by the expected Fisher information in the profile loglikelihood. The Fisher information transforms as

i_{a b}^{(ψ)} (ψ) = \frac{\partial ϕ^{r}}{\partial ψ^{a}} \frac{\partial ϕ^{s}}{\partial ψ^{b}} i_{r s}^{(ϕ)} (ϕ),

(7)

where ψ and $ϕ$ are two parameterizations and we have used the convention that symbols appearing both as subscripts and superscripts in the same product are summed. The information about θ, having adjusted for estimation of τ₀ is therefore, using (6) and (7),

i_{θ θ . τ_{0}} = i_{θ θ} - i_{θ τ_{0}}^{2} / i_{τ_{0} τ_{0}} = \frac{n_{0} n_{1}}{θ^{2} (n_{0} + n_{1})} I_{g},

showing that the two-sample problem in general scale families has the same anomalous geometry documented in § 3 for large θ, leading to the discrepancy between the likelihood-ratio and the Wald statistics.

5 Gaussian variance-component model

Consider a variance-component model in which $Y \in R^{n}$ is normal with mean $μ \in X$ of dimension p < n, and covariance matrix $Σ = σ^{2} {I_{n} + V (θ)}$ ⁠, where $V (θ)$ is a known matrix function of a vector parameter $θ = {(θ_{1}, \dots, θ_{s})}^{T}$ with $V (0) = 0$ ⁠. This encompasses the linear model $V (θ) = \sum_{u} θ_{u} V_{u}$ from § 2. The subspace $X \subset R^{n}$ is a group under addition, which implies that, for any matrix $U^{T}$ with kernel $X$ ⁠, the normalized residual statistic

q (y, U) = U^{T} y / | | U^{T} y | | = U^{T} y / {(y^{T} U U^{T} y)}^{1 / 2}

is maximal invariant under the affine group with action $g (a, x) : y \mapsto a y + x$ ⁠, with a > 0 and $x \in X$ ⁠. For distributional calculations, it is convenient to take the columns of U to be an orthonormal basis in $X^{⊥}$ ⁠, the orthogonal complement of $X$ with respect to the standard inner product. In that case $U U^{T} = I_{n} - X {(X^{T} X)}^{- 1} X^{T}, U^{T} U = I_{n - p}$ and the density function of $Q = q (Y, U)$ is

\frac{Γ {(n - p) / 2} | A_{θ} |^{- 1 / 2} {(q^{T} A_{θ}^{- 1} q)}^{- (n - p) / 2}}{2 π^{(n - p) / 2}} d q_{1} \dots d q_{n - p},

(8)

where $σ^{2} A_{θ} = U^{T} Σ U$ ⁠. At θ = 0, Q is uniformly distributed on the $(n - p)$ -dimensional unit sphere in $R^{n}$ ⁠. The above exposition hybridizes Kariya (1980) and King (1980).

By construction, distribution (8) does not depend on β or $σ^{2}$ ⁠, and inference for θ is conveniently based on the marginal loglikelihood function

\overset{⌣}{ℓ} (θ) = - \frac{1}{2} log | A_{θ} | - \frac{n - p}{2} log (q^{T} A_{θ}^{- 1} q),

(9)

closely related to the REML loglikelihood function that uses the marginal distribution of $U^{T} Y$ rather than Q. Since the transformation to Q eliminates the nuisance parameter $σ^{2}$ as well as β, analysis based on (9) is more amenable to analytic calculation.

The density function in (8) is relative to a particular orthonormal basis in $X^{⊥}$ ⁠, and the same basis is embedded in the likelihood function (9) in matrix $A_{θ}$ ⁠. The conclusion in (9) is equivalent to equation (3.3) of McCullagh (2009), which evades the problem of selecting a basis by allowing singular matrices.

The single block factor setting of § 3 is a special case with $n = (f_{1} + 1) b, Σ = σ^{2} (I_{n} + θ V), V = I_{f_{1} + 1} \otimes 1_{b}^{} 1_{b}^{T}$ and $X = 1$ ⁠, so that $U U^{T} = I_{n} - 1_{n}^{} 1_{n}^{T} / n$ ⁠. Suppose for an exact analytic calculation that $f_{1} + 1$ and b are both powers of two. The Supplementary Material shows that $log | A_{θ} | = f_{1} log (1 + b θ)$ and

q^{T} A_{θ}^{- 1} q = 1 + \sum_{j = 1}^{f_{1}} q_{j b}^{2} {{(1 + b θ)}^{- 1} - 1},

so that (9) becomes

\begin{matrix} \overset{⌣}{ℓ} (θ) = - \frac{f_{1}}{2} log (1 + b θ) - \frac{f}{2} log {1 - \frac{b θ \sum_{j = 1}^{f_{1}} q_{j b}^{2}}{1 + b θ}} \\ = - \frac{f_{1}}{2} log (1 + b θ) - \frac{f}{2} log {1 - \frac{b θ f_{1} F}{(1 + b θ) (f_{0} + f_{1} F)}} . \end{matrix}

(10)

The solution of the likelihood equation by differentiation of (10) is $(1 + b \overset{⌣}{θ}) = F$ and direct calculation shows that $\overset{⌣}{ℓ} (θ) - \overset{⌣}{ℓ} (\overset{⌣}{θ}) = \hat{ℓ} (θ) - \hat{ℓ} (\hat{θ})$ from (3). This is demonstrated empirically in Fig. 2 with the values of f₀, f₁ and b from the numerical example of § 6.1 so as to also illustrate the anomalous geometry. When an analysis of variance is feasible, it produces identical estimates to those based on maximization of (9), the only difference arising from the choice of basis for the orthogonal subspaces.

Graph of ℓ^(θ)−ℓ^(θ^) from § 3 (solid line) and ℓ⌣(θ)−ℓ⌣(θ⌣) from (10) (dashed line) for f1=22, f0=92, b = 5, MS0=1 and F = 4.

Fig. 2

Graph of $\hat{ℓ} (θ) - \hat{ℓ} (\hat{θ})$ from § 3 (solid line) and $\overset{⌣}{ℓ} (θ) - \overset{⌣}{ℓ} (\overset{⌣}{θ})$ from (10) (dashed line) for $f_{1} = 22, f_{0} = 92$ ⁠, b = 5, ${MS}_{0} = 1$ and F = 4.

Open in new tab Download slide

More generally, in a model with a scalar parameter θ generating a variance component in the form $Σ = σ^{2} {I_{n} + V (θ)}$ ⁠, the likelihood equation for θ is

\frac{tr (A_{θ}^{- 1} {\dot{A}}_{θ})}{n - p} = \frac{q^{T} (A_{θ}^{- 1} {\dot{A}}_{θ} A_{θ}^{- 1}) q}{q^{T} A_{θ}^{- 1} q}

at $θ = \overset{⌣}{θ}$ ⁠, where ${\dot{A}}_{θ} = \nabla_{θ} A_{θ}$ and, by the Woodbury identity,

A_{θ}^{- 1} = I_{n - p} - U^{T} {V {(θ)}^{- 1} + U U^{T}}^{- 1} U .

The second derivative of $\overset{⌣}{ℓ} (θ)$ at an arbitrary point is

\begin{matrix} \overset{⌣}{γ} (θ) = \nabla_{θ θ}^{2} \overset{⌣}{ℓ} (θ) \\ = - \frac{1}{2} tr (A_{θ}^{- 1} {\ddot{A}}_{θ} - A_{θ}^{- 1} {\dot{A}}_{θ} A_{θ}^{- 1} {\dot{A}}_{θ}) \\ - \frac{n - p}{2} {\frac{q^{T} A_{θ}^{- 1} (2 {\dot{A}}_{θ} A_{θ}^{- 1} {\dot{A}}_{θ} - {\ddot{A}}_{θ}) A_{θ}^{- 1} q}{q^{T} A_{θ}^{- 1} q} + \frac{q^{T} (A_{θ}^{- 1} {\dot{A}}_{θ} A_{θ}^{- 1}) q}{{(q^{T} A_{θ}^{- 1} q)}^{2}}}, \end{matrix}

where ${\ddot{A}}_{θ} = \nabla_{θ θ}^{2} A_{θ}$ ⁠. A general analytic approximation to the curvature $\overset{⌣}{γ} (\overset{⌣}{θ})$ has not been ascertained. However, if $V (θ)$ is of the form θV with V a known matrix, then ${\dot{A}}_{θ} = U^{T} V U, {\ddot{A}}_{θ} = 0$ and the previous display simplifies. In particular,

\begin{matrix} \lim_{θ \to \infty} tr (A_{θ}^{- 1} {\ddot{A}}_{θ} - A_{θ}^{- 1} {\dot{A}}_{θ} A_{θ}^{- 1} {\dot{A}}_{θ}) = 0, \\ \lim_{θ \to \infty} A_{θ}^{- 1} {\dot{A}}_{θ} A_{θ}^{- 1} = 0, \end{matrix}

and

\lim_{θ \to \infty} A_{θ}^{- 1} (2 {\dot{A}}_{θ} A_{θ}^{- 1} {\dot{A}}_{θ} - {\ddot{A}}_{θ}) A_{θ}^{- 1} = 0,

so that $\lim_{θ \to \infty} \nabla_{θ θ}^{2} \overset{⌣}{ℓ} (θ) = 0$ ⁠. By consistency of the maximum likelihood estimator, the marginal loglikelihood function has zero curvature at the maximising point when the true value of θ is arbitrarily large.

6 Numerical illustrations

6.1 Covariance determined by split-plot nested blocking

The split-plot covariance $σ_{0}^{2} (I_{n} + θ V)$ is a linear combination of the identity and a binary matrix such that V_ij = 1 if observational units i, j belong to the same whole plot or block. Chapter 1 of McCullagh (2023) discusses an example of this type with l = 24 rats constituting the blocks, and s = 5 sites on each rat constituting the observational units. A two-level treatment was assigned at random to the rats, so the model formula site+treat for the expected value determines a subspace of dimension 6. This is not a split-plot design in the traditional sense because the split-plot effect is associated with a classification factor, sites ranging from anterior to caudal, not with a treatment factor.

The real experiment is a little more complicated because some components are missing. With this exception, the simulated data mirror that example, and illustrate the effect of increasing θ on the discrepancy between Λ and W². Since $\hat{θ} ≃ 0.4$ for the actual experiment, the range $0 ⩽ θ ⩽ 1$ was used for simulation. For each of the 1000 Monte Carlo replications, outcomes were generated according to the model $Y \sim N (μ, Σ)$ ⁠, the particular point $μ \in X$ being immaterial for the REML likelihood. As it happens, the analysis of § 3 applies here with two modified mean squares, one for treatment and one for residuals eliminating additive row and column effects, namely,

\begin{matrix} f_{0} {MS}_{0} = | | (I_{n} - P_{rat} - P_{site} + P_{0}) Y | |^{2} \sim σ_{0}^{2} χ_{f_{0}}^{2}, \\ f_{1} {MS}_{1} = | | (P_{rat} - P_{trt}) Y | |^{2} \sim σ_{1}^{2} χ_{f_{1}}^{2}, \end{matrix}

where $f_{1} = l - 2, f_{0} = (l - 1) (s - 1)$ and, for example, $P_{trt} Y$ is the projection of Y on the subspace spanned by the treatment basis. For the null hypothesis θ = 0, Fig. 3 compares the simulated values of $Λ / W^{2}$ with the linear approximation $1 + 4 s \hat{θ} / 3$ from (2).

Simulated Λ/W2 plotted against θ^ for the nested-block arrangement with l = 24 large blocks and s = 5 nested blocks.

Fig. 3

Simulated $Λ / W^{2}$ plotted against $\hat{θ}$ for the nested-block arrangement with l = 24 large blocks and s = 5 nested blocks.

Open in new tab Download slide

Maximization of the marginal loglikelihood function $\overset{⌣}{ℓ} (θ)$ from (9) is equivalent. Both normed loglikelihood functions are graphed in Fig. 2 for $\hat{θ} = \overset{⌣}{θ} = 0.6$ ⁠, from which the anomalous geometry is apparent.

6.2 Covariance determined by Latin square blocking

The Latin square covariance $Σ = σ_{0}^{2} (I_{n} + θ_{1} V_{1} + θ_{2} V_{2})$ is a linear combination of the identity and two binary matrices of the form ${(V_{1})}_{i j} = 1$ if observational units i and j share a column and ${(V_{2})}_{i j} = 1$ if they share a row. For the simulation, the variance-component parameters were taken as $σ_{0}^{2} = 1, θ_{2} = 0.2$ ⁠, while θ₁ was varied in the range $0 \leq θ_{1} \leq 1$ ⁠. The relevant mean squares on $f_{0} = (b - 1) (b - 2)$ and $f_{1} = (b - 1)$ degrees of freedom are

\begin{matrix} f_{0} {MS}_{0} = | | (I_{n} - P_{col} - P_{row} - P_{trt} + 2 P_{0}) Y | |^{2} \sim σ_{0}^{2} χ_{f_{0}}^{2}, \\ f_{1} {MS}_{1} = | | (P_{col} - P_{0}) Y | |^{2} \sim σ_{1}^{2} χ_{f_{1}}^{2}, \end{matrix}

where $σ_{1}^{2} = σ_{0}^{2} (1 + b θ_{1})$ ⁠. The analogous mean square ${MS}_{2}$ for rows on f₂ = f₁ degrees of freedom is used in variance estimation. Specifically, the information about θ₁, having adjusted for estimation of $σ_{0}^{2}$ and θ₂, is

\frac{b^{2} f_{1}}{2 {(1 + b θ_{1})}^{2}} {1 - \frac{f_{1}^{2} {(1 + b θ_{2})}^{2}}{f_{2} f {(1 + b θ_{1})}^{2}}} .

(11)

In (11), θ_j is estimated as the positive part of $(F_{j} - 1) / b$ with $F_{j} = {MS}_{j} / {MS}_{0}$ ⁠. For $f = f_{1} + f_{0} ≫ f_{1}$ ⁠, which amounts to b being large, the adjustment is negligible provided that θ₂ is not too large relative to θ₁, and (11) is comparable to (1). The resulting Wald statistic for testing the hypothesis $θ_{1} = 0$ is to be compared with Λ, whose form is identical to that of § 3.

Figure 4 depicts qualitatively similar behaviour for the ratio $Λ / W^{2}$ as that of Fig. 3, with additional variability attributable to randomness in ${\hat{θ}}_{2}$ manifesting through the estimate of (11). The version with θ₂ treated as known is depicted in Fig. 4(b).

Simulated Λ/W2 plotted against θ^1 for the Latin square with b = 6 rows, columns and treatments. (a): θ2 estimated in (11); (b): θ2 treated as known.

Fig. 4

Simulated $Λ / W^{2}$ plotted against ${\hat{θ}}_{1}$ for the Latin square with b = 6 rows, columns and treatments. (a): θ₂ estimated in (11); (b): θ₂ treated as known.

Open in new tab Download slide

Acknowledgement

Section 3.4 was based on a detailed comparison supplied by one of the two referees, to whom we are grateful.

Supplementary material

The Supplementary Material provides more explicit derivations for some of the key results, together with a discussion of three further Wald statistics and two score statistics.

References

Bartlett

M. S

. (

1937

Properties of sufficiency and statistical tests

Proc. R. Soc. A

155

268

–

Google Scholar

OpenURL Placeholder Text

WorldCat

Chernoff

. (

1954

On the distribution of the likelihood ratio

Ann. Math. Statist.

573

–

Google Scholar

Crossref

WorldCat

Dickey

D. A.

(

2020

A warning about Wald tests

SAS Global Forum

, Paper

5088

2020

Google Scholar

OpenURL Placeholder Text

WorldCat

Geyer

C. J

. (

1994

On the asymptotics of constrained M-estimation

Ann. Statist.

1993

–

2010

Google Scholar

Crossref

WorldCat

Kariya

. (

1980

Locally robust tests for serial correlation in least squares regression

Ann. Statist.

1065

–

Google Scholar

Crossref

WorldCat

King

M. L.

(

1980

Robust tests for spherical symmetry and their application to least squares regression

Ann. Statist.

1265

–

Google Scholar

Crossref

WorldCat

McCullagh

. (

2009

Marginal likelihood for distance matrices

Statist. Sinica

631

–

Google Scholar

OpenURL Placeholder Text

WorldCat

McCullagh

(

2023

Ten Projects in Applied Statistics

Cham

Springer

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Patterson

H. D.

Thompson

. (

1971

Recovery of inter-block information when block sizes are unequal

Biometrika

545

–

Google Scholar

Crossref

WorldCat

Self

S. G.

Liang

K-Y

. (

1987

Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions

J. Am. Statist. Assoc

605

–

Google Scholar

Crossref

WorldCat

H. T. V.

Zhou

. (

1997

Generalization of likelihood ratio tests under nonstandard conditions

Ann. Statist.

1993

–

2010

Google Scholar

OpenURL Placeholder Text

WorldCat

Wald

. (

1947

A note on regression analysis

Ann. Math. Statist.

586

–

Google Scholar

Crossref

WorldCat

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
July 2023	80
August 2023	54
September 2023	31
October 2023	36
November 2023	42
December 2023	28
January 2024	18
February 2024	32
March 2024	18
April 2024	29
May 2024	109
June 2024	51
July 2024	42
August 2024	48
September 2024	37
October 2024	24
November 2024	16
December 2024	12
January 2025	8
February 2025	24
March 2025	20
April 2025	12
May 2025	11

Article Contents

An anomaly arising in the analysis of processes with more than one source of variability

Summary

1 Introduction

2 Variance-component models

3 Analysis for a single block factor

3.1 Introduction

3.2 Comparison of Wald and likelihood-ratio statistics

3.3 Nonconstant Fisher information and anomalous geometry

3.4 A synthesis with Dickey (2020)

4 Two-sample problem in general scale families

5 Gaussian variance-component model

6 Numerical illustrations

6.1 Covariance determined by split-plot nested blocking

6.2 Covariance determined by Latin square blocking

Acknowledgement

Supplementary material

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

An anomaly arising in the analysis of processes with more than one source of variability

Summary

1 Introduction

2 Variance-component models

3 Analysis for a single block factor

3.1 Introduction

3.2 Comparison of Wald and likelihood-ratio statistics

3.3 Nonconstant Fisher information and anomalous geometry

3.4 A synthesis with Dickey (2020)

4 Two-sample problem in general scale families

5 Gaussian variance-component model

6 Numerical illustrations

6.1 Covariance determined by split-plot nested blocking

6.2 Covariance determined by Latin square blocking

Acknowledgement

Supplementary material

References

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only