Abstract

The least trimmed squares (LTS) estimator is a popular robust regression estimator. It finds a subsample of h ‘good’ observations among n observations and applies least squares on that subsample. We formulate a model in which this estimator is maximum likelihood. The model has ‘outliers’ of a new type, where the outlying observations are drawn from a distribution with values outside the realized range of h ‘good’, normal observations. The LTS estimator is found to be h1/2 consistent and asymptotically standard normal in the location-scale case. Consistent estimation of h is discussed. The model differs from the commonly used ϵ-contamination models and opens the door for statistical discussion on contamination schemes, new methodological developments on tests for contamination as well as inferences based on the estimated good data.

1 Introduction

The LTS estimator suggested by Rousseeuw (1984) is a popular robust regression estimator. It is defined as follows. Consider a sample with n observations, where some are thought to be ‘good’ and others are ‘outliers’. The user specifies that there are h ‘good’ observations. The LTS estimator finds the h sub-sample with the smallest residual sum of squares. Rousseeuw (1984) developed LTS in the tradition of Huber (1964) and Hampel (1971), who were instrumental in formalizing robust statistics. Huber suggested a framework of i.i.d. errors from an ϵ-contaminated normal distribution. Hampel formalized robustness and breakdown points. The present work, with its focus on maximum likelihood, deviates from this tradition.

Specifically, we propose a model in which the LTS estimator is maximum likelihood, a new approach to robust statistics. In this model, we first draw h ‘good’ regression errors from a normal distribution. Conditionally on these ‘good’ errors, we draw nh ‘outlier’ errors from a distribution with support outside the range of the drawn ‘good’ errors. The model is semi-parametric, so we apply a general notion of maximum likelihood. We provide an asymptotic theory for a location-scale model and find that the LTS estimator is h1/2-consistent and asymptotically standard normal. More than 50% contamination can be allowed under mild regularity conditions on the distribution function for the ‘outliers’. Interestingly, the associated scale estimator does not require a consistency factor. In practice, h is typically unknown. We suggest a consistent estimator for the proportion of ‘good’ observations, h/n.

The approach of asking in which model the LTS estimator is maximum likelihood is similar to that taken by Gauss in 1809 (Hald, 2007, Section 5.5, 7.2). In the terminology of Fisher (1922), Gauss asked in which continuous i.i.d. location model is the arithmetic mean the maximum-likelihood estimator and found the answer to be the normal model. Maximum likelihood often brings a host of attractive features. The model provides interpretation and reveals the circumstances under which an estimator works well in terms of nice distributional results and optimality properties. Further, we have a framework for testing the goodness-of-fit, which leads to the possibility of first refuting and then improving a model.

To take advantage of these attractive features of the likelihood framework, we follow Gauss and suggest a model in which the LTS estimator is maximum likelihood. The model is distinctive in that the errors are not i.i.d. Rather, the h ‘good’ errors are i.i.d. normal, whereas the nh ‘outlier’ errors are i.i.d., conditionally on the ‘good’ errors, with distributions assigning zero probability to the realized range of the ‘good’ errors. When h=n, we have a standard i.i.d. normal model, just as the LTS estimator reduces to the ordinary least squares (OLS) estimator. The model is semi-parametric, so we use an extension of traditional likelihoods, in which we compare pairs of probability measures (Kiefer & Wolfowitz, 1956) and consider probabilities of small hyper-cubes including the data point (Fisher, 1922; Scholz, 1980).

In practice, it is of considerable interest to develop a theory for inference for LTS. Within a framework of i.i.d. ϵ-contaminated errors, the asymptotic theory depends on the contamination distribution and the scale estimator requires a consistency correction (Butler, 1982; Čížek, 2005; Croux & Rousseeuw, 1992; Rousseeuw, 1984; Víšek, 2006). Since the contamination distribution is unknown in practice, inference is typically done using the asymptotic distribution of the LTS estimator derived as if there is no contamination. This seems fine for an infinitesimal deviation from the central normal model (Huber & Ronchetti, 2009, Section 12). However, these approaches would lead to invalid inference in case of stronger contamination—see Section 7.1. Within the present framework, the asymptotic theory is simpler. Specifically, we derive the asymptotic properties of the LTS estimator for a location-scale version of the presented model and find that the LTS estimator has the same asymptotic theory as the infeasible OLS estimator computed from the ‘good’ data, when it is known which data are ‘good’. As the asymptotic distribution does not depend on the contamination distribution, inference is much simpler.

In regression, a major concern is that the OLS estimator may be very sensitive to inclusion or omission of particular data points, referred to as ‘bad’ leverage points. LTS provides one of the first high-breakdown point estimators suggested for regression that also avoids the problem of ‘bad’ leverage. The presented model allows ‘bad’ leverage points.

Another key feature of the LTS model presented here is a separation of ‘good’ and ‘outlying’ errors, where the ‘outliers’ are placed outside the realized range of the ‘good’, normal observations. This way, the asymptotic error in estimating the set of ‘good’ observations is so small that it does not influence the asymptotic distribution of the LTS estimator. This throws light on the discussion regarding estimators’ ability to separate overlapping populations of ‘good’ and ‘outlying’ observations (Doornik, 2016; Riani et al., 2014).

LTS is widely used in its own right and often as a starting point for algorithms such as the MM estimator (Yohai, 1987) and the Forward Search (Atkinson et al., 2010). Many variants of LTS have been developed: nonlinear regression in time series (Čížek, 2005), algorithms for fraud detection (Rousseeuw et al., 2019) and sparse regression (Alfons et al., 2013).

In all cases, the practitioner must choose h, the number of ‘good’ observations. In our reading, this remains a major issue in robust statistics. We propose a consistent estimator for the proportion of ‘good’ observations, h/n, in a location-scale model and study its finite sample properties by simulation. We apply it in the empirical example in Section 8.

The least median of squares (LMS) estimator, also suggested by Rousseeuw (1984), is closely related to LTS. The LMS estimator seemed numerically more attractive than LTS until the advent of the fast LTS approximation to LTS (Rousseeuw & van Driessen, 2000). Nonetheless, LMS remains of considerable interest. Replacing the normal distribution in the LTS model with a uniform distribution gives a model in which LMS is maximum likelihood. In the Online supplementary material, we show that the LMS estimator is h-consistent and asymptotically Laplace in the location-scale case. This is at odds with the slow n1/3 consistency rate found in the context of i.i.d. models.

We start by presenting the LTS estimator in Section 2. The LTS regression model is given in Section 3. The general maximum likelihood concept and its application to LTS are presented in Section 4 with details in Appendix  A. Section 5 presents an asymptotic theory for LTS in the location-scale case with proofs in Appendix  B. Estimation of h is discussed in Section 6. Monte Carlo simulations are presented in Section 7. An empirical illustration is given in Section 8.

In a supplement, the LMS estimator is analysed and detailed derivations of some identities in the LTS proof are given. All codes for simulations and empirical illustration are available from https://users.ox.ac.uk/˜nuff0078/Discuss and as supplementary files to the paper.

Notation: Vectors are column vectors. The transpose of a vector v is denoted v.

2 The LTS estimator

The available data are a scalar yi and a p-vector of regressors xi for i=1,,n. We consider a regression equation yi=βxi+σεi with regression parameter β and scale σ.

The least trimmed squares (LTS) estimator suggested by Rousseeuw (1984) is defined as follows. Given a value of β, the residuals are ri(β)=yiβxi. The ordered squared residuals are denoted r(1)2(β)r(n)2(β). The user chooses an integer hn. Given that choice, the sum of the h smallest residual squares is obtained. Minimizing over β gives the LTS estimator

(2.1)

The LTS minimization classifies the observations as ‘good’ or ‘outliers’. The set of indices of the h ‘good’ observations is an h-subset of (1,,n). We let ζ^ denote the set of indices i of the estimated ‘good’ observations satisfying ri2(β^)r(h)2(β^). Rousseeuw and van Driessen (2000) point out that β^ is a minimizer over least-squares estimators, that is β^=β^ζ^, where

(2.2)

In the model proposed below, the maximum-likelihood estimator for the scale is the residual variance of the estimated ‘good’ observations

(2.3)

We show the consistency of σ^2 in the location-scale case. The problem of estimating the scale is intricately linked to the choice of model for the innovations εi. Previously, Croux and Rousseeuw (1992) considered σ^2 in the context of a regression model with i.i.d. innovations with distribution function F. For that model, they found that σ^2 should be divided by a consistency factor defined as the conditional variance of εi given that |F(εi)1/2|h/(2n). In practice, F is unknown, so the normal distribution is often chosen for the consistency factor.

3 The LTS regression model

We formulate the LTS model where h ‘good’ errors are assumed i.i.d. normal, while nh ‘outliers’ are located outside the realized range of the ‘good’ errors. This differs from the ϵ-contaminated normal model of Huber (1964), where i.i.d. errors are normal with probability 1ϵ and otherwise drawn from a contamination distribution. The supports of the normal and the contamination distributions overlap in the ϵ-contamination case, whereas ‘good’ and ‘outlying’ observations are separated in the below LTS model.

LTS regression model

 
Model 3.1

Consider the regression model yi=βxi+σεi for data yi,xi with i=1,,n. Let hn be given. Let x1,,xn be deterministic. Let ζ be a set with h elements from 1,,n.

For iζ, let εi be i.i.d. N(0,1) distributed.

For jζ, let ξj be independent with distribution functions Gj(z) for zR, where Gj are continuous at 0. The ‘outlier’ errors are defined by
(3.1)
The parameters are βRdimx, σ>0, ζ which is any h-subset of 1,,n and Gj which are nh arbitrary distributions on R, that are continuous at 0.

Finally, suppose iζxixi is invertible for all ζ.

The ‘outlier’ distributions Gj are parameters in the LTS model. They are constrained to be continuous at zero to avoid overlap of ‘good’ and ‘outlier’ errors. Although the regressors are deterministic in Model 3.1, randomness of x1,,xn can be easily accommodated by a conditional model where the set of distributions Gj(z) are replaced by the set of conditional distributions Gj(z|xj). The likelihood below then becomes a conditional likelihood.

The LTS model allows leverage points, since Gj can vary with j. Informally, an observation is a ‘bad’ leverage point for the OLS estimator, if that estimator is very sensitive to inclusion or omission of that observation (Rousseeuw & Leroy, 1987, Section 1.1). An example is the star data shown in Figure 1 and used in the empirical illustration in Section 8.

Star data and fit by LTS with different values of h. Log light intensity against log temperature. Solid dots are estimated ‘good’ observations for h=42.
Figure 1.

Star data and fit by LTS with different values of h. Log light intensity against log temperature. Solid dots are estimated ‘good’ observations for h=42.

A consequence of the model is that the ‘good’ errors must have consecutive order statistics. Randomly, according to the choice of Gj, some ‘outlier’ errors are to the left of the ‘good’ errors. The count of left ‘outlier’ errors is the random variable

(3.2)

Thus, the ordered errors satisfy

(3.3)

The set ζ collects the indices of the observations corresponding to ε(δ+1),,ε(δ+h). The difficulty in robust regression is of course that the errors are unknown. We note, that the extreme ‘good’ errors, ε(δ+1) and ε(δ+h), are finite in any realization. As the ‘good’ errors are normal, the extremes grow at a (2logh)1/2 rate for large h, see Example 5.1 below. Likewise, the ‘outlier’ errors are also finite, but located outside the realized range from ε(δ+1) to ε(δ+h).

4 Maximum-likelihood estimation in the LTS model

The LTS model is semi-parametric. We therefore start with a general maximum likelihood concept before proceeding to analysis of the LTS model.

4.1 A general definition of maximum likelihood

Traditional parametric maximum likelihood is defined in terms of densities, which are not well-defined here. Thus, we follow the generalization proposed by Scholz (1980), which has two ingredients. First, it uses pairwise comparison of measures, as suggested by Kiefer and Wolfowitz (1956), see Johansen (1978) and Gissibl et al. (2021) for applications. This way, a dominating measure is avoided. Second, it compares probabilities of small sets that include the data point, following the informality of Fisher (1922). This way, densities are not needed. Scholz’ approach is suited to the present situation, where the candidate maximum-likelihood estimator is known and we are searching for a model.

We consider data in Rn and can therefore simplify the approach of Scholz. Let P be a family of probability measures on the Borel sets of Rn. Given a (data) point zRn and a distance ϵ>0 define the hypercube Czϵ=(z1ϵ,z1]××(znϵ,zn], which is a Borel set.

 
Definition 4.1

For P,QP write P<zQ if limsupϵ0{P(Czϵ)/Q(Czϵ)}<1 and PzQ if limsupϵ0{P(Czϵ)/Q(Czϵ)}1, where by convention 0/0=1.

Following Scholz, define P,Q to be equivalent at z and write P=zQ if PzQ and QzP. We get that (i) P=zQ if and only if limϵ0{P(Czϵ)/Q(Czϵ)} exists and equals 1; (ii) P<zQ and Q<zR imply P<zR (transitivity); and (iii) P=zP for all PP (reflexivity).

 
Definition 4.2

Let P={Pθ|θΘ} be a parametrized family of probability measures. Then Lϵ(θ)=Pθ(Cyϵ) is the ϵ-likelihood at the data point y. Further, θ^ is a maximum-likelihood estimator if PθyPθ^ for all θΘ.

The maximum-likelihood estimator is unique if P<zP^ for all PP^. As for Scholz, traditional maximum likelihood is a special case. Further, the empirical distribution function is maximum-likelihood estimator in a model of i.i.d. draws from an unknown distribution.

4.2 The LTS likelihood

To obtain the ϵ-likelihood for the LTS regression Model 3.1, we find the probability that the random n-vector of observations yi belongs to an ϵ-cube Czϵ. Throughout the argument, the regressors xi are kept fixed. Conditioning ‘outliers’ on ‘good’ observations, we get

(4.1)

For the first product for ‘good’ observations, we note that εi=(yiβxi)/σ is standard normal and define ziβσ=(ziβxi)/σ and ziβσϵ=ziβσϵ/σ. Then, the factors of the first product in (4.1) are Φ(ziβσ)Φ(ziβσϵ), which we denote ΔϵΦ(ziβσ).

For the second product, let yiβσ=(yiβxi)/σ=εi and introduce

and a similar expression z~jβσϵ for zjβσϵ, noting that z~jβσϵ=0 if zjβσϵ falls within the range from miniζεi to maxiζεi. A derivation in Appendix  A shows that the factors of the second product in (4.1) are ΔϵGj(z~jβσ)=Gj(z~jβσ)Gj(z~jβσϵ). With this notation, the probability (4.1) of the ϵ-cube is

(4.2)

The ϵ-likelihood arises from the probability (4.2) evaluated in the data point. That is, we replace the vector z by the data vector y. We define y~jβσ and ΔϵGj(y~jβσ) in a similar fashion as before. Thus, we get the ϵ-likelihood

(4.3)

We note that the ‘good’ errors cannot be repeated due to the assumptions that the ‘good’ observations are continuous and that the centred ‘outliers’ are continuous at zero. Thus, parameter values resulting in repetitions of the ‘good’ errors are to be ignored. As an example, suppose we have n=10 observations and we have chosen a β so that the residual yiβxi takes the ordered values: 1,1,1,2,3,6,6,7,8,9. The values 1 and 6 are repetitions and cannot be ‘good’. Thus, for h=3, we can only select ζ as the index triplet 7,8,9. All other choices are asigned a zero likelihood. The issue is essentially the same as in OLS theory, where the least-squares estimator may be useful when there are repetitions, but it cannot be maximum likelihood, as repetitions happen with probability zero under normality.

4.3 Maximum-likelihood estimation

The two products in the ϵ-likelihood (4.3) resemble a standard normal likelihood and a product of nh likelihoods, each with one observations from an arbitrary distribution. We will exploit those examples using profile likelihood arguments.

First, suppose β,σ,ζ are given. Then, the first product in the LTS ϵ-likelihood (4.3) is constant. In the second product, the jth factor only depends on Gj. Since the Gj functions vary in a product space, we maximize each factor separately. The maximum value for ΔϵGj(yjβσ) is unity for any ϵ>0 and regardless of the value of β,σ,ζ. The maximum is attained for any distribution function that is continuous at zero and that rises from zero to unity over the interval from yjβσϵ to yjβσ. This set of distribution functions includes the distribution with a point mass at y~jβσ, which satisfies the requirement of continuity at zero, because the ‘outliers’ are separated from the ‘good’ observations. Moreover, the point mass distribution is unique in the limit for vanishing ϵ. Thus, maximizing (4.3) over Gj gives the profile ϵ-likelihood

(4.4)

Second, suppose ζ is given. We argue that the unique maximum-likelihood estimator in the sense of Definition 4.2 is given by

(4.5)

We need to show that limsupϵ0{LGϵ(β,σ,ζ)/LGϵ(β^ζ,σ^ζ,ζ)}<1 for all (β,σ)(β^ζ,σ^ζ). Note that ϵhLGϵ(β,σ,ζ) converges to a standard likelihood for a regression with normal errors. Therefore, the condition is satisfied as long as iζxixi is invertible. Thus, maximizing (4.4) over (β,σ) gives a profile ϵ-likelihood for ζ satisfying, for ϵ0,

(4.6)

Third, the profile likelihood is maximized by choosing ζ so that σ^ζ is as small as possible. We note that the maximization is subject to the constraint that none of the ‘good’ residuals are repeated. Apart from the latter constraint, this is the LTS estimator described in (2.2). We summarize.

 
Theorem 4.1

The LTS regression Model 3.1 has ϵ-likelihood Lϵ(β,σ,ζ,Gj for jζ) defined in (4.3). The maximum-likelihood estimator is found as follows. For any h-subsample ζ, let β^ζ and σ^ζ be the least-squares estimators in (4.5). Let ζ^=arg minζσ^ζ2 subject to the constraint that ε^iε^ for iζ, 1n and i and where ε^i=yiβ^ζxi. Then β^=β^ζ^ and σ^=σ^ζ^.

5 Asymptotic theory for the location-scale model

We present an asymptotic theory of the LTS estimator in the special case of a location-scale model yi=μ+σεi. In this model, the observations yi and the unknown errors εi have the same ranks. Thus, the ordering in (3.3) is observable and we only need to consider values of ζ corresponding to indices of observations of the form y(δ+1),,y(δ+h) for some δ=0,,nh. Following Rousseeuw and Leroy (1987, p. 171), the LTS estimators then reduce to μ^=μ^δ^ and σ^=σ^δ^, where

(5.1)

5.1 Sequence of data generating processes

We consider a sequence of LTS models indexed by n, so that hn as n. In this sequence, the ‘outliers’ have a common distribution G, which is continuous at zero. If hn=n the LTS estimator reduces to the full sample least-squares estimator with standard asymptotic theory. Here, we choose

(5.2)

where λ is the asymptotic proportion of ‘good’ observations. The parameters ζn, δn vary with n, while μ,σ,G are constant in n.

We reparametrize G in terms of separate distributions for left and right ‘outliers’. Let

(5.3)

Thus, as ξj is G-distributed, we have that ε_j=ξj is G_-distributed when ξj<0 and ε¯j=ξj is G¯-distributed when ξj>0. This leads to

This way, the ‘outliers’, εj for jζn, can be constructed through a binomial experiment as follows. Draw nh independent Bernoulli(ρ) variables. If the jth variable is unity then εj=miniζεiε_j. If it is zero then εj=maxiζεi+ε¯j. Hence, the number of left ‘outliers’ is

(5.4)

by the Law of Large Numbers. In summary, the sequence of data generating processes is defined by μ,σ,ρ,G_,G¯, and hn,δn.

5.2 The main result

We show that the asymptotic distribution of the LTS estimator is the same as it would have been, if it had been known which observations were ‘good’. Here, we consider the case where the ‘good’ observations are normal and there are more ‘good’ observations than ‘outliers’. The result does not require any regularity conditions for the common ‘outlier’ distribution G.

 
Theorem 5.1
Consider the sequence of LTS location-scale models, where hn/nλ with 1/2<λ<1. Suppose that εi for iζn are i.i.d. N(0,1) distributed. Then, for any η>0,

We expect that Theorem 5.1 would generalize to the LTS regression Model 3.1. That is, the LTS regression estimator will have the same asymptotic theory as if we knew which observations were ‘good’. The proof of such a result is, however, an open problem.

Theorem 5.1 for the LTS location-scale model differs from the known results for i.i.d. models. Butler (1982) proved a general result for the model with i.i.d. errors, showing that n1/2(μ^μ) is asymptotically normal when the errors are symmetric with a strongly unimodal distribution. Further, the asymptotic variance involves an efficiency factor that differs from unity even in the normal case. He also gives an asymptotic theory for nonsymmetric errors. Further, Víšek (2006) analysed the regression estimator for i.i.d. errors. Johansen and Nielsen (2016b, Theorem 5) provide an asymptotic expansion of the scale estimator under normality.

A corresponding result for the LMS estimator is given in the supplement. The LMS estimator is maximum likelihood in a model where the ‘good’ observations are uniform. When there are no outliers, n=h, the LMS estimator is a Chebychev estimator. Knight (2017) provides asymptotic analysis of Chebychev estimators in regression models. Coinciding with that theory, Online Supplementary Material, Theorem C.2 shows that the LMS estimator is super-consistent at an n1 rate in that model, which contrasts with the well-known n1/3 rate for i.i.d. models. However, Online Supplementary Material, Theorem C.2 does require regularity conditions for the ‘outliers’ to give sufficient separation from the ‘good’ observations.

5.3 Extentions of the main result

5.3.1 Nonnormal errors

We start by relaxing the normality assumption. In the spirit of OLS asymptotics, we present sufficient conditions for the ‘good’ errors: 2+ moments, symmetry, and exponential tails.

 
Assumption 5.1

Suppose εi for iζn are i.i.d. with distribution F satisfying

  • Eεi=0 and Eεi2=1 for iζn;

  • E|εi|2+ω< for iζn and some ω>0;

  • F has infinite support: inf{x:F(x)>0}= and sup{x:F(x)<1}=;

  • ε(δn+1)/ε(δn+hn)+1=oP(1);

  • 0<η<1, Cη<1:ε(δn+hn1η)/ε(δn+1),ε(δn+hnhn1η)/ε(δn+hn)Cη+oP(1).

 
Example 5.1

The normal distribution satisfies Assumption 5.1.

For (iv) use that an{ε(δn+hn)bn} converges to a type I extreme value distribution when anbn(2loghn)1/2, see Leadbetter et al. (1982, Theorem 1.5.3). Here, ∼ denotes asymptotic equivalence. In particular, ε(δn+hn)/bn1 in probability.

For (v) use Mill’s ratio Φ(x)φ(x)/x for x. Take log to see that Φ1(1/sn)(2logsn)1/2 for sn. We find Cη(2loghnη1)1/2/(2loghn1)1/2=(1η)1/2<1.

 
Example 5.2

Assumption 5.1 is satisfied by the Laplace distribution and the double geometric distribution with probabilities (1p)|x|p/(2p) for xZ. The latter is not in the domain of attraction of an extreme value distribution (Leadbetter et al., 1982, Theorem 1.7.13).

Assumption 5.1(iv) is not satisfied by td distributions with d1 degrees of freedom. These are in the domain of attraction of extreme value distributions of type II. Gumbel and Keeney (1950) show that ε(δn+1)/ε(δn+hn) converges to a nondegenerate distribution with median 1.

 
Theorem 5.2

Consider the sequence of LTS location-scale models. Let 1/2<λ<1 and suppose Assumption 5.1. Then, the limiting results in Theorem 5.1 apply.

5.3.2 More ‘outliers’ than ‘good’ observations

In this section, we allow for more ‘outliers’ than ‘good’ observations. Although in the traditional breakdown point analysis there has to be more ‘good’ observations than ‘outliers’ (Rousseeuw & Leroy, 1987, Section 3.4), in practice, LTS estimators are sometimes used with more ‘outliers’ than ‘good’ observations, as in the Forward Search algorithm (Atkinson et al., 2010) for instance. We show that, within the LTS model framework, it is possible to allow for more ‘outliers’ than ‘good’ observations, but in this case, we need some regularity conditions on the ‘outliers’ to make sure the ‘good’ observations can be found.

Let the proportion of ‘good’ observations satisfy 0<λ<1. Recall, that δn/(nhn)ρ=G(0)a.s. is the proportion of the ‘outliers’ that are to the left. The number of ‘outliers’ to the right is n¯=nhnδn. Define also the proportion of left and right ‘outliers’ relative to the number of ‘good’ observations through

Regularity conditions are needed for the ‘outlier’ distribution when ω_1 or ω¯1. Note that ω_=ω¯<1 when the proportion of ‘good’ observations is λ>1/3 and proportions of left and right ‘outliers’ are the same, so that ρ=1/2. The following definition is convenient for analysing the empirical distribution function of the ‘outliers’, evaluated at a random quantile.

 
Definition 5.1

A distribution function H is said to be regular, if it is twice differentiable on an open interval S=]s_,s¯[ with s_<s¯ so that H(s_)=0 and H(s¯)=1 and the density h and its derivative h˙ satisfy

  • supxSh(x)< and supxSH(x){1H(x)}|h˙(x)|/{h(x)}2<;

  • If limxs_h(x)=0 then h is nondecreasing on an interval to the right of s_.

  • If limxs¯h(x)=0 then h is nonincreasing on an interval to the left of s¯.

 
Example 5.3

The normal distribution is regular. Apply Mill’s ratio to see this. The exponential distribution with H(x)=1exp(x) is also regular with S=R+ and with H(x){1H(x)}|h˙(x)|/{h(x)}2=H(x)1 while h(x) is decreasing as x.

The following assumption requires that the ‘outlier’ distribution is regular and that the ‘outliers’ are more spread out than the ‘good’ observations.

 
Assumption 5.2

Let q>4. Let ε_1 be G_-distributed and ε¯1 be G¯-distributed, see (5.3).

  • If ω¯1 suppose G¯ is regular with 0xqdG¯(x)< and v¯=minω¯1ς1Var{ε¯1|ςω¯1G¯(ε¯1)ς}>1.

  • If ω_1 suppose G_ is regular with 0xqdG_(x)< and v_=minω_1ς1Var{ε_1|ςω_1G_(ε_1)ς}>1.

 
Theorem 5.3

Consider the sequence of LTS location-scale models. Let 0<λ<1 and suppose Assumptions 5.1, 5.2. Then, the limiting results in Theorem 5.1 apply.

Given the novelty (and perhaps at first surprising content) of Theorem 5.3, a discussion of this result and its relationship with Theorem 5.1 seems in order. Recall that Theorem 5.1 has no regularity conditions on the ‘outliers’. The result exploits that there are more than 50% ‘good’ observations and the thin normal tails of these ‘good’ observations separate them sufficiently from the ‘outliers’. In Theorem 5.3, we allow less than 50% ‘good’ observations but require regularity conditions on the outliers to ensure sufficient separation. More specifically, the proof of Theorem 5.3 covers two situations. First, if the proportions of left and right ‘outliers’ are equal and more than one-third of the observations are ‘good’, then ω¯=ω_<1, so that Assumption 5.2 is nonbinding. Second, in general, Assumption 5.2(i,ii) implies that the ‘outliers’ are spread out more than the ‘good’ observations.

This has a potential parallel with the breakdown point analysis of (Rousseeuw & Leroy, 1987, Section 3.4), which shows that for a given dataset the distribution of the LTS estimator is bounded with less than 50% contamination of an arbitrary type. It may still be possible to obtain the same result allowing for more than 50% contamination that satisfies regularity conditions of the type considered in Theorem 5.3, but we leave this as an open question.

5.4 The OLS estimator in the LTS model

We show that the least-squares estimator μ¯=n1i=1nyi can diverge in the sequence of LTS models. This implies that the least-squares estimator is not robust within the LTS model in the sense of Hampel (1971). We assume all ‘outliers’ are to the right, so that ρ=0.

 
Theorem 5.4
Consider the sequence of LTS location-scale models. Let 0<λ<1 and ρ=0. Suppose εi for iζn are i.i.d. with Eεi=0, Varεi=1, and infinite support (Assumption 5.1,i,iii). Suppose G¯ has finite expectation μG=0{1G¯(x)}dx. Then, μ¯ diverges. Noting that ε(hn) in probability, we have that

6 Estimating the proportion of ‘good’ observations

The LTS regression model takes the number of good observations, h, as given. In practice, an investigator has to choose h. Estimation of h is a difficult problem, and methods for estimating h are scarce in the literature. In this section, we start by reviewing a commonly used method for choosing h and discuss its asymptotic properties. We will then propose some new methods for consistently estimating the rate λ=h/n of ‘good’ observations. The best of these methods is consistent at a logh rate. This is a slow rate and the method may be imprecise about the number h of ‘good’ observations even in large samples. Notwithstanding, to the best of our knowledge, this is the only consistent method available in the literature. Further research and improvements in this area are needed.

6.1 The index plot method

A common method for estimating h is to apply a high breakdown point LTS estimator selecting approximately n/2 observations, then compute scaled residuals for all observations and keep those observations for which the scaled residuals are less than 2.5 in absolute value. This is conveniently done using an index plot (Rousseeuw & Hubert, 1997; Rousseeuw & Leroy, 1987). More specifically, scaled residuals ε^iς are plotted against the index i. Here, the consistency factor ς=0.615 is the conditional standard deviation of εi given that 1/4<Φ(εi)<3/4 (Croux & Rousseeuw, 1992). Bands on ±2.5 are displayed on the plot and observations corresponding to scaled residuals ε^iς beyond these bands are declared outliers. Hence, the estimator of the number of outliers is nh^IP=i=1n1(|ε^iς|>2.5). We note that when scaling the residuals, the consistency factor is based on the assumption that all errors are i.i.d. normal.

We can analyse the asymptotic properties of the index plot method using Theorems 4, 5, 7 in Johansen and Nielsen (2016b). This will show, that in a clean, normal sample, the index plot method, asymptotically, finds γ^IP=(nh^IP)/nPγ=P(|εi|>2.5)=2Φ(2.5)=1.24% outliers. In the terminology of Castle et al. (2011); Hendry and Santos (2010), γ is the asymptotic gauge of the procedure. Thus, the index plot method will on average estimate a nonzero proportion of outliers, even in uncontaminated samples.

6.2 Methods based on the LTS model

We consider methods for consistent estimation of the proportion of ‘good’ observations in the LTS model. This is a semi-parametric model selection problem where the dimension of the parameter space increases with decreasing h, since the number of Gj functions increases when the number of ‘outliers’ increases. We consider three estimation methods. In all cases, suppose hh_ where h_>dimxi is a lower bound chosen by the user.

Maximum likelihood. We argue that h^=h_ is the maximum-likelihood estimator for h. Let Lϵ(β,σ,ζ,Gj,h) be the ϵ-likelihood given in (4.3) for each h. Let σ^h denote the maximum-likelihood estimator in Theorem 4.1 for each h. By (4.6), the profile likelihood for h is

We note that σ^h2>0 for all hh_, so that ϵh_hL^hϵ/L^h_ϵ is bounded uniformly in h. Thus, L^hϵ/L^h_ϵ0 as ϵ0 whenever h>h_, so that h^=h_.

An information criteria. Penalizing logσ^h2 gives the information criteria

(6.1)

for a penalty f. Let h^IC be the minimizer. Then λ^IC=h^IC/n estimates the proportion of ‘good’ observations, λ say. Below, we argue, for the location-scale case, that λ^IC is consistent if the penalty is chosen so that f(n) for increasing n, but f(n)<λ1loglogn, where 0<λ1. If, for instance, it is expected that more than half of the observations are ‘good’, so that λ>1/2, the penalty can be chosen as f(n)=2loglogn. The intuition is as follows.

Consider data generating processes as in Section 5.1 and let hn denote the number of ‘good’ observations, so that hn/nλ. We want to show that λ^IC=λ in the limit. Theorem 5.1 shows that σ^hn2 is consistent for σ2 when h=hn. Thus, we consider sequences hn so that hn/nλλ and argue that IChnIChnlog(σ^hn2/σ2)+f(n)(λλ) has a positive limit, so that the minimizer λ^IC=λ in the limit.

Suppose λ<λ. In that case, one could expect that, asymptotically, the estimated set of ‘good’ observations is a subset of the good observations, so that ζ^hζ. Further, σ^h2 converges to σλ2, say, the variance of a truncated normal distribution as analysed by Butler (1982). Then, IChnIChnlog(σλ2/σ2)+f(n)(λλ), which is positive in the limit when f(n) diverges. Thus, λ^ICλ in the limit.

Suppose λ>λ. Then IChnIChnlog(σ^hn2/σ2)f(n)(λλ), where the penalty term is negative, so that f(n) must be chosen carefully. With the normal LTS model, σ^hn2 must diverge. The reason is that, for λ>λ, the estimated set of ‘good’ observations ζ^h includes both ‘good’ observations and ‘outliers’. The ‘outliers’ diverge at a (2loghn)1/2 rate, see Example 5.1, so that σ^hn2 must diverge at a 2loghn rate. Noting that hnnλ, we get IChnIChnloglognf(n)(λλ). When 1/2<λ<λ1, we have λλ<1/2, and the loglogn term dominates when f(n)2loglogn. Thus, λ^ICλ in the limit.

Combining these arguments indicates that λ^IC=λ in the limit. Unfortunately, in simulations that are not reported here, we find that a 2loglogn penalty grows so slowly in n that for some specifications the consistency is only realized for extremely large samples.

Cumulant based normality test. A more useful estimator for the proportion of ‘good’ observations λ is the minimizer h^T, say, of the normality test statistic

(6.2)

based on third and fourth cumulants of the estimated ‘good’ residuals. We argue that h^T/n is consistent for λ. This is supported by a simulation study in Section 7.2.

The intuition of the consistency argument is similar to that for the information criteria. First, for h=hn, Theorem 5.1 may be extended to show that Th0 is asymptotically χ22. For h<hn, the sample moments may converge to the moments of a truncated normal distribution, see Berenguer-Rico and Nielsen (2017). Thus, Th would diverge as it is normalized using the normal distribution. For h>hn, the estimated set of ‘good’ observations ζ^h contains both ‘good’ and ‘outlying’ observations, so that Th diverges at a logarithmic rate, instead of the iterated logarithmic rate for the information criteria. A formal proof is left for future work.

7 Simulations

7.1 Inference in the location-scale model

In the location-scale model yi=μ+σεi, we study the finite sample properties of tests for μ=0 using four different tests statistics and six different data generating processes, which we describe below. We consider sample sizes n=25,100,400,1600 and let h/n=λ=0.8.

Tests. We consider four different estimators: LTS, LMS, OLS and SLTS (scale-corrected LTS estimator). For each estimator, s say, we compute t-type statistics, ts=(μ^sμ)/ses, with associated asymptotic 95% quantiles qs.

For the OLS estimator, we have μ^OLS=n1i=1nyi and seOLS2=σ^OLS2/n with σ^OLS2=n1i=1n(yiy¯)2. The asymptotic quantile qOLS is standard normal.

For the LTS estimator, we apply the estimators μ^LTS and σ^LTS given in (5.1). By Theorem 5.2, we get that seLTS2=σ^LTS2/h. The asymptotic quantile qLTS is standard normal.

For the LMS estimator, we apply the estimators μ^LMS and σ^LMS given in Online Supplementary Material, (C.4). Online Supplementary Material, Theorem C.2 gives that seLMS2=σ^LMS2/h2. The asymptotic quantile qLMS is standard Laplace.

Finally, SLTS uses the usual approach to LTS estimation with consistency and efficiency correction factors arising from truncation in a standard normal distribution as outlined in the end of Section 2. Let ςh/n2=ccx2φ(x)dx/ccφ(x)dx with c chosen so that ccφ(x)dx=h/n, which gives ς0.82=0.438. Then, we have estimators μ^SLTS=μ^LTS and σ^SLTS2=σ^LTS2/ςh/n2, while seSLTS2=σ^SLTS2/(nςh/n2). The asymptotic quantile qSLTS is standard normal.

Data generating processes (DGPs). Table 1 gives an overview of the DGPs. The first three DGPs are examples of the LTS Model 3.1 and the LMS Online Supplementary Material, Model C.1. The proportion of ‘good’ observations is λ=80%. The ‘good’ errors εi are i.i.d. normal N(0,1) in DGP1 and DGP2 and i.i.d. uniform U[1,1] in DGP3. The ‘outlier’ errors are defined as in (3.1), so that εj=(maxiζεi+ξj)1(ξj>0)+(miniζεi+ξj)1(ξj<0), where ξjν+1(ξj>0)+ν1(ξj<0) are i.i.d. normal N(0,1) and ν+ and ν separate ‘good’ and ‘outlying’ observations. The separators are ν+=ν=0 in DGP1 and ν+=3, ν=1 in DGP2 and DGP3.

Table 1.

Data generating processes

ModelMLE‘good’ error‘outliers’
DGP1LTSLTSN(0,1)ν+=ν=0
DGP2LTSLTSN(0,1)ν+=3, ν=1
DGP3LMSLMSU(1,1)ν+=3, ν=1
DGP4ϵ-contaminationOLSN(0,1)N(0,1)
DGP5ϵ-contaminationN(0,1)N(0,3)
DGP6ϵ-contaminationN(0,1)N(2,1)
ModelMLE‘good’ error‘outliers’
DGP1LTSLTSN(0,1)ν+=ν=0
DGP2LTSLTSN(0,1)ν+=3, ν=1
DGP3LMSLMSU(1,1)ν+=3, ν=1
DGP4ϵ-contaminationOLSN(0,1)N(0,1)
DGP5ϵ-contaminationN(0,1)N(0,3)
DGP6ϵ-contaminationN(0,1)N(2,1)

Note. The proportion of good errors is λ=0.8. Columns 2 and 3 show the model and indicate which estimator is maximum likelihood. Columns 4 and 5 indicate how the ‘good’ and the ‘outlying’ errors are chosen. For DGP1 – DGP3: λn ‘good’ observations,‘outliers’ have ξjν+1(ξj>0)+ν1(ξj<0) is N(0,1). For DGP4 – DGP6: distribution is λN(0,1)+(1λ)H.

Table 1.

Data generating processes

ModelMLE‘good’ error‘outliers’
DGP1LTSLTSN(0,1)ν+=ν=0
DGP2LTSLTSN(0,1)ν+=3, ν=1
DGP3LMSLMSU(1,1)ν+=3, ν=1
DGP4ϵ-contaminationOLSN(0,1)N(0,1)
DGP5ϵ-contaminationN(0,1)N(0,3)
DGP6ϵ-contaminationN(0,1)N(2,1)
ModelMLE‘good’ error‘outliers’
DGP1LTSLTSN(0,1)ν+=ν=0
DGP2LTSLTSN(0,1)ν+=3, ν=1
DGP3LMSLMSU(1,1)ν+=3, ν=1
DGP4ϵ-contaminationOLSN(0,1)N(0,1)
DGP5ϵ-contaminationN(0,1)N(0,3)
DGP6ϵ-contaminationN(0,1)N(2,1)

Note. The proportion of good errors is λ=0.8. Columns 2 and 3 show the model and indicate which estimator is maximum likelihood. Columns 4 and 5 indicate how the ‘good’ and the ‘outlying’ errors are chosen. For DGP1 – DGP3: λn ‘good’ observations,‘outliers’ have ξjν+1(ξj>0)+ν1(ξj<0) is N(0,1). For DGP4 – DGP6: distribution is λN(0,1)+(1λ)H.

The last three DGPs are examples of ϵ-contamination (Huber, 1964). We draw n observations from the distribution function 0.8Φ+0.2H, where Φ is standard normal and H represents contamination. In DGP4, H=Φ, giving a standard i.i.d. normal model. In DGP5 and DGP6, H is N(0,3) and N(2,1), giving symmetric and nonsymmetric mixtures, respectively.

We have different maximum-likelihood estimators for the different models. These are LTS for DGP1 and DGP2, LMS for DGP3, and OLS for DGP4. None of the considered estimators are maximum likelihood for DGP5 and DGP6.

Table 2 reports results from 106 repetitions. The Monte Carlo standard error is 0.001.

Table 2.

Simulated rejection frequencies for nominal 5% tests on intercept

nDGP123456DGP123456
OLSLTS
250.0920.0840.0840.0670.0660.3590.2550.0810.0720.3710.3370.388
1000.0830.1290.1990.0540.0540.8870.1800.0580.0550.3830.3450.518
4000.1000.3210.6640.0510.0511.0000.1100.0520.0510.3890.3490.827
16000.1590.7450.9980.0500.0501.0000.0710.0500.0500.3900.3490.998
LMSSLTS
250.7200.4890.0700.6410.6310.7020.0110.0000.0000.0340.0270.063
1000.9610.7850.0540.8360.8310.9010.0020.0000.0000.0410.0280.116
4000.9990.9360.0510.9310.9290.9820.0000.0000.0000.0470.0310.373
16001.0000.9920.0500.9720.9710.9990.0000.0000.0000.0490.0320.941
nDGP123456DGP123456
OLSLTS
250.0920.0840.0840.0670.0660.3590.2550.0810.0720.3710.3370.388
1000.0830.1290.1990.0540.0540.8870.1800.0580.0550.3830.3450.518
4000.1000.3210.6640.0510.0511.0000.1100.0520.0510.3890.3490.827
16000.1590.7450.9980.0500.0501.0000.0710.0500.0500.3900.3490.998
LMSSLTS
250.7200.4890.0700.6410.6310.7020.0110.0000.0000.0340.0270.063
1000.9610.7850.0540.8360.8310.9010.0020.0000.0000.0410.0280.116
4000.9990.9360.0510.9310.9290.9820.0000.0000.0000.0470.0310.373
16001.0000.9920.0500.9720.9710.9990.0000.0000.0000.0490.0320.941
Table 2.

Simulated rejection frequencies for nominal 5% tests on intercept

nDGP123456DGP123456
OLSLTS
250.0920.0840.0840.0670.0660.3590.2550.0810.0720.3710.3370.388
1000.0830.1290.1990.0540.0540.8870.1800.0580.0550.3830.3450.518
4000.1000.3210.6640.0510.0511.0000.1100.0520.0510.3890.3490.827
16000.1590.7450.9980.0500.0501.0000.0710.0500.0500.3900.3490.998
LMSSLTS
250.7200.4890.0700.6410.6310.7020.0110.0000.0000.0340.0270.063
1000.9610.7850.0540.8360.8310.9010.0020.0000.0000.0410.0280.116
4000.9990.9360.0510.9310.9290.9820.0000.0000.0000.0470.0310.373
16001.0000.9920.0500.9720.9710.9990.0000.0000.0000.0490.0320.941
nDGP123456DGP123456
OLSLTS
250.0920.0840.0840.0670.0660.3590.2550.0810.0720.3710.3370.388
1000.0830.1290.1990.0540.0540.8870.1800.0580.0550.3830.3450.518
4000.1000.3210.6640.0510.0511.0000.1100.0520.0510.3890.3490.827
16000.1590.7450.9980.0500.0501.0000.0710.0500.0500.3900.3490.998
LMSSLTS
250.7200.4890.0700.6410.6310.7020.0110.0000.0000.0340.0270.063
1000.9610.7850.0540.8360.8310.9010.0020.0000.0000.0410.0280.116
4000.9990.9360.0510.9310.9290.9820.0000.0000.0000.0470.0310.373
16001.0000.9920.0500.9720.9710.9990.0000.0000.0000.0490.0320.941

The OLS statistic is maximum likelihood with DGP4 and performs well. It performs equally well with the symmetric, i.i.d. DGP5. For DGP1, it is slowly diverging, possibly because the absolute sample mean is diverging with n. OLS performs poorly with the nonsymmetric DGP2, DGP3, and DGP6.

The LTS statistic is maximum likelihood with DGP1 and DGP2. The asymptotic theory also applies for DGP3. The convergence is slow for DGP1, where there is no separation. The LTS statistic does not perform well with ϵ-contamination in DGP4, DGP5, and DGP6.

The LMS statistic is maximum likelihood with DGP3 and perform well with that DGP, but poorly with all other DGPs. The SLTS statistic is not maximum likelihood for any of the considered models. It is calibrated to be asymptotically unbiased for DGP4 and performs well for that model, but poorly with all other DGPs.

Overall, we see that it is a good idea to apply maximum likelihood but this does require that the model specification is checked. In particular, the LTS estimator is best in DGP1–DGP3, although with some finite sample distortion with DGP1 where ‘good’ and ‘outlying’ observations are not well-separated. The LTS estimator does not work well for the ϵ-contaminated models, where a model dependent scale correction is needed. The usual approach of using the normal scale correction as in SLTS does not work well in general. All estimators are poor for asymmetric ϵ-contamination.

7.2 h estimation

Next, we study the finite sample properties of estimating h using the cumulant-based normality test statistic Th in (6.2). Results are reported in Table 3, based on 103 repetitions.

Table 3.

Estimating h by minimizing the Th statistic in (6.2)

DGP1DGP2
λ=0.7λ=0.8λ=0.7λ=0.8
nh^biash^sdh^rh^biash^sdh^rh^biash^sdh^rh^biash^sdh^r
251.22.51.02.50.32.11.52.1
1009.68.63.76.00.24.31.13.0
40041.335.84.512.40.82.10.82.0
160092.8143.10.43.50.82.00.82.0
128000.74.20.64.41.23.21.33.2
DGP1DGP2
λ=0.7λ=0.8λ=0.7λ=0.8
nh^biash^sdh^rh^biash^sdh^rh^biash^sdh^rh^biash^sdh^r
251.22.51.02.50.32.11.52.1
1009.68.63.76.00.24.31.13.0
40041.335.84.512.40.82.10.82.0
160092.8143.10.43.50.82.00.82.0
128000.74.20.64.41.23.21.33.2

Note. Here, h^bias and h^sd report the simulated bias and standard deviation for h^, while h^r is ticked if all simulated values of h^ are interior to the range from 60% to 90%.

Table 3.

Estimating h by minimizing the Th statistic in (6.2)

DGP1DGP2
λ=0.7λ=0.8λ=0.7λ=0.8
nh^biash^sdh^rh^biash^sdh^rh^biash^sdh^rh^biash^sdh^r
251.22.51.02.50.32.11.52.1
1009.68.63.76.00.24.31.13.0
40041.335.84.512.40.82.10.82.0
160092.8143.10.43.50.82.00.82.0
128000.74.20.64.41.23.21.33.2
DGP1DGP2
λ=0.7λ=0.8λ=0.7λ=0.8
nh^biash^sdh^rh^biash^sdh^rh^biash^sdh^rh^biash^sdh^r
251.22.51.02.50.32.11.52.1
1009.68.63.76.00.24.31.13.0
40041.335.84.512.40.82.10.82.0
160092.8143.10.43.50.82.00.82.0
128000.74.20.64.41.23.21.33.2

Note. Here, h^bias and h^sd report the simulated bias and standard deviation for h^, while h^r is ticked if all simulated values of h^ are interior to the range from 60% to 90%.

In each repetition, we compute the Th statistic for each h in the range from 60% to 90% of n. The estimator of h is the minimizer of Th over that range.

The data generating processes are DGP1 and DGP2 from above. These are examples of the LTS model, so that DGP1 has symmetric ‘outliers’ that are not separated from the ‘good’ observations, while DGP2 has asymmetric ‘outliers’ that are separated from the ‘good’ observations. For each DGP, we consider cases with 70% or 80% ‘good’ observations.

Table 3 reports three quantities for each of the DGPs. First, h^bias and h^sd are the Monte Carlo average and standard deviation of the estimation error h^h. Further, h^r is a binary variable, which is checked if h^ is in the interior of the range from 60% to 90% of n for all 103 simulations. The theory suggests that the proportion h/n of ‘good’ observations is consistently estimated, whereas h is not consistently estimated. Thus, we would expect h^bias and h^sd to grow slower than linearly in n, but not to vanish.

For all the four set-ups, the simulations confirm that h^/n is consistent for λ. In DGP1, where there is only little separation between ‘good’ and ‘outlying’ observations, the performance differs substantially between the cases λ=0.7 and λ=0.8. We do not have an explanation for this difference. Nonetheless, the estimation works much better for DGP2, with its separation between ‘good’ observations and ‘outliers’, in both cases λ=0.7 and λ=0.8, matching the theoretical discussion in the above sections.

We also considered the information criteria Φh in (6.1), but omit the results. The performance of Φh is poor, quite possibly due to the loglogn rates involved.

8 Empirical illustrations

We provide two empirical illustrations using the stars data of Rousseeuw and Leroy (1987) and the state infant mortality data of Wooldridge (2015). Both analyses illustrate the estimation of h. The second case also illustrates inference in the LTS model. Therefore, we must determine empirically if the LTS model is appropriate. Indeed, there could be a contamination pattern that is consistent with the ϵ-contamination, with the LTS model, or with neither of those. For both illustrations, we study the source of the data to arrive at reasonable empirical models.

Throughout, we use R version 4.0.2 (R Core Team, 2020) estimating LTS using ltsReg from the Robustbase package. Before each LTS call we apply set.seed(0). R code with embedded data is available as Supplementary material.

8.1 The stars data

For this empirical illustration, we consider the data on log light intensity and log temperature for the Hertzsprung–Russell diagram of the star cluster CYG OB1 containing n=47 stars as reported by Rousseeuw and Leroy (1987, Table 2.3). Figure 1 shows a cross plot of the variables, where the log temperature axis is reversed. The majority of observations follow a steep band called the main sequence. Rousseeuw and Leroy (1987) refer to the four stars to the top right of Figure 1 as ‘outliers’.

By consulting the original source of the data in Humphreys (1978), see also (Vansina & De Grève, 1982, Appendix  A), we found that the 4 stars to the right of Figure 1 are of M-type (observations 11, 20, 30, 34) and they are red supergiants. Further, the fifth star from the right is of F-type (observation 7, called 44 Cyg). The next 31 stars (1 doublet) from the right are of B-type and the remaining 11 stars (1 doublet) furthest to the left are of the O-type. The doublets are not exact doublets in Humphreys’ original data, so this in itself should not be seen as evidence against a normality assumption.

We fitted the linear model loglight=β1+β2logtemperature+σεi. Table 4, left panel, shows LTS estimates for different h values. For h=n=47, the LTS estimator is the full sample OLS estimator. The β estimates are the ‘raw coefficients’ found by ltsReg while σ^ is computed directly from (2.3) without any consistency correction. Figure 1 shows LTS fits for selected values of h. It is seen that the fits rotate when h increases. Table 4 also reports the Th criterion as a function of h. It is minimized for h=42 pointing at five ‘outliers’: The four M-stars and the F-star. Figure 1 indicates estimated ‘good’ observations and ‘outliers’ with solid and open dots.

Table 4.

Estimates by LTS and Th criterion for selecting h

Full sampleSub sample
hβ^1β^2σ^Thβ^1β^2σ^Th
2513.624.220.181.7213.624.220.181.72
3611.493.710.271.9811.493.710.271.98
379.003.160.282.499.003.160.282.49
408.583.070.312.138.583.070.312.13
418.503.050.331.268.503.050.331.26
427.402.800.370.397.402.800.370.39
434.062.050.400.697.880.650.492.57
441.890.700.490.497.740.620.512.76
457.340.530.512.947.580.590.532.73
466.920.440.532.747.120.490.552.83
476.790.410.552.75
Full sampleSub sample
hβ^1β^2σ^Thβ^1β^2σ^Th
2513.624.220.181.7213.624.220.181.72
3611.493.710.271.9811.493.710.271.98
379.003.160.282.499.003.160.282.49
408.583.070.312.138.583.070.312.13
418.503.050.331.268.503.050.331.26
427.402.800.370.397.402.800.370.39
434.062.050.400.697.880.650.492.57
441.890.700.490.497.740.620.512.76
457.340.530.512.947.580.590.532.73
466.920.440.532.747.120.490.552.83
476.790.410.552.75

Note. Left panel has full sample. Right panel excludes the F-type star.

Table 4.

Estimates by LTS and Th criterion for selecting h

Full sampleSub sample
hβ^1β^2σ^Thβ^1β^2σ^Th
2513.624.220.181.7213.624.220.181.72
3611.493.710.271.9811.493.710.271.98
379.003.160.282.499.003.160.282.49
408.583.070.312.138.583.070.312.13
418.503.050.331.268.503.050.331.26
427.402.800.370.397.402.800.370.39
434.062.050.400.697.880.650.492.57
441.890.700.490.497.740.620.512.76
457.340.530.512.947.580.590.532.73
466.920.440.532.747.120.490.552.83
476.790.410.552.75
Full sampleSub sample
hβ^1β^2σ^Thβ^1β^2σ^Th
2513.624.220.181.7213.624.220.181.72
3611.493.710.271.9811.493.710.271.98
379.003.160.282.499.003.160.282.49
408.583.070.312.138.583.070.312.13
418.503.050.331.268.503.050.331.26
427.402.800.370.397.402.800.370.39
434.062.050.400.697.880.650.492.57
441.890.700.490.497.740.620.512.76
457.340.530.512.947.580.590.532.73
466.920.440.532.747.120.490.552.83
476.790.410.552.75

Note. Left panel has full sample. Right panel excludes the F-type star.

The estimation of h by the statistic Th appears a little shaky in Table 4, left panel. The lowest value is obtained for h=42, while the value for h=44 is nearly as low. The slope coefficients β^2 change gradually for h>42 with no obvious choice of h for h>42. Table 4, right panel, shows corresponding results when dropping the F star from the sample. The results are the same for h42. We see that h=42 is now a clear minimum identifying the four M-type stars as ‘outliers’. The M stars appear to have a masking effect, where after their deletion, the F star emerges as very influential in the sense of Rousseeuw and Leroy (1987, p. 81). Perhaps for this reason, different conclusions are reached by different traditional methods. An LTS index plot points at 5 ‘outliers’, see Section 6.1, while an MM index plot (using lmrob in the R package robustbase) and LMS residuals (using lmsreg in the R package MASS) both point at 4 ‘outliers’, not detecting the F star as ‘outlying’.

Figure 2 shows kernel density plots for the scaled residuals for h=25,42,47. The black, thin lines gives kernel densities for the full sample. The red, dashed lines gives kernel densities for the estimated ‘good’ observations. The standard normal distribution is shown with a blue, thick line. For h=42, the red, blue, and part of the black lines coincide, which indicates the normality of the ‘good’ observations. The full sample kernel density has a probability mass in the right tail corresponding to the four giants. There is a slight discrepancy between the full sample and the ‘good’ kernel densities in the region from 2 to 4 corresponding to the F star. By construction this 43rd residual will be outside the range of the 42 ‘good’ residuals, but not by far. Kernel densities are very sensitive to the choice of kernel and bandwidth. For illustration, we chose a Gaussian kernel and bandwidth 1.5h1/5 to get the best match of the red and blue curves for h=42. With that choice, we can more clearly see discrepancies between the kernel density for the ‘good’ observations and the normal curve for h=25,47, that is, for the LTS with a high breakdown point and the OLS estimator, respectively.

Scaled LTS residuals for h=25,42,47. Kernel densities for all residuals (thin) and for ‘good’ residuals (dashed) with fitted standard normal density (thick).
Figure 2.

Scaled LTS residuals for h=25,42,47. Kernel densities for all residuals (thin) and for ‘good’ residuals (dashed) with fitted standard normal density (thick).

8.2 State infant mortality rates

We consider 1990 data for the United States on infant mortality rates by state, including Washington DC, which has a particularly high infant mortality rate (Wooldridge, 2015, p. 299). The data are available as infmrt in the R package wooldridge and are taken from U.S. Bureau of the Census (1994). We analyse two models.

The first model follows Wooldridge. It is a linear regression of the number of deaths within the first year per 1,000 live births, infmort, on the log of per capita income, lpcinc, the log of physicians per 100,000 members of the civilian population, lphysic, and the log of population in thousands, lpopul. In Figure 3, the graph using circle symbols shows the cumulant-based criteria Th as function of h. The OLS regression is obtained for h=n=51 and has a rather large value of Th—notice the gap in the Th-axis—indicating that the full sample model is mis-specified. Choosing h=50 would lead to one ‘outlier’, which is Washington DC. However, the Th function is minimized at h=45 indicating six ‘outliers’: Delaware, Washington DC, Georgia, Texas, California, and Alaska (U.S. Bureau of the Census, 1994, Table 123). The interpretation is not obvious and could be due to a missing regressor.

State infant mortality rates. Th criterion as function of h. Model for infmrt shown with circles. Model for loginfmrt shown with squares.
Figure 3.

State infant mortality rates. Th criterion as function of h. Model for infmrt shown with circles. Model for loginfmrt shown with squares.

The second model differes in two respects. It applies logarithms to the regressand to stabilize rates close to zero as well as for Washington DC. It also includes a regressor for the log proportion of black people in the population, lblack (U.S. Bureau of the Census, 1992, Table 255), since infant mortality is quite different for white and black infants in most states (U.S. Bureau of the Census, 1994, Table 123). In Figure 3, the graph using square symbols shows Th versus h for this model. The minimizer is h=50 with Washington DC as the ‘outlier’. The minimum of 0.08 is small compared with a χ22 distribution, so no evidence against the LTS model. We note that the Th function is also quite low for h in the left side of the plot, albeit not as low as for h=50. The estimated LTS model for h=50 is

(8.1)

The standard errors are the usual OLS standard errors as the LTS model appears to apply. We conclude that the variables lpcinc and lpopul are not significant.

Changing the regressand to be measured on the original scale introduces a second ‘outlier’, South Dakota, which has one of the lowest infant mortality rates for the black population.

9 Concluding remarks

Estimating the proportion of ‘good’ observations. We consider a few issues regarding the estimation of h and the adequacy of the model in the two empirical examples.

First, the sample sizes, n=47 and n=50, are rather small. The results in Section 6.2 indicate that the estimation of λ=h/n based on the Th criterion is logh consistent, hence, it may be imprecise in small samples.

Second, the results in Section 6.2 focus on the location-scale LTS model. The empirical illustrations above consider models with regressors. We believe the estimation of λ based on the Th criterion extends to a well-specified multiple LTS regression model. However, as pointed out in Section 8.2, the omission of regressors could interfere with the estimation of λ. This aspect needs further investigation.

Third, the results in Section 6.2 refer to the LTS model, where, once ‘outliers’ have been removed, the remaining ‘good’ observations look normal. In practice, the adequacy of the model in a given data set has to be studied. A formal testing procedure is not yet available for the present model. Estimating λ by the Th criterion is unlikely to be consistent in a model where ‘outliers’ are generated by ϵ-contamination.

Fourth, notwithstanding the above issues, the suggested procedure for selecting h seems to work well in the two empirical illustrations in this section and helped in finding satisfactory LTS models for the data.

Other models of the LTS type. New models and estimators can be generated by replacing the normal assumption in the LTS models with some other distribution. For instance, the uniform leads to the least median of squares (LMS) estimator analysed in Online Supplementary Material, Appendix C, while the Laplace distribution leads to the least trimmed sum of absolute deviations (LTA) estimator (Dodge & Jurečková, 2000; Hawkins & Olive, 1999; Hössjer, 1994, Section 2.7). Ron Butler has suggested to us that the approach in this paper could also be applied to the minimum covariance determinant (MCD) estimator for multivariate location and scale (Butler et al., 1993; Rousseeuw, 1985).

Alternative models for ‘outliers’. The maximum likelihood argument in Section 4 would also work if the ‘outlier’ errors εj rather than ξj have distribution Gj. The analysis would be related to the trimmed likelihood argument of Gallegos and Ritter (2009). However, the resulting LTS estimator would have less attractive asymptotic properties in that model compared to those derived in this paper.

Inference requires a model for both ‘good’ and ‘outlying’ observations. In the presented theory, the ‘good’ and the ‘outlying’ observations are separated. The traditional approach, as advocated by Huber (1964), is to consider mixture distributions formed by mixing a reference distribution with a contamination distribution. Any subsequent inference on the regression parameter β would require a specific formulation of the ϵ-contaminated distribution. This is a nontrivial practical problem. Instead, there has been a focus on showing that the bias of estimators is bounded under contamination (Huber & Ronchetti, 2009, Section 4) while inference is conducted using the asymptotic distribution that assumes all observations are i.i.d. normal, implying that the ‘good’ observations are truncated normal. This gives a different distribution theory for inference compared with the one presented here and simulations in Section 7.1 indicate that this will not control size in general. It is therefore of interest to formulate models allowing ‘outliers’ under which consistency can be proved. The LTS model in this paper does so. The presented simulations show that the two inferential theories are really very different. In practice, LTS estimation should therefore be evaluated in the context of a particular model and inference should be conducted accordingly.

Alternative estimators of h. We have proposed consistent estimators for h/n, but it would be useful to investigate their performance further. In a regression context, it may be worth considering the Forward Search algorithm (Atkinson et al., 2010). Omitted regressors and ‘outliers’ may confound each other, so a simultaneous search over these may be useful as in the Autometrics algorithm (Castle et al., 2021; Hendry & Doornik, 2014). Some asymptotic theory for these algorithms are provided in Johansen and Nielsen (2016a, 2016b).

Misspecification tests can be developed for the present model. The asymptotic theory developed here shows that standard normality tests can be applied to the set of estimated ‘good’ observations. Other tests could also be investigated, in particular those that are concerned with functional form or omission of regressors.

More ‘outliers’ than ‘good’ observations. This is allowed in the LTS model and supported by the asymptotic analysis under regularity conditions for the ‘outliers’. This lends some support to the practice of starting the Forward Search algorithm by an LTS estimator for fewer than half of the observations (Atkinson et al., 2010). Whether it makes more sense to model the ‘outliers’ or the ‘good’ observations as normally distributed in this situation must rest on a careful consideration of the data and the substantive context.

Acknowledgments

Comments from R. Butler, Y. Gao, O. Hao, C. Henning, referees, and the associate editors are gratefully acknowledged.

Data availability

All the data that we use are publicly available; the precise sources are cited in the article. The source code files in R available as Supplementary material.

Supplementary material

Supplementary material are available at Journal of the Royal Statistical Society: Series B online.

Appendix A. Derivation of the LTS likelihood

We prove that P(zjϵ<yjzj|yiforiζ)=ΔϵGj(z~jβσ)=Gj(z~jβσϵ)Gj(z~jβσ).

First, we prove that P0=P(yjzj|)=G(z~jβσ), where the dot indicates the conditioning set. Since εj=(yjβxj)/σ and zjβσ=(zjβxj)/σ, we have that P0=P(εjzjβσ|).

Recall that the ‘outlier’ errors are εj=(maxiζyiβσ+ξj)1(ξj>0)+(miniζyiβσ+ξj)1(ξj<0), see (3.1) and since yiβσ=εi. Further, ξj has distribution function Gj, which is continuous at zero. As a consequence, P(miniζyiβσεjmaxiζyiβσ|)=0 and P(εjminiζyiβσ)=P(ξj<0)=Gj(0). Thus, P0=P(εjzjβσ|)=Gj(0) for miniζyiβσzjβσmaxiζyiβσ.

Let P0=P1+P2, where P1=P{(εjzjβσ)(ξj<0)|} and P2=P{(εjzjβσ)(ξj>0)|}.

By (3.1), we have (εjzjβσ)=(ξjzjβσminiζyiβσ), so that P1 can be written as P{ξjmin(zjβσminiζyiβσ,0)|}. Hence,

Similarly, P2=P{(ξjzjβσmaxiζyiβσ)(ξj>0)|} by (3.1). If zjβσ<maxiζyiβσ, then the intersection is empty. If instead zjβσ>maxiζyiβσ, then, the intersection is the set (0<ξizjβσmaxiζyiβσ). Hence,

Note also that if zjβσ<miniζyiβσ, then zjβσ<maxiζyiβσ. And, if zjβσ>maxiζyiβσ, then zjβσ>miniζyiβσ. In combination, we have

Recall the notation z~jβσ=(zjβσminiζεi)1(zjβσ<miniζεi)+(zjβσmaxiζεi)1(zjβσ>maxiζεi). We then get P0=P(yj<zj|)=Gj(z~jβσ).

Second, similarly, P(yj<zjϵ|)=Gj(z~jβσϵ). Combining, the desired result follows.

Appendix B. Proofs of asymptotic theory for the LTS model

We start with some preliminary extreme value results, which then allows analysis of the case with more ‘good’ observations than ‘outliers’. Then some results on empirical processes follows, which are needed for the general case.

B.1 Extreme values

For a distribution function F define the quantile function Q(ψ)=inf{c:F(c)ψ}.

 
Lemma B.1

Suppose F(c)=0 for c<0. Let ψn=oP(1). Then Qn(ψn)=OP(1).

 
Proof.

Let a small ϵ>0 be given. Then a finite x0 exists to that f=F(x)1ϵ. We show Pn=P(An)2ϵ where An={Qn(ψn)x}. Applying Fn, we get An={ψnFn(x)}. By the Law of Large Numbers, Fn(x)=f+oP(1). Hence, if Bn={Fn(x)>fϵ} then Pn(Bn)>1ϵ for large n. Since An=(AnBn)(AnBnc), we have An(AnBn)Bnc. Here, P(Bnc)ϵ by construction. Moreover, AnBn(ψn>fϵ) where P(ψn>fϵ)ϵ for large n since ψn=oP(1) by assumption. Thus, Pn2ϵ. □

 
Lemma B.2

Suppose ε1,,εn are i.i.d. with E|εi|q< for some q>0. Then ε(n)=oP(n1/p) for any p<q.

 
Proof.

We show Pn=P{ε(n)ϵn1/p}0 for any ϵ>0. Write Pn=Pi=1n(εiϵn1/p). Boole’s inequality gives Pni=1nP(εiϵn1/p)=nP(ε1ϵn1/p). Markov’s inequality gives Pnn1q/pE|εi|q, which vanishes for p<q.          □

B.2 Some initial properties of the LTS estimator

We consider the LTS estimator under the sequence of data generating processes defined in Section 5.1. The main challenge is to show that δ^ is close to the Binomial(nhn,ρ)-distributed variable δn=jζn1(εj<miniζnεi). In the following lemmas, we condition on the sequence δn, so that the randomness stems from ‘good’ errors ε(δn+1),,ε(δn+hn) and the magnitudes of the ‘outliers’, ε_(δn+1j) for jδn and ε¯(jδnhn) for j>δn+hn. The unconditional statements in the Theorems about δ^ are then derived as follows. If P(δ^δnIn|δn)1 for an interval In and a sequence δn then by the law of iterated expectations

(B1)

due to the dominated converges theorem, because P(δ^δnIn|δn) is bounded.

We give detailed proofs for the case δ^>δn, so that some of the small ‘good’ observations are considered left ‘outliers’ and some of the small right ‘outliers’ are considered ‘good’. The case δ^<δn is analogous, since we can multiply all observations by 1 and relabel left and right. When considering δ^>δn we note that δ^δnn¯, the number of right ‘outliers’. Recall ρ=G(0). Hence, if ρ=1, then all outliers are to the left. In particular, due to the binomial construction of δn then n¯=0a.s. when ρ=1, so that the event δ^>δn is a null set. Thus, when analysing (δ^δn)/hn it suffices to consider δ^>δn and ρ<1.

The next lemma concerns cases where s^=δ^δn is small. We show that the LTS estimators are close to the infeasible OLS estimators for the ‘good’ observations. We note that, by Lemma B.2 and Assumption 5.1(ii) that E|εi|2+ω<, we can find a small η>0 so that ε(δn+1),ε(δn+hn)=oP(hn1/2η). We write ε(δ^+i)p for {ε(δ^+i)}p.

 
Lemma B.3

Consider the LTS estimator under Assumption 5.1(ii). Let δ^=δn+OP(hnη) for some small η>0 defined above. Then, hn1/2(μ^μ^δn) and σ^2σ^δn2 are oP(1).

 
Proof.
The estimators μ^ and σ^2 are formed from the sample moments of ε(δ^+1),,ε(δ^+hn). Let Enp=i=1hnε(δ^+i)pi=1hnε(δn+i)p. We need to show that En1=oP(hn1/2) and En2=oP(hn). As remarked above, we condition on δn and consider δ^>δn while assuming ρ<1. Then
The first and the fourth sum cancel. In the second sum, change summation index j=ihnδn+δ^, so that δ^n+i=δn+hn+j, and replace ε(δn+hn+j)=ε(δn+hn)+ε¯(j). This gives
Here, ε(δn+hn) is the maximum and ε(δn+i) is the ith order statistic of the ‘good’ errors.
For p=1, we find ε(δn+i)ε(δn+1) and ε¯(i)ε¯(δ^δn), so that
Here, ε¯(δ^δn) is the (δ^δn)/n¯ empirical quantile in the G¯ distribution. By assumption δ^δn=OP(hnη) and n¯/hnω¯=(1λ)(1ρ)/λ>0 so that (δ^δn)/n¯=OP(hnη1)=oP(1). Lemma B.1 then shows ε¯(δ^δn)=OP(1). Further, Lemma B.2 using Assumption 5.1(ii) that E|εi|2+ω< shows that ε(δn+1),ε(δn+hn) are oP(hn1/2η) for some η>0. In combination, En1=OP(hnη){oP(hn1/2η)+OP(1)}=oP(n1/2).
For p=2, we find similarly, using the inequality (x+y)22(x2+y2),
Apply the above bounds δ^δn=OP(hnη), ε(δn+hn)=oP(hn1/2η) and ε¯(δ^δn)=OP(1) to get that En2=OP(hnη)[oP{hn(1/2η)2}+OP(1)]=oP(hn).             □

The next Lemma is the main ingredient to showing consistency of δ^. It is convenient to define sequences

(B2)

We note that by Assumption 5.1(iii), εi has a distribution function with infinite support. Hence, the extreme value ε(δn+hn) diverges to infinity and ε(δn+hn)1=oP(1). By Assumption 5.1(ii), Lemma B.2 applies, so that ε(δn+hn)=oP(hn1/2η) for some η>0. Then, for large hn and η small, 0<s_n<s¯n where s¯n/hn=oP(1).

 
Lemma B.4

Suppose Assumption 5.1 and let ρ<1 and 0<λ<1. Then, for any η>0, we have mins_nshns¯nhn1η(σ^δn+s2σ^δn2) in probability.

 
Proof.
We condition on δn, the number of left ‘outliers’. Recall that the ordered ‘good’ observations are ε(δn+1),,ε(δn+hn). Let ε(δn+hn+j)=ε(δn+hn)+ε¯(j) for 1jn¯. Expand, see Online Supplementary Material, Section D in Supplementary material,
(B3)
with coefficients An=An1An2+2An32An4 where
We find a lower bound for An. Notice An1,An30. Further, An2hn1i=1sε(δn+i)2=Bn2 say. For An4, use Jensen’s inequality, add further summand and use the Law of Large Numbers using Assumption 5.1(i) for the unordered normal ‘good’ εδn+i to get
(B4)
Further, we have hn1i=1s{ε(δn+hn)ε(δn+i)}(s/hn){ε(δn+hn)ε(δn+1)}=Bn4 say, so that |An4|Bn4{1+oP(1)}, where the remainder term from (B4) is uniform in s. In combination,
(B5)
where the oP(1) term is uniform in s. We analyse separately for ss¯n and s¯ns.

1. Considers¯nshns¯nwheres¯n/hn=|ε(δn+hn)|1/2. We start by finding a lower bound to (s/hn)(1s/hn)ε(δn+hn)2. The range of interest is s¯n/hns/hn1s¯n/hn. As argued above s¯n/hn=oP(1). The function x(1x) is concave with roots at x=0 and x=1. Then, for x0<x<1x0 with 0<x0<1/2, the function x(1x) has two minima, one at x0 and the other one at 1x0. Therefore, x(1x)x0(1x0)x0/2. Thus, 2(s/hn)(1s/hn)s¯n/hn=|ε(δn+hn)|1/2 on the considered range.

Next, we bound Bn21+oP(1) uniformly in s as in (B4).

For Bn4=(s/hn){ε(δn+hn)ε(δn+1)} use s/hn1 so that Bn4{ε(δn+hn)ε(δn+1)}.

Thus, (B5) reduces to 2Ss|ε(δn+hn)|1/2ε(δn+hn)22{1+2ε(δn+hn)2ε(δn+1)}{1+oP(1)}. Since ε(δn+hn) in probability and ε(δn+1)/ε(δn+hn)=1+oP(1) by Assumption 5.1(iv) we get that mins¯nshns¯n2Ss|ε(δn+hn)|3/2{1+oP(1)}. In particular, mins¯nshns¯nhn1ηSs in probability.

2. Considers_nss¯nwheres_n=hnηfor anyη>0ands¯n=|ε(δn+hn)|1/2hn, see (B2). We find bounds for the Bn terms in (B5).

First, Bn2=hn1i=1sε(δn+i)2. Write Bn2=hn1{i=1rnε(δn+i)2+i=rn+1sε(δn+i)2}, where rn=hnη/2. In the left tail, the squared order statistics decrease with increasing index with large probability. Hence, we can bound Bn2hn1{rnε(δn+1)2+(srn)ε(δn+rn)2}. Bounding (srn)s and rn/srn/s_n=hnη/2 we get Bn2(s/hn){hnη/2ε(δn+1)2+ε(δn+rn)2}. By Assumption 5.1(iv,v) then ε(δn+1)/ε(δn+hn)=1+oP(1) and ε(δn+rn)/ε(δn+1)Cη+oP(1) for some Cη<1. Then, Bn2(s/hn)ε(δn+hn)2[hnη/2{1+oP(1)}+{Cη2+oP(1)}], which reduces to (s/hn)ε(δn+hn)2{Cη2+oP(1)}.

Second, Bn4=(s/hn){ε(δn+hn)ε(δn+1)} by definition.

Insert bounds for Bn2,Bn4 in (B5) along with s/hns¯n/hn=|ε(δn+hn)|1/2 to get
Using that ε(δn+1)/ε(δn+hn)=1+oP(1) and ε(δn+hn)1=oP(1) by Assumption 5.1(iii) gives that Ss(s/hn)ε(δn+hn)2{1Cη2+oP(1)}. Further ss_n where s_n/hn=hnη1. Thus, mins_nss¯nhn1ηSs in probability.  □

B.3 Fewer ‘outliers’ than ‘good’ observations

 
Proof of Theorem 5.2.

First, we show that s^=δ^δn=oP(hnη) for any η>0. It suffices to show s^=OP(hnη) for each η.

As discussed above, we consider the case δ^>δn, so that some of the small ‘good’ observations are considered left ‘outliers’ and some of the small right ‘outliers’ are considered ‘good’. The case δ^<δn is analogous. When considering δ^>δn we note that s^=δ^δnn¯, the number of right ‘outliers’. Moreover, s^n¯nhn, since there are n¯ right ‘outliers’ and nhn ‘outliers’ in total.

Choose s_n=hnη and s¯n=|ε(δn+hn)|1/2hn as in (B2). Since s^/hn(nhn)/hn converges to a value less than unity then s^/hn1s¯n/hn for large n. Thus, if we show P(s_ns^hns¯n)0 then P(s^<s_n)1 as desired. Now, s^ is the minimizer of Ss=σ^δn+s2σ^δn2. Since S0=0, then P(s_ns^hns¯n)0 if mins_nshns¯nhn1η(σ^δn+s2σ^δn2) in probability. This follows from Lemma B.4 using Assumption 5.1.

Second, since δ^δn=oP(hnη) then Lemma B.3 using Assumption 5.1(ii) shows that hn1/2(μ^μ^δn), σ^2σ^δn2 are oP(1).

Third, the i.i.d. Law of Large Numbers and Central Limit Theorem using Assumption 5.1(i) show that hn1/2(μ^δnμ)/σ is asymptotically normal while σ^δn2 is consistent for σ2.      □

B.4 Marked empirical processes evaluated at quantiles

We start with some preliminary results on marked empirical processes evaluated at quantiles.

Consider random variables εi for i=1,,n and define the marked empirical distribution and its expectation, for c0, by

(B6)

For p=0, let Fn0=Fn. We also define the quantile function Q(ψ)=inf{c:F(c)ψ} and the empirical quantiles Qn(ψ)=inf{c:Fn(c)ψ}. The first result follows from the theory of empirical quantile processes.

 
Lemma B.5

Suppose F is regular (Definition 5.1). Then, for all ζ>0,

  • n1/2[Fn{Q(ψ)}ψ] converges in distribution on D[0,1] to a Brownian bridge;

  • sup0ψ1|n1/2[F{Qn(ψ)}ψ]+n1/2[Fn{Q(ψ)}ψ]|=a.s.o(nζ1/4).

 
Proof.

(a) This is Billingsley (1968, Theorem 16.4).

(b) Let D(ψ)=f{Q(ψ)}n1/2{Qn(ψ)Q(ψ)} and write the object of interest as the sum of n1/2[F{Qn(ψ)}ψ]D(ψ) and n1/2[Fn{Q(ψ)}ψ]+D(ψ). These two terms are o(nζ1/4)a.s. by Csörgö (1983, Corollaries 6.2.1, 6.2.2), uniformly in ψ.  □

We need the following Glivenko–Cantelli result.

 
Lemma B.6

Let εi be i.i.d. continuous, positive random variables with E|εi|p<. Then supc>0|Fnp(c)F¯p(c)|=oP(1).

 
Proof.

We note that F¯p is nondecreasing with F¯p()<. Since F¯p is continuous, then for any δ>0 exists a finite integer KN and chaining points =c0<c1<<cK= so that max1kK{F¯p(ck)F¯p(ck1)}δ.

Since Fnp and F¯p are both nondecreasing we get for ck1<cck the bounds
In combination, |Fnp(c)F¯p(c)|maxk1,k|Fnp(ck)F¯p(ck)|+{F¯p(ck)F¯p(ck1)}. The last term is bounded by δ, so that supc>0|Fnp(c)F¯p(c)|max1kK|Fnp(ck)F¯p(ck)|+δ. For each k then Fnp(ck)F¯p(ck)=oP(1) by the Law of Large Numbers, requiring εi to be i.i.d. with E|εi|p<. Since K is finite then the maximum over k also vanishes almost surely. By choosing δ sufficiently small the overall bound is seen to be oP(1).  □

The next result is inspired by Johansen and Nielsen (2016a, Lemma D.11).

 
Lemma B.7

Let pN0. Suppose εi is positive, regular and Eεiq< for some q>2p. Then, sup1/(n+1)ψn/(n+1)|Fnp{Qn(ψ)}F¯p{Q(ψ)}|=oP(1).

 
Proof.
For p=0, then ϕn=n1/2[Fnp{Qn(ψ)}F¯p{Q(ψ)}] satisfies ϕn=n1/2[F{Qn(ψ)}ψ]. Lemma B.5 shows that ϕn converges in distribution to a Brownian bridge as a process in ψ. By Billingsley (1968, p. 142–143) then sup0ψ1ϕn converges in distribution so that sup0ψ1ϕn=OP(1) and the result follows. For pN, add and subtract F¯np{Qn(ψ)} to get
Term 1. This is oP(1) since supc>0|Fnp(c)F¯p(c)|=oP(1) by Lemma B.6.
Term 2. Write Sn(ψ)=F¯p{Qn(ψ)}F¯p{Q(ψ)} as integral. First, change variable u=F(c), du=f(c)dc, so that c=Q(u). Then apply the mean value theorem, so that
for an intermediate point ψ* so that Q(ψ*) belongs to the interval from Q(ψ) to Qn(ψ). Since sup0ψ1ϕn=OP(1), we must show that {Q(ψ*)}p=oP(n1/2). It suffices to show that Q(ψ) and Qn(ψ) are oP{n1/(2p)} for ψ=n/(n+1).

Consider Qn(ψ). Write Qn(ψ)=max1inεi. Lemma B.2 shows Qn(ψ)=oP{n1/(2p)} for 2p<q since Eεiq< by assumption.

Consider Q(ψ). Bound |Q(ψ)|cn, where cn satisfies (n+1)1=P(ε1>cn). We must show cn=o{n1/(2p)}. It suffices that cnq=O(n) for q>2p. By the Markov inequality P(|ε1|>cn)cnqEεiq, so that cnq(n+1)/Eεiq=O(n).  □

B.5 More ‘outliers’ than ‘good’ observations

The next lemma is needed when there are more than half of the observations are ‘outliers’. As σ^δ2 is not diverging, additional regularity conditions are needed to ensure that σ^δ2>σ^δn2.

 
Lemma B.8

Suppose Assumption 5.2(i) holds. Let 1ω¯=(1ρ)(1λ)/λ<. Recall s¯n=|ε(δn+hn)|1/2hn, from (B2). Then, conditional on δn, an ϵ>0 exists so that minhns¯nsn¯(σ^δn+s2σ^δn2)ϵ+oP(1) for large n.

 
Proof.

The errors ε(δn+i) are standard normal order statistics for 1ihn and ε(δn+hn+j)=ε(δn+hn)+ε¯(j) for 1jn¯, where ε¯j is G¯-distributed. We let ε¯0=0.

It suffices to show that σ^δn+s2/σ21+ϵ+oP(1) uniformly in s, since σ^δn2/σ2=1+oP(1) by the Law of Large Numbers using Assumption 5.1(i) applied to the sample variance of the ‘good’ errors. We consider separately the cases hnsn¯ and hns¯ns<hn.

1. Considerhnsn¯. In this case, σ^δn+s2 is the sample variance of ε(δn+s+j)=ε(δn+hn)+ε¯(shn+j) for 1jhn. Sample variances are invariant to the location, so that σ^δn+s2=h1j=1hε¯(shn+j)2{h1j=1hε¯(shn+j)}2. Let As/n¯={s/n¯ω¯1<G¯(ε¯1)s/n¯}.

We argue that minhnsn¯σ^δn+s2/σ2v¯, where v¯=minω¯1ς1Var(ε¯1|Aς). This suffices, as v¯>1 by Assumption 5.2(i). Write j=1hnε¯(shn+j)p=k=1n¯ε¯kp1{ε¯(shn)<ε¯kε¯(s)} and let G¯np(c)=n¯1i=1n¯ε¯ip1(εic) and G¯n1(ψ)=inf{c:G¯(c)ψ}, so that ε¯(k)=G¯n1(k/n¯). Then,
Apply Lemma B.7 with F=G¯, n=n¯, requiring the 4+ moments and regularity of Assumption 5.2(i), so that, uniformly in hnsn¯,
Now, hn/n¯ω¯1, where ω¯1 by construction. Then, Eε¯1p1{s/n¯hn/n¯<G¯(ε¯1)s/n¯}=Eε¯1p1As/n¯+oP(1) uniformly in hnsn¯. Noting that hn/n¯=n¯1j=1hnε¯(s+j)0=E1As/n¯+oP(1), we get
so that σ^δn+s2/σ2=E(ε¯12|As/n¯)+oP(1){E(ε¯1|As/n¯)+oP(1)}2. Since Eε¯11As/n¯Eε¯1< and E1As/n¯=ω¯1>0 uniformly in s, we get E(ε¯1|As)ω¯Eε¯1<. Thus,
Since the errors are uniform in s, we get minhnsn¯σ^δn+s2/σ2v¯+oP(1) as desired.
2. Considerhns¯ns<hnwheres¯n=(2loghn)1/4hn, see (B2). In this case, we have hns¯n ‘outliers’ and s¯n ‘good’ observations. Expand,
(B7)
see Online Supplementary Material, Section D in Supplementary material, where
We note that An1,An2,An3,An40. Therefore, σ^δn+s2/σ2An2.

We argue, as in part 1, that An2v¯+oP(1). Indeed, since 1>s/hn1s¯n/hn1, then n¯1j=1hns¯nε¯(j)pn¯1j=1sε¯(j)pn¯1j=1hnε¯(j)p. Both bounds equal Eε¯1p1Aω¯1+oP(1), so that we can proceed as in part 1.    □

 
Proof of Theorem 5.3.

We will show that s^=δ^δn=oP(hnα) for any α>0. It suffices to consider the case where P(δ^δn>hnα)0 and where ρ<1 as remarked in Section 5.1. We consider λ,ρ<1 so that ω¯=(1ρ)(1λ)/λ satisfies 0<ω¯. The case ω¯<1 was covered in the proof of Theorem 5.2. Thus, suppose ω¯1.

Recall s_n,s¯n from (B2). In particular, hnα>s_n for large n, so that P(s^>hnα)P(s^s_n). We show, that the latter probability vanishes. Note that s^n¯.

We have that s^ is the minimizer to σ^δn+s2σ^δn2, which is zero for s=δδn=0. The Lemmas B.4, B.8 using Assumption 5.2(i) show that σ^δn+s2σ^δn2, asymptotically, has a uniform, positive lower bound on each of the intervals s_ns^hns¯n and hns¯ns^n¯. Thus, σ^δn+s2σ^δn2 is bounded away from zero on ss_n so that P(s^s_n)0.

A similar argument applies for δ^δn<hnα using Assumption 5.2(ii). The limiting results for μ^,σ^2 then follow as in the proof of Theorem 5.2.                      □

B.6 The OLS estimator in the LTS model

 
Proof of Theorem 5.4.
The sample average satisfies (μ¯μ)/σ=n1i=1nεi. Since ρ=0 there are only right ‘outliers’. Separate the ‘good’ observations εi for i=1,,hn with maximum ε(hn) and ‘outliers’ εhn+j=ε(hn)+ε¯j for j=1,,nhn, to get
(B8)
Under Assumption 5.1(i,iii) the first sum vanishes by the Law of Large Numbers, while ε(hn) in probability. Further, (nhn)/n1λ. The second sum converges to (1λ)μG by the Law of Large Numbers. Combine to see that |ε(hn)|1(μ^OLSμ)/σ=1λ+oP(1).            □

References

Alfons
A.
,
Croux
C.
, &
Gelper
S.
(
2013
).
Sparse least trimmed squares regression for analyzing high-dimensional large data sets
.
The Annals of Applied Statistics
,
7
(
1
),
226
248
. https://doi.org/10.1214/12-AOAS575

Atkinson
A. C.
,
Riani
M.
, &
Cerioli
A.
(
2010
).
The forward search: Theory and data analysis (with discussion)
.
Journal of the Korean Statistical Society
,
39
(
2
),
117
134
. https://doi.org/10.1016/j.jkss.2010.02.007

Berenguer-Rico
V.
, &
Nielsen
B.
(
2017
).
Marked and weighted empirical processes of residuals with applications to robust regressions
.
(Discussion Paper 841)
.
Oxford
:
Dept. Econ.
.

Billingsley
P
(
1968
).
Convergence of probability measures
.
John Wiley & Sons
.

Butler
R. W.
(
1982
).
Nonparametric interval and point prediction using data trimmed by a Grubbs-type outlier rule
.
The Annals of Statistics
,
10
(
1
),
197
204
. https://doi.org/10.1214/aos/1176345702

Butler
R. W.
,
Davies
P. L.
, &
Jhun
M.
(
1993
).
Asymptotics for the minimum covariance determinant estimator
.
The Annals of Statistics
,
21
(
3
),
1385
1400
. https://doi.org/10.1214/aos/1176349264

Castle
J. L.
,
Doornik
J. A.
, &
Hendry
D. F.
(
2011
).
Evaluating automatic model selection
.
Journal of Time Series Econometrics
,
3
(
1
), 1–33.
Article 8
. https://doi.org/10.2202/1941-1928.1097

Castle
J. L.
,
Doornik
J. A.
, &
Hendry
D. F.
(
2023
).
Robust discovery of regression models
.
Econometrics and Statistics
,
26
,
31
51
. https://doi.org/10.1016/j.ecosta.2021.05.004

Čížek
P.
(
2005
).
Least trimmed squares in nonlinear regression under dependence
.
Journal of Statistical Planning and Inference
,
136
(
11
),
3967
3988
. https://doi.org/10.1016/j.jspi.2005.05.004

Croux
C.
, &
Rousseeuw
P. J.
(
1992
).
A class of high-breakdown scale estimators based on subranges
.
Communications in Statistics - Theory and Methods
,
21
(
7
),
1935
1951
. https://doi.org/10.1080/03610929208830889

Csörgö
M.
(
1983
).
Quantile processes with statistical applications
(Vol.
42
).
CBMS-NFS Regional Conference Series in Applied Mathematics
.
SIAM
.

Dodge
Y.
, &
Jurečková
J
(
2000
).
Adaptive regression
.
Springer
.

Doornik
J. A.
(
2016
).
An example of instability: Discussion of the paper by Søren Johansen and Bent Nielsen
.
Scandinavian Journal of Statistics
,
43
,
2
357
359
. https://doi.org/10.1111/sjos.12207

Fisher
R. A.
(
1922
).
On the mathematical foundations of theoretical statistics
.
Philosophical Transactions of the Royal Society of London. Series A
,
222
(
602
),
309
368
. https://doi.org/10.1098/rsta.1922.0009

Gallegos
M. T.
, &
Ritter
G.
(
2009
).
Trimmed ML estimation of contaminated mixtures
.
Sankhyā: The Indian Journal of Statistics, Series A
,
71
(
2
),
164
220
.

Gissibl
N.
,
Klüppelberg
C.
, &
Lauritzen
S.
(
2021
).
Identifiability and estimation of recursive max-linear models
.
Scandinavian Journal of Statistics
,
48
(
1
),
188
211
. https://doi.org/10.1111/sjos.12446

Gumbel
E. J.
, &
Keeney
R. D.
(
1950
).
The extremal quotient
.
The Annals of Mathematical Statistics
,
21
(
4
),
523
538
. https://doi.org/10.1214/aoms/1177729749

Hald
A.
(
2007
).
A history of parametric statistical inference from Bernoulli to Fisher, 1713–1935
.
Springer
.

Hampel
F. R.
(
1971
).
A general qualitative definition of robustness
.
The Annals of Mathematical Statistics
,
42
(
6
),
1887
1896
. https://doi.org/10.1214/aoms/1177693054

Hawkins
D. M.
, &
Olive
D. J.
(
1999
).
Improved feasible solution algorithms for high break down estimation
.
Computational Statistics & Data Analysis
,
30
(
1
),
1
11
. https://doi.org/10.1016/S0167-9473(98)00082-6

Hendry
D. F.
, &
Doornik
J. A.
(
2014
).
Empirical model discovery and theory evaluation: Automatic selection methods in econometrics
.
MIT Press
.

Hendry
D. F.
, &
Santos
C.
(
2010
). An automatic test of super exogeneity. In
M. V.
Watson
,
T.
Bollerslev
, &
J.
Russell
(Eds.),
Volatility and time series econometrics
.
Oxford University Press
.

Hössjer
O.
(
1994
).
Rank-based estimates in the linear model with high breakdown point
.
Journal of the American Statistical Association
,
89
(
425
),
149
158
. https://doi.org/10.1080/01621459.1994.10476456

Huber
P. J.
(
1964
).
Robust estimation of a location parameter
.
The Annals of Mathematical Statistics
,
35
(
1
),
73
101
. https://doi.org/10.1214/aoms/1177703732

Huber
P. J.
, &
Ronchetti
E. M.
(
2009
).
Robust statistics
(2nd ed.).
John Wiley & Sons
.

Humphreys
R. M.
(
1978
).
Studies of luminous stars in nearby galaxies. I. Supergiants and O stars in the Milky Way
.
Astrophysical Journal Supplement Series
,
38
(
4
),
309
350
. https://doi.org/10.1086/190559

Johansen
S.
(
1978
).
The product limit estimator as maximum likelihood estimator
.
Scandinavian Journal of Statistics
,
5
(
4
),
195
199
.

Johansen
S.
, &
Nielsen
B.
(
2016a
).
Analysis of the forward search using some new results for martingales and empirical processes
.
Bernoulli
,
22
(
2
),
1131
1183
.
Corrigendum (2019) 25, 3201
. https://doi.org/10.3150/14-BEJ689

Johansen
S.
, &
Nielsen
B.
(
2016b
).
Asymptotic theory of outlier detection algorithms for linear time series regression models (with discussion)
.
Scandinavian Journal of Statistics
,
43
(
2
),
321
348
. https://doi.org/10.1111/sjos.12174

Kiefer
J.
, &
Wolfowitz
J.
(
1956
).
Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters
.
The Annals of Mathematical Statistics
,
27
(
4
),
887
906
. https://doi.org/10.1214/aoms/1177728066

Knight
K.
(
2017
).
On the asymptotic distribution of the l estimator in linear regression
.
Mimeo
.

Leadbetter
M. R.
,
Lindgreen
G.
, &
Rootzén
H.
(
1982
).
Extremes and related properties of random sequences and processes
.
Springer
.

R Core Team
(
2020
).
R: A language and environment for statistical computing
.
Vienna, Austria
.

Riani
M.
,
Atkinson
A. C.
, &
Perrotta
D.
(
2014
).
A parametric framework for the comparison of methods of very robust regression
.
Statistical Science
,
29
(
1
),
128
143
. https://doi.org/10.1214/13-STS437

Rousseeuw
P. J.
(
1984
).
Least median of squares regressions
.
Journal of the American Statistical Association
,
79
(
388
),
871
880
. https://doi.org/10.1080/01621459.1984.10477105

Rousseeuw
P. J.
(
1985
). Least median of squares regressions. In
W.
Grossmann
,
G.
Pflug
,
I.
Vincze
, &
W.
Wertz
(Eds.),
Mathematical statistics and applications
(pp.
283
297
).
Reidel
.

Rousseeuw
P. J.
, &
Hubert
M.
(
1997
). Recent developments in PROGRESS. In
Y.
Dodge
(Ed.),
L1-statistical procedures and related topics (Vol. 31) of Lecture notes–monograph series
(pp.
201
214
).
Institute of Mathematical Statistics
.

Rousseeuw
P. J.
, &
Leroy
A. M.
(
1987
).
Robust regression and outlier detection
.
John Wiley & Sons
.

Rousseeuw
P. J.
,
Perrotta
D.
,
Riani
M.
, &
Hubert
M.
(
2019
).
Robust monitoring of time series with application to fraud detection
.
Econometrics and Statistics
,
9
,
108
121
. https://doi.org/10.1016/j.ecosta.2018.05.001

Rousseeuw
P. J.
, &
van Driessen
K.
(
2000
). An algorithm for positive-breakdown regression based on concentration steps. In
W.
Gaul
,
O.
Opitz
, &
M.
Schader
(Eds.),
Data analysis: Scientific modeling and practical application
(pp.
335
346
).
Springer Verlag
.

Scholz
F. W.
(
1980
).
Towards a unified definition of maximum likelihood
.
Canadian Journal of Statistics
,
8
(
2
),
193
203
. https://doi.org/10.2307/3315231

U.S. Bureau of the Census
(
1992
).
1990 census of population: General population statistics, CP-1-1
.
Washington DC
.

U.S. Bureau of the Census
(
1994
).
Statistical abstract of the United States: 1994
.
Washington DC
.

Vansina
F.
, &
De Grève
J. P.
(
1982
).
Close binary systems before and after mass transfer. III. Spectroscopic binaries
.
Astrophysics and Space Science
,
87
(
1-2
),
377
401
. https://doi.org/10.1007/BF00648931

Víšek
J. A.
(
2006
).
The least trimmed squares; part III: Asymptotic normality
.
Kybernetika
,
42
(
2
),
203
224
.

Wooldridge
J. M.
(
2015
).
Introductory econometrics
(6th ed.).
Cengage Learning
.

Yohai
V. J.
(
1987
).
High breakdown-point and high efficiency robust estimates for regression
.
The Annals of Statistics
,
15
(
2
),
642
656
. https://doi.org/10.1214/aos/1176350366

Author notes

Conflict of interest: None declared.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)

Supplementary data