-
PDF
- Split View
-
Views
-
Cite
Cite
Alexander Henzi, Johanna F. Ziegel, Tilmann Gneiting, Isotonic Distributional Regression, Journal of the Royal Statistical Society Series B: Statistical Methodology, Volume 83, Issue 5, November 2021, Pages 963–993, https://doi.org/10.1111/rssb.12450
- Share Icon Share
Abstract
Isotonic distributional regression (IDR) is a powerful non-parametric technique for the estimation of conditional distributions under order restrictions. In a nutshell, IDR learns conditional distributions that are calibrated, and simultaneously optimal relative to comprehensive classes of relevant loss functions, subject to isotonicity constraints in terms of a partial order on the covariate space. Non-parametric isotonic quantile regression and non-parametric isotonic binary regression emerge as special cases. For prediction, we propose an interpolation method that generalizes extant specifications under the pool adjacent violators algorithm. We recommend the use of IDR as a generic benchmark technique in probabilistic forecast problems, as it does not involve any parameter tuning nor implementation choices, except for the selection of a partial order on the covariate space. The method can be combined with subsample aggregation, with the benefits of smoother regression functions and gains in computational efficiency. In a simulation study, we compare methods for distributional regression in terms of the continuous ranked probability score (CRPS) and estimation error, which are closely linked. In a case study on raw and post-processed quantitative precipitation forecasts from a leading numerical weather prediction system, IDR is competitive with state of the art techniques.
1 Introduction
There is an emerging consensus in the transdisciplinary literature that regression analysis should be distributional, with Hothorn et al. (2014) arguing forcefully that
[t]he ultimate goal of regression analysis is to obtain information about the conditional distribution of a response given a set of explanatory variables.
Distributional regression marks a clear break from the classical view of regression, which has focused on estimating the conditional mean of the response variable in terms of one or more explanatory variable(s) or covariate(s). Later extensions have considered other functionals of the conditional distributions, such as quantiles or expectiles (Koenker, 2005; Newey & Powell, 1987; Schulze Waltrup et al., 2015). However, the reduction of a conditional distribution to a single-valued functional results in tremendous loss of information. Therefore, from the perspectives of both estimation and prediction, regression analysis ought to be distributional.
In the extant literature, both parametric and non-parametric approaches to distributional regression are available. Parametric approaches assume that the conditional distribution of the response is of a specific type (e.g. Gaussian) with an analytic relationship between the covariates and the distributional parameters. Key examples include statistically post-processed meteorological and hydrologic forecasts, as exemplified by Gneiting et al. (2005), Schefzik et al. (2013) and Vannitsem et al. (2018). In powerful semi-parametric variants, the conditional distributions remain parametric, but the influence of the covariates on the parameter values is modelled non-parametrically, for example by using generalized additive models (Klein et al., 2015; Rigby & Stasinopoulos, 2005; Umlauf & Kneib, 2018) or modern neural networks (Gasthaus et al., 2019; Rasp & Lerch, 2018). In related developments, semiparametric versions of quantile regression (Koenker, 2005) and transformation methods (Hothorn et al., 2014) can be leveraged for distributional regression.
Non-parametric approaches to distributional regression include kernel or nearest neighbour methods that depend on a suitable notion of distance on the covariate space. Then, the empirical distribution of the response for neighbouring covariates in the training set is used for distributional regression, with possible weighting in dependence on the distance to the covariate value of interest. Kernel smoothing methods and mixture approaches allow for absolutely continuous conditional distributions (Dunson et al., 2007; Hall et al., 1999; Li & Racine, 2008). Classification and regression trees partition the covariate space into leaves, and assign constant regression functions on each leaf (Breiman et al., 1984). Linear aggregation via bootstrap aggregation (bagging) or subsample aggregation (subagging) yields random forests (Breiman, 2001), which are increasingly being used to generate conditional predictive distributions, as proposed by Hothorn et al. (2004) and Meinshausen (2006).
Isotonicity is a natural constraint in estimation and prediction problems. Consider, for example, post-processing techniques in weather forecasting, where the covariates stem from the output of numerical weather prediction (NWP) models, and the response variable is the respective future weather quantity. Intuitively, if the NWP model output indicates a larger precipitation accumulation, the associated regression functions ought to be larger as well. Isotonic relationships of this type hold in a plethora of applied settings. In fact, standard linear regression analysis rests on the assumption of isotonicity, in the form of monotonicity in the values of the covariate(s), save for changes in sign.
Concerning non-parametric regression for a conditional functional, such as the mean or a quantile, there is a sizable literature on estimation under the constraint of isotonicity. The classical work of Ayer et al. (1955), Bartholomew (1959a, b), Brunk (1955), van Eeden (1958), Miles (1959) is summarized in Barlow et al. (1972), de Leeuw et al. (2009), Robertson et al. (1988). Subsequent approaches include Bayesian and non-Bayesian smoothing techniques (e.g. Dette et al., 2006; Mammen, 1991; Neelon & Dunson, 2004; Shively et al., 2009), and reviews are available in Groeneboom & Jongbloed (2014) and Guntuboyina & Sen (2018).
In distributional regression, it may not be immediately clear what is meant by isotonicity, and the literature typically considers one ordinal covariate only (e.g. Davidov & Iliopoulos, 2012; El Barmi & Mukerjee, 2005; Hogg, 1965; Rojo & El Barmi, 2003), with a notable exception being the work of Mösching & Dümbgen (2020b), whose considerations allow for a real-valued covariate. In the general case of a partially ordered covariate space, which we consider here, it is unclear whether semi- or non-parametric techniques might be capable of handling monotonicity contraints, and suitable notions of isotonicity remain to be developed.
To this end, we assume that the response Y is real-valued, and equip the covariate space with a partial order ≼. Our aim is to estimate the conditional distribution of Y given the covariate X, for short , on training data, in a way that respects the partial order, and we desire to use this estimate for prediction. Formally, a distributional regression technique generates a mapping from to a probability measure , which serves to model the conditional distribution . This mapping is isotonic if implies , where denotes the usual stochastic order, that is, if G(y) ≥ H(y) for , where we use the same symbols for the probability measures G, H and their associated conditional cumulative distribution functions (CDFs). Equivalently, holds if for α ∈ (0, 1), where is the standard quantile function (Shaked & Shanthikumar, 2007).
Useful comparisons of predictive distributions are in terms of proper scoring rules, of which the most prominent and most relevant instance is the continuous ranked probability score (CRPS; Gneiting & Raftery, 2007; Matheson & Winkler, 1976). We show that there is a unique isotonic distributional regression that is optimal with respect to the CPRS (Theorem 1), and refer to it as the isotonic distributional regression (IDR). As it turns out, IDR is a universal solution, in that the estimate is optimal with respect to a broad class of proper scoring rules (Theorem 2). Classical special cases such as non-parametric isotonic quantile regression and probabilistic classifiers for threshold-defined binary events are nested by IDR. Simultaneously, IDR avoids pitfalls commonly associated with non-parametric distributional regression, such as suboptimal partitions of the covariate space and level crossing (Athey et al., 2019, p. 1167).
![Simulation example for a sample of size n = 600 from the distribution in (1): (a) True conditional CDFs (smooth) and IDR estimates (step functions) for selected values of the covariate. (b) IDR estimated conditional distributions. The shaded bands correspond to probability mass 0.10 each, with the darkest shade marking the central interval. Vertical strips indicate the cross-sections corresponding to the values of the covariate in panel (a) [Colour figure can be viewed at https://dbpia.nl.go.kr/]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/83/5/10.1111_rssb.12450/1/m_rssb_83_5_963_fig-1.jpeg?Expires=1747909053&Signature=OLiw6zpybsOwONonUGsiEvrWetcwOK9Uj5l4ZJfzNRUKqNVYcRL~itbT~ZrtTa9McfWGRUc04llbjfYhlAOfWHRwo~5LyyCRm91~mXrLGvHYoupa3eq9TdEUl5YOODK2mp8TFvIxxopbS0m~sn6RD5TZbZ6zHMh5Yj68WS6bxJJTPcZB6LbtVwCCjir5ePFESmiVm5TOix-hy1POEEHAgFOy-ze0YJO3HWdr8V8-eTy2lzgBSTWksdgOejuodz0E70FkKWy8UqbBpZGRoQLAeyVUax1EkzLcCGtNCBgFsC4W9rrFFeufE8VTYnVBjOWACSgutdiz4EXQzc9HHObdKw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Simulation example for a sample of size n = 600 from the distribution in (1): (a) True conditional CDFs (smooth) and IDR estimates (step functions) for selected values of the covariate. (b) IDR estimated conditional distributions. The shaded bands correspond to probability mass 0.10 each, with the darkest shade marking the central interval. Vertical strips indicate the cross-sections corresponding to the values of the covariate in panel (a) [Colour figure can be viewed at https://dbpia.nl.go.kr/]
The remainder of the paper is organized as follows. The methodological core of the paper is in Section 2, where we prove existence, uniqueness and universality of the IDR solution, discuss computational issues and asymptotic consistency, and propose strategies for prediction. In Section 3, we turn to the critical issue of the choice of a partial order on the covariate space. Section 4 reports on a comparative simulation study that addresses both prediction and estimation, and Section 5 is devoted to a case study on probabilistic quantitative precipitation forecasts, with covariates provided by the European Centre for Medium-Range Weather Forecasts (ECMWF) ensemble system. Precipitation accumulations feature unfavourable properties that challenge parametric approaches to distributional regression: The conditional distributions have a point mass at zero, and they are continuous and right skewed on the positive half-axis. In a comparison to state-of-the-art methods that have been developed specifically for the purpose, namely Bayesian model averaging (BMA; Sloughter et al., 2007), ensemble model output statistics (EMOS; Scheuerer, 2014) and heteroscedastic censored logistic regression (HCLR; Messner et al., 2014), the (out-of-sample) predictive performance of IDR is competitive, despite the method being generic, and being fully automatic once a partial order on the covariate space has been chosen.
We close the paper with a discussion in Section 6, where we argue that IDR provides a very widely applicable, competitive benchmark in probabilistic forecasting problems. The use of benchmark techniques has been called for across application domains (e.g. Basel Committee on Banking Supervision, 2016; Pappenberger et al., 2015; Rossi, 2013; Vogel et al., 2018), and suitable methods should be competitive in terms of predictive performance, while avoiding implementation decisions that may vary from user to user. IDR is well suited to this purpose, as it is entirely generic, does not involve any implementation decisions, other than the choice of the partial order, applies to all types of real-valued outcomes with discrete, continuous or mixed discrete-continuous distributions, and accommodates general types of covariate spaces.
2 Isotonic Distributional Regression
We proceed to introduce the IDR technique. To this end, we first review basic facts on proper scoring rules and notions of calibration. Then we define the IDR solution, prove existence, uniqueness and universality, and discuss its computation and asymptotic consistency. Thereafter, we turn from estimation to prediction and describe how IDR can be used in out-of-sample forecasting. Throughout, we identify a Borel probability measure on the real line with its cumulative distribution function (CDF), and we denote the extended real line by .
2.1 Preliminaries
Introduced by Matheson & Winkler (1976), the CRPS has become popular across application areas and methodological communities, both for the purposes of evaluating predictive performance and as a loss function in estimation; see, for example, Gasthaus et al. (2019), Gneiting et al. (2005), Hersbach (2000), Hothorn et al. (2014), Pappenberger et al. (2015) and Rasp & Lerch (2018). The CRPS is reported in the same unit as the response variable, and it reduces to the absolute error, |x−y|, if F is the point or Dirac measure in .
2.2 Existence, uniqueness and universality
A partial order relation ≼ on a set has the same properties as a total order, namely reflexivity, antisymmetry and transitivity, except that the elements need not be comparable, that is, there might be elements and such that neither nor holds. A key example is the componentwise order on .
Let be an interval, and let S be a proper scoring rule with respect to a class of probability distributions on I that contains all distributions with finite support. Given training data in the form of a covariate vector and response vector , we may interpret any mapping from to as a distributional regression function. Throughout, we equip with the usual stochastic order.
Definition 1(S-based regression). An element is an S-based isotonic regression of on , if it is a minimizer of the empirical lossover all in .
In plain words, an S-based isotonic regression achieves the best fit in terms of the scoring rule S, subject to the conditional CDFs satisfying partial order constraints induced by the covariate values . The definition and the subsequent results can be extended to losses of the form with rational, strictly positive weights . The adaptations are straightforward and left to the reader.
Furthermore, the definition of an S-based isotonic regression as a minimizer of continues to apply when is equipped with a pre- or quasiorder ≼ instead of a partial order. Preorders are not necessarily antisymmetric, and so there might be elements such that and but . In this setting, we define x and to be equivalent if and , and set if representatives of the equivalence classes satisfy . Then the binary relation defines a partial order on the set of equivalence classes, and the S-based isotonic regression with the new covariates and the partial order coincides with the original S-based isotonic regression.
In Supplementary Section S1 we prove the following result.
Theorem 1(existence and uniqueness). There exists a unique CRPS-based isotonic regression of y on x.
At first sight, the specific choice of the CRPS as a loss function may seem arbitrary. However, the subsequent result, which we prove in Supplementary Section S1, reveals that IDR is simultaneously optimal with respect to broad classes of proper scoring rules that include all relevant choices in the extant literature. The popular logarithmic score allows for the comparison of absolutely continuous distributions with respect to a fixed dominating measure only and thus is not applicable here. Statements concerning calibration are with respect to the empirical distribution of the training data .
Theorem 2(universality) The IDR solution of y on x is threshold calibrated and has the following properties.
The IDR solution is an S-based isotonic regression of y on x under any scoring rule of the formor11where is the elementary quantile scoring function (6), is the elementary probability scoring rule (7), and H and M are locally finite Borel measures on and , respectively.12 For every α ∈ (0, 1) it holds that is a minimizer ofover all , under any function which is left-continuous in both arguments and such that is a proper scoring rule on .13 For every threshold value z ∈ I, it is true that is a minimizer ofover all ordered tuples , under any function that is a proper scoring rule for binary events, which is left-continuous in its first argument, satisfies , and is real-valued, except possibly s(0, 1) = −∞ or s(1, 0) = −∞.14
The quantile weighted and threshold weighted versions of the CRPS studied by Gneiting & Ranjan (2011) arise from (11) and (12) with and , where λ denotes the Lebesgue measure, and and are σ-finite Borel measures on (0, 1) and , respectively. If and are Lebesgue measures, we recover the mixture representations (3) and (4) of the CRPS. By results of Ehm et al. (2016), if H is concentrated on and M is concentrated on {z } × (0,1), these representations cover essentially all proper scoring rules that depend on the predictive distribution F via or F(z) only, yielding universal optimality in statements in parts (b) and (c) of Theorem 2.
In particular, as a special case of (13), the IDR solution is a minimizer of the quantile loss under the asymmetric piecewise linear or pinball function (5) that lies at the heart of quantile regression (Koenker, 2005). Consequently, as the mixture representation 2 of the CRPS may suggest, IDR nests classical non-parametric isotonic quantile regression as introduced and studied by Casady & Cryer (1976) and Robertson & Wright (1975). In other words, part (b) of Theorem 2 demonstrates that, if we (hypothetically) perform non-parametric isotonic quantile regression at every level α ∈ (0, 1) and piece the conditional quantile functions together, we recover the IDR solution. However, the IDR solution is readily computable (Section 2.3), without invoking approximations or truncations, unlike brute force approaches to simultaneous quantile regressions. Loss functions of the form (13) also include the interval score (Gneiting & Raftery, 2007, equation (43); Winkler, 1972), which constitutes the most used proper performance measure for interval forecasts.
In the special case of a binary response variable, we see from (c) and (14) that the IDR solution is an S-based isotonic regression under just any applicable proper scoring rule S. Furthermore, threshold calibration is the strongest possible notion of calibration in this setting (Gneiting & Ranjan, 2013, Theorem 2.11), so the IDR solution is universal in every regard. In the further special case of a total order on the covariate space, the IDR and pool adjacent violators (PAV) algorithm solutions coincide, and the statement in (c) is essentially equivalent to Theorem 1.12 of Barlow et al. (1972). In particular, the IDR or PAV solution yields both the non-parametric maximum likelihood estimate and the non-parametric least squares estimate under the constraint of isotonicity. The latter suggests a computational implementation via quadratic programming, to which we tend now.
2.3 Computational aspects
Clearly, a challenge in the computational implementation of IDR with general partial orders is that the number of variables in the quadratic programming problem (15) grows at a rate of . As a remedy, we propose subsample aggregation, much in the spirit of random forests that rely on bootstrap aggregated (bagged) classification and regression trees (Breiman, 1996, 2001). It was observed early on that random forests generate conditional predictive distributions (Hothorn et al., 2004; Meinshausen, 2006), and recent applications include the statistical post-processing of ensemble weather forecasts (Schlosser et al., 2019; Taillardat et al., 2016; Taillardat et al., 2019). Bühlmann & Yu (2002) and Buja & Stützle (2006) argue forcefully that subsample aggregation (subagging) tends to be equally effective as bagging, but at considerably lower computational cost.
In view of the superlinear computational costs of IDR, smart uses of subsample aggregation yield major efficiency gains, taking into account that the estimation on different subsamples can be performed in parallel. Isotonicity is preserved under linear aggregation, and the aggregated conditional CDFs can be inverted to generate isotonic conditional quantile functions, with the further benefit of smoother estimates in continuous settings. A detailed investigation of optimal subsample aggregation for IDR is a topic for future research. For illustration, Figure 2 returns to the simulation example in Figure 1, but now with a much larger training sample of size n = 10,000 from the distribution in 1. Linear aggregation based on 100 subsamples (drawn without replacement) of size n = 1000 each is superior to the brute force approach on the full training set in terms of estimation accuracy. The computation on the full dataset for this simulation example takes 11.7 s for the naive implementation, but only 1.1 s for the sequential algorithm of Henzi et al. (2020). Subagging gives computation times of 9.9 and 2.5 s, respectively, or 1.8 and 0.5 s when parallelized over eight cores.1
![Simulation example for a sample of size n = 10,000 from the distribution in 1. The true conditional CDFs (smooth dashed graphs) are compared to IDR estimates (step functions) based on (a) the full training sample of size n = 10,000 and (b) linear aggregation of IDR estimates on 100 subsamples of size 1000 each [Colour figure can be viewed at https://dbpia.nl.go.kr/]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/83/5/10.1111_rssb.12450/1/m_rssb_83_5_963_fig-2.jpeg?Expires=1747909053&Signature=Egade8YevqycccVlKezl~CQyCt0b~~LX3Ous8nDQyIzTnQtjdnISArnq1JADpaukPOd9O2iD1mLJhcL5WZdZBmrt4aOeWPY7MXqh0B6qTPnUbP1e9Hh1YPONoQVHtShhdk6tHWWPH0dqosfLLynYGPTnCwK0pRTZDPNIN4Ig2ZDvgmA0jF9eMRnMiJi1b6irHC0x~Rx63W0pk~bGeuAM-STssHI-LFb-ifUeLdUsrOPIHXSGx~P0JmCJZ~kSqQq0a7XVKFvgzqHCg8Suq9wHWcle4IxcN0fDp429MXesbSqwmDMQ4LUTe1zKXnpVmTvhD97mFxmZdYSfuE6g6ppCrQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Simulation example for a sample of size n = 10,000 from the distribution in 1. The true conditional CDFs (smooth dashed graphs) are compared to IDR estimates (step functions) based on (a) the full training sample of size n = 10,000 and (b) linear aggregation of IDR estimates on 100 subsamples of size 1000 each [Colour figure can be viewed at https://dbpia.nl.go.kr/]
2.4 Consistency
We proceed to prove uniform consistency of the IDR estimator. While strong consistency of non-parametric isotonic quantile regression for single quantiles was proved decades ago (Casady & Cryer, 1976; Robertson & Wright, 1975), uniform consistency and rates of convergence for the IDR estimator have been established only recently, and exclusively in the case of a total order, see El Barmi & Mukerjee (2005, Theorem 1) and Mösching & Dümbgen (2020b, Theorem 3.3).
In Supplementary Section S2, we establish the following consistency theorem, which covers key examples of partial orders and is based on strictly weaker assumptions than the results of Mösching & Dümbgen (2020b). However, in contrast to their work, we do not provide rates of convergence. The choice for the covariate space merely serves to simplify the presentation: As IDR is invariant under strictly isotonic transformations, any covariate vector can be transformed to have support in , and the componentwise partial order can be replaced by any weaker preorder. A key assumption uses the concept of an antichain in a partially ordered set , which is a subset that does not admit comparisons, in the sense that u ≼ v for u, v ∈ A implies u = v. As we discuss subsequently, results of Brightwell (1992) imply that the respective distributional condition is mild.
Theorem 3(uniform consistency). Let be endowed with the componentwise partial order and the norm Let further , , , be a triangular array such that are independent and identically distributed random vectors for each , and let . Assume that
- (a)
for all non-degenerate rectangles , there exists a constant such thatwith asymptotic probability one, that is, if denotes the event that , then as n → ∞;- (b)
for some γ ∈ (0,1),with asymptotic probability one.Assume further that the true conditional CDFs satisfy
- (c)
is decreasing in for all ;
- (d)
for every , there exists such thatThen for every ε > 0 and δ > 0,19
Assumption (a) requires that the covariates are sufficiently dense in , as is satisfied under strictly positive Lebesgue densities on . In order to derive rates of convergence, the size of the rectangles J in (a) would need to decrease with n, as in condition (A.2) of Mösching & Dümbgen (2020b); we leave this type of extension as a direction for future work. Assumption (c) is the basic model assumption of IDR, while assumption (d) requires uniform continuity of the conditional distributions, which is weaker than Hölder continuity in condition (A.1) of Mösching & Dümbgen (2020b).
Assumption (b), which is always satisfied in the case of a total order, calls for a more detailed discussion. In words, the maximal number of mutually incomparable elements needs to grow at a rate slower than . Evidently, the easier elements can be ordered, the smaller the maximal antichain. Consequently, Theorem 3 continues to hold under the empirical stochastic order and the empirical increasing convex order on the covariates introduced in Section 3.3, and indeed under any preorder that is weaker than the componentwise order. The key to understanding the distributional implications of (b) is Corollary 2 in Brightwell (1992), which states that for a sequence of independent random vectors from a uniform population on the size of a maximal antichain grows at a rate of ; see also the remark following the proof of Theorem 3 in Supplementary Section S2.
As comparability under the componentwise order is preserved under monotonic transformations, any covariate vector that can be obtained as a monotone transformation of a uniform random vector of arbitrary dimension guarantees (b). This includes, for example all Gaussian random vectors with non-negative correlation coefficients. In this light, assumption (b) is rather weak, and well in line with the intuition that for multivariate isotonic (distributional) regression to work well, there ought be at least minor positive dependence between the covariates. In the context of our case study in Section 5, high positive correlations between the covariates are the rule, as exemplified by Table 3 in Raftery et al. (2005).
2.5 Prediction
As noted, the IDR solution is defined at the covariate values only. Generally, if a (not necessarily optimal) distributional regression is available, a key task in practice is to make a prediction at a new covariate value where . We denote the respective predictive CDF by F.
The difference between the bounds (if any) in (20) might be a useful measure of estimation uncertainty and could be explored as a promising avenue towards the quantification of ambiguity and generation of second-order probabilities (Ellsberg, 1961; Seo, 2009). In the context of ensemble weather forecasts, the assessment of ambiguity has been pioneered by Allen & Eckel (2012). Interesting links arise when the envelope in (20) is interpreted in the spirit of randomized predictive systems and conformal estimates as studied by Vovk et al. (2019); compare, for example, their Figure 5 with our Figure 4b below.
3 Partial Orders
The choice of a sufficiently informative partial order on the covariate space is critical to any successful application of IDR. In the extreme case of distinct, totally ordered covariate values and a perfect monotonic relationship to the response values , the IDR distribution associated with is simply the point measure in , for i = 1, …, n. The same happens in the other extreme, when there are no order relations at all. Hence, the partial order serves to regularize the IDR solution.
Thus far, we have simply assumed that the covariate space is equipped with a partial order ≼, without specifying how the order might be defined. If , the usual componentwise order will be suitable in many applications, and we investigate it in Section 3.1. For covariates that are ordinal and admit a ranking in terms of importance, a lexicographic order may be suitable.
If groups of covariates are exchangeable, as in our case study on quantitative precipitation forecasts, other types of order relations need to be considered. In Sections 3.2 and 3.3, we study relations that are tailored to this setting, namely, the empirical stochastic order and empirical increasing convex order. Proofs are deferred to Supplementary Section S3.
3.1 Componentwise order
This order becomes weaker as the dimension d of the covariate space increases: If and then is a necessary condition for . The following result is an immediate consequence of this observation and the structure of the optimization problem in Definition 1.
Proposition 1Let and have components and for , and let S be a proper scoring rule. Then if and are equipped with the componentwise partial order, and and denote S-based isotonic regressions of y on x and , respectively, it is true that
In simple words, under the componentwise partial order, the inclusion of further covariates can only improve the in-sample fit. This behaviour resembles linear regression, where the addition of covariates can only improve the (unadjusted) R-square.
3.2 Empirical stochastic order
We now define a relation that is based on stochastic dominance and invariant under permutation.
Definition 2Let and denote elements of . Then x is smaller than or equal to in empirical stochastic order, for short , if the empirical distribution of is smaller than the empirical distribution of in the usual stochastic order.
This relation is tailored to groups of exchangeable, real-valued covariates. The following result summarizes its properties and compares to the componentwise order, which we denote by ≼.
Proposition 2Let and denote elements of with order statistics and .
- (a)
The relation is equivalent to for i=1, …, d.
- (b)
If then .
- (c)
If and x and are comparable in the componentwise partial order, then .
- (d)
If and then x and are permutations of each other. Consequently, the relation defines a partial order on .
In a nutshell, the empirical stochastic order is equivalent to the componentwise order on the sorted elements, and this relation is weaker than the componentwise order. However, unlike the componentwise order, the empirical stochastic order does not degenerate as further covariates are added. To the contrary, empirical distributions of larger numbers of exchangeable variables become more informative and more easily comparable.
3.3 Empirical increasing convex order
In applications, the empirical stochastic order might be too strong, in the sense that it does not generate sufficiently informative constraints. In this light, we now define a weaker partial order on , which also is based on a partial order for probability measures. Specifically, let X and be random variables with CDFs F and . Then F is smaller than in increasing convex order if for all increasing convex functions ϕ such that the expectations exist (Shaked & Shanthikumar, 2007, Section 4.A.1).
Definition 3Let and denote elements of . Then x is smaller than or equal to in empirical increasing convex order, for short , if the empirical distribution of is smaller than the empirical distribution of in increasing convex order.
This notion provides another meaningful relation for groups of exchangeable covariates. The following result summarizes its properties and relates it to the empirical stochastic order.
Proposition 3Let and denote elements of with order statistics and .
- (a)
The relation is equivalent to- (b)
If then .
- (c)
If thenwhere g is the Gini mean difference,22If and then x and are permutations of each other. Consequently, the relation defines a partial order on .
Figure 3 illustrates the various types of relations for points in the positive quadrant of . As reflected by the nested character of the regions, the componentwise order is stronger than the empirical stochastic order, which in turn is stronger than the empirical increasing convex order. The latter is equivalent to weak majorization as studied by Marshall et al. (2011). In the special case of vectors with non-negative entries, their Corollary C.5 implies that is dominated by in empirical increasing convex order if, and only if, it lies in the convex hull of the points of the form , where π is a permutation and for i = 1, …, d.
![Regions of smaller, greater and incomparable elements in the positive quadrant of R2, as compared to the point (1, 3), for the (left) componentwise, (middle) empirical stochastic and (right) empirical increasing convex order. Coloured areas below (above) of (1, 3) correspond to smaller (greater) elements, while blank areas contain elements incomparable to (1, 3) in the given partial order [Colour figure can be viewed at https://dbpia.nl.go.kr/]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/83/5/10.1111_rssb.12450/1/m_rssb_83_5_963_fig-3.jpeg?Expires=1747909053&Signature=XXMVlACxP~3Bz9PVhWEKFhfPWkt-vqf4DrB7qk0kiTmdEo3go5ipjzirvj1K1MrfNgBSTvmvzFX~hdm6BRwFCmgsOCf1VDYRi7woTldE7CuYiewik9uF2OWgRVi8qSsr8104d1W202ncIuy1eYyZSzsNsoncuNv3IOQEWtVZBqX5cA6NbU3Sy~3k-MjCl8qCKEjwis01vHyjNkIfb0Uv9no3Z-agELaOOBbjkf96kDeLAS1g0VDVKS9njPO4B2e7XNCT-TOkBW~EttNIMEr7eD-Z0osUAd5aPX-3f0GNPwDjnzXaAyvryoL~nd~PxB8drSuBOZ~wl61f35ixnBCGDw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Regions of smaller, greater and incomparable elements in the positive quadrant of , as compared to the point (1, 3), for the (left) componentwise, (middle) empirical stochastic and (right) empirical increasing convex order. Coloured areas below (above) of (1, 3) correspond to smaller (greater) elements, while blank areas contain elements incomparable to (1, 3) in the given partial order [Colour figure can be viewed at https://dbpia.nl.go.kr/]
4 Simulation Study
Since we view IDR primarily as a tool for prediction, we compare it to other distributional regression methods in terms of predictive performance on continuous and discrete, univariate simulation examples, as measured by the CRPS. However, as noted below and formalized in Supplementary Section S4, the CRPS links asymptotically to estimation error, so under large validation samples, prediction and estimation are assessed simultaneously. A detailed comparative study on mixed discrete-continuous data with a multivariate covariate vector is given in the case study in the next section.
Implementation details for the various methods are given in Table S1 in Supplementary Section S5. Here we only note that QRF uses the grf package (Tibshirani et al., 2020) with a splitting rule that is tailored to quantiles (Athey et al., 2019). We see that, unlike IDR, its competitors rely on manual intervention and tuning. For example, QRFs perform poorly under the default value of 5 for the tuning parameter min.node.size, which we have raised to 40. Further improvement may arise when tuning parameters, such as honesty fraction and node size, are judiciously adjusted to the specific scenario and training sample size at hand. In contrast, IDR is entirely free of implementation decisions, except for the subagging variant, , where we average predictions based on estimates on 100 subsamples of size n/2 each.
Table 1 shows the mean CRPS for the different methods and simulation scenarios. Scenario (23) is the same as in the introduction and illustrated in Figure 1. It has a smooth covariate–response relationship, and NP, SQR, and even the misspecified TRAM technique, which are tailored to this type of setting, outperform QRF and IDR. However, the assumption of continuity in the response is crucial, as the results under the discontinuous scenario (24) demonstrate, where IDR and perform best. In the non-isotonic scenario (25), IDR and retain acceptable performance, even though the key assumption is violated. Not surprisingly, SQR faces challenges in the Poisson scenario (26), where the conditional quantile functions are piecewise constant, and IDR is outperformed only by TRAM. Throughout, the simplistic subagging variant of IDR has slightly lower mean CRPS than the default variant that is estimated on the full training set, and it would be interesting to explore the relation to the super-efficiency phenomenon described by Banerjee et al. (2019).
Mean CRPS in smooth (23), discontinuous (24), non-isotonic (25), and discrete (26) simulation scenarios with training sets of size n
n . | 500 . | 1000 . | 2000 . | 4000 . | 500 . | 1000 . | 2000 . | 4000 . |
---|---|---|---|---|---|---|---|---|
Smooth (23) | Discontinuous (24) | |||||||
NP | 3.561 | 3.542 | 3.532 | 3.525 | 3.614 | 3.582 | 3.562 | 3.549 |
SQR | 3.571 | 3.543 | 3.530 | 3.524 | 3.647 | 3.619 | 3.606 | 3.600 |
TRAM | 3.560 | 3.543 | 3.535 | 3.531 | 3.642 | 3.625 | 3.616 | 3.612 |
QRF | 3.589 | 3.561 | 3.555 | 3.553 | 3.614 | 3.576 | 3.561 | 3.556 |
IDR | 3.604 | 3.568 | 3.548 | 3.535 | 3.628 | 3.581 | 3.555 | 3.540 |
3.595 | 3.561 | 3.543 | 3.532 | 3.620 | 3.577 | 3.551 | 3.537 | |
Ideal | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 |
Non-isotonic (25) | Discrete (26) | |||||||
NP | 3.564 | 3.544 | 3.534 | 3.527 | 1.136 | 1.131 | 1.128 | 1.126 |
SQR | 3.574 | 3.546 | 3.533 | 3.527 | 1.129 | 1.121 | 1.116 | 1.114 |
TRAM | 3.566 | 3.549 | 3.543 | 3.539 | 1.115 | 1.110 | 1.107 | 1.106 |
QRF | 3.587 | 3.560 | 3.555 | 3.553 | 1.121 | 1.113 | 1.112 | 1.112 |
IDR | 3.605 | 3.569 | 3.549 | 3.536 | 1.130 | 1.119 | 1.113 | 1.109 |
3.597 | 3.564 | 3.545 | 3.534 | 1.128 | 1.118 | 1.112 | 1.109 | |
Ideal | 3.516 | 3.516 | 3.516 | 3.516 | 1.104 | 1.104 | 1.104 | 1.104 |
n . | 500 . | 1000 . | 2000 . | 4000 . | 500 . | 1000 . | 2000 . | 4000 . |
---|---|---|---|---|---|---|---|---|
Smooth (23) | Discontinuous (24) | |||||||
NP | 3.561 | 3.542 | 3.532 | 3.525 | 3.614 | 3.582 | 3.562 | 3.549 |
SQR | 3.571 | 3.543 | 3.530 | 3.524 | 3.647 | 3.619 | 3.606 | 3.600 |
TRAM | 3.560 | 3.543 | 3.535 | 3.531 | 3.642 | 3.625 | 3.616 | 3.612 |
QRF | 3.589 | 3.561 | 3.555 | 3.553 | 3.614 | 3.576 | 3.561 | 3.556 |
IDR | 3.604 | 3.568 | 3.548 | 3.535 | 3.628 | 3.581 | 3.555 | 3.540 |
3.595 | 3.561 | 3.543 | 3.532 | 3.620 | 3.577 | 3.551 | 3.537 | |
Ideal | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 |
Non-isotonic (25) | Discrete (26) | |||||||
NP | 3.564 | 3.544 | 3.534 | 3.527 | 1.136 | 1.131 | 1.128 | 1.126 |
SQR | 3.574 | 3.546 | 3.533 | 3.527 | 1.129 | 1.121 | 1.116 | 1.114 |
TRAM | 3.566 | 3.549 | 3.543 | 3.539 | 1.115 | 1.110 | 1.107 | 1.106 |
QRF | 3.587 | 3.560 | 3.555 | 3.553 | 1.121 | 1.113 | 1.112 | 1.112 |
IDR | 3.605 | 3.569 | 3.549 | 3.536 | 1.130 | 1.119 | 1.113 | 1.109 |
3.597 | 3.564 | 3.545 | 3.534 | 1.128 | 1.118 | 1.112 | 1.109 | |
Ideal | 3.516 | 3.516 | 3.516 | 3.516 | 1.104 | 1.104 | 1.104 | 1.104 |
Mean CRPS in smooth (23), discontinuous (24), non-isotonic (25), and discrete (26) simulation scenarios with training sets of size n
n . | 500 . | 1000 . | 2000 . | 4000 . | 500 . | 1000 . | 2000 . | 4000 . |
---|---|---|---|---|---|---|---|---|
Smooth (23) | Discontinuous (24) | |||||||
NP | 3.561 | 3.542 | 3.532 | 3.525 | 3.614 | 3.582 | 3.562 | 3.549 |
SQR | 3.571 | 3.543 | 3.530 | 3.524 | 3.647 | 3.619 | 3.606 | 3.600 |
TRAM | 3.560 | 3.543 | 3.535 | 3.531 | 3.642 | 3.625 | 3.616 | 3.612 |
QRF | 3.589 | 3.561 | 3.555 | 3.553 | 3.614 | 3.576 | 3.561 | 3.556 |
IDR | 3.604 | 3.568 | 3.548 | 3.535 | 3.628 | 3.581 | 3.555 | 3.540 |
3.595 | 3.561 | 3.543 | 3.532 | 3.620 | 3.577 | 3.551 | 3.537 | |
Ideal | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 |
Non-isotonic (25) | Discrete (26) | |||||||
NP | 3.564 | 3.544 | 3.534 | 3.527 | 1.136 | 1.131 | 1.128 | 1.126 |
SQR | 3.574 | 3.546 | 3.533 | 3.527 | 1.129 | 1.121 | 1.116 | 1.114 |
TRAM | 3.566 | 3.549 | 3.543 | 3.539 | 1.115 | 1.110 | 1.107 | 1.106 |
QRF | 3.587 | 3.560 | 3.555 | 3.553 | 1.121 | 1.113 | 1.112 | 1.112 |
IDR | 3.605 | 3.569 | 3.549 | 3.536 | 1.130 | 1.119 | 1.113 | 1.109 |
3.597 | 3.564 | 3.545 | 3.534 | 1.128 | 1.118 | 1.112 | 1.109 | |
Ideal | 3.516 | 3.516 | 3.516 | 3.516 | 1.104 | 1.104 | 1.104 | 1.104 |
n . | 500 . | 1000 . | 2000 . | 4000 . | 500 . | 1000 . | 2000 . | 4000 . |
---|---|---|---|---|---|---|---|---|
Smooth (23) | Discontinuous (24) | |||||||
NP | 3.561 | 3.542 | 3.532 | 3.525 | 3.614 | 3.582 | 3.562 | 3.549 |
SQR | 3.571 | 3.543 | 3.530 | 3.524 | 3.647 | 3.619 | 3.606 | 3.600 |
TRAM | 3.560 | 3.543 | 3.535 | 3.531 | 3.642 | 3.625 | 3.616 | 3.612 |
QRF | 3.589 | 3.561 | 3.555 | 3.553 | 3.614 | 3.576 | 3.561 | 3.556 |
IDR | 3.604 | 3.568 | 3.548 | 3.535 | 3.628 | 3.581 | 3.555 | 3.540 |
3.595 | 3.561 | 3.543 | 3.532 | 3.620 | 3.577 | 3.551 | 3.537 | |
Ideal | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 | 3.516 |
Non-isotonic (25) | Discrete (26) | |||||||
NP | 3.564 | 3.544 | 3.534 | 3.527 | 1.136 | 1.131 | 1.128 | 1.126 |
SQR | 3.574 | 3.546 | 3.533 | 3.527 | 1.129 | 1.121 | 1.116 | 1.114 |
TRAM | 3.566 | 3.549 | 3.543 | 3.539 | 1.115 | 1.110 | 1.107 | 1.106 |
QRF | 3.587 | 3.560 | 3.555 | 3.553 | 1.121 | 1.113 | 1.112 | 1.112 |
IDR | 3.605 | 3.569 | 3.549 | 3.536 | 1.130 | 1.119 | 1.113 | 1.109 |
3.597 | 3.564 | 3.545 | 3.534 | 1.128 | 1.118 | 1.112 | 1.109 | |
Ideal | 3.516 | 3.516 | 3.516 | 3.516 | 1.104 | 1.104 | 1.104 | 1.104 |
These results lend support to our belief that IDR can serve as a universal benchmark in probabilistic forecasting and distributional regression problems. For sufficiently large training samples, IDR offers competitive performance under any type of type of linearly ordered outcome, without reliance on tuning parameters or other implementation choices, except when subsampling is employed.
5 Case Study: Probabilistic Quantitative Precipitation Forecasts
The past decades have witnessed tremendous progress in the science and practice of weather prediction (Bauer et al., 2015). Arguably, the most radical innovation consists in the operational implementation of ensemble systems and an accompanying culture change from point forecasts to distributional forecasts (Leutbecher & Palmer, 2008). An ensemble system comprises multiple runs of numerical weather prediction (NWP) models, where the runs or members differ from each other in initial conditions and numerical-physical representations of atmospheric processes.
Ideally, one would like to interpret an ensemble forecast as a random sample from the conditional distribution of future states of the atmosphere. However, this is rarely advisable in practice, as ensemble forecasts are subject to biases and dispersion errors, thereby calling for some form of statistical post-processing (Gneiting & Raftery, 2005; Vannitsem et al., 2018). This is typically done by fitting a distributional regression model, with the weather variable of interest being the response variable, and the members of the forecast ensemble constituting the covariates, and applying this model to future NWP output, to obtain conditional predictive distributions for future weather quantities. State of the art techniques include Bayesian Model Averaging (BMA; Raftery et al., 2005; Sloughter et al., 2007), Ensemble Model Output Statistics (EMOS; Gneiting et al., 2005; Scheuerer, 2014), and Heteroscedastic Censored Logistic Regression (HCLR; Messner et al., 2014).
In this case study, we apply IDR to the statistical post-processing of ensemble forecasts of accumulated precipitation, a variable that is notoriously difficult to handle, due to its mixed discrete-continuous character, which requires both a point mass at zero and a right skewed continuous component on the positive half-axis. As competitors to IDR, we implement the BMA technique of Sloughter et al. (2007), the EMOS method of Scheuerer (2014), and HCLR (Messner et al., 2014), which are widely used parametric approaches that have been developed specifically for the purposes of probabilistic quantitative precipitation forecasting. In contrast, IDR is a generic technique and fully automatic, once the partial order on the covariate space has been specified.
5.1 Data
The data in our case study comprise forecasts and observations of 24 h accumulated precipitation from 6 January 2007 to 1 January 2017 at meteorological stations on airports in London, Brussels, Zurich and Frankfurt. As detailed in Table 2, data availability differs, and we remove days with missing entries station by station, so that all types of forecasts for a given station are trained and evaluated on the same data. Both forecasts and observations refer to the 24 h period from 6:00 UTC to 6:00 UTC on the following day. The observations are in the unit of millimetre and constitute the response variable in distributional regression. They are typically, but not always, reported in integer multiples of a millimetre.
Meteorological stations at airports, with International Air Transport Association (IATA) airport code, World Meteorological Organization (WMO) station ID, and data availability in days (years)
. | IATA code . | WMO ID . | Data availability . |
---|---|---|---|
Brussels, Belgium | BRU | 06449 | 3406 (9.3) |
Frankfurt, Germany | FRA | 10637 | 3617 (9.9) |
London, UK | LHR | 03772 | 2256 (6.2) |
Zurich, Switzerland | ZRH | 06670 | 3241 (8.9) |
. | IATA code . | WMO ID . | Data availability . |
---|---|---|---|
Brussels, Belgium | BRU | 06449 | 3406 (9.3) |
Frankfurt, Germany | FRA | 10637 | 3617 (9.9) |
London, UK | LHR | 03772 | 2256 (6.2) |
Zurich, Switzerland | ZRH | 06670 | 3241 (8.9) |
Meteorological stations at airports, with International Air Transport Association (IATA) airport code, World Meteorological Organization (WMO) station ID, and data availability in days (years)
. | IATA code . | WMO ID . | Data availability . |
---|---|---|---|
Brussels, Belgium | BRU | 06449 | 3406 (9.3) |
Frankfurt, Germany | FRA | 10637 | 3617 (9.9) |
London, UK | LHR | 03772 | 2256 (6.2) |
Zurich, Switzerland | ZRH | 06670 | 3241 (8.9) |
. | IATA code . | WMO ID . | Data availability . |
---|---|---|---|
Brussels, Belgium | BRU | 06449 | 3406 (9.3) |
Frankfurt, Germany | FRA | 10637 | 3617 (9.9) |
London, UK | LHR | 03772 | 2256 (6.2) |
Zurich, Switzerland | ZRH | 06670 | 3241 (8.9) |
Statistical post-processing is both a calibration and a downscaling problem: Forecasts and observations are at different spatial scales, whence the unprocessed forecasts are subject to representativeness error (Wilks, 2019, Chapter 8.9). Indeed, if we interpret the predictive distribution from the raw ensemble (27) as the empirical distribution of all 52 members—a customary approach, which we adopt hereinafter—there is a strong bias in probability of precipitation forecasts: Days with exactly zero precipitation are predicted much less often at the NWP model grid box scale than they occur at the point scale of the observations.
5.2 BMA, EMOS and HCLR
Before describing our IDR implementation, we review its leading competitors, namely, state of the art parametric distributional regression approaches that have been developed specifically for accumulated precipitation.
Heteroscedastic censored logistic regression (Messner et al., 2014) originates from the observation that conditional CDFs can be estimated by dichotomizing the random variable of interest at given thresholds and estimating the probability of threshold exceedance via logistic regression. The HCLR model used here assumes that square-root transformed precipitation follows a logistic distribution censored at zero, with location parameter linear in , and the mean of the square-root transformed perturbed forecasts, and a scale parameter linear in the standard deviation of the square-root transformed perturbed forecasts. Like EMOS, HCLR can be interpreted within the GAMLSS framework of Rigby & Stasinopoulos (2005).
Code for BMA, EMOS and HCLR is available within the ensembleBMA, ensembleMOS and crch packages in R (Messner, 2018). Unless noted differently, we use default options in implementation decisions.
5.3 Choice of partial order for IDR
IDR applies readily in this setting, without any need for adaptations due to the mixed-discrete continuous character of precipitation accumulation, nor requiring data transformations or other types of implementation decisions. However, the partial order on the elements (27) of the covariate space , or on a suitable derived space, needs to be selected thoughtfully, considering that the perturbed members are exchangeable.
The second implementation uses the same variables and partial order, but combined with a simple subagging approach: Before applying IDR, the training data is split into the two disjoint subsamples of training observations with odd and even indices, and we average the predictions based on these two subsamples.
Henceforth, we refer to the three implementations based on the partial orders in (28) and (29) as , , and . With reference to the discussion preceding Theorem 1, the relations (28) and (29) define preorders on and partial orders on and , respectively.
We have experimented with other options as well, for example, by incorporating the maximum of the perturbed members in the componentwise order in (28), with the motivation that the maximum might serve as a proxy for the spread of the ensemble, or by using the empirical stochastic order in lieu of the empirical increasing convex order in (29). IDR is robust to changes of this type, and the predictive performance remains stable, provided that the partial order honors the key substantive insights, in that the perturbed members are exchangeable, while , due to its higher native resolution, is able to capture local information that is not contained in nor . Hence, ought to play a pivotal role in the partial order.
5.4 Selection of training periods
The selection of the training period is a crucial step in the statistical post-processing of NWP output. Most post-processing methods, including the ones used in this analysis, assume that there is a stationary relationship between the forecasts and the observations. As Hamill (2018) points out, this assumption is hardly ever satisfied in practice: NWP models are updated, instruments at observation stations get replaced, and forecast biases may vary seasonally. These problems are exacerbated by the fact that quantitative precipitation forecasts require large training datasets in order to include sufficient numbers of days with non-zero precipitation and extreme precipitation events.
For BMA and EMOS, a training period over a rolling window of the latest available 720 days at the time of forecasting is (close to) optimal at all stations. This resembles choices made by Scheuerer & Hamill (2015) who used a training sample of about 900 past instances. Scheuerer (2014) took shorter temporal windows, but merged instances from nearby stations into the training sets, which is not possible here. In general, it would be preferable to select training data seasonally (e.g. data from the same month), but in our case the positive effect of using seasonal training data does not outweigh the negative effect of a smaller sample size.
As a non-parametric technique, IDR requires larger sets of training data than BMA or EMOS. As training data for IDR, we used all data available at the time of forecasting, which is about 2500 to 3000 days for the stations Frankfurt, Brussels and Zurich, and 1500 days for London Heathrow. The same training periods are also used for HCLR, where no positive effect of shorter, rolling training periods has been observed (Messner et al., 2014).
For evaluation, we use the years 2015 and 2016 (and 01 January 2017) for all post-processing methods and the raw ensemble. This test dataset consists of roughly 700 instances for each station and lead time.
5.5 Results
Before comparing the BMA, EMOS, , and techniques in terms of out-of-sample predictive performance over the test period, we exemplify them in Figure 4, where we show predictive CDFs for accumulated precipitation at Brussels on 16 December 2015, at a prediction horizon of 2 days. In panel (a), the marks at the bottom correspond to , , the perturbed members and their mean . The observation at 4 mm is indicated by the vertical line. Under all four techniques, the point mass at zero, which represents the probability of no precipitation, is vanishingly small. While the BMA, EMOS and HCLR CDFs are smooth and supported on the positive half-axis, the , and CDFs are piecewise constant with jump points at observed values in the training period. Panel (b) illustrates the hard and soft constraints on the CDF that arise from (20) under the order relation (28), with the thinner lines representing the CDFs of direct successors and predecessors. In this example, the constraints are mostly hard, except for threshold values between 4 and 11 mm.
![Distributional forecasts for accumulated precipitation at Brussels, valid 16 December 2015 at a prediction horizon of 2 days. (a) BMA, EMOS, IDRcw, IDRsbg and IDRicx predictive CDFs. The vertical line represents the observation. (b) IDRcw CDF along with the hard and soft constraints in (20) as induced by the order relation (28). The thin lines show the IDRcw CDFs at direct predecessors and successors [Colour figure can be viewed at https://dbpia.nl.go.kr/]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/83/5/10.1111_rssb.12450/1/m_rssb_83_5_963_fig-4.jpeg?Expires=1747909053&Signature=soXR4rScwsxQ5lfPaz3jOF3DeR0gwJtrpJa383LABl6~AUjXT4CxkSJK8LisdLko8J~Rt9W5txMplLmt1PKzWLpXDtxQV1hKzCgncQZ92ggyN2bLvsLjR589DQRN7RpKRamckr6Zc3TZCf7OOXBGk5HBk9eigB6DIkm1pdjqfPoNY8n-CLdki6raTtdwOjnrjdgMgwqJrYWJMmx9NR2nddhV-QcMHsMJbot8wEB65Cs8KxGyQ9Z47o-k9l4hEFM5p9xa6ETkHJB8rpq0-X5XOWy-vrFvJ44m4SCGrjgNcRxayCWaQQHUOnV~OwdvSEegnzLkIba3wdHBQPTRPj8GGg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Distributional forecasts for accumulated precipitation at Brussels, valid 16 December 2015 at a prediction horizon of 2 days. (a) BMA, EMOS, , and predictive CDFs. The vertical line represents the observation. (b) CDF along with the hard and soft constraints in (20) as induced by the order relation (28). The thin lines show the CDFs at direct predecessors and successors [Colour figure can be viewed at https://dbpia.nl.go.kr/]
We now use the mean CRPS over the test period as an overall measure of out-of-sample predictive performance. Figure 5 shows the CRPS of the raw and post-processed forecasts for all stations and lead times, with the raw forecast denoted as ENS. While HCLR performs best in terms of the CRPS, the IDR variants show scores of a similar magnitude and outperform BMA in many instances. Figure S1 in Supplementary Section S5 shows the difference of the empirical cumulative distribution function (ECDF) of the PIT defined at (8) to the bisector for the distributional forecasts. All three IDR variants show a PIT-distribution close to uniform, and so do BMA, EMOS and HCLR, as opposed to the raw ensemble, which is underdispersed.
![Mean CRPS over the test period for raw and post-processed ensemble forecasts of 24 h accumulated precipitation at prediction horizons of 1, 2, 3, 4 and 5 days. The lowest mean score for a given lead time and station is indicated in green [Colour figure can be viewed at https://dbpia.nl.go.kr/]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/83/5/10.1111_rssb.12450/1/m_rssb_83_5_963_fig-5.jpeg?Expires=1747909053&Signature=oeV8iODuz30x8L5jqwNobZlI-BsJcA5iZZ26zOhoMVCyOimOuAMVW4ifgtt~lsM5EQ5b5L9wNfbjgIUns~w1NwslLi6kAck0pibYM9B-5TF7aY3OZTJwkfIkc6N2~Xpg35rMihcUmrfEMojM2vz3jhRB92~59vEHLWGPprlLWvaNeDJaUbPsmwiBWckH2t8BqPtNke9CBX-yf-QVnwKco5tt4sJknJEi1i412zs3eboI-Fz19F0VF3vwVJDTPq6Alj5-sMbIY6YODdVUrCVqVYhFk8MKpwKOXeiwFAHM7LB9atdUJtipEGgLyM6VkLHTm5OjYrIAXbARvGu3XpgmmQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Mean CRPS over the test period for raw and post-processed ensemble forecasts of 24 h accumulated precipitation at prediction horizons of 1, 2, 3, 4 and 5 days. The lowest mean score for a given lead time and station is indicated in green [Colour figure can be viewed at https://dbpia.nl.go.kr/]
In Figure 6, we evaluate probability of precipitation forecasts by means of the Brier score (Gneiting & Raftery, 2007), and Figure S2 in Supplementary Section S5 shows reliability diagrams (Dimitriadis et al., 2021; Wilks, 2019). As opposed to the raw ensemble forecast, all distributional regression methods yield reliable probability forecasts. BMA, , and separate the estimation of the point mass at zero, and of the distribution for positive accumulations, and the four methods perform ahead of EMOS. HCLR is outperformed by BMA and the IDR variants at lead times of one or two days, but achieves a lower Brier score at the longest lead time of 5 days.
![Mean Brier score over the test period for probability of precipitation forecasts at prediction horizons of 1, 2, 3, 4 and 5 days. The lowest mean score for a given lead time and station is indicated in green [Colour figure can be viewed at https://dbpia.nl.go.kr/]](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/jrsssb/83/5/10.1111_rssb.12450/1/m_rssb_83_5_963_fig-6.jpeg?Expires=1747909053&Signature=zKfvRJVcyLOBa4MNzvCqmrFIv23CxDMOoIalYjkjWHA1bVfrFmIJOO4RUci1RCftqX6p-JzqpbUupjAkB-R1Q-PRxUHEpQluQ~iwZRFIBicBLGBopBTEx0gbEzGz1w9CXjwjCHTJiGFMcL7eRE-0Z2kQuMBv76s54050CSmxBs1Jr-1Aj2sf3UvQMNIdDSMHuVdk~A4cp70GWptViYv3C6qtzkwgxnA~p5NBQZ4STDddqhsQLtnhyPYBcjpl3oT9JuRsrzbBQdvN~OiZqVQMjY15vGg1QdFutwPYQ36zvf-OQyD5ewdJCS3CaEoT9Jm1Z5Aq0zphv3HE4WivAuuZJA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Mean Brier score over the test period for probability of precipitation forecasts at prediction horizons of 1, 2, 3, 4 and 5 days. The lowest mean score for a given lead time and station is indicated in green [Colour figure can be viewed at https://dbpia.nl.go.kr/]
Interestingly, IDR tends to outperform EMOS and HCLR for probability of precipitation forecasts, but not for precipitation accumulations. We attribute this to the fact that parametric techniques are capable of extrapolating beyond the range of the training responses, whereas IDR is not: The highest precipitation amount judged feasible by IDR equals the largest observation in the training set. Furthermore, unlike EMOS and HCLR, IDR does not use information about the spread of the raw ensemble, which is inconsequential for probability of precipitation forecasts, but may impede distributional forecasts of precipitation accumulations.
In all comparisons, the forecast performance of and is similar. However, in our implementation, the simple subagging method used in reduced the computation time by up to one half.
To summarize, our results underscore the suitability of IDR as a benchmark technique in probabilistic forecasting problems. Despite being generic as well as fully automated, IDR is remarkably competitive relative to state of the art techniques that have been developed specifically for the purpose. In fact, in a wide range of applied problems that lack sophisticated, custom-made distributional regresssion solutions, IDR might well serve as a ready-to-use, top-performing method of choice.
6 Discussion
Stigler (1975) gives a lucid historical account of the 19th century transition from point estimation to distribution estimation. In regression analysis, we may be witnessing what future generations might refer to as the transition from conditional mean estimation to conditional distribution estimation, accompanied by a simultaneous transition from point forecasts to distributional forecasts (Gneiting & Katzfuss, 2014).
IDR is a non-parametric technique for estimating conditional distributions that takes advantage of partial order relations within the covariate space. It can be viewed as a far-reaching generalization of pool adjacent violators (PAV) algorithm based classical approaches to isotonic (non-distributional) regression, is entirely generic and fully automated, and provides for a unified treatment of continuous, discrete and mixed discrete-continuous real-valued response variables. Code for the implementation of IDR within R (R Core Team, 2020) and Python (https://www.python.org/) is available via the isodistrreg package at CRAN (https://CRAN.R-project.org/package=isodistrreg) and on github (https://github.com/AlexanderHenzi/isodistrreg; https://github.com/evwalz/isodisreg), with user-friendly functions for partial orders, estimation, prediction and evaluation.
IDR relies on information supplied by order constraints, and the choice of the partial order on the covariate space is a critical decision prior to the analysis. Only variables that contribute to the partial order need to be retained, and the order constraints serve to regularize the IDR solution. Weak orders lead to increased numbers of comparable pairs of training instances and predictive distributions that are more regular. The choice of the partial order is typically guided and informed by substantive expertise, as illustrated in our case study, and it is a challenge for future research to investigate whether the selection of the partial order could be automated. Given that IDR gains information through order constraints, it is a valid concern whether it is robust under misspecifications of the partial order. There is evidence that this is indeed the case: IDR has guaranteed in-sample threshold calibration (Theorem 2) and therefore satisfies a minimal requirement for reliable probabilistic forecasts under any (even misspecified) partial order. Moreover, El Barmi & Mukerjee (2005, Theorem 7) show that in the special case of a discrete, totally ordered covariate, isotonic regression asymptotically has smaller estimation error than non-isotonic alternatives even under mild violations of the monotonicity assumptions, akin to the performance of IDR in the non-isotonic setting (25) in our simulation study.
Unlike other methods for distributional regression, which require implementation decisions, such as the specification of parametric distributions, link functions, estimation procedures and convergence criteria, to be undertaken by users, IDR is fully automatic once the partial order and the training set have been identified. In this light, we recommend that IDR be used as a benchmark technique in distributional regression and probabilistic forecasting problems. With both computational efficiency and the avoidance of overfitting in mind, IDR can be combined with subsample aggregation (subagging) in the spirit of random forests. In our case study on quantitative precipitation forecasts, we used simplistic ad hoc choices for the size and number of subsamples. Future research on computationally efficient algorithmic implementations of IDR as well as optimal and automated choices of subsampling settings is highly desirable.
A limitation of IDR in its present form is that we only consider the usual stochastic order on the space of the conditional distributions. Hence, IDR is unable to distinguish situations where the conditional distributions agree in location but differ in spread, shape or other regards. This restriction is of limited concern for response variables such as precipitation accumulation or income, which are bounded below and right skewed, but may impact the application of IDR to variables with symmetric distributions. In this light, we encourage future work on ramifications of IDR, in which is equipped with partial orders other than the stochastic order, including but not limited to the likelihood ratio order (Mösching & Dümbgen, 2020a). Similarly, the ‘spiking’ problem of traditional isotonic regression, which refers to unwarranted jumps of estimates at boundaries, arguably did not have adverse effects in our simulation and case studies. However, it might be of concern in other applications, where remedies of the type proposed by Wu et al. (2015) might yield improvement and warrant study.
Another promising direction for further research is generalizations of IDR to multivariate response variables. In weather prediction, this would allow simultaneous post-processing of forecasts for several variables, and an open question is for suitable notions of multivariate stochastic dominance that allow efficient estimation in such settings.
Supporting Information
Additional supporting information may be found online in the Supporting Information section.
With Intel(R) Xeon(R) E5-2630 v4 2.20GHz CPUs, in R (R Core Team, 2020), using the doParallel package for parallelization. Times reported are averages over 100 replicates.
Acknowledgements
The authors are grateful to the editor, the associate editor, four anonymous referees, Sebastian Arnold, Lutz Dümbgen, Ed George, Stephan Hemri, Alexander Jordan, Kristof Kraus, Alexandre Mösching, Anja Mühlemann, Aaditya Ramdas, Nicholas Reich, Benedikt Schulz, Peter Vogel, Jonas Wallin and Eva-Maria Walz for providing code, assistance with the data handling, comments and discussion. Tilmann Gneiting acknowledges support by the Klaus Tschira Foundation, by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 257899354 – TRR 165, and via the Fellowship programme operated by the European Centre for Medium-Range Weather Forecasts (ECMWF). Alexander Henzi and Johanna F. Ziegel gratefully acknowledge financial support from the Swiss National Science Foundation. Calculations were mostly performed on UBELIX (http://www.id.unibe.ch/hpc), the HPC cluster at the University of Bern. Open Access Funding provided by Universitat Bern.