Abstract

Extremal graphical models are sparse statistical models for multivariate extreme events. The underlying graph encodes conditional independencies and enables a visual interpretation of the complex extremal dependence structure. For the important case of tree models, we develop a data-driven methodology for learning the graphical structure. We show that sample versions of the extremal correlation and a new summary statistic, which we call the extremal variogram, can be used as weights for a minimum spanning tree to consistently recover the true underlying tree. Remarkably, this implies that extremal tree models can be learned in a completely non-parametric fashion by using simple summary statistics and without the need to assume discrete distributions, existence of densities or parametric models for bivariate distributions.

1 INTRODUCTION

Extreme value theory provides essential statistical tools to quantify the risk of rare events such as floods, heatwaves or financial crises (e.g. Engelke et al., 2019; Katz et al., 2002; Poon et al., 2004). The univariate case is well understood and the generalised extreme value and Pareto distributions describe the distributional tail with only few parameters. In dimension d2, the dependence between large values in the different components of a random vector X=(X1,,Xd) can become very complex. Estimating this dependence in higher dimensions is particularly challenging because the number of extreme observations kn is by definition much smaller than the number n of all samples in a data set. Constructing sparse models for the multivariate dependence between marginal extremes is therefore crucial for obtaining tractable and reliable methods in multivariate extremes; see Engelke and Ivanovs (2021) for a review of recent advances.

One line of research aims at exploiting conditional independence structures (Dawid, 1979) and corresponding graphical models. In the setting of max-stable distributions, which arise as limits of component-wise block maxima of independent copies of X, Gissibl and Klüppelberg (2018) and Klüppelberg and Lauritzen (2019) study max-linear models on directed acyclic graphs. The distributions considered in there do not have densities, and a general result by Papastathopoulos and Strokorb (2016) shows that there exist no non-trivial density factorisation of max-stable distributions on graphical structures.

A different perspective on multivariate extremes is given by threshold exceedances and the resulting class of multivariate Pareto distributions. Such distributions are the only possible limits that can arise from the conditional distribution of X given that it exceeds a high threshold (Rootzén et al., 2018; Rootzén & Tajvidi, 2006). For a d-dimensional random vector Y that follows a multivariate Pareto distribution, Engelke and Hitz (2020) introduce suitable notions of conditional independence and extremal graphical models with respect to a graph G. They further show that these notions are natural as they imply the factorisation of the density of Y through a Hammersley–Clifford type theorem. Extremal graphical models are also related to limits of regularly varying Markov trees studied in Segers (2020) and Asenova et al. (2021).

In most of the above work, the graphical structure G is assumed to be known a priori. It is either based on expert knowledge in the domain of application or it might be identified with an existing graph, as for instance a river network for discharge measurements. However, often no or insufficient domain knowledge on a prior candidate for a graphical structure is available, and a data-driven approach should be followed in order to detect conditional independence relations and to estimate a sensible graph structure. In this work we discuss structural learning for extreme observations.

An important sub-class of general graphs for which structure learning for extremes turns out to be possible in great generality is given by trees. A tree T=(V,E) with nodes V and edge set E is a connected undirected graph without cycles. Most structure learning approaches for trees are based on the notion of the minimum spanning tree. For a set of symmetric weights ρij>0 associated with any pair of nodes i,jV, ij, the latter is defined as the tree structure

(1)

that minimises the sum of distances on that tree. Given the set of weights, there exist greedy algorithms that constructively solve this minimisation problem (Kruskal, 1956; Prim, 1957).

The crucial ingredient for this algorithm are the weights ρij between the nodes, and for statistical inference it is generally desirable to choose them in such a way that Tmst recovers the true underlying tree structure that represents the conditional independence relations. A common approach in graphical modelling is to use the Chow–Liu tree (Chow & Liu, 1968), which is the conditional independence structure that maximises the likelihood for a given parametric model (cf, Cowell et al., 2006, chapter 11). This method uses the negative mutual information as edge weights ρij in (1), and in general this requires formulating parametric models for the bivariate marginal distributions. In the Gaussian case the weights then simplify to ρij=log(1rij2)/2, where rij are the correlation coefficients (cf, Drton & Maathuis, 2017).

For a multivariate Pareto distribution Y that is an extremal graphical model on a tree T, Engelke and Hitz (2020) proposed to use the negative maximised bivariate log-likelihoods as edge weights. This approach has two disadvantages. First, in higher dimensions d it may become prohibitively costly to compute d2 censored likelihood optimisations, and second, a set of parametric bivariate models has to be chosen in advance.

In this paper we study structure learning for extremal tree models in much larger generality. We show that a function of the extremal correlation χij, a widely used coefficient to summarise the strength of extremal dependence between marginals i,jV (e.g., Coles et al., 1999), can be used as weights ρij in (1) to retrieve the underlying tree structure T as the minimum spanning tree under mild non-parametric assumptions. We further introduce a new summary coefficient for extremal dependence, the extremal variogram Γij, which turns out to take a similar role in multivariate extremes as covariances in Gaussian models. More precisely, the extremal variogram of Y is shown to be an additive tree metric on the tree T and, as a consequence, it can be used as well as weights ρij of the minimum spanning tree to recover the true tree structure. Surprisingly, these results are stronger than for non-extremal tree structures, since we do not require any further parametric assumptions or the existence of densities. This phenomenon originates from the homogeneity of multivariate Pareto distributions and particularly nice stochastic representations of extremal tree models.

In practice, we usually observe n samples of X in the domain of attraction of Y, that is, the conditional distribution of X given X exceeds a high threshold converges to the distribution of Y after proper scaling; see Section 2.1 for a formal definition. We then rely on estimators of the quantities χij and Γij to plug into (1). To take into account that X is only in the domain of attraction of Y, typical estimators in extreme value theory use only the most extreme observations. We use an existing estimator χ^ij of extremal correlation and a new empirical estimator of the extremal variogram to show that the extremal tree structure can be estimated consistently in a non-parametric way, even when the dimension increases with the sample size.

The remaining paper is organised as follows. In Section 2 we revisit the notion of extremal graphical models and extend existing representations to the case where densities may not exist. The extremal variogram is introduced in Section 3 and its properties are discussed in detail. In Section 4 we prove the main results on the consistent recovery of extremal tree structures based on extremal correlations and extremal variograms, both on the population level and using empirical estimates. The simulation studies in Section 5 illustrate the finite sample behaviour of our structure estimators and show that extremal variogram based methods typically outperform methods working with the extremal correlation. We apply the new tools in Section 6 to a financial data set of foreign exchange rates. The Appendix and the Supplementary Material S1 contain the proofs and additional illustrations. The methods of this paper are implemented in the R package graphicalExtremes (Engelke et al., 2019).

2 EXTREMAL GRAPHICAL MODELS

2.1 Multivariate Pareto distributions

Let X=(Xi)iV be a random vector with eventually continuous marginal distribution functions Fi. Extreme value theory studies marginal and joint tail properties of X. Univariate extreme value theory focuses on the behaviour of marginal components Xi, see for example Embrechts et al. (1997) and Coles et al. (1999). Multivariate extreme value theory is concerned with the dependence structure among different components of extreme observations from X; see Resnick (2008, chapter 5); de Haan and Ferreira (2006); Beirlant et al. (2004) or Engelke and Ivanovs (2021) for an introduction.

One way to describe such tail properties is based on threshold exceedances; here only observations that land above a high threshold are considered. Multivariate Pareto distributions arise as the limits of such high threshold exceedances and are thus natural models for extreme events (Rootzén & Tajvidi, 2006). To formally define threshold exceedances in dimension d>1, we need to specify the notion of a high threshold in a multivariate setting. Throughout the paper, we consider multivariate exceedances of the random vector X as those realisations where at least one component of X exceeds a high marginal quantile. In order to guarantee the existence of the limit of the exceedance distribution, a regularity condition called multivariate regular variation (Resnick, 2008, Chapter 5) is needed. Intuitively, this assumption ensures that the dependence between different components of this conditional distribution stabilises if the marginal quantile is sufficiently large. More formally, this means that there exists a random vector Y supported on ={x0:x>1} such that for all continuity points x of the distribution function of Y we have

(2)

where we define F(x)=(F1(x1),,Fd(xd)). Note that the condition {F(X)1q} states that at least one component of X exceeds its marginal 1q quantile, explaining the terminology of threshold exceedances. Distributions of random vectors Y that can arise in the above limit are called multivariate Pareto distributions. We say that the random vector X is in the max-domain of attraction of the multivariate Pareto distribution Y=(Yi)iV.

The class of multivariate Pareto distributions is very general and contains many different parametric sub-families. Nevertheless, since the random vector Y arises as a limiting distribution, it has an important structural property called homogeneity:

(3)

where for any Borel subset A we define tA={tx:xA}. This explains the name multivariate Pareto distribution since it implies that for any iV we have (Yix|Yi>1)=11/x for x1, that is, Yi|Yi>1 follows a standard Pareto distribution. Moreover, since F(X) has identically distributed margins, it follows from (2) that (Y1>1)==(Yd>1). Conversely, if the latter holds and Y is homogeneous as in (3), then Y is a multivariate Pareto distribution; for a proof of this equivalence and the relationship to limits appearing in Segers (2020) see Section S.5 of Supplementary Material S1.

2.2 Extremal Markov structures

Since the support of multivariate Pareto distributions is not a product space, the definition of conditional independence is non-standard and relies on auxiliary random vectors derived from Y. For any mV, we consider the random vector Ym defined as Y conditioned on the event that {Ym>1}, which has support on the space m={x:xm>1}. For general random vectors Xd and ordered sets A{1,,d}, let XA denote the sub-vector of X with indices in A. The notation YAm will be used to denote the subvector of Ym with indices in A.

With this notation we can state a definition of conditional independence for multivariate Pareto distributions that is more general than the one in Engelke and Hitz (2020), since we do not assume existence of densities.

Definition 1

For disjoint subsets A,B,CV={1,,d}, we say that YA is conditionally independent of YC given YB if

(4)

In this case we write YAeYC|YB.

The subscript e in e indicates that this conditional independence notion is defined for extreme observations, which are described by the multivariate Pareto distribution Y according to (2).

We view the index set V as a set of nodes of a graph G=(V,E), with connections given by a set of edges EV×V of pairs of distinct nodes. The graph is called undirected if for two nodes i,jV, (i,j)E if and only if (j,i)E. For notational convenience, for undirected graphs we sometimes represent edges as unordered pairs {i,j}E. When counting the number of edges, we count {i,j}E such that each edge is considered only once. For disjoint subsets A,B,CV, B is said to separate A and C in G if every path from A to C contains as least one node in B. For an illustration of these definitions see Section S.1 in Supplementary Material S1.

The notion of an extremal graphical model is then naturally defined as a multivariate Pareto distribution that satisfies the global Markov property on the graph G with respect to the conditional independence relation e, that is, for any disjoint subsets A,B,CV such that B separates A from C in G,

(5)

In line with the definition in the graphical models literature (Lauritzen, 1996, chapter 3), the definition allows for additional conditional independence relations that are not encoded by graph separation. This means that there are typically several graphs G that are consistent with the distribution of Y; for instance, any multivariate Pareto distribution is an extremal graphical model on the fully connected graph.

In the case of a decomposable graph G and if Y possesses a positive and continuous density fY, Engelke and Hitz (2020) show that this density factorises into lower-dimensional densities, and that the graph G is necessarily connected. If Y does not have a density, then the extremal graph can be disconnected and the connected components are mutually independent of each other (Engelke & Hitz, 2020, see Kirstin Strokorb's discussion contribution). Note that we require the global Markov property in the definition of extremal graphical models as opposed to the pairwise Markov property used in Engelke and Hitz (2020). Both properties are equivalent in the case of positive, continuous densities, but in general, the former implies the latter but not the other way around (see Lauritzen, 1996, chapter 3).

2.3 Extremal tree models

An important example of a sparse graph structure is a tree. A tree T=(V,E) is a connected undirected graph without cycles and thus |E|=|V|1. Equivalently, a tree is a graph with a unique path between any two nodes. If Y is an extremal graphical model satisfying the global Markov property (5) with respect to a tree T, we obtain a simple stochastic representation of Ym. This stochastic representation will be the crucial building block for the results on tree learning given in the next section.

To this end we need to introduce the concept of extremal functions. Define the extremal function relative to coordinate m as the d-dimensional, non-negative random vector Wm with Wmm=1 almost surely that satisfies the stochastic representation

(6)

where P is a standard Pareto random variable, (Px)=11/x, x1, which is independent of Wm, and =(d) stands for equality in distribution. Such a representation is possible by homogeneity (3) of Y, which is inherited by Ym. Indeed, given homogeneity of Ym we see that Ymm follows a standard Pareto distribution. Moreover, writing Wm:=Ym/Ymm, homogeneity of Ym and a simple calculation implies that Wm and Ymm are independent, resulting in the representation (6).

The representation (6) is an alternative way of describing the distribution of Y, and indeed, the set of the d extremal functions W1,,Wd uniquely defines the multivariate Pareto distribution. We refer to Dombry et al. (2013, 2016) for additional technical background on extremal functions.

Example 1

In the case d=2, due to homogeneity, the bivariate Pareto distribution Y=(Y1,Y2) can essentially be characterised by a univariate distribution. Indeed, for any non-negative random variable W21 with 𝔼W211, the random vector W1=(1,W21) is the extremal function relative to the first coordinate of a unique bivariate Pareto distribution Y. The extremal function relative to the second coordinate W2=(W12,1) is obtained through a change of measure

(7)

which implies that 𝔼(W21)=1(W12=0)1.

An elementary proof of (7) can be found in Section S.7 of Supplementary Material S1.

We now proceed to a stochastic representation for Ym that involves only the univariate random variables Wij. Define a new, directed tree Tm=(V,Em) rooted at an arbitrary but fixed node mV. The edge set Em consists of all edges eE of the tree T pointing away from node m. For the resulting directed tree we define a set {We:eEm} of independent random variables, where for e=(i,j), the distribution of We=Wji is jth coordinate of the extremal function of Y relative to coordinate i.

The following result generalises proposition 2 in Engelke and Hitz (2020) to extremal tree models with arbitrary edge distributions.

Proposition 1

LetYbe a multivariate Pareto distribution that is an extremal graphical model on the treeT=(V,E). LetPbe a standard Pareto distribution, independent of{We:eEm}. Then we have the joint stochastic representation forYmonm

(8)

whereph(mi;Tm)denotes the set of edges on the unique path from nodemto nodeion the treeTm; see Figure1for an example withm=2.

Conversely, for any set of independent random variables{Wij,Wji;{i,j}E}, whereWijandWjisatisfy the duality (7), the construction (8) defines a consistent family of extremal functionsW1,,Wdthat correspond to a uniqued-dimensional Pareto distributionYthat is an extremal graphical model onT.

A tree T2 rooted at node m=2 with the extremal functions on the edges [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 1

A tree T2 rooted at node m=2 with the extremal functions on the edges [Colour figure can be viewed at https://dbpia.nl.go.kr]

The above result formally establishes the link of the conditional independence in Definition 1 to the limiting tail trees in Segers (2020); see also Proposition 1 in Supplementary Material S1 for details on this link. In this sense, the first part of Proposition 1 can be deduced from theorem 1 in Segers (2020).

Note that Y defined as in Proposition 1 can also be an extremal graphical model on a disconnected graph G; see the paragraph after (5). The representation (8) then remains true and some of the We are almost surely equal to zero.

Remark 1

It is remarkable that for an extremal tree model Y, the distribution of its extremal functions, and therefore also of the multivariate Pareto distribution itself, is characterised by the set of univariate random variables {Wij,Wji;{i,j}E}. This indicates that the probabilistic structure is simpler than in the non-extremal case, where in general both univariate and bivariate distributions are needed to describe a tree graphical model.

3 THE EXTREMAL VARIOGRAM

Covariance matrices play a central role in structure learning for Gaussian graphical models due to their connection to conditional independence properties. In multivariate extreme value theory, several summary statistics have been developed to measure the strength of dependence between the extremes of different variables. The most popular one is the extremal correlation, which for i,jV is defined as

(9)

whenever the limit exists. It ranges between 0 and 1 where the boundary cases are asymptotic independence and complete extremal dependence, respectively (cf, Coles et al., 1999; Schlather & Tawn, 2003). In particular, if X is in the max-domain of attraction of the multivariate Pareto distribution Y, then the extremal correlation always exists and χij=(Yi>1|Yj>1). There are many other coefficients for extremal dependence in the literature, including the madogram (Cooley et al., 2006) and a coefficient defined on the spectral measure introduced in Larsson and Resnick (2012) and used for dimension reduction in Cooley and Thibaud (2019) and Fomichov and Ivanovs (2022).

While designed as summaries for extremal dependence, none of these coefficients has an obvious relation to conditional independence for multivariate Pareto distributions or density factorisation in extremal graphical models of Engelke and Hitz (2020). In this section we define a new coefficient that will turn out to take a similar role in multivariate extremes as covariances in non-extremal models.

3.1 Limiting extremal variogram

The variogram is a well-known object in geostatistics that measures the degree of spatial dependence of a random field (cf, Chilès & Delfiner, 2012; Wackernagel, 2013). It is similar to a covariance function, but instead of positive definiteness, a variogram is conditionally negative definite; for details, see for instance Engelke and Hitz (2020, Appendix B). For Brown–Resnick processes, the seminal work of Kabluchko et al. (2009) has shown that negative definite functions play a crucial role in spatial extreme value theory. We define a variogram for general multivariate Pareto distributions.

Definition 2

For a multivariate Pareto distribution Y we define the extremal variogram rooted at node mV as the matrix Γ(m) with entries

(10)

whenever the right-hand side exists and is finite.

We can interpret the Γij(m) as a distance between the variables Yim and Yjm that is large if their extremal dependence is weak and vice versa.

Proposition 2

LetYbe a multivariate Pareto distribution.

  • (i)
    FormV, we can express the extremal variogram in terms of the extremal function relative to coordinatem,
  • (ii)

    FormV, the matrixΓ(m)is a variogram matrix, that is, it is conditionally negative definite.

  • (iii)

    LetYnbe a sequence of multivariate Pareto distributions with extremal coefficientsχn,imbetween theith andmth coordinate ofYnsatisfyingχn,im0asnfor somei,mV. Then the corresponding extremal variograms satisfyΓn,im(m)asn.

Part (iii) in the above proposition underlines the interpretation of the extremal variogram. When the variables become asymptotically independent, then the extremal variogram grows and eventually diverges to +. Note that the converse statement is not true in general, since there are cases where Γim(m)= but χim>0. We proceed with several examples where the extremal variogram can be computed explicitly. Figure 2 shows the extremal variogram values for these models as a function of the corresponding extremal correlation.

Values of the extremal variogram Γ12(1) as a function of the extremal correlation χ12 for the bivariate Hüsler–Reiss (blue), symmetric Dirichlet (orange) and logistic (green) models. Note that in all three cases we have that W12=(d)W21 and therefore Γ12(1)=Γ12(2). [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 2

Values of the extremal variogram Γ12(1) as a function of the extremal correlation χ12 for the bivariate Hüsler–Reiss (blue), symmetric Dirichlet (orange) and logistic (green) models. Note that in all three cases we have that W12=(d)W21 and therefore Γ12(1)=Γ12(2). [Colour figure can be viewed at https://dbpia.nl.go.kr]

Example 2

The extremal logistic distribution with parameter θ(0,1) can be defined through its extremal functions (see Dombry et al., 2016)

where U1,,Ud are independent and Ui, im follow Fréchet(1/θ,G(1θ)1) distributions, and (G(1θ)Um)1/θ follows a Gamma(1θ,1) distribution; here G(x) is the Gamma function evaluated at x0. It turns out that for the logistic model we have

where ψ(1) is the trigamma function defined as the second derivative of the logarithm of the gamma function.

The corresponding extremal correlations have the form χij=22θ, i,jV.

The proof of this representation of the extremal variogram in the logistic model can be found in Section S.8 in Supplementary Material S1.

Example 3

The extremal Dirichlet distributions with parameters α1,,αd (cf, Coles & Tawn, 1991) has extremal functions

where U1,,Ud are independent and Ui, im follow Gamma(αi,1/αi) distributions, and Um follows a Gamma(αm+1,1/αm) distribution. By straight-forward calculations,

with ψ(1) denoting the trigamma function as in Example 2.

The corresponding extremal correlations do have have a closed form but can be calculated numerically.

For the class of Hüsler–Reiss distributions the extremal variogram turns out to be very natural.

Example 4

The Hüsler–Reiss distribution is parameterised by a variogram matrix Γd×d; see Engelke and Hitz (2020) for details. For any d-variate centred normal random vector U with variogram matrix Γ, the extremal function relative to coordinate mV has representation

(11)

see Dombry et al. (2016, prop. 4). The extremal variogram Γ(m) for any mV is then equal to the variogram matrix Γ from the definition of the Hüsler–Reiss distributions, and, in particular, it is independent of the root node,

The corresponding extremal correlations have the form χij=22Φ(Γij/2), where Φ is the standard normal distribution function.

3.2 Pre-asymptotic extremal variogram

Similar to the extremal correlation in (9) we can define the extremal variogram as the limit of pre-asymptotic versions.

Definition 3

For a multivariate distribution X with continuous marginal distributions we define the pre-asymptotic extremal variogram at level q(0,1) rooted at node mV as the matrix Γ(m)(q) with entries

whenever right-hand side exists and is finite.

Note that for q close to zero the conditional distribution of the terms log{1Fi(Xi)} given Fm(Xm)>1q is approximately that of logYim, iV.

Next we provide conditions which ensure the convergence Γij(m)(q)Γij(m) as q0. We introduce the following notation: for a vector xd and I{1,,d}, let xI denote a vector in |I| with entries xj,jI. For a distribution function F of a d-dimensional random vector X define FI as the distribution function of the corresponding random vector XI and let Y(I) denote the limit obtained in relation (2) when F,X,x are replaced by FI,XI,xI. Note that Y(I) is not the same as YI, the sub-vector of Y with entries in I, because the latter is not supported on I={x0:x>1}|I|. The distribution of Y(I) can be obtained from that of YI by conditioning.

  • (B)
    There exist constants ξ>0,KB< such that for any IV with |I|{2,3} and all q(0,1)
    (12)
  • (T)
    There exists a γ>0 such that for any i,mV the extremal function satisfies
    (13)

Assumption (B) is a strengthening of (2) for bivariate and trivariate distributions as it imposes that convergence to the limit should take place uniformly and at a certain rate. It is closely related to typical second order conditions on the stable tail dependence function that are fairly standard in the literature; see for instance Einmahl et al. (2012) and Fougères et al. (2015) among many others. Additional details on this matter are given in the Section S.12.1 in Supplementary Material S1. Condition (T) is a mild assumption on the extremal functions Wim, which holds for all examples considered in the previous section. This condition prevents the distribution of Wim from putting too much mass close to zero.

Proposition 3

Under conditions (B), (T) we have for anym,i,jV

We note that condition (T) already implies that Γij(m)[0,) for any i,j, so the convergence above is always to a finite limit.

4 STRUCTURE LEARNING FOR EXTREMAL TREE MODELS

4.1 Extremal tree models

Extremal graphical models where the underlying graph is a tree were considered as a sparse statistical model in Engelke and Hitz (2020). As explained in the introduction, their approach of using a censored maximum-likelihood tree becomes prohibitively costly in higher dimension d and requires parametric assumptions on the bivariate distributions of the tree.

Ideally, one would like to have summary statistics, similar to the correlation coefficients rij in the Gaussian case, that can be estimated empirically and that guarantee to recover the true underlying tree structure when used as edge weights. The extremal variogram defined in Section 3 turns out to be a so-called tree metric, and as such a natural quantity to infer the conditional independence structure in extremal tree models. We underline that the extremal variogram Γ(m) is defined for arbitrary multivariate Pareto distributions and in the case of the Hüsler–Reiss distribution it coincides with the parameter matrix.

Proposition 4

LetYbe an extremal graphical model with respect to the treeT=(V,E)and suppose that the extremal variogram matrixΓ(m)exists for allmV. Then we have that

(14)

In other words, for anymV, the extremal variogram matrixΓ(m)defines an additive tree metric.

Corollary 1

LetYbe an extremal graphical model with respect to the treeT=(V,E). Suppose that the extremal variogram matrixΓ(m)exists and is finite for allmVand that(YiYj)>0for alli,jV, ij(or equivalently, Γij(m)>0). For anymV, the minimum spanning tree withρij=Γij(m)is unique and satisfies

For extremal tree models, Corollary 1 shows that independently of any distributional assumption, the extremal variogram contains the conditional independence structure of the tree T. This result is quite surprising, since it is stronger than what is known in the classical, non-extremal theory of trees. Indeed, as discussed in the introduction, for Gaussian graphical models, a analogous result holds for a minimum spanning tree with weights ρij=log(1rij2)/2 for rij denoting the correlation between the ith and jth component of the Gaussian random vector under consideration. The assumption of Gaussianity is crucial and the result no longer holds outside this specific parametric class.

Beyond the world of Gaussian graphical models, there exists some literature on the non-parametric estimation of graphical models on tree structures, see Chow and Liu (1968) for an early contribution and Drton and Maathuis (2017, section 3.1) for an overview. However, one either needs to assume discrete distributions (Chow & Liu, 1968) or the existence of densities (Lafferty et al., 2012; Liu et al., 2011), and non-parametric density estimation is required in the latter case. To the best of our knowledge, multivariate Pareto distributions are the first example for a non-parametric sub-class of multivariate distributions where tree dependence structures can be learned using simple moment-based summary statistics without additional parametric assumptions. It is also remarkable that there is no need to assume the existence of densities and that the distributions we consider can simultaneously have continuous and discrete components.

The reason why such a strong result can hold can be explained by the homogeneity of the multivariate Pareto distribution Y. For trees, all cliques contain two nodes and therefore the density fY factorises into bivariate Pareto densities. Because of the homogeneity, such a bivariate density can be decomposed into independent radial and angular parts; see Example 1. Bivariate Pareto distributions only differ in terms of the angular distribution, whose support is the subset of a one-dimensional sphere with all coordinates positive. Consequently, an extremal tree model in d dimensions can essentially be reduced to d1 univariate angular distributions; see also Proposition 1. This provides an intuitive explanation why the result in Corollary 1 can hold.

We can go further and show that a linear combination of the matrices Γ(m), mV, which are possibly different from each other, still induces the true tree as the minimum spanning tree.

Corollary 2

Under the same assumptions as in Corollary1, the minimum spanning tree with distances

given by a linear combination of the extremal variograms rooted at different nodes with coefficientswm0, mV, maxmVwm>0, is unique and satisfiesTmst=T.

The extremal correlation coefficients χij do not form a tree metric, that is, they are not additive according to the tree structure as the extremal variogram in (A4). It is therefore a non-trivial question whether these coefficients can also be used as weights in a minimum spanning tree to infer the underlying conditional independence structure. Interestingly, the next result gives a partially affirmative answer.

Proposition 5

LetYbe an extremal graphical model on the treeT=(V,E). Then the extremal correlation coefficients satisfy for anyh,lVwithhlthat

(15)

Under the additional assumption that this inequality is strict as soon as(i,j)(h,l), the minimum spanning tree corresponding to distancesρij=log(χij)is unique and satisfies

The assumption that χhl<χij for any (i,j)ph(hl;T) with (i,j)(h,l) is not satisfied for all tree models. Indeed, a counterexample (Segers, J. [Personal communication, 7th July 2022]) with index set V={1,2,3} and edges E={(1,2),(2,3)} is the following. Let the extremal function W21Unif([1/2,3/2]) and let W32 have a discrete distribution (W32=1/4)=4/5 and (W32=4)=1/5; both are valid extremal functions as in Example 1. In this case χ13=χ23=2/5 and the set of minimum spanning trees is not unique. We can then only guarantee that the true underlying tree T lies in the set of all possible minimum spanning trees; this follows from a close inspection of the proof of Proposition 5.

There are simple conditions to ensure that inequality (15) is strict for all (i,j)(h,l). For instance, a sufficient condition for this to hold is that all extremal functions Wji for (i,j)E have support equal to the whole space [0,); see Lemma 1 in Section S.10 in Supplementary Material S1. This covers many relevant examples such as the Hüsler–Reiss, the extremal logistic and the extremal Dirichlet distributions in Examples 2, 3 and 4, respectively. A weaker condition for strict inequality was recently obtained by Hu et al. (2022).

Remark 2

Both the extremal variogram Γij(m) and the extremal correlation χij contain information on conditional independence structure for extremal tree models. The extremal correlation is defined for any model but needs additional assumptions to correctly recover the tree. The extremal variogram does not exist if Y has mass on lower-dimensional sub-faces of but is guaranteed to recover the underlying tree whenever all extremal variograms exist. When their sample versions are used (see Section 4.2), the probability of correctly identifying the underlying tree may differ even when both approaches work on population level; see Section 5.

4.2 Estimation

Throughout this section assume that we observe independent copies X1,,Xn of the d-dimensional random vector X, which is in the max-domain of attraction of a multivariate Pareto distribution Y, an extremal graphical model on the tree T according to (5). Our aim is to estimate T from the observations X1,,Xn. Motivated by Proposition 5 and Corollaries 1, 2 we propose to achieve this through a two-step procedure. We first construct estimators for the quantities χij and Γij(m), and then compute the minimal spanning trees corresponding to those estimators.

The empirical estimator for χij is defined as

where k=kn is an intermediate sequence and F˜i denotes the empirical distribution function of X1i,,Xni. Standard arguments imply that under (2) and provided that k and k/nq[0,1] as n, we have for any i,jV

(16)

where χij(q) is defined in (9) and χij(0):=χij. In particular, if q=0 then χ^ij is a consistent estimator of χij.

The extremal variogram matrix Γ(m) for the sample Xt, t=1,,n, is estimated by

where Var^ denotes the sample variance. Under the assumption k/nq[0,1] as n and mild conditions on the underlying data generation, this estimator can be shown to be consistent for the pre-asymptotic version Γij(m)(q) as introduced in Definition 3.

Theorem 1

Let assumptions (B), (T) hold and assume thatknθfor someθ>0and thatk/nq[0,1]asn. Then we have for anym,i,jV

whereΓij(m)(0):=Γij(m).

The proof of this result turns out to be surprisingly technical, details are given in the Section S.12.4 in Supplementary Material S1. The main challenge arises from the fact that in the definition of Γij(m) only the observations in component m are extreme while observations in other components may also be non-extreme. This is different from the setting that is typically considered in asymptotically dependent extreme value theory.

Remark 3

By choosing q=0, the above theorem implies consistency of the empirical extremal variogram Γ^(m). This result is of independent interest, since it is the first proof of consistency of the moment estimators

which were introduced in Engelke et al. (2015) as estimators for the parameters of the Hüsler–Reiss distribution.

Remark 4

The assumption that the data X1,,Xn are independent was only made to keep the presentation simple. The consistency result in Theorem 1 continues to hold under a high-level assumption that allows for temporal dependence and is spelled out in detail at the beginning of Section S.12.4 in Supplementary Material S1.

Now we have all results that are needed for consistent estimation of the underlying tree structure. Given a general distance ρ with estimator ρ^ on pairs (i,j)V×V, we consider plug-in procedures of the form

(17)

with three cases of particular interest given by

resulting in the estimators T^χ,T^Γ(m),T^Γw, respectively. The special case of w1==wd=1/d is denoted by T^Γ. We solve the minimum spanning tree problem (17) by Prim's algorithm, which is guaranteed to find a global optimiser of problem (17) that is unique if the distances ρ^ij are distinct for all pairs (Prim, 1957).

Theorem 2

Assume thatYis an extremal graphical model on the treeT. Assume (2) holds and that the inequality in (15) is strict whenever(i,j)(h,l). Ifkasnthen there existsq>0such that under the additional assumptionk/nq[0,q]asn,

If instead of (2) and strict inequality in (15) assumptions (B), (T) hold, (YiYj)>0for alli,jV, ij(or equivalently, Γij(m)>0), and ifknθfor someθ>0then for anymVthere existsqm>0such that fork/nq[0,qm]asn, we have

The same is true forT^Γwprovided the weightswmsatisfywm0,maxmwm>0.

Remark 5

As pointed out by a referee, it would be of interest to find weights that maximise (asymptotically) the probability of correct tree structure recovery by T^Γw. This would require precise information on the joint asymptotic distribution of Γ^(m) for different m, which is currently an open question.

Remark 6

At first glance it might seem surprising that the tree structure can be estimated consistently even when kn/n does not converge to zero. The latter would be a classical minimal assumption in extreme value theory and would be required for consistent estimation of χij or Γij(m). We explain the intuition behind this result for the extremal correlation, the arguments for the extremal variogram are exactly the same. Assume that the inequality in (15) is strict whenever (i,j)(h,l), making the minimal spanning tree with respect to logχij unique. The key insight is that even biased estimators of χij(q) can lead to the correct minimal spanning tree since all we need is

for all trees T=(V,E)T, where T denotes the true underlying tree. Multivariate regular variation (2) implies that χij(q)χij as q0 for all i,j, so there exists q0>0 such that the above inequality is satisfied for all q<q0. Since in addition χ^ij=χij(k/n)+o(1) as n under the assumption k, consistency follows.

Theorem 2 shows that the proposed procedures are able to consistently recover the tree structure under rather weak assumptions on the sequence k=kn. It is natural to wonder which choices of k correspond to higher probabilities of recovering the tree structure consistently. Here we provide some indicative discussion of this issue for minimal spanning trees based on χij without going into technical details. Standard results from empirical process theory show that under mild assumptions and for k/nq[0,1] as n, all k(χ^ijχij(k/n)) converge jointly to a multivariate normal distribution with covariance matrix q. The latter matrices satisfy q0 as q0. Combined with the delta method this implies that for any tree T=(V,E)T

where ρij(k/n):=logχij(k/n) and ρ^ij:=logχ^ij, and Zk,n is a weighted linear combination of differences k(χ^ijχij(k/n)) and thus approximately centred normal with variance σq2. The probability that the sum over estimated distances on T is shorter than the sum over true tree T is given by (Zk,n>kΔk,n). Under the assumptions for asymptotic normality of χ^ij, Δk,n converges to Δ(q):=(i,j)Eρij(q)(i,j)Eρij(q). Combining all of the above approximations we find (Zk,n>kΔk,n)(σqN(0,1)>nqΔ(q)). Since σqσ0>0 and Δ(q)Δ(0)>0 as q0, it is easy to see that there exists q0>0 such that qΔ(q)/σq<q0Δ(q0)/σq0 for all q<q0, and thus the probability of selecting T instead of the true tree T starts to increase as the limit of k/n decreases after q0. This suggests that an optimal value for k in terms of maximising the probability of estimating the true tree would satisfy k/nq˜ as n for some q˜>0. Turning the above arguments into a formal proof would require many technicalities which are beyond the scope of the present paper, but the intuition obtained here is also confirmed in the simulations in Section 5.

4.3 Estimation in growing dimensions

The consistency results in the previous section were derived for data of fixed dimension for sample size tending to infinity. Here we provide an extension of those results by adding non-asymptotic bounds on the probability of consistently estimating the true tree. Throughout this section, the underlying tree can change with the sample size n.

We start with discussing results for T^χ. This requires the following additional notation. Assume that Y is an extremal graphical model on the tree T=(V,E) and define the corresponding extremal correlation χijY:=(Yi>1|Yj>1). Let

To gain some intuition on the reason for this definition, recall that in (15) in Proposition 5 we show that

To ensure that the minimal spanning tree corresponding to logχijY is unique, we need to rule out equality in the above statement whenever (i,j)(h,l), which follows from μχY>0. Thus the quantity μχY can be interpreted as a lower bound on the increase of the sum of distances on the edges if we move from the true tree T to TT. This is formalised in the proof of Theorem 3. We are now ready to state the first main result.

Theorem 3

Assume thatYis an extremal graphical model on the treeTand thatX1,,Xnare independent copies ofX, a random vector with continuous marginal distributions. Letχij(k/n)denote the pre-asymptotic extremal coefficients corresponding toXin the sense of (9) and define

Then there exists a universal constantK>0such that

(18)

Note that above we did not assume that X is in the max-domain of attraction of Y. A link between X and Y is implicitly provided through δk/n which measures the distance between χij(k/n) computed from X and the extremal coefficients χijY which correspond to Y.

Some comments on the implications of the above result are in order. On a high level, larger dimensions d, smaller values of μχY, and larger bias δk/n lead to a larger bound. The effects of dimension d and bias δk/n are intuitive: larger dimensions or more bias make the tree recovery problem more difficult. The effect of μχY is also expected because smaller values of μχY imply that, on population level, there exist trees that are closer to the true tree and estimation becomes more difficult.

For a more quantitative discussion assume that (B) in Section 3.2 holds with constants KB,ξ independent of n,d. In this case δk/nKR(k/n)ξ for a possibly different constant KR which is still independent of n,d,ξ; see (S.18) in Section S.13.3 in Supplementary Material S1. Note that the exponent can be bounded by 3n(k/n){[(μχY2KB(k/n)ξ)+2/(4K2)]1}. Straightforward but tedious computations optimising this rate over k show that the largest achievable rate for this exponent is of order n(μχY)2+1/ξ if we let k=cn(μχY)1/ξ for a suitable constant c(0,) which depends on K,KB,ξ only. With this choice of k consistent tree structure recovery is possible if logd=o(n(μχY)2+1/ξ) as n. If μχY stays bounded away from zero this simplifies to logd=o(n) as n, which allows the dimension to grow exponentially in n. In contrast, if the dimension d is fixed but we consider observations from a triangular array with the same tree but changing value of μχY, we require n(μχY)2+1/ξ as n, provided that k is chosen as described above. This condition becomes more stringent if ξ is smaller, which is intuitive since it corresponds to slower decaying bias.

We now discuss tree structure recovery with T^Γ(m) and T^Γw. A key result here are concentration bounds on Γ^ijm. Such bounds are established in Engelke et al. (2021) and reproduced in the proof of Theorem 4 given in the Section S.13.3 in Supplementary Material S1. To state those bounds we need an additional assumption.

  • (D)
    For all IV with |I|=2 the random variables Y(I) have densities fI. There exists an ε>0 such that for all β[ε,1ε] there is a constant K(β) such that

This is equivalent to assumption 2 in Engelke et al. (2021); see the discussion around (S.61) in Section S.13.3 in Supplementary Material S1. Engelke et al. (2021) show that it holds for Hüsler–Reiss distributions, for instance. This condition is implied by the simpler but stronger condition fI(x,2x)Kr(x(2x))1+ε for some ε>0 and all x(0,2); this follows from elementary calculations involving the homogeneity of fI which is derived in (S.60) in Section S.13.3 in Supplementary Material S1.

Theorem 4

Assume thatYis an extremal graphical model with respect to the treeTand thatX1,,Xnare independent samples ofX, a random vector with continuous marginal distributions. Assume that (B), (T), (D) hold and thatknθfor someθ>0. Then there exist constantsc,C,M>0depending only on the constants from (B), (T), (D) andθsuch that for allk1andbk/n:=(k/n)κ(log(n/k))2whereκ:=γξ/(1+γ+ξ)

(19)

Forwm0withm=1dwm=1the same bound holds for(T^ΓwT)withmin(i,j)EΓij(m)replaced bymin(i,j)Em=1dwmΓij(m).

We note that Assumption (D) can be dropped at the cost of introducing an additional 1/(logn)4 factor; details are provided in Section S.13.3 in Supplementary Material S1. Similarly to Theorem 3 we do not explicitly assume that X is in the domain of attraction of Y. Assumption (B) provides the link between X and Y in terms of their bivariate and trivariate distributions.

We briefly comment on the result in Theorem 4. Observe that the general structure of the bound is similar to the corresponding result for tree structure recovery based on χ. The fact that 1 in (18) is replaced by (logn)8 in (19) is due to technical details in the derivation of tail bounds for Γ^ij(m), which has a more complex structure than the simple estimator χ^ij. Similarly to μχY in (18), min(i,j)EΓij(m) can be interpreted as measuring the minimal separation between the length of shortest and second-shortest minimal spanning tree. The quantity bk/n appearing in Theorem 4 stems from bounds on bias terms in estimating Γij(m) and plays a similar role as δk/n for χij. Comments on the fastest possible growth of the dimension d and minimal separation conditions that still allow for consistent tree structure recovery follow along the same lines as in the discussion following Theorem 3 and are omitted for the sake of brevity.

5 SIMULATIONS

The minimum spanning trees based on the empirical versions of the extremal variogram and extremal correlation both recover asymptotically the underlying extremal tree structure. In this section we study the finite sample behaviour of the different tree estimators on simulated data. The results and figures of Sections 5 and 6 can be reproduced with the code at https://github.com/sebastian-engelke/extremal_tree_learning.

Let T=(V,E) be a random tree structure that is generated by sampling uniformly d1 edges and adding these to the empty graph under the constraint to avoid circles. Throughout the whole section, we simulate n samples X1,,Xn from a random vector X in the domain of attraction of a multivariate Pareto distribution Y that is an extremal graphical model on the tree T in dimension d=|V|. As random vector X we take the corresponding max-stable distribution (e.g., de Haan, 1984), which is indeed in the domain of attraction of Y in the sense of (2). In order to perturb the samples, a common way is to add lighter-tailed noise (e.g., Einmahl et al., 2016). More precisely,

(20)

where Zi is a max-stable random vector with standard Fréchet margins associated to Y, and εi is a lighter-tailed noise vector which is independent of Zi. We consider two scenarios for the noise distribution, where in both cases the marginal distribution is transformed to a Fréchet distribution with (εijx)=exp(1/x2), x0, jV.

  • (N1)

    The noise vector εi has independent entries.

  • (N2)

    The noise vector εi in (20) is generated from an extremal tree model on a fixed tree Tnoise that is generally different from the true tree T.

Since the marginals of the noise vector have lighter tails, the limit of Xi in (2) is not altered by εi. The main difference between the two noise mechanisms lies in the type of bias they introduce for large k, and we observe that this has an interesting impact on the recovery of the tree structure underlying Y.

We consider two different parametric classes of distributions for Y.

  • (M1)
    The Hüsler–Reiss tree model is a multivariate Pareto distribution that factorises on T=(V,E), where each bivariate distribution (Yi,Yj) for (i,j)E is Hüsler–Reiss with parameter Γij>0; see Example 4. The joint distribution is then also Hüsler–Reiss with parameter matrix Γ induced by the tree structure through (14). The coefficients on the edges are generated as
  • (M2)
    For the second model we let each bivariate distribution be given by the family of asymmetric Dirichlet distributions; see Example 3. We generate the two parameters of the bivariate Dirichlet models independently as
    Note that the resulting d-dimensional Pareto distribution is not in the family of Dirichlet distributions.

We compare four different estimators for the weights on the minimum spanning tree T^ρ=(V,Êρ) in (17):

  • (i)

    ρ^ij=logχ^ij, where χ^ij is the empirical extremal correlation;

  • (ii)

    ρ^ij=Γ^ij(m), the extremal variogram estimator for one fixed mV;

  • (iii)

    ρ^ij=Γ^ij, the combined extremal variogram estimator;

  • (iv)

    ρ^ij are the censored negative log-likelihoods of the bivariate Hüsler–Reiss model (Yi,Yj), evaluated at the optimiser.

The estimators (i)–(iii) were introduced in Section 4.2 and their consistency has been derived. The estimator in (iv) is the one used in Engelke and Hitz (2020) to learn the structure of Hüsler–Reiss tree models. Note that for this estimator, no theoretical justification is available. As performance measures we choose the average proportion of wrongly estimated edges

(21)

and the probability of not recovering the correct tree structure

(22)

where the outer expectations signify that the tree T is randomly generated in each repetition. Each experiment is repeated 300 times in order to estimate these errors empirically. We report only the results on the structure recovery rate error (22) and provide the corresponding results on the wrong edge rate (21) in the Section S.2 in Supplementary Material S1.

We first investigate the choice of the intermediate sequence k=kn of the number of exceedances used for estimation. We simulate from the Hüsler–Reiss tree model (M1) in dimension d=20 and consider the minimum spanning trees T^Γ and T^CL based on the combined extremal variogram and the censored likelihoods, respectively. Figure 3 shows the structure recovery rate error as a function of the exceedance probability k/n for different samples sizes n. Interestingly, the two noise patterns lead to qualitatively different results: while consistent recovery of the limiting tree seems possible even when k=n for noise model (N1), noise model (N2) with a dependence structure also introduces a bias in the corresponding minimal spanning tree and the true tree cannot be recovered when the limit of k/n is too large. It is interesting to observe that the optimal exceedance probability k/n seems to converge to a positive value q, especially for noise (N1). This is consistent with the intuition given at the end of Section 4.2 in the paragraph after Remark 6. This is in contrast to classical asymptotic theory for consistent estimation in extremes where k=o(n) is required to remove the approximation bias and therefore q=0.

Structure recovery rate error of trees from the Hüsler–Reiss model (M1) and independent noise (N1) (top) and tree noise (N2) (bottom) in dimension d=20 estimated based on empirical correlation (orange), extremal variogram with fixed m∈V (blue), combined empirical variogram (green) and censored maximum likelihood (yellow) as a function of the exceedance probability k/n; sample sizes n=500 (left column), n=1000 (center column) and n=2000 (right column) [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 3

Structure recovery rate error of trees from the Hüsler–Reiss model (M1) and independent noise (N1) (top) and tree noise (N2) (bottom) in dimension d=20 estimated based on empirical correlation (orange), extremal variogram with fixed mV (blue), combined empirical variogram (green) and censored maximum likelihood (yellow) as a function of the exceedance probability k/n; sample sizes n=500 (left column), n=1000 (center column) and n=2000 (right column) [Colour figure can be viewed at https://dbpia.nl.go.kr]

Next we compare the performance of the different structure learning methods for varying sample size n. Since the value of q which is required for consistent estimation is unknown in practice we choose k=n0.8, which satisfies all assumptions of our theory. The results for dimension d=20 are shown in the top row of Figure 4 for the Hüsler–Reiss model (M1) and in the bottom row for the asymmetric Dirichlet model (M2). We observe that the two methods based on the extremal variogram perform consistently better that the extremal correlation-based method. Intuitively this can be explained by the fact that the extremal variogram is a tree metric for conditional independence of multivariate Pareto distributions. The additivity on the tree results in a bigger loss in the minimum spanning tree algorithm when choosing a wrong edge, and therefore it is easier to identify the true structure. The extremal correlation only satisfies a weaker relation (15) on the tree, which might be a reason for the higher error rate. Additionally, the empirical variogram uses information from the entire multivariate Pareto distribution, while the extremal correlation evaluates its distribution at a single point only. A comparison with the censored maximum likelihood estimator (iv) yields several insights. First, this approach seems to lead to consistent estimation of the tree structure even in model (M2) where Y is not a Hüsler–Reiss distribution and the likelihood is thus misspecified. This might be explained by the fact that the strength of dependence is still sufficiently well estimated and the minimum spanning tree does only require correct ordering of the edge weights, which is much weaker than consistency of the estimated weights. Second, the different types of noise distributions in (N1) and (N2) lead to opposing orderings of the best method: whereas T^CL has a slight advantage for noise (N2), T^Γ performs substantially better under (N1). Notably, this is even the case in model (M1) where the likelihood is well-specified. A possible explanation is that the likelihood is not exactly specified due to the added noise in the model and the use of ranks during estimation. This implies that classical results about asymptotic optimality of maximum likelihood methods do not apply here. Moreover, the added noise has different effects on the biases of the estimators, which changes the order of performance depending on the noise distribution.

Structure recovery rate error of trees from the Hüsler–Reiss model (M1) (top) and Dirichlet model (M2) (bottom) in dimension d=20 estimated based on empirical correlation (orange), extremal variogram with fixed m∈V (blue), combined empirical variogram (green) and censored maximum likelihood (yellow); independent noise (N1) (left) and tree noise (N2) (right) [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 4

Structure recovery rate error of trees from the Hüsler–Reiss model (M1) (top) and Dirichlet model (M2) (bottom) in dimension d=20 estimated based on empirical correlation (orange), extremal variogram with fixed mV (blue), combined empirical variogram (green) and censored maximum likelihood (yellow); independent noise (N1) (left) and tree noise (N2) (right) [Colour figure can be viewed at https://dbpia.nl.go.kr]

For a given tree, the task of estimating the correct structure can largely differ according to the strength of dependence of the multivariate Pareto distribution. We therefore conduct a simulation study where we fix n=500 and k=n0.8 and illustrate the performance of the structure estimation methods for a varying strength of tail dependence. For the Hüsler–Reiss model, we randomly generate a tree T=(V,E) in dimension d=20 and for (i,j)E we fix all Γij=λ to some constant λ>0. Equivalently, that means that all neighbouring nodes have extremal correlation χij=22Φ(λ/2). The left panel of Figure 5 shows the results for varying strength of extremal dependence between neighbours measured by the extremal correlation under noise model (N1). Unsurprisingly, the performance of all methods deteriorates at the boundaries, which correspond to the non-identifiable cases of independence and complete dependence. In general, it seems that the empirical variogram-based estimators perform better under stronger dependence, which is probably due to the higher bias of the empirical extremal variogram under weak dependence. The same asymmetry can be observed for the censored maximum likelihood method, while the performance of the extremal correlation seems to be symmetric around χ=1/2. Comparing the performance of different methods, we observe that under noise (N1) the combined extremal variogram performs best uniformly in the values of χ, and the advantage over all other methods can be substantial. The same analysis with noise (N2) is shown in the right panel of Figure 5. In line with the results in Figure 4, the performance of T^CL and T^Γ is fairly similar, with a slight advantage for T^CL at values of χ around 0.5 and the converse for χ closer to 0.2 and 0.8.

Structure recovery rate error for the Hüsler–Reiss model (M1) with noise model (N1) (left) and (N2) (right) in dimension d=20 as a function of the extremal dependence between neighbours measured by the extremal correlation χ; the different methods are based on empirical correlation (orange), extremal variogram with fixed m∈V (blue), combined empirical variogram (green) and censored maximum likelihood (yellow) [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 5

Structure recovery rate error for the Hüsler–Reiss model (M1) with noise model (N1) (left) and (N2) (right) in dimension d=20 as a function of the extremal dependence between neighbours measured by the extremal correlation χ; the different methods are based on empirical correlation (orange), extremal variogram with fixed mV (blue), combined empirical variogram (green) and censored maximum likelihood (yellow) [Colour figure can be viewed at https://dbpia.nl.go.kr]

For the final set of comparisons we study the performance of the methods for a growing dimension d{10,20,30,50,100,200,300}, where we fixed the sample size n=1000 and number of exceedances k=n0.8=251. Figure 6 shows the structure recovery rate errors for the different methods. As expected, the errors increase for larger dimensions, but much slower for the combined extremal variogram than for the other methods. Theoretical pre-asymptotic error bounds for the error rates can be found in Section 4.3. We remark that for we were not able to run simulations in more than d=100 dimensions for the censored maximum likelihood estimator because of the prohibitive computational cost. For the same reason we have only included simulations in one model and one noise setting.

Structure recovery rate error for the Hüsler–Reiss model (M1) with noise model (N1) (left) as a function of the number of nodes |V|=d of the tree; the different methods are based on empirical correlation (orange), extremal variogram with fixed m∈V (blue), combined empirical variogram (green) and censored maximum likelihood (yellow) [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 6

Structure recovery rate error for the Hüsler–Reiss model (M1) with noise model (N1) (left) as a function of the number of nodes |V|=d of the tree; the different methods are based on empirical correlation (orange), extremal variogram with fixed mV (blue), combined empirical variogram (green) and censored maximum likelihood (yellow) [Colour figure can be viewed at https://dbpia.nl.go.kr]

We close this section with some comments on computation times for the four estimators. The extremal correlation and variogram-based trees rely on empirical estimators and are very efficient to compute. The censored likelihood estimator however requires numerical optimisation for every weight ρij, i,jV. Especially in higher dimensions this becomes prohibitively costly. Figure 7 shows the average computation times for the four estimators in the simulations in Figures 4 and 6. It can be seen that the censored likelihood method is several orders of magnitude slower than the empirical methods. As seen in the right-hand panel of Figure 7, this quickly becomes prohibitive if the dimension grows.

Average computation times in seconds in the simulation studies in Section 5 of the four algorithms based on empirical correlation (orange), extremal variogram with fixed m∈V (blue), combined empirical variogram (green) and censored maximum likelihood (yellow). Left: for fixed dimension d=20; right: for fixed sample size n=1000 [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 7

Average computation times in seconds in the simulation studies in Section 5 of the four algorithms based on empirical correlation (orange), extremal variogram with fixed mV (blue), combined empirical variogram (green) and censored maximum likelihood (yellow). Left: for fixed dimension d=20; right: for fixed sample size n=1000 [Colour figure can be viewed at https://dbpia.nl.go.kr]

6 APPLICATION

We illustrate the proposed methodology on foreign exchange rates of d=26 currencies expressed in terms of the British Pound sterling; see Table A1 in Appendix A.1 for the three-letter abbreviations of the respective countries. The data are available from the website of the Bank of England1. They consist of daily observations of spot foreign exchange rates in the period from 1 October 2005 to 30 September 2020, resulting in n=3790 observations.

In order to obtain time series without temporal dependence, we pre-process the data set. We first compute the daily log-returns Rij, i=1,,n, j=1,,d, from the original time series. To remove the serial dependence, we then filter the univariate series by ARMA-GARCH processes; see Hilal et al. (2014) for a similar approach, and Bollerslev et al. (1992) and Engle (1982) for background on financial time series modelling. The AIC suggest that an ARMA(0,2)-GARCH(1,1) model is the most appropriate for most of the univariate series. We derive the absolute values of the standardised filtered returns as

where μ^ij and σ^ij are the estimated mean and SD of the ARMA-GARCH model. The absolute value means that we are interested in extremes in both directions.

The data Xi=(Xi1,,Xid) are approximately independent and identically distributed for i=1,,n, and we will model their tail dependence using an extremal tree model. We first check whether the assumption of asymptotic dependence is satisfied by inspecting the behaviour of the function qχ^ij(q) for values q=k/n close to 1. For most of the pairs this function seems to converge to a positive value and thus there is fairly strongly dependence in the tail between the filtered log-returns; see Figure S3 in the Section S.4 in Supplementary Material S1 for some examples.

Before estimating for this data set the extremal tree structure, we discuss the choice of the number of exceedances k, or equivalently the probability of exceedance q=k/n. This is an important practical issue and a long-standing problem in extreme value theory. In essence, it is a bias-variance trade-off as illustrated in the simulations in Figure 3.

For tree structure estimation, we propose to leverage the specific structure of the tree learning problem. From (14) it follows that for an extremal graphical model on a tree T, the corresponding population Γ=d1m=1dΓ(m) forms a tree metric on that tree. In the sequel, for generic Γ and tree T, denote the extremal variogram matrix completed on the tree T by

We propose to select k so as to minimise the deviation of the empirical values of Γ^ from forming a tree metric on the estimated tree T^. More precisely, define

(23)

where as function g:R+[0,1], we choose the transformation from Γ to χ in Hüsler–Reiss models, that is, g(x)=22Φ(x/2); see Example 4. Here we indicate that these estimates depend on the exceedance probability q=k/n; note that also the estimated tree T^ depends on k. Additional motivation for the form of Δ^(k/n) and a literature review of classical approaches is given in Section S.3 in Supplementary Material S1.

Motivated by the simulations in the previous section we estimate the extremal tree structure T^Γ=(V,ÊΓ) non-parameterically using the combined empirical extremal variogram Γ^. The left panel of Figure 8 shows the error Δ^(k/n) as a function of q=k/n for the exchange rate data set. It can be seen that, indeed, the error seems to stabilise for values of 1q above 0.97. We therefore choose q=0.03 in this application, which corresponds to k=114. The corresponding minimum spanning tree is shown in Figure 9; we note that the tree is very stable across different values of q close to 0.

Left: The squared error Δ^(q) defined in (23) between empirical extremal correlations and those implied by the tree metric structure for different values of the exceedance probability q=k/n. Right: Extremal correlations for the spot foreign exchange rate data implied by the fitted Hüsler–Reiss tree model against the empirical counterparts
FIGURE 8

Left: The squared error Δ^(q) defined in (23) between empirical extremal correlations and those implied by the tree metric structure for different values of the exceedance probability q=k/n. Right: Extremal correlations for the spot foreign exchange rate data implied by the fitted Hüsler–Reiss tree model against the empirical counterparts

Minimum spanning tree T^Γ of extremal dependence for the spot foreign exchange rate data based on the combined extremal variogram. The width of each edge (i,j)∈ÊΓ is proportional to the extremal correlation 2−2Φ(Γ^ij/2), and therefore wider edges indicate stronger extremal dependence. [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 9

Minimum spanning tree T^Γ of extremal dependence for the spot foreign exchange rate data based on the combined extremal variogram. The width of each edge (i,j)ÊΓ is proportional to the extremal correlation 22Φ(Γ^ij/2), and therefore wider edges indicate stronger extremal dependence. [Colour figure can be viewed at https://dbpia.nl.go.kr]

The structure of the tree allows for a nice interpretation of extremal dependence. Extreme observations in the exchange rates with the Euro are strongly connected with extremes of other European currencies in Northern and Eastern countries. The graph suggests that extremes of exchange rates of these currencies are conditionally independent of exchange rates of other countries, given the value of Euro exchange rate. The Malaysian ringgit, the Chinese yuan, the Hong Kong dollar and the Taiwan dollar are strongly pegged to the US dollar and their closeness in the tree is therefore not surprising. Another branch of the tree contains several currencies of the Commonwealth. Finally, the connection between Japan and Switzerland is plausible because both currencies can be considered safe-haven currencies, which are both popular investments in times of crises.

In order to address the stability of the tree structure we bootstrap our data B=100 times and fit each time the tree structure. For generating each bootstrap sample, we draw with replacement n data from the sample of filtered observations X1,,Xn. This is a heuristic approach to assess the overall stability of our empirical conclusions to small perturbations in the data and does not have a formal theoretical justification at this point. Figure 10 shows that graph where the width of each edge is proportional to the number of times it has been selected in an extremal tree. Overall, the tree seems to be fairly stable since there is only a small number of dominant edges. Moreover, we can identify clear clusters that are connected in most of the trees, such as the European currencies. On the other hand some currencies such as the Russian ruble that do not have a dominant connection to any of these clusters. In future research, it could therefore be interesting to study structure estimation for forests, which allow to have unconnected graphs whose connected components are trees (Liu et al., 2011).

Graph where the width of each edge is proportional to the number of times it has been selected in an extremal tree in the bootstrap procedure. [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 10

Graph where the width of each edge is proportional to the number of times it has been selected in an extremal tree in the bootstrap procedure. [Colour figure can be viewed at https://dbpia.nl.go.kr]

So far we have not assumed any specific model for the extremal dependence on the edges since we are able to estimate the tree structure fully non-parametrically with the methods from this paper. If we were only interested in interpretation of the extremal graphical structure we could stop our analysis here. If we require a model for rare event simulation or risk assessment, in a second step we can choose arbitrary bivariate Pareto models for each edge. For simplicity, we choose here for all edges the Hüsler–Reiss model (see Example 4) resulting in a Hüsler–Reiss tree. For this model, the bivariate parameter estimates Γ^ij can be chosen directly as the empirical extremal variogram estimates for all {i,j}ÊΓ. Alternatively, we could estimate them by censored maximum likelihood. In both cases, the remaining entries of the Hüsler–Reiss parameter matrix can be obtained from the additivity of the extremal variogram on the tree in (14). We denote the corresponding parameter matrix completed on the tree T^Γ by Γ^T^Γ. Recall the relation between Γ and χ for Hüsler–Reiss models from Example 4. The right panel of Figure 8 shows the extremal correlations implied by the fitted Hüsler–Reiss tree model, that is, χ^ijT^Γ=22Φ(Γ^ijT^Γ/2), against the empirical counterparts χ^ij, i,jV. Even though the tree structure is a very sparse graph with only d1 edges, the extremal dependence between all variables is well-explained.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are openly available in extremal_tree_learning at https://github.com/sebastian-engelke/.

ACKNOWLEDGEMENTS

The authors would like to thank Jiaying Gu for pointing us to Hall's marriage theorem, and Johan Segers for pointing us to a mistake in an earlier version of Proposition 5. We further thank Nicola Gnecco, Adrien S. Hitz, Michaël Lalancette and Chen Zhou for helpful comments. We also thank the Associate Editor and two anonymous Referees for comments which helped us to improve the presentation and for encouraging us to consider results in growing dimensions. Sebastian Engelke was supported by an Eccellenza grant of the Swiss National Science Foundation and Stanislav Volgushev was partially supported by a discovery grant from NSERC of Canada.

REFERENCES

Asenova
,
S.
,
Mazo
,
G.
&
Segers
,
J.
(
2021
)
Inference on extremal dependence in the domain of attraction of a structured Hüsler–Reiss distribution motivated by a Markov tree with latent variables
.
Extremes
,
24
,
461
500
.

Beirlant
,
J.
,
Goegebeur
,
Y.
,
Teugels
,
J.
&
Segers
,
J.
(
2004
)
Statistics of extremes. Wiley series in probability and statistics
.
Chichester
:
John Wiley & Sons, Ltd.

Bollerslev
,
T.
,
Chou
,
R.Y.
&
Kroner
,
K.F.
(
1992
)
Arch modeling in finance: a review of the theory and empirical evidence
.
Journal of Econometrics
,
52
(
1
),
5
59
.

Chilès
,
J.-P.
&
Delfiner
,
P.
(
2012
)
Geostatistics: modeling spatial uncertainty
.
Wiley Series in Probability and Statistics
, 2nd edition.
Hoboken, NJ
:
John Wiley & Sons, Inc.

Chow
,
C.
&
Liu
,
C.
(
1968
)
Approximating discrete probability distributions with dependence trees
.
IEEE Transactions on Information Theory
,
14
,
462
467
.

Coles
,
S.
,
Heffernan
,
J.
&
Tawn
,
J.
(
1999
)
Dependence measures for extreme value analyses
.
Extremes
,
2
,
339
365
.

Coles
,
S.G.
&
Tawn
,
J.A.
(
1991
)
Modelling extreme multivariate events
.
Journal of the Royal Statistical Society Series B. Methodological
,
53
(
2
),
377
392
.

Cooley
,
D.
,
Naveau
,
P.
&
Poncet
,
P.
(
2006
) Variograms for spatial max-stable random fields. In:
Bertail
,
P.
,
Soulier
,
P.
&
Doukhan
,
P.
(Eds.)
Dependence in probability and statistics
.
Lecture Notes in Statistics
, Vol.
187
.
New York
:
Springer
, pp.
373
390
.

Cooley
,
D.
&
Thibaud
,
E.
(
2019
)
Decompositions of dependence for high-dimensional extremes
.
Biometrika
,
106
(
3
),
587
604
.

Cowell
,
R.G.
,
Dawid
,
P.
,
Lauritzen
,
S.L.
&
Spiegelhalter
,
D.J.
(
2006
)
Probabilistic networks and expert systems: exact computational methods for Bayesian networks
.
New York
:
Springer
.

Dawid
,
A.P.
(
1979
)
Conditional independence in statistical theory
.
Journal of the Royal Statistical Society. Series B (Methodological)
,
41
,
1
31
.

de
Haan
,
L.
(
1984
)
A spectral representation for max-stable processes
.
The Annals of Probability
,
12
,
1194
1204
.

de
Haan
,
L.
&
Ferreira
,
A.
(
2006
)
Extreme value theory
.
New York
:
Springer
.

Dombry
,
C.
,
Engelke
,
S.
&
Oesting
,
M.
(
2016
)
Exact simulation of max-stable processes
.
Biometrika
,
103
,
303
317
.

Dombry
,
C.
,
Eyi-Minko
,
F.
&
Ribatet
,
M.
(
2013
)
Conditional simulation of max-stable processes
.
Biometrika
,
100
(
1
),
111
124
.

Drton
,
M.t.
&
Maathuis
,
M.H.
(
2017
)
Structure learning in graphical modeling
.
Annual Review of Statistics and Its Application
,
4
(
1
),
365
393
.

Einmahl
,
J.H.
,
Krajina
,
A.
,
Segers
,
J.
et al. (
2012
)
An m-estimator for tail dependence in arbitrary dimensions
.
The Annals of Statistics
,
40
(
3
),
1764
1793
.

Einmahl
,
J.H.J.
,
Kiriliouk
,
A.
,
Krajina
,
A.
&
Segers
,
J.
(
2016
)
An M–estimator of spatial tail dependence
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
78
,
275
298
.

Embrechts
,
P.
,
Klüppelberg
,
C.
&
Mikosch
,
T.
(
1997
)
Modelling extremal events: for insurance and finance
.
London
:
Springer
.

Engelke
,
S.
,
de
Fondeville
,
R.
&
Oesting
,
M.
(
2019
)
Extremal behaviour of aggregated data with an application to downscaling
.
Biometrika
,
106
,
127
144
.

Engelke
,
S.
&
Hitz
,
A.
(
2020
)
Graphical models for extremes (with discussion)
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
82
,
871
932
.

Engelke
,
S.
,
Hitz
,
S.A.
&
Gnecco
,
N.
(
2019
)
graphical extremes: statistical methodology for graphical extreme value models. R package version 0.1.0
. Available from: https://CRAN.R-project.org/package=graphicalExtremes

Engelke
,
S.
&
Ivanovs
,
J.
(
2021
)
Sparse structures for multivariate extremes
.
Annual Review of Statistics and Its Application
,
8
,
241
270
.

Engelke
,
S.
,
Lalancette
,
M.
&
Volgushev
,
S.
(
2021
)
Learning extremal graphical structures in high dimensions. arXiv preprint arXiv:2111.00840
.

Engelke
,
S.
,
Malinowski
,
A.
,
Kabluchko
,
Z.
&
Schlather
,
M.
(
2015
)
Estimation of Hüsler–Reiss distributions and Brown–Resnick processes
.
Journal of the Royal Statistical Society Series B. Methodological
,
77
(
1
),
239
265
.

Engle
,
R.F.
(
1982
)
Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation
.
Econometrica
,
50
(
4
),
987
1007
.

Fomichov
,
V.
&
Ivanovs
,
J.
(
2022
)
Spherical clustering in detection of groups of concomitant extremes
.
Biometrika
, asac020. https://doi.org/10.1093/biomet/asac020.

Fougères
,
A.-L.
,
De Haan
,
L.
,
Mercadier
,
C.
et al. (
2015
)
Bias correction in multivariate extremes
.
The Annals of Statistics
,
43
(
2
),
903
934
.

Gissibl
,
N.
&
Klüppelberg
,
C.
(
2018
)
Max-linear models on directed acyclic graphs
.
Bernoulli
,
24
,
2693
2720
.

Hall
,
P.
(
1935
)
On representatives of subsets
.
The Journal of the London Mathematical Society
,
10
,
26
30
.

Hilal
,
S.
,
Poon
,
S.-H.
&
Tawn
,
J.
(
2014
)
Portfolio risk assessment using multivariate extreme value methods
.
Extremes
,
17
,
531
556
.

Hu
,
S.
,
Peng
,
Z.
&
Segers
,
J
. (
2022
)
Modelling multivariate extreme value distributions via Markov trees
. Available from: https://arxiv.org/abs/2208.02627

Kabluchko
,
Z.
,
Schlather
,
M.
&
de
Haan
,
L.
(
2009
)
Stationary max-stable fields associated to negative definite functions
.
The Annals of Probability
,
37
,
2042
2065
.

Katz
,
R.W.
,
Parlange
,
M.B.
&
Naveau
,
P.
(
2002
)
Statistics of extremes in hydrology
.
Advances in Water Resources
,
25
,
1287
1304
.

Klüppelberg
,
C.
&
Lauritzen
,
S.
(
2019
)
Bayesian networks for max-linear models
.
Cham
:
Springer International Publishing
, pp.
79
97
.

Kruskal
,
J.B.
, Jr. (
1956
)
On the shortest spanning subtree of a graph and the traveling salesman problem
.
Proceedings of the American Mathematical Society
,
7
,
48
50
.

Lafferty
,
J.
,
Liu
,
H.
&
Wasserman
,
L.
(
2012
)
Sparse nonparametric graphical models
.
Statistical Science
,
27
,
519
537
.

Larsson
,
M.
&
Resnick
,
S.I.
(
2012
)
Extremal dependence measure and extremogram: the regularly varying case
.
Extremes
,
15
,
231
256
.

Lauritzen
,
S.L.
(
1996
)
Graphical models
.
Oxford
:
Oxford University Press
.

Liu
,
H.
,
Xu
,
M.
,
Gu
,
H.
,
Gupta
,
A.
,
Lafferty
,
J.
&
Wasserman
,
L.
(
2011
)
Forest density estimation
.
The Journal of Machine Learning Research
,
12
,
907
951
.

Papastathopoulos
,
I.
&
Strokorb
,
K.
(
2016
)
Conditional independence among max-stable laws
.
Statistics & Probability Letters
,
108
,
9
15
.

Poon
,
S.-H.
,
Rockinger
,
M.
&
Tawn
,
J.
(
2004
)
Extreme value dependence in financial markets: Diagnostics, models, and financial implications
.
The Review of Financial Studies
,
17
,
581
610
.

Prim
,
R.C.
(
1957
)
Shortest connection networks and some generalizations
.
Bell System Technical Journal
,
36
,
1389
1401
.

Resnick
,
S.I.
(
2008
)
Extreme values, regular variation and point processes
.
New York
:
Springer
.

Rootzén
,
H.
,
Segers
,
J.
&
Wadsworth
,
J.L.
(
2018
)
Multivariate peaks over thresholds models
.
Extremes
,
21
(
1
),
115
145
.

Rootzén
,
H.
&
Tajvidi
,
N.
(
2006
)
Multivariate generalized Pareto distributions
.
Bernoulli
,
12
,
917
930
.

Schlather
,
M.
&
Tawn
,
J.A.
(
2003
)
A dependence measure for multivariate and spatial extreme values: properties and inference
.
Biometrika
,
90
,
139
156
.

Segers
,
J.
(
2020
)
One-versus multi-component regular variation and extremes of Markov trees
.
Advances in Applied Probability
,
52
(
3
),
855
878
.

Wackernagel
,
H.
(
2013
)
Multivariate geostatistics: an introduction with applications
.
New York
:
Springer
.

APPENDIX

A.1 Country codes used in the application in Section 6 Country codes used in the application in Section 6

Table A1 shows the three-letter country codes of the exchange rates into British Pound sterling.

TABLE A1

Three-letter country codes

CodeForeign exchange rate (into GBP)CodeForeign exchange rate (into GBP)
AUSAustralian DollarNORNorwegian Krone
CANCanadian DollarPOLPolish Zloty
CHNChinese YuanRUSRussian Ruble
CZECzech KorunaSAUSaudi Riyal
DNKDanish KroneSGPSingapore Dollar
EUREuroZAFSouth African Rand
a HKGHong Kong DollarKORSouth Korean Won
HUNHungarianSWESwedish Krona
INDIndian RupeeCHESwiss Franc
ISRIsraeli ShekelTWNTaiwan Dollar
JPNJapanese YenTHAThai Baht
MYSMalaysian ringgitTURTurkish Lira
NZLNew Zealand DollarUSAUS Dollar
CodeForeign exchange rate (into GBP)CodeForeign exchange rate (into GBP)
AUSAustralian DollarNORNorwegian Krone
CANCanadian DollarPOLPolish Zloty
CHNChinese YuanRUSRussian Ruble
CZECzech KorunaSAUSaudi Riyal
DNKDanish KroneSGPSingapore Dollar
EUREuroZAFSouth African Rand
a HKGHong Kong DollarKORSouth Korean Won
HUNHungarianSWESwedish Krona
INDIndian RupeeCHESwiss Franc
ISRIsraeli ShekelTWNTaiwan Dollar
JPNJapanese YenTHAThai Baht
MYSMalaysian ringgitTURTurkish Lira
NZLNew Zealand DollarUSAUS Dollar
TABLE A1

Three-letter country codes

CodeForeign exchange rate (into GBP)CodeForeign exchange rate (into GBP)
AUSAustralian DollarNORNorwegian Krone
CANCanadian DollarPOLPolish Zloty
CHNChinese YuanRUSRussian Ruble
CZECzech KorunaSAUSaudi Riyal
DNKDanish KroneSGPSingapore Dollar
EUREuroZAFSouth African Rand
a HKGHong Kong DollarKORSouth Korean Won
HUNHungarianSWESwedish Krona
INDIndian RupeeCHESwiss Franc
ISRIsraeli ShekelTWNTaiwan Dollar
JPNJapanese YenTHAThai Baht
MYSMalaysian ringgitTURTurkish Lira
NZLNew Zealand DollarUSAUS Dollar
CodeForeign exchange rate (into GBP)CodeForeign exchange rate (into GBP)
AUSAustralian DollarNORNorwegian Krone
CANCanadian DollarPOLPolish Zloty
CHNChinese YuanRUSRussian Ruble
CZECzech KorunaSAUSaudi Riyal
DNKDanish KroneSGPSingapore Dollar
EUREuroZAFSouth African Rand
a HKGHong Kong DollarKORSouth Korean Won
HUNHungarianSWESwedish Krona
INDIndian RupeeCHESwiss Franc
ISRIsraeli ShekelTWNTaiwan Dollar
JPNJapanese YenTHAThai Baht
MYSMalaysian ringgitTURTurkish Lira
NZLNew Zealand DollarUSAUS Dollar

A.2 Proof of Proposition 4

In order to show that the extremal variogram Γ(m) defines a tree metric on T, we recall the stochastic representation of Ym in Proposition 1. We compute

where for two sets A and B, AΔB denotes the symmetric difference. The second to last equality follows from the independence of the {We:eE}. Moreover, for the last equation we note that for two neighbouring nodes (s,t)Em in the directed tree Tm, by applying the same argument as above, we have Γst(s)=VarlogWts=Γst(m).

A.3 Proof of Corollary 1

We have to show that for any tree T=(V,E) that differs from T in at least one edge, it holds

(A1)

The terms for (i,j)EE cancel directly between the two sums. For (i,j)EE, the graph (V,E{(i,j)}) is disconnected with connected components, say, V1,V2V. Since T is connected, there must be a hV1 and lV2 such that (h,l)E. Since the path ph(hl;T) must contain the edge (i,j) and

(A2)

this means that the first sum in (A1) contains Γij(m) as part of Γhl(m), which cancels the corresponding term in the second sum.

There are the same number of edges in EE as in EE and every Γhl(m) for (h,l)EE is the sum of several terms in the decomposition (A2). Therefore, the difference on the left-hand side of (A1) is indeed strictly positive as long as none of the distances vanishes.

A.4 Proof of Corollary 2

For the true edge set E we observe

where the first inequality follows from the uniqueness of the minimum spanning tree with weights Γij(m), mV. It follows that T=(V,E) must be the minimum spanning tree corresponding to the weights ρij=m=1dwmΓij(m).

A.5 Proof of Proposition 5

We begin by proving (15). To this end, note that we can write the extremal correlation χhl in the extremal tree model Y as

From (8) we have that

and therefore, by independence between P and {We:eph(hl;Tm)} and since P follows a standard Pareto distribution,

by changing the order of integration. Observe that for any two positive, independent random variables A and B with 𝔼A,𝔼B1, we have from Jensen's inequality by concavity of xmin(x,1)

(A3)

Recall that we have 𝔼We1 for all eE. Since (i,j)ph(hl;Tm) we can apply the above successively to obtain

Thus (15) follows.

We show that the minimal spanning tree is unique provided that the inequality in (15) is strict. We have to show that for any tree T=(V,E) that differs from T in at least one edge, it holds

(A4)

where we let ρij=log(χij)>0.

We will now compare the summands in the two sums in (A4) in a pairwise fashion. To this end, we will construct a bijective mapping τ:EE such that for any (i,j)E, the corresponding edge (h,l)=τ{(i,j)}E satisfies (i,j)ph(hl;T).

Consider the undirected graph G=(E+E,E) where (i,j)E is connected to (h,l)E if and only if (i,j)ph(hl;T). In this formulation, our goal is to find an E-saturating matching, that is, a matching such that every element of E is assigned one element in E. A graphical illustration of this idea is provided in Figure A1.

Left and center: two trees T and T′. Right: bipartite graph between elements in E and E′. A link from (i,j)∈E to (h,l)∈E′ means that (i,j)∈ph(hl;T). The blue links indicate one possible matching τ:E→E′ in this case. [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE A1

Left and center: two trees T and T. Right: bipartite graph between elements in E and E. A link from (i,j)E to (h,l)E means that (i,j)ph(hl;T). The blue links indicate one possible matching τ:EE in this case. [Colour figure can be viewed at https://dbpia.nl.go.kr]

By Hall's marriage theorem (Hall, 1935), such a matching exists provided that for any subset CE, the corresponding neighbourhood n(C)E of elements in E that are connected to at least one of the elements in C satisfies

(A5)

Let e1,,ep be the edges in C, where p=|C|. Removing these edges from the tree T=(V,E) results in a graph (V,EC) with p+1 connected components, which we denote by V1,,Vp+1.

Starting with component V1, we know from the connectedness of the tree T that there must be an edge in E between at least one of the elements of V1, say h1, to l1Vk1 for some k11. Since h1 and l1 are in different connected components in (V,EC), the path ph(h1l1;T) must contain one of the edges in C, and therefore e1=(h1,l1)n(C).

Similarly, there must exist an edge e2=(h2,l2) between an element h2V1Vk1 and some l2Vk2, k2{1,k1}. This edge is necessarily different from e1 as it has a node in Vk2, and the path ph(h2l2;T) must contain one of the edges in C because h2,l2 are in different connected components of (V,EC). Thus e2n(C).

Continuing this argument inductively we obtain p different edges in n(C) and therefore the condition (A5) holds.

In order to show inequality (A4) we rewrite the left-hand side as

(A6)

By construction of τ, for (h,l)=τ{(i,j)}, the path ph(hl;T) must contain the edge (i,j) and thus by (15)

(A7)

This means that all summands in (A6) are non-negative. Recall that we assume in the second part of Proposition 5 that the inequalities (15) are strict for (i,j)(h,l). Since there is at least one (h,l)EE, the first inequality in (A7) is strict for this edge and therefore, the difference on the left-hand side of (A4) is indeed strictly positive. Thus the proof is complete.

Author notes

Funding information Natural Sciences and Engineering Research Council of Canada, Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung, Grant/Award Number: 161297

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)