Non-Parametric Bayesian Covariate-Dependent Multivariate Functional Clustering: An Application to Time-Series Data for Multiple Air Pollutants

Yang, Daewon; Choi, Taeryon; Lavigne, Eric; Chung, Yeonseung

doi:10.1111/rssc.12589

Abstract

Air pollution is a major threat to public health. Understanding the spatial distribution of air pollution concentration is of great interest to government or local authorities, as it informs about target areas for implementing policies for air quality management. Cluster analysis has been popularly used to identify groups of locations with similar profiles of average levels of multiple air pollutants, efficiently summarising the spatial pattern. This study aimed to cluster locations based on the seasonal patterns of multiple air pollutants incorporating the location-specific characteristics such as socio-economic indicators. For this purpose, we proposed a novel non-parametric Bayesian sparse latent factor model for covariate-dependent multivariate functional clustering. Furthermore, we extend this model to conduct clustering with temporal dependency. The proposed methods are illustrated through a simulation study and applied to time-series data for daily mean concentrations of ozone (⁠ $O_{3}$ ⁠), nitrogen dioxide (⁠ $N O_{2}$ ⁠), and fine particulate matter (⁠ $P M_{2.5}$ ⁠) collected for 25 cities in Canada in 1986–2015.

covariate-dependent clustering, Dirichlet process, Indian Buffet process, multivariate functional data, non-parametric Bayesian, temporal dependency

1 INTRODUCTION

There is a growing body of evidence that air pollution is a major threat to public health (WHO Regional Office for Europe, 2013). The adverse health effects of short-term and long-term exposure to different kinds of air pollutants have been well-documented (Héroux et al., 2015). Particulate matter (PM), ozone (⁠ $O_{3}$ ⁠), and nitrogen dioxide (⁠ $N O_{2}$ ⁠) are major pollutants that have aroused significant public health concerns. Understanding the spatial distribution of air pollution levels has been of great interest to the government and local authorities because it informs about target areas for implementing the policies for air quality management. Cluster analysis has been widely used to identify locations with similar average levels of multiple air pollutants, efficiently summarising the spatial distribution (Austin et al., 2013; Coker et al., 2018; Soares et al., 2018).

However, the current practice of cluster analysis for examining the spatial distribution of air pollution is limited in several aspects. First, most previous studies conducted cluster analysis based on temporally averaged profiles of multiple pollutants, which ignores the seasonal pattern of air pollution levels and fails to separate the locations with different seasonal patterns if they show similar average levels. Second, the air pollution levels are often influenced by various local characteristics such as environmental, demographic, and socio-economic conditions but such location-specific information has not been incorporated into clustering. Third, air pollution levels change over a long-term period, and the spatial pattern of clusters also changes over time, however, to our knowledge, no previous study has investigated the temporally changing pattern in clusters. Considering these limitations, the goal of this study was to propose a methodology to cluster locations based on the seasonal patterns of multiple air pollutant levels by incorporating location-specific characteristics. Furthermore, we aimed to formulate a model to investigate how the spatial pattern of clusters changes over a long-term period with temporal dependency.

Our study was motivated by air pollution data collected in Canada. The data included the daily mean concentrations of $O_{3}$ ⁠, ${NO}_{2}$ ⁠, and ${PM}_{2.5}$ for 25 cities in Canada, for the period of 1987–2012. Figure 1a shows the geographical locations of the 25 cities in Canada with three selected cities, St. John's NFL, Calgary and Hamilton highlighted by green, red, and blue, respectively. Figure 1c–e shows the daily mean concentrations for $O_{3}$ ⁠, ${NO}_{2}$ and ${PM}_{2.5}$ in 2010 for the three selected cities. The annual trajectories of the three pollutant levels show some seasonal patterns, and the seasonal patterns differ across cities and among the pollutants. From these observations, the first aim of our study is to cluster the cities based on the seasonal patterns of $O_{3}$ ⁠, ${NO}_{2}$ and ${PM}_{2.5}$ ⁠.

FIGURE 1

(a) Geographical locations and (b) scatter plot for gross domestic product and unemployment rate of 25 cities in Canada; Daily mean concentration of (c) ozone (⁠ $O_{3}$ ⁠), (d) nitrogen dioxide (⁠ ${NO}_{2}$ ⁠) and (e) fine particulate matter (⁠ ${PM}_{2.5}$ ⁠) for three selected cities (St. John's NFL, Hamilton and Calgary) in 2010 [Colour figure can be viewed at https://dbpia.nl.go.kr]

Open in new tab Download slide

The second aim is to cluster the seasonal trajectories of the three pollutants by incorporating city-specific geographical and socio-economic indicators. Previous literature has reported that air pollution levels tend to be spatially correlated and largely influenced by socio-economic conditions. We collected the spatial coordinates (i.e., longitude and latitude) and two socio-economic indicators, the gross domestic product (GDP) and the unemployment rate (%Unemp), in 2010 for each city. In Figure 1b, the values for GDP and %Unemp are plotted and quite different among the three highlighted cities. This may suggest that socio-economic conditions underlie the clusters of air pollution trajectories. Hence, by clustering the pollution trajectories together with city-specific features, we may examine how city-specific characteristics affect the annual trajectories of pollutants.

Finally, we aim to cluster the seasonal curves of the pollutants over different years with temporal dependency. Figure 2 shows the daily mean concentrations of $O_{3}$ and ${NO}_{2}$ in 1987, 1997 and 2007 for the three selected cities, Saint John NB, Hamilton and Calgary. The annual trajectory of $O_{3}$ has remained similar over different years, while that of ${NO}_{2}$ has changed over time. For example, the city of Calgary (blue) shows the convex seasonal curve of $O_{3}$ over all 3 years (Figures 2a,c,e), whereas the seasonal curve of ${NO}_{2}$ has changed over the years with a relatively flat pattern in 1987 (Figure 2b) but a U-shaped seasonality in 1997 and 2007 (Figures 2d,f). Therefore, by clustering the seasonal curves over multiple years with temporal dependency, we aim to examine how the clusters have changed smoothly over time.

FIGURE 2

Daily mean concentrations of ozone (⁠ $O_{3}$ ⁠) and nitrogen dioxide (⁠ ${NO}_{2}$ ⁠) for three selected cities (St. John's NFL, Hamilton, and Calgary) in (a,b) 1987; (c,d) 1997; and (e,f) 2007, respectively [Colour figure can be viewed at https://dbpia.nl.go.kr]

Open in new tab Download slide

The statistical methodology of clustering the annual trajectories of air pollutants was proposed in earlier studies, where a K-means or a hierarchical clustering algorithm was applied to the vector of time-series data for an air pollutant level (Gramsch et al., 2006). However, these approaches suffer from the curse of dimensionality and the corresponding computational cost. The functional clustering approach was proposed by Ignaccolo et al. (2008), but this approach was applicable for clustering the trajectories of a single pollutant. Multiple pollutants tend to be correlated, and clusters should be identified based on the co-variation of these pollutants. Therefore, a multivariate functional clustering method is warranted to cluster the seasonal curves of multiple air pollutants.

In the statistical literature for functional data clustering, numerous studies have proposed methods for univariate functional data clustering in either Bayesian (Heard et al., 2006; Ray & Mallick, 2006; Rodriguez & Dunson, 2014) or frequentist framework (Abraham et al., 2003; Bouveyron & Jacques, 2011; Jacques & Preda, 2013; James & Sugar, 2003). However, few methodologies have been proposed for multivariate functional clustering. Jacques and Preda (2014) and Schmutz et al. (2020) proposed a model-based clustering algorithm for multivariate functional data using functional principal component analysis (FPCA) from a truncated Karhunen Loeve expansion. Alternatively, there are methodologies based on the K-means clustering algorithm (Ieva et al., 2013; Martino et al., 2019; Tokushige et al., 2007). More recently, Bouveyron et al. (2022) proposed a functional latent block model for co-clustering multivariate functional data to study spatio-temporal patterns of air pollution data. However, all of these approaches are still limited for the aims of our study because they neither incorporate covariates nor conduct clustering of the functional curves observed at multiple time points with temporal dependency.

Our research proposes a Bayesian non-parametric sparse latent factor model for conducting multivariate functional clustering while incorporating covariate information simultaneously. Our model represents the multivariate functional data as a finite-dimensional vector through a basis expansion, and the vector of basis coefficients is combined with covariates. Then, sparse latent factor modelling is applied to the combined vectors of basis coefficients and covariates to obtain a lower-dimensional representation. We follow the spirit of the sparse latent factor model proposed by Montagna et al. (2012) in functional regression context, and adapt it to propose a methodology for multivariate functional clustering. For sparsity, we assign a spike and slab prior on factor loadings using the Indian Buffet Process (IBP) (Griffths & Ghahramani, 2006; Knowles & Ghahramani, 2011). Finally, the factors are modelled using a Dirichlet process (DP) mixture to conduct model-based clustering under a fully Bayesian hierarchical modelling. Furthermore, we extend this approach to time-varying clustering with temporal dependency using the dynamic hierarchical Dirichlet process (dHDP) (Ren et al., 2008) mixture to model the latent factors.

The remainder of this paper is organised as follows. In Section 2, we describe our proposed methodology. In Section 3, we conduct a simulation study. In Section 4, we apply our proposed models to analyse the air pollution data in Canada. Section 5 includes discussions and concluding remarks.

2 METHODOLOGY

In this section, we propose a non-parametric Bayesian latent factor model for covariate-dependent multivariate functional clustering to cluster the seasonal curves of multiple air pollutants incorporating city-specific covariates. In Section 2.1, we describe the model for clustering the seasonal trajectories observed in a particular year. In Section 2.2, we extend the model to cluster the seasonal curves observed over multiple years with temporal dependency. In Section 2.3, we describe how to conduct the posterior inference.

2.1 Model for covariate-dependent multivariate functional clustering

For $i = 1, \dots, n$ and $j = 1, \dots, J$ ⁠, let $y_{i j} (v)$ be a noisy measurement of a true mean curve $f_{i j} (v)$ ⁠, which represents the seasonal trajectory of the $j$ th pollutant for the $i$ th city. We assume

y_{i j} (v) = f_{i j} (v) + ϵ_{i j} (v), ϵ_{i j} (v) \sim N (0, ψ_{j}^{2}),

(1)

where $ϵ_{i j} (v)$ 's are independent errors following a normal distribution with $j$ -specific variance. The true mean curves are modelled through a basis expansion as follows:

f_{i j} (v) = \sum_{l = 1}^{R_{j}} τ_{i j l} b_{j l} (v),

(2)

where ${b_{j 1} (v), \dots, b_{j R_{j}} (v)}$ are the basis functions specific to the $j$ th pollutant with dimensions $R_{j}$ ⁠, and ${τ_{i j 1}, \dots, τ_{i j R_{j}}}$ are the basis coefficients for the $j$ th pollutant of the $i$ th city. Suppose that the functional data are observed at days $v_{i 1}, \dots, v_{i s_{i}}$ ⁠. Letting $y_{i j} = {(y_{i j} (v_{i 1}), \dots, y_{i j} (v_{i s_{i}}))}^{'}$ ⁠, the equations in (1) with (2) are expressed as follows:

y_{i j} = B_{i j} τ_{i j} + ϵ_{i j}, ϵ_{i j} \sim N_{s_{i}} (0, ψ_{j}^{2} I_{s_{i}}),

(3)

where $B_{i j}$ is the basis matrix with the size of $s_{i} \times R_{j}$ ⁠, the $c$ th row is $(b_{j 1} (v_{i c}), \dots, b_{j R_{j}} (v_{i c}))$ ⁠, which are the basis functions for the $j$ th pollutant evaluated at day $v_{i c}$ ⁠, and $τ_{i j} = {(τ_{i j 1}, \dots, τ_{i j R_{j}})}^{'}$ is the vector of the basis coefficients for the $j$ th pollutant of the $i$ th city. There are various options for the choice of basis. Among others, we use cubic B-spline basis with re-parameterisation (Kowal et al., 2017; Wand & Ormerod, 2008), which is a class of penalised splines based on the cubic B-spline basis functions. The motivation of this choice is that penalised splines can be represented by linear mixed model (Crainiceanu et al., 2005), for which Bayesian inference relying on MCMC sampling can be implemented more efficiently with reparametrisation. Details of the reparametrised cubic B-spline are explained in Appendix S1.

In Equation (3), the coefficient vector $τ_{i j}$ is a low-dimensional representation of the functional curve $f_{i j} (v)$ ⁠. Now, we combine the coefficient vectors $τ_{i j}$ over all $j$ 's to facilitate multivariate functional data clustering. Simultaneously, we combine the vector of covariates to allow for covariate-dependent clustering. Let $x_{i} = {(x_{i 1}, \dots, x_{i p})}^{'}$ be the vector of p covariates of the $i$ th city. Then, $θ_{i} = {(τ_{i 1}^{'}, \dots, τ_{i J}^{'}, x_{i}^{'})}^{'}$ becomes the combined vector of all basis coefficients and covariates, for which the dimension is $K = \sum R_{j} + p$ ⁠. For flexible modelling of the functional curves, a sufficiently large number of basis is chosen, in which case $\sum R_{j}$ becomes large. Also, the vector of covariates can be high-dimensional and potentially highly correlated, in which case $p$ is unnecessarily large. Thus, to obtain a low-dimensional representation of the functional curves together with the covariate vector, we assume a sparse latent factor model for $θ_{i}$ to reduce the dimension as follows:

θ_{i} = Λ η_{i} + ζ_{i}, ζ_{i} \sim N_{K} (0, \sum),

(4)

where $Λ$ is a $K \times r$ factor loading matrix with $r < < K$ ⁠, $η_{i}$ is a latent factor of the $i$ th city, and $ζ_{i}$ is a residual vector following a multivariate normal distribution, with the mean vector as zeros and a diagonal covariance matrix as $\sum = diag (σ_{1}^{2}, \dots, σ_{K}^{2})$ ⁠. In Equation (4), $η_{i}$ is the $i$ th city's latent factor vector; it summarises the functional data of multiple air pollutants and $p$ -dimensional covariates.

Next, we assume the probability distribution for $η_{i}$ to conduct a model-based clustering for the functional curves jointly with covariates. As $η_{i}$ is a lower-dimensional representation of the functional curves and covariates, clustering based on the means of $η_{i}$ should be more efficient than clustering based on $θ_{i}$ ⁠, which is the original vector of basis coefficients and covariates. Specifically, we assume a DP mixture of normal distributions for $η_{i}$ as follows:

\begin{align} η_{i} & \sim N_{r} (μ_{i}, I_{r}) \\ μ_{i} & \sim G \\ G & \sim DP (ν, G_{0}), \end{align}

(5)

where $μ_{i}$ follows a random distribution $G$ ⁠, which is sampled from a DP with a precision parameter $ν$ and a base distribution $G_{0}$ (Ferguson, 1973). The DP mixture model can be expressed using the stick-breaking representation (Sethuraman, 1994) as follows:

\begin{align} G & = \sum_{l = 1}^{\infty} π_{l} δ_{μ_{l}^{*}} \\ π_{l} & = p_{l} \prod_{i = 1}^{l - 1} (1 - p_{i}), l = 1, 2, \dots \\ p_{l} & \sim Beta (1, ν), l = 1, 2, \dots \\ μ_{l}^{*} & \sim G_{0}, l = 1, 2, \dots \end{align}

(6)

Note that $G$ follows a discrete distribution with infinitely countable point masses. By discreteness, Equation (5) performs a model-based clustering because $μ_{i}$ 's share the values of $μ_{l}^{*}$ 's.

In Equation (4), $Λ$ is a factor loading matrix for which sparsity is desired. A “spike and slab” prior was proposed by West (2003) to induce a sparsity as follows:

λ_{k d} \sim (1 - π_{k d}) δ_{0} (λ_{k d}) + π_{k d} N (λ_{k d}; 0, u_{d}^{2}), k = 1, \dots, K, d = 1, \dots, r,

(7)

where $λ_{k d}$ is the element for the $k$ th row and $d$ th column of $Λ$ ⁠. One drawback of this prior is that $r$ ⁠, which is the number of columns in $Λ$ ⁠, must be specified a priori. To allow for uncertainty in $r$ while keeping the “spike and slab” feature to induce sparsity in the loadings, we use the IBP prior (Griffths & Ghahramani, 2006; Knowles & Ghahramani, 2011) for the factor loading matrix. Specifically, we let $Λ$ be an element-wise multiplication of a binary matrix $Z$ generated from the IBP and a scale matrix $V$ constructed by the random vectors generated from a multivariate Gaussian distribution as follows.

\begin{align} Λ & = Z ⊙ V \\ Z & \sim IBP (α, β) \\ v_{\cdot d} & \sim N (0, u_{d}^{2} I_{K}), \end{align}

(8)

where $v_{\cdot d}$ is the $d$ th column of $V$ ⁠. In this way, inducing sparsity is straightforward. At the same time, using the IBP prior allows the number of factors to be inferred within a well-defined theoretical framework (Knowles & Ghahramani, 2011) because the binary matrix $Z$ generated from an IBP has infinitely many columns with only finite non-zero columns, and the number of non-zero columns of $Z$ corresponds to $r$ ⁠.

So far, we assumed that the covariates $x_{i}$ 's are continuous. However, if the covariates include categorical variables, the direct use of the latent factor model in (4) is not valid. We accommodate categorical covariates using a multinomial probit model. Let $z_{i} = {(z_{i 1}, \dots, z_{i q})}^{'}$ be a vector that represents a categorical covariate with $q$ categories. That is,

z_{i h} = \{\begin{cases} 1 if i th location belongs to the h th category \\ 0 otherwise, \end{cases}

for $h = 1, \dots, q$ ⁠. Then, we introduce a vector of auxiliary variables as $g_{i} = {(g_{i 1}, \dots, g_{i, q - 1})}^{'}$ such that $g_{i}$ is linked to $z_{i}$ through the probit link as follows (Albert & Chib, 1993; Holmes & Held, 2006; Zhang et al., 2014):

z_{i h} = \{\begin{cases} 1 if h = {argmax}_{1 \leq e \leq q - 1} {g_{i e}} and g_{i h} > 0 \\ 0 if \max {g_{i e}} \leq 0 . \end{cases}

That is, $g_{i}$ is a vector of the continuous auxiliary variables that represent the categorical variable. Then, we stack $τ_{i j}$ 's and $g_{i}$ into $θ_{i} = {(τ_{i 1}^{'}, \dots, τ_{i J}^{'}, g_{i}^{'})}^{'}$ and apply the latent factor modelling to $θ_{i}$ ⁠, as in equation (4).

2.2 Model for covariate-dependent multivariate functional clustering with temporal dependency

We extend the proposed model in Section 2.1 to cluster the functional data observed at multiple time points with temporal dependency. For $i = 1, \dots, n$ ⁠, $j = 1, \dots, J$ ⁠, and $t = 1, \dots, T$ ⁠, let $y_{i j}^{t} (v)$ be the noisy measurement of the true mean curve $f_{i j}^{t} (v)$ ⁠, which represents the seasonal curve of the $j$ th pollutant for the $i$ th city in the $t$ th year. We assume that

\begin{align} y_{i j}^{t} (v) & = f_{i j}^{t} (v) + ϵ_{i j}^{t} (v), ϵ_{i j}^{t} (v) \sim N (0, ψ_{j}^{2}) \\ f_{i j}^{t} (v) & = \sum_{l = 1}^{R_{j}} τ_{i j l}^{t} b_{j l} (v), \end{align}

(9)

where $ϵ_{i j}^{t} (v)$ 's are independent errors following a normal distribution with $j$ -specific variance $ψ_{j}^{2}$ ⁠, ${b_{j 1} (v), \dots, b_{j R_{j}} (v)}$ are the basis functions specific to the $j$ th pollutant with dimension $R_{j}$ ⁠, and ${τ_{i j 1}^{t}, \dots, τ_{i j R_{j}}^{t}}$ are the basis coefficients for the $j$ th pollutant of the $i$ th city in the $t$ th year. Suppose that the functional data are observed at days $v = v_{i 1}, \dots, v_{i, s_{i}^{t}}$ for the $i$ th city in the $t$ th year. Letting $y_{i j}^{t} = {(y_{i j}^{t} (v_{i 1}), \dots, y_{i j}^{t} (v_{i s_{i}^{t}}))}^{'}$ ⁠, the equations in (9) are expressed as follows:

y_{i j}^{t} = B_{i j}^{t} τ_{i j}^{t} + ϵ_{i j}^{t}, ϵ_{i j}^{t} \sim N_{s_{i}^{t}} (0, ψ_{j}^{2} I_{s_{i}^{t}}),

(10)

where $B_{i j}^{t}$ is the basis matrix with the size of $s_{i}^{t} \times R_{j}$ ⁠, the $c$ th row is $(b_{j 1} (v_{i c}), \dots, b_{j R_{j}} (v_{i c}))$ ⁠, which are the basis functions for the $j$ th pollutant evaluated at day $v_{i c}$ ⁠, and $τ_{i j}^{t} = {(τ_{i j 1}^{t}, \dots, τ_{i j R_{j}}^{t})}^{'}$ is the vector of the basis coefficients for the $j$ th pollutant of the $i$ th city in the $t$ th year.

Letting $x_{i}^{t} = (x_{i 1}^{t}, \dots, x_{i p}^{t})$ be the covariate vector of the $i$ th city in the $t$ th year, we combine the coefficient vectors $τ_{i j}^{t} = {(τ_{i j l}^{t}, \dots, τ_{i j R_{j}}^{t})}^{t}$ s over all $j$ 's and $x_{i}^{t}$ as $θ_{i}^{t} = {(τ_{i 1}^{t}, \dots, τ_{i p}^{t}, x_{i}^{t})}^{t}$ for which the dimension is $K = \sum_{j = 1}^{J} R_{j} + p$ ⁠. As in Section 2.1, we assume a sparse latent factor model for $θ_{i}^{t}$ to reduce the dimension as follows:

θ_{i}^{t} = Λ η_{i}^{t} + ζ_{i}^{t}, ζ_{i}^{t} \sim N_{K} (0, \sum),

(11)

where $Λ$ is a $K \times r$ factor loading matrix with $r < < K$ ⁠, $η_{i}^{t}$ is a latent factor of the $i$ th city and the $t$ th year, and $ζ_{i}^{t}$ is a residual vector following a multivariate normal distribution with mean vector as zeros and a diagonal covariance matrix as $\sum = diag (σ_{1}^{2}, \dots, σ_{K}^{2})$ ⁠. For $Λ$ ⁠, we express $Λ$ as an element-wise multiplication of a binary matrix $Z$ and a scale matrix $V$ with an IBP prior on $Z$ and a Gaussian prior on $V$ ⁠, as in Section 2.1.

For temporally dependent clustering, the DP mixture is inappropriate as it does not incorporate temporal dependency. Instead, we use the dHDP proposed by Ren et al. (2008) as follows:

\begin{align} η_{i}^{t} & \sim N (μ_{i}^{t}, I_{r}) \\ μ_{i}^{t} | G_{t} & \sim G_{t} \\ G_{t} & = (1 - {\tilde{w}}_{t - 1}) G_{t - 1} + {\tilde{w}}_{t - 1} H_{t - 1} \\ G_{1} & \sim DP (α_{01}, G_{0}), H_{t - 1} \sim DP (α_{0 t}, G_{0}) \\ G_{0} & \sim DP (γ, H) . \end{align}

(12)

Note that $G_{t}$ is modified from $G_{t - 1}$ by introducing an innovation distribution $H_{t - 1}$ ⁠, through which temporally proximate data share the same atoms with the potential for innovation. Additionally, the discreteness of $G_{0}$ makes all $G_{t}$ 's share the same atoms, which means that dHDP encourages the sharing of common atoms between temporally proximate data than between widely separated data.

2.3 Prior specification and posterior inference

The Bayesian formulation of our model is completed by specifying the priors on the model parameters. First, we specify the priors for the model in Section 2.1. In Equations (1) and (4), we impose inverse-gamma priors on $ψ_{j}^{2}$ 's and $σ_{k}^{2}$ 's: $ψ_{j}^{2} \sim IG (a_{ψ}, b_{ψ})$ ⁠, $σ_{k}^{2} \sim IG (a_{σ}, b_{σ})$ ⁠. For DP parameters in Equations (5) and (7), we specify $G_{0} = N (m, Ω = diag (w_{1}^{2}, \dots, w_{r}^{2}))$ ⁠, and $ν \sim G (a_{ν}, b_{ν})$ ⁠. In Equation (8), gamma priors and an inverse-gamma prior are specified on $α$ ⁠, $β$ ⁠, and $u_{d}$ 's: $α \sim G (a_{α}, b_{α})$ ⁠, $β \sim G (a_{β}, b_{β})$ ⁠, and $u_{d}^{2} \sim I G (a_{u}, b_{u})$ ⁠. For hyperparameters, we assume $m \sim N (0, I_{r})$ and $w_{r}^{2} \sim I G (a_{w}, b_{w})$ ⁠. Next, we specify the priors for the model in Section 2.2. In Equations (9) and (11), we impose inverse-gamma priors on $ψ_{j}^{2}$ 's and $σ_{k}^{2}$ 's: $ψ_{j}^{2} \sim IG (a_{ψ}, b_{ψ})$ ⁠, $σ_{k}^{2} \sim I G (a_{σ}, b_{σ})$ ⁠. For dHDP priors in Equation (12), we assume $H = N (m, Ω = diag (w_{1}^{2} \dots, w_{r}^{2}))$ ⁠, $γ \sim G (γ; γ_{01}, γ_{02})$ ⁠, and $α_{0 t} \sim G (α_{0 t}; c_{0}, d_{0})$ for $t = 1 \dots, T$ ⁠. For hyperparameters, we assume $m \sim N (0, I_{r})$ and $w_{r}^{2} \sim IG (a_{w}, b_{w})$ ⁠.

The posterior inferences for both models in Sections 2.1 and 2.2 proceed via Markov Chain Monte Carlo (MCMC) sampling algorithms. For the model in Section 2.1, it is not straightforward because the parameters in the DP mixture model are infinite-dimensional. We used a truncation approximation approach based on the stick-breaking representation in Equation (6), which has been shown to approximate the DP mixture model well (Ishwaran & James, 2001). With truncation level $M$ ⁠, we used only the first $M$ terms of the infinite sum of $G$ in Equation (6) with $p_{l} \sim Beta (1, ν), l = 1, \dots, M - 1$ ⁠, and $p_{M} = 1$ ⁠. The truncation error bound with truncation level $M$ is calculated as $4 n \exp (- (M - 1) / ν)$ ⁠, which converges to 0 exponentially as $M$ increases. Additionally, we introduce latent variables $L = (L_{1}, \dots, L_{n})$ ⁠, which is the configuration variable such that $L_{i} = j$ means that $μ_{i} = μ_{j}^{*}$ for $i = 1, \dots, n$ and $j = 1, \dots, M$ ⁠. The posterior inference for the model in Section 2.2 is almost the same as for the model in Section 2.1, except for the dHDP inference. For dHDP inference, Ren et al. (2008) proposed a truncated stick-breaking representation approach with a sufficiently large truncation level $M$ ⁠. The details of the MCMC sampling procedure for both models in Sections 2 and 3 are in Appendix S1.

The posterior summary for clustering is not straightforward because each MCMC iteration leads to different clustering. We applied the method proposed by Dahl (2006), which finds the optimum clustering as a clustering structure observed at a particular MCMC iteration that minimises the distance from the pairwise clustering probability matrix. Here, the $(i, j)$ th element is the posterior probability that the $i$ th and $j$ th locations are assigned to the same cluster. To summarise cluster-specific mean trajectories, we first handled the label switching problem via the equivalence class representatives (ECR) algorithm (Papastamoulis & Iliopoulos, 2010; Papastamoulis, 2016). Briefly, the ECR algorithm finds a permutation of the labels to be applied in a given MCMC iteration as the one that reorders the corresponding allocations to become identical to the representative of its class. Once we permute the labels of MCMC samples through the ECR algorithm, we obtain the cluster-specific mean trajectories along with credible intervals using the MCMC samples of the basis coefficients averaged over the cities for each cluster. Posterior summaries for both the clustering results and cluster-specific mean trajectories are obtained after checking the convergence of multiple MCMC chains.

3 SIMULATION STUDY

In this section, we conduct simulation studies to evaluate the performance of our proposed models compared with other existing methods. In Section 3.1, we evaluate the methods in covariate-dependent trivariate functional clustering. In Section 3.2, we assess the methods in covariate-dependent time-varying bivariate functional clustering.

3.1 Covariate-dependent trivariate functional clustering

We generate covariate-dependent trivariate functional data as follows. First, we generate continuous covariates. For $i = 1, \dots, n$ and $w = 1, \dots, p$ ⁠, let $x_{i w}$ denote the the $w$ th continuous covariate of the $i$ th city. We assume that there are three clusters and there are $n_{k} = n / 3$ cities in each of the $k$ th cluster, $k = 1, \dots, 3$ ⁠. For each $k$ ⁠, $x_{i w}$ 's are generated as follows:

\begin{align} x_{i 1} & \sim N (k + 1, σ_{1}^{2}), x_{i 2} \sim N (k I (k = 2), σ_{2}^{2}) \\ x_{i 3} & \sim N (5 - k, σ_{3}^{2}), x_{i 4} \sim N (k + 2, σ_{4}^{2}), x_{i 5} \sim N (5 - k, σ_{5}^{2}), \\ x_{i 6} & \sim N ((k - 1) k I (k = 3) + k I (k = 1), σ_{6}^{2}), x_{i 7} \sim N (k - 2, σ_{7}^{2}) \end{align}

(13)

where $σ_{w}^{2}$ are the variance of the $w$ th covariate, which are specified in a later paragraph. Next, we generate categorical covariates. Let $g_{i l e}$ denote the underlying auxiliary variables for the $l$ th categorical covariate (⁠ $l = 1, \dots, 3$ ⁠) of the $i$ th city. We assume

\begin{align} g_{i 11} & \sim N (k {(- 1)}^{k - 1}, 1^{2}) g_{i 21} \sim N (k {(- 1)}^{k}, 1^{2}) g_{i 22} \sim N ({(- 1)}^{k - 1}, 1^{2}) \\ g_{i 31} & \sim N (k {(- 1)}^{k}, 1^{2}) g_{i 32} \sim N ({(- 1)}^{k}, 1^{2}) . \end{align}

(14)

Given $g_{i l e}$ 's, the categorical covariates $z_{i l} = {(z_{i l 1}, \dots, z_{i l m_{l}})}^{'}$ s are defined by

z_{i l h} = \{\begin{cases} 1 if h = {argmax}_{1 \leq e \leq m_{l} - 1} {g_{i l e}} and g_{i l h} > 0 \\ 0 if \max_{1 \leq e \leq m_{l} - 1} {g_{i l e}} \leq 0 . \end{cases}

(15)

That is, $z_{i l h} = 1$ means that the $l$ th categorical covariate of the $i$ th city belongs to the $h$ th category, and $m_{l}$ is the number of categories of the $l$ th categorical variable. Here, $m_{1} = 2$ ⁠, $m_{2} = 3$ and $m_{3} = 3$ ⁠.

Given the covariates $x_{i w}$ 's and $g_{i l e}$ 's generated, we generate the trivariate functional data. Let $y_{i j} (v)$ denote the noisy measurement of the functional curve of the $j$ th pollutant for the $i$ th city. We assume

\begin{align} y_{i 1} (v) & = f_{1} (v) + \frac{g_{i 11}}{10} v + (x_{i 1} + \frac{g_{i 22}}{10}) h_{1} (v) + (x_{i 2} + \frac{g_{i 21}}{10}) h_{2} (v) + x_{i 3} h_{3} (v) + ϵ_{i 1} (v) \\ y_{i 2} (v) & = f_{2} (v) + \frac{g_{i 11}}{10} + (x_{i 4} + \frac{g_{i 32}}{10}) h_{4} (v) + (x_{i 5} + \frac{g_{i 31}}{10}) h_{5} (v) + ϵ_{i 2} (v) \\ y_{i 3} (v) & = f_{3} (v) + x_{i 6} h_{6} (v) + x_{i 7} h_{7} (v) + ϵ_{i 3} (v), \end{align}

(16)

where $f_{j} (v)$ corresponds to the functional curve of the $j$ th pollutant, which is independent of the covariates. We describe the specifications for $f_{j} (v)$ in the next paragraph. The functions linked to the covariates are set as $h_{1} (v) = 2 \log (v + 0.1) + 2$ ⁠, $h_{2} (v) = 4 \cos (5 v) + 2$ ⁠, $h_{3} (v) = \exp (v)$ ⁠, $h_{4} (v) = 5 v^{3} + 1$ ⁠, $h_{5} (v) = 3 \sin (3 v + 1 / 3) - 1, h_{6} (v) = \exp (2 v^{2})$ ⁠, and $h_{7} (v) = 5 \cos (5 v^{3} + 4.5)$ ⁠. We added Gaussian noise to each city-specific curve as $ϵ_{i 1} (v) \overset{i i d}{˜} N (0, 3^{2})$ ⁠, $ϵ_{i 2} (v) \overset{i i d}{˜} N (0, 2^{2})$ and $ϵ_{i 3} (v) \overset{i i d}{˜} N (0, 4^{2})$ ⁠. We consider equidistant grid points for $v$ as $v = 0, 1 / (s - 1), 2 / (s - 1), \dots, (s - 2) / (s - 1), 1$ for a constant $s$ ⁠.

In order to evaluate the methods in various situations where the between-cluster separation is weak or strong with respect to either the functional curves or the covariates, we consider four different cases for $f_{j} (v)$ and $σ_{w}^{2}$ ⁠.

•
(fWeak-coWeak) Both the functional curves and the covariates are weakly separated: $f_{1} (v) = 4 (k I (k = 2) - I (k \neq 2)) v$ ⁠, $f_{2} (v) = 3.3 (I (k = 1) - I (k \neq 1)) + 1.1 I (k = 3) v$ ⁠, $f_{3} (v) = k (3 v + 2.1)$ and $σ_{1} = σ_{3} = σ_{4} = σ_{5} = σ_{7} = 1$ ⁠, $σ_{2} = 1.5$ ⁠, $σ_{6} = 2$
•
(fWeak-coStrong) The functional curves are weakly and the covariates are strongly separated: $f_{1} (v) = 2 (k I (k = 2) - I (k \neq 2)) v$ ⁠, $f_{2} (v) = 2 (I (k = 1) - I (k \neq 1)) + I (k = 3) v$ ⁠, $f_{3} (v) = k (v + 0.5)$ and $σ_{1} = σ_{3} = σ_{4} = σ_{5} = σ_{7} = 0.7$ ⁠, $σ_{2} = 1.3$ ⁠, $σ_{6} = 1.3$
•
(fStrong-coWeak) The functional curves are strongly and the covariates are weakly separated: $f_{1} (v) = (k I (k = 2) - I (k \neq 2)) (4 v + 1)$ ⁠, $f_{2} (v) = 4 (I (k = 1) - I (k \neq 1)) + 5 I (k = 3) v$ ⁠, $f_{3} (v) = k (4 v + 2)$ and $σ_{1} = σ_{3} = σ_{4} = σ_{5} = σ_{7} = 1$ ⁠, $σ_{2} = 1.5$ ⁠, $σ_{6} = 2$
•
(fStrong-coStrong) Both the functional curve and the covariates are strongly separated: $f_{1} (v) = (k I (k = 2) - I (k \neq 2)) (4 v + 1)$ ⁠, $f_{2} (v) = 4 (I (k = 1) - I (k \neq 1)) + 5 I (k = 3) v$ ⁠, $f_{3} (v) = k (2 v + 1.5)$ and $σ_{1} = σ_{3} = σ_{4} = σ_{5} = σ_{7} = 0.7$ ⁠, $σ_{2} = 1.3$ ⁠, $σ_{6} = 1.3$

The functional data and covariates in a sample dataset for each of the four cases are presented in Figures S1–S4. For each case, we generated 50 sets of data. To assess the sensitivity of the clustering results to the sample size, we considered $n = 30, 60$ and $s = 50, 100$ for each of the four cases.

We compare the proposed method in Section 2.1 with five existing methods. Hereafter, the proposed model is denoted by 'MFclust1', and the compared methods are described as follows:

1.
K-means: K-means clustering applied to the vector of functional coefficients and covariates
2.
MFclust0: the proposed method without including covariates
3.
GMFD: K-means clustering for multivariate functional data based on the Mahalanobis-type distance (Martino et al., 2019)
4.
Funclust: model-based multivariate functional clustering based on Gaussian mixture modelling of the FPCA scores (Jacques & Preda, 2014)
5.
FunHDDC: model-based multivariate functional clustering based on functional latent mixture model which fits the data into group-specific functional subspaces through a multivariate FPCA (Schmutz et al., 2020)
6.
K-means0: K-means clustering applied to the vector of covariates only

Note that the methods in (b)–(e) do not allow for incorporating covariates and the method in (f) uses only the covariates. Implementations of (c)–(e) are done through the R packages, gmfd, Funclustering and funHDDC, respectively. Note that the number of clusters must be specified for the methods in (a), (c), and (f).

For the Funclust and FunHDDC, we use B-spline with 30 equidistant knots and follow the default setting in their R packages. For the K-means, we use B-spline with 10 equidistant knots. For the GMFD, we used the generalised Mahalanobis distance. For MFclust1 and MFclust0, we use the reparametrized cubic B-spline with 20 equidistant internal knots. For MCMC sampling in implementing MFclust1 and MFclust0, we truncate the DP as $M = 20$ in its stick-breaking representation. The hyperparameters are set as $a_{u} = b_{u} = a_{α} = b_{α} = a_{β} = b_{β} = a_{σ} = b_{σ} = a_{w} = b_{w} = a_{ψ} = b_{ψ} = 1$ and $a_{ν} = b_{ν} = 0.5$ ⁠. We run MCMC for 15,000 iterations, discarding the first 10,000 samples for burn-in. The clustering performance was evaluated based on the correct classification rate (CCR) (%). The CCR corresponds to the proportion of the correctly classified subjects over the total sample size. The CCR varies between 0 and 1, and the larger the CCR, the better the correspondence between the resulting clusters and the true partition. To deal with the label-switching problem, all possible permutations of cluster labels are considered and the maximum value of the CCR is taken (Jacques & Preda, 2014).

Figure 3 shows the box-plot of the CCR calculated by different methods for each of the four cases when $(n, s) = (60, 100)$ ⁠. MFclust1 performed best, in all cases showing the highest median with relatively low variability of the CCR. The K-means performed worse than the proposed MFclust1, but its CCR distribution was close to that of MFclust1 when the between-cluster separation was strong in both the functional data and covariates (Figure 3d). MFclust0, GMFD, Funclust and FunHDDC performed worse than MFclust1 because they do not incorporate covariates in clustering. K-means0 performed comparably with the K-means in general, and performed better than MFclust0 and GMFD, showing a higher median with small variability in the CCR when the cluster separation in covariates is strong (Figure 3b,d). These results suggest that the proposed method, MFclust1, improves clustering as it conducts clustering (1) based on a sparse representation of the multivariate functional data and (2) by incorporating the covariates. For other combinations of $(n, s)$ ⁠, see Figures S5–S7, which confirms that the proposed model performs the most robustly for different sample sizes. Figures S8–S11 present the cluster-specific mean curves and credible intervals for each case and $(n, s)$ combination, which shows that the proposed models estimate the functional curves well.

FIGURE 3

Clustering results of the simulation study in Section 3.1: correct classification rates (CCR) for different methods for each of the four cases of between-cluster separation; (a) fWeak-coWeak, (b) fWeak-coStrong, (c) fStrong-coWeak, (d) fStrong-coStrong. The sample size combination is $(n, s) = (60, 100)$ ⁠.

Open in new tab Download slide

Additionally, we investigated how much of dimension reduction occurs in applying the proposed MFclust1 by monitoring the estimated number of factors through the MCMC sampling. Figure S12 shows that the estimated number of factors (r) is much smaller than the original dimension (⁠ $K = 84$ ⁠) of the joint vector of basis coefficients and covariates. Finally, we conducted sensitivity analysis to check if the results are robust to different choices of hyperparameters for the DP and IBP in applying the proposed MFclust1 and MFclust0. Figure S13 indicates that clustering results are quite robust to varying choices of hyperparameters.

3.2 Covariate-dependent bivariate functional clustering with temporal dependency

We generated bivariate functional data observed sequentially at four time points for 30 locations (i.e., $n^{t} = 30$ for $t = 1, \dots, T$ and $T = 4$ ⁠). First, we generate continuous covariates. For $t = 1, \dots, T$ ⁠, $i = 1, \dots, n^{t}$ and $w = 1, \dots, 6$ ⁠, let $x_{i w}^{t}$ denote the $w$ th continuous covariate of the $i$ th city at the tth time point. We assume that there are five clusters and there are $n_{k}^{t}$ cities in each of the $k$ th cluster at the tth time point, $k = 1, \dots, 5$ ⁠. For each $k$ ⁠, $x_{i w}^{t}$ 's are generated as follows:

\begin{align} x_{i 1}^{t} & \sim N (k + 1, σ_{1}^{2}), x_{i 2}^{t} \sim N ((6 - k) I (k \in {2, 3}) + (k + 1) I (k \notin {2, 3}), σ_{2}^{2}) \\ x_{i 3}^{t} & \sim N (6 - k, σ_{3}^{2}), x_{i 4}^{t} \sim N (6 - k, σ_{4}^{2}), \\ x_{i 5}^{t} & \sim N (k - 2, σ_{5}^{2}), x_{i 6}^{t} \sim N ((6 - k) I (k \in {2, 3}) + (k + 1) I (k \notin {2, 3}), σ_{6}^{2}), \end{align}

(17)

where $σ_{w}^{2}$ are the variance of the $w$ th covariate, which are specified in later paragraphs.

Given the covariates $x_{i w}^{t}$ generated, we generate bivariate functional data. Let $y_{i j}^{t} (v)$ denote the noisy measurement of the functional curve of the $j$ th pollutant for the $i$ th city at the tth time point. Then, we assume

\begin{align} y_{i 1}^{t} (v) & = f_{1} (v) + x_{i 1}^{t} h_{1} (v) + x_{i 2}^{t} h_{2} (v) + x_{i 3}^{t} h_{3} (v) + ϵ_{i 1}^{t} (v) \\ y_{i 2}^{t} (v) & = f_{2} (v) + x_{i 4}^{t} h_{4} (v) + x_{i 5}^{t} h_{5} (v) + x_{i 6}^{t} h_{6} (v) + ϵ_{i 2}^{t} (v), \end{align}

(18)

where $f_{j} (v)$ corresponds to the functional curve of the $j$ th pollutant, which is independent of the covariates. We describe the specifications for $f_{j} (v)$ in later paragraphs. The functions that are linked to the covariates are set as $h_{1} (v) = 2 \log (v + 0.1)$ ⁠, $h_{2} (v) = \cos (5 v) + 2$ ⁠, $h_{3} (v) = \exp (v)$ ⁠, $h_{4} (v) = \tan (v + 1 / 3)$ ⁠, $h_{5} (v) = 5 \sin (3 v + 0.5) - 2$ ⁠, $h_{6} (v) = v^{3} - 3 v^{2}$ ⁠. Finally, we added Gaussian noise to each city-specific curve as $ϵ_{i 1}^{t} (v) \overset{i i d}{˜} N (0, 4^{2})$ and $ϵ_{i 2}^{t} (v) \overset{i i d}{˜} N (0, 4^{2})$ ⁠. We consider equidistant grid points for $v$ as $v = 0, 1 / (s - 1), 2 / (s - 1), \dots, (s - 2) / (s - 1), 1$ for a constant $s = 100$ ⁠.

To induce temporal dependency in the clustering pattern, we generate $n_{k}^{t}$ as follows. First, we generate a $5 \times T$ matrix $P$ from a matrix normal distribution, $P \sim M N_{5, T} (J_{5, T}, 3 I_{5}, V)$ ⁠, where $J_{5, T}$ is a $5 \times T$ matrix of ones, and $V$ is a $T \times T$ matrix. We set $V_{i j} = ρ^{| i - j |}$ with $ρ = 0.8$ ⁠. Then, we generate $p_{t}^{*}$ from the Dirichlet distribution, $p_{t}^{*} \sim Dirichlet (P_{\cdot t})$ ⁠, where $P_{\cdot t}$ is the tth column of $P$ ⁠. Finally, we sample $n_{k}^{t}$ from a multinomial distribution with parameters $p_{t}^{*}$ ⁠.

In order to evaluate the methods in various situations where the between-cluster separation is weak or strong in either the functional curves or the covariates, we consider four different cases for $f_{j} (v)$ and $σ_{w}^{2}$ for $w = 1, \dots, 6$ as below.

•
(fWeak-coWeak) Both the functional curves and covariates are weakly separated: $f_{1} (v) = 3 v + 3$ ⁠, $f_{2} (v) = 10 v + 3$ and $σ_{1} = σ_{3} = σ_{4} = σ_{5} = 1$ ⁠, $σ_{2} = σ_{6} = 1.5$
•
(fWeak-coStrong) The functional curves are weakly and covariates are strongly separated: $f_{1} (v) = 3 v + 1.5$ ⁠, $f_{2} (v) = 10 v + 2$ and $σ_{1} = σ_{3} = σ_{4} = σ_{5} = 0.7$ ⁠, $σ_{2} = σ_{6} = 1.2$
•
(fStrong-coWeak) The functional curves are strongly and covariates are weakly separated: $f_{1} (v) = 3 v + 4$ ⁠, $f_{2} (v) = 10 v + 7$ and $σ_{1} = σ_{3} = σ_{4} = σ_{5} = 1$ ⁠, $σ_{2} = σ_{6} = 1.5$
•
(fStrong-coStrong) Both the functional curve and covariates are strongly separated: $f_{1} (v) = 3 v + 5$ ⁠, $f_{2} (v) = 10 v + 8$ and $σ_{1} = σ_{3} = σ_{4} = σ_{5} = 0.7$ ⁠, $σ_{2} = σ_{6} = 1.2$

The functional data and covariates in a sample dataset for each of the four cases are presented in Figures S14–S17. For each case, we generated 50 sets of data.

We compare the proposed method in Section 2.2 with seven existing methods. Hereafter, the proposed model is denoted by 'Temp-MFclust1' and the compared methods are described as follows:

1.
MFclust1: the proposed method in Section 2.1
2.
K-means: K-means clustering applied to the joint vector of functional coefficients and covariates
3.
Temp-MFclust0: the proposed method in Section 2.2 without including covariates
4.
MFclust0: the proposed method in Section 2.1 without including covariates
5.
GMFD: K-means clustering algorithm for multivariate functional data based on the Mahalanobis type distance (Martino et al., 2019)
6.
Funclust: model-based multivariate functional clustering based on Gaussian mixture modelling of the FPCA scores (Jacques & Preda, 2014)
7.
FunHDDC: model-based multivariate functional clustering based on functional latent mixture model which fits the data into group-specific functional subspaces through a multivariate FPCA (Schmutz et al., 2020)
8.
K-means0: K-means clustering was applied to the covariates only

Note that all methods except for (c) do not assume temporal dependency and the methods in (c)–(g) do not incorporate covariates. The method in (h) uses only the covariates in the clustering. Implementations of (e)–(g) are done through the R packages, gmfd, Funclustering and funHDDC, respectively. Note that the number of clusters must be specified for the methods in (b), (e), and (h) a priori.

For the Funclust and FunHDDC, we use B-spline with 30 equidistant knots and follow the default setting in their R packages. For the K-means, we use B-spline with 10 equidistant knots. For the GMFD, we used the generalised Mahalanobis distance. For the Temp-MFclust1, MFclust1, Temp-MFclust0 and MFclust0, we use the reparametrised cubic B-spline with 20 equidistant internal knots. For MCMC sampling in implementing MFclust1 and MFclust0, we truncate the DP as $M = 20$ in its stick-breaking representation. The hyperparameters are set as $a_{u} = b_{u} = a_{α} = b_{α} = a_{β} = b_{β} = a_{σ} = b_{σ} = a_{w} = b_{w} = a_{ψ} = b_{ψ} = 1$ and $a_{ν} = b_{ν} = 0.5$ ⁠. We run MCMC for 15,000 iterations, discarding the first 10,000 samples for burn-in. For MCMC sampling in implementing the Temp-MFclust1 and Temp-MFclust0, we truncate the dHDP as $M = 20$ in its stick-breaking representation and the hyperparameters are set as $a_{u} = b_{u} = a_{α} = b_{α} = a_{β} = b_{β} = a_{σ} = b_{σ} = a_{w} = b_{w} = a_{ψ} = b_{ψ} = 1$ and $a_{0} = b_{0} = c_{0} = d_{0} = γ_{01} = γ_{02} = 1$ ⁠. We run MCMC for 25,000 iterations, discarding the first 20,000 samples for burn-in.

Figure 4 shows a box-plot of the CCR calculated using different methods. The Temp-MFclust1 shows the best performance and MFclust1 is the second best. Additionally, though not incorporating the covariates, the Temp-MFclust0 and MFclust0 perform relatively well. The K-means, GMFD, Funclust, FunHDDC, and K-means0 perform relatively poorly in all cases. These results suggest that the proposed method, Temp-MFclust1, improves clustering because it allows for inducing temporal dependency or covariates compared with MFclust1 or the Temp-MFclust0. Figure S18 presents the cluster-specific mean curves and credible intervals for each case, which shows that the proposed models estimate the functional curves well.

FIGURE 4

Clustering results of the simulation study in Section 3.2: correct classification rates (CCR) by different methods for each of the four cases of between-cluster separation; (a) fWeak-coWeak; (b) fWeak-coStrong; (c) fStrong-coWeak; (d) fStrong-coStrong. The sample size combination is $(n, s) = (60, 100)$ ⁠.

Open in new tab Download slide

Additionally, we investigated how much of dimension reduction occurs in applying the proposed MFclust1 by monitoring the estimated number of factors through the MCMC sampling. Figure S19 shows that the estimated number of factors (r) is much smaller than the original dimension (⁠ $K = 54$ ⁠) of the joint vector of basis coefficients and covariates. Finally, we conducted sensitivity analysis to check if the results are robust to different choices of hyperparameters for the DP and IBP in applying the proposed MFclust1 and MFclust0. Figure S20 indicates that clustering results are quite robust to varying choices of hyperparameters.

4 DATA APPLICATION

In this section, we apply the proposed models to the air pollution data described in Section 1. In Section 4.1, we cluster the annual trajectories of the daily mean concentrations of $O_{3}$ ⁠, ${NO}_{2}$ and ${PM}_{2.5}$ in 2010, while incorporating the spatial coordinates and two socio-economic indicators. In Section 4.2, we conduct clustering of the annual trajectories of ${NO}_{2}$ and $O_{3}$ in 1987, 1992, 1997, 2002, 2007 and 2012 with temporal dependency while incorporating the spatial coordinates.

4.1 Clustering the annual trajectories of $O_{3}$ ⁠, ${NO}_{2}$ and ${PM}_{2.5}$ in 2010

We analysed the time-series data for daily mean concentrations of ${NO}_{2}$ ⁠, $O_{3}$ and ${PM}_{2.5}$ collected from 25 cities in Canada in 2010. We consider the spatial coordinate (i.e., latitude and longitude), GDP and %Unemp as city-specific covariates. We apply the proposed model described in Section 2.1, both with and without city-specific covariates, abbreviated as 'MFclust1' and 'MFclust0', respectively. Before clustering, we standardise each covariate. We use the reparametrized cubic B-spline basis with 20 equidistant internal knots. For posterior sampling, we truncate the DP with $M = 20$ ⁠, for which the truncation error bound is calculated as $4 n \exp (- (M - 1) / ν) = 5.7 \times 1 0^{- 7}$ with $n = 25$ and $ν = 1$ ⁠. We set the hyperparameters as follows: $a_{u} = b_{u} = a_{α} = b_{α} = a_{β} = b_{β} = a_{σ} = b_{σ} = a_{w} = b_{w} = a_{ψ} = b_{ψ} = 1$ and $a_{ν} = b_{ν} = 0.5$ ⁠. We run the MCMC algorithm for 15,000 iterations and discard the first 10,000 samples as burn-ins.

Figure 5a–d shows the clustering results from MFclust0. Without incorporating the covariates, two clusters were identified (Table S1 lists the names of cities in each cluster). Cluster 1 (black) includes 23 cities (92 $%$ ⁠), for which the annual trajectory of $O_{3}$ peaks in April and troughs in October with large amplitude (Figure 5a), while the level of ${NO}_{2}$ peaks in February and troughs in August with small amplitude (Figure 5b). The level of ${PM}_{2.5}$ shows wiggly trajectories with small amplitudes. Cluster 2 (red) includes two cities (8 $%$ ⁠), Calgary and Edmonton, for which the level of $O_{3}$ peaks in April and troughs in January, whereas the level of ${NO}_{2}$ shows an opposite trajectory with a peak in January and a trough in July. The level of ${PM}_{2.5}$ shows wiggly trajectories. All three trajectories in cluster 2 show larger amplitudes than those of cluster 1.

FIGURE 5

Clustering results of the data application in Section 4.1: (a)–(c) and (e)–(g) cluster-specific mean curves with 95 $%$ credible intervals for $O_{3}$ ⁠, ${NO}_{2}$ and ${PM}_{2} . 5$ estimated from MFclust0 and MFclust1; (d), (h) map of clusters obtained from MFclust0 and MFclust1; (i) scatter plot for gross domestic product and unemployment rate with clusters indicated [Colour figure can be viewed at https://dbpia.nl.go.kr]

Open in new tab Download slide

Figure 5e–i shows the clustering results from MFclust1. By incorporating the spatial coordinate, GDP and %Unemp, three clusters were identified (Table S2 lists the names of cities in each cluster). Compared with MFclust0, MFclust1 clustered the cities that are geographically more adjacent and similar in socio-economic conditions while clustering based on the annual trajectories of $O_{3}$ ⁠, ${NO}_{2}$ and ${PM}_{2.5}$ ⁠. Cluster 1 (black) includes 17 cities (68 $%$ ⁠) that are located in the east which show low GDP and relatively high levels of %Unemp (Figure 5h,i). Cluster 2 (red) is equivalent to cluster 2 identified by MFclust0, which includes the two cities, Calgary and Edmonton. Incorporating the GDP and %Unemp into clustering revealed that these two cities have much higher level of GDP and lower level of %Unemp. Finally, cluster 3 (blue) includes western six cities that show low GDP and low %Unemp and the seasonal fluctuations in the levels of ${NO}_{2}$ and ${PM}_{2.5}$ are similar to cluster 1.

Overall, our results indicate that the cities with higher GDP (Calgary and Edmonton) showed larger amplitude in the seasonal patterns of $O_{3}$ ⁠, ${NO}_{2}$ and ${PM}_{2.5}$ ⁠, and higher levels of ${PM}_{2.5}$ throughout the year. In addition, western cities showed lower level of $O_{3}$ than eastern cities based on the results of clustering with the spatial coordinates.

4.2 Clustering the annual trajectories of ${NO}_{2}$ and $O_{3}$ in 1987–2012 with temporal dependency

We analyse the time-series data for daily mean concentrations of ${NO}_{2}$ and $O_{3}$ collected for 25 cities in 1987, 1992, 1997, 2002, 2007 and 2012. For city-specific covariates, we include spatial coordinates. We apply the proposed model described in Section 2.2, with and without city-specific coordinates, denoted by 'Temp-MFclust1' and 'Temp-MFclust0', respectively. For the basis, we use a reparametrised cubic B-spline basis with 20 equidistant internal knots. For posterior sampling, we truncated the dHDP with $M = 20$ in the stick-breaking representation of $G_{0}$ ⁠. The hyperparameter is set as follows: $a_{u} = b_{u} = a_{α} = b_{α} = a_{β} = b_{β} = a_{σ} = b_{σ} = a_{w} = b_{w} = a_{ψ} = b_{ψ} = a_{0} = b_{0} = c_{0} = d_{0} = γ_{01} = γ_{02} = 1$ ⁠. We run the MCMC algorithm for 25,000 iterations and discard the first 20,000 samples as burn-ins.

Since the results from the Temp-MFclust1 and Temp-MFclust0 are similar, we present only the result of the Temp-MFclust1 in Figure 6 (see Figure S22 and Table S3 for the result of Temp-MFclust0). By incorporating the spatial coordinate, five clusters were identified (Table S4 lists the names of cities in each cluster in the years 1987, 1992, 1997, 2002, 2007 and 2012). Figure 6a,b shows the cluster-specific mean curves for ${NO}_{2}$ and $O_{3}$ ⁠. Figure 6c–h shows how the spatial pattern of the clusters changed over time. In 1987 (Figure 6c), the cities around Toronto belonged to cluster 2 (red) while most of the other cities belonged to cluster 1 (black). However, in 2002 (Figure 6f), none of the cities remained in clusters 1 and 2 as the cities around Toronto moved to cluster 5 (cyan) and other cities moved to cluster 3 (blue) and 4 (green), which remain roughly the same in 2007 and 2012.

FIGURE 6

Clustering results from the Temp-MFclust1 in data application of Section 4.2; (a) and (b) cluster-specific mean curves with credible intervals for $O_{3}$ and ${NO}_{2}$ ⁠, respectively; (c)–(h) map of clusters in 1987, 1992, 1997, 2002, 2007 and 2012, respectively [Colour figure can be viewed at https://dbpia.nl.go.kr]

Open in new tab Download slide

These results imply that, in earlier years, most of the cities showed lower levels of $O_{3}$ and higher levels of ${NO}_{2}$ ⁠, similar to the trajectories of clusters 1 and 2. However, the cities around Toronto have changed the seasonal patterns, showing higher levels of $O_{3}$ and lower levels of ${NO}_{2}$ ⁠. In addition, the two cities, Calgary and Edmonton, changed the seasonal curve of ${NO}_{2}$ dramatically from 1997, showing a U-shaped seasonality with a higher level of ${NO}_{2}$ in winter. Overall, our results indicate that the annual trajectories of $O_{3}$ and ${NO}_{2}$ have changed, showing temporal dependency from 1987 through 2012 in Canada in most of the 25 cities.

5 DISCUSSION AND CONCLUSION

In this paper, we propose a non-parametric Bayesian sparse latent factor model for covariate-dependent multivariate functional clustering and apply it to cluster the seasonal curves of multiple air pollutants incorporating location-specific covariates. We further extend the model to cluster the seasonal curves observed in multiple years with temporal dependency. Simulation studies show that our proposed models perform better than other competing methods. Our data application reveals that the annual trajectories of the levels of $O_{3}$ ⁠, ${NO}_{2}$ and ${PM}_{2.5}$ for 25 cities in Canada are grouped into two or five clusters, the geographical and socio-economic indicators seem to play a role in clustering the seasonal curves of pollutants, and the clustering structure has changed over the decades with temporal dependency.

In functional data clustering, basis expansion representation of functional data is a frequent solution for dimension reduction. One of the widely used tools to represent the functional curves through a low-dimensional vector is a FPCA (Jacques & Preda, 2013, 2014; Schmutz et al., 2020), where a finite number of basis functions are derived by eigen decomposition of the empirical covariance function of the observed curves, and each curve is then represented by a vector of eigenscores with respect to the estimated basis. It has been proven that an equivalent representation of the functional curve to the FPCA approach can be obtained if an orthonormal basis (i.e., Fourier basis) is chosen and a usual multivariate PCA is applied to the corresponding basis coefficients (Ramsay & Silverman 2005). Our proposed methods rely on the basis expansion with a cubic B-spline basis with reparametrisation, which does not lead to an equivalent formulation to the FPCA approach. This may be regarded as a drawback of the proposed model from the FPCA point of view. However, by including a sparse latent factor model within the hierarchy, our model leads to an analogous formulation to the FPCA approach as noted in Montagna et al. (2012). Additionally, orthogonality in the basis enhances interpretability of the elements in the eigen-decomposition, but this is not a primary concern in our application since our aim is to conduct a covariate-dependent functional clustering and a sparse latent factor model with basis function approach is adopted as a dimension reduction tool while linking the functional data with a covariate vector which is potentially high-dimensional.

In sparse Bayesian latent factor modelling, there are other alternative priors than the IBP for the factor loading matrix such as the multiplicative gamma process shrinkage prior (MGP) (Bhattacharya & Dunson, 2011). While the IBP induces sparsity by generating a binary matrix which allows for many of the factor loadings to be zero, the MGP makes the elements in the loading matrix to shrink towards zero as the column index increases. Thus, a loading matrix generated from the MGP has infinitely many nonzero columns, and one needs to truncate the number of columns with an adaptive procedure in the MCMC sampling for practical use. Furthermore, Durante (2017) noted that in using the MGP, the desirable shrinkage property is not guaranteed if the hyperparameters are not carefully chosen depending on the model. Recently, Legramanti et al. (2020) proposed an alternative prior, called a cumulative shrinkage prior (CSP), to overcome such limitation of the MGP. Though the CSP was shown to perform better than the MGP in their simulation study, it was not compared with the IBP. Future research on the comparison of these priors would be warranted to provide guidelines to practical users.

As our proposed models rely on the MCMC sampling to approximate the posterior distributions for inference, it takes much longer time to fit the proposed models than to fit the competing models. For example, in Section 3.1, it took 5 min 45 s for the 15,000 MCMC iterations to fit the MFclust1 for the sample size combination of $(n, s) = (60, 100)$ ⁠. Meanwhile, it took 0.66, 12.01, 0.58, 15.08, and 0.51 s for the competing methods of K-means, GMFD, Funclust, FunHDDC and K-means0, respectively. In Section 3.2, it took 11 min 40 s for the 25,000 iterations to fit the Temp-MFclust1 while 1,49, 1.06, 1.75, 1.81, and 0.51 s were taken for each of the competing methods. These quantities rely on an R (R Core Team, 2020) implementation run on an Intel Core i9-9900KF CPU desktop computer with 64GB of RAM. The increased computing time may be a drawback of our proposed methods, but it should be worth spending to obtain better clustering results.

The proposed models are motivated by a problem of clustering the seasonal curves of multiple air pollutants, but they can be applied to a general setting of multivariate functional clustering when incorporating covariates or/and temporal dependency is desired. Our models do not require the functional data to be completely observed or observed at equally spaced grid points. Because we obtain a lower-dimensional representation of the functional data through basis expansion, our methods can be applied to multivariate functional data that are observed sparsely or at unequally spaced grid points. In addition, our model can accommodate either continuous or categorical covariates, which should be useful in many applications such as social science, biology, and epidemiology.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available on request from the corresponding author. The data are not publicly available due to privacy or ethical restrictions.R and C++ code used in this article is available at the Github website (https://github.com/yangdw01/MFclust) and the data used in the application can be requested from the corresponding author ([email protected]).

ACKNOWLEDGEMENTS

This research was supported by the Senior Research grant (2019R1A2C1086194) from the National Research Foundation (NRF) of Korea funded by the Ministry of Science, Information and Communication Technologies, the Basic Science Research grant (2019R1A2C1010018) from the NRF of Korea funded by the Ministry of Education, and the Government-wide R $&$ D Fund project for Infectious Disease (HG18C0025).

REFERENCES

Abraham

,

C.

,

Cornillon

,

P.A.

,

Matzner-Lober

,

E.

&

Molinari

,

N.

(

2003

)

Unsupervised curve clustering using b-splines

.

Scandinavian Journal of Statistics

,

30

,

581

–

595

.

Google Scholar

Crossref

WorldCat

Albert

,

J.H.

&

Chib

,

S.

(

1993

)

Bayesian analysis of binary and polychotomous response data

.

Journal of the American Statistical Association

,

88

,

669

–

679

.

Google Scholar

Crossref

WorldCat

Austin

,

E.

,

Coull

,

B.A.

,

Zanobetti

,

A.

&

Koutrakis

,

P.

(

2013

)

A framework to spatially cluster air pollution monitoring sites in us based on the pm2.5 composition

.

Environment International

,

59

,

244

–

254

.

Bhattacharya

,

A.

&

Dunson

,

D.

(

2011

)

Sparse Bayesian infinite factor models

.

Biometrika

,

98

,

291

–

306

.

Bouveyron

,

C.

&

Jacques

,

J.

(

2011

)

Model-based clustering of time series in group-specific functional subspaces

.

Advances in Data Analysis and Classification

,

5

,

281

–

300

.

Google Scholar

Crossref

WorldCat

Bouveyron

,

C.

,

Jacques

,

J.

,

Schmutz

,

A.

,

Simoes

,

F.

&

Bottini

,

S.

(

2022

)

Co-clustering of multivariate functional data for the analysis of air pollution in the south of France

.

The Annals of Applied Statistics

,

16

(

3

),

1400

–

1422

.

Google Scholar

Crossref

WorldCat

Coker

,

E.

,

Liverani

,

S.

,

Su

,

J.G.

&

Molitor

,

J.

(

2018

)

Multi-pollutant modeling through examination of susceptible subpopulations using profile regression

.

Current Environmental Health Reports

,

5

,

59

–

69

.

Crainiceanu

,

A.

,

Ruppert

,

D.

&

Wand

,

M.

(

2005

)

Bayesian analysis for penalized spline regression using winbugs

.

Journal of Statistical Software

,

14

,

1

–

24

.

Google Scholar

Crossref

WorldCat

Dahl

,

D.B.

(

2006

)

Model-based clustering for expression data via a Dirichlet process mixture model, in Bayesian inference for gene expression and proteomics

.

Cambridge

:

Cambridge University Press

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Durante

,

D.

(

2017

)

A note on the multiplicative gamma process

.

Statistics & Probability Letters

,

122

,

198

–

204

.

Google Scholar

Crossref

WorldCat

Ferguson

,

T.S.

(

1973

)

A Bayesian analysis of some nonparametric problems

.

Annals of Statistics

,

1

,

209

–

230

.

Google Scholar

Crossref

WorldCat

Gramsch

,

E.

,

Cerecedabalic

,

F.

,

Oyola

,

P.

&

Vonbaer

,

D.

(

2006

)

Examination of pollution trends in Santiago de Chile with cluster analysis of pm10 and ozone data

.

Atmospheric Environment

,

40

,

5464

–

5475

.

Google Scholar

Crossref

WorldCat

Griffths

,

L.

&

Ghahramani

,

Z.

(

2006

)

Infinite latent feature models and the Indian buffet process

.

Advances in Neural Information Processing Systems

,

18

,

475

–

482

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Heard

,

N.A.

,

Holmes

,

C.C.

&

Stephens

,

D.A.

(

2006

)

A quantitative study of gene regulation involved in the immune response of anopheline mosquitoes: an application of Bayesian hierarchical clustering of curves

.

Journal of the American Statistical Association

,

101

,

18

–

29

.

Google Scholar

Crossref

WorldCat

Héroux

,

M.E.

,

Anderson

,

H.R.

,

Atkinson

,

R.

,

Brunekreef

,

B.

,

Cohen

,

A.

,

Forastiere

,

F.

et al. (

2015

)

Quantifying the health impacts of ambient air pollutants: recommendations of a WHO Europe project

.

International Journal of Public Health

,

60

,

619

–

627

.

Holmes

,

C.C.

&

Held

,

L.

(

2006

)

Bayesian auxiliary variable models for binary and multinomial regression

.

Bayesian Analysis

,

1

,

145

–

168

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Ieva

,

F.

,

Paganoni

,

A.M.

,

Pigoli

,

D.

&

Vitelli

,

V.

(

2013

)

Multivariate functional clustering for the morphological analysis of electrocardiograph curves

.

Journal of the Royal Statistical Society, Series C

,

62

,

401

–

418

.

Google Scholar

Crossref

WorldCat

Ignaccolo

,

R.

,

Ghigo

,

S.

&

Giovenali

,

E.

(

2008

)

Analysis of air quality monitoring networks by functional clustering

.

Environmetrics

,

19

,

672

–

686

.

Google Scholar

Crossref

WorldCat

Ishwaran

,

H.

&

James

,

L.F.

(

2001

)

Gibbs sampling methods for stick-breaking priors

.

Journal of the American Statistical Association

,

96

,

161

–

173

.

Google Scholar

Crossref

WorldCat

Jacques

,

J.

&

Preda

,

C.

(

2013

)

Funclust: A curves clustering method using functional random variables density approximation

.

Neurocomputing

,

112

,

164

–

171

.

Google Scholar

Crossref

WorldCat

Jacques

,

J.

&

Preda

,

C.

(

2014

)

Model-based clustering for multivariate functional data

.

Computational Statistics and Data Analysis

,

71

,

92

–

106

.

Google Scholar

Crossref

WorldCat

James

,

G.M.

&

Sugar

,

C.A.

(

2003

)

Clustering for sparsely sampled functional data

.

Journal of the American Statistical Association

,

98

,

397

–

408

.

Google Scholar

Crossref

WorldCat

Knowles

,

D.

&

Ghahramani

,

Z.

(

2011

)

Nonparametric bayesian sparse factor models with application to gene expression modeling

.

The Annals of Applied Statistics

,

5

,

1534

–

1552

.

Google Scholar

Crossref

WorldCat

Kowal

,

D.R.

,

Matteson

,

D.S.

&

Ruppert

,

D.

(

2017

)

A Bayesian multivariate functional dynamic linear model

.

Journal of the American Statistical Association

,

112

,

733

–

744

.

Google Scholar

Crossref

WorldCat

Legramanti

,

S.

,

Durante

,

D.

&

Dunson

,

D.

(

2020

)

Bayesian cumulative shrinkage for infinite factorizations

.

Biometrika

,

107

,

745

–

752

.

Martino

,

A.

,

Ghiglietti

,

A.

,

Ieva

,

F.

&

Paganoni

,

A.M.

(

2019

)

A k-means procedure based on a Mahalanobis type distance for clustering multivariate functional data

.

Statistical Methods and Applications

,

28

,

301

–

322

.

Google Scholar

Crossref

WorldCat

Montagna

,

S.

,

Tokdar

,

S.T.

,

Neelon

,

B.

&

Dunson

,

D.

(

2012

)

Bayesian latent factor regression for functional and longitudinal data

.

Biometrics

,

68

,

1064

–

1073

.

Papastamoulis

,

P.

(

2016

)

label.switching: An r package for dealing with the label switching problem in mcmc outputs

.

Journal of Statistical Software

,

69

,

1

–

24

.

Google Scholar

Crossref

WorldCat

Papastamoulis

,

P.

&

Iliopoulos

,

G.

(

2010

)

An artificial allocations based solution to the label switching problem in bayesian analysis of mixtures of distributions

.

Journal of Computational and Graphical Statistics

,

19

,

313

–

331

.

Google Scholar

Crossref

WorldCat

Ramsay

,

J.O.

&

Silverman

,

B.W.

(

2005

)

Functional data analysis

.

New York

:

Springer Science and Business Media

.

R Core Team (2020) R: A language and environment for statistical computing

. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Ray

,

S.

&

Mallick

,

B.

(

2006

)

Functional clustering by Bayesian wavelet methods

.

Journal of the Royal Statistical Society: Series B

,

68

,

305

–

332

.

Google Scholar

Crossref

WorldCat

Ren

,

L.

,

Carin

,

L.

&

Dunson

,

D.

(

2008

)

The dynamic hierarchical dirichlet process

. In: Proceedings of the twenty-fifth international conference on machine learning, Helsinki.

Rodriguez

,

A.

&

Dunson

,

D.B.

(

2014

)

Functional clustering in nested designs: modeling variability in reproductive epidemiology studies

.

The Annals of Applied Statistics

,

8

,

1416

–

1442

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Schmutz

,

A.

,

Jacques

,

J.

,

Bouveyron

,

C.

,

Cheze

,

L.

&

Martin

,

P.

(

2020

)

Clustering multivariate functional data in group-specific functional subspaces

.

Computational Statistics

,

35

,

1101

–

1131

.

Google Scholar

Crossref

WorldCat

Sethuraman

,

J.

(

1994

)

A constructive definition of dirichlet priors

.

Statistica Sinica

,

4

,

639

–

650

.

Google Scholar

OpenURL Placeholder Text

WorldCat

Soares

,

J.

,

Makar

,

P.A.

,

Aklilu

,

Y.

&

Akingunola

,

A.

(

2018

)

The use of hierarchical clustering for the design of optimized monitoring networks

.

Atmospheric Chemistry and Physics

,

18

,

6543

–

6566

.

Google Scholar

Crossref

WorldCat

Tokushige

,

S.

,

Yadohisa

,

H.

&

Inada

,

K.

(

2007

)

Crisp and fuzzy k-means clustering algorithms for multivariate functional data

.

Computational Statistics

,

22

,

1

–

16

.

Google Scholar

Crossref

WorldCat

Wand

,

M.P.

&

Ormerod

,

J.T.

(

2008

)

On semiparametric regression with O'Sullivan penalized splines

.

Australian & New Zealand Journal of Statistics

,

50

,

179

–

198

.

Google Scholar

Crossref

WorldCat

West

,

M.

(

2003

) Bayesian factor regression models in the “Large p, Small n” paradigm. In:

Bernardo

,

J.M.

,

Bayarri

,

M.J.

,

Berger

,

J.O.

,

Dawid

,

A.P.

,

Heckerman

,

D.

,

Smith

,

A.F.M.

et al. (Eds.)

Bayesian statistics

.

Oxford

:

Oxford University Press

, pp.

723

–

732

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

WHO Regional Office for Europe

. (

2013

)

Review of evidence on health aspects of air pollution

. REVIHAAP project: technical report, WHO/EURO: 2013-4101-43860-61757.

Zhang

,

Z.

,

Wang

,

D.

,

Dai

,

G.

&

Jordan

,

M.I.

(

2014

)

Matrix-variate dirichlet process priors with applications

.

Bayesian Analysis

,

9

,

259

–

286

.

Google Scholar

Crossref

WorldCat

Author notes

Funding information Government-wide R & D Fund project for Infectious Disease Research, HG18C0025; National Research Foundation of Korea, 2019R1A2C1086194; 2019R1A2C1010018

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)

Download all slides

Month:	Total Views:
March 2023	36
April 2023	18
May 2023	5
June 2023	5
July 2023	10
August 2023	5
September 2023	2
October 2023	9
November 2023	18
December 2023	17
January 2024	17
February 2024	19
March 2024	26
April 2024	16
May 2024	23
June 2024	12
July 2024	26
August 2024	16
September 2024	19
October 2024	28
November 2024	14
December 2024	23
January 2025	16
February 2025	15
March 2025	12
April 2025	11
May 2025	4

Article Contents

Non-Parametric Bayesian Covariate-Dependent Multivariate Functional Clustering: An Application to Time-Series Data for Multiple Air Pollutants

Abstract

1 INTRODUCTION

2 METHODOLOGY

2.1 Model for covariate-dependent multivariate functional clustering

2.2 Model for covariate-dependent multivariate functional clustering with temporal dependency

2.3 Prior specification and posterior inference

3 SIMULATION STUDY

3.1 Covariate-dependent trivariate functional clustering

3.2 Covariate-dependent bivariate functional clustering with temporal dependency

4 DATA APPLICATION

4.1 Clustering the annual trajectories of $O_{3}$ ⁠, ${NO}_{2}$ and ${PM}_{2.5}$ in 2010

4.2 Clustering the annual trajectories of ${NO}_{2}$ and $O_{3}$ in 1987–2012 with temporal dependency

5 DISCUSSION AND CONCLUSION

DATA AVAILABILITY STATEMENT

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

Non-Parametric Bayesian Covariate-Dependent Multivariate Functional Clustering: An Application to Time-Series Data for Multiple Air Pollutants

Abstract

1 INTRODUCTION

2 METHODOLOGY

2.1 Model for covariate-dependent multivariate functional clustering

2.2 Model for covariate-dependent multivariate functional clustering with temporal dependency

2.3 Prior specification and posterior inference

3 SIMULATION STUDY

3.1 Covariate-dependent trivariate functional clustering

3.2 Covariate-dependent bivariate functional clustering with temporal dependency

4 DATA APPLICATION

4.1 Clustering the annual trajectories of O3⁠, NO2 and PM2.5 in 2010

4.2 Clustering the annual trajectories of NO2 and O3 in 1987–2012 with temporal dependency

5 DISCUSSION AND CONCLUSION

DATA AVAILABILITY STATEMENT

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

4.1 Clustering the annual trajectories of $O_{3}$ ⁠, ${NO}_{2}$ and ${PM}_{2.5}$ in 2010

4.2 Clustering the annual trajectories of ${NO}_{2}$ and $O_{3}$ in 1987–2012 with temporal dependency