-
PDF
- Split View
-
Views
-
Cite
Cite
Xiongtao Dai, Zhenhua Lin, Hans-Georg Müller, Modeling Sparse Longitudinal Data on Riemannian Manifolds, Biometrics, Volume 77, Issue 4, December 2021, Pages 1328–1341, https://doi.org/10.1111/biom.13385
- Share Icon Share
Abstract
Modern data collection often entails longitudinal repeated measurements that assume values on a Riemannian manifold. Analyzing such longitudinal Riemannian data is challenging, because of both the sparsity of the observations and the nonlinear manifold constraint. Addressing this challenge, we propose an intrinsic functional principal component analysis for longitudinal Riemannian data. Information is pooled across subjects by estimating the mean curve with local Fréchet regression and smoothing the covariance structure of the linearized data on tangent spaces around the mean. Dimension reduction and imputation of the manifold-valued trajectories are achieved by utilizing the leading principal components and applying best linear unbiased prediction. We show that the proposed mean and covariance function estimates achieve state-of-the-art convergence rates. For illustration, we study the development of brain connectivity in a longitudinal cohort of Alzheimer's disease and normal participants by modeling the connectivity on the manifold of symmetric positive definite matrices with the affine-invariant metric. In a second illustration for irregularly recorded longitudinal emotion compositional data for unemployed workers, we show that the proposed method leads to nicely interpretable eigenfunctions and principal component scores. Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative database.
1 Introduction
Functional data that assume values in an Euclidean space are typically considered as random elements of a Hilbert space (Horvath and Kokoszka, 2012; Hsing and Eubank, 2015; Wang et al., 2016). Specifically applicable in such linear spaces with their flat Euclidean geometry are key techniques such as functional principal component analysis (Chen and Lei, 2015) and functional regression (Kong et al., 2016). However, only recently has the analysis of nonlinear functional data been considered. In one strand of previous work, the entire functional trajectory is assumed to reside on a nonlinear manifold (Chen and Müller, 2012), while in the other, the response values of the functional or longitudinal data reside on a manifold (Yuan et al., 2012; Lin et al., 2017). The latter setting where data objects are sparsely or densely observed random Riemannian manifold-valued functions is increasingly encountered in practice. Examples where such data are encountered include neuroimaging (Cornea et al., 2017) and human kinetics studies (Telschow et al., 2019).
Our work is motivated by data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), where one aims to characterize and assess the progression of Alzheimer's disease through collecting longitudinal neuroimaging measures, such as functional magnetic resonance imaging (fMRI) scans. Functional connectivity obtained from resting-state fMRI (rs-fMRI) is known to be altered for Alzheimer's patients (Badhwar et al., 2017) and can be represented by the correlation matrix of brain region activation, which we model on the constrained manifold of symmetric positive-definite matrices. In spite of the relevance for many applied studies of functional connectivity (Ginestet et al., 2017), there is no firm statistical foundation to date for the study of the time-dynamics of data such as longitudinally observed correlation matrices. A second motivating example concerns longitudinal mood assessment of unemployed workers, where in each longitudinal survey, the participants reported the fraction of time during which they experienced four different moods ranging from bad, low, mild, to good. Such longitudinal compositional data consisting of repeated longitudinal observations of compositions that add up to one are encountered in many important applications, eg, repeated voting, consumer preference tracing, soil or air composition over time, and microbiome dynamics.
It has been assumed in virtually all previous studies of manifold-valued functional data (eg, Anirudh et al., 2017; Dai and Müller, 2018; Lin and Yao, 2019) that we are aware of that the recorded data consist of densely or continuously observed functions. A typical example for this are time-varying network data where networks with a fixed number of nodes can be characterized by their graph Laplacians, which are symmetric positive definite matrices, and with this representation networks can be viewed as time-varying metric space-valued objects, a connection that has been made in Dubey and Müller (2020). This connection implies that the manifold methodology of Dai and Müller (2018) and Lin and Yao (2019) is directly applicable for fully observed network trajectories. The Riemannian Functional Principal Analysis through Conditional Expectation (RPACE) approach introduced here then provides a tool to extend these previous results to longitudinal networks, which are irregularly and sparsely observed in time. In general, for most longitudinal biomedical and social studies, data are not continuously observed in time, and more often than not the time points at which measurements are available are irregular and sparse. Nevertheless, one still may assume that the data for each subject result from the realization of an underlying unobserved continuous-time stochastic process that takes values on a Riemannian manifold.
The current lack of suitable methodology for longitudinal manifold-valued data thus provides strong motivation for the development of suitable methods, including a version of principal component analysis, for sparsely observed and potentially noise-contaminated Riemannian longitudinal data. The principal components can subsequently be used to obtain best fits for subject-specific trajectories. Methodology and theory development that is necessary to achieve this goal is challenging, due to the combination of irregularity and scarcity of the repeated measurements that are available per subject on the one hand and the inherent nonlinearity of manifold-valued data on the other hand. Specifically, this development is a highly nontrivial extension of the methodology that is available for longitudinal Euclidean functional data (Yao et al., 2005; Li and Hsing, 2010; Zhang and Wang, 2016).
Functional principal component analysis has emerged as an important tool for analyzing infinite-dimensional functional data. For densely observed manifold-valued functional data, Dai and Müller (2018) studied an intrinsic Riemannian functional principal component analysis that utilizes the Fréchet mean curve defined at each observation time and Riemannian logarithm maps, mapping the observed manifolds at each fixed time to a tangent plane centered around the Fréchet mean of the observed manifolds at that time, and demonstrated its advantages over extrinsic approaches that ignore the manifold structure. While this approach utilized the ambient space in which the manifold is embedded, Lin and Yao (2019) subsequently extended the theory and developed a fully intrinsic approach. In such intrinsic approaches, Fréchet means (Fréchet, 1948) are used to extend the classical Euclidean mean to data on Riemannian manifolds (Patrangenaru et al., 2018).
For sparsely observed manifold-valued longitudinal data, the Riemannian functional principal components approach (Dai and Müller, 2018) that was specifically designed for dense observations is inapplicable. In this work, we instead propose a Riemannian version of the PACE approach (Yao et al., 2005), which aims to pool observations and borrow information across all subjects. For this, we obtain the Fréchet mean curve for longitudinal Riemannian data under a local Fréchet regression framework (Petersen and Müller, 2019), extending both the local linear smoothing paradigm for Euclidean data (Fan and Gijbels, 1996) and for independent Riemannian data (Yuan et al., 2012).
The main innovation and contributions of this paper are as follows: First, we develop a functional principal component analysis for sparsely observed Riemannian longitudinal data. The proposed methods can also handle densely observed Riemannian functional data contaminated with measurement errors, which existing methodology cannot. Second, a local polynomial estimate for the mean function Riemannian longitudinal data is proposed, based on pooling the data across subjects and taking into account the dependency of the repeated measurements within subjects. Third, uniform rates of convergence are derived for the mean and covariance functions for sparse and dense Riemannian functional data under a unified framework, extending previous results for the Euclidean case (Li and Hsing, 2010; Zhang and Wang, 2016) by adopting an M-estimation framework. Fourth, we demonstrate the utility of our methods for longitudinal neuroimaging and social sciences data, as well as in simulation results that are included in the Supporting Information. Finally, an R implementation RFPCA is available on GitHub.
2 Methodology
2.1 Preliminaries for Manifolds
Let be a d-dimensional smooth, connected, and geodesically complete Riemannian manifold isometrically embedded in an ambient space
, where positive integers
are the intrinsic and ambient dimensions, respectively. The tangent space
at
is a d-dimensional vector space consisting of all velocity vectors
where
is a differentiable curve with
defined in a vicinity of 0. The tangent space is endowed with the Riemannian metric
induced from the inner product in the ambient Euclidean space, defined by
for
. The geodesic distance
between
is the infimum of the length taken over all piecewise differentiable curves on
joining p and q. A geodesic
is a constant-speed curve on the manifold that is characterized by having a vanishing projection of
onto
, for
.
The Riemannian exponential map is defined as
, where
is a geodesic with initial velocity
, and the Riemannian logarithm map Logp is the inverse of Expp, assuming that it is well defined; see the left panel of Figure 1. We view
as a smooth submanifold of the ambient Euclidean space
. Since an isometric embedding always exists for a suitable D due to the Nash embedding theorem (Nash, 1956), one does not need to specify an ambient space (Lee, 1997).

Left: Illustration of the tangent space at ,
, the value of a process on the manifold
at observation time
, the process
that is obtained after applying the logarithm map and lies on the tangent space, and the actually observed value
with noise and its version
on the tangent space after applying the logarithm map, illustrating the observation model (5). Specifically,
,
for some random noise
, and
. The dashed ellipse in the tangent space
represents the boundary of the domain
. Right: At each time
, the value of the eigenfunction
, denoted by arrows, lies on the tangent space
at the mean
, a point on the central black curve. The values of the eigenfunction at the first and last time points are bolded and are shown to lie on different tangent spaces. The solid and dotted blue are curves lying on the manifold that represent the mode of variation of
along the direction of
, the kth eigenfunction, which are defined by
and
,
, respectively. This figure appears in color in the electronic version of this article, and any mention of color refers to that version
2.2 Statistical Model
We define -valued Riemannian random processes
as a D-dimensional vector-valued random process indexed by a compact interval
such that
, and assume that the process X is of second-order, ie, for every
, there exists
such that the Fréchet functional
is finite, where
is the above-defined geodesic distance. Defining the Fréchet mean function
as

the following assumption is required to ensure the existence and uniqueness of the Fréchet mean curve:
- (X0)
X is of second-order, and the Fréchet mean curve
exists and is unique.
Assumption (X0) holds, for example, if the range of the process X is contained in a geodesic ball of sufficient small radius (Afsari, 2011). Since it is assumed that is geodesically complete, by the Hopf—Rinow theorem, its exponential map Expp at each p is defined on the entire tangent space
at
. We define the domain
to be the interior of the collection of tangent vectors
such that if
is a geodesic emanating from p with the direction v, then γ([0, 1]) is a minimizing geodesic. On the domain
, the map Expp is injective with image Im(Expp). The Riemannian logarithm map at p, denoted by Logp, is the inverse of Expp, restricted to Im(Expp); if
for some
, then
. The logarithm map Logp effectively provides a local linear approximation of a neighborhood of
, mapping on the tangent
.
To study the covariance structure of the random process X on tangent spaces, we require
- (X1)
For some constant
,
, where
denotes the set
.
This condition requires to stay away from the cut locus (Lee, 1997) of
uniformly for
, so that the logarithm map
is well defined for all t. It is not needed if
is injective on
for all t. In the special case of a d-dimensional unit sphere
that we deal with for the special case of longitudinal compositional data, if
is continuous and the distribution of
vanishes at an open set with positive volume that contains
, (X1) holds. Under (X0) and (X1), the
-valued logarithm process
is well defined for all
.
An important observation (Bhattacharya and Patrangenaru, 2003) is that , and furthermore, that
for every
, where
denotes the Euclidean norm in
, which makes it possible to define the covariance function of L by

This covariance function and its covariance operator admit the eigendecomposition

with orthonormal eigenfunctions and eigenvalues
, where
, leading to the Karhunen—Loève representation (Grenander, 1950; Kleffe, 1973)

where the are the uncorrelated Riemannian functional principal component scores with
and
. We will utilize finitely truncated versions

for a given integer . The eigenfunctions
, defined on the tangent spaces, represent the linearized modes of variation of X; see the right panel of Figure 1.
To reflect longitudinal studies, for a sample of an
-valued Riemannian random process X, we assume that each
is only recorded at
random time points
, where each observation
is additionally corrupted by intrinsic random noise. Specifically, with
, one records noisy measurements

where measurement times are identically distributed and independent of the predictors
, with
for some density f supported on
. A graphical illustration of the sampling model (5) is in the left panel of Figure 1. We require
- (Y0)
Conditional on
, the
are independent and are independent of the
, with isotropic variance σ2 and
. Furthermore, the Fréchet mean of
conditional on
is
.
As , the assumption on
implies that
. The random noise
, although modeled in the tangent spaces, induces random noise on the manifold itself via Riemannian exponential maps, and could alternatively be modeled directly on the manifold, under the centering condition that the Fréchet mean of
given
is
, which is the equivalent of the usual centering condition for model errors in Euclidean space. The following condition is analogous to (X1) and is needed for
to stay away from the cut locus. It imposes indirect constraints on the random noise. For example, it requires the random noise to be bounded when the cut locus is not empty. The condition is not needed when the underlying manifold
has nonpositive sectional curvature (Lee, 1997).
- (Y1)
For some constant
,
.
2.3 Estimation
To deal with the sparse and irregularly spaced observations coming from a longitudinal study, we develop a new method for estimating the Fréchet mean function by harnessing the Fréchet regression framework that was originally designed for the case of independent observations (Petersen and Müller, 2019). We study an extension that is valid for dependent repeated measurements in a unified framework that covers both sparse and dense observations.
To construct a local polynomial smoother for manifold-valued responses, we define the local weight function at as

where ,
for
,
are subject-specific weights satisfying
,
,
is a symmetric density function, and
is a bandwidth. The mean
is estimated by

where we define the double-weighted Fréchet function as

for and
. For the special case
where observations lie in a Euclidean space,
coincides with the sum of squared errors loss used in Zhang and Wang (2016). Two prominent schemes are to assign equal weight to each observation, resulting in
(Yao et al., 2005) or to assign equal weight to each subject, ie,
(Li and Hsing, 2010). As for the Euclidean case, the former scheme was found to work better for nondense and the latter for ultradense functional data, and hence we will adopt the former weight scheme in our implementations for sparsely sampled data.
To estimate the covariance structure, we first map the observed data into tangent spaces, obtaining D-dimensional column vectors and then the matrix-valued covariance function Γ by scatterplot smoothing (Yao et al., 2005) for matrix-valued data matrices
for
. This leads to the estimates
, where

Here, is the matrix Frobenius norm,
is a bandwidth, and
are weights satisfying
, where we select
in the sparse longitudinal case; see Zhang and Wang (2016, 2018) for other possible choices. Estimates for the eigenfunctions
and
of Γ are then obtained by the eigenfunctions
and eigenvalues
of
.
In applications, one needs to choose appropriate bandwidths and
, as well as the number of included components K. To select
for smoothing the mean function
, we adopt a generalized cross-validation (GCV) criterion

where is the total number of observations, selecting
as the minimizer of GCV(h). While a similar GCV strategy can be applied to select the bandwidth
for the covariance function, we propose to employ the simpler choice
, which we found to be computationally efficient and to perform well in simulations. To practically determine the number of components K included in the finite-truncated representation (4), we consider the fraction of variation explained (FVE)

and choose the number of included components as the smallest K such that FVE exceeds a specified threshold ,

Commonly γ is set to .90, .95, or .99.
2.4 Riemannian Functional Principal Component Analysis Through Conditional Expectation
The unobserved Riemannian functional principal component scores (4) need to be estimated from the discrete samples
. Approximating the integral in (3) is infeasible when the number of repeated measurements per curve is small, which is well known for the Euclidean case (Yao et al., 2005). We therefore develop in the following RPACE for tangent vector-valued processes. Throughout this subsection, expected values will be taken conditional on the observation time points
.
Applying best linear unbiased prediction of , we obtain scores

where denotes the best linear unbiased predictor; with the vectorization operation denoted as Vec( · ),
are the vectorized log-mapped noisy observations for subject i,
,
, and
, where I is the identity matrix. The entry of
corresponding to
is
, where
and
denote the ath and
th entry in a vector v and matrix A, respectively. Estimates
coincide with the conditional expectations
if the joint distribution of
is elliptically contoured (Fang et al., 1990, Theorem 2.18) such as the Gaussian distribution.
Substituting estimates for the corresponding unknown quantities in (9), we obtain

where and
,
and
are obtained from
, the minimizer of (6), and
. The K-truncated processes

are then estimated by

If one aims to estimate the underlying processes or
at some fixed
, one can also proceed directly without estimating the scores. Best linear unbiased predictors are obtained as

with corresponding estimates

obliviating the need for finite truncation. Here, and
is the plug-in estimate of
.
3 Asymptotic Properties
To derive the asymptotic properties of the estimates in Section 2, we require the following assumptions, in addition to conditions (X0) and (X1); see the Supporting Information for details.
- (M0)
The domain
is a compact interval and
is a bounded submanifold of
.
- (K0)
The kernel function K is a Lipschitz continuous symmetric probability density function on [ − 1, 1].
- (X2)
Almost surely, the sample paths
are twice continuously differentiable.
- (X3)
The density f of the
is positive and twice continuously differentiable.
Boundedness of the manifold as in (M0) can be replaced by compact support conditions on the random process ,
and the noisy observations
. Conditions (X2) and (X3) concern the smoothness of the process and design density and are standard for the Euclidean case (Zhang and Wang, 2016).
Let , where
,
, and
,
by the Cauchy—Schwarz inequality. The finiteness of
is implied by the Lipschitz continuity of the kernel function K and the compactness of the domain
. Define
and
. Two additional conditions (L0) and (L1) are needed.
- (L0)
The Fréchet mean functions
and
exist and are unique, the latter almost surely for all n.
Defining a real-valued function ,
and
, where
for
,
denotes as before the tangent space at p and
the Riemannian exponential map, we assume
- (L1)
The Hessian of
at
is uniformly positive definite along the mean function, ie, for its smallest eigenvalue
it holds that
Conditions (L0) and (L1) ensure properly defined minima and are necessary for consistent estimation of the mean curve using M-estimation theory (Petersen and Müller, 2019). On a Riemannian manifold with sectional curvature at most
, (L0) and (L1) are satisfied asymptotically if the support of the noisy observations
in the local time window stays within
, where
is a geodesic ball with center
and radius r (Bhattacharya and Bhattacharya, 2012); this specifically holds for longitudinal compositional data mapped to the positive orthant of a unit sphere. The next two conditions on the bandwidths
and
are needed to derive the rate of convergence of the mean and covariance estimates, respectively. For simplicity of presentation, we assume
, noting that extensions to more general cases are straightforward (Zhang and Wang, 2016).
- (H1)
and
.
- (H2)
,
, and
.
Theorem 1 Assume that conditions (X0)-(X3), (Y0)-(Y1), (M0), (K0), (L0)-(L1), and (H1) hold. Then,

Theorem 1 shows that estimate enjoys the same uniform convergence rate as in Zhang and Wang (2016) for the Euclidean case, so the presence of curvature does not impact the rate. The rate in (15) has three terms that correspond to three regimes that are characterized by the growth rate of m relative to the sample size: (a) When
, the observations per curve are sparse and the optimal choice
yields
. (b) When
, corresponding to an intermediate case, the optimal choice
leads to the uniform rate
for
. (c) When
, the observations are dense, and any choice
gives rise to the uniform rate
.
We note that the transition from (a) to (c) is akin to a phase transition as observed in Hall et al. (2006). Our next result concerns the uniform rate for the estimator of Γ, the covariance function of the log-mapped data, extending a result of Zhang and Wang (2016) to manifold-valued functional data.
Theorem 2 Assume conditions (X0)-(X3), (Y0)-(Y1), (M0), (K0), (L0)-(L1), (H1), and (H2) hold. Then,

Again, the above rate gives rise to three regimes that are determined by the growth rate of m relative to the sample size and are discussed in Web Appendix C in the Supporting Information, along with its implications for estimated eigenvalues and eigenfunctions, where corresponding rates are obtained by applying results from perturbation theory (Bosq, 2000).
4 Data Applications
4.1 Time-Evolving Functional Connectivity
The longitudinal development of brain functional connectivity, defined as the temporal dependency of neuronal activation patterns in different brain regions, has become a focus of recent investigations in brain imaging (Deoni et al., 2016; Dai et al., 2019) to quantify brain development and brain aging. In such studies, brain imaging modalities that include fMRI scans are often obtained longitudinally to quantify the coactivation of various brain regions over time, where activation of a region is inferred from elevated blood oxygen levels in the specified region. We focus here on quantifying the effects of brain aging in terms of longitudinally varying functional connectivity between brain regions, assessed using fMRI scans that are obtained from subjects in a relaxed state. Brain connectivity is measured by calculating a correlation (Worsley et al., 2005) that essentially corresponds to functional dynamic correlation (Dubin and Müller, 2005) between the average signals of various brain regions. The observed correlations are frequently contaminated with measurement errors from various sources (Laumann et al., 2017).
Recent work on longitudinal modeling of fMRI-based correlation matrices and connectivity has focused on linear models (Hart et al., 2018) with their associated restrictions or on graph-based methods under very limited designs where just two repeated observations are considered per subject (Kundu et al., 2019). In contrast to these previous approaches that are grounded in traditional Euclidean representations, we propose here a nonparametric approach for general longitudinal data that is highly flexible and is designed for objects that form Riemannian manifolds and where the fitted trajectories automatically stay in the same space; a directly comparable method does not exist yet. Obtaining trajectories of correlation matrices first, these can then be converted into time-varying dynamic connectivity measures.
The data for our analysis are from ADNI, a longitudinal study that has recorded repeated rs-fMRI scans. Of central interest are changes in brain function for Alzheimer's patients after the onset of the disease. Since the time of onset is unobservable, we chose as a proxy the time of the first scan for each subject, the first time at which the diagnosis status is available, and for all subjects, we chose this approximate onset time as the origin of the time axis. The times ,
, when fMRI scans were obtained for the ith subject, are accordingly recorded relative to the time origin at
, which means that the first scan for each subject takes place at the time
. The raw rs-fMRI data were first preprocessed by following a standard protocol that involves motion correction, slice timing correction, coregistration, normalization, and detrending. For each scan, we obtained the pairwise correlations between 10 brain regions that were identified as relevant for brain connectivity in Buckner et al. (2009). The resulting 10 × 10 correlation matrices constitute the manifold-valued responses
,
,
, at each scan time
. For each subject, one thus has a random number of correlation matrices that are sparsely observed in time and may be noise-contaminated.
It is of interest to determine and compare the continuously interpolated mean trajectories for both Alzheimer's and normal subjects, for which we apply the proposed RPACE. The correlation matrices are modeled on the Riemannian manifold of symmetric positive definite matrices
, for which the tangent space
for a
is represented by the collection of 10 × 10 symmetric matrices. We employ the affine-invariant Riemannian metric on
(Pennec et al., 2006), which can be defined through
for
. While the representation of tangent spaces does not depend on the selected P, the Riemannian metric, exponential maps, and logarithm maps depend on P, as follows. For a symmetric positive definite
, the Riemannian exponential map ExpP takes a symmetric matrix
to a symmetric positive definite matrix via
, and the Riemannian logarithm map LogP takes a symmetric positive definite
to a symmetric matrix on
via
, where
and
are the matrix exponential and logarithm. The geodesic distance is
, where
denotes the Frobenius norm of a matrix
. The proposed methods guarantee that the fitted objects lie on the
and are always SPD matrices, which can then be converted to correlation matrices. This is an important feature that distinguishes the proposed geometric approach from classical Euclidean methods, where there is no such guarantee. The affine-invariant metric endows
with globally nonpositive curvature (Pennec et al., 2006), which ensures that Fréchet mean, exponential map, and logarithm map are always well defined.
The time window of interest in the longitudinal connectivity analysis is years after the initial visit at
, where the time domain is chosen to allow at least one full year of observations. After removing subjects with outlying signals or no repeated rs-fMRI measurements within 1.1 years of the initial visit, the sample consisted of 64 subjects, of whom 26 had a diagnosis of Alzheimer's disease and 38 were cognitively normal. A total number of 215 scans were available with two to four repeated measurements per subject. The sparsity and irregularity of these data poses difficulties for classical analyses, prevents the application of presmoothing, and renders previous approaches (Dai and Müller, 2018) infeasible. The proposed RPACE method is geared toward such sparse and irregularly sampled manifold-valued functional data and guarantees consistent estimation.
The raw correlation matrices for three randomly selected subjects are displayed in the first row of each panel in Figure 2. The large heterogeneity in the raw correlations suggests the presence of substantial measurement errors. The eigenvalues decay slowly in this example, motivating the application of (14) to obtain noise-filtered fitted trajectories
without resorting to finite-dimensional truncation. The fitted correlations with
and the Gaussian kernel, displayed in the second row of each panel in Figure 2, are clearly smoother and less noisy compared to the raw correlations, which helps to delineate underlying trends.

Longitudinal functional connectivity of three randomly selected subjects, reflected by 10 × 10 correlation matrices for ten brain regions, where the subject in the left upper panel is cognitively normal and has two available measurements and the subjects in the right upper and lower panels have been diagnosed with Alzheimer's and have three and, respectively, four, available measurements, with times of the measurements in years as indicated in the panels. Each panel includes observed (top row) and fitted (bottom row) correlation matrices. This figure appears in color in the electronic version of this article, and any mention of color refers to that version
To further demonstrate the application of our methodology for comparisons of SPD(10)-valued connectivity matrices, we focus on the initial visit time , where a noisy raw correlation matrix
is available for each subject, as well as the value of the noise-filtered fitted trajectory
. For a simple-minded approach, one could also consider the average connectivity matrix
, defined as the Fréchet mean of all correlation matrices observed at random times for the ith subject. For each of the raw, fitted, and average correlation matrices available per subject, we compare the correlation matrices of the subjects with Alzheimer's with those of the cognitively normal group. We apply the Fréchet analysis of variance test of Dubey and Müller (2019) for the two sample comparison of the correlation matrices, where we refer to this paper for further details about the test. The P-values obtained with the bootstrap version of the test are
when using the fitted correlation matrices
,
when using the raw correlation matrices at the first visit
, and
when using the subject-specific correlation matrix averages
. This suggests that the noise reduction achieved by the proposed fitting procedure leads to more powerful inference.
4.2 Emotional Well-Being for Unemployed Workers
In this second data example, we analyze data from the Survey of Unemployed Workers in New Jersey (Krueger and Mueller, 2011) conducted in the fall of 2009 and the beginning of 2010, during which the unemployment rate in the United States peaked at 10% after the 2007-2008 financial crisis. The data are from a stratified random sample of unemployed workers, who were surveyed weekly for up to 12 weeks. Questionnaires included an entry survey, which assessed baseline characteristics such as household income, and weekly follow-ups regarding job search activities and emotional well-being. In each follow-up questionnaire, participants were asked to report the percentage of time they spent in each of four different moods.
We consider a sample of unemployed workers enrolled in the study who were not offered a job during the survey period. Times
at which subject i responded to the jth survey were recorded as days since the start of the study that falls within
. The overall weekly response rate was around 40% and the number of responses
per subject ranged from 1 to 12, with 25% of all subjects having only one response recorded. Thus these data are a mixture of very sparse and somewhat sparse longitudinal observations. As subjects responded on different days of the week, the times
at which the compositional mood vector was recorded are not only sparse but also irregularly spaced. At each
, compositional data
,
, are observed, where
is the reported and assumed to be noise-contaminated proportion of time subject i spent in the lth mood in the previous week,
corresponding to bad, low/irritable, mildly pleasant, and good moods. The
reflect an underlying mood composition process
, where
is the proportion of time a subject spent in the lth mood in the week preceding day t.
The proposed RPACE method was applied to the square-root transformed compositional data and compositional process
, defined as

both lying on the sphere for
, as compositional data are nonnegative and sum to one (Scealy and Welsh, 2011; Dai and Müller, 2018). Two geometries might be considered as alternatives to the proposed spherical geometry: the Aitchison geometry (Aitchison, 1986) obtained through applying a log-ratio transformation and the Euclidean geometry for the unaltered original compositions. The Aitchison geometry faces an immediate problem in this data application because a substantial proportion of mood compositions is zero, leaving the log-ratio undefined; this poses no problems for the proposed square-root transformation approach. Though the compositional simplex can be identically embedded into the Euclidean space and endowed with the Euclidean geometry, this approach yields a geodesic distance that equally emphasizes the differences in the entries with large or small magnitude. In many applications, it is more sensible to attach higher importance to the small entries, and this is effectively achieved by the square-root transformation and also by the log transformation.
Bandwidths were selected by GCV as and
days with the Epanechnikov kernel. The fitted mood composition trajectories are displayed in the left panel of Figure 3 for four randomly selected subjects, where the solid dots denote the reported moods and are slightly jittered vertically if they overlap, and dashed curves denote the fitted trajectories. The fits are obtained with
components, selected according to the FVE criterion (8) with threshold
, which is a reasonable choice in view of the large sample size. As the self-reported moods contain substantial aberrations from smooth trajectories that we view as noise, the fitted trajectories do not go through the raw observations. The mean trajectory is displayed in the right panel of Figure 3, indicating that the emotional well-being of subjects tends to deteriorate as the period of unemployment lengthens, with an overall increase in the proportion of bad mood and a decrease in the proportion of good mood.

Left: Longitudinal mood compositional data for four randomly selected unemployed workers, with raw observations shown as dots and fitted trajectories by the proposed method shown as solid curves, using eight eigencomponents. Overlapping dots were slightly jittered vertically. Right: The overall mean function. All functions are shown in the square root transformation scale. This figure appears in color in the electronic version of this article, and any mention of color refers to that version
The first four eigenfunctions for mood composition trajectories are shown in Figure 4, where the first eigenfunction corresponds to the overall contrast between neutral-to-positive mood (good and mild) and negative moods (low and bad); the second eigenfunction represents emotional stability, which is a contrast between more neutral moods and extreme emotions (good and bad); the third eigenfunction corresponds to a shift of mood compositions to more positive moods, namely, from bad to low and from mild to good; the fourth eigenfunction encodes an increase of positive feelings and a decrease of negative ones over time. Here it is important to note that the sign of the eigenfunctions is arbitrary and could be reversed. The first four eigenfunctions together explain 95% of the total variation.

The first four eigenfunctions for the longitudinal mood composition data in the square root transformation scale, with fraction of variation explained (FVE) displayed in the panel subtitles. This figure appears in color in the electronic version of this article, and any mention of color refers to that version
To demonstrate that the scores obtained from the proposed approach are useful for downstream tasks such as regression, we explored the association between the second Riemannian principal component score , corresponding to the proportion of extreme moods, and annual household income in 2008, a measure of financial stability. Collecting these scores for all subjects, we constructed kernel density estimates for the
within each income category; see Figure 5. Participants with higher household income before losing their job and thus higher financial stability tend to have higher emotion stability, as demonstrated by the right-shifted distributions of
and larger means (colored dots). The relationship between prior income and emotional stability appears to be nonlinear, especially for lower income groups.

The distributions of the second Riemannian principal component score, encoding emotion stability, visualized as densities in dependence on the annual household income in 2008. Colored dots indicate the mean of this score for each income group. This figure appears in color in the electronic version of this article, and any mention of color refers to that version
5 Concluding Remarks
While the proposed RPACE approach has been found to perform very well in the data examples and simulation, and will provide a valuable tool for longitudinal studies with complex data, its utility finds limits for extremely sparse longitudinal designs with an average of around two measurements per subject. Such extremely sparse designs do occur in practical applications (Dai et al., 2019). In such cases, mean and linearized covariance functions can still be reasonably estimated if the number of subjects is large, but the recovery of individual trajectories is unstable. Further limitations are encountered for manifolds with high curvature where local linear approximations work less well and for stratified or infinite-dimensional manifolds. To address and overcome these limits will be left for future research.
Data Availability Statement
The data that support the findings of this paper are publicly available at adni.loni.usc.edu and opr.princeton.edu/archive/njui/.
Acknowledgments
We thank the reviewers for their constructive comments. This research was supported by NSF grants DMS-1712864 and DMS-2014626. Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf
References

Supplementary data
Web Appendices for the simulation referenced in Section 1, proofs and additional discussions for the theoretical results referenced in Section 3, and links to the data and code for reproducing the data applications are available with this paper at the Biometrics website on Wiley Online Library. An R package implementation is available at https://github.com/CrossD/RFPCA.