High-Dimensional Principal Component Analysis with Heterogeneous Missingness

Author Notes

Abstract

We study the problem of high-dimensional Principal Component Analysis (PCA) with missing observations. In a simple, homogeneous observation model, we show that an existing observed-proportion weighted (OPW) estimator of the leading principal components can (nearly) attain the minimax optimal rate of convergence, which exhibits an interesting phase transition. However, deeper investigation reveals that, particularly in more realistic settings where the observation probabilities are heterogeneous, the empirical performance of the OPW estimator can be unsatisfactory; moreover, in the noiseless case, it fails to provide exact recovery of the principal components. Our main contribution, then, is to introduce a new method, which we call primePCA, that is designed to cope with situations where observations may be missing in a heterogeneous manner. Starting from the OPW estimator, primePCA iteratively projects the observed entries of the data matrix onto the column space of our current estimate to impute the missing entries, and then updates our estimate by computing the leading right singular space of the imputed data matrix. We prove that the error of primePCA converges to zero at a geometric rate in the noiseless case, and when the signal strength is not too small. An important feature of our theoretical guarantees is that they depend on average, as opposed to worst-case, properties of the missingness mechanism. Our numerical studies on both simulated and real data reveal that primePCA exhibits very encouraging performance across a wide range of scenarios, including settings where the data are not Missing Completely At Random.

heterogeneous missingness, high-dimensional statistics, iterative projections, missing data, principal component analysis

1 INTRODUCTION

One of the ironies of working with Big Data is that missing data play an ever more significant role, and often present serious difficulties for analysis. For instance, a common approach to handling missing data is to perform a so-called complete-case analysis (Little & Rubin, 2019), where we restrict attention to individuals in our study with no missing attributes. When relatively few features are recorded for each individual, one can frequently expect a sufficiently large proportion of complete cases that, under an appropriate missing at random (MAR) hypothesis, a complete-case analysis may result in only a relatively small loss of efficiency. On the other hand, in high-dimensional regimes where there are many features of interest, there is often such a small proportion of complete cases that this approach becomes infeasible. As a very simple illustration of this phenomenon, imagine an $n \times d$ data matrix in which each entry is missing independently with probability 0.01. When $d = 5$ ⁠, a complete-case analysis would result in around 95 $%$ of the individuals (rows) being retained, but even when we reach $d = 300$ ⁠, only around 5 $%$ of rows will have no missing entries.

The inadequacy of the complete-case approach in many applications has motivated numerous methodological developments in the field of missing data over the past 60 years or so, including imputation (Ford, 1983; Rubin, 2004), factored likelihood (Anderson, 1957) and maximum likelihood approaches (Dempster et al., 1977); see, for example, Little & Rubin (2019) for an introduction to the area. Recent years have also witnessed increasing emphasis on understanding the performance of methods for dealing with missing data in a variety of high-dimensional problems, including sparse regression (Belloni et al., 2017; Loh & Wainwright, 2012), classification (Cai & Zhang, 2018), sparse principal component analysis (Elsener & van de Geer, 2018) and covariance and precision matrix estimation (Loh & Tan, 2018; Lounici, 2014).

In this paper, we study the effects of missing data in one of the canonical problems of high-dimensional data analysis, namely dimension reduction via Principal Component Analysis (PCA). This is closely related to the topic of matrix completion, which has received a great deal of attention in the literature over the last decade or so (Candès et al., 2011; Candès & Plan, 2010; Candès & Recht, 2009; Keshavan et al., 2010; Koltchinskii et al., 2011; Mazumder et al., 2010; Negahban & Wainwright, 2012) for example. There, the focus is typically on accurate recovery of the missing entries, subject to a low-rank assumption on the signal matrix; by contrast, our focus is on estimation of the principal eigenspaces. Previously proposed methods for low-dimensional PCA with missing data include non-linear iterative partial least squares (Wold & Lyttkens, 1969), iterative PCA (Josse & Husson, 2012; Kiers, 1997) and its regularised variant (Josse et al., 2009); see Dray & Josse (2015) for a nice survey and comparative study. More broadly, the R-miss-tastic website https://rmisstastic.netlify.com/ provides a valuable resource on methods for handling missing data.

The importance of the problem of high-dimensional PCA with missing data derives from its wide variety of applications. For instance, in many commercial settings, one may have a matrix of customers and products, with entries recording the number of purchases. Naturally, there will typically be a high proportion of missing entries. Nevertheless, PCA can be used to identify items that distinguish the preferences of customers particularly effectively, to make recommendations to users of products they might like and to summarise efficiently customers' preferences. Later, we will illustrate such an application, on the Million Song Dataset, where we are able to identify particular songs that have substantial discriminatory power for users' preferences as well as other interesting characteristics of the user database. Other potential application areas include health data, where one may seek features that best capture the variation in a population, and where the corresponding principal component scores may be used to cluster individuals into subgroups (that may, for instance, receive different treatment regimens).

To formalise the problem we consider, suppose that the (partially observed) matrix $n \times d$ matrix $Y$ is of the form

Y = X + Z,

(1)

for independent random matrices $X$ and $Z$ ⁠, where $X$ is a low-rank matrix and $Z$ is a noise matrix with independent and identically distributed entries having zero mean. The low-rank property of $X$ is encoded through the assumption that it is generated via

X = U V_{K}^{⊤},

(2)

where $V_{K} \in ℝ^{d \times K}$ has orthonormal columns and $U$ is a random $n \times K$ matrix (with $n > K$ ⁠) having independent and identically distributed rows with mean zero and covariance matrix $\sum_{u}$ ⁠. Note that when $X$ and $Z$ are independent, the covariance matrix of $Y$ has a $K$ -spiked structure; such covariance models have been studied extensively in both theory and applications (Cai et al., 2013; Fan et al., 2013; Johnstone & Lu, 2009; Paul, 2007).

We are interested in estimating the column space of $V_{K}$ ⁠, denoted by $Col (V_{K})$ ⁠, which is also the $K$ -dimensional leading eigenspace of $\sum_{y} : = n^{- 1} 𝔼 (Y^{⊤} Y)$ ⁠. Cho et al. (2017) considered a different but related model where $U$ in (2) is deterministic, and is not necessarily centred, so that $V_{K}$ is the top $K$ right singular space of $𝔼 (Y)$ ⁠. (By contrast, in our setting, $𝔼 (Y) = 0$ ⁠, so the mean structure is uninformative for recovering $V_{K}$ ⁠.) Their model can be viewed as being obtained from the model (1) and (2) by conditioning on $U$ ⁠. In the context of a $p$ -homogeneous Missing Completely At Random (MCAR) observation model, where each entry of $Y$ is observed independently with probability $p \in (0, 1)$ (independently of $Y)$ ⁠, Cho et al. (2017) studied the estimation of $Col (V_{K})$ by $Col ({\hat{V}}_{K})$ ⁠, where ${\hat{V}}_{K}$ is a simple estimator formed as the top $K$ eigenvectors of an observed-proportion weighted (OPW) version of the sample covariance matrix (here, the weighting is designed to achieve approximate unbiasedness). Our first contribution, in Section 2, is to provide a detailed, finite-sample analysis of this estimator in the model given by (1) and (2) together with a $p$ -homogeneous MCAR missingness structure, with a noise level of constant order. The differences between the settings necessitate completely different arguments, and reveal in particular a new phenomenon in the form of a phase transition in the attainable risk bound for the $\sin Θ$ loss function, that is, the Frobenius norm of the diagonal matrix of the sines of the principal angles between ${\hat{V}}_{K}$ and $V_{K}$ ⁠. Moreover, we also provide a minimax lower bound in the case of estimating a single principal component, which reveals that this estimator achieves the minimax optimal rate up to a poly-logarithmic factor.

While this appears to be a very encouraging story for the OPW estimator, it turns out that it is really only the starting point for a more complete understanding of high-dimensional PCA with missing data. For instance, in the noiseless case, the OPW estimator fails to provide exact recovery of the principal components. Moreover, it is the norm rather than the exception in applications that missingness is heterogeneous, in the sense that the probability of observing entries of $Y$ varies (often significantly) across columns. For instance, in recommendation systems, some products will typically be more popular than others, and hence we observe more ratings in those columns. As another example, in meta-analyses of data from several studies, it is frequently the case that some covariates are common across all studies, while others appear only in a reduced proportion of them. In Section 2.2, we present an example to show that, even with an MCAR structure, PCA algorithms can break down entirely for such heterogeneous observation mechanisms when individual rows of $V_{K}$ can have large Euclidean norm. Intuitively, if we do not observe the interaction between the $j$ th and $k$ th columns of $Y$ ⁠, then we cannot hope to estimate the $j$ th or $k$ th rows of $V_{K}$ ⁠, and this will cause substantial error if these rows of $V_{K}$ contain significant signal. This example illustrates that it is only possible to handle heterogeneous missingness in high-dimensional PCA with additional structure, and indicates that it is natural to measure the difficulty of the problem in terms of the incoherence among the entries of $V_{K}$ —that is, the maximum Euclidean norm of the rows of $V_{K}$ ⁠.

Our main contribution, then, is to propose a new, iterative algorithm, called primePCA (short for projected refinement for imputation of missing entries in Principal Component Analysis), in Section 3, to estimate $V_{K}$ ⁠, even with heterogeneous missingness. The main initialiser that we study for this algorithm is a modified version of the simple estimator discussed above, where the modification accounts for potential heterogeneity. Each iteration of primePCA projects the observed entries of $Y$ onto the column space of the current estimate of $V_{K}$ to impute missing entries, and then updates our estimate of $V_{K}$ by computing the leading right singular space of the imputed data matrix. An illustration of the two steps of a single iteration of the primePCA algorithm in the case $d = 3$ ⁠, $K = 1$ is given in Figure 1.

FIGURE 1

An illustration of the two steps of a single iteration of the primePCA algorithm with $d = 3$ and $K = 1$ ⁠. Black dots represent fully observed data points, while vertical dotted lines that emanate from them give an indication of their $x_{3}$ coordinate values, as well as their projections onto the $x_{1}$ - $x_{2}$ plane. The $x_{1}$ coordinate of the orange data point and the $x_{2}$ coordinate of the blue data point are unobserved, so the true observations lie on the respective solid lines through those points (which are parallel to the relevant axes). Starting from an input estimate of $V_{K}$ (left), given by the black arrow, we impute the missing coordinates as the closest points on the coloured lines to $V_{K}$ (middle), and then obtain an updated estimate of $V_{K}$ as the leading right singular vector of the imputed data matrix (right, with the old estimate in grey). [Colour figure can be viewed at https://dbpia.nl.go.kr]

Open in new tab Download slide

Our theoretical results reveal that in the noiseless setting, that is, $Z = 0$ ⁠, primePCA achieves exact recovery of the principal eigenspaces (with a geometric convergence rate) when the initial estimator is close to the truth and a sufficiently large proportion of the data are observed. Moreover, we also provide a performance guarantee for the initial estimator, showing that under appropriate conditions it satisfies the desired requirement with high probability, conditional on the observed missingness pattern. Code for our algorithm is available in the R package primePCA (Zhu et al., 2019).

To the best of our knowledge, primePCA is the first method for high-dimensional PCA that is designed to cope with settings where missingness is heterogeneous. Indeed, the previously mentioned works on high-dimensional PCA and other high-dimensional statistical problems with missing data have either focused on a uniform missingness setting or have imposed a lower bound on entrywise observation probabilities, which reduces to this uniform case. In particular, such results fail to distinguish in terms of the performance of their algorithms between a setting where one variable is observed with a very low probability $p$ and all other variables are fully observed, and a setting where all variables are observed with probability $p$ ⁠. A key contribution of our work is to account explicitly for the effect of a heterogeneous missingness mechanism, where the estimation error depends on average entrywise missingness rather than worst-case missingness; see the discussions after Theorem 4 and Proposition 2 below. In Section 4, the empirical performance of primePCA is compared with both that of the initialiser, and a popular method for matrix completion called softImpute (Hastie et al., 2015; Mazumder et al., 2010); we also discuss maximum likelihood approaches implemented via the Expectation–Maximisation (EM) algorithm, which can be used when the dimension is not too high. Our settings include a wide range of signal-to-noise ratios (SNRs), as well as Missing Completely At Random, MAR and Missing Not At Random (MNAR) examples (Little & Rubin, 2019; Seaman et al., 2013). These comparisons reveal that primePCA provides highly accurate and robust estimation of principal components, for instance outperforming the softImpute algorithm, even when the latter is allowed access to the oracle choice of regularisation parameter for each dataset. Our analysis of the Million Song Dataset is given in Section 5. In Section 6, we illustrate how some of the ideas in this work may be applied to other high-dimensional statistical problems involving missing data. Proofs of our main results are deferred to Section A in Appendix S1 (Zhu et al., 2019); auxiliary results and their proofs are given in Section B of Appendix S1.

1.1 Notation

For a positive integer $T$ ⁠, we write $[T] : = {1, \dots, T}$ ⁠. For $v = {(v_{1}, \dots, v_{d})}^{⊤} \in ℝ^{d}$ and $p \in [1, \infty)$ ⁠, we define $‖ v ‖_{p} : = (\sum_{j = 1}^{d} | v_{j} |^{p})^{1 / p}$ and $‖ v ‖_{\infty} : = \max_{j \in [d]} | v_{j} |$ ⁠. We let $S^{d - 1} : = {u \in ℝ^{d} : ‖ u ‖_{2} = 1}$ denote the unit Euclidean sphere in $ℝ^{d}$ ⁠.

Given $u = {(u_{1}, \dots, u_{d})}^{⊤} \in ℝ^{d}$ ⁠, we write $diag (u) \in ℝ^{d \times d}$ for the diagonal matrix whose $j$ th diagonal entry is $u_{j}$ ⁠. We let $𝕆^{d_{1} \times d_{2}}$ denote the set of matrices in $ℝ^{d_{1} \times d_{2}}$ with orthonormal columns. For a matrix $A = (A_{i j}) \in ℝ^{d_{1} \times d_{2}}$ ⁠, and $p, q \in [1, \infty]$ ⁠, we write $‖ A ‖_{p} : = (\sum_{i, j} | A_{i j} |^{p})^{1 / p}$ if $1 \leq p < \infty$ and $‖ A ‖_{\infty} : = \max_{i, j} | A_{i j} |$ for its entrywise $ℓ_{p}$ norm, as well as $‖ A ‖_{p \to q} : = \sup_{‖ v ‖_{p} = 1} ‖ A v ‖_{q}$ for its $p$ -to- $q$ operator norm. We provide special notation for the (Euclidean) operator norm and the Frobenius norm by writing $‖ A ‖_{op} : = ‖ A ‖_{2 \to 2}$ and $‖ A ‖_{F} : = ‖ A ‖_{2}$ respectively. We also write $σ_{j} (A)$ for the $j$ th largest singular value of $A$ ⁠, and define its nuclear norm by $‖ A ‖_{*} : = \sum_{j = 1}^{\min (d_{1}, d_{2})} σ_{j} (A)$ ⁠. If $S \subseteq [n]$ ⁠, we write $A_{S} \in ℝ^{| S | \times d}$ for the matrix obtained by extracting the rows of $A$ that are in $S$ ⁠. For $A, B \in ℝ^{d_{1} \times d_{2}}$ ⁠, the Hadamard product of $A$ and $B$ ⁠, denoted $A \circ B$ ⁠, is defined such that ${(A \circ B)}_{i j} = A_{i j} B_{i j}$ for any $i \in [d_{1}]$ and $j \in [d_{2}]$ ⁠.

2 THE OPW ESTIMATOR

In this section, we study a simple OPW estimator of the matrix of principal components. To define the estimator, let $A_{i j}$ denote the event that the $(i, j)$ th entry $y_{i j}$ of $Y$ is observed. We define the revelation matrix $Ω = (ω_{i j}) \in ℝ^{n \times d}$ by $ω_{i j} : = 𝟙_{A_{i j}}$ ⁠, and the partially observed data matrix

Y_{Ω} : = Y \circ Ω .

(3)

Our observed data are the pair (⁠ $Y_{Ω}, Ω$ ⁠). Importantly, the fact that we observe $Ω$ allows us to distinguish between observed zeros and missing entries (even though these also appear as zeros in $Y_{Ω}$ ⁠). We first consider the simplest possible case, which we refer to as the $p$ -homogeneous observation model, where entries of the data matrix $Y$ are observed independently and completely at random (i.e., independent of $(U, Z)$ ⁠), each with probability $p$ ⁠. Thus, $ℙ (A_{i j}) = p \in (0, 1)$ for all $i \in [n], j \in [d]$ ⁠, and $A_{i j}$ and $A_{i^{'} j^{'}}$ are independent for $(i, j) \neq (i^{'}, j^{'})$ ⁠.

For $i \in [n]$ ⁠, let $y_{i}^{⊤}$ and $ω_{i}^{⊤}$ denote the $i$ th rows of $Y$ and $Ω$ ⁠, respectively, and define ${\tilde{y}}_{i} : = y_{i} \circ ω_{i}$ ⁠. Writing $P : = 𝔼 ω_{1} ω_{1}^{⊤}$ and $W$ for its entrywise inverse, we have that under the $p$ -homogeneous observation model, $P = p^{2} {1_{d} 1_{d}^{⊤} - (1 - p^{- 1}) I_{d}}$ and $W = p^{- 2} {1_{d} 1_{d}^{⊤} - (1 - p) I_{d}}$ ⁠. Following Lounici (2013, 2014) and Cho et al. (2017), we consider the following weighted sample covariance matrix:

G : = (\frac{1}{n} Y_{Ω}^{⊤} Y_{Ω}) \circ W = (\frac{1}{n} \sum_{i = 1}^{n} {\tilde{y}}_{i} {\tilde{y}}_{i}^{⊤}) \circ W .

The reason for including the weight $W$ is to ensure that $𝔼 (G | Y) = n^{- 1} Y^{⊤} Y$ ⁠, so that $G$ is an unbiased estimator of $\sum_{y}$ ⁠. Related ideas appear in the work of Cai & Zhang (2016) on high-dimensional covariance matrix estimation with missing data; see also Little & Rubin (2019, section 3.4). In practice, $p$ is typically unknown and needs to be estimated. It is therefore natural to consider the following plug-in estimator $\hat{G}$ ⁠:

\hat{G} = (\frac{1}{n} Y_{Ω}^{⊤} Y_{Ω}) \circ \hat{W},

(4)

where $\hat{W} = {\hat{p}}^{- 2} {1_{d} 1_{d}^{⊤} - (1 - \hat{p}) I_{d}}$ and $\hat{p} : = {(n d)}^{- 1} ‖ Ω ‖_{1}$ denotes the proportion of observed entries in $Y$ ⁠. The OPW estimator of $V_{K}$ ⁠, denoted ${\hat{V}}_{K}^{OPW}$ ⁠, is the $d \times K$ matrix formed from the top $K$ eigenvectors of $\hat{G}$ ⁠.

2.1 Theory for homogeneous missingness

We begin by studying the theoretical performance of ${\hat{V}}_{K}^{OPW}$ in a simple model that will allow us to reveal an interesting phase transition for the problem. For a random vector $x$ taking values in $ℝ^{d}$ and for $r \geq 1$ ⁠, we define its (Orlicz) $ψ_{r}$ -norm and a version that is invariant to invertible affine transformations by

‖ x ‖_{ψ_{r}} : = \sup_{u \in S^{d - 1}} \sup_{q \in ℕ} \frac{(𝔼 | u^{⊤} x |^{q})^{1 / q}}{q^{1 / r}} and ‖ x ‖_{ψ_{r}^{*}} : = \sup_{u \in S^{d - 1}} \frac{‖ u^{⊤} (x - 𝔼 x) ‖_{ψ_{r}}}{{var}^{1 / 2} (u^{⊤} x)},

respectively. Recall that we say $x$ is sub-Gaussian if $‖ x ‖_{ψ_{2}^{*}} < \infty$ ⁠.

In this preliminary section, we assume that $(Y_{Ω}, Ω)$ is generated according to (1), (2) and (3), where:

(A1)
$U$ ⁠, $Z$ and $Ω$ are independent;
(A2)
$U$ has independent and identically distributed rows $(u_{i} : i \in [n])$ with $𝔼 u_{1} = 0$ and $‖ u_{1} ‖_{ψ_{2}^{*}} \leq τ$ ⁠;
(A3)
$Z = {(z_{i j})}_{i \in [n], j \in [d]}$ has independent and identically distributed entries with $𝔼 z_{11} = 0$ ⁠, $var z_{11} = 1$ and $‖ z_{11} ‖_{ψ_{2}^{*}} \leq τ$ ⁠;
(A4)
$‖ y_{1 j}^{2} ‖_{ψ_{1}} \leq M$ for all $j \in [d]$ ⁠;
(A5)
$Ω$ has independent $Bern (p)$ entries.

Thus, (A1) ensures that the complete data matrix $Y$ and the revelation matrix $Ω$ are independent; in other words, for now we work in a Missing Completely At Random (MCAR) setting. In a homoscedastic noise model, there is no loss of generality (by a scaling argument) in assuming that each entry of $Z$ has unit variance, as in (A3). In many places in this work, it will be convenient to think intuitively of $τ$ and $M$ in (A2)–(A4) as constants. In particular, if $U$ has multivariate normal rows and $Z$ has normal entries, then we can simply take $τ = 1$ ⁠. For $M$ ⁠, under the same normality assumptions, we have $‖ y_{1 j}^{2} ‖_{ψ_{1}} = var (y_{1 j})$ ⁠, so this intuition amounts to thinking of the variance of each component of our data as being of constant order.

A natural measure of the performance of an estimator ${\hat{V}}_{K}$ of $V_{K}$ is given by the Davis–Kahan $\sin Θ$ loss

L ({\hat{V}}_{K}, V_{K}) : = \frac{1}{\sqrt{2}} ‖ {\hat{V}}_{K} {\hat{V}}_{K}^{⊤} - V_{K} V_{K}^{⊤} ‖_{F},

(Davis & Kahan, 1970)¹. Our first theorem controls the risk of the OPW estimator; here and below, we write $λ_{k}$ for the $k$ th largest eigenvalue of $\sum_{u}$ ⁠.

Theorem 1
Assume (A1)–(A5) and that $n, d \geq 2$ ⁠, $d p \geq 1$ ⁠. Write $R : = λ_{1} + 1$ ⁠. Then there exists a universal constant $C > 0$ such that
$𝔼 L ({\hat{V}}_{K}^{OPW}, V_{K}) \leq \frac{C K^{1 / 2}}{λ_{K} p} \{{(\frac{M d (R τ^{2} p + M \log d) \log^{2} d}{n})}^{1 / 2} + \frac{M d (\log^{2} d) (\log n)}{n}\} .$
(5)
In particular, if $n \geq d (\log^{2} d) (\log^{2} n) / (λ_{1} p + \log d)$ ⁠, then there exists $C_{M, τ} > 0$ ⁠, depending only on $M$ and $τ$ ⁠, such that
$𝔼 L ({\hat{V}}_{K}^{OPW}, V_{K}) \leq \frac{C_{M, τ}}{λ_{K} p} {(\frac{K d (λ_{1} p + \log d) \log^{2} d}{n})}^{1 / 2} .$
(6)

Theorem 1 reveals an interesting phase transition phenomenon. Specifically, if the signal strength is large enough that $λ_{1} \geq p^{- 1} \log d$ ⁠, then we should regard $n p$ as the effective sample size, as might intuitively be expected. On the other hand, if $λ_{1} < p^{- 1} \log d$ ⁠, then the estimation problem is considerably more difficult and the effective sample size is of order $n p^{2}$ ⁠. In fact, by inspecting the proof of Theorem 1, we see that in the high signal case, it is the difficulty of estimating the diagonal entries of $\sum_{y}$ that drives the rate, while when the signal strength is low, the bottleneck is the challenge of estimating the off-diagonal entries. By comparing (6) with the minimax lower bound result in Theorem 2 below, we see that this phase transition phenomenon is an inherent feature of this estimation problem, rather than an artefact of the proof techniques we used to derived the upper bound.

The condition $n \geq d (\log^{2} d) (\log^{2} n) / (λ_{1} p + \log d)$ in Theorem 1 is reasonable given the scaling requirement for consistency of the empirical eigenvectors (Johnstone & Lu, 2009; Shen et al., 2016; Wang & Fan, 2017). Indeed, Shen et al. (2016, Theorem 5.1) show that when $λ_{1} ≫ 1$ ⁠, the top eigenvector of the sample covariance matrix estimator is consistent if and only if $d / (n λ_{1}) \to 0$ ⁠. If we regard $n p$ as the effective sample size in our missing data PCA problem, then it is a sensible analogy to assume that $d / (n p λ_{1}) \to 0$ here, which implies that the condition $n \geq d (\log^{2} d) (\log^{2} n) / (λ_{1} p + \log d)$ holds for large $n$ ⁠, up to poly-logarithmic factors.

As mentioned in the introduction, Cho et al. (2017) considered the different but related problem of singular space estimation in a model in which $Y = Θ + Z$ ⁠, where $Θ$ is a matrix of the form $U V_{K}^{⊤}$ for a deterministic matrix $U$ ⁠, whose rows are not necessarily centred. In this setting, $V_{K}$ is the matrix of top $K$ right singular vectors of $Θ$ ⁠, and the same estimator ${\hat{V}}_{K}$ can be applied. An important distinction is that, when the rows of $U$ are not centred and the entries of $Θ$ are of comparable magnitude, $‖ Θ ‖_{F}$ is of order $\sqrt{n d}$ ⁠, so when $K$ is regarded as a constant, it is natural to think of the singular values of $Θ$ as also being of order $\sqrt{n d}$ ⁠. Indeed, this is assumed in Cho et al. (2017). On the other hand, in our model, where the rows of $U$ have mean zero, assuming that the eigenvalues are of order $\sqrt{n d}$ would amount to an extremely strong requirement, essentially restricting attention to very highly spiked covariance matrices. Removing this condition in Theorem 1 requires completely different arguments.

In order to state our minimax lower bound, we let $𝒫_{n, d} (λ_{1}, p)$ denote the class of distributions of pairs $(Y_{Ω}, Ω)$ satisfying (A1), (A2), (A3) and (A5) with $K = 1$ ⁠. Since we are now working with vectors instead of matrices, we write $v$ in place of $V_{1}$ ⁠.

Theorem 2
There exists a universal constant $c > 0$ such that
$\inf_{\hat{v}} \sup_{P \in 𝒫_{n, d} (λ_{1}, p)} 𝔼_{P} L (\hat{v}, v) \geq c \min \{\frac{1}{λ_{1} p} {(\frac{d (λ_{1} p + 1)}{n})}^{1 / 2}, 1\},$
where the infimum is taken over all estimators $\hat{v} = \hat{v} (Y_{Ω}, Ω)$ of $v$ ⁠.

Theorem 2 reveals that ${\hat{V}}_{1}^{OPW}$ in Theorem 1 achieves the minimax optimal rate of estimation up to a poly-logarithmic factor when $M$ and $τ$ are regarded as constants.

2.2 Heterogeneous observation mechanism

A key assumption of the theory in Section 2.1, which allowed even a very simple estimator to perform well, was that the missingness probability was homogeneous across the different entries of the matrix. On the other hand, the aim of this sub-section is to show that the situation changes dramatically once the data can be missing heterogeneously.

To this end, consider the following example. Suppose that $ω$ is equal to ${(1, 0, 1, \dots, 1)}^{⊤}$ or ${(0, 1, 1, \dots, 1)}^{⊤}$ with equal probability, so that

P = 𝔼 ω ω^{⊤} = (\begin{matrix} 1 / 2 & 0 & 1 / 2 & \dots & 1 / 2 \\ 0 & 1 / 2 & 1 / 2 & \dots & 1 / 2 \\ 1 / 2 & 1 / 2 & 1 & \dots & 1 \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 1 / 2 & 1 / 2 & 1 & \dots & 1 \end{matrix}) \in ℝ^{d \times d} .

In other words, for each $i \in [n]$ ⁠, we observe precisely one of the first two entries of $y_{i}$ ⁠, together with all of the remaining $(d - 2)$ entries. Let $\sum = I_{d} + α α^{⊤}$ ⁠, where $α = {(2^{- 1 / 2}, 2^{- 1 / 2}, 0, \dots, 0)}^{⊤} \in ℝ^{d}$ ⁠, and $\sum^{'} = I_{d} + α^{'} {(α^{'})}^{⊤}$ ⁠, where $α^{'} = {(2^{- 1 / 2}, - 2^{- 1 / 2}, 0, \dots, 0)}^{⊤} \in ℝ^{d}$ ⁠. Suppose that $y \sim N_{d} (0, \sum)$ and let $\tilde{y} : = y \circ ω$ ⁠, and similarly assume that $y^{'} \sim N_{d} (0, \sum^{'})$ and set ${\tilde{y}}^{'} : = y^{'} \circ ω$ ⁠. Then $(\tilde{y}, ω)$ and $({\tilde{y}}^{'}, ω)$ are identically distributed. However, the leading eigenvectors of $\sum$ and $\sum^{'}$ are respectively $α$ and $α^{'}$ ⁠, which are orthogonal!

Thus, it is impossible to simultaneously estimate consistently the leading eigenvector of both $\sum$ and $\sum^{'}$ from our observations. We note that it is the disproportionate weight of the first two coordinates in the leading eigenvector, combined with the failure to observe simultaneously the first two entries in the data, that makes the estimation problem intractable in this example. The understanding derived from this example motivates us to seek bounds on the error in high-dimensional PCA that depend on an incoherence parameter $μ : = {(d / K)}^{1 / 2} ‖ V_{K} ‖_{2 \to \infty} \in [1, {(d / K)}^{1 / 2}]$ ⁠. The intuition here is that the maximally incoherent case is where each column of $V_{K}$ is a unit vector proportional to a vector whose entries are either 1 or $- 1$ ⁠, in which case $‖ V_{K} ‖_{2 \to \infty} = {(K / d)}^{1 / 2}$ and $μ = 1$ ⁠. On the other hand, in the worst case, when the columns of $V_{K}$ are the first $K$ standard basis vectors in $ℝ^{d}$ ⁠, we have $μ = {(d / K)}^{1 / 2}$ ⁠. Bounds involving incoherence have appeared previously in the literature on matrix completion (e.g., Candès & Plan, 2010; Keshavan et al., 2010), but for a different reason. There, the purpose is to control the principal angles between the true right singular space and the standard basis, which yields bounds on the number of observations required to infer the missing entries of the matrix. In our case, the incoherence condition controls the extent to which the loadings of the principal components of interest are concentrated in any single coordinate, and therefore the extent to which significant estimation error in a few components of the leading eigenvectors can affect the overall statistical performance. In the intractable example above, $μ = {(d / 2)}^{1 / 2}$ ⁠, and with such a large value of $μ$ ⁠, heavy corruption from missingness in only a few entries spoils any chance of consistent estimation.

3 OUR NEW ALGORITHM FOR PCA WITH MISSING ENTRIES

We are now in a position to introduce and analyse our iterative algorithm primePCA to estimate $Col (V_{K})$ ⁠, the principal eigenspace of the covariance matrix $\sum_{y}$ ⁠. The basic idea is to iterate between imputing the missing entries of the data matrix $Y_{Ω}$ using a current (input) iterate ${\hat{V}}_{K}^{(in)}$ ⁠, and then applying a singular value decomposition (SVD) to the completed data matrix. More precisely, for $i \in [n]$ ⁠, we let $𝒥_{i}$ denote the indices for which the corresponding entry of $y_{i}$ is observed, and regress the observed data ${\tilde{y}}_{i, 𝒥_{i}} = y_{i, 𝒥_{i}}$ on ${({\hat{V}}_{K}^{(in)})}_{𝒥_{i}}$ to obtain an estimate ${\hat{u}}_{i}$ of the $i$ th row of $U$ ⁠. This is natural in view of the data generating mechanism $y_{i} = V_{K} u_{i} + z_{i}$ ⁠. We then use ${\hat{y}}_{i, 𝒥_{i}^{c}} : = {({\hat{V}}_{K}^{(in)})}_{𝒥_{i}^{c}} {\hat{u}}_{i}$ to impute the missing values $y_{i, 𝒥_{i}^{c}}$ ⁠, retain the original observed entries as ${\hat{y}}_{i, 𝒥_{i}} : = {\tilde{y}}_{i, 𝒥_{i}}$ ⁠, and set our next (output) iterate ${\hat{V}}_{K}^{(out)}$ to be the top $K$ right singular vectors of the imputed matrix $\hat{Y} : = {({\hat{y}}_{1}, \dots, {\hat{y}}_{n})}^{⊤}$ ⁠. To motivate this final choice, observe that when $Z = 0$ ⁠, we have $rank (Y) = K$ ⁠; we therefore have the SVD $Y = L Γ R^{⊤}$ ⁠, where $L \in 𝕆^{n \times K}, R \in 𝕆^{d \times K}$ and $Γ \in ℝ^{K \times K}$ is diagonal with positive diagonal entries. This means that $R = V_{K} U^{⊤} L Γ^{- 1}$ ⁠, so the column spaces of $R$ and $V_{K}$ coincide. For convenience, pseudocode of a single iteration of refinement in this algorithm is given in Algorithm 1.

refine refine(K,V^K(in),Ω,YΩ), a single step of refinement of current iterate V^K(in)

Algorithm 1

refine $refine (K, {\hat{V}}_{K}^{(in)}, Ω, Y_{Ω})$ ⁠, a single step of refinement of current iterate ${\hat{V}}_{K}^{(in)}$

Open in new tab Download slide

We now seek to provide formal justification for Algorithm 1. The recursive nature of the primePCA algorithm induces complex relationships between successive iterates, so to facilitate theoretical analysis, we will impose some conditions on the underlying data generating mechanism that may not hold in situations where we would like to apply to algorithm. Nevertheless, we believe that the analysis provides considerable insight into the performance of the primePCA algorithm, and these are discussed extensively below; moreover, our simulations in Section 4 consider settings both within and outside the scope of our theory, and confirm its attractive and robust numerical performance.

In addition to the loss function $L$ ⁠, it will be convenient to define a slightly different notion of distance between subspaces. For any $V, \tilde{V} \in 𝕆^{d \times K}$ ⁠, we let $W_{1} D W_{2}^{⊤}$ be an SVD of ${\tilde{V}}^{⊤} V$ ⁠. The two-to-infinity distance between $\tilde{V}$ and $V$ is then defined to be

T (\tilde{V}, V) : = ‖ \tilde{V} - V W_{2} W_{1}^{⊤} ‖_{2 \to \infty} .

We remark that the definition of $T (\tilde{V}, V)$ does not depend on our choice of SVD and that $T (\tilde{V}, V) = T (\tilde{V} O_{1}, V O_{2})$ for any $O_{1}, O_{2} \in 𝕆^{K \times K}$ ⁠, so that $T$ really represents a distance between the subspaces spanned by $\tilde{V}$ and $V$ ⁠. In fact, there is a sense in which the change-of-basis matrix $W_{2} W_{1}^{⊤}$ tries to align the columns of $V$ as closely as possible with those of $\tilde{V}$ ⁠; more precisely, if we change the norm from the two-to-infinity operator norm to the Frobenius norm, then $W_{2} W_{1}^{⊤}$ uniquely solves the so-called Procrustes problem (Schönemann, 1966):

W_{2} W_{1}^{⊤} = \underset{W \in 𝕆^{K \times K}}{argmin} {‖ \tilde{V} - V W ‖}_{F} .

(7)

The following proposition considers the noiseless setting $Z = 0$ ⁠, and shows that, for any estimator ${\hat{V}}_{K}^{(in)}$ that is close to $V_{K}$ ⁠, a single iteration of refinement in Algorithm 1 contracts the two-to-infinity distance between their column spaces, under appropriate conditions. We define $Ω^{c} : = 1_{d} 1_{d}^{⊤} - Ω$ ⁠.

Proposition 1
Let ${\hat{V}}_{K}^{(out)} : = refine (K, {\hat{V}}_{K}^{(in)}, Ω, Y_{Ω})$ as in Algorithm1and further let $Δ : = T ({\hat{V}}_{K}^{(in)}, V_{K})$ ⁠. We assume that $\min_{i \in [n]} ‖ ω_{i} ‖_{1} > K$ and that $\min_{i \in [n]} \frac{d^{1 / 2} σ_{K} ({({\hat{V}}_{K}^{(in)})}_{𝒥_{i}})}{| 𝒥_{i} |^{1 / 2}} \geq 1 / σ_{*} > 0$ ⁠. Suppose that $Z = 0$ and that the SVD of $Y$ is of the form $L Γ R^{⊤}$ ⁠, where $‖ L ‖_{2 \to \infty} \leq μ {(K / n)}^{1 / 2}$ and $‖ R ‖_{2 \to \infty} \leq μ {(K / d)}^{1 / 2}$ for some $μ \geq 1$ ⁠. Then there exist $c_{1}, C > 0$ ⁠, depending only on $σ_{*}$ ⁠, such that whenever
(i)
$Δ \leq \frac{c_{1} σ_{K} (Γ)}{K^{2} μ^{4} σ_{1} (Γ) \sqrt{d}}$ ⁠,
(i)
$ρ : = \frac{C K^{2} μ^{4} σ_{1} (Γ) ‖ Ω^{c} ‖_{1 \to 1}}{σ_{K} (Γ) n} < 1$ ⁠,
we have that
$T ({\hat{V}}_{K}^{(out)}, V_{K}) \leq ρ Δ .$

In order to understand the main conditions of Proposition 1, it is instructive to consider the case $K = 1$ ⁠, as was illustrated in Figure 1, and initially to think of $μ$ as a constant. In that case, condition (i) asks that the absolute value of every component of the difference between the vectors ${\hat{V}}_{1}^{(in)}$ and $V_{1}$ is $O (d^{- 1 / 2})$ ⁠; for intuition, if two vectors are uniformly distributed on $S^{d - 1}$ ⁠, then each of their $ℓ_{\infty}$ norms is $O_{p} (d^{- 1 / 2} \log^{1 / 2} d)$ ⁠; in other words, we only ask that the initialiser is very slightly better than a random guess. In Condition (ii), $ρ$ being less than 1 is equivalent to the proportion of missing data in each column being less than $1 / (C^{'} μ^{4})$ (where $C^{'}$ again depends only on $σ_{*})$ ⁠, and the conclusion is that the refine step contracts the initial two-to-infinity distance from $V_{K}$ by at least a factor of $ρ$ ⁠. In the noiseless setting of Proposition 1, the matrix $R$ of right singular vectors of $Y$ has the same column span (and hence the same two-to-infinity norm) as $V_{K}$ ⁠. We can therefore gain some intuition about the scale of $μ$ by considering the situation where $V_{K}$ is uniformly distributed on $𝕆^{d \times K}$ ⁠, so in particular, the columns of $V_{K}$ are uniformly distributed on $S^{d - 1}$ ⁠. By Vershynin (2018, Theorem 5.1.4), we deduce that $‖ V_{K} ‖_{2 \to \infty} = O_{p} (\sqrt{\frac{K \log d}{d}})$ ⁠. On the other hand, when the distribution of $U$ is invariant under left multiplication by an orthogonal matrix (e.g. if $U$ has independent and identically distributed Gaussian rows), then $L$ is distributed uniformly on $𝕆^{n \times K}$ ⁠. Arguing as above, we see that, with high probability, we may take $μ ≲ \max (\sqrt{\log n}, \sqrt{\log d})$ ⁠. This calculation suggests that we do not lose too much by thinking of $μ$ as a constant (or at most, growing very slowly with $n$ and $d$ ⁠).

To apply Proposition 1, we also require conditions on $\min_{i \in [n]} ‖ ω_{i} ‖_{1}$ and $σ_{*}$ ⁠. In practice, if either of these conditions is not satisfied, we first perform a screening step that restricts attention to a set of row indices for which the data contain sufficient information to estimate the $K$ principal components. This screening step is explicitly accounted for in Algorithm 2, as well as in the theory that justifies it. An alternative would be to seek to weight rows according to their utility for principal component estimation, but it seems difficult to implement this in a principled way that facilitates formal justification.

primePCA, an iterative algorithm for estimating VK given initialiser V^K(0)

Algorithm 2

primePCA, an iterative algorithm for estimating V_K given initialiser ${\hat{V}}_{K}^{(0)}$

Open in new tab Download slide

Algorithm 2 provides pseudocode for the iterative primePCA algorithm, given an initial estimator ${\hat{V}}_{K}^{(0)}$ ⁠. The iterations continue until either we hit the convergence threshold $κ^{*}$ or the maximum iteration number $n_{iter}$ ⁠. Theorem 3 below guarantees that, in the noiseless setting of Proposition 1, the primePCA estimator converges to $V_{K}$ at a geometric rate.

Theorem 3
For $t \in [n_{iter}]$ ⁠, let ${\hat{V}}_{K}^{(t)}$ be the $t th$ iterate of Algorithm2with input $K$ ⁠, ${\hat{V}}_{K}^{(0)}$ ⁠, $Ω \in {0, 1}^{n \times d}$ ⁠, $Y_{Ω} \in ℝ^{n \times d}$ ⁠, $n_{iter} \in ℕ$ ⁠, $σ_{*} \in (0, \infty)$ and $κ^{*} = 0$ ⁠. Write $Δ : = T ({\hat{V}}_{K}^{(0)}, V_{K})$ ⁠, and let
$ℐ : = \{i : ‖ ω_{i} ‖_{1} > K, σ_{K} ({(V_{K})}_{𝒥_{i}}) \geq \frac{| 𝒥_{i} |^{1 / 2}}{d^{1 / 2} σ_{*}}\},$
where $𝒥_{i} : = {j : ω_{i j} = 1}$ ⁠. Suppose that $Z = 0$ and that the SVD of $Y_{ℐ}$ is of the form $L Γ R^{⊤}$ ⁠, where $‖ L ‖_{2 \to \infty} \leq μ {(K / n)}^{1 / 2}$ and $‖ R ‖_{2 \to \infty} \leq μ {(K / d)}^{1 / 2}$ for some $μ \geq 1$ ⁠. Assume that
$ϵ : = \min \{|\frac{σ_{K} ({(V_{K})}_{𝒥_{i}}) d^{1 / 2}}{| 𝒥_{i} |^{1 / 2}} - \frac{1}{σ_{*}}| : i \in [n], ‖ ω_{i} ‖_{1} > K\} > 0 .$
Then there exist $c_{1}, C > 0$ ⁠, depending only $σ_{*}$ and $ϵ$ ⁠, such that whenever
(i)
$Δ \leq \frac{c_{1} σ_{K} (Γ)}{K^{2} μ^{4} σ_{1} (Γ) \sqrt{d}}$ ⁠,
(ii)
$ρ : = \frac{C K^{2} μ^{4} σ_{1} (Γ) ‖ Ω_{ℐ}^{c} ‖_{1 \to 1}}{σ_{K} (Γ) | ℐ |} < 1$ ⁠,
we have that for every $t \in [n_{iter}]$ ⁠,
$T ({\hat{V}}_{K}^{(t)}, V_{K}) \leq ρ^{t} Δ .$

The condition that $ϵ > 0$ amounts to the very mild assumption that the algorithmic input $σ_{*}$ is not exactly equal to any element of the set ${\frac{| 𝒥_{i} |^{1 / 2}}{σ_{K} ({(V_{K})}_{𝒥_{i}}) d^{1 / 2}} : i \in [n], ‖ ω_{i} ‖_{1} > K}$ ⁠, though the conditions on $c_{1}$ and $C$ become milder as $ϵ$ increases.

3.1 Initialisation

Theorem 3 provides a general guarantee on the performance of primePCA, but relies on finding an initial estimator ${\hat{V}}_{K}^{(0)}$ that is sufficiently close to the truth $V_{K}$ ⁠. The aim of this sub-section, then, is to propose a simple initialiser and show that it satisfies the requirement of Theorem 3 with high probability, conditional on the missingness pattern.

Consider the following modified weighted sample covariance matrix

\tilde{G} : = \frac{1}{n} \sum_{i = 1}^{n} {\tilde{y}}_{i} {\tilde{y}}_{i}^{⊤} \circ \tilde{W},

(8)

where for any $j, k \in [d]$ ⁠,

{\tilde{W}}_{j k} : = \{\begin{matrix} \frac{n}{\sum_{i = 1}^{n} ω_{i j} ω_{i k}} & if \sum_{i = 1}^{n} ω_{i j} ω_{i k} > 0, \\ 0, & otherwise . \end{matrix}

(9)

Here, the matrix $\tilde{W}$ replaces $\hat{W}$ in (4) because, unlike in Section 2.1, we no longer wish to assume homogeneous missingness. We take as our initial estimator of $V_{K}$ the matrix of top $K$ eigenvectors of $\tilde{G}$ ⁠, denoted ${\tilde{V}}_{K}$ ⁠. Theorem 4 below studies the performance of this initialiser, in terms of its two-to-infinity norm error, and provides sufficient conditions for us to be able to apply Theorem 3. In particular, it ensures that the initialiser is reasonably well-aligned with the target $V_{K}$ ⁠. We write $ℙ^{Ω}$ and $𝔼^{Ω}$ for probabilities and expectations conditional on $Ω$ ⁠.

Theorem 4
Assume (A1)–(A4) and that $n, d \geq 2$ ⁠. Suppose further that $‖ V_{K} ‖_{2 \to \infty} \leq μ {(K / d)}^{1 / 2}$ ⁠, that $\sum_{i = 1}^{n} ω_{i j} ω_{i k} > 0$ for all $j, k$ and let $R : = λ_{1} + 1$ ⁠. Then there exist $c_{M, τ}, C_{M, τ} > 0$ ⁠, depending only on $M$ and $τ$ ⁠, such that for every $ξ > 2$ ⁠, if
$λ_{K} > c_{M, τ} \{{(\frac{\max (‖ \tilde{W} ‖_{1}, R ‖ \tilde{W} ‖_{1 \to 1}) ξ \log d}{n})}^{1 / 2} + \frac{ξ ‖ \tilde{W} ‖_{F} \log^{2} d}{n}\},$
(10)
then
$\begin{array}{l} ℙ^{Ω} \{T ({\tilde{V}}_{K}, V_{K}) \geq \frac{C_{M, τ} K μ^{2} R^{1 / 2}}{λ_{K}} (\frac{K}{d^{1 / 2}} + \frac{1}{λ_{K}}) (\frac{ξ^{1 / 2} ‖ \tilde{W} ‖_{\infty \to \infty}^{1 / 2} \log^{1 / 2} d}{n^{1 / 2}} + \frac{ξ ‖ \tilde{W} ‖_{2 \to \infty} \log d}{n})\} \\ \leq 2 (e^{K \log 5} + K + 4) d^{- (ξ - 1)} + 2 d^{- (ξ - 2)} . \end{array}$
As a consequence, writing
$A : = \{\frac{σ_{K} (Y_{ℐ})}{σ_{1} (Y_{ℐ})} > \frac{C_{M, τ} K^{3} μ^{6} R^{1 / 2}}{c_{1} λ_{K}} (1 + \frac{d^{1 / 2}}{K λ_{K}}) (\frac{ξ^{1 / 2} ‖ \tilde{W} ‖_{\infty \to \infty}^{1 / 2} \log^{1 / 2} d}{n^{1 / 2}} + \frac{ξ ‖ \tilde{W} ‖_{2 \to \infty} \log d}{n})\},$
where $ℐ$ and $c_{1}$ are as in Theorem3, we have that
$ℙ^{Ω} (T ({\tilde{V}}_{K}, V_{K}) > \frac{c_{1} σ_{K} (Y_{ℐ})}{K^{2} σ_{1} (Y_{ℐ}) d^{1 / 2}}) \leq 2 (e^{K \log 5} + K + 4) d^{- (ξ - 1)} + 2 d^{- (ξ - 2)} + ℙ^{Ω} (A^{c}) .$

The first part of Theorem 4 provides a general probabilistic upper bound for $T ({\tilde{V}}_{K}, V_{K})$ ⁠, after conditioning on the missingness pattern. This allows us, in the second part, to provide a guarantee on the probability with which ${\tilde{V}}_{K}$ is a good enough initialiser for Theorem 3 to apply. For intuition regarding $ℙ^{Ω} (A^{c})$ ⁠, consider the MCAR setting with $p_{j k} : = 𝔼 (ω_{1 j} ω_{1 k})$ for $j, k \in [d]$ ⁠. In that case, by Lemma 6, typical realisations of $\tilde{W}$ have $‖ \tilde{W} ‖_{\infty \to \infty} \leq 2 \max_{j \in [d]} \sum_{k \in [d]} p_{j k}^{- 1}$ and $‖ \tilde{W} ‖_{2 \to \infty} \leq 2 \max_{j \in [d]} (\sum_{k \in [d]} p_{j k}^{- 2})^{1 / 2}$ when $\sum_{j, k \in [d]} e^{- n p_{j k} / 8}$ is small. In particular, when $n \min_{j, k \in [d]} p_{j k} \geq \log d$ ⁠, we expect $ℙ^{Ω} (A^{c})$ to be small when $λ_{1}$ and $λ_{K}$ are both of the same order, and grow faster than

\max \{{(\frac{d \log d}{n} \max_{j \in [d]} \sum_{k = 1}^{d} \frac{1}{p_{j k}})}^{1 / 3}, \frac{\log d}{n} \max_{j \in [d]} \sum_{k = 1}^{d} \frac{1}{p_{j k}}\} .

As a special case, in the $p$ -homogeneous model where $p_{j k} = p^{2} 𝟙_{{j \neq k}} + p 𝟙_{{j = k}}$ for $j, k \in [d]$ ⁠, the requirement on $λ_{K}$ above is that it should grow faster than $\max {{(\frac{d^{2} \log d}{n p^{2}})}^{1 / 3}, \frac{d \log d}{n p^{2}}}$ ⁠.

One of the attractions of our analysis is the fact that we are able to provide bounds that only depend on entrywise missingness probabilities in an average sense, as opposed to worst-case missingness probabilities. The refinements conferred by such bounds are particularly important when the missingness mechanism is heterogeneous, as typically encountered in practice. The averaging of missingness probabilities can be partially seen in Theorem 4, since $‖ \tilde{W} ‖_{\infty \to \infty}$ and $‖ \tilde{W} ‖_{2 \to \infty}$ depend only on the $ℓ_{1}$ and $ℓ_{2}$ norms of each row of $\tilde{W}$ ⁠, but is even more evident in the proposition below, which gives a probabilistic bound on the original $\sin Θ$ distance between ${\tilde{V}}_{K}$ and $V_{K}$ ⁠.

Proposition 2
Assume the same conditions as in Theorem4. Then there exists a universal constant $C > 0$ such that for any $ξ > 1$ ⁠, if
$λ_{K} > C \{{(\frac{M τ^{2} R ‖ \tilde{W} ‖_{1 \to 1} ξ \log d}{n})}^{1 / 2} + \frac{M ‖ \tilde{W} ‖_{op} ξ \log^{2} d}{n}\},$
(11)
then
$\begin{align} ℙ^{Ω} \{L ({\tilde{V}}_{K}, V_{K}) \geq \frac{2^{9 / 2} e K τ μ}{λ_{K}} {(\frac{M R}{d})}^{1 / 2} (\frac{ξ^{1 / 2} ‖ \tilde{W} ‖_{1}^{1 / 2} \log^{1 / 2} d}{n^{1 / 2}} + \frac{ξ ‖ \tilde{W} ‖_{F} \log d}{n})\} \\ \leq (2 K + 4) d^{- (ξ - 1)} . \end{align}$

In this bound, we see that $L ({\tilde{V}}_{K}, V_{K})$ only depends on $\tilde{W}$ through the entrywise $ℓ_{1}$ and $ℓ_{2}$ norms of the whole matrix. Lemma 6 again provides probabilistic control of these norms under the $p$ -homogeneous missingness mechanism. In general, if the rows of $Ω$ are independent and identically distributed, but different covariates are missing with different probabilities, then off-diagonal entries of $\tilde{W}$ will concentrate around the reciprocals of the simultaneous observation probabilities of pairs of covariates. As such, for a typical realisation of $Ω$ ⁠, our bound in Proposition 2 depends only on the harmonic averages of these simultaneous observation probabilities and their squares. Such an averaging effect ensures that our method is effective in a much wider range of heterogeneous settings than previously allowed in the literature.

3.2 Weakening the missingness proposition condition for contraction

Theorem 3 provides a geometric contraction guarantee for the primePCA algorithm in the noiseless case. The price we pay for this strong conclusion, however, is a strong condition on the proportion of missingness that enters the contraction rate parameter $ρ$ through $‖ Ω_{ℐ}^{c} ‖_{1 \to 1}$ ⁠; indeed in an asymptotic framework where the incoherence parameter $μ$ grows with the sample size and/or dimension, the proportion of missingness would need to vanish asymptotically. Therefore, to complement our earlier theory, we present below Proposition 3 and Corollary 1. Proposition 3 is an analogue of the deterministic Proposition 1 in that it demonstrates that a single iteration of the primePCA algorithm yields a contraction provided that the input ${\hat{V}}_{K}^{(in)}$ is sufficiently close to $V_{K}$ ⁠. The two main differences are first that the contraction is in terms of a Procrustes-type loss (see the discussion around (7)), which turns out to be convenient for Corollary 1; and second, the bound depends only on the incoherence of the matrix $V_{K}$ ⁠, and not on the corresponding quantity for $U$ ⁠.

Proposition 3
Let ${\hat{V}}_{K}^{(in)} \in 𝕆^{d \times K}$ ⁠, let $O : = {argmin}_{\tilde{O} \in 𝕆^{K \times K}} ‖ {\hat{V}}_{K}^{(in)} - V_{K} \tilde{O} ‖_{F}$ and let $Ξ : = {\hat{V}}_{K}^{(in)} - V_{K} O$ ⁠. Fix $U \in ℝ^{n \times K}$ and $V_{K} \in 𝕆^{d \times K}$ with $‖ V_{K} ‖_{2 \to \infty} \leq μ {(K / d)}^{1 / 2}$ ⁠, and let $Y : = U V_{K}^{⊤}$ ⁠, with $c : = σ_{K} (Y) / ‖ Y ‖_{F}$ ⁠. Suppose that $κ_{1}, κ_{2}, κ_{3} > 0$ are such that for every $i \in [n]$ ⁠,
$\frac{‖ V_{𝒥_{i}, K}^{⊤} Ξ_{𝒥_{i}} ‖_{op}}{| 𝒥_{i} |} \leq \frac{‖ Ξ ‖_{op}^{2}}{d} + κ_{1} \frac{μ ‖ Ξ ‖_{op}}{d^{3 / 2}}, \frac{‖ Ξ_{𝒥_{i}} ‖_{op}^{2}}{| 𝒥_{i} |} \leq κ_{2} \frac{‖ Ξ ‖_{op}^{2}}{d}, ‖ Ξ_{𝒥_{i}^{c}} ‖_{op}^{2} \leq κ_{3} ‖ Ξ ‖_{op}^{2} .$
(12)
Assume further that
$‖ Ξ ‖_{op} \leq \min {{(\frac{c}{4 σ_{*}^{2} (κ_{1} + κ_{2})})}^{1 / 2}, \frac{c}{4 μ κ_{1} σ_{*}^{2} K} {(\frac{d}{\log K})}^{1 / 2}} .$
(13)
Then the output ${\hat{V}}_{K}^{(out)} : = refine (K, {\hat{V}}_{K}^{(in)}, Ω, Y_{Ω})$ of Algorithm1satisfies
$‖ {\hat{V}}_{K}^{(out)} - V_{K} \hat{O} ‖_{op} \leq \frac{16 ‖ Ξ ‖_{op}}{c} {σ_{*}^{2} (κ_{1} + κ_{2}) ‖ Ξ ‖_{op} + σ_{*}^{2} κ_{1} μ K {(\frac{\log K}{d})}^{1 / 2} + κ_{3}^{1 / 2} (1 + \frac{c}{2})},$
where $\hat{O} : = {argmin}_{\tilde{O} \in 𝕆^{K \times K}} ‖ {\hat{V}}_{K}^{(out)} - V_{K} \tilde{O} ‖_{F} \in 𝕆^{K \times K}$ ⁠.

Interestingly, the proof of Proposition 3 proceeds in a very different fashion from that of Proposition 1. The key step is to bound the discrepancy between the principal components of the imputed data matrix $\hat{Y}$ in Algorithm 1 and $V_{K}$ using a modified version of Wedin's theorem (Wang, 2016). To achieve the desired contraction rate, instead of viewing the true data matrix $Y$ as the reference matrix when calculating the perturbation, we choose a different reference matrix $\tilde{Y}$ with the same top $K$ right singular space as $Y$ but which is closer to $\hat{Y}$ in terms of the Frobenius norm. Such a reference shift sharpens the eigenspace perturbation bound.

The contraction rate in Proposition 3 is a sum of three terms, the first two of which are small provided that $‖ Ξ ‖_{op}$ is small and $d$ is large respectively. On the other hand, the final term is small provided that no small subset of the rows of $Ξ$ contributes excessively to its operator norm. For different missingness mechanisms, such a guarantee would need to be established probabilistically on a case-by-case basis; in Corollary 1 we illustrate how this can be done to achieve a high probability contraction in the simplest missingness model. Importantly, the proportion of missingness allowed, and hence the contraction rate parameter, no longer depend on the incoherence of $V_{K}$ ⁠, and can be of constant order.

Corollary 1
Consider the $p$ -homogeneous MCAR setting. Fix $U \in ℝ^{n \times K}$ and $V_{K} \in 𝕆^{d \times K}$ with $‖ V_{K} ‖_{2 \to \infty} \leq μ {(K / d)}^{1 / 2}$ ⁠, and let $Y : = U V_{K}^{⊤}$ ⁠, with $c : = σ_{K} (Y) / ‖ Y ‖_{F}$ ⁠. Suppose that ${\hat{V}}_{K}^{(in)}, {\hat{V}}_{K}^{(out)} \in 𝕆^{d \times K}$ ⁠, $O, \hat{O} \in 𝕆^{K \times K}$ and $Ξ \in ℝ^{d \times K}$ are as in Proposition 3, let $C_{*} : = ‖ Ξ ‖_{op} / ‖ Ξ ‖_{2 \to \infty}$ ⁠, and, fixing $δ \in (0, 1)$ ⁠, suppose that
$‖ Ξ ‖_{op} \leq \min {\frac{p {(1 - p)}^{1 / 2}}{44 μ K^{3 / 2} (σ_{*} \lor 1) \log (24 n K / δ)}, {(\frac{c}{8 σ_{*}^{2}})}^{1 / 2}, \frac{c}{4 μ σ_{*}^{2} K} {(\frac{d}{\log K})}^{1 / 2}} .$
Suppose that $d p \geq 8 \log (3 / δ)$ ⁠. Then with probability at least $1 - δ$ ⁠, the output ${\hat{V}}_{K}^{(out)} : = refine (K, {\hat{V}}_{K}^{(in)}, Ω, Y_{Ω})$ of Algorithm1satisfies
$‖ {\hat{V}}_{K}^{(out)} - V_{K} \hat{O} ‖_{op} \leq \frac{125}{c} {K^{1 / 2} {(1 - p)}^{1 / 2} + \frac{\log^{1 / 2} (3 / δ)}{C_{*}}} ‖ {\hat{V}}_{K}^{(in)} - V_{K} O ‖_{op} .$

To understand the conclusion of Corollary 1, it is instructive to consider the special case $K = 1$ ⁠. Here, $c = 1$ and $C_{*}$ is the ratio of the $ℓ_{2}$ and $ℓ_{\infty}$ norms of the vector ${\hat{V}}_{1}^{(in)} - sgn (V_{1}^{⊤} {\hat{V}}_{1}^{(in)}) V_{1}$ ⁠. When the entries of this vector are of comparable magnitude, $C_{*}$ is therefore of order $d^{1 / 2}$ ⁠, so the contraction rate is of order ${(1 - p)}^{1 / 2} + d^{- 1 / 2}$ ⁠.

3.3 Other missingness mechanisms

Another interesting aspect of our theory is that the guarantees provided in Theorem 3 are deterministic. Provided we start with a sufficiently good initialiser, Theorem 3 describes the way in which the performance of primePCA improves over iterations. An attraction of this approach is that it offers the potential to study the performance of primePCA under more general missingness mechanisms. For instance, one setting of considerable practical interest is the MAR model, which postulates that our data vector $y = (y_{1}, \dots, y_{d})$ and observation indicator vector $ω$ satisfy

ℙ (ω = ϵ | y = a) = ℙ (ω = ϵ {| ⋂}_{j : ϵ_{j} = 1} {y_{j} = a_{j}}),

(14)

for all $ϵ = {(ϵ_{1}, \dots, ϵ_{d})}^{⊤} \in {0, 1}^{d}$ and $a = {(a_{1}, \dots, a_{d})}^{⊤} \in ℝ^{d}$ ⁠. In other words, the probability of seeing a particular missingness pattern only depends on the data vector through components of this vector that are observed. Thus, if we want to understand the performance of primePCA under different missingness mechanisms, such as specific MAR (or even MNAR) models, all we require is an analogue of Theorem 4 on the performance of the initialiser in these new missingness settings. Such results, however, are likely to be rather problem-specific in nature, and it can be that choosing an initialiser based on available information on the dependence between the observations and the missingness mechanism makes it easier to prove the desired performance guarantees.

We now provide an example to illustrate how such initialisers can be constructed and analysed. Consider an MAR setting where the missingness pattern depends on the data matrix only through a fully observed categorical variable. In this case, we can construct a variant of the OPW estimator, denoted ${\hat{V}}_{K}^{OPWv}$ ⁠, by modifying the weighted sample covariance matrix in (8) to condition on the fully observed covariate, and then take the leading eigenvectors of an appropriate average of these conditional weighted sample covariance matrices. Specifically, suppose that our data consist of independent and identically distributed copies $(y_{1}, ω_{1}), \dots, (y_{n}, ω_{n})$ of $(y, ω) = (y_{0}, y_{1}, \dots, y_{d}, ω_{0}, ω_{1}, \dots, ω_{d})$ ⁠, where $ω_{0} = 1$ ⁠, where $y_{0}$ is a categorical random variable taking values in ${1, \dots, L}$ and where $(y_{1}, \dots, y_{d}) | y_{0} \sim N_{d} (0, \sum_{y_{0}})$ is independent of $ω_{j} | y_{0} \overset{iid}{˜} Bern (p_{y_{0}})$ for all $j \in [d]$ ⁠. Writing $y_{- 0} : = {(y_{1}, \dots, y_{d})}^{⊤}$ ⁠, $ω_{- 0} : = {(ω_{1}, \dots, ω_{d})}^{⊤}$ and ${\tilde{y}}_{- 0} : = y_{- 0} \circ ω_{- 0}$ ⁠, we have that $Cov (y_{0}, y_{- 0}) = 0$ ⁠, that is, $Cov (y)$ is block diagonal. Thus, introducing the subscript $i$ for our $i$ th observation, as a starting point to construct ${\hat{V}}_{K}^{OPWv}$ ⁠, it is natural to consider an oracle estimator of $Cov (y_{- 0})$ ⁠, given by

G : = \frac{1}{n} \sum_{ℓ = 1}^{L} \sum_{i : y_{i 0} = ℓ} {\tilde{y}}_{i, - 0} {\tilde{y}}_{i, - 0}^{⊤} \circ W_{ℓ},

where $W_{ℓ} : = p_{ℓ}^{- 2} {1_{d} 1_{d}^{⊤} - (1 - p_{ℓ}) I_{d}}$ ⁠. Observe that we can write

G = \sum_{ℓ = 1}^{L} \frac{n_{ℓ}}{n} G^{(ℓ)},

where $n_{ℓ} : = | {i : y_{i 0} = ℓ} |$ and where $G^{(ℓ)} : = n_{ℓ}^{- 1} \sum_{i : y_{i 0} = ℓ} {\tilde{y}}_{i, - 0} {\tilde{y}}_{i, - 0}^{⊤} \circ W_{ℓ}$ is the OPW estimator of $\sum_{ℓ}$ based on the observations with $y_{i 0} = ℓ$ ⁠. Hence, $G$ is unbiased for $Cov (y_{- 0})$ ⁠, because

𝔼 (G) = \sum_{ℓ = 1}^{L} 𝔼 (\frac{n_{ℓ}}{n} 𝔼 (G^{(ℓ)} | y_{10}, \dots, y_{n 0})) = \sum_{ℓ = 1}^{L} \frac{𝔼 (n_{ℓ})}{n} \sum_{ℓ} = Cov (y_{- 0}) .

In practice, when $p_{ℓ}$ is unknown, we can estimate it by

{\hat{p}}_{ℓ} : = \frac{1}{d n_{ℓ}} \sum_{i : y_{i 0} = ℓ} \sum_{j = 1}^{d} ω_{i j},

and substitute this estimate into $W_{ℓ}$ to obtain empirical estimators ${\tilde{G}}^{(ℓ)}$ and $\tilde{G}$ of $G^{(ℓ)}$ and $G$ ⁠, respectively. Finally, ${\hat{V}}_{K}^{OPWv}$ can be obtained as the matrix of top $K$ eigenvectors of the $(d + 1) \times (d + 1)$ matrix

(\begin{matrix} \frac{1}{n} \sum_{i = 1}^{n} y_{i 0}^{2} - {(\frac{1}{n} \sum_{i = 1}^{n} y_{i 0})}^{2} & 0^{⊤} \\ 0 & \tilde{G} \end{matrix}) .

To sketch the way to bound the $\sin Θ$ loss of such an initialiser, we can condition on $y_{10}, \dots, y_{n 0}$ and apply matrix Bernstein concentration inequalities similarly to those in the proof of Theorem 1 to show that ${\tilde{G}}^{(ℓ)}$ is close to $\sum_{ℓ}$ for each $ℓ$ ⁠. Simple binomial concentration bounds then allow us to combine these to control $‖ \tilde{G} - Cov (y_{- 0}) ‖_{op}$ ⁠, and then apply a variant the Davis–Kahan theorem to obtain a final result.

While different initialisers can be designed and analysed theoretically in specific missingness settings, as shown in the example above, our empirical experience, nevertheless, is that regardless of the missingness mechanism, primePCA is extremely robust to the choice of initialiser. This is evident from the discussion of the performance of primePCA in MAR and MNAR settings given in Section 4.4.

4 SIMULATION STUDIES

In this section, we assess the empirical performance of primePCA, as proposed in Algorithm 2, with initialiser ${\tilde{V}}_{K}$ from Section 3.1, and denote the output of this algorithm by ${\hat{V}}_{K}^{prime}$ ⁠. In Sections 4.1, 4.2 and 4.3, we generate observations according to the model described in (1), (2) and (3) where the rows of the matrix $U$ are independent $N_{d} (0, \sum_{u})$ random vectors, for some positive semi-definite $\sum_{u} \in ℝ^{d \times d}$ ⁠. We further generate the observation indicator matrix $Ω$ ⁠, independently of $U$ and $Z$ ⁠, and investigate the following four missingness mechanisms that represent different levels of heterogeneity:

(H1)
Homogeneous: $ℙ (ω_{i j} = 1) = 0.05$ for all $i \in [n], j \in [d]$ ⁠;
(H2)
Mildly heterogeneous: $ℙ (ω_{i j} = 1) = P_{i} Q_{j}$ for $i \in [n], j \in [d]$ ⁠, where $P_{1}, \dots, P_{n} \overset{iid}{˜} U [0, 0.2]$ and $Q_{1}, \dots, Q_{d} \overset{iid}{˜} U [0.05, 0.95]$ independently;
(H3)
Highly heterogeneous columns: $ℙ (ω_{i j} = 1) = 0.19$ for $i \in [n]$ and all odd $j \in [d]$ and $ℙ (ω_{i j} = 1) = 0.01$ for $i \in [n]$ and all even $j \in [d]$ ⁠.
(H4)
Highly heterogeneous rows: $ℙ (ω_{i j} = 1) = 0.18$ for $j \in [d]$ and all odd $i \in [n]$ and $ℙ (ω_{i j} = 1) = 0.02$ for $j \in [d]$ and all even $i \in [n]$ ⁠.

In Sections 4.1, 4.2 and 4.3 below, we investigate primePCA in noiseless, noisy and misspecified settings, respectively. Section 4.4 is devoted to MAR and MNAR settings. In all cases, the average statistical error was estimated from 100 Monte Carlo repetitions of the experiment. For comparison, we also studied the softImpute algorithm (Hastie et al., 2015; Mazumder et al., 2010), which is considered to be state-of-the-art for matrix completion (Chi et al., 2018). This algorithm imputes the missing entries of $Y$ by solving the following nuclear-norm-regularised optimisation problem:

{\hat{y}}^{soft} : = \underset{X \in ℝ^{n \times d}}{argmin} \{\frac{1}{2} ‖ Y_{Ω} - X_{Ω} ‖_{F}^{2} + λ ‖ X ‖_{*}\},

where $λ > 0$ is to be chosen by the practitioner. The softImpute estimator of $V_{K}$ is then given by the matrix of top $K$ right singular vectors ${\hat{V}}_{K}^{soft}$ of ${\hat{Y}}^{soft}$ ⁠. In practice, the optimisation is carried out by representing $X$ as $A B^{⊤}$ ⁠, and performing alternating projections to update $A \in ℝ^{n \times K}$ and $B \in ℝ^{d \times K}$ iteratively. The fact that the softImpute algorithm was originally intended for matrix completion means that it treats the left and right singular vectors symmetrically, whereas the primePCA algorithm, which has the advantage of a clear geometric interpretation as exemplified in Figure 1, focuses on the target of inference in PCA, namely the leading right singular vectors.

Figure 2 presents Monte Carlo estimates of $𝔼 L ({\hat{V}}_{K}^{prime}, V_{K})$ for different choices of $σ_{*}$ in two different settings. The first uses the noiseless set-up of Section 4.1, together with missingness mechanism (H1); the second uses the noisy setting of Section 4.2 with parameter $ν = 20$ and missingness mechanism (H2). We see that the error barely changes when $σ_{*}$ varies within $[2, 10]$ ⁠; very similar plots were obtained for different data generation and missingness mechanisms, though we omit these for brevity. For definiteness, we therefore fixed $σ_{*} = 3$ throughout our simulation study.

Estimates of 𝔼L(V^Kprime,VK) for various choices of σ∗ under (H1) in the noiseless setting of Section 4.1 (left) and (H2) in the noisy setting of Section 4.2 with ν=20 (right)

FIGURE 2

Estimates of $𝔼 L ({\hat{V}}_{K}^{prime}, V_{K})$ for various choices of $σ_{*}$ under (H1) in the noiseless setting of Section 4.1 (left) and (H2) in the noisy setting of Section 4.2 with $ν = 20$ (right)

Open in new tab Download slide

4.1 Noiseless case

In the noiseless setting, we let $Z = 0$ ⁠, and also fix $n = 2000$ ⁠, $d = 500$ ⁠, $K = 2$ and $\sum_{u} = 100 I_{2}$ ⁠. We set

V_{K} = \sqrt{\frac{1}{500}} (\begin{array}{cc} 1_{250} & 1_{250} \\ 1_{250} & - 1_{250} \end{array}) \in ℝ^{500 \times 2} .

In Figure 3, we present the (natural) logarithm of the estimated average loss of primePCA and softImpute under (H1), (H2), (H3) and (H4). We set the range of $y$ -axis to be the same for each method to facilitate straightforward comparison. We see that the statistical error of primePCA decreases geometrically as the number of iterations increases, which confirms the conclusion of Theorem 3 in this noiseless setting. Moreover, after a moderate number of iterations, its performance is a substantial improvement on that of the softImpute algorithm, even if this latter algorithm is given access to an oracle choice of the regularisation parameter $λ$ ⁠. The high statistical error of softImpute in these settings can be partly explained by the default value of the tuning parameter thresh in the softImpute package in $R$ ⁠, namely $1 0^{- 5}$ ⁠, which corresponds to the red curve in the right-hand panels of Figure 3. By reducing the values of thresh to $1 0^{- 7}$ and $1 0^{- 9}$ ⁠, corresponding to the green and blue curves in Figure 3, respectively, we were able to improve the performance of softImpute to some extent, though the statistical error is sensitive to the choice of the regularisation parameter $λ$ ⁠. Moreover, even with the optimal choice of $λ$ ⁠, it is not competitive with primePCA. Finally, we mention that for the 2000 iterations of setting (H2), primePCA took on average just under 10 min per repetition to compute, whereas the solution path of softImpute with $thresh = 1 0^{- 9}$ took around 36 min per repetition.

FIGURE 3

Logarithms of the average Frobenius norm $\sin Θ$ error of primePCA and softImpute under various heterogeneity levels of missingness in absence of noise. The four rows of plots above, from the top to bottom, correspond to (H1), (H2), (H3) and (H4). [Colour figure can be viewed at https://dbpia.nl.go.kr]

Open in new tab Download slide

4.2 Noisy case

Here, we generate the rows of $Z$ as independent $N_{d} (0, I_{d})$ random vectors, independent of all other data. We maintain the same choices of $n$ ⁠, $d$ ⁠, $K$ and $V_{K}$ as in Section 4.1, set $\sum_{u} = ν^{2} I_{2}$ and vary $ν > 0$ to achieve different SNRs. In particular, defining $SNR : = Tr Cov (x_{1}) / Tr Cov (z_{1})$ ⁠, the choices $ν = 10, 20, 40, 60$ correspond to the very low, low, medium and high $SNR = 0.4, 1.6, 6.4, 14.4$ ⁠, respectively. For an additional comparison, we consider a variant of the softImpute algorithm called hardImpute (Mazumder et al., 2010), which retains only a fixed number of top singular values in each iteration of matrix imputation; this can be achieved by setting the argument $λ$ in the softImpute function to be 0.

To avoid confounding our study of the statistical performance of the softImpute algorithm with the choice of regularisation parameter $λ$ ⁠, we gave the softImpute algorithm a particularly strong form of oracle choice of $λ$ ⁠, namely where $λ$ was chosen for each individual repetition of the experiment, so as to minimise the loss function. Naturally, such a choice is not available to the practitioner. Moreover, in order to ensure the range of $λ$ was wide enough to include the best softImpute solution, we set the argument $rank.max$ in that algorithm to be 20.

In Table 1, we report the statistical error of primePCA after 2000 iterations of refinement, together with the corresponding statistical errors of our initial estimator primePCA_init and those of softImpute(oracle) and hardImpute. Remarkably, primePCA exhibits stronger performance than these other methods across each of the SNR regimes and different missingness mechanisms. We also remark that hardImpute is inaccurate and unstable, because it might converge to a local optimum that is far from the truth.

TABLE 1

Open in new tab

Average losses (with SEs in brackets) under (H1), (H2), (H3) and (H4)

		$ν = 10$	$ν = 20$	$ν = 40$	$ν = 60$
(H1)	hardImpute	$0.89 1_{(0.005)}$	$0.44 4_{(0.001)}$	$0.25 1_{(0.001)}$	$0.18 6_{(0.0005)}$
	softImpute(oracle)	$0.37 7_{(0.0009)}$	$0.18 6_{(0.0004)}$	$0.09 5_{(0.0002)}$	$0.06 4_{(0.0002)}$
	primePCA_init	$0.44 9_{(0.001)}$	$0.30 6_{(0.001)}$	$0.26 6_{(0.001)}$	$0.25 9_{(0.001)}$
	primePCA	$0.36 8_{(0.001)}$	$0.17 1_{(0.0004)}$	$0.08 4_{(0.0002)}$	$0.05 6_{(0.0001)}$
(H2)	hardImpute	$0.92 0_{(0.006)}$	$0.47 3_{(0.001)}$	$0.29 1_{(0.001)}$	$0.23 6_{(0.001)}$
	softImpute(oracle)	$0.51 9_{(0.001)}$	$0.30 8_{(0.001)}$	$0.18 5_{(0.001)}$	$0.14 1_{(0.001)}$
	primePCA_init	$0.54 9_{(0.002)}$	$0.39 9_{(0.002)}$	$0.35 7_{(0.001)}$	$0.34 9_{(0.001)}$
	primePCA	$0.47 5_{(0.002)}$	$0.23 2_{(0.001)}$	$0.11 5_{(0.001)}$	$0.07 7_{(0.0005)}$
(H3)	hardImpute	$0.79 2_{(0.003)}$	$0.47 9_{(0.001)}$	$0.38 5_{(0.001)}$	$0.42 7_{(0.001)}$
	softImpute(oracle)	$0.62 2_{(0.002)}$	$0.37 4_{(0.001)}$	$0.22 2_{(0.001)}$	$0.17 0_{(0.001)}$
	primePCA_init	$0.62 4_{(0.002)}$	$0.48 6_{(0.001)}$	$0.44 9_{(0.001)}$	$0.44 2_{(0.001)}$
	primePCA	$0.58 1_{(0.002)}$	$0.29 0_{(0.001)}$	$0.14 5_{(0.001)}$	$0.09 7_{(0.0004)}$
(H4)	hardImpute	$0.36 8_{(0.001)}$	$0.17 4_{(0.0005)}$	$0.08 9_{(0.0003)}$	$0.06 2_{(0.0003)}$
	softImpute(oracle)	$0.24 3_{(0.0006)}$	$0.12 1_{(0.0002)}$	$0.06 2_{(0.0001)}$	$0.04 2_{(0.0001)}$
	primePCA_init	$0.29 0_{(0.0007)}$	$0.20 3_{(0.001)}$	$0.17 5_{(0.0005)}$	$0.16 9_{(0.0004)}$
	primePCA	$0.23 8_{(0.0006)}$	$0.11 6_{(0.0003)}$	$0.05 8_{(0.0002)}$	$0.03 8_{(0.0001)}$

		$ν = 10$	$ν = 20$	$ν = 40$	$ν = 60$
(H1)	hardImpute	$0.89 1_{(0.005)}$	$0.44 4_{(0.001)}$	$0.25 1_{(0.001)}$	$0.18 6_{(0.0005)}$
	softImpute(oracle)	$0.37 7_{(0.0009)}$	$0.18 6_{(0.0004)}$	$0.09 5_{(0.0002)}$	$0.06 4_{(0.0002)}$
	primePCA_init	$0.44 9_{(0.001)}$	$0.30 6_{(0.001)}$	$0.26 6_{(0.001)}$	$0.25 9_{(0.001)}$
	primePCA	$0.36 8_{(0.001)}$	$0.17 1_{(0.0004)}$	$0.08 4_{(0.0002)}$	$0.05 6_{(0.0001)}$
(H2)	hardImpute	$0.92 0_{(0.006)}$	$0.47 3_{(0.001)}$	$0.29 1_{(0.001)}$	$0.23 6_{(0.001)}$
	softImpute(oracle)	$0.51 9_{(0.001)}$	$0.30 8_{(0.001)}$	$0.18 5_{(0.001)}$	$0.14 1_{(0.001)}$
	primePCA_init	$0.54 9_{(0.002)}$	$0.39 9_{(0.002)}$	$0.35 7_{(0.001)}$	$0.34 9_{(0.001)}$
	primePCA	$0.47 5_{(0.002)}$	$0.23 2_{(0.001)}$	$0.11 5_{(0.001)}$	$0.07 7_{(0.0005)}$
(H3)	hardImpute	$0.79 2_{(0.003)}$	$0.47 9_{(0.001)}$	$0.38 5_{(0.001)}$	$0.42 7_{(0.001)}$
	softImpute(oracle)	$0.62 2_{(0.002)}$	$0.37 4_{(0.001)}$	$0.22 2_{(0.001)}$	$0.17 0_{(0.001)}$
	primePCA_init	$0.62 4_{(0.002)}$	$0.48 6_{(0.001)}$	$0.44 9_{(0.001)}$	$0.44 2_{(0.001)}$
	primePCA	$0.58 1_{(0.002)}$	$0.29 0_{(0.001)}$	$0.14 5_{(0.001)}$	$0.09 7_{(0.0004)}$
(H4)	hardImpute	$0.36 8_{(0.001)}$	$0.17 4_{(0.0005)}$	$0.08 9_{(0.0003)}$	$0.06 2_{(0.0003)}$
	softImpute(oracle)	$0.24 3_{(0.0006)}$	$0.12 1_{(0.0002)}$	$0.06 2_{(0.0001)}$	$0.04 2_{(0.0001)}$
	primePCA_init	$0.29 0_{(0.0007)}$	$0.20 3_{(0.001)}$	$0.17 5_{(0.0005)}$	$0.16 9_{(0.0004)}$
	primePCA	$0.23 8_{(0.0006)}$	$0.11 6_{(0.0003)}$	$0.05 8_{(0.0002)}$	$0.03 8_{(0.0001)}$

TABLE 1

Open in new tab

Average losses (with SEs in brackets) under (H1), (H2), (H3) and (H4)

		$ν = 10$	$ν = 20$	$ν = 40$	$ν = 60$
(H1)	hardImpute	$0.89 1_{(0.005)}$	$0.44 4_{(0.001)}$	$0.25 1_{(0.001)}$	$0.18 6_{(0.0005)}$
	softImpute(oracle)	$0.37 7_{(0.0009)}$	$0.18 6_{(0.0004)}$	$0.09 5_{(0.0002)}$	$0.06 4_{(0.0002)}$
	primePCA_init	$0.44 9_{(0.001)}$	$0.30 6_{(0.001)}$	$0.26 6_{(0.001)}$	$0.25 9_{(0.001)}$
	primePCA	$0.36 8_{(0.001)}$	$0.17 1_{(0.0004)}$	$0.08 4_{(0.0002)}$	$0.05 6_{(0.0001)}$
(H2)	hardImpute	$0.92 0_{(0.006)}$	$0.47 3_{(0.001)}$	$0.29 1_{(0.001)}$	$0.23 6_{(0.001)}$
	softImpute(oracle)	$0.51 9_{(0.001)}$	$0.30 8_{(0.001)}$	$0.18 5_{(0.001)}$	$0.14 1_{(0.001)}$
	primePCA_init	$0.54 9_{(0.002)}$	$0.39 9_{(0.002)}$	$0.35 7_{(0.001)}$	$0.34 9_{(0.001)}$
	primePCA	$0.47 5_{(0.002)}$	$0.23 2_{(0.001)}$	$0.11 5_{(0.001)}$	$0.07 7_{(0.0005)}$
(H3)	hardImpute	$0.79 2_{(0.003)}$	$0.47 9_{(0.001)}$	$0.38 5_{(0.001)}$	$0.42 7_{(0.001)}$
	softImpute(oracle)	$0.62 2_{(0.002)}$	$0.37 4_{(0.001)}$	$0.22 2_{(0.001)}$	$0.17 0_{(0.001)}$
	primePCA_init	$0.62 4_{(0.002)}$	$0.48 6_{(0.001)}$	$0.44 9_{(0.001)}$	$0.44 2_{(0.001)}$
	primePCA	$0.58 1_{(0.002)}$	$0.29 0_{(0.001)}$	$0.14 5_{(0.001)}$	$0.09 7_{(0.0004)}$
(H4)	hardImpute	$0.36 8_{(0.001)}$	$0.17 4_{(0.0005)}$	$0.08 9_{(0.0003)}$	$0.06 2_{(0.0003)}$
	softImpute(oracle)	$0.24 3_{(0.0006)}$	$0.12 1_{(0.0002)}$	$0.06 2_{(0.0001)}$	$0.04 2_{(0.0001)}$
	primePCA_init	$0.29 0_{(0.0007)}$	$0.20 3_{(0.001)}$	$0.17 5_{(0.0005)}$	$0.16 9_{(0.0004)}$
	primePCA	$0.23 8_{(0.0006)}$	$0.11 6_{(0.0003)}$	$0.05 8_{(0.0002)}$	$0.03 8_{(0.0001)}$

		$ν = 10$	$ν = 20$	$ν = 40$	$ν = 60$
(H1)	hardImpute	$0.89 1_{(0.005)}$	$0.44 4_{(0.001)}$	$0.25 1_{(0.001)}$	$0.18 6_{(0.0005)}$
	softImpute(oracle)	$0.37 7_{(0.0009)}$	$0.18 6_{(0.0004)}$	$0.09 5_{(0.0002)}$	$0.06 4_{(0.0002)}$
	primePCA_init	$0.44 9_{(0.001)}$	$0.30 6_{(0.001)}$	$0.26 6_{(0.001)}$	$0.25 9_{(0.001)}$
	primePCA	$0.36 8_{(0.001)}$	$0.17 1_{(0.0004)}$	$0.08 4_{(0.0002)}$	$0.05 6_{(0.0001)}$
(H2)	hardImpute	$0.92 0_{(0.006)}$	$0.47 3_{(0.001)}$	$0.29 1_{(0.001)}$	$0.23 6_{(0.001)}$
	softImpute(oracle)	$0.51 9_{(0.001)}$	$0.30 8_{(0.001)}$	$0.18 5_{(0.001)}$	$0.14 1_{(0.001)}$
	primePCA_init	$0.54 9_{(0.002)}$	$0.39 9_{(0.002)}$	$0.35 7_{(0.001)}$	$0.34 9_{(0.001)}$
	primePCA	$0.47 5_{(0.002)}$	$0.23 2_{(0.001)}$	$0.11 5_{(0.001)}$	$0.07 7_{(0.0005)}$
(H3)	hardImpute	$0.79 2_{(0.003)}$	$0.47 9_{(0.001)}$	$0.38 5_{(0.001)}$	$0.42 7_{(0.001)}$
	softImpute(oracle)	$0.62 2_{(0.002)}$	$0.37 4_{(0.001)}$	$0.22 2_{(0.001)}$	$0.17 0_{(0.001)}$
	primePCA_init	$0.62 4_{(0.002)}$	$0.48 6_{(0.001)}$	$0.44 9_{(0.001)}$	$0.44 2_{(0.001)}$
	primePCA	$0.58 1_{(0.002)}$	$0.29 0_{(0.001)}$	$0.14 5_{(0.001)}$	$0.09 7_{(0.0004)}$
(H4)	hardImpute	$0.36 8_{(0.001)}$	$0.17 4_{(0.0005)}$	$0.08 9_{(0.0003)}$	$0.06 2_{(0.0003)}$
	softImpute(oracle)	$0.24 3_{(0.0006)}$	$0.12 1_{(0.0002)}$	$0.06 2_{(0.0001)}$	$0.04 2_{(0.0001)}$
	primePCA_init	$0.29 0_{(0.0007)}$	$0.20 3_{(0.001)}$	$0.17 5_{(0.0005)}$	$0.16 9_{(0.0004)}$
	primePCA	$0.23 8_{(0.0006)}$	$0.11 6_{(0.0003)}$	$0.05 8_{(0.0002)}$	$0.03 8_{(0.0001)}$

4.3 Near low-rank case

In this sub-section, we set $n = 2000$ ⁠, $d = 500$ ⁠, $K = 10$ ⁠, $\sum_{u} = diag (2^{10}, 2^{9}, \dots, 2)$ ⁠, and fixed $V_{K}$ once for all experiments to be the top $K$ eigenvectors of one realisation² of the sample covariance matrix of $n$ independent $N_{d} (0, I_{d})$ random vectors. Here $d^{1 / 2} ‖ V_{K} ‖_{2 \to \infty} < 1.72$ ⁠, and we again generated the rows of $Z$ as independent $N_{d} (0, I_{d})$ random vectors. Table 2 reports the average loss of estimating the top $\hat{K}$ eigenvectors of $\sum_{y}$ ⁠, where $\hat{K}$ varies from 1 to 5. Interestingly, even in this misspecified setting, primePCA is competitive with the oracle version of softImpute.

TABLE 2

Open in new tab

Average losses (with SEs in brackets) in the setting of Section 4.3 under (H1), (H2), (H3) and (H4)

		$\hat{K} = 1$	$\hat{K} = 2$	$\hat{K} = 3$	$\hat{K} = 4$	$\hat{K} = 5$
(H1)	hardImpute	$0.30 8_{(0.002)}$	$0.50 7_{(0.002)}$	$0.76 4_{(0.004)}$	$1.19 9_{(0.006)}$	$1.52 4_{(0.004)}$
	softImpute(oracle)	$0.10 7_{(0.001)}$	$0.18 2_{(0.001)}$	$0.27 5_{(0.001)}$	$0.40 1_{(0.001)}$	$0.59 6_{(0.001)}$
	primePCA_init	$0.20 3_{(0.001)}$	$0.34 5_{(0.001)}$	$0.55 4_{(0.003)}$	$1.07 4_{(0.007)}$	$1.42 7_{(0.006)}$
	primePCA	$0.14 1_{(0.001)}$	$0.20 0_{(0.001)}$	$0.26 9_{(0.001)}$	$0.37 4_{(0.001)}$	$0.58 0_{(0.001)}$
(H2)	hardImpute	$0.29 8_{(0.002)}$	$0.46 6_{(0.002)}$	$0.69 6_{(0.003)}$	$1.12 4_{(0.006)}$	$1.45 2_{(0.004)}$
	softImpute(oracle)	$0.18 8_{(0.001)}$	$0.28 3_{(0.001)}$	$0.41 0_{(0.001)}$	$0.56 2_{(0.001)}$	$0.75 1_{(0.001)}$
	primePCA_init	$0.28 5_{(0.001)}$	$0.44 3_{(0.004)}$	$0.75 7_{(0.013)}$	$1.20 1_{(0.004)}$	$1.53 3_{(0.003)}$
	primePCA	$0.19 0_{(0.002)}$	$0.26 7_{(0.002)}$	$0.36 8_{(0.003)}$	$0.54 3_{(0.008)}$	$0.79 7_{(0.009)}$
(H3)	hardImpute	$0.30 2_{(0.001)}$	$0.48 2_{(0.002)}$	$0.69 5_{(0.002)}$	$1.00 4_{(0.006)}$	$1.37 3_{(0.004)}$
	softImpute(oracle)	$0.20 6_{(0.001)}$	$0.33 8_{(0.001)}$	$0.49 2_{(0.001)}$	$0.66 4_{(0.002)}$	$0.87 8_{(0.002)}$
	primePCA_init	$0.34 1_{(0.001)}$	$0.52 8_{(0.019)}$	$1.09 7_{(0.008)}$	$1.30 6_{(0.008)}$	$1.59 7_{(0.004)}$
	primePCA	$0.22 2_{(0.001)}$	$0.33 0_{(0.002)}$	$0.45 2_{(0.003)}$	$0.64 1_{(0.008)}$	$0.91 9_{(0.007)}$
(H4)	hardImpute	$0.09 0_{(0.001)}$	$0.14 8_{(0.001)}$	$0.22 6_{(0.001)}$	$0.34 6_{0.002}$	$0.58 9_{(0.007)}$
	softImpute(oracle)	$0.07 1_{(0.001)}$	$0.11 2_{(0.001)}$	$0.16 4_{(0.001)}$	$0.23 3_{(0.001)}$	$0.33 2_{(0.001)}$
	primePCA_init	$0.13 9_{(0.001)}$	$0.22 0_{(0.001)}$	$0.32 5_{(0.001)}$	$0.47 5_{(0.002)}$	$0.80 5_{(0.012)}$
	primePCA	$0.09 8_{(0.001)}$	$0.13 5_{(0.001)}$	$0.17 6_{(0.001)}$	$0.23 6_{(0.001)}$	$0.32 8_{(0.001)}$

		$\hat{K} = 1$	$\hat{K} = 2$	$\hat{K} = 3$	$\hat{K} = 4$	$\hat{K} = 5$
(H1)	hardImpute	$0.30 8_{(0.002)}$	$0.50 7_{(0.002)}$	$0.76 4_{(0.004)}$	$1.19 9_{(0.006)}$	$1.52 4_{(0.004)}$
	softImpute(oracle)	$0.10 7_{(0.001)}$	$0.18 2_{(0.001)}$	$0.27 5_{(0.001)}$	$0.40 1_{(0.001)}$	$0.59 6_{(0.001)}$
	primePCA_init	$0.20 3_{(0.001)}$	$0.34 5_{(0.001)}$	$0.55 4_{(0.003)}$	$1.07 4_{(0.007)}$	$1.42 7_{(0.006)}$
	primePCA	$0.14 1_{(0.001)}$	$0.20 0_{(0.001)}$	$0.26 9_{(0.001)}$	$0.37 4_{(0.001)}$	$0.58 0_{(0.001)}$
(H2)	hardImpute	$0.29 8_{(0.002)}$	$0.46 6_{(0.002)}$	$0.69 6_{(0.003)}$	$1.12 4_{(0.006)}$	$1.45 2_{(0.004)}$
	softImpute(oracle)	$0.18 8_{(0.001)}$	$0.28 3_{(0.001)}$	$0.41 0_{(0.001)}$	$0.56 2_{(0.001)}$	$0.75 1_{(0.001)}$
	primePCA_init	$0.28 5_{(0.001)}$	$0.44 3_{(0.004)}$	$0.75 7_{(0.013)}$	$1.20 1_{(0.004)}$	$1.53 3_{(0.003)}$
	primePCA	$0.19 0_{(0.002)}$	$0.26 7_{(0.002)}$	$0.36 8_{(0.003)}$	$0.54 3_{(0.008)}$	$0.79 7_{(0.009)}$
(H3)	hardImpute	$0.30 2_{(0.001)}$	$0.48 2_{(0.002)}$	$0.69 5_{(0.002)}$	$1.00 4_{(0.006)}$	$1.37 3_{(0.004)}$
	softImpute(oracle)	$0.20 6_{(0.001)}$	$0.33 8_{(0.001)}$	$0.49 2_{(0.001)}$	$0.66 4_{(0.002)}$	$0.87 8_{(0.002)}$
	primePCA_init	$0.34 1_{(0.001)}$	$0.52 8_{(0.019)}$	$1.09 7_{(0.008)}$	$1.30 6_{(0.008)}$	$1.59 7_{(0.004)}$
	primePCA	$0.22 2_{(0.001)}$	$0.33 0_{(0.002)}$	$0.45 2_{(0.003)}$	$0.64 1_{(0.008)}$	$0.91 9_{(0.007)}$
(H4)	hardImpute	$0.09 0_{(0.001)}$	$0.14 8_{(0.001)}$	$0.22 6_{(0.001)}$	$0.34 6_{0.002}$	$0.58 9_{(0.007)}$
	softImpute(oracle)	$0.07 1_{(0.001)}$	$0.11 2_{(0.001)}$	$0.16 4_{(0.001)}$	$0.23 3_{(0.001)}$	$0.33 2_{(0.001)}$
	primePCA_init	$0.13 9_{(0.001)}$	$0.22 0_{(0.001)}$	$0.32 5_{(0.001)}$	$0.47 5_{(0.002)}$	$0.80 5_{(0.012)}$
	primePCA	$0.09 8_{(0.001)}$	$0.13 5_{(0.001)}$	$0.17 6_{(0.001)}$	$0.23 6_{(0.001)}$	$0.32 8_{(0.001)}$

TABLE 2

Open in new tab

Average losses (with SEs in brackets) in the setting of Section 4.3 under (H1), (H2), (H3) and (H4)

		$\hat{K} = 1$	$\hat{K} = 2$	$\hat{K} = 3$	$\hat{K} = 4$	$\hat{K} = 5$
(H1)	hardImpute	$0.30 8_{(0.002)}$	$0.50 7_{(0.002)}$	$0.76 4_{(0.004)}$	$1.19 9_{(0.006)}$	$1.52 4_{(0.004)}$
	softImpute(oracle)	$0.10 7_{(0.001)}$	$0.18 2_{(0.001)}$	$0.27 5_{(0.001)}$	$0.40 1_{(0.001)}$	$0.59 6_{(0.001)}$
	primePCA_init	$0.20 3_{(0.001)}$	$0.34 5_{(0.001)}$	$0.55 4_{(0.003)}$	$1.07 4_{(0.007)}$	$1.42 7_{(0.006)}$
	primePCA	$0.14 1_{(0.001)}$	$0.20 0_{(0.001)}$	$0.26 9_{(0.001)}$	$0.37 4_{(0.001)}$	$0.58 0_{(0.001)}$
(H2)	hardImpute	$0.29 8_{(0.002)}$	$0.46 6_{(0.002)}$	$0.69 6_{(0.003)}$	$1.12 4_{(0.006)}$	$1.45 2_{(0.004)}$
	softImpute(oracle)	$0.18 8_{(0.001)}$	$0.28 3_{(0.001)}$	$0.41 0_{(0.001)}$	$0.56 2_{(0.001)}$	$0.75 1_{(0.001)}$
	primePCA_init	$0.28 5_{(0.001)}$	$0.44 3_{(0.004)}$	$0.75 7_{(0.013)}$	$1.20 1_{(0.004)}$	$1.53 3_{(0.003)}$
	primePCA	$0.19 0_{(0.002)}$	$0.26 7_{(0.002)}$	$0.36 8_{(0.003)}$	$0.54 3_{(0.008)}$	$0.79 7_{(0.009)}$
(H3)	hardImpute	$0.30 2_{(0.001)}$	$0.48 2_{(0.002)}$	$0.69 5_{(0.002)}$	$1.00 4_{(0.006)}$	$1.37 3_{(0.004)}$
	softImpute(oracle)	$0.20 6_{(0.001)}$	$0.33 8_{(0.001)}$	$0.49 2_{(0.001)}$	$0.66 4_{(0.002)}$	$0.87 8_{(0.002)}$
	primePCA_init	$0.34 1_{(0.001)}$	$0.52 8_{(0.019)}$	$1.09 7_{(0.008)}$	$1.30 6_{(0.008)}$	$1.59 7_{(0.004)}$
	primePCA	$0.22 2_{(0.001)}$	$0.33 0_{(0.002)}$	$0.45 2_{(0.003)}$	$0.64 1_{(0.008)}$	$0.91 9_{(0.007)}$
(H4)	hardImpute	$0.09 0_{(0.001)}$	$0.14 8_{(0.001)}$	$0.22 6_{(0.001)}$	$0.34 6_{0.002}$	$0.58 9_{(0.007)}$
	softImpute(oracle)	$0.07 1_{(0.001)}$	$0.11 2_{(0.001)}$	$0.16 4_{(0.001)}$	$0.23 3_{(0.001)}$	$0.33 2_{(0.001)}$
	primePCA_init	$0.13 9_{(0.001)}$	$0.22 0_{(0.001)}$	$0.32 5_{(0.001)}$	$0.47 5_{(0.002)}$	$0.80 5_{(0.012)}$
	primePCA	$0.09 8_{(0.001)}$	$0.13 5_{(0.001)}$	$0.17 6_{(0.001)}$	$0.23 6_{(0.001)}$	$0.32 8_{(0.001)}$

		$\hat{K} = 1$	$\hat{K} = 2$	$\hat{K} = 3$	$\hat{K} = 4$	$\hat{K} = 5$
(H1)	hardImpute	$0.30 8_{(0.002)}$	$0.50 7_{(0.002)}$	$0.76 4_{(0.004)}$	$1.19 9_{(0.006)}$	$1.52 4_{(0.004)}$
	softImpute(oracle)	$0.10 7_{(0.001)}$	$0.18 2_{(0.001)}$	$0.27 5_{(0.001)}$	$0.40 1_{(0.001)}$	$0.59 6_{(0.001)}$
	primePCA_init	$0.20 3_{(0.001)}$	$0.34 5_{(0.001)}$	$0.55 4_{(0.003)}$	$1.07 4_{(0.007)}$	$1.42 7_{(0.006)}$
	primePCA	$0.14 1_{(0.001)}$	$0.20 0_{(0.001)}$	$0.26 9_{(0.001)}$	$0.37 4_{(0.001)}$	$0.58 0_{(0.001)}$
(H2)	hardImpute	$0.29 8_{(0.002)}$	$0.46 6_{(0.002)}$	$0.69 6_{(0.003)}$	$1.12 4_{(0.006)}$	$1.45 2_{(0.004)}$
	softImpute(oracle)	$0.18 8_{(0.001)}$	$0.28 3_{(0.001)}$	$0.41 0_{(0.001)}$	$0.56 2_{(0.001)}$	$0.75 1_{(0.001)}$
	primePCA_init	$0.28 5_{(0.001)}$	$0.44 3_{(0.004)}$	$0.75 7_{(0.013)}$	$1.20 1_{(0.004)}$	$1.53 3_{(0.003)}$
	primePCA	$0.19 0_{(0.002)}$	$0.26 7_{(0.002)}$	$0.36 8_{(0.003)}$	$0.54 3_{(0.008)}$	$0.79 7_{(0.009)}$
(H3)	hardImpute	$0.30 2_{(0.001)}$	$0.48 2_{(0.002)}$	$0.69 5_{(0.002)}$	$1.00 4_{(0.006)}$	$1.37 3_{(0.004)}$
	softImpute(oracle)	$0.20 6_{(0.001)}$	$0.33 8_{(0.001)}$	$0.49 2_{(0.001)}$	$0.66 4_{(0.002)}$	$0.87 8_{(0.002)}$
	primePCA_init	$0.34 1_{(0.001)}$	$0.52 8_{(0.019)}$	$1.09 7_{(0.008)}$	$1.30 6_{(0.008)}$	$1.59 7_{(0.004)}$
	primePCA	$0.22 2_{(0.001)}$	$0.33 0_{(0.002)}$	$0.45 2_{(0.003)}$	$0.64 1_{(0.008)}$	$0.91 9_{(0.007)}$
(H4)	hardImpute	$0.09 0_{(0.001)}$	$0.14 8_{(0.001)}$	$0.22 6_{(0.001)}$	$0.34 6_{0.002}$	$0.58 9_{(0.007)}$
	softImpute(oracle)	$0.07 1_{(0.001)}$	$0.11 2_{(0.001)}$	$0.16 4_{(0.001)}$	$0.23 3_{(0.001)}$	$0.33 2_{(0.001)}$
	primePCA_init	$0.13 9_{(0.001)}$	$0.22 0_{(0.001)}$	$0.32 5_{(0.001)}$	$0.47 5_{(0.002)}$	$0.80 5_{(0.012)}$
	primePCA	$0.09 8_{(0.001)}$	$0.13 5_{(0.001)}$	$0.17 6_{(0.001)}$	$0.23 6_{(0.001)}$	$0.32 8_{(0.001)}$

4.4 Other missingness mechanisms

Finally in this section, we investigate the performance of primePCA, as well as other alternative algorithms, in settings where the MCAR hypothesis is not satisfied. We consider two simulation frameworks to explore both MAR (see (14)) and MNAR mechanisms. In the first, we assume that missingness depends on the data matrix $Y$ only through a fully observed covariate, as in the example in Section 3.3. Specifically, for some $α \geq 0$ ⁠, for $K = 2$ ⁠, and for two matrices³ $V_{+}, V_{-} \in 𝕆^{d \times 2}$ ⁠, the pair $(y_{1}, ω_{1}) = (y_{10}, y_{11} \dots, y_{1 d}, ω_{10}, ω_{11}, \dots, ω_{1 d})$ is generated as follows:

\begin{aligned} ω_{10} & = 1, y_{10} \sim Unif {- 1, 1}, \\ (y_{11}, \dots, y_{1 d}), ω_{11}, \dots, ω_{1 d} & are conditionally independent given y_{10}, \\ {(y_{11}, \dots, y_{1 d})}^{⊤} | y_{10} & \sim \{\begin{matrix} N_{d} (0, V_{+} diag (40, 10) V_{+}^{⊤} + I_{d}) & if y_{10} = 1 \\ N_{d} (0, V_{-} diag (40, 10) V_{-}^{⊤} + I_{d}) & if y_{10} = - 1, \end{matrix} \\ ℙ (ω_{1 j} = 1 | y_{10}) & = {1 + \exp (\frac{j}{d} + y_{10} α)}^{- 1}, for j \in [d] . \end{aligned}

(15)

The other rows of $(Y, Ω)$ are taken to be as independent copies of $(y_{1}, ω_{1})$ ⁠. Thus, when $α = 0$ ⁠, the matrices $Y$ and $Ω$ are independent, and we are in an MCAR setting; when $α \neq 0$ ⁠, the data are MAR but not MCAR, and $α$ measures the extent of departure from the MCAR setting. The covariance matrix of $y_{1}$ is

\sum_{y} = (\begin{matrix} 1 & 0^{⊤} \\ 0 & \frac{1}{2} V_{+} diag (40, 10) V_{+}^{⊤} + \frac{1}{2} V_{-} diag (40, 10) V_{-}^{⊤} + I_{d} \end{matrix}) \in ℝ^{(d + 1) \times (d + 1)} .

In this example, we can construct a variant of the OPW estimator, which we call the OPWv estimator, by exploiting the fact that, conditional on the fully observed first column of $Y$ ⁠, the data are MCAR. To do this, let

{\sum^{^}}^{OPWv} : = (\begin{matrix} 1 & 0^{⊤} \\ 0 & \frac{1}{2} {\tilde{G}}_{+} + \frac{1}{2} {\tilde{G}}_{-} \end{matrix}),

where ${\tilde{G}}_{+}$ and ${\tilde{G}}_{-}$ are the weighted sample covariance matrices computed as in (8), based on data ${(y_{i j}, ω_{i j})}_{i : y_{i 0} = 1, j \in [d]}$ and ${(y_{i j}, ω_{i j})}_{i : y_{i 0} = - 1, j \in [d]}$ ⁠, respectively. The OPWv estimator is the matrix of the first two eigenvectors of ${\sum^{^}}^{OPWv}$ ⁠. Both the OPW and OPWv estimators are plausible initialisers for the primePCA algorithm.

In low-dimensional settings, likelihood-based approaches, often implemented via an EM algorithm, are popular for handling MAR data (14) (Rubin, 1976). In Table 3, we compare the performance of primePCA in this setting with that of an EM algorithm derived from the suggestion in Little and Rubin (2019), section 11.3, and considered both the OPW and OPWv estimators as initialisers. We set $n = 500$ ⁠, $d \in {25, 50}$ ⁠, $α \in {0.1, 0.5}$ and took $\hat{K} = 2$ for both the primePCA and the EM algorithms. From the table, we see that the OPWv estimator is able to exploit the group structure of the data to improve upon the OPW estimator, especially for the larger value of $α$ ⁠. It is reassuring to find that the performance of primePCA is completely unaffected by the choice of initialiser, and, remarkably, it outperforms the OPWv estimator, even though the latter has access to additional model structure information. The worse root mean squared error of the EM algorithm is mainly due its numerical instability when performing Schur complement computations⁴.

TABLE 3

Open in new tab

Root mean squared errors of the $\sin Θ$ loss function (with SEs in brackets) over 100 repetitions from the data-generating mechanism in (15) for observed-proportion weighted (OPW) estimator and its class-weighted variant (OPWv), expectation-mMaximisation (EM) and primePCA with both the OPW or OPWv initialisers

$d$	$α$	OPW	OPWv	EM	EMv	primePCA	primePCAv
25	0.1	$0.26 6_{(0.005)}$	$0.24 7_{(0.004)}$	$0.41 4_{(0.045)}$	$0.46 4_{(0.053)}$	$0.20 6_{(0.004)}$	$0.20 6_{(0.004)}$
25	0.5	$0.34 6_{(0.009)}$	$0.24 8_{(0.005)}$	$0.44 5_{(0.047)}$	$0.37 8_{(0.056)}$	$0.24 8_{(0.014)}$	$0.24 8_{(0.008)}$
50	0.1	$0.28 7_{(0.003)}$	$0.26 5_{(0.003)}$	$0.35 0_{(0.032)}$	$0.34 6_{(0.032)}$	$0.22 0_{(0.002)}$	$0.22 0_{(0.002)}$
50	0.5	$0.59 1_{(0.025)}$	$0.29 0_{(0.003)}$	$0.58 8_{(0.033)}$	$0.36 9_{(0.03)}$	$0.25 5_{(0.005)}$	$0.25 5_{(0.005)}$

$d$	$α$	OPW	OPWv	EM	EMv	primePCA	primePCAv
25	0.1	$0.26 6_{(0.005)}$	$0.24 7_{(0.004)}$	$0.41 4_{(0.045)}$	$0.46 4_{(0.053)}$	$0.20 6_{(0.004)}$	$0.20 6_{(0.004)}$
25	0.5	$0.34 6_{(0.009)}$	$0.24 8_{(0.005)}$	$0.44 5_{(0.047)}$	$0.37 8_{(0.056)}$	$0.24 8_{(0.014)}$	$0.24 8_{(0.008)}$
50	0.1	$0.28 7_{(0.003)}$	$0.26 5_{(0.003)}$	$0.35 0_{(0.032)}$	$0.34 6_{(0.032)}$	$0.22 0_{(0.002)}$	$0.22 0_{(0.002)}$
50	0.5	$0.59 1_{(0.025)}$	$0.29 0_{(0.003)}$	$0.58 8_{(0.033)}$	$0.36 9_{(0.03)}$	$0.25 5_{(0.005)}$	$0.25 5_{(0.005)}$

TABLE 3

Open in new tab

$d$	$α$	OPW	OPWv	EM	EMv	primePCA	primePCAv
25	0.1	$0.26 6_{(0.005)}$	$0.24 7_{(0.004)}$	$0.41 4_{(0.045)}$	$0.46 4_{(0.053)}$	$0.20 6_{(0.004)}$	$0.20 6_{(0.004)}$
25	0.5	$0.34 6_{(0.009)}$	$0.24 8_{(0.005)}$	$0.44 5_{(0.047)}$	$0.37 8_{(0.056)}$	$0.24 8_{(0.014)}$	$0.24 8_{(0.008)}$
50	0.1	$0.28 7_{(0.003)}$	$0.26 5_{(0.003)}$	$0.35 0_{(0.032)}$	$0.34 6_{(0.032)}$	$0.22 0_{(0.002)}$	$0.22 0_{(0.002)}$
50	0.5	$0.59 1_{(0.025)}$	$0.29 0_{(0.003)}$	$0.58 8_{(0.033)}$	$0.36 9_{(0.03)}$	$0.25 5_{(0.005)}$	$0.25 5_{(0.005)}$

$d$	$α$	OPW	OPWv	EM	EMv	primePCA	primePCAv
25	0.1	$0.26 6_{(0.005)}$	$0.24 7_{(0.004)}$	$0.41 4_{(0.045)}$	$0.46 4_{(0.053)}$	$0.20 6_{(0.004)}$	$0.20 6_{(0.004)}$
25	0.5	$0.34 6_{(0.009)}$	$0.24 8_{(0.005)}$	$0.44 5_{(0.047)}$	$0.37 8_{(0.056)}$	$0.24 8_{(0.014)}$	$0.24 8_{(0.008)}$
50	0.1	$0.28 7_{(0.003)}$	$0.26 5_{(0.003)}$	$0.35 0_{(0.032)}$	$0.34 6_{(0.032)}$	$0.22 0_{(0.002)}$	$0.22 0_{(0.002)}$
50	0.5	$0.59 1_{(0.025)}$	$0.29 0_{(0.003)}$	$0.58 8_{(0.033)}$	$0.36 9_{(0.03)}$	$0.25 5_{(0.005)}$	$0.25 5_{(0.005)}$

The second simulation framework is as follows. Let $\sum : = {(\min {j, k})}_{j, k \in [d]} \in ℝ^{d \times d}$ and let $ξ = {(ξ_{i j})}_{i \in [n], j \in [d]}$ be a latent Bernoulli thinning matrix. The data matrix $Y = {(y_{i j})}_{i \in [n], j \in [d]}$ and revelation matrix $Ω = {(ω_{i j})}_{i \in [n], j \in [d]}$ are generated in such a way that $Y$ and $ξ$ are independent,

\begin{aligned} {(y_{i 1}, \dots, y_{i d})}^{⊤} & \overset{iid}{˜} N_{d} (0, \sum), for i \in [n], \\ ξ_{i j} & \overset{iid}{˜} Bern (p), \\ ω_{i j} & = ξ_{i j} 𝟙_{{\max_{1 \leq t < j} | y_{i t} | < τ}}, for some τ > 0, \end{aligned}

(16)

(where the maximum of the empty set is $- \infty$ by convention). As usual, we observe $(Y \circ Ω, Ω)$ ⁠. In other words, viewing each $(y_{i 1}, \dots, y_{i d})$ as a $d$ -step standard Gaussian random walk, we observe Bernoulli-thinned paths of the process up to (and including) the hitting time of the threshold $\pm τ$ ⁠. We note that the observations satisfy the MAR hypothesis if and only if $p = 1$ ⁠, and as $p$ decreases from 1, the mechanism becomes increasingly distant from MAR, as we become increasingly likely to fail to observe the threshold hitting time. We take $K = 1$ ⁠.

In Table 4, we compare the performance of primePCA with that of the EM algorithm, and in both cases, we can initialise with either the OPW estimator or a mean-imputation estimator, obtained by imputing all missing entries by their respective population column means. We set $n = 500$ ⁠, $d = 100$ ⁠, $τ = d^{1 / 2}$ ⁠, took $\hat{K} = 1$ for both primePCA and the EM algorithm, and took $p \in {0.25, 0.5, 0.75, 1}$ ⁠. From the table, we see that primePCA outperforms the EM algorithm except in the MAR case where $p = 1$ ⁠, which is tailor-made for the likelihood-based EM approach. In fact, primePCA is highly robust statistically and stable computationally, performing well consistently across different missingness settings and initialisers. On the other hand, the EM algorithm exhibits a much heavier dependence on the initialiser: its statistical performance suffers when initialised with the poorer mean-imputation estimator and runs into numerical instability issues when initialised with the OPW estimator in the MNAR settings. We found that these instability issues are exacerbated in higher dimensions, and moreover, that the EM algorithm quickly becomes computationally infeasible⁵. This explains why we did not run the EM algorithm on the larger-scale problems in Sections 4.1, 4.2 and 4.3, as well as the real data example in Section 5 below.

TABLE 4

Open in new tab

Root mean squared errors of the $\sin Θ$ loss function (with SEs in brackets) over 100 repetitions from the data-generating mechanism in (16) for mean-imputation estimator (MI), observed-proportion weighted (OPW) estimator, Expectation–Maximisation (EM) and primePCA with both MI and OPW initialisers (distinguished by subscripts in table header)

$p$	MI	OPW	EM $_{MI}$	EM $_{OPW}$	primePCA $_{MI}$	primePCA $_{OPW}$
1	$0.54 8_{(0.004)}$	$0.28 2_{(0.004)}$	$0.08 6_{(0.003)}$	$0.05 6_{(0.002)}$	$0.09 6_{(0.002)}$	$0.09 6_{(0.002)}$
0.75	$0.55 1_{(0.004)}$	$0.28 5_{(0.004)}$	$0.11 7_{(0.004)}$	$0.35 3_{(0.041)}$	$0.09 7_{(0.002)}$	$0.09 7_{(0.002)}$
0.5	$0.55 7_{(0.005)}$	$0.2 9_{(0.004)}$	$0.18 6_{(0.025)}$	$0.94 4_{(0.013)}$	$0 . 1_{(0.002)}$	$0 . 1_{(0.002)}$
0.25	$0.57 5_{(0.005)}$	$0.30 9_{(0.005)}$	$0.22 8_{(0.005)}$	$0.98 9_{(0.001)}$	$0.11 2_{(0.002)}$	$0.12 3_{(0.006)}$

$p$	MI	OPW	EM $_{MI}$	EM $_{OPW}$	primePCA $_{MI}$	primePCA $_{OPW}$
1	$0.54 8_{(0.004)}$	$0.28 2_{(0.004)}$	$0.08 6_{(0.003)}$	$0.05 6_{(0.002)}$	$0.09 6_{(0.002)}$	$0.09 6_{(0.002)}$
0.75	$0.55 1_{(0.004)}$	$0.28 5_{(0.004)}$	$0.11 7_{(0.004)}$	$0.35 3_{(0.041)}$	$0.09 7_{(0.002)}$	$0.09 7_{(0.002)}$
0.5	$0.55 7_{(0.005)}$	$0.2 9_{(0.004)}$	$0.18 6_{(0.025)}$	$0.94 4_{(0.013)}$	$0 . 1_{(0.002)}$	$0 . 1_{(0.002)}$
0.25	$0.57 5_{(0.005)}$	$0.30 9_{(0.005)}$	$0.22 8_{(0.005)}$	$0.98 9_{(0.001)}$	$0.11 2_{(0.002)}$	$0.12 3_{(0.006)}$

TABLE 4

Open in new tab

$p$	MI	OPW	EM $_{MI}$	EM $_{OPW}$	primePCA $_{MI}$	primePCA $_{OPW}$
1	$0.54 8_{(0.004)}$	$0.28 2_{(0.004)}$	$0.08 6_{(0.003)}$	$0.05 6_{(0.002)}$	$0.09 6_{(0.002)}$	$0.09 6_{(0.002)}$
0.75	$0.55 1_{(0.004)}$	$0.28 5_{(0.004)}$	$0.11 7_{(0.004)}$	$0.35 3_{(0.041)}$	$0.09 7_{(0.002)}$	$0.09 7_{(0.002)}$
0.5	$0.55 7_{(0.005)}$	$0.2 9_{(0.004)}$	$0.18 6_{(0.025)}$	$0.94 4_{(0.013)}$	$0 . 1_{(0.002)}$	$0 . 1_{(0.002)}$
0.25	$0.57 5_{(0.005)}$	$0.30 9_{(0.005)}$	$0.22 8_{(0.005)}$	$0.98 9_{(0.001)}$	$0.11 2_{(0.002)}$	$0.12 3_{(0.006)}$

$p$	MI	OPW	EM $_{MI}$	EM $_{OPW}$	primePCA $_{MI}$	primePCA $_{OPW}$
1	$0.54 8_{(0.004)}$	$0.28 2_{(0.004)}$	$0.08 6_{(0.003)}$	$0.05 6_{(0.002)}$	$0.09 6_{(0.002)}$	$0.09 6_{(0.002)}$
0.75	$0.55 1_{(0.004)}$	$0.28 5_{(0.004)}$	$0.11 7_{(0.004)}$	$0.35 3_{(0.041)}$	$0.09 7_{(0.002)}$	$0.09 7_{(0.002)}$
0.5	$0.55 7_{(0.005)}$	$0.2 9_{(0.004)}$	$0.18 6_{(0.025)}$	$0.94 4_{(0.013)}$	$0 . 1_{(0.002)}$	$0 . 1_{(0.002)}$
0.25	$0.57 5_{(0.005)}$	$0.30 9_{(0.005)}$	$0.22 8_{(0.005)}$	$0.98 9_{(0.001)}$	$0.11 2_{(0.002)}$	$0.12 3_{(0.006)}$

5 REAL DATA ANALYSIS: MILLION SONG DATASET

We apply primePCA to a subset of the Million Song Dataset⁶ to analyse music preferences. The original data can be expressed as a matrix with $110, 000$ users (rows) and $163, 206$ songs (columns), with entries representing the number of times a song was played by a particular user. The proportion of non-missing entries in the matrix is $0.008 %$ ⁠. Since the matrix is very sparse, and since most songs have very few listeners, we enhance the SNR by restricting our attention to songs that have at least 100 listeners (⁠ $1777$ songs in total). This improves the proportion of non-missing entries to $0.23 %$ ⁠. Further summary information about the filtered data is provided below:

Open in new tab

(a) Quantiles of non-missing matrix entry values:
$0 %$	$10 %$	$20 %$	$30 %$	$40 %$	$50 %$	$60 %$	$70 %$	$80 %$	$90 %$	$100 %$
1	1	1	1	1	1	2	3	5	8	500
$90 %$	$91 %$	$92 %$	$93 %$	$94 %$	$95 %$	$96 %$	$97 %$	$98 %$	$99 %$	$100 %$
8	9	9	10	11	13	15	18	23	33	500
(b) Quantiles of the number of listeners for each song:
$0 %$	$10 %$	$20 %$	$30 %$	$40 %$	$50 %$	$60 %$	$70 %$	$80 %$	$90 %$	$100 %$
100	108	117	126	139	154	178	214	272.8	455.6	5043
(c) Quantiles of the total play counts of each user:
$0 %$	$10 %$	$20 %$	$30 %$	$40 %$	$50 %$	$60 %$	$70 %$	$80 %$	$90 %$	$100 %$
0	0	1	3	4	6	9	14	21	38	1114
$90 %$	$91 %$	$92 %$	$93 %$	$94 %$	$95 %$	$96 %$	$97 %$	$98 %$	$99 %$	$100 %$
38	41	44	48	54	60	68	79	97	132	1114

(a) Quantiles of non-missing matrix entry values:
$0 %$	$10 %$	$20 %$	$30 %$	$40 %$	$50 %$	$60 %$	$70 %$	$80 %$	$90 %$	$100 %$
1	1	1	1	1	1	2	3	5	8	500
$90 %$	$91 %$	$92 %$	$93 %$	$94 %$	$95 %$	$96 %$	$97 %$	$98 %$	$99 %$	$100 %$
8	9	9	10	11	13	15	18	23	33	500
(b) Quantiles of the number of listeners for each song:
$0 %$	$10 %$	$20 %$	$30 %$	$40 %$	$50 %$	$60 %$	$70 %$	$80 %$	$90 %$	$100 %$
100	108	117	126	139	154	178	214	272.8	455.6	5043
(c) Quantiles of the total play counts of each user:
$0 %$	$10 %$	$20 %$	$30 %$	$40 %$	$50 %$	$60 %$	$70 %$	$80 %$	$90 %$	$100 %$
0	0	1	3	4	6	9	14	21	38	1114
$90 %$	$91 %$	$92 %$	$93 %$	$94 %$	$95 %$	$96 %$	$97 %$	$98 %$	$99 %$	$100 %$
38	41	44	48	54	60	68	79	97	132	1114

Open in new tab

(a) Quantiles of non-missing matrix entry values:
$0 %$	$10 %$	$20 %$	$30 %$	$40 %$	$50 %$	$60 %$	$70 %$	$80 %$	$90 %$	$100 %$
1	1	1	1	1	1	2	3	5	8	500
$90 %$	$91 %$	$92 %$	$93 %$	$94 %$	$95 %$	$96 %$	$97 %$	$98 %$	$99 %$	$100 %$
8	9	9	10	11	13	15	18	23	33	500
(b) Quantiles of the number of listeners for each song:
$0 %$	$10 %$	$20 %$	$30 %$	$40 %$	$50 %$	$60 %$	$70 %$	$80 %$	$90 %$	$100 %$
100	108	117	126	139	154	178	214	272.8	455.6	5043
(c) Quantiles of the total play counts of each user:
$0 %$	$10 %$	$20 %$	$30 %$	$40 %$	$50 %$	$60 %$	$70 %$	$80 %$	$90 %$	$100 %$
0	0	1	3	4	6	9	14	21	38	1114
$90 %$	$91 %$	$92 %$	$93 %$	$94 %$	$95 %$	$96 %$	$97 %$	$98 %$	$99 %$	$100 %$
38	41	44	48	54	60	68	79	97	132	1114

(a) Quantiles of non-missing matrix entry values:
$0 %$	$10 %$	$20 %$	$30 %$	$40 %$	$50 %$	$60 %$	$70 %$	$80 %$	$90 %$	$100 %$
1	1	1	1	1	1	2	3	5	8	500
$90 %$	$91 %$	$92 %$	$93 %$	$94 %$	$95 %$	$96 %$	$97 %$	$98 %$	$99 %$	$100 %$
8	9	9	10	11	13	15	18	23	33	500
(b) Quantiles of the number of listeners for each song:
$0 %$	$10 %$	$20 %$	$30 %$	$40 %$	$50 %$	$60 %$	$70 %$	$80 %$	$90 %$	$100 %$
100	108	117	126	139	154	178	214	272.8	455.6	5043
(c) Quantiles of the total play counts of each user:
$0 %$	$10 %$	$20 %$	$30 %$	$40 %$	$50 %$	$60 %$	$70 %$	$80 %$	$90 %$	$100 %$
0	0	1	3	4	6	9	14	21	38	1114
$90 %$	$91 %$	$92 %$	$93 %$	$94 %$	$95 %$	$96 %$	$97 %$	$98 %$	$99 %$	$100 %$
38	41	44	48	54	60	68	79	97	132	1114

We mention here a respect in which the data set does not conform exactly to the framework studied in the paper, namely that we treat zero entries as missing data (this is very common for analyses of user-preference data sets). In practice, while it may indeed be the case that a zero play count for song $j$ by user $i$ provides no indication of their level of preference for that song, it may also be the case that it reflects a dislike of that song. To address this issue, following our main analysis we will present a study of the robustness of our conclusions to different levels of true zeros in the data.

From point (a) above, we see that the distribution of play counts has an extremely heavy tail, and in particular the sample variances of the counts will be highly heterogeneous across songs. To guard against excessive influence from the outliers, we discretise the play counts into five interest levels as follows:

Open in new tab

Play count

2–3

4–6

7–10

\geq 11

Level of interest

We are now in a position to analyse the data using primePCA, noting that one of the attractions of estimating the principal eigenspaces in this setting (as opposed to matrix completion, for instance), is that it becomes straightforward to make recommendations to new users, instead of having to run the algorithm again from scratch. For $i = 1, \dots, n = 110, 000$ and $j = 1, \dots, d = 1, 777$ ⁠, let $y_{i j} \in {1, \dots, 5}$ denote the level of interest of user $i$ in song $j$ ⁠, let $\hat{K} = 10$ and let $ℐ = {i : ‖ ω_{i} ‖_{1} > \hat{K}}$ ⁠. Our initial goal is to assess the top $\hat{K}$ eigenvalues of $\sum_{y}$ to see if there is low-rank signal in $Y = (y_{i j})$ ⁠. To this end, we first apply Algorithm 2 to obtain ${\hat{V}}_{\hat{K}}^{prime}$ ⁠; next, for each $i \in ℐ$ ⁠, we run Steps 2–5 of Algorithm 1 to obtain the estimated principal score ${\hat{u}}_{i}$ ⁠, so that we can approximate $y_{i}$ by ${\hat{y}}_{i} = {\hat{V}}_{\hat{K}}^{prime} {\hat{u}}_{i}$ ⁠. This allows us to estimate $\sum_{y}$ by ${\sum^{^}}_{y} = n^{- 1} \sum_{i \in ℐ} {\hat{y}}_{i} {\hat{y}}_{i}^{⊤}$ ⁠. Figure 4 displays the top $\hat{K}$ eigenvalues of ${\sum^{^}}_{y}$ ⁠, which exhibit a fairly rapid decay, thereby providing evidence for the existence of low-rank signal in $Y$ ⁠.

FIGURE 4

Leading eigenvalues of ${\sum^{^}}_{y}$

Open in new tab Download slide

In the left panel of Figure 5, we present the estimate ${\hat{V}}_{2}^{prime}$ of the top two eigenvectors of the covariance matrix $\sum_{y}$ ⁠, with colours indicating the genre of the song. The outliers in the $x$ -axis of this plot are particularly interesting: they reveal songs that polarise opinion among users (see Table 5) and that best capture variation in individuals' preferences for types of music measured by the first principal component. It is notable that Rock songs are overrepresented among the outliers (see Table 6), relative to, say, Country songs. Users who express a preference for particular songs are also more likely to enjoy songs that are nearby in the plot. Such information is therefore potentially commercially valuable, both as an efficient means of gauging users' preferences, and for providing recommendations.

$Plots of the first two principal components V^2prime (left) and the associated scores {u^i}i=1n (right) [Colour figure can be viewed at https://dbpia.nl.go.kr]$

FIGURE 5

Plots of the first two principal components ${\hat{V}}_{2}^{prime}$ (left) and the associated scores ${{\hat{u}}_{i}}_{i = 1}^{n}$ (right) [Colour figure can be viewed at https://dbpia.nl.go.kr]

Open in new tab Download slide

TABLE 5

Open in new tab

Titles, artists and genres of the 22 outlier songs in Figure 5

ID	Title	Artist	Genre
1	Your Hand In Mine	Explosions In The Sky	Rock
2	All These Things That I've Done	The Killers	Rock
3	Lady Marmalade	Christina Aguilera / Lil' Kim/	Pop
		Mya / Pink
4	Here It Goes Again	Ok Go	Rock
5	I Hate Pretending (Album Version)	Secret Machines	Rock
6	No Rain	Blind Melon	Rock
7	Comatose (Comes Alive Version)	Skillet	Rock
8	Life In Technicolor	Coldplay	Rock
9	New Soul	Yael Naïm	Pop
10	Blurry	Puddle Of Mudd	Rock
11	Give It Back	Polly Paulusma	Pop
12	Walking On The Moon	The Police	Rock
13	Face Down (Album Version)	The Red Jumpsuit Apparatus	Rock
14	Savior	Rise Against	Rock
15	Swing Swing	The All-American Rejects	Rock
16	Without Me	Eminem	Rap
17	Almaz	Randy Crawford	Pop
18	Hotel California	Eagles	Rock
19	Hey There Delilah	Plain White T's	Rock
20	Revelry	Kings Of Leon	Rock
21	Undo	Björk	Rock
22	You're The One	Dwight Yoakam	Country

ID	Title	Artist	Genre
1	Your Hand In Mine	Explosions In The Sky	Rock
2	All These Things That I've Done	The Killers	Rock
3	Lady Marmalade	Christina Aguilera / Lil' Kim/	Pop
		Mya / Pink
4	Here It Goes Again	Ok Go	Rock
5	I Hate Pretending (Album Version)	Secret Machines	Rock
6	No Rain	Blind Melon	Rock
7	Comatose (Comes Alive Version)	Skillet	Rock
8	Life In Technicolor	Coldplay	Rock
9	New Soul	Yael Naïm	Pop
10	Blurry	Puddle Of Mudd	Rock
11	Give It Back	Polly Paulusma	Pop
12	Walking On The Moon	The Police	Rock
13	Face Down (Album Version)	The Red Jumpsuit Apparatus	Rock
14	Savior	Rise Against	Rock
15	Swing Swing	The All-American Rejects	Rock
16	Without Me	Eminem	Rap
17	Almaz	Randy Crawford	Pop
18	Hotel California	Eagles	Rock
19	Hey There Delilah	Plain White T's	Rock
20	Revelry	Kings Of Leon	Rock
21	Undo	Björk	Rock
22	You're The One	Dwight Yoakam	Country

TABLE 5

Open in new tab

Titles, artists and genres of the 22 outlier songs in Figure 5

ID	Title	Artist	Genre
1	Your Hand In Mine	Explosions In The Sky	Rock
2	All These Things That I've Done	The Killers	Rock
3	Lady Marmalade	Christina Aguilera / Lil' Kim/	Pop
		Mya / Pink
4	Here It Goes Again	Ok Go	Rock
5	I Hate Pretending (Album Version)	Secret Machines	Rock
6	No Rain	Blind Melon	Rock
7	Comatose (Comes Alive Version)	Skillet	Rock
8	Life In Technicolor	Coldplay	Rock
9	New Soul	Yael Naïm	Pop
10	Blurry	Puddle Of Mudd	Rock
11	Give It Back	Polly Paulusma	Pop
12	Walking On The Moon	The Police	Rock
13	Face Down (Album Version)	The Red Jumpsuit Apparatus	Rock
14	Savior	Rise Against	Rock
15	Swing Swing	The All-American Rejects	Rock
16	Without Me	Eminem	Rap
17	Almaz	Randy Crawford	Pop
18	Hotel California	Eagles	Rock
19	Hey There Delilah	Plain White T's	Rock
20	Revelry	Kings Of Leon	Rock
21	Undo	Björk	Rock
22	You're The One	Dwight Yoakam	Country

ID	Title	Artist	Genre
1	Your Hand In Mine	Explosions In The Sky	Rock
2	All These Things That I've Done	The Killers	Rock
3	Lady Marmalade	Christina Aguilera / Lil' Kim/	Pop
		Mya / Pink
4	Here It Goes Again	Ok Go	Rock
5	I Hate Pretending (Album Version)	Secret Machines	Rock
6	No Rain	Blind Melon	Rock
7	Comatose (Comes Alive Version)	Skillet	Rock
8	Life In Technicolor	Coldplay	Rock
9	New Soul	Yael Naïm	Pop
10	Blurry	Puddle Of Mudd	Rock
11	Give It Back	Polly Paulusma	Pop
12	Walking On The Moon	The Police	Rock
13	Face Down (Album Version)	The Red Jumpsuit Apparatus	Rock
14	Savior	Rise Against	Rock
15	Swing Swing	The All-American Rejects	Rock
16	Without Me	Eminem	Rap
17	Almaz	Randy Crawford	Pop
18	Hotel California	Eagles	Rock
19	Hey There Delilah	Plain White T's	Rock
20	Revelry	Kings Of Leon	Rock
21	Undo	Björk	Rock
22	You're The One	Dwight Yoakam	Country

TABLE 6

Open in new tab

Genre distribution of the outliers (songs whose corresponding coordinate in the estimated leading principal component is of magnitude larger than 0.07)

	Rock	Pop	Electronic	Rap	Country	RnB	Latin	Others
Population (Total $= 1, 777$ ⁠)	$48.92 %$	$18.53 %$	$9.12 %$	$7.15 %$	$4.33 %$	$2.35 %$	$2.26 %$	$7.34 %$
Outliers (Total $= 22$ ⁠)	$72.73 %$	$18.18 %$	$0 %$	$4.54 %$	$4.54 %$	$0 %$	$0 %$	$0 %$

	Rock	Pop	Electronic	Rap	Country	RnB	Latin	Others
Population (Total $= 1, 777$ ⁠)	$48.92 %$	$18.53 %$	$9.12 %$	$7.15 %$	$4.33 %$	$2.35 %$	$2.26 %$	$7.34 %$
Outliers (Total $= 22$ ⁠)	$72.73 %$	$18.18 %$	$0 %$	$4.54 %$	$4.54 %$	$0 %$	$0 %$	$0 %$

TABLE 6

Open in new tab

Genre distribution of the outliers (songs whose corresponding coordinate in the estimated leading principal component is of magnitude larger than 0.07)

	Rock	Pop	Electronic	Rap	Country	RnB	Latin	Others
Population (Total $= 1, 777$ ⁠)	$48.92 %$	$18.53 %$	$9.12 %$	$7.15 %$	$4.33 %$	$2.35 %$	$2.26 %$	$7.34 %$
Outliers (Total $= 22$ ⁠)	$72.73 %$	$18.18 %$	$0 %$	$4.54 %$	$4.54 %$	$0 %$	$0 %$	$0 %$

	Rock	Pop	Electronic	Rap	Country	RnB	Latin	Others
Population (Total $= 1, 777$ ⁠)	$48.92 %$	$18.53 %$	$9.12 %$	$7.15 %$	$4.33 %$	$2.35 %$	$2.26 %$	$7.34 %$
Outliers (Total $= 22$ ⁠)	$72.73 %$	$18.18 %$	$0 %$	$4.54 %$	$4.54 %$	$0 %$	$0 %$	$0 %$

The right panel of Figure 5 presents the principal scores ${{\hat{u}}_{i}}_{i = 1}^{n}$ of the users, with frequent users (whose total song plays are in the top 10% of all users) in red and occasional users in blue. This plot reveals, for instance, that the second principal component is well aligned with general interest in the website. Returning to the left plot, we can now interpret a positive $y$ -coordinate for a particular song (which is the case for the large majority of songs) as being associated with an overall interest in the music provided by the site.

As discussed above, it may be the case that some of the entries that we have treated as missing in fact represent a user's aversion to a particular song. We therefore studied the robustness of our conclusions by replacing some of the missing entries with an interest level of 1 (i.e. the lowest level available). More precisely, for some $α \in {0.05, 0.1, 0.2}$ ⁠, and independently for each user $i \in [n]$ ⁠, we generated $R_{i} \sim Poisson (α ‖ ω_{i} ‖_{1})$ ⁠, and assigned an interest level of 1 to $R_{i}$ uniformly random chosen songs that this user had not previously heard through the site. We then ran primePCA on this imputed dataset, obtaining estimators ${\hat{v}}_{1}^{'}$ and ${\hat{v}}_{2}^{'}$ of the two leading principal components. Denoting the original primePCA estimators for the two columns of $V_{2}$ by ${\hat{v}}_{1}$ and ${\hat{v}}_{2}$ ⁠, respectively, Table 7 reports the average of the inner product $⟨ {\hat{v}}_{j}, {\hat{v}}_{k}^{'} ⟩$ ⁠, where $j, k \in {1, 2}$ ⁠, based on 100 independent Monte Carlo experiments. Bearing in mind that the average absolute inner product between two independent random vectors chosen uniformly on $S^{1776}$ is around 0.020, this table is reassuring that the conclusions are robust to the treatment of missing entries.

TABLE 7

Open in new tab

Robustness assessment: average inner products (over 100 repetitions) between top two eigenvectors obtained by running primePCA on the original data and with some of the missing entries imputed with an interest level of 1

$α$	$⟨ {\hat{v}}_{1}, {\hat{v}}_{1}^{'} ⟩$	$⟨ {\hat{v}}_{1}, {\hat{v}}_{2}^{'} ⟩$	$⟨ {\hat{v}}_{2}, {\hat{v}}_{1}^{'} ⟩$	$⟨ {\hat{v}}_{2}, {\hat{v}}_{2}^{'} ⟩$
0.05	$0.81 6_{(0.018)}$	$- 0.04 2_{(0.007)}$	$- 0.01 2_{(0.007)}$	$0.91 0_{(0.002)}$
0.1	$0.75 6_{(0.018)}$	$- 0.02 7_{(0.007)}$	$- 0.07 0_{(0.008)}$	$0.89 3_{(0.002)}$
0.2	$0.54 6_{(0.025)}$	$- 0.06 7_{(0.010)}$	$- 0.08 5_{(0.010)}$	$0.85 9_{(0.002)}$

Note: SEs are given in brackets.

TABLE 7

Open in new tab

$α$	$⟨ {\hat{v}}_{1}, {\hat{v}}_{1}^{'} ⟩$	$⟨ {\hat{v}}_{1}, {\hat{v}}_{2}^{'} ⟩$	$⟨ {\hat{v}}_{2}, {\hat{v}}_{1}^{'} ⟩$	$⟨ {\hat{v}}_{2}, {\hat{v}}_{2}^{'} ⟩$
0.05	$0.81 6_{(0.018)}$	$- 0.04 2_{(0.007)}$	$- 0.01 2_{(0.007)}$	$0.91 0_{(0.002)}$
0.1	$0.75 6_{(0.018)}$	$- 0.02 7_{(0.007)}$	$- 0.07 0_{(0.008)}$	$0.89 3_{(0.002)}$
0.2	$0.54 6_{(0.025)}$	$- 0.06 7_{(0.010)}$	$- 0.08 5_{(0.010)}$	$0.85 9_{(0.002)}$

Note: SEs are given in brackets.

6 DISCUSSION

Heterogeneous missingness is ubiquitous in contemporary, large-scale data sets, yet we currently understand very little about how existing procedures perform or should be adapted to cope with the challenges this presents. Here we attempt to extract the lessons learned from this study of high-dimensional PCA, in order to see how related ideas may be relevant in other statistical problems where one wishes to recover low-dimensional structure with data corrupted in a heterogeneous manner.

A key insight, as gleaned from Section 2.2, is that the way in which the heterogeneity interacts with the underlying structure of interest is crucial. In the worst case, the missingness may be constructed to conceal precisely the structure one seeks to uncover, thereby rendering the problem infeasible by any method. The only hope, then, in terms of providing theoretical guarantees, is to rule out such an adversarial interaction. This was achieved via our incoherence condition in Section 3, and we look forward to seeing how the relevant interactions between structure and heterogeneity can be controlled in other statistical problems such as those mentioned in the introduction. For instance, in sparse linear regression, one would anticipate that missingness of covariates with strong signal would be much more harmful than corresponding missingness for noise variables.

Our study also contributes to the broader understanding of the uses and limitations of spectral methods for estimating hidden low-dimensional structures in high-dimensional problems. We have seen that the OPW estimator is both methodologically simple and, in the homogeneous missingness setting, achieves near-minimax optimality when the noise level is of constant order. Similar results have been obtained for spectral clustering for network community detection in stochastic block models (Rohe et al., 2011) and in low-rank-plus-sparse matrix estimation problems (Fan et al., 2013). On the other hand, the OPW estimator fails to provide exact recovery of the principal components in the noiseless setting. In these other aforementioned problems, it has also been observed that refinement of an initial spectral estimator can enhance performance, particularly in high SNR regimes (Gao et al., 2016; Zhang et al., 2018), as we were able to show for our primePCA algorithm. This suggests that such a refinement has the potential to confer a sharper dependence of the statistical error rate on the SNR compared with a vanilla spectral algorithm, and understanding this phenomenon in greater detail provides another interesting avenue for future research.

¹ When $K = 1$ ⁠, we have that $L ({\hat{V}}_{1}, V_{1})$ is the sine of the acute angle between ${\hat{V}}_{1}$ and $V_{1}$ ⁠. More generally, $L^{2} ({\hat{V}}_{K}, V_{K})$ is the sum of the squares of the sines of the principal angles between the subspaces spanned by ${\hat{V}}_{K}$ and $V_{K}$ ⁠.

² In R, we set the random seed to be 2019 before generating $V_{K}$ ⁠.

³ To be completely precise, in our simulations, $V_{+}$ and $V_{-}$ were generated independently (and independently of all other randomness) and were drawn from Haar measure on $𝕆^{d \times 2}$ ⁠; however, these matrices were then fixed for every replication, so it is convenient to regard them as deterministic for the purposes of this description.

⁴ In fact, to try to improve the numerical stability of the EM procedure, we prevented the sample covariance estimators from exiting the cone of positive semi-definite matrices during iterations and took Moore–Penrose pseudoinverses with eigenvalues below $1 0^{- 10}$ regarded as 0. Both of these modifications did indeed improve the algorithm, but some instability persists. Moreover, use of the SWEEP operator (Beaton, 1964), which is designed to compute the Schur complement in a numerically stable way, failed to remedy the situation, yielding identical (increasing) log-likelihood trajectories as the vanilla algorithm.

⁵ Each iteration of the EM algorithm involves the inversion of $n$ matrices, where the dimension of the $i$ th such matrix is $\sum_{j = 1}^{d} ω_{i j} \times \sum_{j = 1}^{d} ω_{i j}$ (i.e. $O (d) \times O (d)$ ⁠). Using standard matrix inversion algorithms, then, each iteration has computational complexity of order $n d^{3}$ ⁠, and moreover the number of iterations required for numerical convergence can be very large in higher dimensions. This meant that even when $d = 100$ ⁠, primePCA was nearly 50 times faster than the EM algorithm.

⁶https://www.kaggle.com/c/msdchallenge/data

ACKNOWLEDGEMENTS

The authors thank the anonymous reviewers for helpful feedback, which helped to improve the paper. Ziwei Zhu was supported by NSF grant DMS-2015366, Tengyao Wang was supported by EPSRC grant EP/T02772X/1 and Richard J. Samworth was supported by EPSRC grants EP/P031447/1 and EP/N031938/1, as well as ERC grant Advanced Grant 101019498.

REFERENCES

Anderson

T.W.

(

1957

)

Maximum likelihood estimates for a multivariate normal distribution when some observations are missing

Journal of the American Statistical Association

200

–

203

Google Scholar

Crossref

WorldCat

Beaton

A.E.

(

1964

)

The use of special matrix operators in statistical calculus

ETS Research Bulletin Series

–

222

Google Scholar

OpenURL Placeholder Text

WorldCat

Belloni

Rosenbaum

Tsybakov

A.B.

(

2017

)

Linear and conic programming estimators in high dimensional errors-in-variables models

Journal of the Royal Statistical Society. Series B: Statistical Methodology

939

–

956

Google Scholar

Crossref

WorldCat

Cai

T.T.

(

2013

)

Sparse PCA: optimal rates and adaptive estimation

The Annals of Statistics

3074

–

3110

Google Scholar

OpenURL Placeholder Text

WorldCat

Cai

T.T.

Zhang

(

2016

)

Minimax rate-optimal estimation of high-dimensional covariance matrices with incomplete data

The Journal of Multivariate Analysis

150

–

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Cai

T.T.

Zhang

(

2018

)

High-dimensional linear discriminant analysis: optimality, adaptive algorithm, and missing data. arXiv:1804.03018

Candès

E.J.

Wright

(

2011

)

Robust principal component analysis?

Journal of the ACM

11:1

–

11:37

Google Scholar

Crossref

WorldCat

Candès

E.J.

Plan

(

2010

)

Matrix completion with noise

Proceedings of the IEEE

925

–

936

Google Scholar

Crossref

WorldCat

Candès

E.J.

Recht

(

2009

)

Exact matrix completion via convex optimization

Foundations of Computational Mathematics (FoCM)

717

–

772

Google Scholar

Crossref

WorldCat

Chi

Chen

(

2018

)

Nonconvex optimization meets low-rank matrix factorization: an overview. arXiv preprint arXiv:1809.09573

Cho

Kim

Rohe

(

2017

)

Asymptotic theory for estimating the singular vectors and values of a partially-observed low rank matrix with noise

Statistica Sinica

1921

–

1948

Google Scholar

OpenURL Placeholder Text

WorldCat

Davis

Kahan

W.M.

(

1970

)

The rotation of eigenvectors by a perturbation III

The SIAM Journal on Numerical Analysis

–

Google Scholar

Crossref

WorldCat

Dempster

A.P.

Laird

N.M.

Rubin

D.B.

(

1977

)

Maximum likelihood from incomplete data via the EM algorithm

Journal of the Royal Statistical Society. Series B: Statistical Methodology

–

Google Scholar

OpenURL Placeholder Text

WorldCat

Dray

Josse

(

2015

)

Principal component analysis with missing values: a comparative survey of methods

Plant Ecology

216

657

–

667

Google Scholar

Crossref

WorldCat

Elsener

van de

Geer

. (

2018

) Sparse spectral estimation with missing and corrupted measurements. arXiv:1811.10443.

Fan

Liao

Micheva

(

2013

)

Large covariance estimation by thresholding principal orthogonal complements

Journal of the Royal Statistical Society. Series B: Statistical Methodology

603

–

680

Google Scholar

Crossref

WorldCat

Ford

B.L.

(

1983

) An overview of hot-deck procedures. In:

Madow

W.G.

Olkin

Rubin

D.B.

(Eds.)

Incomplete data in sample surveys, Vol. 2: theory and bibliographies

New York

Academic Press

, pp.

185

–

207

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Gao

Zhang

A.Y.

Zhou

H.H.

(

2016

)

Achieving optimal misclassification proportion in stochastic block models

Journal of Machine Learning Research

–

Google Scholar

OpenURL Placeholder Text

WorldCat

Hastie

Mazumder

Lee

J.D.

Zadeh

(

2015

)

Matrix completion and low-rank SVD via fast alternating least squares

Journal of Machine Learning Research

3367

–

3402

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Johnstone

I.M.

A.Y.

(

2009

)

On consistency and sparsity for principal components analysis in high dimensions

Journal of the American Statistical Association

104

682

–

693

Josse

Husson

(

2012

)

Handling missing values in exploratory multivariate data analysis methods

Journal de la société française de statistique

153

–

Google Scholar

OpenURL Placeholder Text

WorldCat

Josse

Pagès

Husson

(

2009

)

Gestion des donneés manquantes en analyse en composantes principales

Journal de la société française de statistique

150

–

Google Scholar

OpenURL Placeholder Text

WorldCat

Keshavan

R.H.

Montanari

(

2010

)

Matrix completion from a few entries

IEEE Transactions on Information Theory

2980

–

2998

Google Scholar

Crossref

WorldCat

Kiers

H.A.L.

(

1997

)

Weighted least squares fitting using ordinary least squares algorithms

Psychometrika

251

–

266

Google Scholar

Crossref

WorldCat

Koltchinskii

Lounici

Tsybakov

A.B.

(

2011

)

Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion

The Annals of Statistics

2302

–

2329

Google Scholar

OpenURL Placeholder Text

WorldCat

Little

R.J.

Rubin

D.B.

(

2019

)

Statistical analysis with missing data

Hoboken

John Wiley & Sons

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Loh

P.-L.

Tan

X.L.

(

2018

)

High-dimensional robust precision matrix estimation: cellwise corruption under

ϵ

-contamination

The Electronic Journal of Statistics

1429

–

1467

Google Scholar

Crossref

WorldCat

Loh

P.-L.

Wainwright

M.J.

(

2012

)

High-dimensional regression with noisy and missing data: provable guarantees with nonconvexity

The Annals of Statistics

1637

–

1664

Google Scholar

OpenURL Placeholder Text

WorldCat

Lounici

(

2013

) Sparse principal component analysis with missing observations. In:

Houdré

(Ed.)

High dimensional probability VI

Basel

Birkhäuser

, pp.

327

–

356

Lounici

(

2014

)

High-dimensional covariance matrix estimation with missing observations

Bernoulli

1029

–

1058

Google Scholar

Crossref

WorldCat

Mazumder

Hastie

Tibshirani

(

2010

)

Spectral regularization algorithms for learning large incomplete matrices

Journal of Machine Learning Research

2287

–

2322

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Negahban

Wainwright

M.J.

(

2012

)

Restricted strong convexity and weighted matrix completion: optimal bounds with noise

Journal of Machine Learning Research

1665

–

1697

Google Scholar

OpenURL Placeholder Text

WorldCat

Paul

(

2007

)

Asymptotics of sample eigenstructure for a large dimensional spiked covariance model

Statistica Sinica

1617

–

1642

Google Scholar

OpenURL Placeholder Text

WorldCat

Rohe

Chatterjee

(

2011

)

Spectral clustering and the high-dimensional stochastic blockmodel

The Annals of Statistics

1878

–

1915

Google Scholar

OpenURL Placeholder Text

WorldCat

Rubin

D.B.

(

1976

)

Inference and missing data

Biometrika

581

–

592

Google Scholar

Crossref

WorldCat

Rubin

D.B.

(

2004

)

Multiple imputation for nonresponse in surveys

Hoboken

John Wiley & Sons

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

Schönemann

(

1966

)

A generalized solution of the orthogonal Procrustes problem

Psychometrika

–

Google Scholar

Crossref

WorldCat

Seaman

Galati

Jackson

Carlin

(

2013

)

What is meant by “missing at random”?

Statistical Science

257

–

268

Google Scholar

OpenURL Placeholder Text

WorldCat

Shen

Zhu

Marron

(

2016

)

The statistics and mathematics of high dimension low sample size asymptotics

Statistica Sinica

1747

–

1770

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Vershynin

(

2018

)

High-dimensional probability: an introduction with applications in data science

Cambridge

Cambridge University Press

Wang

(

2016

)

Spectral methods and computational trade-offs in high-dimensional statistical inference

. Ph.D thesis, University of Cambridge.

Wang

Fan

(

2017

)

Asymptotics of empirical eigen-structure for ultra-high dimensional spiked covariance model

The Annals of Statistics

1342

–

1374

Google Scholar

PubMed

OpenURL Placeholder Text

WorldCat

Wold

Lyttkens

(

1969

)

Nonlinear iterative partial least squares (NIPALS) estimation procedures

Bulletin of the International Statistical Institute

–

Google Scholar

OpenURL Placeholder Text

WorldCat

Zhang

Cai

T.T.

(

2018

)

Heteroskedastic PCA: algorithm, optimality, and applications. arXiv:1810.08316

Zhu

Wang

Samworth

R. J.

(

2019

)

primePCA: projected refinement for imputation of missing entries in principal component analysis. R package, version 1.2. Available from

: https://CRAN.R-project.org/web/packages/primePCA/.

Author notes

Funding information Engineering and Physical Sciences Research Council, Grant/Award Numbers: EP/N031938/1; EP/P031447/1; EP/T02772X/1; H2020 European Research Council, Grant/Award Number: 101019498; NSF, Grant/Award Number: DMS-2015366

This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
March 2023	73
April 2023	131
May 2023	58
June 2023	54
July 2023	71
August 2023	93
September 2023	97
October 2023	87
November 2023	90
December 2023	106
January 2024	78
February 2024	55
March 2024	75
April 2024	82
May 2024	48
June 2024	42
July 2024	60
August 2024	53
September 2024	48
October 2024	61
November 2024	66
December 2024	45
January 2025	36
February 2025	49
March 2025	69
April 2025	49
May 2025	8

Article Contents

High-Dimensional Principal Component Analysis with Heterogeneous Missingness

Abstract

1 INTRODUCTION

1.1 Notation

2 THE OPW ESTIMATOR

2.1 Theory for homogeneous missingness

2.2 Heterogeneous observation mechanism

3 OUR NEW ALGORITHM FOR PCA WITH MISSING ENTRIES

3.1 Initialisation

3.2 Weakening the missingness proposition condition for contraction

3.3 Other missingness mechanisms

4 SIMULATION STUDIES

4.1 Noiseless case

4.2 Noisy case

4.3 Near low-rank case

4.4 Other missingness mechanisms

5 REAL DATA ANALYSIS: MILLION SONG DATASET

6 DISCUSSION

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

High-Dimensional Principal Component Analysis with Heterogeneous Missingness Open Access

Abstract

1 INTRODUCTION

1.1 Notation

2 THE OPW ESTIMATOR

2.1 Theory for homogeneous missingness

2.2 Heterogeneous observation mechanism

3 OUR NEW ALGORITHM FOR PCA WITH MISSING ENTRIES

3.1 Initialisation

3.2 Weakening the missingness proposition condition for contraction

3.3 Other missingness mechanisms

4 SIMULATION STUDIES

4.1 Noiseless case

4.2 Noisy case

4.3 Near low-rank case

4.4 Other missingness mechanisms

5 REAL DATA ANALYSIS: MILLION SONG DATASET

6 DISCUSSION

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Supplementary data

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only

High-Dimensional Principal Component Analysis with Heterogeneous Missingness