Abstract

We study the problem of high-dimensional Principal Component Analysis (PCA) with missing observations. In a simple, homogeneous observation model, we show that an existing observed-proportion weighted (OPW) estimator of the leading principal components can (nearly) attain the minimax optimal rate of convergence, which exhibits an interesting phase transition. However, deeper investigation reveals that, particularly in more realistic settings where the observation probabilities are heterogeneous, the empirical performance of the OPW estimator can be unsatisfactory; moreover, in the noiseless case, it fails to provide exact recovery of the principal components. Our main contribution, then, is to introduce a new method, which we call primePCA, that is designed to cope with situations where observations may be missing in a heterogeneous manner. Starting from the OPW estimator, primePCA iteratively projects the observed entries of the data matrix onto the column space of our current estimate to impute the missing entries, and then updates our estimate by computing the leading right singular space of the imputed data matrix. We prove that the error of primePCA converges to zero at a geometric rate in the noiseless case, and when the signal strength is not too small. An important feature of our theoretical guarantees is that they depend on average, as opposed to worst-case, properties of the missingness mechanism. Our numerical studies on both simulated and real data reveal that primePCA exhibits very encouraging performance across a wide range of scenarios, including settings where the data are not Missing Completely At Random.

1 INTRODUCTION

One of the ironies of working with Big Data is that missing data play an ever more significant role, and often present serious difficulties for analysis. For instance, a common approach to handling missing data is to perform a so-called complete-case analysis (Little & Rubin, 2019), where we restrict attention to individuals in our study with no missing attributes. When relatively few features are recorded for each individual, one can frequently expect a sufficiently large proportion of complete cases that, under an appropriate missing at random (MAR) hypothesis, a complete-case analysis may result in only a relatively small loss of efficiency. On the other hand, in high-dimensional regimes where there are many features of interest, there is often such a small proportion of complete cases that this approach becomes infeasible. As a very simple illustration of this phenomenon, imagine an n×d data matrix in which each entry is missing independently with probability 0.01. When d=5, a complete-case analysis would result in around 95% of the individuals (rows) being retained, but even when we reach d=300, only around 5% of rows will have no missing entries.

The inadequacy of the complete-case approach in many applications has motivated numerous methodological developments in the field of missing data over the past 60 years or so, including imputation (Ford, 1983; Rubin, 2004), factored likelihood (Anderson, 1957) and maximum likelihood approaches (Dempster et al., 1977); see, for example, Little & Rubin (2019) for an introduction to the area. Recent years have also witnessed increasing emphasis on understanding the performance of methods for dealing with missing data in a variety of high-dimensional problems, including sparse regression (Belloni et al., 2017; Loh & Wainwright, 2012), classification (Cai & Zhang, 2018), sparse principal component analysis (Elsener & van de Geer, 2018) and covariance and precision matrix estimation (Loh & Tan, 2018; Lounici, 2014).

In this paper, we study the effects of missing data in one of the canonical problems of high-dimensional data analysis, namely dimension reduction via Principal Component Analysis (PCA). This is closely related to the topic of matrix completion, which has received a great deal of attention in the literature over the last decade or so (Candès et al., 2011; Candès & Plan, 2010; Candès & Recht, 2009; Keshavan et al., 2010; Koltchinskii et al., 2011; Mazumder et al., 2010; Negahban & Wainwright, 2012) for example. There, the focus is typically on accurate recovery of the missing entries, subject to a low-rank assumption on the signal matrix; by contrast, our focus is on estimation of the principal eigenspaces. Previously proposed methods for low-dimensional PCA with missing data include non-linear iterative partial least squares (Wold & Lyttkens, 1969), iterative PCA (Josse & Husson, 2012; Kiers, 1997) and its regularised variant (Josse et al., 2009); see Dray & Josse (2015) for a nice survey and comparative study. More broadly, the R-miss-tastic website https://rmisstastic.netlify.com/ provides a valuable resource on methods for handling missing data.

The importance of the problem of high-dimensional PCA with missing data derives from its wide variety of applications. For instance, in many commercial settings, one may have a matrix of customers and products, with entries recording the number of purchases. Naturally, there will typically be a high proportion of missing entries. Nevertheless, PCA can be used to identify items that distinguish the preferences of customers particularly effectively, to make recommendations to users of products they might like and to summarise efficiently customers' preferences. Later, we will illustrate such an application, on the Million Song Dataset, where we are able to identify particular songs that have substantial discriminatory power for users' preferences as well as other interesting characteristics of the user database. Other potential application areas include health data, where one may seek features that best capture the variation in a population, and where the corresponding principal component scores may be used to cluster individuals into subgroups (that may, for instance, receive different treatment regimens).

To formalise the problem we consider, suppose that the (partially observed) matrix n×d matrix Y is of the form

(1)

for independent random matrices X and Z, where X is a low-rank matrix and Z is a noise matrix with independent and identically distributed entries having zero mean. The low-rank property of X is encoded through the assumption that it is generated via

(2)

where VKd×K has orthonormal columns and U is a random n×K matrix (with n>K) having independent and identically distributed rows with mean zero and covariance matrix u. Note that when X and Z are independent, the covariance matrix of Y has a K-spiked structure; such covariance models have been studied extensively in both theory and applications (Cai et al., 2013; Fan et al., 2013; Johnstone & Lu, 2009; Paul, 2007).

We are interested in estimating the column space of VK, denoted by Col(VK), which is also the K-dimensional leading eigenspace of y:=n1𝔼(YY). Cho et al. (2017) considered a different but related model where U in (2) is deterministic, and is not necessarily centred, so that VK is the top K right singular space of 𝔼(Y). (By contrast, in our setting, 𝔼(Y)=0, so the mean structure is uninformative for recovering VK.) Their model can be viewed as being obtained from the model (1) and (2) by conditioning on U. In the context of a p-homogeneous Missing Completely At Random (MCAR) observation model, where each entry of Y is observed independently with probability p(0,1) (independently of Y), Cho et al. (2017) studied the estimation of Col(VK) by Col(V^K), where V^K is a simple estimator formed as the top K eigenvectors of an observed-proportion weighted (OPW) version of the sample covariance matrix (here, the weighting is designed to achieve approximate unbiasedness). Our first contribution, in Section 2, is to provide a detailed, finite-sample analysis of this estimator in the model given by (1) and (2) together with a p-homogeneous MCAR missingness structure, with a noise level of constant order. The differences between the settings necessitate completely different arguments, and reveal in particular a new phenomenon in the form of a phase transition in the attainable risk bound for the sinΘ loss function, that is, the Frobenius norm of the diagonal matrix of the sines of the principal angles between V^K and VK. Moreover, we also provide a minimax lower bound in the case of estimating a single principal component, which reveals that this estimator achieves the minimax optimal rate up to a poly-logarithmic factor.

While this appears to be a very encouraging story for the OPW estimator, it turns out that it is really only the starting point for a more complete understanding of high-dimensional PCA with missing data. For instance, in the noiseless case, the OPW estimator fails to provide exact recovery of the principal components. Moreover, it is the norm rather than the exception in applications that missingness is heterogeneous, in the sense that the probability of observing entries of Y varies (often significantly) across columns. For instance, in recommendation systems, some products will typically be more popular than others, and hence we observe more ratings in those columns. As another example, in meta-analyses of data from several studies, it is frequently the case that some covariates are common across all studies, while others appear only in a reduced proportion of them. In Section 2.2, we present an example to show that, even with an MCAR structure, PCA algorithms can break down entirely for such heterogeneous observation mechanisms when individual rows of VK can have large Euclidean norm. Intuitively, if we do not observe the interaction between the jth and kth columns of Y, then we cannot hope to estimate the jth or kth rows of VK, and this will cause substantial error if these rows of VK contain significant signal. This example illustrates that it is only possible to handle heterogeneous missingness in high-dimensional PCA with additional structure, and indicates that it is natural to measure the difficulty of the problem in terms of the incoherence among the entries of VK—that is, the maximum Euclidean norm of the rows of VK.

Our main contribution, then, is to propose a new, iterative algorithm, called primePCA (short for projected refinement for imputation of missing entries in Principal Component Analysis), in Section 3, to estimate VK, even with heterogeneous missingness. The main initialiser that we study for this algorithm is a modified version of the simple estimator discussed above, where the modification accounts for potential heterogeneity. Each iteration of primePCA projects the observed entries of Y onto the column space of the current estimate of VK to impute missing entries, and then updates our estimate of VK by computing the leading right singular space of the imputed data matrix. An illustration of the two steps of a single iteration of the primePCA algorithm in the case d=3, K=1 is given in Figure 1.

An illustration of the two steps of a single iteration of the primePCA algorithm with d=3 and K=1. Black dots represent fully observed data points, while vertical dotted lines that emanate from them give an indication of their x3 coordinate values, as well as their projections onto the x1-x2 plane. The x1 coordinate of the orange data point and the x2 coordinate of the blue data point are unobserved, so the true observations lie on the respective solid lines through those points (which are parallel to the relevant axes). Starting from an input estimate of VK (left), given by the black arrow, we impute the missing coordinates as the closest points on the coloured lines to VK (middle), and then obtain an updated estimate of VK as the leading right singular vector of the imputed data matrix (right, with the old estimate in grey). [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 1

An illustration of the two steps of a single iteration of the primePCA algorithm with d=3 and K=1. Black dots represent fully observed data points, while vertical dotted lines that emanate from them give an indication of their x3 coordinate values, as well as their projections onto the x1-x2 plane. The x1 coordinate of the orange data point and the x2 coordinate of the blue data point are unobserved, so the true observations lie on the respective solid lines through those points (which are parallel to the relevant axes). Starting from an input estimate of VK (left), given by the black arrow, we impute the missing coordinates as the closest points on the coloured lines to VK (middle), and then obtain an updated estimate of VK as the leading right singular vector of the imputed data matrix (right, with the old estimate in grey). [Colour figure can be viewed at https://dbpia.nl.go.kr]

Our theoretical results reveal that in the noiseless setting, that is, Z=0, primePCA achieves exact recovery of the principal eigenspaces (with a geometric convergence rate) when the initial estimator is close to the truth and a sufficiently large proportion of the data are observed. Moreover, we also provide a performance guarantee for the initial estimator, showing that under appropriate conditions it satisfies the desired requirement with high probability, conditional on the observed missingness pattern. Code for our algorithm is available in the R package primePCA (Zhu et al., 2019).

To the best of our knowledge, primePCA is the first method for high-dimensional PCA that is designed to cope with settings where missingness is heterogeneous. Indeed, the previously mentioned works on high-dimensional PCA and other high-dimensional statistical problems with missing data have either focused on a uniform missingness setting or have imposed a lower bound on entrywise observation probabilities, which reduces to this uniform case. In particular, such results fail to distinguish in terms of the performance of their algorithms between a setting where one variable is observed with a very low probability p and all other variables are fully observed, and a setting where all variables are observed with probability p. A key contribution of our work is to account explicitly for the effect of a heterogeneous missingness mechanism, where the estimation error depends on average entrywise missingness rather than worst-case missingness; see the discussions after Theorem 4 and Proposition 2 below. In Section 4, the empirical performance of primePCA is compared with both that of the initialiser, and a popular method for matrix completion called softImpute (Hastie et al., 2015; Mazumder et al., 2010); we also discuss maximum likelihood approaches implemented via the Expectation–Maximisation (EM) algorithm, which can be used when the dimension is not too high. Our settings include a wide range of signal-to-noise ratios (SNRs), as well as Missing Completely At Random, MAR and Missing Not At Random (MNAR) examples (Little & Rubin, 2019; Seaman et al., 2013). These comparisons reveal that primePCA provides highly accurate and robust estimation of principal components, for instance outperforming the softImpute algorithm, even when the latter is allowed access to the oracle choice of regularisation parameter for each dataset. Our analysis of the Million Song Dataset is given in Section 5. In Section 6, we illustrate how some of the ideas in this work may be applied to other high-dimensional statistical problems involving missing data. Proofs of our main results are deferred to Section A in Appendix S1 (Zhu et al., 2019); auxiliary results and their proofs are given in Section B of Appendix S1.

1.1 Notation

For a positive integer T, we write [T]:={1,,T}. For v=(v1,,vd)d and p[1,), we define vp:=(j=1d|vj|p)1/p and v:=maxj[d]|vj|. We let Sd1:={ud:u2=1} denote the unit Euclidean sphere in d.

Given u=(u1,,ud)d, we write diag(u)d×d for the diagonal matrix whose jth diagonal entry is uj. We let 𝕆d1×d2 denote the set of matrices in d1×d2 with orthonormal columns. For a matrix A=(Aij)d1×d2, and p,q[1,], we write Ap:=(i,j|Aij|p)1/p if 1p< and A:=maxi,j|Aij| for its entrywise p norm, as well as Apq:=supvp=1Avq for its p-to-q operator norm. We provide special notation for the (Euclidean) operator norm and the Frobenius norm by writing Aop:=A22 and AF:=A2 respectively. We also write σj(A) for the jth largest singular value of A, and define its nuclear norm by A:=j=1min(d1,d2)σj(A). If S[n], we write AS|S|×d for the matrix obtained by extracting the rows of A that are in S. For A,Bd1×d2, the Hadamard product of A and B, denoted AB, is defined such that (AB)ij=AijBij for any i[d1] and j[d2].

2 THE OPW ESTIMATOR

In this section, we study a simple OPW estimator of the matrix of principal components. To define the estimator, let Aij denote the event that the (i,j)th entry yij of Y is observed. We define the revelation matrix Ω=(ωij)n×d by ωij:=𝟙Aij, and the partially observed data matrix

(3)

Our observed data are the pair (YΩ,Ω). Importantly, the fact that we observe Ω allows us to distinguish between observed zeros and missing entries (even though these also appear as zeros in YΩ). We first consider the simplest possible case, which we refer to as the p-homogeneous observation model, where entries of the data matrix Y are observed independently and completely at random (i.e., independent of (U,Z)), each with probability p. Thus, (Aij)=p(0,1) for all i[n],j[d], and Aij and Aij are independent for (i,j)(i,j).

For i[n], let yi and ωi denote the ith rows of Y and Ω, respectively, and define y˜i:=yiωi. Writing P:=𝔼ω1ω1 and W for its entrywise inverse, we have that under the p-homogeneous observation model, P=p2{1d1d(1p1)Id} and W=p2{1d1d(1p)Id}. Following Lounici (2013, 2014) and Cho et al. (2017), we consider the following weighted sample covariance matrix:

The reason for including the weight W is to ensure that 𝔼(G|Y)=n1YY, so that G is an unbiased estimator of y. Related ideas appear in the work of Cai & Zhang (2016) on high-dimensional covariance matrix estimation with missing data; see also Little & Rubin (2019, section 3.4). In practice, p is typically unknown and needs to be estimated. It is therefore natural to consider the following plug-in estimator G^:

(4)

where W^=p^2{1d1d(1p^)Id} and p^:=(nd)1Ω1 denotes the proportion of observed entries in Y. The OPW estimator of VK, denoted V^KOPW, is the d×K matrix formed from the top K eigenvectors of G^.

2.1 Theory for homogeneous missingness

We begin by studying the theoretical performance of V^KOPW in a simple model that will allow us to reveal an interesting phase transition for the problem. For a random vector x taking values in d and for r1, we define its (Orlicz) ψr-norm and a version that is invariant to invertible affine transformations by

respectively. Recall that we say x is sub-Gaussian if xψ2<.

In this preliminary section, we assume that (YΩ,Ω) is generated according to (1), (2) and (3), where:

  • (A1)

    U, Z and Ω are independent;

  • (A2)

    U has independent and identically distributed rows (ui:i[n]) with 𝔼u1=0 and u1ψ2τ;

  • (A3)

    Z=(zij)i[n],j[d] has independent and identically distributed entries with 𝔼z11=0, varz11=1 and z11ψ2τ;

  • (A4)

    y1j2ψ1M for all j[d];

  • (A5)

    Ω has independent Bern(p) entries.

Thus, (A1) ensures that the complete data matrix Y and the revelation matrix Ω are independent; in other words, for now we work in a Missing Completely At Random (MCAR) setting. In a homoscedastic noise model, there is no loss of generality (by a scaling argument) in assuming that each entry of Z has unit variance, as in (A3). In many places in this work, it will be convenient to think intuitively of τ and M in (A2)–(A4) as constants. In particular, if U has multivariate normal rows and Z has normal entries, then we can simply take τ=1. For M, under the same normality assumptions, we have y1j2ψ1=var(y1j), so this intuition amounts to thinking of the variance of each component of our data as being of constant order.

A natural measure of the performance of an estimator V^K of VK is given by the Davis–Kahan sinΘ loss

(Davis & Kahan, 1970)1. Our first theorem controls the risk of the OPW estimator; here and below, we write λk for the kth largest eigenvalue of u.

Theorem 1

Assume (A1)–(A5) and thatn,d2, dp1. WriteR:=λ1+1. Then there exists a universal constantC>0such that

(5)

In particular, ifnd(log2d)(log2n)/(λ1p+logd), then there existsCM,τ>0, depending only onMandτ, such that

(6)

Theorem 1 reveals an interesting phase transition phenomenon. Specifically, if the signal strength is large enough that λ1p1logd, then we should regard np as the effective sample size, as might intuitively be expected. On the other hand, if λ1<p1logd, then the estimation problem is considerably more difficult and the effective sample size is of order np2. In fact, by inspecting the proof of Theorem 1, we see that in the high signal case, it is the difficulty of estimating the diagonal entries of y that drives the rate, while when the signal strength is low, the bottleneck is the challenge of estimating the off-diagonal entries. By comparing (6) with the minimax lower bound result in Theorem 2 below, we see that this phase transition phenomenon is an inherent feature of this estimation problem, rather than an artefact of the proof techniques we used to derived the upper bound.

The condition nd(log2d)(log2n)/(λ1p+logd) in Theorem 1 is reasonable given the scaling requirement for consistency of the empirical eigenvectors (Johnstone & Lu, 2009; Shen et al., 2016; Wang & Fan, 2017). Indeed, Shen et al. (2016, Theorem 5.1) show that when λ11, the top eigenvector of the sample covariance matrix estimator is consistent if and only if d/(nλ1)0. If we regard np as the effective sample size in our missing data PCA problem, then it is a sensible analogy to assume that d/(npλ1)0 here, which implies that the condition nd(log2d)(log2n)/(λ1p+logd) holds for large n, up to poly-logarithmic factors.

As mentioned in the introduction, Cho et al. (2017) considered the different but related problem of singular space estimation in a model in which Y=Θ+Z, where Θ is a matrix of the form UVK for a deterministic matrix U, whose rows are not necessarily centred. In this setting, VK is the matrix of top K right singular vectors of Θ, and the same estimator V^K can be applied. An important distinction is that, when the rows of U are not centred and the entries of Θ are of comparable magnitude, ΘF is of order nd, so when K is regarded as a constant, it is natural to think of the singular values of Θ as also being of order nd. Indeed, this is assumed in Cho et al. (2017). On the other hand, in our model, where the rows of U have mean zero, assuming that the eigenvalues are of order nd would amount to an extremely strong requirement, essentially restricting attention to very highly spiked covariance matrices. Removing this condition in Theorem 1 requires completely different arguments.

In order to state our minimax lower bound, we let 𝒫n,d(λ1,p) denote the class of distributions of pairs (YΩ,Ω) satisfying (A1), (A2), (A3) and (A5) with K=1. Since we are now working with vectors instead of matrices, we write v in place of V1.

Theorem 2

There exists a universal constantc>0such that

where the infimum is taken over all estimatorsv^=v^(YΩ,Ω)ofv.

Theorem 2 reveals that V^1OPW in Theorem 1 achieves the minimax optimal rate of estimation up to a poly-logarithmic factor when M and τ are regarded as constants.

2.2 Heterogeneous observation mechanism

A key assumption of the theory in Section 2.1, which allowed even a very simple estimator to perform well, was that the missingness probability was homogeneous across the different entries of the matrix. On the other hand, the aim of this sub-section is to show that the situation changes dramatically once the data can be missing heterogeneously.

To this end, consider the following example. Suppose that ω is equal to (1,0,1,,1) or (0,1,1,,1) with equal probability, so that

In other words, for each i[n], we observe precisely one of the first two entries of yi, together with all of the remaining (d2) entries. Let =Id+αα, where α=(21/2,21/2,0,,0)d, and =Id+α(α), where α=(21/2,21/2,0,,0)d. Suppose that yNd(0,) and let y˜:=yω, and similarly assume that yNd(0,) and set y˜:=yω. Then (y˜,ω) and (y˜,ω) are identically distributed. However, the leading eigenvectors of and are respectively α and α, which are orthogonal!

Thus, it is impossible to simultaneously estimate consistently the leading eigenvector of both and from our observations. We note that it is the disproportionate weight of the first two coordinates in the leading eigenvector, combined with the failure to observe simultaneously the first two entries in the data, that makes the estimation problem intractable in this example. The understanding derived from this example motivates us to seek bounds on the error in high-dimensional PCA that depend on an incoherence parameter μ:=(d/K)1/2VK2[1,(d/K)1/2]. The intuition here is that the maximally incoherent case is where each column of VK is a unit vector proportional to a vector whose entries are either 1 or 1, in which case VK2=(K/d)1/2 and μ=1. On the other hand, in the worst case, when the columns of VK are the first K standard basis vectors in d, we have μ=(d/K)1/2. Bounds involving incoherence have appeared previously in the literature on matrix completion (e.g., Candès & Plan, 2010; Keshavan et al., 2010), but for a different reason. There, the purpose is to control the principal angles between the true right singular space and the standard basis, which yields bounds on the number of observations required to infer the missing entries of the matrix. In our case, the incoherence condition controls the extent to which the loadings of the principal components of interest are concentrated in any single coordinate, and therefore the extent to which significant estimation error in a few components of the leading eigenvectors can affect the overall statistical performance. In the intractable example above, μ=(d/2)1/2, and with such a large value of μ, heavy corruption from missingness in only a few entries spoils any chance of consistent estimation.

3 OUR NEW ALGORITHM FOR PCA WITH MISSING ENTRIES

We are now in a position to introduce and analyse our iterative algorithm primePCA to estimate Col(VK), the principal eigenspace of the covariance matrix y. The basic idea is to iterate between imputing the missing entries of the data matrix YΩ using a current (input) iterate V^K(in), and then applying a singular value decomposition (SVD) to the completed data matrix. More precisely, for i[n], we let 𝒥i denote the indices for which the corresponding entry of yi is observed, and regress the observed data y˜i,𝒥i=yi,𝒥i on (V^K(in))𝒥i to obtain an estimate u^i of the ith row of U. This is natural in view of the data generating mechanism yi=VKui+zi. We then use y^i,𝒥ic:=(V^K(in))𝒥icu^i to impute the missing values yi,𝒥ic, retain the original observed entries as y^i,𝒥i:=y˜i,𝒥i, and set our next (output) iterate V^K(out) to be the top K right singular vectors of the imputed matrix Y^:=(y^1,,y^n). To motivate this final choice, observe that when Z=0, we have rank(Y)=K; we therefore have the SVD Y=LΓR, where L𝕆n×K,R𝕆d×K and ΓK×K is diagonal with positive diagonal entries. This means that R=VKULΓ1, so the column spaces of R and VK coincide. For convenience, pseudocode of a single iteration of refinement in this algorithm is given in Algorithm 1.

refine refine(K,V^K(in),Ω,YΩ), a single step of refinement of current iterate V^K(in)
Algorithm 1

refine refine(K,V^K(in),Ω,YΩ), a single step of refinement of current iterate V^K(in)

We now seek to provide formal justification for Algorithm 1. The recursive nature of the primePCA algorithm induces complex relationships between successive iterates, so to facilitate theoretical analysis, we will impose some conditions on the underlying data generating mechanism that may not hold in situations where we would like to apply to algorithm. Nevertheless, we believe that the analysis provides considerable insight into the performance of the primePCA algorithm, and these are discussed extensively below; moreover, our simulations in Section 4 consider settings both within and outside the scope of our theory, and confirm its attractive and robust numerical performance.

In addition to the loss function L, it will be convenient to define a slightly different notion of distance between subspaces. For any V,V˜𝕆d×K, we let W1DW2 be an SVD of V˜V. The two-to-infinity distance between V˜ and V is then defined to be

We remark that the definition of T(V˜,V) does not depend on our choice of SVD and that T(V˜,V)=T(V˜O1,VO2) for any O1,O2𝕆K×K, so that T really represents a distance between the subspaces spanned by V˜ and V. In fact, there is a sense in which the change-of-basis matrix W2W1 tries to align the columns of V as closely as possible with those of V˜; more precisely, if we change the norm from the two-to-infinity operator norm to the Frobenius norm, then W2W1 uniquely solves the so-called Procrustes problem (Schönemann, 1966):

(7)

The following proposition considers the noiseless setting Z=0, and shows that, for any estimator V^K(in) that is close to VK, a single iteration of refinement in Algorithm 1 contracts the two-to-infinity distance between their column spaces, under appropriate conditions. We define Ωc:=1d1dΩ.

Proposition 1

LetV^K(out):=refine(K,V^K(in),Ω,YΩ)as in Algorithm1and further letΔ:=T(V^K(in),VK). We assume thatmini[n]ωi1>Kand thatmini[n]d1/2σK((V^K(in))𝒥i)|𝒥i|1/21/σ>0. Suppose thatZ=0and that the SVD ofYis of the formLΓR, whereL2μ(K/n)1/2andR2μ(K/d)1/2for someμ1. Then there existc1,C>0, depending only onσ, such that whenever

  • (i)

    Δc1σK(Γ)K2μ4σ1(Γ)d,

  • (i)

    ρ:=CK2μ4σ1(Γ)Ωc11σK(Γ)n<1,

we have that

In order to understand the main conditions of Proposition 1, it is instructive to consider the case K=1, as was illustrated in Figure 1, and initially to think of μ as a constant. In that case, condition (i) asks that the absolute value of every component of the difference between the vectors V^1(in) and V1 is O(d1/2); for intuition, if two vectors are uniformly distributed on Sd1, then each of their norms is Op(d1/2log1/2d); in other words, we only ask that the initialiser is very slightly better than a random guess. In Condition (ii), ρ being less than 1 is equivalent to the proportion of missing data in each column being less than 1/(Cμ4) (where C again depends only on σ), and the conclusion is that the refine step contracts the initial two-to-infinity distance from VK by at least a factor of ρ. In the noiseless setting of Proposition 1, the matrix R of right singular vectors of Y has the same column span (and hence the same two-to-infinity norm) as VK. We can therefore gain some intuition about the scale of μ by considering the situation where VK is uniformly distributed on 𝕆d×K, so in particular, the columns of VK are uniformly distributed on Sd1. By Vershynin (2018, Theorem 5.1.4), we deduce that VK2=Op(Klogdd). On the other hand, when the distribution of U is invariant under left multiplication by an orthogonal matrix (e.g. if U has independent and identically distributed Gaussian rows), then L is distributed uniformly on 𝕆n×K. Arguing as above, we see that, with high probability, we may take μmax(logn,logd). This calculation suggests that we do not lose too much by thinking of μ as a constant (or at most, growing very slowly with n and d).

To apply Proposition 1, we also require conditions on mini[n]ωi1 and σ. In practice, if either of these conditions is not satisfied, we first perform a screening step that restricts attention to a set of row indices for which the data contain sufficient information to estimate the K principal components. This screening step is explicitly accounted for in Algorithm 2, as well as in the theory that justifies it. An alternative would be to seek to weight rows according to their utility for principal component estimation, but it seems difficult to implement this in a principled way that facilitates formal justification.

primePCA, an iterative algorithm for estimating VK given initialiser V^K(0)
Algorithm 2

primePCA, an iterative algorithm for estimating VK given initialiser V^K(0)

Algorithm 2 provides pseudocode for the iterative primePCA algorithm, given an initial estimator V^K(0). The iterations continue until either we hit the convergence threshold κ or the maximum iteration number niter. Theorem 3 below guarantees that, in the noiseless setting of Proposition 1, the primePCA estimator converges to VK at a geometric rate.

Theorem 3

Fort[niter], letV^K(t)be thetthiterate of Algorithm2with inputK, V^K(0), Ω{0,1}n×d, YΩn×d, niter, σ(0,)andκ=0. WriteΔ:=T(V^K(0),VK), and let

where𝒥i:={j:ωij=1}. Suppose thatZ=0and that the SVD ofYis of the formLΓR, whereL2μ(K/n)1/2andR2μ(K/d)1/2for someμ1. Assume that

Then there existc1,C>0, depending onlyσ and ϵ, such that whenever

  • (i)

    Δc1σK(Γ)K2μ4σ1(Γ)d,

  • (ii)

    ρ:=CK2μ4σ1(Γ)Ωc11σK(Γ)||<1,

we have that for everyt[niter],

The condition that ϵ>0 amounts to the very mild assumption that the algorithmic input σ is not exactly equal to any element of the set {|𝒥i|1/2σK((VK)𝒥i)d1/2:i[n],ωi1>K}, though the conditions on c1 and C become milder as ϵ increases.

3.1 Initialisation

Theorem 3 provides a general guarantee on the performance of primePCA, but relies on finding an initial estimator V^K(0) that is sufficiently close to the truth VK. The aim of this sub-section, then, is to propose a simple initialiser and show that it satisfies the requirement of Theorem 3 with high probability, conditional on the missingness pattern.

Consider the following modified weighted sample covariance matrix

(8)

where for any j,k[d],

(9)

Here, the matrix W˜ replaces W^ in (4) because, unlike in Section 2.1, we no longer wish to assume homogeneous missingness. We take as our initial estimator of VK the matrix of top K eigenvectors of G˜, denoted V˜K. Theorem 4 below studies the performance of this initialiser, in terms of its two-to-infinity norm error, and provides sufficient conditions for us to be able to apply Theorem 3. In particular, it ensures that the initialiser is reasonably well-aligned with the target VK. We write Ω and 𝔼Ω for probabilities and expectations conditional on Ω.

Theorem 4

Assume (A1)(A4) and thatn,d2. Suppose further thatVK2μ(K/d)1/2, that i=1nωijωik>0for allj,kand letR:=λ1+1. Then there existcM,τ,CM,τ>0, depending only onMandτ, such that for everyξ>2, if

(10)

then

As a consequence, writing

where and c1are as in Theorem3, we have that

The first part of Theorem 4 provides a general probabilistic upper bound for T(V˜K,VK), after conditioning on the missingness pattern. This allows us, in the second part, to provide a guarantee on the probability with which V˜K is a good enough initialiser for Theorem 3 to apply. For intuition regarding Ω(Ac), consider the MCAR setting with pjk:=𝔼(ω1jω1k) for j,k[d]. In that case, by Lemma 6, typical realisations of W˜ have W˜2maxj[d]k[d]pjk1 and W˜22maxj[d ](k[d ]pjk2)1/2 when j,k[d]enpjk/8 is small. In particular, when nminj,k[d]pjklogd, we expect Ω(Ac) to be small when λ1 and λK are both of the same order, and grow faster than

As a special case, in the p-homogeneous model where pjk=p2𝟙{jk}+p𝟙{j=k} for j,k[d], the requirement on λK above is that it should grow faster than max{(d2logdnp2)1/3,dlogdnp2}.

One of the attractions of our analysis is the fact that we are able to provide bounds that only depend on entrywise missingness probabilities in an average sense, as opposed to worst-case missingness probabilities. The refinements conferred by such bounds are particularly important when the missingness mechanism is heterogeneous, as typically encountered in practice. The averaging of missingness probabilities can be partially seen in Theorem 4, since W˜ and W˜2 depend only on the 1 and 2 norms of each row of W˜, but is even more evident in the proposition below, which gives a probabilistic bound on the original sinΘ distance between V˜K and VK.

Proposition 2

Assume the same conditions as in Theorem4. Then there exists a universal constantC>0such that for anyξ>1, if

(11)

then

In this bound, we see that L(V˜K,VK) only depends on W˜ through the entrywise 1 and 2 norms of the whole matrix. Lemma 6 again provides probabilistic control of these norms under the p-homogeneous missingness mechanism. In general, if the rows of Ω are independent and identically distributed, but different covariates are missing with different probabilities, then off-diagonal entries of W˜ will concentrate around the reciprocals of the simultaneous observation probabilities of pairs of covariates. As such, for a typical realisation of Ω, our bound in Proposition 2 depends only on the harmonic averages of these simultaneous observation probabilities and their squares. Such an averaging effect ensures that our method is effective in a much wider range of heterogeneous settings than previously allowed in the literature.

3.2 Weakening the missingness proposition condition for contraction

Theorem 3 provides a geometric contraction guarantee for the primePCA algorithm in the noiseless case. The price we pay for this strong conclusion, however, is a strong condition on the proportion of missingness that enters the contraction rate parameter ρ through Ωc11; indeed in an asymptotic framework where the incoherence parameter μ grows with the sample size and/or dimension, the proportion of missingness would need to vanish asymptotically. Therefore, to complement our earlier theory, we present below Proposition 3 and Corollary 1. Proposition 3 is an analogue of the deterministic Proposition 1 in that it demonstrates that a single iteration of the primePCA algorithm yields a contraction provided that the input V^K(in) is sufficiently close to VK. The two main differences are first that the contraction is in terms of a Procrustes-type loss (see the discussion around (7)), which turns out to be convenient for Corollary 1; and second, the bound depends only on the incoherence of the matrix VK, and not on the corresponding quantity for U.

Proposition 3

LetV^K(in)𝕆d×K, letO:=argminO˜𝕆K×KV^K(in)VKO˜Fand letΞ:=V^K(in)VKO. FixUn×K and VK𝕆d×KwithVK2μ(K/d)1/2, and letY:=UVK, withc:=σK(Y)/YF. Suppose thatκ1,κ2,κ3>0are such that for everyi[n],

(12)

Assume further that

(13)

Then the outputV^K(out):=refine(K,V^K(in),Ω,YΩ)of Algorithm1satisfies

whereO^:=argminO˜𝕆K×KV^K(out)VKO˜F𝕆K×K.

Interestingly, the proof of Proposition 3 proceeds in a very different fashion from that of Proposition 1. The key step is to bound the discrepancy between the principal components of the imputed data matrix Y^ in Algorithm 1 and VK using a modified version of Wedin's theorem (Wang, 2016). To achieve the desired contraction rate, instead of viewing the true data matrix Y as the reference matrix when calculating the perturbation, we choose a different reference matrix Y˜ with the same top K right singular space as Y but which is closer to Y^ in terms of the Frobenius norm. Such a reference shift sharpens the eigenspace perturbation bound.

The contraction rate in Proposition 3 is a sum of three terms, the first two of which are small provided that Ξop is small and d is large respectively. On the other hand, the final term is small provided that no small subset of the rows of Ξ contributes excessively to its operator norm. For different missingness mechanisms, such a guarantee would need to be established probabilistically on a case-by-case basis; in Corollary 1 we illustrate how this can be done to achieve a high probability contraction in the simplest missingness model. Importantly, the proportion of missingness allowed, and hence the contraction rate parameter, no longer depend on the incoherence of VK, and can be of constant order.

Corollary 1

Consider thep-homogeneous MCAR setting. FixUn×KandVK𝕆d×KwithVK2μ(K/d)1/2, and letY:=UVK, withc:=σK(Y)/YF. Suppose thatV^K(in),V^K(out)𝕆d×K, O,O^𝕆K×K and Ξd×K are as in Proposition 3, let C:=Ξop/Ξ2, and, fixing δ(0,1), suppose that

Suppose thatdp8log(3/δ). Then with probability at least1δ, the outputV^K(out):=refine(K,V^K(in),Ω,YΩ)of Algorithm1satisfies

To understand the conclusion of Corollary 1, it is instructive to consider the special case K=1. Here, c=1 and C is the ratio of the 2 and norms of the vector V^1(in)sgn(V1V^1(in))V1. When the entries of this vector are of comparable magnitude, C is therefore of order d1/2, so the contraction rate is of order (1p)1/2+d1/2.

3.3 Other missingness mechanisms

Another interesting aspect of our theory is that the guarantees provided in Theorem 3 are deterministic. Provided we start with a sufficiently good initialiser, Theorem 3 describes the way in which the performance of primePCA improves over iterations. An attraction of this approach is that it offers the potential to study the performance of primePCA under more general missingness mechanisms. For instance, one setting of considerable practical interest is the MAR model, which postulates that our data vector y=(y1,,yd) and observation indicator vector ω satisfy

(14)

for all ϵ=(ϵ1,,ϵd){0,1}d and a=(a1,,ad)d. In other words, the probability of seeing a particular missingness pattern only depends on the data vector through components of this vector that are observed. Thus, if we want to understand the performance of primePCA under different missingness mechanisms, such as specific MAR (or even MNAR) models, all we require is an analogue of Theorem 4 on the performance of the initialiser in these new missingness settings. Such results, however, are likely to be rather problem-specific in nature, and it can be that choosing an initialiser based on available information on the dependence between the observations and the missingness mechanism makes it easier to prove the desired performance guarantees.

We now provide an example to illustrate how such initialisers can be constructed and analysed. Consider an MAR setting where the missingness pattern depends on the data matrix only through a fully observed categorical variable. In this case, we can construct a variant of the OPW estimator, denoted V^KOPWv, by modifying the weighted sample covariance matrix in (8) to condition on the fully observed covariate, and then take the leading eigenvectors of an appropriate average of these conditional weighted sample covariance matrices. Specifically, suppose that our data consist of independent and identically distributed copies (y1,ω1),,(yn,ωn) of (y,ω)=(y0,y1,,yd,ω0,ω1,,ωd), where ω0=1, where y0 is a categorical random variable taking values in {1,,L} and where (y1,,yd)|y0Nd(0,y0) is independent of ωj|y0˜iidBern(py0) for all j[d]. Writing y0:=(y1,,yd), ω0:=(ω1,,ωd) and y˜0:=y0ω0, we have that Cov(y0,y0)=0, that is, Cov(y) is block diagonal. Thus, introducing the subscript i for our ith observation, as a starting point to construct V^KOPWv, it is natural to consider an oracle estimator of Cov(y0), given by

where W:=p2{1d1d(1p)Id}. Observe that we can write

where n:=|{i:yi0=}| and where G():=n1i:yi0=y˜i,0y˜i,0W is the OPW estimator of based on the observations with yi0=. Hence, G is unbiased for Cov(y0), because

In practice, when p is unknown, we can estimate it by

and substitute this estimate into W to obtain empirical estimators G˜() and G˜ of G() and G, respectively. Finally, V^KOPWv can be obtained as the matrix of top K eigenvectors of the (d+1)×(d+1) matrix

To sketch the way to bound the sinΘ loss of such an initialiser, we can condition on y10,,yn0 and apply matrix Bernstein concentration inequalities similarly to those in the proof of Theorem 1 to show that G˜() is close to for each . Simple binomial concentration bounds then allow us to combine these to control G˜Cov(y0)op, and then apply a variant the Davis–Kahan theorem to obtain a final result.

While different initialisers can be designed and analysed theoretically in specific missingness settings, as shown in the example above, our empirical experience, nevertheless, is that regardless of the missingness mechanism, primePCA is extremely robust to the choice of initialiser. This is evident from the discussion of the performance of primePCA in MAR and MNAR settings given in Section 4.4.

4 SIMULATION STUDIES

In this section, we assess the empirical performance of primePCA, as proposed in Algorithm 2, with initialiser V˜K from Section 3.1, and denote the output of this algorithm by V^Kprime. In Sections 4.1, 4.2 and 4.3, we generate observations according to the model described in (1), (2) and (3) where the rows of the matrix U are independent Nd(0,u) random vectors, for some positive semi-definite ud×d. We further generate the observation indicator matrix Ω, independently of U and Z, and investigate the following four missingness mechanisms that represent different levels of heterogeneity:

  • (H1)

    Homogeneous: (ωij=1)=0.05 for all i[n],j[d];

  • (H2)

    Mildly heterogeneous: (ωij=1)=PiQj for i[n],j[d], where P1,,Pn˜iidU[0,0.2] and Q1,,Qd˜iidU[0.05,0.95] independently;

  • (H3)

    Highly heterogeneous columns: (ωij=1)=0.19 for i[n] and all odd j[d] and (ωij=1)=0.01 for i[n] and all even j[d].

  • (H4)

    Highly heterogeneous rows: (ωij=1)=0.18 for j[d] and all odd i[n] and (ωij=1)=0.02 for j[d] and all even i[n].

In Sections 4.1, 4.2 and 4.3 below, we investigate primePCA in noiseless, noisy and misspecified settings, respectively. Section 4.4 is devoted to MAR and MNAR settings. In all cases, the average statistical error was estimated from 100 Monte Carlo repetitions of the experiment. For comparison, we also studied the softImpute algorithm (Hastie et al., 2015; Mazumder et al., 2010), which is considered to be state-of-the-art for matrix completion (Chi et al., 2018). This algorithm imputes the missing entries of Y by solving the following nuclear-norm-regularised optimisation problem:

where λ>0 is to be chosen by the practitioner. The softImpute estimator of VK is then given by the matrix of top K right singular vectors V^Ksoft of Y^soft. In practice, the optimisation is carried out by representing X as AB, and performing alternating projections to update An×K and Bd×K iteratively. The fact that the softImpute algorithm was originally intended for matrix completion means that it treats the left and right singular vectors symmetrically, whereas the primePCA algorithm, which has the advantage of a clear geometric interpretation as exemplified in Figure 1, focuses on the target of inference in PCA, namely the leading right singular vectors.

Figure 2 presents Monte Carlo estimates of 𝔼L(V^Kprime,VK) for different choices of σ in two different settings. The first uses the noiseless set-up of Section 4.1, together with missingness mechanism (H1); the second uses the noisy setting of Section 4.2 with parameter ν=20 and missingness mechanism (H2). We see that the error barely changes when σ varies within [2,10]; very similar plots were obtained for different data generation and missingness mechanisms, though we omit these for brevity. For definiteness, we therefore fixed σ=3 throughout our simulation study.

Estimates of 𝔼L(V^Kprime,VK) for various choices of σ∗ under (H1) in the noiseless setting of Section 4.1 (left) and (H2) in the noisy setting of Section 4.2 with ν=20 (right)
FIGURE 2

Estimates of 𝔼L(V^Kprime,VK) for various choices of σ under (H1) in the noiseless setting of Section 4.1 (left) and (H2) in the noisy setting of Section 4.2 with ν=20 (right)

4.1 Noiseless case

In the noiseless setting, we let Z=0, and also fix n=2000, d=500, K=2 and u=100I2. We set

In Figure 3, we present the (natural) logarithm of the estimated average loss of primePCA and softImpute under (H1), (H2), (H3) and (H4). We set the range of y-axis to be the same for each method to facilitate straightforward comparison. We see that the statistical error of primePCA decreases geometrically as the number of iterations increases, which confirms the conclusion of Theorem 3 in this noiseless setting. Moreover, after a moderate number of iterations, its performance is a substantial improvement on that of the softImpute algorithm, even if this latter algorithm is given access to an oracle choice of the regularisation parameter λ. The high statistical error of softImpute in these settings can be partly explained by the default value of the tuning parameter thresh in the softImpute package in R, namely 105, which corresponds to the red curve in the right-hand panels of Figure 3. By reducing the values of thresh to 107 and 109, corresponding to the green and blue curves in Figure 3, respectively, we were able to improve the performance of softImpute to some extent, though the statistical error is sensitive to the choice of the regularisation parameter λ. Moreover, even with the optimal choice of λ, it is not competitive with primePCA. Finally, we mention that for the 2000 iterations of setting (H2), primePCA took on average just under 10 min per repetition to compute, whereas the solution path of softImpute with thresh=109 took around 36 min per repetition.

Logarithms of the average Frobenius norm sinΘ error of primePCA and softImpute under various heterogeneity levels of missingness in absence of noise. The four rows of plots above, from the top to bottom, correspond to (H1), (H2), (H3) and (H4). [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 3

Logarithms of the average Frobenius norm sinΘ error of primePCA and softImpute under various heterogeneity levels of missingness in absence of noise. The four rows of plots above, from the top to bottom, correspond to (H1), (H2), (H3) and (H4). [Colour figure can be viewed at https://dbpia.nl.go.kr]

4.2 Noisy case

Here, we generate the rows of Z as independent Nd(0,Id) random vectors, independent of all other data. We maintain the same choices of n, d, K and VK as in Section 4.1, set u=ν2I2 and vary ν>0 to achieve different SNRs. In particular, defining SNR:=TrCov(x1)/TrCov(z1), the choices ν=10,20,40,60 correspond to the very low, low, medium and high SNR=0.4,1.6,6.4,14.4, respectively. For an additional comparison, we consider a variant of the softImpute algorithm called hardImpute (Mazumder et al., 2010), which retains only a fixed number of top singular values in each iteration of matrix imputation; this can be achieved by setting the argument λ in the softImpute function to be 0.

To avoid confounding our study of the statistical performance of the softImpute algorithm with the choice of regularisation parameter λ, we gave the softImpute algorithm a particularly strong form of oracle choice of λ, namely where λ was chosen for each individual repetition of the experiment, so as to minimise the loss function. Naturally, such a choice is not available to the practitioner. Moreover, in order to ensure the range of λ was wide enough to include the best softImpute solution, we set the argument rank.max in that algorithm to be 20.

In Table 1, we report the statistical error of primePCA after 2000 iterations of refinement, together with the corresponding statistical errors of our initial estimator primePCA_init and those of softImpute(oracle) and hardImpute. Remarkably, primePCA exhibits stronger performance than these other methods across each of the SNR regimes and different missingness mechanisms. We also remark that hardImpute is inaccurate and unstable, because it might converge to a local optimum that is far from the truth.

TABLE 1

Average losses (with SEs in brackets) under (H1), (H2), (H3) and (H4)

ν=10ν=20ν=40ν=60
(H1)hardImpute0.891(0.005)0.444(0.001)0.251(0.001)0.186(0.0005)
softImpute(oracle)0.377(0.0009)0.186(0.0004)0.095(0.0002)0.064(0.0002)
primePCA_init0.449(0.001)0.306(0.001)0.266(0.001)0.259(0.001)
primePCA0.368(0.001)0.171(0.0004)0.084(0.0002)0.056(0.0001)
(H2)hardImpute0.920(0.006)0.473(0.001)0.291(0.001)0.236(0.001)
softImpute(oracle)0.519(0.001)0.308(0.001)0.185(0.001)0.141(0.001)
primePCA_init0.549(0.002)0.399(0.002)0.357(0.001)0.349(0.001)
primePCA0.475(0.002)0.232(0.001)0.115(0.001)0.077(0.0005)
(H3)hardImpute0.792(0.003)0.479(0.001)0.385(0.001)0.427(0.001)
softImpute(oracle)0.622(0.002)0.374(0.001)0.222(0.001)0.170(0.001)
primePCA_init0.624(0.002)0.486(0.001)0.449(0.001)0.442(0.001)
primePCA0.581(0.002)0.290(0.001)0.145(0.001)0.097(0.0004)
(H4)hardImpute0.368(0.001)0.174(0.0005)0.089(0.0003)0.062(0.0003)
softImpute(oracle)0.243(0.0006)0.121(0.0002)0.062(0.0001)0.042(0.0001)
primePCA_init0.290(0.0007)0.203(0.001)0.175(0.0005)0.169(0.0004)
primePCA0.238(0.0006)0.116(0.0003)0.058(0.0002)0.038(0.0001)
ν=10ν=20ν=40ν=60
(H1)hardImpute0.891(0.005)0.444(0.001)0.251(0.001)0.186(0.0005)
softImpute(oracle)0.377(0.0009)0.186(0.0004)0.095(0.0002)0.064(0.0002)
primePCA_init0.449(0.001)0.306(0.001)0.266(0.001)0.259(0.001)
primePCA0.368(0.001)0.171(0.0004)0.084(0.0002)0.056(0.0001)
(H2)hardImpute0.920(0.006)0.473(0.001)0.291(0.001)0.236(0.001)
softImpute(oracle)0.519(0.001)0.308(0.001)0.185(0.001)0.141(0.001)
primePCA_init0.549(0.002)0.399(0.002)0.357(0.001)0.349(0.001)
primePCA0.475(0.002)0.232(0.001)0.115(0.001)0.077(0.0005)
(H3)hardImpute0.792(0.003)0.479(0.001)0.385(0.001)0.427(0.001)
softImpute(oracle)0.622(0.002)0.374(0.001)0.222(0.001)0.170(0.001)
primePCA_init0.624(0.002)0.486(0.001)0.449(0.001)0.442(0.001)
primePCA0.581(0.002)0.290(0.001)0.145(0.001)0.097(0.0004)
(H4)hardImpute0.368(0.001)0.174(0.0005)0.089(0.0003)0.062(0.0003)
softImpute(oracle)0.243(0.0006)0.121(0.0002)0.062(0.0001)0.042(0.0001)
primePCA_init0.290(0.0007)0.203(0.001)0.175(0.0005)0.169(0.0004)
primePCA0.238(0.0006)0.116(0.0003)0.058(0.0002)0.038(0.0001)
TABLE 1

Average losses (with SEs in brackets) under (H1), (H2), (H3) and (H4)

ν=10ν=20ν=40ν=60
(H1)hardImpute0.891(0.005)0.444(0.001)0.251(0.001)0.186(0.0005)
softImpute(oracle)0.377(0.0009)0.186(0.0004)0.095(0.0002)0.064(0.0002)
primePCA_init0.449(0.001)0.306(0.001)0.266(0.001)0.259(0.001)
primePCA0.368(0.001)0.171(0.0004)0.084(0.0002)0.056(0.0001)
(H2)hardImpute0.920(0.006)0.473(0.001)0.291(0.001)0.236(0.001)
softImpute(oracle)0.519(0.001)0.308(0.001)0.185(0.001)0.141(0.001)
primePCA_init0.549(0.002)0.399(0.002)0.357(0.001)0.349(0.001)
primePCA0.475(0.002)0.232(0.001)0.115(0.001)0.077(0.0005)
(H3)hardImpute0.792(0.003)0.479(0.001)0.385(0.001)0.427(0.001)
softImpute(oracle)0.622(0.002)0.374(0.001)0.222(0.001)0.170(0.001)
primePCA_init0.624(0.002)0.486(0.001)0.449(0.001)0.442(0.001)
primePCA0.581(0.002)0.290(0.001)0.145(0.001)0.097(0.0004)
(H4)hardImpute0.368(0.001)0.174(0.0005)0.089(0.0003)0.062(0.0003)
softImpute(oracle)0.243(0.0006)0.121(0.0002)0.062(0.0001)0.042(0.0001)
primePCA_init0.290(0.0007)0.203(0.001)0.175(0.0005)0.169(0.0004)
primePCA0.238(0.0006)0.116(0.0003)0.058(0.0002)0.038(0.0001)
ν=10ν=20ν=40ν=60
(H1)hardImpute0.891(0.005)0.444(0.001)0.251(0.001)0.186(0.0005)
softImpute(oracle)0.377(0.0009)0.186(0.0004)0.095(0.0002)0.064(0.0002)
primePCA_init0.449(0.001)0.306(0.001)0.266(0.001)0.259(0.001)
primePCA0.368(0.001)0.171(0.0004)0.084(0.0002)0.056(0.0001)
(H2)hardImpute0.920(0.006)0.473(0.001)0.291(0.001)0.236(0.001)
softImpute(oracle)0.519(0.001)0.308(0.001)0.185(0.001)0.141(0.001)
primePCA_init0.549(0.002)0.399(0.002)0.357(0.001)0.349(0.001)
primePCA0.475(0.002)0.232(0.001)0.115(0.001)0.077(0.0005)
(H3)hardImpute0.792(0.003)0.479(0.001)0.385(0.001)0.427(0.001)
softImpute(oracle)0.622(0.002)0.374(0.001)0.222(0.001)0.170(0.001)
primePCA_init0.624(0.002)0.486(0.001)0.449(0.001)0.442(0.001)
primePCA0.581(0.002)0.290(0.001)0.145(0.001)0.097(0.0004)
(H4)hardImpute0.368(0.001)0.174(0.0005)0.089(0.0003)0.062(0.0003)
softImpute(oracle)0.243(0.0006)0.121(0.0002)0.062(0.0001)0.042(0.0001)
primePCA_init0.290(0.0007)0.203(0.001)0.175(0.0005)0.169(0.0004)
primePCA0.238(0.0006)0.116(0.0003)0.058(0.0002)0.038(0.0001)

4.3 Near low-rank case

In this sub-section, we set n=2000, d=500, K=10, u=diag(210,29,,2), and fixed VK once for all experiments to be the top K eigenvectors of one realisation2 of the sample covariance matrix of n independent Nd(0,Id) random vectors. Here d1/2VK2<1.72, and we again generated the rows of Z as independent Nd(0,Id) random vectors. Table 2 reports the average loss of estimating the top K^ eigenvectors of y, where K^ varies from 1 to 5. Interestingly, even in this misspecified setting, primePCA is competitive with the oracle version of softImpute.

TABLE 2

Average losses (with SEs in brackets) in the setting of Section 4.3 under (H1), (H2), (H3) and (H4)

K^=1K^=2K^=3K^=4K^=5
(H1)hardImpute0.308(0.002)0.507(0.002)0.764(0.004)1.199(0.006)1.524(0.004)
softImpute(oracle)0.107(0.001)0.182(0.001)0.275(0.001)0.401(0.001)0.596(0.001)
primePCA_init0.203(0.001)0.345(0.001)0.554(0.003)1.074(0.007)1.427(0.006)
primePCA0.141(0.001)0.200(0.001)0.269(0.001)0.374(0.001)0.580(0.001)
(H2)hardImpute0.298(0.002)0.466(0.002)0.696(0.003)1.124(0.006)1.452(0.004)
softImpute(oracle)0.188(0.001)0.283(0.001)0.410(0.001)0.562(0.001)0.751(0.001)
primePCA_init0.285(0.001)0.443(0.004)0.757(0.013)1.201(0.004)1.533(0.003)
primePCA0.190(0.002)0.267(0.002)0.368(0.003)0.543(0.008)0.797(0.009)
(H3)hardImpute0.302(0.001)0.482(0.002)0.695(0.002)1.004(0.006)1.373(0.004)
softImpute(oracle)0.206(0.001)0.338(0.001)0.492(0.001)0.664(0.002)0.878(0.002)
primePCA_init0.341(0.001)0.528(0.019)1.097(0.008)1.306(0.008)1.597(0.004)
primePCA0.222(0.001)0.330(0.002)0.452(0.003)0.641(0.008)0.919(0.007)
(H4)hardImpute0.090(0.001)0.148(0.001)0.226(0.001)0.3460.0020.589(0.007)
softImpute(oracle)0.071(0.001)0.112(0.001)0.164(0.001)0.233(0.001)0.332(0.001)
primePCA_init0.139(0.001)0.220(0.001)0.325(0.001)0.475(0.002)0.805(0.012)
primePCA0.098(0.001)0.135(0.001)0.176(0.001)0.236(0.001)0.328(0.001)
K^=1K^=2K^=3K^=4K^=5
(H1)hardImpute0.308(0.002)0.507(0.002)0.764(0.004)1.199(0.006)1.524(0.004)
softImpute(oracle)0.107(0.001)0.182(0.001)0.275(0.001)0.401(0.001)0.596(0.001)
primePCA_init0.203(0.001)0.345(0.001)0.554(0.003)1.074(0.007)1.427(0.006)
primePCA0.141(0.001)0.200(0.001)0.269(0.001)0.374(0.001)0.580(0.001)
(H2)hardImpute0.298(0.002)0.466(0.002)0.696(0.003)1.124(0.006)1.452(0.004)
softImpute(oracle)0.188(0.001)0.283(0.001)0.410(0.001)0.562(0.001)0.751(0.001)
primePCA_init0.285(0.001)0.443(0.004)0.757(0.013)1.201(0.004)1.533(0.003)
primePCA0.190(0.002)0.267(0.002)0.368(0.003)0.543(0.008)0.797(0.009)
(H3)hardImpute0.302(0.001)0.482(0.002)0.695(0.002)1.004(0.006)1.373(0.004)
softImpute(oracle)0.206(0.001)0.338(0.001)0.492(0.001)0.664(0.002)0.878(0.002)
primePCA_init0.341(0.001)0.528(0.019)1.097(0.008)1.306(0.008)1.597(0.004)
primePCA0.222(0.001)0.330(0.002)0.452(0.003)0.641(0.008)0.919(0.007)
(H4)hardImpute0.090(0.001)0.148(0.001)0.226(0.001)0.3460.0020.589(0.007)
softImpute(oracle)0.071(0.001)0.112(0.001)0.164(0.001)0.233(0.001)0.332(0.001)
primePCA_init0.139(0.001)0.220(0.001)0.325(0.001)0.475(0.002)0.805(0.012)
primePCA0.098(0.001)0.135(0.001)0.176(0.001)0.236(0.001)0.328(0.001)
TABLE 2

Average losses (with SEs in brackets) in the setting of Section 4.3 under (H1), (H2), (H3) and (H4)

K^=1K^=2K^=3K^=4K^=5
(H1)hardImpute0.308(0.002)0.507(0.002)0.764(0.004)1.199(0.006)1.524(0.004)
softImpute(oracle)0.107(0.001)0.182(0.001)0.275(0.001)0.401(0.001)0.596(0.001)
primePCA_init0.203(0.001)0.345(0.001)0.554(0.003)1.074(0.007)1.427(0.006)
primePCA0.141(0.001)0.200(0.001)0.269(0.001)0.374(0.001)0.580(0.001)
(H2)hardImpute0.298(0.002)0.466(0.002)0.696(0.003)1.124(0.006)1.452(0.004)
softImpute(oracle)0.188(0.001)0.283(0.001)0.410(0.001)0.562(0.001)0.751(0.001)
primePCA_init0.285(0.001)0.443(0.004)0.757(0.013)1.201(0.004)1.533(0.003)
primePCA0.190(0.002)0.267(0.002)0.368(0.003)0.543(0.008)0.797(0.009)
(H3)hardImpute0.302(0.001)0.482(0.002)0.695(0.002)1.004(0.006)1.373(0.004)
softImpute(oracle)0.206(0.001)0.338(0.001)0.492(0.001)0.664(0.002)0.878(0.002)
primePCA_init0.341(0.001)0.528(0.019)1.097(0.008)1.306(0.008)1.597(0.004)
primePCA0.222(0.001)0.330(0.002)0.452(0.003)0.641(0.008)0.919(0.007)
(H4)hardImpute0.090(0.001)0.148(0.001)0.226(0.001)0.3460.0020.589(0.007)
softImpute(oracle)0.071(0.001)0.112(0.001)0.164(0.001)0.233(0.001)0.332(0.001)
primePCA_init0.139(0.001)0.220(0.001)0.325(0.001)0.475(0.002)0.805(0.012)
primePCA0.098(0.001)0.135(0.001)0.176(0.001)0.236(0.001)0.328(0.001)
K^=1K^=2K^=3K^=4K^=5
(H1)hardImpute0.308(0.002)0.507(0.002)0.764(0.004)1.199(0.006)1.524(0.004)
softImpute(oracle)0.107(0.001)0.182(0.001)0.275(0.001)0.401(0.001)0.596(0.001)
primePCA_init0.203(0.001)0.345(0.001)0.554(0.003)1.074(0.007)1.427(0.006)
primePCA0.141(0.001)0.200(0.001)0.269(0.001)0.374(0.001)0.580(0.001)
(H2)hardImpute0.298(0.002)0.466(0.002)0.696(0.003)1.124(0.006)1.452(0.004)
softImpute(oracle)0.188(0.001)0.283(0.001)0.410(0.001)0.562(0.001)0.751(0.001)
primePCA_init0.285(0.001)0.443(0.004)0.757(0.013)1.201(0.004)1.533(0.003)
primePCA0.190(0.002)0.267(0.002)0.368(0.003)0.543(0.008)0.797(0.009)
(H3)hardImpute0.302(0.001)0.482(0.002)0.695(0.002)1.004(0.006)1.373(0.004)
softImpute(oracle)0.206(0.001)0.338(0.001)0.492(0.001)0.664(0.002)0.878(0.002)
primePCA_init0.341(0.001)0.528(0.019)1.097(0.008)1.306(0.008)1.597(0.004)
primePCA0.222(0.001)0.330(0.002)0.452(0.003)0.641(0.008)0.919(0.007)
(H4)hardImpute0.090(0.001)0.148(0.001)0.226(0.001)0.3460.0020.589(0.007)
softImpute(oracle)0.071(0.001)0.112(0.001)0.164(0.001)0.233(0.001)0.332(0.001)
primePCA_init0.139(0.001)0.220(0.001)0.325(0.001)0.475(0.002)0.805(0.012)
primePCA0.098(0.001)0.135(0.001)0.176(0.001)0.236(0.001)0.328(0.001)

4.4 Other missingness mechanisms

Finally in this section, we investigate the performance of primePCA, as well as other alternative algorithms, in settings where the MCAR hypothesis is not satisfied. We consider two simulation frameworks to explore both MAR (see (14)) and MNAR mechanisms. In the first, we assume that missingness depends on the data matrix Y only through a fully observed covariate, as in the example in Section 3.3. Specifically, for some α0, for K=2, and for two matrices3V+,V𝕆d×2, the pair (y1,ω1)=(y10,y11,y1d,ω10,ω11,,ω1d) is generated as follows:

(15)

The other rows of (Y,Ω) are taken to be as independent copies of (y1,ω1). Thus, when α=0, the matrices Y and Ω are independent, and we are in an MCAR setting; when α0, the data are MAR but not MCAR, and α measures the extent of departure from the MCAR setting. The covariance matrix of y1 is

In this example, we can construct a variant of the OPW estimator, which we call the OPWv estimator, by exploiting the fact that, conditional on the fully observed first column of Y, the data are MCAR. To do this, let

where G˜+ and G˜ are the weighted sample covariance matrices computed as in (8), based on data (yij,ωij)i:yi0=1,j[d] and (yij,ωij)i:yi0=1,j[d], respectively. The OPWv estimator is the matrix of the first two eigenvectors of ^OPWv. Both the OPW and OPWv estimators are plausible initialisers for the primePCA algorithm.

In low-dimensional settings, likelihood-based approaches, often implemented via an EM algorithm, are popular for handling MAR data (14) (Rubin, 1976). In Table 3, we compare the performance of primePCA in this setting with that of an EM algorithm derived from the suggestion in Little and Rubin (2019), section 11.3, and considered both the OPW and OPWv estimators as initialisers. We set n=500, d{25,50}, α{0.1,0.5} and took K^=2 for both the primePCA and the EM algorithms. From the table, we see that the OPWv estimator is able to exploit the group structure of the data to improve upon the OPW estimator, especially for the larger value of α. It is reassuring to find that the performance of primePCA is completely unaffected by the choice of initialiser, and, remarkably, it outperforms the OPWv estimator, even though the latter has access to additional model structure information. The worse root mean squared error of the EM algorithm is mainly due its numerical instability when performing Schur complement computations4.

TABLE 3

Root mean squared errors of the sinΘ loss function (with SEs in brackets) over 100 repetitions from the data-generating mechanism in (15) for observed-proportion weighted (OPW) estimator and its class-weighted variant (OPWv), expectation-mMaximisation (EM) and primePCA with both the OPW or OPWv initialisers

dαOPWOPWvEMEMvprimePCAprimePCAv
250.10.266(0.005)0.247(0.004)0.414(0.045)0.464(0.053)0.206(0.004)0.206(0.004)
250.50.346(0.009)0.248(0.005)0.445(0.047)0.378(0.056)0.248(0.014)0.248(0.008)
500.10.287(0.003)0.265(0.003)0.350(0.032)0.346(0.032)0.220(0.002)0.220(0.002)
500.50.591(0.025)0.290(0.003)0.588(0.033)0.369(0.03)0.255(0.005)0.255(0.005)
dαOPWOPWvEMEMvprimePCAprimePCAv
250.10.266(0.005)0.247(0.004)0.414(0.045)0.464(0.053)0.206(0.004)0.206(0.004)
250.50.346(0.009)0.248(0.005)0.445(0.047)0.378(0.056)0.248(0.014)0.248(0.008)
500.10.287(0.003)0.265(0.003)0.350(0.032)0.346(0.032)0.220(0.002)0.220(0.002)
500.50.591(0.025)0.290(0.003)0.588(0.033)0.369(0.03)0.255(0.005)0.255(0.005)
TABLE 3

Root mean squared errors of the sinΘ loss function (with SEs in brackets) over 100 repetitions from the data-generating mechanism in (15) for observed-proportion weighted (OPW) estimator and its class-weighted variant (OPWv), expectation-mMaximisation (EM) and primePCA with both the OPW or OPWv initialisers

dαOPWOPWvEMEMvprimePCAprimePCAv
250.10.266(0.005)0.247(0.004)0.414(0.045)0.464(0.053)0.206(0.004)0.206(0.004)
250.50.346(0.009)0.248(0.005)0.445(0.047)0.378(0.056)0.248(0.014)0.248(0.008)
500.10.287(0.003)0.265(0.003)0.350(0.032)0.346(0.032)0.220(0.002)0.220(0.002)
500.50.591(0.025)0.290(0.003)0.588(0.033)0.369(0.03)0.255(0.005)0.255(0.005)
dαOPWOPWvEMEMvprimePCAprimePCAv
250.10.266(0.005)0.247(0.004)0.414(0.045)0.464(0.053)0.206(0.004)0.206(0.004)
250.50.346(0.009)0.248(0.005)0.445(0.047)0.378(0.056)0.248(0.014)0.248(0.008)
500.10.287(0.003)0.265(0.003)0.350(0.032)0.346(0.032)0.220(0.002)0.220(0.002)
500.50.591(0.025)0.290(0.003)0.588(0.033)0.369(0.03)0.255(0.005)0.255(0.005)

The second simulation framework is as follows. Let :=(min{j,k})j,k[d]d×d and let ξ=(ξij)i[n],j[d] be a latent Bernoulli thinning matrix. The data matrix Y=(yij)i[n],j[d] and revelation matrix Ω=(ωij)i[n],j[d] are generated in such a way that Y and ξ are independent,

(16)

(where the maximum of the empty set is by convention). As usual, we observe (YΩ,Ω). In other words, viewing each (yi1,,yid) as a d-step standard Gaussian random walk, we observe Bernoulli-thinned paths of the process up to (and including) the hitting time of the threshold ±τ. We note that the observations satisfy the MAR hypothesis if and only if p=1, and as p decreases from 1, the mechanism becomes increasingly distant from MAR, as we become increasingly likely to fail to observe the threshold hitting time. We take K=1.

In Table 4, we compare the performance of primePCA with that of the EM algorithm, and in both cases, we can initialise with either the OPW estimator or a mean-imputation estimator, obtained by imputing all missing entries by their respective population column means. We set n=500, d=100, τ=d1/2, took K^=1 for both primePCA and the EM algorithm, and took p{0.25,0.5,0.75,1}. From the table, we see that primePCA outperforms the EM algorithm except in the MAR case where p=1, which is tailor-made for the likelihood-based EM approach. In fact, primePCA is highly robust statistically and stable computationally, performing well consistently across different missingness settings and initialisers. On the other hand, the EM algorithm exhibits a much heavier dependence on the initialiser: its statistical performance suffers when initialised with the poorer mean-imputation estimator and runs into numerical instability issues when initialised with the OPW estimator in the MNAR settings. We found that these instability issues are exacerbated in higher dimensions, and moreover, that the EM algorithm quickly becomes computationally infeasible5. This explains why we did not run the EM algorithm on the larger-scale problems in Sections 4.1, 4.2 and 4.3, as well as the real data example in Section 5 below.

TABLE 4

Root mean squared errors of the sinΘ loss function (with SEs in brackets) over 100 repetitions from the data-generating mechanism in (16) for mean-imputation estimator (MI), observed-proportion weighted (OPW) estimator, Expectation–Maximisation (EM) and primePCA with both MI and OPW initialisers (distinguished by subscripts in table header)

pMIOPWEMMIEMOPWprimePCAMIprimePCAOPW
10.548(0.004)0.282(0.004)0.086(0.003)0.056(0.002)0.096(0.002)0.096(0.002)
0.750.551(0.004)0.285(0.004)0.117(0.004)0.353(0.041)0.097(0.002)0.097(0.002)
0.50.557(0.005)0.29(0.004)0.186(0.025)0.944(0.013)0.1(0.002)0.1(0.002)
0.250.575(0.005)0.309(0.005)0.228(0.005)0.989(0.001)0.112(0.002)0.123(0.006)
pMIOPWEMMIEMOPWprimePCAMIprimePCAOPW
10.548(0.004)0.282(0.004)0.086(0.003)0.056(0.002)0.096(0.002)0.096(0.002)
0.750.551(0.004)0.285(0.004)0.117(0.004)0.353(0.041)0.097(0.002)0.097(0.002)
0.50.557(0.005)0.29(0.004)0.186(0.025)0.944(0.013)0.1(0.002)0.1(0.002)
0.250.575(0.005)0.309(0.005)0.228(0.005)0.989(0.001)0.112(0.002)0.123(0.006)
TABLE 4

Root mean squared errors of the sinΘ loss function (with SEs in brackets) over 100 repetitions from the data-generating mechanism in (16) for mean-imputation estimator (MI), observed-proportion weighted (OPW) estimator, Expectation–Maximisation (EM) and primePCA with both MI and OPW initialisers (distinguished by subscripts in table header)

pMIOPWEMMIEMOPWprimePCAMIprimePCAOPW
10.548(0.004)0.282(0.004)0.086(0.003)0.056(0.002)0.096(0.002)0.096(0.002)
0.750.551(0.004)0.285(0.004)0.117(0.004)0.353(0.041)0.097(0.002)0.097(0.002)
0.50.557(0.005)0.29(0.004)0.186(0.025)0.944(0.013)0.1(0.002)0.1(0.002)
0.250.575(0.005)0.309(0.005)0.228(0.005)0.989(0.001)0.112(0.002)0.123(0.006)
pMIOPWEMMIEMOPWprimePCAMIprimePCAOPW
10.548(0.004)0.282(0.004)0.086(0.003)0.056(0.002)0.096(0.002)0.096(0.002)
0.750.551(0.004)0.285(0.004)0.117(0.004)0.353(0.041)0.097(0.002)0.097(0.002)
0.50.557(0.005)0.29(0.004)0.186(0.025)0.944(0.013)0.1(0.002)0.1(0.002)
0.250.575(0.005)0.309(0.005)0.228(0.005)0.989(0.001)0.112(0.002)0.123(0.006)

5 REAL DATA ANALYSIS: MILLION SONG DATASET

We apply primePCA to a subset of the Million Song Dataset6 to analyse music preferences. The original data can be expressed as a matrix with 110,000 users (rows) and 163,206 songs (columns), with entries representing the number of times a song was played by a particular user. The proportion of non-missing entries in the matrix is 0.008%. Since the matrix is very sparse, and since most songs have very few listeners, we enhance the SNR by restricting our attention to songs that have at least 100 listeners (1777 songs in total). This improves the proportion of non-missing entries to 0.23%. Further summary information about the filtered data is provided below:

(a) Quantiles of non-missing matrix entry values:
0%10%20%30%40%50%60%70%80%90%100%
1111112358500
90%91%92%93%94%95%96%97%98%99%100%
89910111315182333500
(b) Quantiles of the number of listeners for each song:
0%10%20%30%40%50%60%70%80%90%100%
100108117126139154178214272.8455.65043
(c) Quantiles of the total play counts of each user:
0%10%20%30%40%50%60%70%80%90%100%
00134691421381114
90%91%92%93%94%95%96%97%98%99%100%
3841444854606879971321114
(a) Quantiles of non-missing matrix entry values:
0%10%20%30%40%50%60%70%80%90%100%
1111112358500
90%91%92%93%94%95%96%97%98%99%100%
89910111315182333500
(b) Quantiles of the number of listeners for each song:
0%10%20%30%40%50%60%70%80%90%100%
100108117126139154178214272.8455.65043
(c) Quantiles of the total play counts of each user:
0%10%20%30%40%50%60%70%80%90%100%
00134691421381114
90%91%92%93%94%95%96%97%98%99%100%
3841444854606879971321114
(a) Quantiles of non-missing matrix entry values:
0%10%20%30%40%50%60%70%80%90%100%
1111112358500
90%91%92%93%94%95%96%97%98%99%100%
89910111315182333500
(b) Quantiles of the number of listeners for each song:
0%10%20%30%40%50%60%70%80%90%100%
100108117126139154178214272.8455.65043
(c) Quantiles of the total play counts of each user:
0%10%20%30%40%50%60%70%80%90%100%
00134691421381114
90%91%92%93%94%95%96%97%98%99%100%
3841444854606879971321114
(a) Quantiles of non-missing matrix entry values:
0%10%20%30%40%50%60%70%80%90%100%
1111112358500
90%91%92%93%94%95%96%97%98%99%100%
89910111315182333500
(b) Quantiles of the number of listeners for each song:
0%10%20%30%40%50%60%70%80%90%100%
100108117126139154178214272.8455.65043
(c) Quantiles of the total play counts of each user:
0%10%20%30%40%50%60%70%80%90%100%
00134691421381114
90%91%92%93%94%95%96%97%98%99%100%
3841444854606879971321114

We mention here a respect in which the data set does not conform exactly to the framework studied in the paper, namely that we treat zero entries as missing data (this is very common for analyses of user-preference data sets). In practice, while it may indeed be the case that a zero play count for song j by user i provides no indication of their level of preference for that song, it may also be the case that it reflects a dislike of that song. To address this issue, following our main analysis we will present a study of the robustness of our conclusions to different levels of true zeros in the data.

From point (a) above, we see that the distribution of play counts has an extremely heavy tail, and in particular the sample variances of the counts will be highly heterogeneous across songs. To guard against excessive influence from the outliers, we discretise the play counts into five interest levels as follows:

Play count12–34–67–1011
Level of interest12345
Play count12–34–67–1011
Level of interest12345
Play count12–34–67–1011
Level of interest12345
Play count12–34–67–1011
Level of interest12345

We are now in a position to analyse the data using primePCA, noting that one of the attractions of estimating the principal eigenspaces in this setting (as opposed to matrix completion, for instance), is that it becomes straightforward to make recommendations to new users, instead of having to run the algorithm again from scratch. For i=1,,n=110,000 and j=1,,d=1,777, let yij{1,,5} denote the level of interest of user i in song j, let K^=10 and let ={i:ωi1>K^}. Our initial goal is to assess the top K^ eigenvalues of y to see if there is low-rank signal in Y=(yij). To this end, we first apply Algorithm 2 to obtain V^K^prime; next, for each i, we run Steps 2–5 of Algorithm 1 to obtain the estimated principal score u^i, so that we can approximate yi by y^i=V^K^primeu^i. This allows us to estimate y by ^y=n1iy^iy^i. Figure 4 displays the top K^ eigenvalues of ^y, which exhibit a fairly rapid decay, thereby providing evidence for the existence of low-rank signal in Y.

Leading eigenvalues of ∑^y
FIGURE 4

Leading eigenvalues of ^y

In the left panel of Figure 5, we present the estimate V^2prime of the top two eigenvectors of the covariance matrix y, with colours indicating the genre of the song. The outliers in the x-axis of this plot are particularly interesting: they reveal songs that polarise opinion among users (see Table 5) and that best capture variation in individuals' preferences for types of music measured by the first principal component. It is notable that Rock songs are overrepresented among the outliers (see Table 6), relative to, say, Country songs. Users who express a preference for particular songs are also more likely to enjoy songs that are nearby in the plot. Such information is therefore potentially commercially valuable, both as an efficient means of gauging users' preferences, and for providing recommendations.

Plots of the first two principal components V^2prime (left) and the associated scores {u^i}i=1n (right) [Colour figure can be viewed at https://dbpia.nl.go.kr]
FIGURE 5

Plots of the first two principal components V^2prime (left) and the associated scores {u^i}i=1n (right) [Colour figure can be viewed at https://dbpia.nl.go.kr]

TABLE 5

Titles, artists and genres of the 22 outlier songs in Figure 5

IDTitleArtistGenre
1Your Hand In MineExplosions In The SkyRock
2All These Things That I've DoneThe KillersRock
3Lady MarmaladeChristina Aguilera / Lil' Kim/Pop
Mya / Pink
4Here It Goes AgainOk GoRock
5I Hate Pretending (Album Version)Secret MachinesRock
6No RainBlind MelonRock
7Comatose (Comes Alive Version)SkilletRock
8Life In TechnicolorColdplayRock
9New SoulYael NaïmPop
10BlurryPuddle Of MuddRock
11Give It BackPolly PaulusmaPop
12Walking On The MoonThe PoliceRock
13Face Down (Album Version)The Red Jumpsuit ApparatusRock
14SaviorRise AgainstRock
15Swing SwingThe All-American RejectsRock
16Without MeEminemRap
17AlmazRandy CrawfordPop
18Hotel CaliforniaEaglesRock
19Hey There DelilahPlain White T'sRock
20RevelryKings Of LeonRock
21UndoBjörkRock
22You're The OneDwight YoakamCountry
IDTitleArtistGenre
1Your Hand In MineExplosions In The SkyRock
2All These Things That I've DoneThe KillersRock
3Lady MarmaladeChristina Aguilera / Lil' Kim/Pop
Mya / Pink
4Here It Goes AgainOk GoRock
5I Hate Pretending (Album Version)Secret MachinesRock
6No RainBlind MelonRock
7Comatose (Comes Alive Version)SkilletRock
8Life In TechnicolorColdplayRock
9New SoulYael NaïmPop
10BlurryPuddle Of MuddRock
11Give It BackPolly PaulusmaPop
12Walking On The MoonThe PoliceRock
13Face Down (Album Version)The Red Jumpsuit ApparatusRock
14SaviorRise AgainstRock
15Swing SwingThe All-American RejectsRock
16Without MeEminemRap
17AlmazRandy CrawfordPop
18Hotel CaliforniaEaglesRock
19Hey There DelilahPlain White T'sRock
20RevelryKings Of LeonRock
21UndoBjörkRock
22You're The OneDwight YoakamCountry
TABLE 5

Titles, artists and genres of the 22 outlier songs in Figure 5

IDTitleArtistGenre
1Your Hand In MineExplosions In The SkyRock
2All These Things That I've DoneThe KillersRock
3Lady MarmaladeChristina Aguilera / Lil' Kim/Pop
Mya / Pink
4Here It Goes AgainOk GoRock
5I Hate Pretending (Album Version)Secret MachinesRock
6No RainBlind MelonRock
7Comatose (Comes Alive Version)SkilletRock
8Life In TechnicolorColdplayRock
9New SoulYael NaïmPop
10BlurryPuddle Of MuddRock
11Give It BackPolly PaulusmaPop
12Walking On The MoonThe PoliceRock
13Face Down (Album Version)The Red Jumpsuit ApparatusRock
14SaviorRise AgainstRock
15Swing SwingThe All-American RejectsRock
16Without MeEminemRap
17AlmazRandy CrawfordPop
18Hotel CaliforniaEaglesRock
19Hey There DelilahPlain White T'sRock
20RevelryKings Of LeonRock
21UndoBjörkRock
22You're The OneDwight YoakamCountry
IDTitleArtistGenre
1Your Hand In MineExplosions In The SkyRock
2All These Things That I've DoneThe KillersRock
3Lady MarmaladeChristina Aguilera / Lil' Kim/Pop
Mya / Pink
4Here It Goes AgainOk GoRock
5I Hate Pretending (Album Version)Secret MachinesRock
6No RainBlind MelonRock
7Comatose (Comes Alive Version)SkilletRock
8Life In TechnicolorColdplayRock
9New SoulYael NaïmPop
10BlurryPuddle Of MuddRock
11Give It BackPolly PaulusmaPop
12Walking On The MoonThe PoliceRock
13Face Down (Album Version)The Red Jumpsuit ApparatusRock
14SaviorRise AgainstRock
15Swing SwingThe All-American RejectsRock
16Without MeEminemRap
17AlmazRandy CrawfordPop
18Hotel CaliforniaEaglesRock
19Hey There DelilahPlain White T'sRock
20RevelryKings Of LeonRock
21UndoBjörkRock
22You're The OneDwight YoakamCountry
TABLE 6

Genre distribution of the outliers (songs whose corresponding coordinate in the estimated leading principal component is of magnitude larger than 0.07)

RockPopElectronicRapCountryRnBLatinOthers
Population (Total =1,777)48.92%18.53%9.12%7.15%4.33%2.35%2.26%7.34%
Outliers (Total =22)72.73%18.18%0%4.54%4.54%0%0%0%
RockPopElectronicRapCountryRnBLatinOthers
Population (Total =1,777)48.92%18.53%9.12%7.15%4.33%2.35%2.26%7.34%
Outliers (Total =22)72.73%18.18%0%4.54%4.54%0%0%0%
TABLE 6

Genre distribution of the outliers (songs whose corresponding coordinate in the estimated leading principal component is of magnitude larger than 0.07)

RockPopElectronicRapCountryRnBLatinOthers
Population (Total =1,777)48.92%18.53%9.12%7.15%4.33%2.35%2.26%7.34%
Outliers (Total =22)72.73%18.18%0%4.54%4.54%0%0%0%
RockPopElectronicRapCountryRnBLatinOthers
Population (Total =1,777)48.92%18.53%9.12%7.15%4.33%2.35%2.26%7.34%
Outliers (Total =22)72.73%18.18%0%4.54%4.54%0%0%0%

The right panel of Figure 5 presents the principal scores {u^i}i=1n of the users, with frequent users (whose total song plays are in the top 10% of all users) in red and occasional users in blue. This plot reveals, for instance, that the second principal component is well aligned with general interest in the website. Returning to the left plot, we can now interpret a positive y-coordinate for a particular song (which is the case for the large majority of songs) as being associated with an overall interest in the music provided by the site.

As discussed above, it may be the case that some of the entries that we have treated as missing in fact represent a user's aversion to a particular song. We therefore studied the robustness of our conclusions by replacing some of the missing entries with an interest level of 1 (i.e. the lowest level available). More precisely, for some α{0.05,0.1,0.2}, and independently for each user i[n], we generated RiPoisson(αωi1), and assigned an interest level of 1 to Ri uniformly random chosen songs that this user had not previously heard through the site. We then ran primePCA on this imputed dataset, obtaining estimators v^1 and v^2 of the two leading principal components. Denoting the original primePCA estimators for the two columns of V2 by v^1 and v^2, respectively, Table 7 reports the average of the inner product v^j,v^k, where j,k{1,2}, based on 100 independent Monte Carlo experiments. Bearing in mind that the average absolute inner product between two independent random vectors chosen uniformly on S1776 is around 0.020, this table is reassuring that the conclusions are robust to the treatment of missing entries.

TABLE 7

Robustness assessment: average inner products (over 100 repetitions) between top two eigenvectors obtained by running primePCA on the original data and with some of the missing entries imputed with an interest level of 1

αv^1,v^1v^1,v^2v^2,v^1v^2,v^2
0.050.816(0.018)0.042(0.007)0.012(0.007)0.910(0.002)
0.10.756(0.018)0.027(0.007)0.070(0.008)0.893(0.002)
0.20.546(0.025)0.067(0.010)0.085(0.010)0.859(0.002)
αv^1,v^1v^1,v^2v^2,v^1v^2,v^2
0.050.816(0.018)0.042(0.007)0.012(0.007)0.910(0.002)
0.10.756(0.018)0.027(0.007)0.070(0.008)0.893(0.002)
0.20.546(0.025)0.067(0.010)0.085(0.010)0.859(0.002)

Note: SEs are given in brackets.

TABLE 7

Robustness assessment: average inner products (over 100 repetitions) between top two eigenvectors obtained by running primePCA on the original data and with some of the missing entries imputed with an interest level of 1

αv^1,v^1v^1,v^2v^2,v^1v^2,v^2
0.050.816(0.018)0.042(0.007)0.012(0.007)0.910(0.002)
0.10.756(0.018)0.027(0.007)0.070(0.008)0.893(0.002)
0.20.546(0.025)0.067(0.010)0.085(0.010)0.859(0.002)
αv^1,v^1v^1,v^2v^2,v^1v^2,v^2
0.050.816(0.018)0.042(0.007)0.012(0.007)0.910(0.002)
0.10.756(0.018)0.027(0.007)0.070(0.008)0.893(0.002)
0.20.546(0.025)0.067(0.010)0.085(0.010)0.859(0.002)

Note: SEs are given in brackets.

6 DISCUSSION

Heterogeneous missingness is ubiquitous in contemporary, large-scale data sets, yet we currently understand very little about how existing procedures perform or should be adapted to cope with the challenges this presents. Here we attempt to extract the lessons learned from this study of high-dimensional PCA, in order to see how related ideas may be relevant in other statistical problems where one wishes to recover low-dimensional structure with data corrupted in a heterogeneous manner.

A key insight, as gleaned from Section 2.2, is that the way in which the heterogeneity interacts with the underlying structure of interest is crucial. In the worst case, the missingness may be constructed to conceal precisely the structure one seeks to uncover, thereby rendering the problem infeasible by any method. The only hope, then, in terms of providing theoretical guarantees, is to rule out such an adversarial interaction. This was achieved via our incoherence condition in Section 3, and we look forward to seeing how the relevant interactions between structure and heterogeneity can be controlled in other statistical problems such as those mentioned in the introduction. For instance, in sparse linear regression, one would anticipate that missingness of covariates with strong signal would be much more harmful than corresponding missingness for noise variables.

Our study also contributes to the broader understanding of the uses and limitations of spectral methods for estimating hidden low-dimensional structures in high-dimensional problems. We have seen that the OPW estimator is both methodologically simple and, in the homogeneous missingness setting, achieves near-minimax optimality when the noise level is of constant order. Similar results have been obtained for spectral clustering for network community detection in stochastic block models (Rohe et al., 2011) and in low-rank-plus-sparse matrix estimation problems (Fan et al., 2013). On the other hand, the OPW estimator fails to provide exact recovery of the principal components in the noiseless setting. In these other aforementioned problems, it has also been observed that refinement of an initial spectral estimator can enhance performance, particularly in high SNR regimes (Gao et al., 2016; Zhang et al., 2018), as we were able to show for our primePCA algorithm. This suggests that such a refinement has the potential to confer a sharper dependence of the statistical error rate on the SNR compared with a vanilla spectral algorithm, and understanding this phenomenon in greater detail provides another interesting avenue for future research.

1 When K=1, we have that L(V^1,V1) is the sine of the acute angle between V^1 and V1. More generally, L2(V^K,VK) is the sum of the squares of the sines of the principal angles between the subspaces spanned by V^K and VK.

2 In R, we set the random seed to be 2019 before generating VK.

3 To be completely precise, in our simulations, V+ and V were generated independently (and independently of all other randomness) and were drawn from Haar measure on 𝕆d×2; however, these matrices were then fixed for every replication, so it is convenient to regard them as deterministic for the purposes of this description.

4 In fact, to try to improve the numerical stability of the EM procedure, we prevented the sample covariance estimators from exiting the cone of positive semi-definite matrices during iterations and took Moore–Penrose pseudoinverses with eigenvalues below 1010 regarded as 0. Both of these modifications did indeed improve the algorithm, but some instability persists. Moreover, use of the SWEEP operator (Beaton, 1964), which is designed to compute the Schur complement in a numerically stable way, failed to remedy the situation, yielding identical (increasing) log-likelihood trajectories as the vanilla algorithm.

5 Each iteration of the EM algorithm involves the inversion of n matrices, where the dimension of the ith such matrix is j=1dωij×j=1dωij (i.e. O(d)×O(d)). Using standard matrix inversion algorithms, then, each iteration has computational complexity of order nd3, and moreover the number of iterations required for numerical convergence can be very large in higher dimensions. This meant that even when d=100, primePCA was nearly 50 times faster than the EM algorithm.

ACKNOWLEDGEMENTS

The authors thank the anonymous reviewers for helpful feedback, which helped to improve the paper. Ziwei Zhu was supported by NSF grant DMS-2015366, Tengyao Wang was supported by EPSRC grant EP/T02772X/1 and Richard J. Samworth was supported by EPSRC grants EP/P031447/1 and EP/N031938/1, as well as ERC grant Advanced Grant 101019498.

REFERENCES

Anderson
,
T.W.
(
1957
)
Maximum likelihood estimates for a multivariate normal distribution when some observations are missing
.
Journal of the American Statistical Association
,
52
,
200
203
.

Beaton
,
A.E.
(
1964
)
The use of special matrix operators in statistical calculus
.
ETS Research Bulletin Series
,
2
,
i
222
.

Belloni
,
A.
,
Rosenbaum
,
M.
&
Tsybakov
,
A.B.
(
2017
)
Linear and conic programming estimators in high dimensional errors-in-variables models
.
Journal of the Royal Statistical Society. Series B: Statistical Methodology
,
79
,
939
956
.

Cai
,
T.T.
,
Ma
,
Z.
&
Wu
,
Y.
(
2013
)
Sparse PCA: optimal rates and adaptive estimation
.
The Annals of Statistics
,
41
,
3074
3110
.

Cai
,
T.T.
&
Zhang
,
A.
(
2016
)
Minimax rate-optimal estimation of high-dimensional covariance matrices with incomplete data
.
The Journal of Multivariate Analysis
,
150
,
55
74
.

Cai
,
T.T.
&
Zhang
,
L.
(
2018
)
High-dimensional linear discriminant analysis: optimality, adaptive algorithm, and missing data. arXiv:1804.03018
.

Candès
,
E.J.
,
Li
,
X.
,
Ma
,
Y.
&
Wright
,
J.
(
2011
)
Robust principal component analysis?
Journal of the ACM
,
58
,
11:1
11:37
.

Candès
,
E.J.
&
Plan
,
Y.
(
2010
)
Matrix completion with noise
.
Proceedings of the IEEE
,
98
,
925
936
.

Candès
,
E.J.
&
Recht
,
B.
(
2009
)
Exact matrix completion via convex optimization
.
Foundations of Computational Mathematics (FoCM)
,
9
,
717
772
.

Chi
Y.
,
Lu
Y.
&
Chen
Y.
(
2018
)
Nonconvex optimization meets low-rank matrix factorization: an overview. arXiv preprint arXiv:1809.09573
.

Cho
,
J.
,
Kim
,
D.
&
Rohe
,
K.
(
2017
)
Asymptotic theory for estimating the singular vectors and values of a partially-observed low rank matrix with noise
.
Statistica Sinica
,
27
,
1921
1948
.

Davis
,
C.
&
Kahan
,
W.M.
(
1970
)
The rotation of eigenvectors by a perturbation III
.
The SIAM Journal on Numerical Analysis
,
7
,
1
46
.

Dempster
,
A.P.
,
Laird
,
N.M.
&
Rubin
,
D.B.
(
1977
)
Maximum likelihood from incomplete data via the EM algorithm
.
Journal of the Royal Statistical Society. Series B: Statistical Methodology
,
39
,
1
38
.

Dray
,
S.
&
Josse
,
J.
(
2015
)
Principal component analysis with missing values: a comparative survey of methods
.
Plant Ecology
,
216
,
657
667
.

Elsener
,
A
&
van de
Geer
,
S
. (
2018
) Sparse spectral estimation with missing and corrupted measurements. arXiv:1811.10443.

Fan
,
J.
,
Liao
,
Y.
&
Micheva
,
M.
(
2013
)
Large covariance estimation by thresholding principal orthogonal complements
.
Journal of the Royal Statistical Society. Series B: Statistical Methodology
,
75
,
603
680
.

Ford
,
B.L.
(
1983
) An overview of hot-deck procedures. In:
Madow
,
W.G.
,
Olkin
,
I.
&
Rubin
,
D.B.
(Eds.)
Incomplete data in sample surveys, Vol. 2: theory and bibliographies
.
New York
:
Academic Press
, pp.
185
207
.

Gao
,
C.
,
Ma
,
Z.
,
Zhang
,
A.Y.
&
Zhou
,
H.H.
(
2016
)
Achieving optimal misclassification proportion in stochastic block models
.
Journal of Machine Learning Research
,
18
,
1
45
.

Hastie
,
T.
,
Mazumder
,
R.
,
Lee
,
J.D.
&
Zadeh
,
R.
(
2015
)
Matrix completion and low-rank SVD via fast alternating least squares
.
Journal of Machine Learning Research
,
16
,
3367
3402
.

Johnstone
,
I.M.
&
Lu
,
A.Y.
(
2009
)
On consistency and sparsity for principal components analysis in high dimensions
.
Journal of the American Statistical Association
,
104
,
682
693
.

Josse
,
J.
&
Husson
,
F.
(
2012
)
Handling missing values in exploratory multivariate data analysis methods
.
Journal de la société française de statistique
,
153
,
1
21
.

Josse
,
J.
,
Pagès
,
J.
&
Husson
,
F.
(
2009
)
Gestion des donneés manquantes en analyse en composantes principales
.
Journal de la société française de statistique
,
150
,
28
51
.

Keshavan
,
R.H.
,
Montanari
,
A.
&
Oh
,
S.
(
2010
)
Matrix completion from a few entries
.
IEEE Transactions on Information Theory
,
56
,
2980
2998
.

Kiers
,
H.A.L.
(
1997
)
Weighted least squares fitting using ordinary least squares algorithms
.
Psychometrika
,
62
,
251
266
.

Koltchinskii
,
V.
,
Lounici
,
K.
&
Tsybakov
,
A.B.
(
2011
)
Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion
.
The Annals of Statistics
,
39
,
2302
2329
.

Little
,
R.J.
&
Rubin
,
D.B.
(
2019
)
Statistical analysis with missing data
.
Hoboken
:
John Wiley & Sons
.

Loh
,
P.-L.
&
Tan
,
X.L.
(
2018
)
High-dimensional robust precision matrix estimation: cellwise corruption under ϵ-contamination
.
The Electronic Journal of Statistics
,
12
,
1429
1467
.

Loh
,
P.-L.
&
Wainwright
,
M.J.
(
2012
)
High-dimensional regression with noisy and missing data: provable guarantees with nonconvexity
.
The Annals of Statistics
,
40
,
1637
1664
.

Lounici
,
K.
(
2013
) Sparse principal component analysis with missing observations. In:
Houdré
,
C.
(Ed.)
High dimensional probability VI
.
Basel
:
Birkhäuser
, pp.
327
356
.

Lounici
,
K.
(
2014
)
High-dimensional covariance matrix estimation with missing observations
.
Bernoulli
,
20
,
1029
1058
.

Mazumder
,
R.
,
Hastie
,
T.
&
Tibshirani
,
R.
(
2010
)
Spectral regularization algorithms for learning large incomplete matrices
.
Journal of Machine Learning Research
,
11
,
2287
2322
.

Negahban
,
S.
&
Wainwright
,
M.J.
(
2012
)
Restricted strong convexity and weighted matrix completion: optimal bounds with noise
.
Journal of Machine Learning Research
,
13
,
1665
1697
.

Paul
,
D.
(
2007
)
Asymptotics of sample eigenstructure for a large dimensional spiked covariance model
.
Statistica Sinica
,
17
,
1617
1642
.

Rohe
,
K.
,
Chatterjee
,
S.
&
Yu
,
B.
(
2011
)
Spectral clustering and the high-dimensional stochastic blockmodel
.
The Annals of Statistics
,
39
,
1878
1915
.

Rubin
,
D.B.
(
1976
)
Inference and missing data
.
Biometrika
,
63
,
581
592
.

Rubin
,
D.B.
(
2004
)
Multiple imputation for nonresponse in surveys
.
Hoboken
:
John Wiley & Sons
.

Schönemann
,
P.
(
1966
)
A generalized solution of the orthogonal Procrustes problem
.
Psychometrika
,
31
,
1
10
.

Seaman
,
S.
,
Galati
,
J.
,
Jackson
,
D.
&
Carlin
,
J.
(
2013
)
What is meant by “missing at random”?
Statistical Science
,
28
,
257
268
.

Shen
,
D.
,
Shen
,
H.
,
Zhu
,
H.
&
Marron
,
J.
(
2016
)
The statistics and mathematics of high dimension low sample size asymptotics
.
Statistica Sinica
,
26
,
1747
1770
.

Vershynin
,
R.
(
2018
)
High-dimensional probability: an introduction with applications in data science
.
Cambridge
:
Cambridge University Press
.

Wang
,
T.
(
2016
)
Spectral methods and computational trade-offs in high-dimensional statistical inference
. Ph.D thesis, University of Cambridge.

Wang
,
W.
&
Fan
,
J.
(
2017
)
Asymptotics of empirical eigen-structure for ultra-high dimensional spiked covariance model
.
The Annals of Statistics
,
45
,
1342
1374
.

Wold
,
H.
&
Lyttkens
,
E.
(
1969
)
Nonlinear iterative partial least squares (NIPALS) estimation procedures
.
Bulletin of the International Statistical Institute
,
43
,
29
51
.

Zhang
,
A.
,
Cai
,
T.T.
&
Wu
,
Y.
(
2018
)
Heteroskedastic PCA: algorithm, optimality, and applications. arXiv:1810.08316
.

Zhu
,
Z.
,
Wang
,
T.
&
Samworth
,
R. J.
(
2019
)
primePCA: projected refinement for imputation of missing entries in principal component analysis. R package, version 1.2. Available from
: https://CRAN.R-project.org/web/packages/primePCA/.

Author notes

Funding information Engineering and Physical Sciences Research Council, Grant/Award Numbers: EP/N031938/1; EP/P031447/1; EP/T02772X/1; H2020 European Research Council, Grant/Award Number: 101019498; NSF, Grant/Award Number: DMS-2015366

This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.