-
PDF
- Split View
-
Views
-
Cite
Cite
Xiangrong Yin, Haileab Hilafu, Sequential Sufficient Dimension Reduction for Large p, Small n Problems, Journal of the Royal Statistical Society Series B: Statistical Methodology, Volume 77, Issue 4, September 2015, Pages 879–892, https://doi.org/10.1111/rssb.12093
- Share Icon Share
Summary
We propose a new and simple framework for dimension reduction in the large p, small n setting. The framework decomposes the data into pieces, thereby enabling existing approaches for n>p to be adapted to n < p problems. Estimating a large covariance matrix, which is a very difficult task, is avoided. We propose two separate paths to implement the framework. Our paths provide sufficient procedures for identifying informative variables via a sequential approach. We illustrate the paths by using sufficient dimension reduction approaches, but the paths are very general. Empirical evidence demonstrates the efficacy of our paths. Additional simulations and applications are given in an on-line supplementary file.
1. Introduction
1.1. Sufficient dimension reduction
Sufficient dimension reduction (SDR) (Li, 1991; Cook, 1994, 1996) is a methodology for reducing the dimension of predictors while preserving the regression relationship with the response. More specifically, let denote a subspace of , and let denote the orthogonal projection onto . We assume that Y is a scalar response and X is a p × 1 predictor vector. If a subspace is such that Y and X are independent conditioning on , i.e. where ‘∐’ represents the independence, then can be used as the predictor without loss of regression information. Such subspaces are called dimension reduction subspaces. The intersection of such subspaces , if it itself satisfies conditional independence, is called the central subspace (CS) (Cook, 1994) and is denoted by with its dimensionality denoted by d. Under mild conditions (Cook, 1996; Yin et al., 2008), the CS is well defined and unique.
Many methods have been proposed to estimate or part of it. Among others, these include the inverse approaches, sliced inverse regression (SIR) (Li, 1991) and the sliced average variance estimate (Cook and Weisberg, 1991), forward approaches, Hristache et al. (2001a,2001b), the minimum average variance estimate (Xia et al., 2002) and sliced regression (Wang and Xia, 2008) and Dalalyan et al. (2008), and correlation approaches, canonical correlation (Fung et al. 2002), Kullback–Leibler distance (Yin and Cook, 2005), the Fouriers transform (Zhu and Zeng, 2006) and reproducing kernel Hilbert space type (Fukumizu et al., 2004, 2009).
1.2. Sufficient variable selection
Another issue in modelling data is variable selection. Note that the two terms model selection and variable selection are often interchanged in most of the literature because most approaches are based on models. However, there is a distinction between them (Li et al., 2005). We make a formal sufficient variable selection (SVS) definition that is similar to that of Cook (2004), where he proposed methods for testing the significance of variables in the SDR setting.
Definition 1If there is a p × q matrix α (q⩽p), where the columns of α consist of unit vectors of e js with jth element 1 and 0 otherwise, such that , then the column space of α is called the variable selection space. The intersection of all such spaces, if it itself satisfies the conditional independence condition above, is called the central variable selection space, denoted by , with dimension s.
Without loss of generality, we can take α = (I s,0)T, which is also discussed by Cook (2004). Immediately by this definition, and d⩽s. And, if exists, then exists and is unique. There are differences between and . For instance, it is possible that the model is not sparse at all, so a basis of is I p, but 1⩽d⩽s. For example, d = 1, but β T = (1,…,1) is a basis of . It is also possible that the model is not sparse at all, so a basis of is I p, but 1⩽d < s and individual solutions can be sparse. For example, d = 2, but and satisfying .
Another difference is that the CS is invariant under affine transformation. However, owing to the covariance matrix Σx, the sparsity of may not result in sparsity of , or vice versa, i.e. a sparse solution in one scale may result in a non-sparse solution in another scale. Thus, estimation of SVS may even be more difficult.
One approach to achieve SVS is to combine SDR with a penalization approach, i.e. first apply an SDR method to estimate , and then apply a penalization approach to the estimated to obtain an estimated . See Ni et al. (2005), Li and Nachtsheim (2006), Li (2007), Zhou and He (2008), Wang and Yin (2008) and Chen et al. (2010).
1.3. Large p, small n problems
The model-free SDR methods and SVS methods which combine SDR with a penalization approach achieved great success, but they fail in the large p, small n setting. This is due to the operational difficulty that is faced when calculating a large sample covariance matrix and its inverse. Little progress has been made in an attempt to overcome this. See Chiaromonte and Martinelli (2002), Li and Yin (2008), Cook et al. (2007) and Zhu et al. (2010a).
The model-based penalization approaches such as the lasso (Tibshirani, 1996), smoothly clipped absolute deviation (Fan and Li, 2001) and the Dantzig selector (Candès and Tao, 2007), have been very useful and widely used. However, when the dimension p≫n, as is usually the case with high dimensional data sets in bioinformatics, machine learning and pattern recognition, these methods may not perform well owing to the simultaneous challenges of computational expediency, statistical accuracy and algorithmic stability (Fan et al., 2009). For overviews of statistical challenges with high dimensionality, see, for example, Donoho (2000) and Fan and Li (2006).
To overcome these problems, Fan and Lv (2008) proposed a sure independence screening (SIS) method to select important variables in ultrahigh dimensional linear models. See also Huang et al. (2008), Fan et al. (2009), Fan and Song (2010) and Hall and Miller (2009). Although screening methods achieved great success, most of them are model based and on single-index models.
In this paper, we take a completely new approach and propose a novel but simple framework to tackle the large p, small n problem. The framework decomposes the data into different pieces and undertakes sufficient reduction of the data sequentially. We propose two paths. Our paths provide a sufficient way for identifying informative variables by taking joint information of the predictors. The paths proposed will let existing methods which have difficulty dealing with large p, small n data to be adapted to overcome this problem. Estimating a large covariance matrix, which is a very difficult task, is avoided. The two paths are very general and provide a novel approach to solve the large p, small n problem. There are many choices for the implementation of the paths. We shall, however, illustrate them by using SDR and SVS approaches.
The paper is organized as follows. In Section 2, we introduce the framework, and then propose path I and path II, followed by discussions on some related issues. In Section 3, we illustrate our paths via simulations and a data analysis. Concluding remarks are given in Section 4. We also provide an on-line supplementary file for additional materials. R code for the algorithms and a sample data set are available from
http://wileyonlinelibrary.com/journal/rss-datasets
2. The proposed framework
Let X 1 and X 2 be random vectors, and R(X 1) be a vector function of X 1. Then, a simple application of proposition 4.6 in Cook (1998) implies the following result.
Proposition 1Either statement (a) or statement (b) implies statement (c) below:
For the implementation of the framework proposed, in this paper, we shall explore two paths through statements (a) and (b) in proposition 1. Statement (a) appears to be a natural choice when the response variable is quantitative. We refer to this procedure as path I. Consider (X 2, Y) as a multivariate response; then statement (a) becomes a usual multivariate response problem where any of the existing statistical methods for multivariate responses can be applied. Details about the development of path I are given in Section 2.1. When the response variable is categorical, statement (b) will be the natural choice. We refer to this procedure as path II. Consider X 2 as a multivariate response. Conditioning on Y, the first part of statement (b) is another multivariate response problem, whereas the second part is a usual univariate categorical response problem. The development of path II is given in Section 2.2.
Although we focus on large p, small n problems, the framework can be used when p < n, which opens up two new approaches (paths) for analysing traditional data.
The two paths depend on the choice of p 1. Let k = [p/p 1] be the rounded-up integer. If p 1 = p (and, hence, k = 1), our paths reduce to a traditional approach directly modelling on Y|X, which we refer to as a ‘one-step’ method. If p 1 = 1, our paths become testing independence of X 1 and (Y, X 2) under statement (a) or independence of X 1 and X 2 given Y, and independence of X 1 and Y under statement (b), a ‘p-step’ method. Note that this p-step method differs from the usual SIS (Fan and Lv, 2008), or related approaches as these methods used only marginal information of X 1 and Y. In general, our k-step approach enlightens a way to solve multi-index models with n≪p, providing a sufficient way to reduce data.
The paths proposed are very general and do not put any restriction on the form of R(X 1). In this paper, we shall use , where β 1 is a matrix, and , for achieving SDR and SVS respectively. To implement our two paths, we also need methods that can work with multiple responses. There is a huge literature dealing with multiple responses (e.g. Lounici et al. (2009)). We believe that, with suitable modification, many multiresponse methods can be used in our two paths. In this paper, we restrict our attention to SDR approaches that deal with multiresponses.
2.1. Path I
2.1.1 Projective resampling sliced inverse regression
With and , path I is concerned with the estimation of , the CS with multivariate response. Many methods have been developed for estimating such a subspace, for instance, Cook and Setodji (2003), Yin and Bura (2006) and Li et al. (2008). Here, we shall take a procedure using the projective resampling approach that was developed by Li et al. (2008). More details on the projective resampling approach can be found in the on-line supplementary file.
When applying projective resampling sliced inverse regression (PRSIR) to find the reduced variable , we must estimate the dimensionality of β 1. Many methods have been proposed thus far, for instance, the sequential asymptotic test (Li, 1991; Bura and Cook, 2001), the modified Bayes information criterion (Zhu et al., 2006) and the sparse eigendecomposition method (Zhu et al., 2010b). We adopt Li's asymptotic χ 2-test (Li, 1991) in this paper.
2.1.2. Sufficient variable selection approach
For SVS, path I is to estimate . We shall illustrate the estimation of by combining PRSIR with penalization approaches. In particular, we adopt Li's (2007) approach. See the on-line supplementary file for details.
2.1.3. Estimation procedure
Having presented all the necessary machinery for the implementation of path I, we now give the explicit estimation procedure. Firstly, order predictor vector X by a robust marginal utility method. We use the distance correlation method of Li et al. (2012), still denoted as X.
- Step 1:
decompose into , where X 1 is a p 1×1 vector X 1 such that n>p 1, and consider the problem of estimating .
- Step 2:
for the SDR solution, apply PRSIR to the problem of , and find the reduced variable ; for the SVS solution, apply PRSIR incorporated in Li's (2007) approach to the problem of , and find the reduced variable .
- Step 3:
replace predictor X by and go back to step 1.
Repeat steps 1–3 until all the variables in the original predictor vector X have been used in step 1. A detailed algorithm is given in the on-line supplementary file. Note that our path calls several methods into one procedure so that it can sufficiently recover the informative variables. This may increase computational complexity, which is discussed in the supplementary file.
2.2. Path II
We defer the theoretical justification of path II to the on-line supplementary file. Next we describe the procedure of path II.
2.2.1. Partial sufficient dimension reduction
In path II, we must deal with and . The former is to identify , the partial CS (Chiaromonte et al., 2002) but with multivariate responses, whereas the latter is to identify , the usual CS (Cook, 1994, 1996). To the best of our knowledge, the only available multivariate response SDR in the presence of categorical predictors is the projective resampling partial sliced inverse regression (PRPSIR), proposed by Hilafu and Yin (2013). Here, we shall use PRPSIR for estimating ; see the on-line supplementary file for more details on this method. For estimating , we simply adopt SIR (Li, 1991).
2.2.2. Partial sufficient variable selection
With where , and , path II seeks to estimate . However, in path II we must deal with two parts: and . The estimation of becomes a new estimation problem of partial sufficient variable selection, which is parallel to what we described in Section 1.2. For this, we adopt PRPSIR (Hilafu and Yin, 2013) incorporated in Li (2007). To estimate , we use SIR (Li, 1991) incorporated in Li (2007).
2.2.3. Estimation procedure
Our estimation procedure of path II is as follows.
- Step 1:
decompose into where X 1 is a p 1×1 vector X 1 such that n>p 1, thus considering the problem of estimating and .
- Step 2:
for the SDR solution, apply PRPSIR to the problem of , and find the reduced variable ; for the SVS solution, apply PRPSIR incorporated in Li's (2007) approach to the problem of , and find the reduced variable .
- Step 3:
for the SDR solution, apply SIR to the problem of , and find the reduced variable ; for the SVS solution, apply SIR incorporated in Li's (2007) approach to the problem of , and find the reduced variable .
- Step 4:
set β 1 = (α 1, α 2), replace predictor X by and go back to step 1.
When we combine α 1 and α 2 to obtain β 1, we may simply use the singular value decomposition method to remove redundant directions in case α 1 and α 2 have common estimated directions, though other fine methods can be used. We repeat steps 1–4 until all the variables in the original predictor vector X have been used in step 1. A detailed algorithm is in the on-line supplementary file.
2.3. Effective use of path I and path II
Our two paths may depend on how the predictors are grouped. Grouping is determined by two factors: the choice of p 1 and the order of X.
For path II, we find that, if we try different partitions of the original predictor set, the solutions are very stable, i.e. the change of a reasonable choice of p 1 (for which the method selected works well) will not affect the results (see the on-line supplementary file). A different ordering of the predictor vector does not change the results significantly either. Therefore, path II is quite stable. The reason for this is that the second part of statement (b) that deals with the marginal relationships between X 1 and Y plays an important role in keeping informative variables. In proposition 1, the second part of statement (b) indicates that marginally dependent variables will be in the sufficient part of statement (c). Therefore, marginal dependence may be critical and, regardless of the order of the variables, path II will use this information.
For path I, the order of the predictors is important, whereas the choice of p 1 is much less critical (see some empirical evidence in the on-line supplementary file). If we spread the informative variables out in different blocks, then the signal in each block may be weak, making it difficult to recover them, i.e., when we condition on one block, the other informative variables increase the magnitude of the error in the analysis in this step. Thus, in theory, we would like to put relevant variables together. But, we do not know which variables are relevant. Therefore, we borrow the strength from the second part of statement (b) in path II. We devise an ordering procedure, such as using marginal utility measures to put informative variables (in the marginal sense) together. Therefore, ordering using marginal information is important because the marginally dependent variables are in the sufficient set of statement (c).
For path I, it is desirable to put potential informative variables towards the end. If we put them at the front (first block), it could be more difficult to recover them, because at this point more noise variables are in the response. This, after some steps, potentially adds noise variables to the estimate, diluting the accuracy of the estimate. However, when potential informative variables are at the end, the method will probably first discard irrelevant variables (reducing noise) and then, towards the end, recover the important variables with less noise and accurate estimates. This, unfortunately, means that path I relies on the marginal order of the predictors. However, a possible solution is to turn the response into a categorical variable (by slicing) and to use path II.
Although the choice of p 1 is less critical, we should choose p 1 to be as large as possible for which the method works well in a traditional setting (i.e. n>p 1). For instance, for SIR (Li, 1991), p 1 can run from 2 to n−5 (Cook et al., 2012). This is because, if a model is abundant, larger p 1 will reduce the magnitude of the conditional error, i.e. the variance of Y|X, and strengthen the signals. If a model is sparse, a larger p 1 still strengthens the signal by reducing the magnitude of the dimension of the long response vector. Consequently, if a one-step solution is available, i.e. p 1 = p works, then this is preferred.
In conclusion, to use the two paths effectively, we suggest that one should pick a larger p 1 for a method that works well in a traditional setting (n>p 1), and use it as the block size. For path I, order the predictors by using a marginal utility measure (e.g. Zhu et al. (2011) and Li et al. (2012)) by ranking from the weakest association to the strongest association, whereas, for path II, no particular ordering is necessary. Nevertheless, as a referee pointed out, the stability, or, say, the difference between two estimates from two different partitions, is difficult to quantify rigorously. Although we agree with this, we borrow the ensemble idea from Yin and Li (2011) to provide a partial remedy (see limited evidence in the on-line supplementary file).
2.4. Theoretical properties
Theoretical properties, such as asymptotic properties as a referee pointed out, are difficult to establish for our sequential procedure. Our procedure is unlike most studies that concern large p, small n problems which offer a one-step solution. Nevertheless, we offer some explanations to understand the difficulties that are posed by the sequential approach, but we leave it largely as an open problem. To our limited knowledge, we can consider three cases: A1, s⩽p < n, where p is fixed and n → ∞; A2, s < n < p, n and p → ∞; A3, n < s = αp for some α ∈ (0,1], p → ∞.
Asymptotic results for case A1 can be obtained straightforwardly following from Li et al. (2008) as described in the on-line supplementary file.
Let p 1 and q 1 be the respective dimensions of X 1 and (Y, X 2). Consider the simplest case of the two-step solution. Note that p 1 + q 1 = p + 1; when p → ∞, we can have three different scenarios:
- a.
q 1 fixed but p 1→∞;
- b.
q 1→∞, but p 1 fixed;
- c.
p 1, q 1→∞.
For scenario (a), because q 1 is fixed, by a suitable adaptation of the projective resampling idea by Li et al. (2008), we can convert scenario (a) into a univariate response and adapt the one-step approach of either A2 where Wu and Li (2011) developed results, or A3 where Cook et al. (2012) developed results. Although the idea seems straightforward, the technicalities are still difficult as this is a two-step solution combining projective resampling, and A2 or A3. It is even more difficult to deal with scenarios (b) and (c) because they bring a new situation where the dimension of the response tends to ∞. Although we can use the projective resampling idea to convert the response vector into a univariate response, we do not know yet what conditions to put on the response before we study the properties for scenario (b) and even more so for scenario (c), where the predictor vector tends to ∞ as well.
We believe that each set-up can result in an independent research paper, if results can be obtained. Owing to such difficulties and fresh challenges, it is best to leave a complete asymptotic study when n and p tend to ∞ as a future research topic.
3. Simulations and applications
In this section, we assess the efficacy of the proposed paths through simulations and application to real data.
3.1. Simulations
In the simulations, for an estimate of , both assumed to be semiorthogonal without loss of generality, the estimation error is measured by the distance between the subspaces that is spanned by these matrices, , the Frobenius norm (Li et al., 2005). The corresponding benchmark distance in this paper is the average of 10000 simulated distances over two respective random matrices following Li et al. (2008). Another measurement is the absolute correlation coefficient |r| between the true sufficient predictor and its estimate, whose benchmark distance is the average over 10000 simulated |r|, which are calculated by randomly simulating the direction and the data vector.
The accuracy of sparsity is measured by the true positive rate TPR, which is defined as the ratio of the number of correctly identified active predictors to the number of truly active predictors, and the false positive rate FPR, which is defined as the ratio of the number of falsely identified active predictors to the total number of inactive predictors. The measures TPR and FPR are also known as the sensitivity and 1-specificity and, ideally, we wish to have TPR to be close to 1 and FPR to be close to 0 at the same time.
For each model setting, 100 replicates of the data are generated, unless stated otherwise. In each model we consider three cases: case 1, x is and ɛ are independent and identically distributed N(0, 1); case 2, the predictor x i is generated according to , where Σx is a positive definite matrix with (j 1, j 2)th entry , and W is N(0,I p), and the ɛ are independent and identically distributed N(0, 1), independent of x is; case 3, the predictor x i is generated according to , where Σx is a positive definite matrix with (j 1, j 2)th entry , and W is N(0,I p), and ɛ are independent and identically distributed N(0, 1), independent of x is.
In our simulation, we use fixed p 1 = 20 (the dimension of X 1) in each step. Also in the PRSIR step we use m = 2000. In the PRPSIR step, we take m = 500 to save some computing time. In addition, when we compare the accuracy of the estimates, in each step we retain only the dimension that is not 0, i.e., if the test of dimensionality is 0, we delete all the variables that are used as predictors in this step. If it is bigger than 0, we retain either the true d (path I) or the actual estimated dimensions (path II). When we assess the accuracy of the correctly estimated dimensions, we use the actual estimated dimension in each step. For testing the dimensionality in both paths, the χ 2-test (Li, 1991) is set at α = 0.05, unless otherwise stated. In the tables reported, the non-sparse estimate means that we used only the SDR method in the paths, whereas sparse estimate means that we used the SDR method with the penalization approach.
We study the following two models: model 1 is a sparse non-linear example for path I with continuous response, whereas model 2 is a sparse example for path II with discrete response but two dimensional.
3.1.1. Model 1
Table 1 reports the results for model 1. As indicated in the last column, the χ 2-test tends to overestimate the dimension. But TPR and FPR are very accurate for all three cases, indicating that the procedure does a good job of identifying the active predictors. For independent (case 1) and moderately correlated data (case 2), the non-sparse solutions are very accurate, in terms of both Δf and |r|. When the predictors are highly correlated (case 3), the estimation is less accurate as expected, as PRSIR is not specially devised to deal with such data. The sparse solutions for all cases are very reasonable, in terms of Δf and |r|, but are generally inferior to non-sparse estimates. This agrees with the general fact that sparse estimates are good in selecting variables but tend to induce bias in estimation.
Case . | Non-sparse estimates . | Sparse estimates . | % of d ‡ . | ||||
---|---|---|---|---|---|---|---|
Δf . | |r| . | Δf . | |r| . | TPR . | FPR . | ||
1 | 0.2482 | 0.9852 | 0.5721 | 0.9081 | 1 | 0.0244 | 8 (d = 1), 63 (d = 2), 29 (d = 3) |
(0.0552) | (0.0068) | (0.0915) | (0.0458) | ||||
2 | 0.3642 | 0.9862 | 0.8744 | 0.8280 | 0.9975 | 0.0241 | 12 (d = 1), 53 (d = 2), 35 (d = 3) |
(0.0792) | (0.0066) | (0.2580) | (0.1165) | ||||
3 | 1.0032 | 0.9749 | 1.3078 | 0.7545 | 1 | 0.1277 | 7 (d = 1), 51 (d = 2), 42 (d = 3) |
(0.1399) | (0.0121) | (0.1236) | (0.1861) |
Case . | Non-sparse estimates . | Sparse estimates . | % of d ‡ . | ||||
---|---|---|---|---|---|---|---|
Δf . | |r| . | Δf . | |r| . | TPR . | FPR . | ||
1 | 0.2482 | 0.9852 | 0.5721 | 0.9081 | 1 | 0.0244 | 8 (d = 1), 63 (d = 2), 29 (d = 3) |
(0.0552) | (0.0068) | (0.0915) | (0.0458) | ||||
2 | 0.3642 | 0.9862 | 0.8744 | 0.8280 | 0.9975 | 0.0241 | 12 (d = 1), 53 (d = 2), 35 (d = 3) |
(0.0792) | (0.0066) | (0.2580) | (0.1165) | ||||
3 | 1.0032 | 0.9749 | 1.3078 | 0.7545 | 1 | 0.1277 | 7 (d = 1), 51 (d = 2), 42 (d = 3) |
(0.1399) | (0.0121) | (0.1236) | (0.1861) |
For Δf and |r|, the results reported are the average over 100 replicates, and the values in parentheses are the standard errors. Benchmark Δf = 1.4138, |r| = 0.0442 for case 1, |r| = 0.0468 for case 2 and |r| = 0.0679 for case 3.
χ 2-test at α = 0.05
Case . | Non-sparse estimates . | Sparse estimates . | % of d ‡ . | ||||
---|---|---|---|---|---|---|---|
Δf . | |r| . | Δf . | |r| . | TPR . | FPR . | ||
1 | 0.2482 | 0.9852 | 0.5721 | 0.9081 | 1 | 0.0244 | 8 (d = 1), 63 (d = 2), 29 (d = 3) |
(0.0552) | (0.0068) | (0.0915) | (0.0458) | ||||
2 | 0.3642 | 0.9862 | 0.8744 | 0.8280 | 0.9975 | 0.0241 | 12 (d = 1), 53 (d = 2), 35 (d = 3) |
(0.0792) | (0.0066) | (0.2580) | (0.1165) | ||||
3 | 1.0032 | 0.9749 | 1.3078 | 0.7545 | 1 | 0.1277 | 7 (d = 1), 51 (d = 2), 42 (d = 3) |
(0.1399) | (0.0121) | (0.1236) | (0.1861) |
Case . | Non-sparse estimates . | Sparse estimates . | % of d ‡ . | ||||
---|---|---|---|---|---|---|---|
Δf . | |r| . | Δf . | |r| . | TPR . | FPR . | ||
1 | 0.2482 | 0.9852 | 0.5721 | 0.9081 | 1 | 0.0244 | 8 (d = 1), 63 (d = 2), 29 (d = 3) |
(0.0552) | (0.0068) | (0.0915) | (0.0458) | ||||
2 | 0.3642 | 0.9862 | 0.8744 | 0.8280 | 0.9975 | 0.0241 | 12 (d = 1), 53 (d = 2), 35 (d = 3) |
(0.0792) | (0.0066) | (0.2580) | (0.1165) | ||||
3 | 1.0032 | 0.9749 | 1.3078 | 0.7545 | 1 | 0.1277 | 7 (d = 1), 51 (d = 2), 42 (d = 3) |
(0.1399) | (0.0121) | (0.1236) | (0.1861) |
For Δf and |r|, the results reported are the average over 100 replicates, and the values in parentheses are the standard errors. Benchmark Δf = 1.4138, |r| = 0.0442 for case 1, |r| = 0.0468 for case 2 and |r| = 0.0679 for case 3.
χ 2-test at α = 0.05
3.1.2. Model 2
Table 2 reports the results for model 2. This is a sparse model but two dimensional. The estimated dimensions are very accurate as indicated in the last column. Again TPR and FPR for all three cases show the efficacy of path II. As expected, when the dimension increases, the estimation accuracy in terms of both Δf and |r| decreases.
Case . | Non-sparse estimates . | Sparse estimates . | % of d ‡ . | ||||
---|---|---|---|---|---|---|---|
Δf . | |r| . | Δf . | |r| . | TPR . | FPR . | ||
1 | 1.5626 | 0.4253/0.6518 | 1.6126 | 0.4011/0.5132 | 0.9887 | 0.1750 | 97 (d = 24), 3 (d = 3) |
(0.0894) | (0.1444/0.2106) | (0.1009) | (0.1252/0.1963) | ||||
0.3687/0.4922 | 0.2811/0.4232 | ||||||
(0.1732/0.1696) | (0.2311/0.1423) | ||||||
2 | 1.6582 | 0.3779/0.7343 | 1.878 | 0.2464/0.4053 | 1 | 0.1676 | 2 (d = 1), 92 (d = 2), |
(0.0912) | (0.1144/0.1591) | (0.1207) | (0.1267/0.1940) | ||||
0.4736/0.4351 | 0.4078/0.5318 | ||||||
(0.1424/0.1504) | (0.1119/0.1452) | ||||||
3 | 1.8427 | 0.3545/0.6032 | 1.9268 | 0.2312/0.3901 | 1 | 0.2172 | 3 (d = 1), 94 (d = 2), |
(0.0822) | (0.1918/0.1580) | (0.0820) | (0.0967/0.2146) | ||||
0.4112/0.4559 | 0.4117/0.3618 | ||||||
(0.1424/0.1120) | (0.1212/0.1124) |
Case . | Non-sparse estimates . | Sparse estimates . | % of d ‡ . | ||||
---|---|---|---|---|---|---|---|
Δf . | |r| . | Δf . | |r| . | TPR . | FPR . | ||
1 | 1.5626 | 0.4253/0.6518 | 1.6126 | 0.4011/0.5132 | 0.9887 | 0.1750 | 97 (d = 24), 3 (d = 3) |
(0.0894) | (0.1444/0.2106) | (0.1009) | (0.1252/0.1963) | ||||
0.3687/0.4922 | 0.2811/0.4232 | ||||||
(0.1732/0.1696) | (0.2311/0.1423) | ||||||
2 | 1.6582 | 0.3779/0.7343 | 1.878 | 0.2464/0.4053 | 1 | 0.1676 | 2 (d = 1), 92 (d = 2), |
(0.0912) | (0.1144/0.1591) | (0.1207) | (0.1267/0.1940) | ||||
0.4736/0.4351 | 0.4078/0.5318 | ||||||
(0.1424/0.1504) | (0.1119/0.1452) | ||||||
3 | 1.8427 | 0.3545/0.6032 | 1.9268 | 0.2312/0.3901 | 1 | 0.2172 | 3 (d = 1), 94 (d = 2), |
(0.0822) | (0.1918/0.1580) | (0.0820) | (0.0967/0.2146) | ||||
0.4112/0.4559 | 0.4117/0.3618 | ||||||
(0.1424/0.1120) | (0.1212/0.1124) |
For Δf and |r|, the results reported are the average over 100 replicates, and the number in parentheses are the standard errors. Benchmark Δf = 1.9993, |r| = 0.0595 for case 1, |r| = 0.0607 for case 2 and |r| = 0.0716 for case 3.
χ 2-test at α = 0.05
Case . | Non-sparse estimates . | Sparse estimates . | % of d ‡ . | ||||
---|---|---|---|---|---|---|---|
Δf . | |r| . | Δf . | |r| . | TPR . | FPR . | ||
1 | 1.5626 | 0.4253/0.6518 | 1.6126 | 0.4011/0.5132 | 0.9887 | 0.1750 | 97 (d = 24), 3 (d = 3) |
(0.0894) | (0.1444/0.2106) | (0.1009) | (0.1252/0.1963) | ||||
0.3687/0.4922 | 0.2811/0.4232 | ||||||
(0.1732/0.1696) | (0.2311/0.1423) | ||||||
2 | 1.6582 | 0.3779/0.7343 | 1.878 | 0.2464/0.4053 | 1 | 0.1676 | 2 (d = 1), 92 (d = 2), |
(0.0912) | (0.1144/0.1591) | (0.1207) | (0.1267/0.1940) | ||||
0.4736/0.4351 | 0.4078/0.5318 | ||||||
(0.1424/0.1504) | (0.1119/0.1452) | ||||||
3 | 1.8427 | 0.3545/0.6032 | 1.9268 | 0.2312/0.3901 | 1 | 0.2172 | 3 (d = 1), 94 (d = 2), |
(0.0822) | (0.1918/0.1580) | (0.0820) | (0.0967/0.2146) | ||||
0.4112/0.4559 | 0.4117/0.3618 | ||||||
(0.1424/0.1120) | (0.1212/0.1124) |
Case . | Non-sparse estimates . | Sparse estimates . | % of d ‡ . | ||||
---|---|---|---|---|---|---|---|
Δf . | |r| . | Δf . | |r| . | TPR . | FPR . | ||
1 | 1.5626 | 0.4253/0.6518 | 1.6126 | 0.4011/0.5132 | 0.9887 | 0.1750 | 97 (d = 24), 3 (d = 3) |
(0.0894) | (0.1444/0.2106) | (0.1009) | (0.1252/0.1963) | ||||
0.3687/0.4922 | 0.2811/0.4232 | ||||||
(0.1732/0.1696) | (0.2311/0.1423) | ||||||
2 | 1.6582 | 0.3779/0.7343 | 1.878 | 0.2464/0.4053 | 1 | 0.1676 | 2 (d = 1), 92 (d = 2), |
(0.0912) | (0.1144/0.1591) | (0.1207) | (0.1267/0.1940) | ||||
0.4736/0.4351 | 0.4078/0.5318 | ||||||
(0.1424/0.1504) | (0.1119/0.1452) | ||||||
3 | 1.8427 | 0.3545/0.6032 | 1.9268 | 0.2312/0.3901 | 1 | 0.2172 | 3 (d = 1), 94 (d = 2), |
(0.0822) | (0.1918/0.1580) | (0.0820) | (0.0967/0.2146) | ||||
0.4112/0.4559 | 0.4117/0.3618 | ||||||
(0.1424/0.1120) | (0.1212/0.1124) |
For Δf and |r|, the results reported are the average over 100 replicates, and the number in parentheses are the standard errors. Benchmark Δf = 1.9993, |r| = 0.0595 for case 1, |r| = 0.0607 for case 2 and |r| = 0.0716 for case 3.
χ 2-test at α = 0.05
In all our simulations, we report the ‘raw’ results from the path once it finishes the first cycle. These reported ‘raw’ results already seem sufficiently accurate. However, re-estimation may be used to improve the results, i.e., once the path stops after the first cycle, we use the remaining active variables to obtain another solution. An illustration of our approach for an abundant model and a comparison of our approach with the one-step estimation method can be found in the on-line supplementary file.
3.2. Leukaemia data
In this section, we shall use path II to analyse a leukaemia data set, which comes from high density Affymetrix oligonucleotide arrays. This data set was first analysed by Golub et al. (1999), and later by Chiaromonte and Martinelli (2002), Dudoit et al. (2002) and Fan and Fan (2008), among others. The data are available from http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi. There are 72 samples from two classes of acute leukaemia: 47 in acute lymphoblastic leukaemia and 25 in acute myeloid leukaemia. There are 7129 genes. Before we apply our path II approach to the data, we preprocess them following Golub et al. (1999). Three preprocessing steps were applied:
- a.
thresholding, gene expression readings of 100 or fewer were set to 100 and expression readings of 16000 or more were set to 16000;
- b.
screening, only genes where max−min> 500 and max/min> 5 were included, where max and min refer to the maximum and minimum readings of a gene expression among the 72 samples respectively;
- c.
transformation, gene expression readings of the genes selected were log-transformed.
The data were then summarized by a 72 × 3571 matrix X = (x ij), where x ij denotes the expression level for gene j in messenger ribonucleic acid sample i. We further standardized the data so that observations have mean 0 and variance 1 across variables (genes).
In the application of our method, we run the SIR and PRPSIR steps and form the estimated space by combining directions from these steps, if any. At each step, we use Li's χ 2-test (Li, 1991) to determine the structural dimension. Because of the computational cost, tuning parameters for the sparse SDR estimation of Li (2007) are chosen only at the outset, the first step, and are applied to the remaining steps. We set the number of predictors (genes) that are considered at each step to be 20 (p 1 = 20), and the number of slices to be 5.
3.2.1. Initial stage estimation
When path II stops, the initial estimate has dimension 1 and it contains 28 non-zero coefficients—it selects 28 genes. Fig. 1(a) shows a boxplot of the sparse estimate of the direction. Fig. 1(b) presents a boxplot of the non-sparse estimate of the direction with all the selected 28 genes by refitting the 72 × 28 data.

Sufficient summary plot for the leukaemia data with two categories
3.2.2. Refined stage estimation
In the refined stage we form a new data matrix of size 72 × 28. Since now n = 72 and p = 28, we can just apply the usual SIR method, with shrinkage procedure applied. The estimate contains 11 non-zero coefficients—it selects 11 genes. Fig. 1(c) shows a boxplot of the sufficient predictor that is formed by using this sparse estimate of the direction. Fig. 1(d) presents a boxplot of the sufficient predictor from a non-sparse estimate by using only the 11 selected genes by refitting the 72 × 11 data. There is no further reduction from the 11 genes, and thus we have the final estimate of 11 genes. Fig. 1(d) is our final summary plot, which clearly shows the separation of the two groups.
Since we used SIR and its variations, we would like to check the conditional normality of X|Y. Note that these methods require a linearity condition (Cook, 1998), which is implied by conditional normality. Indeed, we use the Shapiro test for normality with p-value 0.4569 or greater and the Bartlett test for constant variance with p-value 0.2955. These results and a QQ-plot (which is not reported here) show that a conditional normality assumption seems very reasonable.
In two-group analysis, re-estimation using reduced variables is applied, which demonstrates the improvement and supports our earlier comments in Section 3.1.
4. Discussion
Analysis for large p, small n data is important but difficult. This is for at least two reasons: the sample covariance matrix is not invertible, and the capacity of manipulating a large predictor vector. In this paper we proposed a simple framework to tackle large p, small n problems. The framework differs from usual methodologies in that it decomposes the predictor vector into pieces so that inverting the sample covariance matrix is not an issue, and a large response vector can be manipulated. It provides a sufficient way to solve the problem. We illustrated the framework via two paths using SDR approaches. However, the framework does not have to be restricted to our chosen approaches. Many methods that work for n>p data can be adopted in the framework, as long as a multiple-response vector can be handled. Hence, it opens possible research for large p, small n problems. However, the asymptotic properties for the framework when p → ∞, extended from one-step solution, are not yet clear. In addition, differences between solutions from different partitions of the predictor vector may be difficult to quantify rigorously, though an ensemble idea can provide a promising solution. We leave them as open problems.
Acknowledgements
The authors thank the Joint Editor, an Associate Editor and two referees for their valuable comments, which greatly improved the paper. This work is supported in part by National Science Foundation grant DMS-1205564.
References