Summary

We propose a new and simple framework for dimension reduction in the large p, small n setting. The framework decomposes the data into pieces, thereby enabling existing approaches for n>p to be adapted to n < p problems. Estimating a large covariance matrix, which is a very difficult task, is avoided. We propose two separate paths to implement the framework. Our paths provide sufficient procedures for identifying informative variables via a sequential approach. We illustrate the paths by using sufficient dimension reduction approaches, but the paths are very general. Empirical evidence demonstrates the efficacy of our paths. Additional simulations and applications are given in an on-line supplementary file.

1. Introduction

1.1. Sufficient dimension reduction

Sufficient dimension reduction (SDR) (Li, 1991; Cook, 1994, 1996) is a methodology for reducing the dimension of predictors while preserving the regression relationship with the response. More specifically, let S denote a subspace of Rp, and let PS denote the orthogonal projection onto S. We assume that Y is a scalar response and X is a p × 1 predictor vector. If a subspace S is such that Y and X are independent conditioning on PSX, i.e. Y⨿X|PSX, where ‘∐’ represents the independence, then PSX can be used as the predictor without loss of regression information. Such subspaces S are called dimension reduction subspaces. The intersection of such subspaces S, if it itself satisfies conditional independence, is called the central subspace (CS) (Cook, 1994) and is denoted by SY|X with its dimensionality denoted by d. Under mild conditions (Cook, 1996; Yin et al., 2008), the CS is well defined and unique.

Many methods have been proposed to estimate SY|X or part of it. Among others, these include the inverse approaches, sliced inverse regression (SIR) (Li, 1991) and the sliced average variance estimate (Cook and Weisberg, 1991), forward approaches, Hristache et al. (2001a,2001b), the minimum average variance estimate (Xia et al., 2002) and sliced regression (Wang and Xia, 2008) and Dalalyan et al. (2008), and correlation approaches, canonical correlation (Fung et al. 2002), Kullback–Leibler distance (Yin and Cook, 2005), the Fouriers transform (Zhu and Zeng, 2006) and reproducing kernel Hilbert space type (Fukumizu et al., 2004, 2009).

1.2. Sufficient variable selection

Another issue in modelling data is variable selection. Note that the two terms model selection and variable selection are often interchanged in most of the literature because most approaches are based on models. However, there is a distinction between them (Li et al., 2005). We make a formal sufficient variable selection (SVS) definition that is similar to that of Cook (2004), where he proposed methods for testing the significance of variables in the SDR setting.

Definition 1

If there is a p × q matrix α (qp), where the columns of α consist of unit vectors of e  js with jth element 1 and 0 otherwise, such that Y⨿X|αTX, then the column space of α is called the variable selection space. The intersection of all such spaces, if it itself satisfies the conditional independence condition above, is called the central variable selection space, denoted by SY|XV, with dimension s.

Without loss of generality, we can take α = (I  s,0)T, which is also discussed by Cook (2004). Immediately by this definition, SY|XSY|XV and ds. And, if SY|X exists, then SY|XV exists and is unique. There are differences between SY|X and SY|XV. For instance, it is possible that the model is not sparse at all, so a basis of SY|XV is I  p, but 1⩽ds. For example, d = 1, but β  T = (1,…,1) is a basis of SY|X. It is also possible that the model is not sparse at all, so a basis of SY|XV is I  p, but 1⩽d < s and individual solutions can be sparse. For example, d = 2, but β1T=(1,1,0,,0) and β2T=(0,0,1,,1) satisfying β1T+β2T=βT=(1,,1).

Another difference is that the CS is invariant under affine transformation. However, owing to the covariance matrix Σx, the sparsity of SY|XV may not result in sparsity of SY|ZV, or vice versa, i.e. a sparse solution in one scale may result in a non-sparse solution in another scale. Thus, estimation of SVS may even be more difficult.

One approach to achieve SVS is to combine SDR with a penalization approach, i.e. first apply an SDR method to estimate SY|X, and then apply a penalization approach to the estimated SY|X to obtain an estimated SY|XV. See Ni et al. (2005), Li and Nachtsheim (2006), Li (2007), Zhou and He (2008), Wang and Yin (2008) and Chen et al. (2010).

1.3. Large p, small n problems

The model-free SDR methods and SVS methods which combine SDR with a penalization approach achieved great success, but they fail in the large p, small n setting. This is due to the operational difficulty that is faced when calculating a large sample covariance matrix and its inverse. Little progress has been made in an attempt to overcome this. See Chiaromonte and Martinelli (2002), Li and Yin (2008), Cook et al. (2007) and Zhu et al. (2010a).

The model-based penalization approaches such as the lasso (Tibshirani, 1996), smoothly clipped absolute deviation (Fan and Li, 2001) and the Dantzig selector (Candès and Tao, 2007), have been very useful and widely used. However, when the dimension pn, as is usually the case with high dimensional data sets in bioinformatics, machine learning and pattern recognition, these methods may not perform well owing to the simultaneous challenges of computational expediency, statistical accuracy and algorithmic stability (Fan et al., 2009). For overviews of statistical challenges with high dimensionality, see, for example, Donoho (2000) and Fan and Li (2006).

To overcome these problems, Fan and Lv (2008) proposed a sure independence screening (SIS) method to select important variables in ultrahigh dimensional linear models. See also Huang et al. (2008), Fan et al. (2009), Fan and Song (2010) and Hall and Miller (2009). Although screening methods achieved great success, most of them are model based and on single-index models.

In this paper, we take a completely new approach and propose a novel but simple framework to tackle the large p, small n problem. The framework decomposes the data into different pieces and undertakes sufficient reduction of the data sequentially. We propose two paths. Our paths provide a sufficient way for identifying informative variables by taking joint information of the predictors. The paths proposed will let existing methods which have difficulty dealing with large p, small n data to be adapted to overcome this problem. Estimating a large covariance matrix, which is a very difficult task, is avoided. The two paths are very general and provide a novel approach to solve the large p, small n problem. There are many choices for the implementation of the paths. We shall, however, illustrate them by using SDR and SVS approaches.

The paper is organized as follows. In Section 2, we introduce the framework, and then propose path I and path II, followed by discussions on some related issues. In Section 3, we illustrate our paths via simulations and a data analysis. Concluding remarks are given in Section 4. We also provide an on-line supplementary file for additional materials. R code for the algorithms and a sample data set are available from

http://wileyonlinelibrary.com/journal/rss-datasets

2. The proposed framework

Let X  1 and X  2 be random vectors, and R(X  1) be a vector function of X  1. Then, a simple application of proposition 4.6 in Cook (1998) implies the following result.

Proposition 1

Either statement (a) or statement (b) implies statement (c) below:

  • X1⨿(X2,Y)|R(X1);

  • X1⨿X2|{R(X1),Y}andX1⨿Y|R(X1);

  • X1⨿Y|{R(X1),X2}.

Proposition 1 is fundamental and simple. However, it has a very important implication. Suppose that statement (c) holds; then
(1)
which implies that the distribution of Y|(X1,X2) is the same as that of Y|{R(X1),X2}. If the dimension of R(X  1) is less than that of X  1, then we achieve SDR. This interpretation of equation (1) is the key observation for the proposed framework under the case of pn. Write XT=(X1T,X2T) and select X  1 so that its dimension p  1 is less than the sample size n and the sample covariance matrix of X1,Σ^x1, is invertible. Now suppose that we reduce X  1 to R(X  1). In what follows consider R(X  1) and X  2 as the new predictor X to form new X  1 and X  2, and obtain new R(X  1) as above. Keep doing that until no further reduction is possible. If at any stage the number of reduced variables is already smaller than n, then we do not have to go through this procedure. Instead, at this point, treat it as a traditional problem of small p and large n. Thus, the key is to find R(X  1) such that statement (c) is satisfied. This can be forced to hold if either statement (a) or statement (b) is true. Therefore, proposition 1 provides a framework that uses the SDR idea as a bridge to overcome the large p, small n problem.

For the implementation of the framework proposed, in this paper, we shall explore two paths through statements (a) and (b) in proposition 1. Statement (a) appears to be a natural choice when the response variable is quantitative. We refer to this procedure as path I. Consider (X  2, Y) as a multivariate response; then statement (a) becomes a usual multivariate response problem where any of the existing statistical methods for multivariate responses can be applied. Details about the development of path I are given in Section 2.1. When the response variable is categorical, statement (b) will be the natural choice. We refer to this procedure as path II. Consider X  2 as a multivariate response. Conditioning on Y, the first part of statement (b) is another multivariate response problem, whereas the second part is a usual univariate categorical response problem. The development of path II is given in Section 2.2.

Although we focus on large p, small n problems, the framework can be used when p < n, which opens up two new approaches (paths) for analysing traditional data.

The two paths depend on the choice of p  1. Let k = [p/p  1] be the rounded-up integer. If p  1 = p (and, hence, k = 1), our paths reduce to a traditional approach directly modelling on Y|X, which we refer to as a ‘one-step’ method. If p  1 = 1, our paths become testing independence of X  1 and (Y, X  2) under statement (a) or independence of X  1 and X  2 given Y, and independence of X  1 and Y under statement (b), a ‘p-step’ method. Note that this p-step method differs from the usual SIS (Fan and Lv, 2008), or related approaches as these methods used only marginal information of X  1 and Y. In general, our k-step approach enlightens a way to solve multi-index models with np, providing a sufficient way to reduce data.

The paths proposed are very general and do not put any restriction on the form of R(X  1). In this paper, we shall use R(X1)=β1TX1, where β  1 is a matrix, and β1T=(Id1,0), for achieving SDR and SVS respectively. To implement our two paths, we also need methods that can work with multiple responses. There is a huge literature dealing with multiple responses (e.g. Lounici et al. (2009)). We believe that, with suitable modification, many multiresponse methods can be used in our two paths. In this paper, we restrict our attention to SDR approaches that deal with multiresponses.

2.1. Path I

Let XT=(X1T,,XkT)T, and R(Xi)=βiTXi where β  i is a matrix; statement (a) in proposition 1 results in
By setting i = 1, we have X1⨿(Y,X2,,Xk)|β1TX1X1⨿Y|(β1TX1,X2,,Xk). Thus, P(Y|X)=P(Y|β1TX1,X2,,Xk). In general,
this leads to P(Y|X)=P(Y|β1TX1,X2,,Xk)==P(Y|β1TX1,β2TX2,,βkTXk). Let p × k matrix B  1 = diag(β  i); then P(Y|X)=P(Y|B1TX), or Y⨿X|B1TX. Hence, the subspace that is spanned by the columns of B  1 is an SDR subspace, so SY|Xspan(B1). We take another step from B1TX, if needed. Thus, theoretically in the population sense, the path will always return a subspace containing the CS. Replacing βiTXi by R(X  i), one can prove the same result for a general function R(X). Thus we omit the details.

2.1.1 Projective resampling sliced inverse regression

With R(X1)=β1TX1 and YT=(X2T,Y), path I is concerned with the estimation of SY|X, the CS with multivariate response. Many methods have been developed for estimating such a subspace, for instance, Cook and Setodji (2003), Yin and Bura (2006) and Li et al. (2008). Here, we shall take a procedure using the projective resampling approach that was developed by Li et al. (2008). More details on the projective resampling approach can be found in the on-line supplementary file.

When applying projective resampling sliced inverse regression (PRSIR) to find the reduced variable β1TX1, we must estimate the dimensionality of β  1. Many methods have been proposed thus far, for instance, the sequential asymptotic test (Li, 1991; Bura and Cook, 2001), the modified Bayes information criterion (Zhu et al., 2006) and the sparse eigendecomposition method (Zhu et al., 2010b). We adopt Li's asymptotic χ  2-test (Li, 1991) in this paper.

2.1.2. Sufficient variable selection approach

For SVS, path I is to estimate SY|XV. We shall illustrate the estimation of SY|XV by combining PRSIR with penalization approaches. In particular, we adopt Li's (2007) approach. See the on-line supplementary file for details.

2.1.3. Estimation procedure

Having presented all the necessary machinery for the implementation of path I, we now give the explicit estimation procedure. Firstly, order predictor vector X by a robust marginal utility method. We use the distance correlation method of Li et al. (2012), still denoted as X.

  • Step 1:

    decompose XRp into XT=(X1T,X2T), where X  1 is a p  1×1 vector X  1 such that n>p  1, and consider the problem of estimating X1⨿(X2,Y)|β1TX1.

  • Step 2:

    for the SDR solution, apply PRSIR to the problem of YT=(X2T,Y)|X1, and find the reduced variable β1TX1; for the SVS solution, apply PRSIR incorporated in Li's (2007) approach to the problem of YT=(X2T,Y)|X1, and find the reduced variable β1TX1.

  • Step 3:

    replace predictor X by (β1TX1,X2) and go back to step 1.

Repeat steps 1–3 until all the variables in the original predictor vector X have been used in step 1. A detailed algorithm is given in the on-line supplementary file. Note that our path calls several methods into one procedure so that it can sufficiently recover the informative variables. This may increase computational complexity, which is discussed in the supplementary file.

2.2. Path II

We defer the theoretical justification of path II to the on-line supplementary file. Next we describe the procedure of path II.

2.2.1. Partial sufficient dimension reduction

In path II, we must deal with X1⨿X2|(β1TX1,Y) and X1⨿Y|β1TX1. The former is to identify SX2|X1(Y), the partial CS (Chiaromonte et al., 2002) but with multivariate responses, whereas the latter is to identify SY|X1, the usual CS (Cook, 1994, 1996). To the best of our knowledge, the only available multivariate response SDR in the presence of categorical predictors is the projective resampling partial sliced inverse regression (PRPSIR), proposed by Hilafu and Yin (2013). Here, we shall use PRPSIR for estimating SX2|X1(Y); see the on-line supplementary file for more details on this method. For estimating SY|X1, we simply adopt SIR (Li, 1991).

2.2.2. Partial sufficient variable selection

With R(X1)=β1TX1 where β1T=(Id1,0), and YT=(X2T,Y), path II seeks to estimate SY|XV. However, in path II we must deal with two parts: SX2|X1(Y) and SY|X1. The estimation of SX2|X1(Y) becomes a new estimation problem of partial sufficient variable selection, which is parallel to what we described in Section 1.2. For this, we adopt PRPSIR (Hilafu and Yin, 2013) incorporated in Li (2007). To estimate SY|X1V, we use SIR (Li, 1991) incorporated in Li (2007).

2.2.3. Estimation procedure

Our estimation procedure of path II is as follows.

  • Step 1:

    decompose XRp into XT=(X1T,X2T) where X  1 is a p  1×1 vector X  1 such that n>p  1, thus considering the problem of estimating SX2|X1(Y) and SY|X1.

  • Step 2:

    for the SDR solution, apply PRPSIR to the problem of SX2|X1(Y), and find the reduced variable α1TX1; for the SVS solution, apply PRPSIR incorporated in Li's (2007) approach to the problem of SX2|X1(Y), and find the reduced variable α1TX1.

  • Step 3:

    for the SDR solution, apply SIR to the problem of SY|X1, and find the reduced variable α2TX1; for the SVS solution, apply SIR incorporated in Li's (2007) approach to the problem of SY|X1, and find the reduced variable α2TX1.

  • Step 4:

    set β  1 = (α  1, α  2), replace predictor X by (β1TX1,X2) and go back to step 1.

When we combine α  1 and α  2 to obtain β  1, we may simply use the singular value decomposition method to remove redundant directions in case α  1 and α  2 have common estimated directions, though other fine methods can be used. We repeat steps 1–4 until all the variables in the original predictor vector X have been used in step 1. A detailed algorithm is in the on-line supplementary file.

2.3. Effective use of path I and path II

Our two paths may depend on how the predictors are grouped. Grouping is determined by two factors: the choice of p  1 and the order of X.

For path II, we find that, if we try different partitions of the original predictor set, the solutions are very stable, i.e. the change of a reasonable choice of p  1 (for which the method selected works well) will not affect the results (see the on-line supplementary file). A different ordering of the predictor vector does not change the results significantly either. Therefore, path II is quite stable. The reason for this is that the second part of statement (b) that deals with the marginal relationships between X  1 and Y plays an important role in keeping informative variables. In proposition 1, the second part of statement (b) indicates that marginally dependent variables will be in the sufficient part of statement (c). Therefore, marginal dependence may be critical and, regardless of the order of the variables, path II will use this information.

For path I, the order of the predictors is important, whereas the choice of p  1 is much less critical (see some empirical evidence in the on-line supplementary file). If we spread the informative variables out in different blocks, then the signal in each block may be weak, making it difficult to recover them, i.e., when we condition on one block, the other informative variables increase the magnitude of the error in the analysis in this step. Thus, in theory, we would like to put relevant variables together. But, we do not know which variables are relevant. Therefore, we borrow the strength from the second part of statement (b) in path II. We devise an ordering procedure, such as using marginal utility measures to put informative variables (in the marginal sense) together. Therefore, ordering using marginal information is important because the marginally dependent variables are in the sufficient set of statement (c).

For path I, it is desirable to put potential informative variables towards the end. If we put them at the front (first block), it could be more difficult to recover them, because at this point more noise variables are in the response. This, after some steps, potentially adds noise variables to the estimate, diluting the accuracy of the estimate. However, when potential informative variables are at the end, the method will probably first discard irrelevant variables (reducing noise) and then, towards the end, recover the important variables with less noise and accurate estimates. This, unfortunately, means that path I relies on the marginal order of the predictors. However, a possible solution is to turn the response into a categorical variable (by slicing) and to use path II.

Although the choice of p  1 is less critical, we should choose p  1 to be as large as possible for which the method works well in a traditional setting (i.e. n>p  1). For instance, for SIR (Li, 1991), p  1 can run from 2 to n−5 (Cook et al., 2012). This is because, if a model is abundant, larger p  1 will reduce the magnitude of the conditional error, i.e. the variance of Y|X, and strengthen the signals. If a model is sparse, a larger p  1 still strengthens the signal by reducing the magnitude of the dimension of the long response vector. Consequently, if a one-step solution is available, i.e. p  1 = p works, then this is preferred.

In conclusion, to use the two paths effectively, we suggest that one should pick a larger p  1 for a method that works well in a traditional setting (n>p  1), and use it as the block size. For path I, order the predictors by using a marginal utility measure (e.g. Zhu et al. (2011) and Li et al. (2012)) by ranking from the weakest association to the strongest association, whereas, for path II, no particular ordering is necessary. Nevertheless, as a referee pointed out, the stability, or, say, the difference between two estimates from two different partitions, is difficult to quantify rigorously. Although we agree with this, we borrow the ensemble idea from Yin and Li (2011) to provide a partial remedy (see limited evidence in the on-line supplementary file).

2.4. Theoretical properties

Theoretical properties, such as asymptotic properties as a referee pointed out, are difficult to establish for our sequential procedure. Our procedure is unlike most studies that concern large p, small n problems which offer a one-step solution. Nevertheless, we offer some explanations to understand the difficulties that are posed by the sequential approach, but we leave it largely as an open problem. To our limited knowledge, we can consider three cases: A1, sp < n, where p is fixed and n → ∞; A2, s < n < p, n and p → ∞; A3, n < s = αp for some α ∈ (0,1], p → ∞.

Asymptotic results for case A1 can be obtained straightforwardly following from Li et al. (2008) as described in the on-line supplementary file.

Let p  1 and q  1 be the respective dimensions of X  1 and (Y, X  2). Consider the simplest case of the two-step solution. Note that p  1 + q  1 = p + 1; when p → ∞, we can have three different scenarios:

  • a.

    q  1 fixed but p  1→∞;

  • b.

    q  1→∞, but p  1 fixed;

  • c.

    p  1, q  1→∞.

For scenario (a), because q  1 is fixed, by a suitable adaptation of the projective resampling idea by Li et al. (2008), we can convert scenario (a) into a univariate response and adapt the one-step approach of either A2 where Wu and Li (2011) developed results, or A3 where Cook et al. (2012) developed results. Although the idea seems straightforward, the technicalities are still difficult as this is a two-step solution combining projective resampling, and A2 or A3. It is even more difficult to deal with scenarios (b) and (c) because they bring a new situation where the dimension of the response tends to ∞. Although we can use the projective resampling idea to convert the response vector into a univariate response, we do not know yet what conditions to put on the response before we study the properties for scenario (b) and even more so for scenario (c), where the predictor vector tends to ∞ as well.

We believe that each set-up can result in an independent research paper, if results can be obtained. Owing to such difficulties and fresh challenges, it is best to leave a complete asymptotic study when n and p tend to ∞ as a future research topic.

3. Simulations and applications

In this section, we assess the efficacy of the proposed paths through simulations and application to real data.

3.1. Simulations

In the simulations, for an estimate B^ of B0, both assumed to be semiorthogonal without loss of generality, the estimation error is measured by the distance between the subspaces that is spanned by these matrices, Δf(B^,B0)=tr(PB^PB0), the Frobenius norm (Li et al., 2005). The corresponding benchmark distance in this paper is the average of 10000 simulated distances over two respective random matrices following Li et al. (2008). Another measurement is the absolute correlation coefficient |r| between the true sufficient predictor and its estimate, whose benchmark distance is the average over 10000 simulated |r|, which are calculated by randomly simulating the direction and the data vector.

The accuracy of sparsity is measured by the true positive rate TPR, which is defined as the ratio of the number of correctly identified active predictors to the number of truly active predictors, and the false positive rate FPR, which is defined as the ratio of the number of falsely identified active predictors to the total number of inactive predictors. The measures TPR and FPR are also known as the sensitivity and 1-specificity and, ideally, we wish to have TPR to be close to 1 and FPR to be close to 0 at the same time.

For each model setting, 100 replicates of the data are generated, unless stated otherwise. In each model we consider three cases: case 1, x  is and ɛ are independent and identically distributed N(0, 1); case 2, the predictor x  i is generated according to xi=Σx1/2W, where Σx is a positive definite matrix with (j  1, j  2)th entry 0.5|j1j2|, and W is N(0,I  p), and the ɛ are independent and identically distributed N(0, 1), independent of x  is; case 3, the predictor x  i is generated according to xi=Σx1/2W, where Σx is a positive definite matrix with (j  1, j  2)th entry 0.9|j1j2|, and W is N(0,I  p), and ɛ are independent and identically distributed N(0, 1), independent of x  is.

In our simulation, we use fixed p  1 = 20 (the dimension of X  1) in each step. Also in the PRSIR step we use m = 2000. In the PRPSIR step, we take m = 500 to save some computing time. In addition, when we compare the accuracy of the estimates, in each step we retain only the dimension that is not 0, i.e., if the test of dimensionality is 0, we delete all the variables that are used as predictors in this step. If it is bigger than 0, we retain either the true d (path I) or the actual estimated dimensions (path II). When we assess the accuracy of the correctly estimated dimensions, we use the actual estimated dimension in each step. For testing the dimensionality in both paths, the χ  2-test (Li, 1991) is set at α = 0.05, unless otherwise stated. In the tables reported, the non-sparse estimate means that we used only the SDR method in the paths, whereas sparse estimate means that we used the SDR method with the penalization approach.

We study the following two models: model 1 is a sparse non-linear example for path I with continuous response, whereas model 2 is a sparse example for path II with discrete response but two dimensional.

3.1.1. Model 1

Model 1 is the sparse non-linear model
with p = 1000, n = 200 and the number of active variables s = 4.

Table 1 reports the results for model 1. As indicated in the last column, the χ  2-test tends to overestimate the dimension. But TPR and FPR are very accurate for all three cases, indicating that the procedure does a good job of identifying the active predictors. For independent (case 1) and moderately correlated data (case 2), the non-sparse solutions are very accurate, in terms of both Δf and |r|. When the predictors are highly correlated (case 3), the estimation is less accurate as expected, as PRSIR is not specially devised to deal with such data. The sparse solutions for all cases are very reasonable, in terms of Δf and |r|, but are generally inferior to non-sparse estimates. This agrees with the general fact that sparse estimates are good in selecting variables but tend to induce bias in estimation.

Table 1.

Accuracy for model 1

CaseNon-sparse estimatesSparse estimates% of d  
Δf|r|Δf|r|TPRFPR
10.24820.98520.57210.908110.02448 (d = 1), 63 (d = 2), 29 (d = 3)
(0.0552)(0.0068)(0.0915)(0.0458)
20.36420.98620.87440.82800.99750.024112 (d = 1), 53 (d = 2), 35 (d = 3)
(0.0792)(0.0066)(0.2580)(0.1165)
31.00320.97491.30780.754510.12777 (d = 1), 51 (d = 2), 42 (d = 3)
(0.1399)(0.0121)(0.1236)(0.1861)
CaseNon-sparse estimatesSparse estimates% of d  
Δf|r|Δf|r|TPRFPR
10.24820.98520.57210.908110.02448 (d = 1), 63 (d = 2), 29 (d = 3)
(0.0552)(0.0068)(0.0915)(0.0458)
20.36420.98620.87440.82800.99750.024112 (d = 1), 53 (d = 2), 35 (d = 3)
(0.0792)(0.0066)(0.2580)(0.1165)
31.00320.97491.30780.754510.12777 (d = 1), 51 (d = 2), 42 (d = 3)
(0.1399)(0.0121)(0.1236)(0.1861)

For Δf and |r|, the results reported are the average over 100 replicates, and the values in parentheses are the standard errors. Benchmark Δf = 1.4138, |r| = 0.0442 for case 1, |r| = 0.0468 for case 2 and |r| = 0.0679 for case 3.

χ  2-test at α = 0.05

Table 1.

Accuracy for model 1

CaseNon-sparse estimatesSparse estimates% of d  
Δf|r|Δf|r|TPRFPR
10.24820.98520.57210.908110.02448 (d = 1), 63 (d = 2), 29 (d = 3)
(0.0552)(0.0068)(0.0915)(0.0458)
20.36420.98620.87440.82800.99750.024112 (d = 1), 53 (d = 2), 35 (d = 3)
(0.0792)(0.0066)(0.2580)(0.1165)
31.00320.97491.30780.754510.12777 (d = 1), 51 (d = 2), 42 (d = 3)
(0.1399)(0.0121)(0.1236)(0.1861)
CaseNon-sparse estimatesSparse estimates% of d  
Δf|r|Δf|r|TPRFPR
10.24820.98520.57210.908110.02448 (d = 1), 63 (d = 2), 29 (d = 3)
(0.0552)(0.0068)(0.0915)(0.0458)
20.36420.98620.87440.82800.99750.024112 (d = 1), 53 (d = 2), 35 (d = 3)
(0.0792)(0.0066)(0.2580)(0.1165)
31.00320.97491.30780.754510.12777 (d = 1), 51 (d = 2), 42 (d = 3)
(0.1399)(0.0121)(0.1236)(0.1861)

For Δf and |r|, the results reported are the average over 100 replicates, and the values in parentheses are the standard errors. Benchmark Δf = 1.4138, |r| = 0.0442 for case 1, |r| = 0.0468 for case 2 and |r| = 0.0679 for case 3.

χ  2-test at α = 0.05

3.1.2. Model 2

Model 2 is the sparse categorical two-dimensional model
where β1T=(1,1,1,1,0,,0) and β2T=(0,0,0,0,0,0,1,1,1,1,0,,0),I is an indicator function, p = 1000 and n = 200. This model is an example of Zhu and Zeng (2006). The response variable Y takes four values: 0, 1, 2 or 3.

Table 2 reports the results for model 2. This is a sparse model but two dimensional. The estimated dimensions are very accurate as indicated in the last column. Again TPR and FPR for all three cases show the efficacy of path II. As expected, when the dimension increases, the estimation accuracy in terms of both Δf and |r| decreases.

Table 2.

Accuracy for model 2

CaseNon-sparse estimatesSparse estimates% of d  
Δf|r|Δf|r|TPRFPR
11.56260.4253/0.65181.61260.4011/0.51320.98870.175097 (d = 24), 3 (d = 3)
(0.0894)(0.1444/0.2106)(0.1009)(0.1252/0.1963)
0.3687/0.49220.2811/0.4232
(0.1732/0.1696)(0.2311/0.1423)
21.65820.3779/0.73431.8780.2464/0.405310.16762 (d = 1), 92 (d = 2),
(0.0912)(0.1144/0.1591)(0.1207)(0.1267/0.1940)
0.4736/0.43510.4078/0.5318
(0.1424/0.1504)(0.1119/0.1452)
31.84270.3545/0.60321.92680.2312/0.390110.21723 (d = 1), 94 (d = 2),
(0.0822)(0.1918/0.1580)(0.0820)(0.0967/0.2146)
0.4112/0.45590.4117/0.3618
(0.1424/0.1120)(0.1212/0.1124)
CaseNon-sparse estimatesSparse estimates% of d  
Δf|r|Δf|r|TPRFPR
11.56260.4253/0.65181.61260.4011/0.51320.98870.175097 (d = 24), 3 (d = 3)
(0.0894)(0.1444/0.2106)(0.1009)(0.1252/0.1963)
0.3687/0.49220.2811/0.4232
(0.1732/0.1696)(0.2311/0.1423)
21.65820.3779/0.73431.8780.2464/0.405310.16762 (d = 1), 92 (d = 2),
(0.0912)(0.1144/0.1591)(0.1207)(0.1267/0.1940)
0.4736/0.43510.4078/0.5318
(0.1424/0.1504)(0.1119/0.1452)
31.84270.3545/0.60321.92680.2312/0.390110.21723 (d = 1), 94 (d = 2),
(0.0822)(0.1918/0.1580)(0.0820)(0.0967/0.2146)
0.4112/0.45590.4117/0.3618
(0.1424/0.1120)(0.1212/0.1124)

For Δf and |r|, the results reported are the average over 100 replicates, and the number in parentheses are the standard errors. Benchmark Δf = 1.9993, |r| = 0.0595 for case 1, |r| = 0.0607 for case 2 and |r| = 0.0716 for case 3.

χ  2-test at α = 0.05

Table 2.

Accuracy for model 2

CaseNon-sparse estimatesSparse estimates% of d  
Δf|r|Δf|r|TPRFPR
11.56260.4253/0.65181.61260.4011/0.51320.98870.175097 (d = 24), 3 (d = 3)
(0.0894)(0.1444/0.2106)(0.1009)(0.1252/0.1963)
0.3687/0.49220.2811/0.4232
(0.1732/0.1696)(0.2311/0.1423)
21.65820.3779/0.73431.8780.2464/0.405310.16762 (d = 1), 92 (d = 2),
(0.0912)(0.1144/0.1591)(0.1207)(0.1267/0.1940)
0.4736/0.43510.4078/0.5318
(0.1424/0.1504)(0.1119/0.1452)
31.84270.3545/0.60321.92680.2312/0.390110.21723 (d = 1), 94 (d = 2),
(0.0822)(0.1918/0.1580)(0.0820)(0.0967/0.2146)
0.4112/0.45590.4117/0.3618
(0.1424/0.1120)(0.1212/0.1124)
CaseNon-sparse estimatesSparse estimates% of d  
Δf|r|Δf|r|TPRFPR
11.56260.4253/0.65181.61260.4011/0.51320.98870.175097 (d = 24), 3 (d = 3)
(0.0894)(0.1444/0.2106)(0.1009)(0.1252/0.1963)
0.3687/0.49220.2811/0.4232
(0.1732/0.1696)(0.2311/0.1423)
21.65820.3779/0.73431.8780.2464/0.405310.16762 (d = 1), 92 (d = 2),
(0.0912)(0.1144/0.1591)(0.1207)(0.1267/0.1940)
0.4736/0.43510.4078/0.5318
(0.1424/0.1504)(0.1119/0.1452)
31.84270.3545/0.60321.92680.2312/0.390110.21723 (d = 1), 94 (d = 2),
(0.0822)(0.1918/0.1580)(0.0820)(0.0967/0.2146)
0.4112/0.45590.4117/0.3618
(0.1424/0.1120)(0.1212/0.1124)

For Δf and |r|, the results reported are the average over 100 replicates, and the number in parentheses are the standard errors. Benchmark Δf = 1.9993, |r| = 0.0595 for case 1, |r| = 0.0607 for case 2 and |r| = 0.0716 for case 3.

χ  2-test at α = 0.05

In all our simulations, we report the ‘raw’ results from the path once it finishes the first cycle. These reported ‘raw’ results already seem sufficiently accurate. However, re-estimation may be used to improve the results, i.e., once the path stops after the first cycle, we use the remaining active variables to obtain another solution. An illustration of our approach for an abundant model and a comparison of our approach with the one-step estimation method can be found in the on-line supplementary file.

3.2. Leukaemia data

In this section, we shall use path II to analyse a leukaemia data set, which comes from high density Affymetrix oligonucleotide arrays. This data set was first analysed by Golub et al. (1999), and later by Chiaromonte and Martinelli (2002), Dudoit et al. (2002) and Fan and Fan (2008), among others. The data are available from http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi. There are 72 samples from two classes of acute leukaemia: 47 in acute lymphoblastic leukaemia and 25 in acute myeloid leukaemia. There are 7129 genes. Before we apply our path II approach to the data, we preprocess them following Golub et al. (1999). Three preprocessing steps were applied:

  • a.

    thresholding, gene expression readings of 100 or fewer were set to 100 and expression readings of 16000 or more were set to 16000;

  • b.

    screening, only genes where max−min> 500 and max/min> 5 were included, where max and min refer to the maximum and minimum readings of a gene expression among the 72 samples respectively;

  • c.

    transformation, gene expression readings of the genes selected were log-transformed.

The data were then summarized by a 72 × 3571 matrix X = (x  ij), where x  ij denotes the expression level for gene j in messenger ribonucleic acid sample i. We further standardized the data so that observations have mean 0 and variance 1 across variables (genes).

In the application of our method, we run the SIR and PRPSIR steps and form the estimated space by combining directions from these steps, if any. At each step, we use Li's χ  2-test (Li, 1991) to determine the structural dimension. Because of the computational cost, tuning parameters for the sparse SDR estimation of Li (2007) are chosen only at the outset, the first step, and are applied to the remaining steps. We set the number of predictors (genes) that are considered at each step to be 20 (p  1 = 20), and the number of slices to be 5.

3.2.1. Initial stage estimation

When path II stops, the initial estimate has dimension 1 and it contains 28 non-zero coefficients—it selects 28 genes. Fig. 1(a) shows a boxplot of the sparse estimate of the direction. Fig. 1(b) presents a boxplot of the non-sparse estimate of the direction with all the selected 28 genes by refitting the 72 × 28 data.

 Sufficient summary plot for the leukaemia data with two categories
Fig. 1.

Sufficient summary plot for the leukaemia data with two categories

3.2.2. Refined stage estimation

In the refined stage we form a new data matrix of size 72 × 28. Since now n = 72 and p = 28, we can just apply the usual SIR method, with shrinkage procedure applied. The estimate contains 11 non-zero coefficients—it selects 11 genes. Fig. 1(c) shows a boxplot of the sufficient predictor that is formed by using this sparse estimate of the direction. Fig. 1(d) presents a boxplot of the sufficient predictor from a non-sparse estimate by using only the 11 selected genes by refitting the 72 × 11 data. There is no further reduction from the 11 genes, and thus we have the final estimate of 11 genes. Fig. 1(d) is our final summary plot, which clearly shows the separation of the two groups.

Since we used SIR and its variations, we would like to check the conditional normality of X|Y. Note that these methods require a linearity condition (Cook, 1998), which is implied by conditional normality. Indeed, we use the Shapiro test for normality with p-value 0.4569 or greater and the Bartlett test for constant variance with p-value 0.2955. These results and a QQ-plot (which is not reported here) show that a conditional normality assumption seems very reasonable.

In two-group analysis, re-estimation using reduced variables is applied, which demonstrates the improvement and supports our earlier comments in Section 3.1.

4. Discussion

Analysis for large p, small n data is important but difficult. This is for at least two reasons: the sample covariance matrix is not invertible, and the capacity of manipulating a large predictor vector. In this paper we proposed a simple framework to tackle large p, small n problems. The framework differs from usual methodologies in that it decomposes the predictor vector into pieces so that inverting the sample covariance matrix is not an issue, and a large response vector can be manipulated. It provides a sufficient way to solve the problem. We illustrated the framework via two paths using SDR approaches. However, the framework does not have to be restricted to our chosen approaches. Many methods that work for n>p data can be adopted in the framework, as long as a multiple-response vector can be handled. Hence, it opens possible research for large p, small n problems. However, the asymptotic properties for the framework when p → ∞, extended from one-step solution, are not yet clear. In addition, differences between solutions from different partitions of the predictor vector may be difficult to quantify rigorously, though an ensemble idea can provide a promising solution. We leave them as open problems.

Acknowledgements

The authors thank the Joint Editor, an Associate Editor and two referees for their valuable comments, which greatly improved the paper. This work is supported in part by National Science Foundation grant DMS-1205564.

References

1

Bura
,
E.
and
Cook
,
R. D.
(
2001
)
Extending sliced inverse regression: the weighted chi-squared test
.
J. Am. Statist. Ass.
,
96
,
996
1003
.

2

Candès
,
E.
and
Tao
,
T.
(
2007
)
The Dantzig selector: statistical estimation when p is much larger than n
.
Ann. Statist.
,
35
,
2313
2351
.

3

Chen
,
X.
,
Zou
,
C.
and
Cook
,
R. D.
(
2010
)
Coordinate-independent sparse sufficient dimension reduction and variable selection
.
Ann. Statist.
,
38
,
3696
3723
.

4

Chiaromonte
,
F.
,
Cook
,
R. D.
and
Li
,
B.
(
2002
)
Sufficient dimension reduction in regression with categorical predictors
.
Ann. Statist.
,
30
,
475
497
.

5

Chiaromonte
,
F.
and
Martinelli
,
J.
(
2002
)
Dimension reduction strategies for analyzing global gene expression data with a response
.
Math. Biosci.
,
176
,
123
144
.

6

Cook
,
R. D.
(
1994
)
On the interpretation of regression plots
.
J. Am. Statist. Ass.
,
89
,
177
190
.

7

Cook
,
R. D.
(
1996
)
Graphics for regressions with a binary response
.
J. Am. Statist. Ass.
,
91
,
983
992
.

8

Cook
,
R. D.
(
1998
)
Regression Graphics: Ideas for Studying Regressions through Graphics
.
New York
:
Wiley
.

9

Cook
,
R. D.
(
2004
)
Testing predictor contributions in sufficient dimension reduction
.
Ann. Statist.
,
32
,
1062
1092
.

10

Cook
,
R. D.
,
Forzani
,
L.
and
Rothman
,
A.
(
2012
)
Estimating sufficient reductions of the predictors in abundant high-dimensional regressions
.
Ann. Statist.
,
40
,
353
384
.

11

Cook
,
R. D.
,
Li
,
B.
and
Chiaromonte
,
F.
(
2007
)
Dimension reduction in regression without matrix inversion
.
Biometrika
,
94
,
569
584
.

12

Cook
,
R. D.
and
Setodji
,
C.
(
2003
)
A model-free test for reduced rank in multivariate regression
.
J. Am. Statist. Ass.
,
98
,
340
351
.

13

Cook
,
R. D.
and
Weisberg
,
S.
(
1991
)
Discussion of Li (1991)
.
J. Am. Statist. Ass.
,
86
,
328
332
.

14

Dalalyan
,
A.
,
Juditsky
,
A.
and
Spokoiny
,
V.
(
2008
)
A new algorithm for estimating the effective dimension-reduction subspace
.
J. Mach. Learn. Res.
,
9
,
1647
1678
.

15

Donoho
,
D. L.
(
2000
)
High-dimensional data analysis: the curses and blessings of dimensionality
.
American Mathematical Society Conf. Math Challenges of the 21st Century
.

16

Dudoit
,
S.
,
Fridlyand
,
J.
and
Speed
,
T. P.
(
2002
)
Comparison of discrimination methods for the classification of tumors using gene expression data
.
J. Am. Statist. Ass.
,
97
,
77
87
.

17

Fan
,
J.
and
Fan
,
Y.
(
2008
)
High-dimensional classification using features annealed independence rules
.
Ann. Statist.
,
36
,
2605
2637
.

18

Fan
,
J.
and
Li
,
R.
(
2001
)
Variable selection via nonconcave penalized likelihood and its oracle properties
.
J. Am. Statist. Ass.
,
96
,
1348
1360
.

19

Fan
,
J.
and
Li
,
R.
(
2006
) Statistical challenges with high dimensionality: feature selection in knowledge discovery. In
Proc. Int. Congr. Mathematicians
(eds
M.
 
Sanz-Sole
,
J.
 
Soria
,
J. L.
 
Varona
and
J.
 
Verdera
), vol. III, pp.
595
622
.
Freiburg
:
European Mathematical Society
.

20

Fan
,
J.
and
Lv
,
J.
(
2008
)
Sure independence screening for ultrahigh dimensional feature space (with discussion)
.
J. R. Statist. Soc. B
,
70
,
849
911
.

21

Fan
,
J.
,
Samworth
,
R.
and
Wu
,
Y.
(
2009
)
Ultrahigh dimensional feature selection: beyond the linear model
.
J. Mach. Learn. Res.
,
10
,
2013
2038
.

22

Fan
,
J.
and
Song
,
R.
(
2010
)
Sure independence screening in generalized linear models with np-dimensionality
.
Ann. Statist.
,
38
,
3567
3604
.

23

Fukumizu
,
K.
,
Bach
,
F.
and
Jordan
,
M.
(
2004
)
Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces
.
J. Mach. Learn. Res.
,
5
,
73
99
.

24

Fukumizu
,
K.
,
Bach
,
F.
and
Jordan
,
M.
(
2009
)
Kernel dimension reduction in regression
.
Ann. Statist.
,
37
,
1871
1905
.

25

Fung
,
W. K.
,
He
,
X.
,
Liu
,
L.
and
Shi
,
P.
(
2002
)
Dimension reduction based on canonical correlation
.
Statist. Sin.
,
12
,
1093
1113
.

26

Golub
,
T. R.
,
Slonim
,
D. K.
,
Tamayo
,
P.
,
Hurd
,
C.
,
Gaasenbeek
,
M.
,
Mesirov
,
J. P.
,
Coller
,
H.
,
Loh
,
M. L.
,
Downing
,
J. R.
,
Caligiuri
,
M. A.
,
Bloomfield
,
C. D.
and
Lander
,
E. S.
(
1999
)
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
.
Science
,
286
,
531
537
.

27

Hall
,
P.
and
Miller
,
H.
(
2009
)
Using generalised correlation to effect variable selection in very high dimensional problems
.
J. Computnl Graph. Statist.
,
18
,
533
550
.

28

Hilafu
,
H.
and
Yin
,
X.
(
2013
)
Sufficient dimension reduction and statistical modeling of plasma concentrations
.
Computnl Statist. Data Anal.
,
63
,
139
147
.

29

Hristache
,
M.
,
Juditsky
,
A.
,
Polzehl
,
J.
and
Spokoiny
,
V.
(
2001a
)
Structure adaptive approach for dimension reduction
.
Ann. Statist.
,
29
,
1537
1566
.

30

Hristache
,
M.
,
Juditsky
,
A.
and
Spokoiny
,
V.
(
2001b
)
Direct estimation of the index coefficient in a singleindex model
.
Ann. Statist.
,
29
,
595
623
.

31

Huang
,
J.
,
Horowitz
,
J.
and
Ma
,
S.
(
2008
)
Asymptotic properties of bridge estimators in sparse high-dimensional regression models
.
Ann. Statist.
,
36
,
587
613
.

32

Li
,
K.-C.
(
1991
)
Sliced inverse regression for dimension reduction (with discussion)
.
J. Am. Statist. Ass.
,
86
,
316
342
.

33

Li
,
L.
(
2007
)
Sparse sufficient dimension reduction
.
Biometrika
,
94
,
603
613
.

34

Li
,
L.
,
Cook
,
R. D.
and
Nachtsheim
,
C. J.
(
2005
)
Model-free variable selection
.
J. R. Statist. Soc. B
,
67
,
285
299
.

35

Li
,
L.
and
Nachtsheim
,
C. J.
(
2006
)
Sparse sliced inverse regression
.
Technometrics
,
48
,
503
510
.

36

Li
,
B.
,
Wen
,
S.
and
Zhu
,
L.
(
2008
)
On a projective resampling method for dimension reduction with multivariate responses
.
J. Am. Statist. Ass.
,
103
,
1177
1186
.

37

Li
,
L.
and
Yin
,
X.
(
2008
)
Sliced inverse regression with regulations
.
Biometrics
,
64
,
124
131
.

38

Li
,
B.
,
Zha
,
H.
and
Chiaromonte
,
F.
(
2005
)
Contour regression: a general approach to dimension reduction
.
Ann. Statist.
,
33
,
1580
1616
.

39

Li
,
R.
,
Zhong
,
W.
and
Zhu
,
L.-P.
(
2012
)
Feature screening via distance correlation learning
.
J. Am. Statist. Ass.
,
107
,
1129
1139
.

40

Lounici
,
K.
,
Pontil
,
M.
,
Tsybakov
,
A.
and
van de Geer
,
S.
(
2009
)
Taking advantage of sparsity in multitask learning
. In
Proc. Conf. Learning Theory, Montréal, June 18th–21st
.

41

Ni
,
L.
,
Cook
,
R. D.
and
Tsai
,
C. L.
(
2005
)
A note on shrinkage sliced inverse regression
.
Biometrika
,
92
,
242
247
.

42

Tibshirani
,
R.
(
1996
)
Regression shrinkage and selection via the lasso
.
J. R. Statist. Soc. B
,
58
,
267
288
.

43

Wang
,
H.
and
Xia
,
Y.
(
2008
)
Sliced regression for dimension reduction
.
J. Am. Statist. Ass.
,
103
,
811
821
.

44

Wang
,
Q.
and
Yin
,
X.
(
2008
)
A nonlinear multi-dimensional variable selection method for high dimensional data: sparse MAVE
.
Computnl Statist. Data Anal.
,
52
,
4512
4520
.

45

Wu
,
Y.
and
Li
,
L.
(
2011
)
Asymptotic properties of sufficient dimension reduction with a diverging number of predictors
.
Statist. Sin.
,
21
,
707
730
.

46

Xia
,
Y.
,
Tong
,
H.
,
Li
,
W. K.
and
Zhu
,
L.-X.
(
2002
)
An adaptive estimation of dimension reduction space (with discussion)
.
J. R. Statist. Soc. B
,
64
,
363
410
.

47

Yin
,
X.
and
Bura
,
E.
(
2006
)
Moment based dimension reduction for multivariate response regression
.
J. Statist. Planng Inf.
,
136
,
3675
3688
.

48

Yin
,
X.
and
Cook
,
R. D.
(
2005
)
Direction estimation in single-index regressions
.
Biometrika
,
92
,
371
384
.

49

Yin
,
X.
and
Li
,
B.
(
2011
)
Sufficient dimension reduction based on an ensemble of minimum average variance estimators
.
Ann. Statist.
,
39
,
3392
3416
.

50

Yin
,
X.
,
Li
,
B.
and
Cook
,
R. D.
(
2008
)
Successive direction extraction for estimating the central subspace in a Multiple-index regression
.
J. Multiv. Anal.
,
99
,
1733
1757
.

51

Zhou
,
J.
and
He
,
X.
(
2008
)
Dimension reduction based on constrained canonical correlation and variable filtering
.
Ann. Statist.
,
36
,
1649
1668
.

52

Zhu
,
L-P.
,
Li
,
L.
,
Li
,
R.
and
Zhu
,
L.
(
2011
)
Model-free feature screening for ultrahigh-dimensional data
.
J. Am. Statist. Ass.
,
106
,
1464
1475
.

53

Zhu
,
L.
,
Miao
,
B.
and
Peng
,
H.
(
2006
)
On sliced inverse regression with large dimensional covariates
.
J. Am. Statist. Ass.
,
101
,
630
643
.

54

Zhu
,
L.-P.
,
Yin
,
X.
and
Zhu
,
L.
(
2010a
)
Dimension reduction for correlated data: an alternating inverse regression
.
J. Computnl Graph. Statist.
,
19
,
887
899
.

55

Zhu
,
L.-P.
,
Yu
,
Z.
and
Zhu
,
L.
(
2010b
)
A sparse eigen-decomposition estimation in semi-parametric regression
.
Computnl Statist. Data Anal.
,
54
,
976
986
.

56

Zhu
,
Y.
and
Zeng
,
P.
(
2006
)
Fourier methods for estimating the central subspace and the central mean subspace in regression
.
J. Am. Statist. Ass.
,
101
,
1638
1651
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/journals/pages/open_access/funder_policies/chorus/standard_publication_model)