-
PDF
- Split View
-
Views
-
Cite
Cite
Yaghoub Rahimi, Sung Ha Kang, Yifei Lou, A lifted ℓ1 framework for sparse recovery, Information and Inference: A Journal of the IMA, Volume 13, Issue 1, March 2024, iaad055, https://doi.org/10.1093/imaiai/iaad055
- Share Icon Share
Abstract
We introduce a lifted |$\ell _1$| (LL1) regularization framework for the recovery of sparse signals. The proposed LL1 regularization is a generalization of several popular regularization methods in the field and is motivated by recent advancements in re-weighted |$\ell _1$| approaches for sparse recovery. Through a comprehensive analysis of the relationships between existing methods, we identify two distinct types of lifting functions that guarantee equivalence to the |$\ell _0$| minimization problem, which is a key objective in sparse signal recovery. To solve the LL1 regularization problem, we propose an algorithm based on the alternating direction method of multipliers and provide proof of convergence for the unconstrained formulation. Our experiments demonstrate the improved performance of the LL1 regularization compared with state-of-the-art methods, confirming the effectiveness of our proposed framework. In conclusion, the LL1 regularization presents a promising and flexible approach to sparse signal recovery and invites further research in this area.
1. Introduction
Compressed sensing [14] plays an important role in many applications including signal processing, medical imaging, matrix completion, feature selection and machine learning [18,24,33]. One important assumption in compressed sensing is that a signal of interest is sparse or compressible. This allows for an efficient representation of high-dimensional data only by a few meaningful subsets. Compressed sensing often involves sparse recovery from an under-determined linear system that can be formulated mathematically by minimizing the |$\ell _0 $| ‘pseudo-norm’, i.e.
where |$A \in \mathbb{R}^{m \times n}$| is called a sensing matrix and |$\mathbf b \in \mathbb{R}^m $| is a measurement vector. Since the minimization problem (1.1) is NP-hard [34], various regularization functionals are proposed to approximate the |$\ell _0 $| penalty. The |$\ell _1 $| norm is widely used as a convex relaxation of |$\ell _0,$| which is called LASSO [46] in statistics or basis pursuit [13] in signal processing. Due to its convex nature, the |$\ell _1$| norm is computationally traceable to optimize with exact recovery guarantees based on restricted isometry property (RIP) and/or null space property (NSP) [9,15,47]. Alternatively, there are non-convex models, i.e. concave with respect to the positive cone, that outperform the convex |$\ell _1$| approach in terms of improved accuracy of identifying sparse solutions. For example, |$\ell _p (0<p<1)$| [12,26,56,57], smoothly clipped absolute deviation (SCAD) [17], minimax concave penalty (MCP) [61], capped |$\ell _1$| (CL1) [30,43,64] and transformed |$\ell _1 $| (TL1) [31,62,63] are separable and concave penalties. Some non-separable non-convex penalties include sorted |$\ell _1 $| [23], |$\ell _1-\ell _2 $| [4,28,29,58,59] and |$\ell _1/\ell _2 $| [39,51,55]. Properties of sparsity-promoting functions are developed in [42]. In addition, a large volume of literature discusses greedy algorithms [25,33] that minimize the |$\ell _0$| regularization, such as orthogonal matching pursuit [49], CoSAMP [35], iterative hard thresholding (IHT) [7], conjugate gradient IHT [6] and algebraic pursuits [11].
In this paper, we generalize a type of re-weighted approaches [10,20,53] for sparse recovery based on the fact that a convex function can be smoothly approximated by subtracting a strongly convex function from the objective function [36]. In particular, we propose a lifted regularization, which we referred to as lifted |$\ell _1$| (LL1),
where we denote |$|\mathbf x | = [ |x_1|,...,|x_n|]\in \mathbb R^n$| and |$\langle \cdot , \cdot \rangle $| is the standard Euclidean inner product. In (1.2), the variable |$\mathbf u $| plays the role of weights with |$U$| as a set of restrictions on |$\mathbf u $|, e.g. |$U = [0,\infty )^n $| or |$ U = [0,1]^n $|. The function |$g:\mathbb{R}^n \rightarrow \mathbb{R} $| is decreasing near zero (please refer to Definition 1 for multi-variable decreasing function) to ensure a non-trivial solution of (1.2), and |$\alpha> 0$| is a parameter. We can consider a more general function |$g(\mathbf u; \alpha )$|, instead of |$\alpha g(\mathbf u)$| when making a connection of the proposed model to several existing sparse-promoting regularizations in Section 2.2. We focus on two specific types of |$g$| functions (see Definition 2) that ensure the existence and uniqueness of the minimum.
To find a sparse signal from a set of linear measurements, we propose the following constrained minimization problem:
We lift the dimension of a single vector |$\mathbf x\in \mathbb R^n$| in the original |$\ell _0$| problem (1.1) into a joint optimization over |$\mathbf x $| and |$\mathbf u$| in the proposed model (1.3). We can establish the equivalence between the two if the function |$g$| and the set |$U$| satisfy certain conditions. For the measurements with noise, we consider an unconstrained formulation, i.e.
where |$\gamma>0$| is a balancing parameter.
From an algorithmic point of view, the lifted form enables us to minimize two variables each with fast implementation, and can lead to a better local minimum in a higher dimension than the original formulation [2,27,36,60]. The idea of updating |$\mathbf u$| while finding the sparsest solution |$\mathbf x$| is also related to the classic graduated non-convex algorithm [5,41] that constructs a family of functions (1.3) approaching to the true cost function (1.1). To minimize (1.3) and (1.4), we apply the alternating direction method of multipliers (ADMM) [8,19], and conduct convergence analysis of ADMM for solving the unconstrained problem (1.4). We demonstrate in experiments that the proposed approach outperforms the state of the art. Our contributions are threefold,
We propose a unified model that generalizes many existing regularizations and re-weighted algorithms for sparse recovery problems;
We establish the equivalence between the proposed model (1.3) and the |$\ell _0$| model (1.1);
We present an efficient algorithm that is supported by a convergence analysis.
The rest of the paper is organized as follows. In Section 2, we present details of the proposed model, including its properties with an exact sparse recovery guarantee. Section 3 describes the ADMM implementation and its convergence. Section 4 presents experimental results to demonstrate how this lifted model improves the sparse recovery over the state of the art. Finally, concluding remarks are given in Section 5.
2. The Proposed Lifted L1 Framework
We first introduce definitions and notations that are used throughout the paper. The connection to well-known sparsity-promoting regularizations is presented in Section 2.2. We present useful properties of LL1 in Section 2.3, and establish its equivalence to the original |$\ell _0$| problem in Section 2.4.
2.1 Definitions and Notations
We mark any vector in bold, specifically |$\mathbf 1$| denotes the all-ones vector and |$\mathbf 0$| denotes the zero vector. For a vector |$\mathbf x \in \mathbb{R}^n, $| we define |$|x|_{(i)} $| as the |$i $|-th largest component of |$|\mathbf x| $|. We say that a vector |$\mathbf x \in \mathbb{R}^n $| majorizes |$ \mathbf x^{\prime} \in \mathbb{R}^n$| with the notation |$|\mathbf x| \succ |\mathbf x^{\prime}|,$| if we have |$\sum _{i \leq j} |x|_{(i)} \leq \sum _{i \leq j} |x^{\prime}|_{(i)} \ \forall 1\leq j < n$| and |$\sum _{i = 1}^n |x_i| = \sum _{i = 1}^n |x_i^{\prime}| $|. For two vectors |$\mathbf x,\mathbf y\in \mathbb R^n,$| the notation |$\mathbf x\leq \mathbf y$| means each element of |$\mathbf x$| is smaller than or equal to the corresponding element in |$\mathbf y;$| similarly for |$\mathbf x\geq \mathbf y$|, |$\mathbf x>\mathbf y$| and |$\mathbf x<\mathbf y$|. The positive cone refers to the set |$\mathbb{R}^n_+ = \{ \mathbf x \in \mathbb{R}^n \mid \mathbf x \geq \mathbf 0 \} $|. A rectangular subset of |$ \mathbb{R}^n$| is a Cartesian product of intervals aligned with the canonical axes. We define two elementwise operators, |$\max (\mathbf x,\mathbf y)$| and |$\mathbf x \odot \mathbf y,$| both returning a vector form by taking maximum and multiplication, respectively, for every component. The proposed regularization (1.2) can be expressed by the |$\delta $|-function,
We use (1.2) and (2.1) interchangeably throughout the paper. The constrained LL1 problem (1.3) is equivalent to
where |$\varOmega = \{ \mathbf x \in \mathbb{R}^n \mid A \mathbf x = \mathbf b \} $|. We denote |$[n]$| as the set |$ \{1,2,\dots ,n \} $|, |$S_n $| as the symmetric group of |$n $| elements and a permutation |$\pi \in S_n$| of a vector |$\mathbf x$| is defined as |$\mathbf x \circ \pi = \left (x_{\pi (1)}, \dots , x_{\pi (n)} \right ).$| We summarize the relevant properties of a function as follows.
Let |$g:\mathbb{R}^n \rightarrow \mathbb{R} \cup \{ +\infty , -\infty \} $| be a function, we say that
- The function |$g$| is separable if there exists a set of functions |$\{g_i: \mathbb{R} \rightarrow \mathbb{R}\}$| for |$i \in [n]$| s.t.$$ \begin{align*}& g(\mathbf{x}) = \sum_{i = 1}^n g_i(x_i), \quad \forall \mathbf{x} =[x_1,\cdots, x_n]. \end{align*} $$
- The function |$g$| is strongly convex with parameter |$\mu> 0 $| if(2.3)$$ \begin{align}& g(\mathbf y) \geq g(\mathbf x) + \left\langle \nabla g(\mathbf x), \mathbf y - \mathbf x \right\rangle + \frac{\mu}{2} \| \mathbf y - \mathbf x\|_2^2, \quad \forall \mathbf x, \; \mathbf y.\end{align} $$
The function |$g$| is symmetric if |$ g(\mathbf x) = g(\mathbf x \circ \pi ), \; \forall \mathbf x \in \mathbb{R}^n $| and |$\forall \pi \in S_n $|.
The function |$g$| is coercive if |$ g(\mathbf x) \rightarrow +\infty \ \ as \ \ \| \mathbf x \| \rightarrow +\infty .$|
The function |$g$| is decreasing on |$ U$| if |$g(\mathbf x) \leq g(\mathbf y), \; \forall \mathbf x \geq \mathbf y$| and |$\mathbf x, \; \mathbf y \in U $|.
2.2 Connections to Sparsity Promoting Models
Many existing models can be understood as special cases of the proposed LL1 model (1.3). We start with two recent works. One is a joint minimization model [65] between the weights and the vector,
where |$\epsilon> 0$| is a fixed number and |$|\mathbf x| - \epsilon $| is a vector subtracting |$\epsilon $| from all components of |$|\mathbf x|$|. With the assumption that the weights are binary, the authors [65] established the equivalence between (1.1) and (2.4) for a small enough |$\epsilon $|. Another related work is the trimmed LASSO [1,3],
which is equivalent to the LL1 form on the right with |$U = \{ \mathbf u \in \{0,1\}^n \ \mid \ \|\mathbf u\|_1 = n-k \} $|. In the middle, the sum is over the |$n-k$| smallest components of the vector |$\mathbf x $| for a given sparsity |$ k$|.
In what follows, we consider a more general form of |$g(\mathbf u; \alpha ),$| as opposed to |$\alpha g(\mathbf u),$| i.e.
As a consequence of the Fenchel–Moreau’s Theorem, regularizations that are concave on the positive cone are of the form (2.5). In other words, we have the following theorem:
Any proper and lower semi-continuous function |$J(\mathbf x) $| that is concave on the positive cone can be lifted by a convex function |$g:\mathbb{R}^n \rightarrow \mathbb{R} $| and a set |$U $| such that |$J(|\mathbf x|) = F_g^U(\mathbf x) $| as in (2.5). Here |$g(\mathbf u):= \sup _{\mathbf x \geq \mathbf 0} \left \langle \mathbf x, -\mathbf u \right \rangle + J(\mathbf x) $| and |$U = \{ \mathbf u \in \mathbb{R}^n \mid g(\mathbf u) \neq +\infty \}$|.
Please refer to Appendix A and Appendix B for the proof of Theorem 1 and detailed computations, respectively. As many sparsity-promoting functions satisfy the assumptions in Theorem 1, we can rewrite |$J(\mathbf x)$| in the form of |$F_g^U$|. Using Theorem 1, we relate (2.5) to the following functionals that are widely used to promote sparsity.
- (i) The |$\ell _p $| model [12,26] is defined as |$ J^{\ell _p} (\mathbf x) = \sum _{i=1}^{n} |x_i| ^{p}.$| As |$p \rightarrow 0 $|, |$\ell _p^p $| approaches to |$\ell _0 $|, while it reduces to |$\ell _1$| for |$p=1$|. Much research focuses on |$p=1/2,$| due to a closed-form solution [56,57] in the optimization process. By choosing |$ g(\mathbf u; p) = \frac{1-p}{p} \sum _{i = 1}^n u_i^{\frac{p}{p-1}}$| and |$U = \mathbb{R}^n_+ $|, we can express the |$\ell _p $| regularization as$$ \begin{align*}& \min_{\mathbf u} \left\{ \sum_{i=1}^{n} \left( u_i|x_i| + \frac{1-p}{p} u_i^{\frac{p}{p-1}} \right) \ \mid \ u_i \geq 0 \right\}. \end{align*} $$
- (ii) The log-sum penalty [10,38] is given by |$ J^{\log }_a (\mathbf x) = \sum _{i=1}^{n} \log (|x_i| + a), $| for a small positive parameter |$a$| to make the logarithmic function well-defined. The re-weighted |$ \ell _1$| algorithm [10] (IRL1) minimizes |$J^{\log }_a (\mathbf x)$|, which is equivalent to (2.5) in that$$ \begin{align*} & \min_{\mathbf u} \Bigg\{ \sum_{i=1}^{n} \left( u_i|x_i| + a u_i - \log(u_i) \right) \mid u_i \geq 0 \Bigg\}. \end{align*} $$
- (iii) Smoothly clipped absolute deviation (SCAD) [17] is defined byThis penalty is designed to reduce the bias of the |$\ell _1 $| model. For |$g(\mathbf u, a, b) = -ab \| \mathbf u \|_1 + (b - 1) \frac{\| \mathbf u \|_2^2}{2} $| and |$U = [0, a]^n $|, we have |$J_{a, b}^{\text{SCAD}}$| is equivalent to(2.6)$$ \begin{align}& J_{a, b}^{\text{SCAD}} (\mathbf x) = \sum_{i=1}^{n} f_{a, b}^{\text{SCAD}} (x_i), \text{where}\ f_{a, b} ^{\text{SCAD}}(t) = \left\{ \begin{array}{@{}ll} a |t| & \text{if}\ |t| \leq a, \\[3pt] \displaystyle\frac{2 a b |t| - t^2 - a^2}{2(b - 1)} & \text{if}\ a < |t| \leq a b, \\[6pt] \displaystyle\frac{(b + 1)a^2}{2} & \text{if}\ |t|> ab. \end{array}\right.\end{align} $$$$ \begin{align*}& \min_{\mathbf u} \left\{ \sum_{i=1}^{n} \left( u_i|x_i| - ab u_i + (b - 1)\frac{u_i^2} {2} \right) \ \mid \ u_i \in [0, a] \right\}. \end{align*} $$
- (iv) Mini-max concave penalty (MCP) [61] is defined byWith the same purpose of reducing bias as SCAD, MCP consists of two piece-wise defined functions, which is simpler than SCAD. For |$g(\mathbf u, a, b) = -b (a \| \mathbf u\|_1 - \frac{\| \mathbf u\|_2^2}{2}) $| and |$U = \mathbb{R}^n_+ $|, we can rewrite |$J_{a, b}^{\text{MCP}}$| as(2.7)$$ \begin{align}& J_{a, b}^{\text{MCP}} (\mathbf x) = \sum_{i=1}^{n} f_{a, b}^{\text{MCP}} (x_i), \text{ where }\ f_{a, b}^{\text{MCP}} (t) = \left\{ \begin{array}{@{}ll} a |t| - \displaystyle\frac{t^2}{2 b} & \text{if}\ |t| \leq ab, \\[3pt] \frac{1}{2} b a^2 & \text{if}\ |t|> ab. \end{array}\right.\end{align} $$$$ \begin{align*}& \min_{\mathbf u} \left\{ \sum_{i=1}^{n} \left( u_i|x_i| + b \left(\frac{u_i^2}{2} -a u_i \right) \right) \ | \ u_i \geq 0 \right\}. \end{align*} $$
- (v) Capped |$\ell _1 $| (CL1) [64] is defined as |$ J^{\text{CL1}}_a (\mathbf x) = \sum _{i=1}^{n} \min \{ |x_i|, a \},$| with a positive parameter |$a>0.$| As |$a \rightarrow 0 $| the function |$F^{\text{CL1}}_a / a $| approaches to |$\ell _0.$| The CL1 penalty is unbiased and has fewer internal parameters than SCAD/MCP. For |$g(\mathbf u,a) = -a\| \mathbf u\|_1 $| and |$U = [0,1]^n $|, the CL1 regularization can be expressed asIt is similar to the model in [65], except for a binary restriction on |$\mathbf u$| as in (2.4).$$ \begin{align*} & \min_{ \mathbf u} \Bigg\{ \sum_{i=1}^{n}\left( u_i|x_i| -a u_i \right) \ | \ 0 \leq u_i \leq 1 \Bigg\}. \end{align*} $$
- (vi) Transformed |$\ell _1 $| (TL1) [31] is defined as |$ J_a^{\text{TL1}} (\mathbf x) = \sum _{i=1}^{n} \frac{(a+1)|x_i|}{a + |x_i|}. $| It reduces to |$\ell _0$| for |$a=0,$| and converges to |$\ell _1$| as |$a\rightarrow \infty .$| The TL1 regularization is Lipschitz continuous, unbiased and equivalent to$$ \begin{align*} & \min_{\mathbf u} \Bigg\{ \sum_{i=1}^{n} \left( u_i|x_i| +a u_i - 2 \sqrt{a u_i} \right) \ | \ u_i \geq 0 \Bigg\}. \end{align*} $$
- (vii) Error function penalty (ERF) [20] is defined byThis model is less biased than |$\ell _1,$| and gives a good approximation to |$\ell _0 $| for a small value of |$\sigma $|. Let |$h(t) = \int _{t}^{1} \sqrt{- \log ( \tau )} d \tau $|, then for |$g(\mathbf u, \sigma ) = \sigma \sum _{i=1}^{n} h(u_i) $| and |$U = [0,1]^n $|, the ERF function is equivalent to(2.8)$$ \begin{align}& J_{\sigma}^{\text{ERF}} (\mathbf x) = \sum_{i=1}^{n} f_{\sigma}^{\text{ERF}} (|x_i|), \text{ where }\ f_{\sigma}^{\text{ERF}} (t) = \int_{0}^{t} e^{-\tau^2 / \sigma^2} d \tau.\end{align} $$$$ \begin{align*}& \min_{\mathbf u} \left\{ \sum_{i=1}^{n} \left( u_i|x_i| + \sigma h(u_i) \right) \ \mid \ u_i \in [0, 1] \right\}. \end{align*} $$
- (viii) The |$\ell _1-\ell _2$| penalty, defined as |$ J^{\text{L1-L2}} (\mathbf x) = \|\mathbf x \|_1 - \|\mathbf x\|_2, $| is an example of non-separable regularization, which has been proven to perform very well when the matrix |$A$| is highly coherent [29,59]. The |$\ell _1-\ell _2$| regularization is equivalent toIn this case, the corresponding |$g$| function is zero, and the set |$U$| is a non-rectangular set, thus non-separable.$$ \begin{align*} & \min_{ \mathbf u} \Bigg\{ \sum_{i=1}^{n} u_i|x_i| \ \ \ | \ \ \ \|\mathbf 1 - \mathbf u \|_2 \leq 1 \Bigg\}. \end{align*} $$
We summarize these existing regularizations with their corresponding |$g$| function and |$U$| set in Table 1. All these |$g$| functions are separable, and hence we can plot |$g$| as a univariate function. As illustrated in Fig. 1, each |$g$| function is decreasing on a small interval near the origin, thus motivating the conditions on the function |$ g$| presented in Definition 2 as well as Theorems 5 and 6 for exact recovery guarantees.
Summary of various regularizations and their corresponding |$g$| function and |$U$| set
Model . | Regularization . | Function |$ g$| . | Set |$U $| . |
---|---|---|---|
|$ \ell _p$| | |$\sum _{i=1}^{n} |x_i|^{p} $| | |$\frac{1-p}{p} \sum _{i = 1}^n u_i^{\frac{p}{p-1}} $| | |$ \mathbb{R}^n_+$| |
log-sum | |$ \sum _{i=1}^{n} \log (|x_i| + a)$| | |$ a \| \mathbf u \|_1 - \sum _{i=1}^n \log (u_i)$| | |$ \mathbb{R}^n_+$| |
SCAD | (2.6) | |$-ab \| \mathbf u \|_1 + (b - 1) \frac{\| \mathbf u \|_2^2}{2} $| | |$[0, a]^n $| |
MCP | (2.7) | |$-b (a \| \mathbf u\|_1 - \frac{\| \mathbf u\|_2^2}{2}) $| | |$ \mathbb{R}^n_+$| |
CL1 | |$\sum _{i=1}^{n} \min \{ |x_i|, a \} $| | |$-a\| \mathbf u\|_1$| | |$ [0,1]^n$| |
TL1 | |$ \sum _{i=1}^{n} \frac{(a+1)|x_i|}{a + |x_i|} $| | |$ a \|\mathbf u \|_1 - 2 \sum _{i} (\sqrt{a u_i})$| | |$ \mathbb{R}^n_+$| |
ERF | (2.8) | |$\sigma \sum _i h(u_i)$| | |$ [0,1]^n$| |
|$\ell _1-\ell _2$| | |$\|\mathbf x\|_1-\|\mathbf x\|_2$| | 0 | |$\{\|\mathbf 1-\mathbf u\|_2\leq 1\}$| |
Model . | Regularization . | Function |$ g$| . | Set |$U $| . |
---|---|---|---|
|$ \ell _p$| | |$\sum _{i=1}^{n} |x_i|^{p} $| | |$\frac{1-p}{p} \sum _{i = 1}^n u_i^{\frac{p}{p-1}} $| | |$ \mathbb{R}^n_+$| |
log-sum | |$ \sum _{i=1}^{n} \log (|x_i| + a)$| | |$ a \| \mathbf u \|_1 - \sum _{i=1}^n \log (u_i)$| | |$ \mathbb{R}^n_+$| |
SCAD | (2.6) | |$-ab \| \mathbf u \|_1 + (b - 1) \frac{\| \mathbf u \|_2^2}{2} $| | |$[0, a]^n $| |
MCP | (2.7) | |$-b (a \| \mathbf u\|_1 - \frac{\| \mathbf u\|_2^2}{2}) $| | |$ \mathbb{R}^n_+$| |
CL1 | |$\sum _{i=1}^{n} \min \{ |x_i|, a \} $| | |$-a\| \mathbf u\|_1$| | |$ [0,1]^n$| |
TL1 | |$ \sum _{i=1}^{n} \frac{(a+1)|x_i|}{a + |x_i|} $| | |$ a \|\mathbf u \|_1 - 2 \sum _{i} (\sqrt{a u_i})$| | |$ \mathbb{R}^n_+$| |
ERF | (2.8) | |$\sigma \sum _i h(u_i)$| | |$ [0,1]^n$| |
|$\ell _1-\ell _2$| | |$\|\mathbf x\|_1-\|\mathbf x\|_2$| | 0 | |$\{\|\mathbf 1-\mathbf u\|_2\leq 1\}$| |
Summary of various regularizations and their corresponding |$g$| function and |$U$| set
Model . | Regularization . | Function |$ g$| . | Set |$U $| . |
---|---|---|---|
|$ \ell _p$| | |$\sum _{i=1}^{n} |x_i|^{p} $| | |$\frac{1-p}{p} \sum _{i = 1}^n u_i^{\frac{p}{p-1}} $| | |$ \mathbb{R}^n_+$| |
log-sum | |$ \sum _{i=1}^{n} \log (|x_i| + a)$| | |$ a \| \mathbf u \|_1 - \sum _{i=1}^n \log (u_i)$| | |$ \mathbb{R}^n_+$| |
SCAD | (2.6) | |$-ab \| \mathbf u \|_1 + (b - 1) \frac{\| \mathbf u \|_2^2}{2} $| | |$[0, a]^n $| |
MCP | (2.7) | |$-b (a \| \mathbf u\|_1 - \frac{\| \mathbf u\|_2^2}{2}) $| | |$ \mathbb{R}^n_+$| |
CL1 | |$\sum _{i=1}^{n} \min \{ |x_i|, a \} $| | |$-a\| \mathbf u\|_1$| | |$ [0,1]^n$| |
TL1 | |$ \sum _{i=1}^{n} \frac{(a+1)|x_i|}{a + |x_i|} $| | |$ a \|\mathbf u \|_1 - 2 \sum _{i} (\sqrt{a u_i})$| | |$ \mathbb{R}^n_+$| |
ERF | (2.8) | |$\sigma \sum _i h(u_i)$| | |$ [0,1]^n$| |
|$\ell _1-\ell _2$| | |$\|\mathbf x\|_1-\|\mathbf x\|_2$| | 0 | |$\{\|\mathbf 1-\mathbf u\|_2\leq 1\}$| |
Model . | Regularization . | Function |$ g$| . | Set |$U $| . |
---|---|---|---|
|$ \ell _p$| | |$\sum _{i=1}^{n} |x_i|^{p} $| | |$\frac{1-p}{p} \sum _{i = 1}^n u_i^{\frac{p}{p-1}} $| | |$ \mathbb{R}^n_+$| |
log-sum | |$ \sum _{i=1}^{n} \log (|x_i| + a)$| | |$ a \| \mathbf u \|_1 - \sum _{i=1}^n \log (u_i)$| | |$ \mathbb{R}^n_+$| |
SCAD | (2.6) | |$-ab \| \mathbf u \|_1 + (b - 1) \frac{\| \mathbf u \|_2^2}{2} $| | |$[0, a]^n $| |
MCP | (2.7) | |$-b (a \| \mathbf u\|_1 - \frac{\| \mathbf u\|_2^2}{2}) $| | |$ \mathbb{R}^n_+$| |
CL1 | |$\sum _{i=1}^{n} \min \{ |x_i|, a \} $| | |$-a\| \mathbf u\|_1$| | |$ [0,1]^n$| |
TL1 | |$ \sum _{i=1}^{n} \frac{(a+1)|x_i|}{a + |x_i|} $| | |$ a \|\mathbf u \|_1 - 2 \sum _{i} (\sqrt{a u_i})$| | |$ \mathbb{R}^n_+$| |
ERF | (2.8) | |$\sigma \sum _i h(u_i)$| | |$ [0,1]^n$| |
|$\ell _1-\ell _2$| | |$\|\mathbf x\|_1-\|\mathbf x\|_2$| | 0 | |$\{\|\mathbf 1-\mathbf u\|_2\leq 1\}$| |

Plotting the associated |$g$| function for models listed in Table 1 into two categories: (a) SCAD with |$a = 1, b = 2$|, CL1 with |$a = 1$| and ERF with |$\sigma = 1 $|; and (b) |$\ell _p $| with |$p = 1/2 $|, log-sum with |$a = 1 $|, MCP with |$a = 1, b = 3 $| and TL1 with |$a = 4 $|. This motivates Type B and Type C in Definition 2.
2.3 Properties of the Proposed Regularization
There is a wide range of analysis related to concave and symmetric regularization functions based on RIP and NSP conditions [9,15,47,48] regarding the sensing matrix |$A $|. Our general model |$F^U_{g,\alpha }$| satisfies all the NSP-related conditions discussed in [48] so that the exact sparse recovery can be guaranteed. Theorem 2 summarizes some important properties of the proposed regularization.
For any |$\mathbf x \in \mathbb{R}^n,$| |$\alpha> 0,$| and a feasible set |$U $| on the weights |$\mathbf u, $| |$F_{g, \alpha }^U (\mathbf x)$| defined in (1.2) has the following properties:
(i) The function |$F^U_{g, \alpha }(\mathbf x) = -\left ( \alpha g + \delta _{U} \right )^*(-|\mathbf x|),$| where |$f^*$| denotes the convex conjugate of a function |$f$|, thus |$F^U_{g, \alpha }(\mathbf x)$| is concave in the positive cone.
(ii) If |$g $| is symmetric on |$ U $|, then |$ F^U_{g, \alpha }$| is symmetric on |$\mathbb{R}^n $|.
(iii) If |$g $| is separable on |$ U $|, then |$ F^U_{g, \alpha }$| is separable on |$\mathbb{R}^n$|.
(iv) If |$g $| is separable and symmetric on |$U $|, |$ F^U_{g, \alpha }$| satisfies the increasing property on |$\mathbb{R}^n_+ $|, i.e. |$F^U_{g, \alpha } (|\mathbf x|) \geq F^U_{g, \alpha } ( |\mathbf x^{\prime}|) $| for any |$|\mathbf x| \geq |\mathbf x^{\prime}| $|, and reverses the order of majorization, i.e. |$F^U_{g, \alpha } (|\mathbf x|) \leq F^U_{g, \alpha } ( |\mathbf x^{\prime}|) $| if |$|\mathbf x| \succ |\mathbf x^{\prime}|$|, where |$\succ $| is defined in Section 2.1.
- (v) If |$g $| is separable and |$U $| is rectangular, then |$ F^U_{g, \alpha }$| satisfies the sub-additive property on |$\mathbb{R}^n $|, i.e.The equality holds if |$\mathbf x_1 $| and |$\mathbf x_2 $| have disjoint support, and each coordinate of |$g $| has the same minimum.$$ \begin{align*}& F^U_{g, \alpha}(\mathbf x_1 + \mathbf x_2) \leq F^U_{g, \alpha}(\mathbf x_1) + F^U_{g, \alpha}(\mathbf x_2), \quad \forall \mathbf x_1, \mathbf x_2 \in \mathbb{R}^n. \end{align*} $$
- (vi) Let |$U $| be compact and |$g $| continuous. Then |$F_{g,\alpha }^U $| is continuous and the set of sub-differentials of |$-F_{g,\alpha }^U $| at the point |$\mathbf x \geq 0 $| is given bywhere |$\text{Conv} $| is the convex hall of the points. In addition, the function |$F_{g,\alpha }^U $| is differentiable at |$\mathbf x$| if there exists a unique solution |$\mathbf u$| for minimizing |$F_{g,\alpha }^U (\mathbf x)$|. Consequently, if |$g $| is strongly convex, |$F_{g,\alpha }^U $| is continuously differentiable on the positive cone.$$ \begin{align*}& \partial (-F_{g, \alpha}^U) (\mathbf x) = - \text{Conv} \left(\arg\min_{\mathbf u \in U} \left\langle \mathbf u, |\mathbf x | \right\rangle + \alpha g(\mathbf u) \right), \end{align*} $$
- (vii) If |$g(0) = 0 $| and |$g$| takes its minimum value at some point in |$ U \subseteq \mathbb{R}^n_+$|, then we have for |$\alpha _1> \alpha _2 $| that$$ \begin{align*}& F^U_{g, \alpha_1}(\mathbf x) \leq F^U_{g, \alpha_2}(\mathbf x), \quad \forall \mathbf x \in \mathbb{R}^n. \end{align*} $$
(i) Recall the convex conjugate of a function |$f$| is defined as |$ f^*(\mathbf y)=\sup \{\langle \mathbf x, \mathbf y\rangle -f(\mathbf x)\}.$| Comparing it with the definition |$F^U_{g, \alpha }(\mathbf x)$| in (1.2), we have |$F^U_{g, \alpha }(\mathbf x) = -\left ( \alpha g + \delta _{U} \right )^*(-|\mathbf x|)$|. As convex conjugate is always convex, |$F_{g, \alpha }^{U}(\mathbf x)$| is concave on the positive cone.
(ii) If |$ g$| is symmetric, it is straightforward that |$F_{g, \alpha }^{U} $| is also symmetric.
(iii) Since |$g $| is a separable function, |$\min _{\mathbf u\in U} f(\mathbf x, \mathbf u)$| breaks down into |$n $| scalar problems, and hence |$ F^U_{g, \alpha }$| is separable.
- (iv) The increasing property follows from the fact that we have for every |$\mathbf u \in U $|Taking the minimum of both sides with respect to |$\mathbf u $| proves the increasing property. The reverse order of majorization can be proved in the same way as in [48, Proposition 2.10].$$ \begin{align*}& \left\langle \mathbf u, |\mathbf x | \right\rangle + \alpha g(\mathbf u) \geq \left\langle \mathbf u, |\mathbf x^{\prime} | \right\rangle + \alpha g(\mathbf u) \quad \forall |\mathbf x| \geq |\mathbf x^{\prime}|. \end{align*} $$
(v) The sub-additive property can be proved in the same way as in [48, Lemma 2.7].
- (vi) It is straightforward thatfor each |$\mathbf u \in U $|. We set |$f_{\mathbf u}(\mathbf x) = -\left \langle \mathbf u, | \mathbf x| \right \rangle - \alpha g(\mathbf u)$|. As |$g$| is continuous, the map |$\mathbf u \mapsto f_{\mathbf u}(\mathbf x)$| is continuous for every |$\mathbf x $|, and for each |$\mathbf u \in U $|, the function |$f_{\mathbf u} $| is continuous. For |$\mathbf x \geq 0 $|, it follows from the Ioffe–Tikhomirov’s Theorem [21, Proposition 6.3] that(2.9)$$ \begin{align}& -F^U_{g, \alpha}(\mathbf x) = -\min_{\mathbf u\in U} \left\langle \mathbf u, | \mathbf x| \right\rangle + \alpha g(\mathbf u) = \max _{\mathbf u\in U} -\left\langle \mathbf u, | \mathbf x| \right\rangle - \alpha g(\mathbf u),\end{align} $$where |$ T(\mathbf x) = \{ \mathbf u \mid -F_{g, \alpha }^U(\mathbf x) = f_{\mathbf u}(\mathbf x) \}$|.$$ \begin{align*} \partial (-F_{g, \alpha}^U) (\mathbf x) & = \text{Conv} \left\{ \cup_{\mathbf u \in T(\mathbf x)} \partial f_{\mathbf u}(\mathbf x) \right\} = \text{Conv} \left\{ \cup_{\mathbf u \in T(\mathbf x)} \{-\mathbf u \} \right\} \\ & = \text{Conv} \{ -\mathbf u \mid -F_{g, \alpha}^U(\mathbf x) = f_{\mathbf u}(\mathbf x) \} = \text{Conv} \{ -\mathbf u \mid \mathbf u \in \arg\min_{\mathbf u \in U} \left\langle \mathbf u, |\mathbf x | \right\rangle + \alpha g(\mathbf u) \} \\ & = -\text{Conv} \{ \arg\min_{\mathbf u \in U} \left\langle \mathbf u, |\mathbf x | \right\rangle + \alpha g(\mathbf u) \}, \end{align*} $$
(vii) Given |$\mathbf x \in \mathbb{R}^n $| with |$\alpha _2$|, we denote |$ \mathbf u^*_2 \in \arg \min _{\mathbf u \in U} \left \langle \mathbf u, |\mathbf x | \right \rangle + \alpha _2 g(\mathbf u), $| which implies |$ F^U_{g, \alpha _2}(\mathbf x) = \left \langle \mathbf u^*_2, |\mathbf x | \right \rangle + \alpha _2 g(\mathbf u^*_2) $| and |$\left \langle \mathbf u^*_2, |\mathbf x | \right \rangle + \alpha _2 g(\mathbf u^*_2) \leq \left \langle \mathbf 0, |\mathbf x | \right \rangle + \alpha _2 g(\mathbf 0) = 0.$| It further follows from |$\left \langle \mathbf u^*_2, |\mathbf x | \right \rangle \geq 0$| that |$g(\mathbf u^*_2) \leq 0 $|. For |$\alpha _1> \alpha _2 $|, we have |$\alpha _1 g(\mathbf u^*_2) \leq \alpha _2 g(\mathbf u^*_2)$| and hence |$ F^U_{g, \alpha _1}(\mathbf x) \leq \left \langle \mathbf u_2^*, |\mathbf x | \right \rangle + \alpha _1 g(\mathbf u^*_2) \leq \left \langle \mathbf u_2^*, |\mathbf x | \right \rangle + \alpha _2 g(\mathbf u^*_2) = F^U_{g, \alpha _2}(\mathbf x). $|
Using proprieties (i)–(v), we can prove that every |$s $|-sparse vector |$\mathbf x $| is the unique solution to (1.3) if and only if |$F_{g, \alpha } $| satisfies the generalized null space property (gNSP) [48] of order |$s$|. A function |$ F$| satisfies the gNSP of order |$s $| corresponding to a matrix |$A $| if
Note that |$S \subseteq \{1, \dots , n \}$|, |$\bar{S}$| is the complement of |$S$|, and |$\mathbf v_S$| refers to the vector with the same coordinates as |$\mathbf v$| except zero for indices in |$\bar{S}$|. Please refer to [48, Theorem 4.3] for more details on gNSP. The property (vii) has algorithmic benefits, as many optimization algorithms are designed for continuously differentiable functions. We show in Theorem 3 that |$F^U_{g, \alpha }$| is related to |$\ell _0$| and |$\ell _1$| if |$g$| is separable (without the assumption of strong convexity on |$g$|). The relationship of |$F^U_{g, \alpha }$| to iteratively re-weighted algorithms, e.g. [10,20] is characterized in Theorem 4.
Suppose |$U= [0,1]^n$| and |$g$| is separable, i.e. |$g(\mathbf u) = \sum _{i=1}^n g_i(u_i)$| with each |$g_i$| being a strictly decreasing function on |$[0,1]$| with a bounded derivative. If |$g_i(0) =0$| and |$g_i(1) = -1 $| for |$1\leq i \leq n$|, we have that for |$\mathbf x \in \mathbb{R}^n $| there are |$ \alpha _0 \leq \alpha _1$| such that
(i) |$ \frac 1 {\alpha } F^U_{g, \alpha } (\mathbf x) + n = \|\mathbf x\|_0 $|, for all |$0< \alpha \leq \alpha _0 $|;
(ii) |$ F^U_{g, \alpha }(\mathbf x) -\alpha g(\mathbf 1) = \|\mathbf x\|_1$|, for all |$\alpha \geq \alpha _1 $|.
(i) |$ \frac 1 {\alpha } F^U_{g, \alpha } + n \rightarrow \ell _0 $|, as |$\alpha \rightarrow 0 $|;
(ii) |$ F^U_{g, \alpha } -\alpha g(\mathbf 1) \rightarrow \ell _1$|, as |$\alpha \rightarrow +\infty $|.
- (i) For any fixed |$\mathbf x \in \mathbb{R}^n $|, we consider the derivative of |$F^U_{g, \alpha }$| with respect to each of its component, i.e. |$f_i:=|x_i| + \alpha g^{\prime}(u_i)$|. If |$x_i = 0,$| then |$f_i$| is negative due to decreasing |$g,$| and hence the minimum is attained at |$u_i = 1$|. If |$x_i \neq 0$|, |$f_i$| is positive for a small enough |$\alpha ,$| due to the assumption that |$g^{\prime}$| is bounded. Then positive derivative implies that the function is increasing, and hence the minimum is attained at |$u_i = 0 $|. In summary, if |$\alpha $| is sufficiently small, then we obtain that |$u_i=1$| if |$x_i=0$| and |$u_i=0$| if |$x_i \neq 0$|. For this choice of |$\mathbf u, $| we get that |$ \left \langle \mathbf u, |\mathbf x | \right \rangle = 0$| andwhich implies that |$F_{g, \alpha } (\mathbf x) = \alpha (\|\mathbf x \|_0 - n)$| for a small enough |$\alpha $| that depends on the choice of |$\mathbf x$|. By letting |$\alpha \rightarrow 0$|, we have |$F_{g, \alpha } (\mathbf x) / \alpha = \|\mathbf x \|_0 - n$| for all |$\mathbf x.$|$$ \begin{align*} &\sum_{i} g_i(u_i) = \sum_{x_i = 0} g_i(u_i) = \sum_{x_i = 0} (-1) = \|\mathbf x \|_0 - n, \end{align*} $$
(ii) Since |$g(1) < g(0)$| there exists a value of |$u_i \in (0,1] $| with |$g^{\prime}(u_i) < 0 $| and so for large enough |$\alpha $|, the derivative |$ |x_i| + \alpha g^{\prime}(u_i) $| is always negative. It further follows from the decreasing function |$g$| that the minimizer is always attained at |$\mathbf u = \mathbf 1$| to reach to the desired result. Similarly to (i), by letting |$\alpha \rightarrow +\infty ,$| the analysis holds for all |$\mathbf x.$|
Theorem 3 presents the ideal choice for the weight, i.e. |$u_i=1$| if |$x_i=0$| and |$u_i=0$| if |$x_i \neq 0$|. A similar idea of zero weights on the known support was explored in [32,40,50], which is referred to as weighted |$\ell _1$|. In addition, Theorem 3 implies that the function |$ \frac{1}{\alpha }F_{g,\alpha }^U + n$| approximates the |$\ell _0$| norm from below. We can define a function of |$H(\mathbf x, \alpha ):= \frac 1 {\alpha } F^U_{g, \alpha } (\mathbf x) + n: \mathbb{R}^n \times [0,\alpha _0] \rightarrow \mathbb{R}$| as a transformation between |$\| \mathbf x \|_0 $| and |$\frac 1 {\alpha _0} F^U_{g, \alpha _0} (\mathbf x) + n $| for a fixed |$\alpha _0$|. As characterized in Corollary 1, this relationship motivates us to consider a type of homotopic algorithm (discussed in Section 3) to better approximate the desired |$\ell _0$| norm, although |$H(\mathbf x, \alpha )$| is not homotopy itself (it is not continuous with respect to |$\mathbf x$|).
If we do not restrict all the |$g_i$| functions attain the same value at |$1$| as in Theorem 3, but rather |$g_i(1)$| can take different values, then the proposed regularization |$ F^U_{g, \alpha }$| is equivalent to a weighted |$\ell _1 $| model with a certain shift of |$g(\mathbf 1)$|; see Theorem 4.
Since each derivative |$g_i^{\prime}$| is bounded, there exists a positive number |$M_{\mathbf x}$| (depending on |$\mathbf x$|) such that |$|x_i| + \alpha g^{\prime}(u_i)$| is negative for |$ \alpha> M_{\mathbf x}$|. As a result, the minimizer of |$\arg \min _{u_i}\langle u_i, |x_i|\rangle + \alpha g_i(u_i)$| is attained at |$u_i = 1 $|. For a sufficiently large |$\alpha ,$| the minimizer |$\mathbf u^* =\mathbf{1} $|. By letting |$\alpha \rightarrow +\infty ,$| (2.10) holds for all |$\mathbf x.$|
2.4 Exact Recovery Analysis
There are many models approximating the |$\ell _0$| minimization problem (1.1), and yet only a few of them have exact recovery guarantees. Motivated by the equivalence [65] between the |$\ell _0 $| model (1.1) and (2.4) with a sufficiently small parameter |$\epsilon $|, we give conditions on the |$g $| function to establish the equivalence between our proposed model (1.3) and (1.1). Note that the weight vector |$\mathbf u $| in our formulation is not binary, but takes continuous values. By taking Table 1 and Fig. 1 into account, we consider two types of |$g $| functions defined as follows:
Let |$g:\mathbb{R}^n \rightarrow \mathbb{R} \cup \{ +\infty , -\infty \} $| be a separable function with |$g(\mathbf u) = \sum _{i = 1}^n g_i(u_i),$| for |$\mathbf u =[u_1,\cdots ,u_n]\in \mathbb{R}^n $|. We define
Type B: All |$g_i $| functions have bounded derivatives on |$[0,1],$| and are strictly decreasing on |$[0,1] $| with the same value at |$0 $| and |$ 1$|, i.e. there exist two constants |$a, b \in \mathbb{R} $| such that |$g_i(0) = a, g_i(1) = b, \forall i\in [n].$|
Type C: All |$g_i $| functions are convex on |$[0, \infty )$| with the same value at |$0 $| and the same minimum at a point other than zero, i.e. there exist two constants |$a>b \in \mathbb{R}$| such that |$g_i(0) = a $| and |$\min _{t \geq 0} g_i(t) = b, \forall i\in [n]$|.
An important characteristic both types of functions share is that they are decreasing near zero. Type B functions are defined on a bounded interval, and we enforce a box constraint on |$\mathbf u$| for strictly decreasing |$g$|. Type C refers to convex functions defined on an unbounded interval due to |$\lim _{t \rightarrow \infty } g(t)> g(0).$| Note that Theorems 3 and 4 hold when |$g$| is a Type B or Type C function. We establish the equivalent between (1.3) and (1.1) for Type B and Type C functions in Theorems 5 and 6, respectively.
Suppose |$g $| is a Type B function in Definition 2 on |$U = [0,1]^n $|. There exists |$\alpha> 0$| such that the model (1.3) is equivalent to (1.1), i.e. if |$(\mathbf x^*,\mathbf u^*) $| is a minimizer of (1.3), then |$\mathbf x^* $| is a minimizer of (1.1); conversely, if |$\mathbf x^* $| is a minimizer of (1.1) then by taking |$u^*_i = 1 $| for |$x^*_i = 0 $| and |$u^*_i = 0 $| otherwise, |$(\mathbf x^*,\mathbf u^*) $| is a minimizer of (1.3).
Since |$g $| is a Type B function, we represent |$g(\mathbf u) = \sum _{i=1}^n g_i(u_i)$| with each |$g_i$| strictly decreasing and having bounded derivatives on |$[0,1].$| Without loss of generality, we assume that |$g_i(0) = 0 $| and |$g_i(1) = -1, \forall i\in [n].$| Denote |$ s:= \min _{\mathbf x} \{ \|\mathbf x\|_0 \mid A \mathbf x = \mathbf b \}$| and |$ \epsilon _0:= \min _{\mathbf x} \{ |\mathbf x|_{(s)} \mid A \mathbf x = \mathbf b \}. $| Here |$\epsilon _0> 0;$| otherwise there exists a solution to (1.1) with sparsity less than |$s$|. Since |$g_i $| has bounded derivatives, there exists a scalar |$ \alpha> 0 $| such that |$-\frac{\epsilon _0}{\alpha } < g^{\prime}_i(u), \forall u \in [0,1]$| and |$i \in [n] $|.
Next we show that |$\mathbf x^* $| must have sparsity |$ s$|. If |$ |x^*_i| \in (0,\epsilon _0) $| for some |$i$|, then we have |$u_i|x_i| + \alpha g_i(u_i)> \alpha g_i(1).$| Note that the inequality is strict, forcing |$ f(\mathbf x^*, \mathbf u^*)$| to be strictly greater than the lower bound |$-\alpha (n-s) $|, which is a contradiction. Therefore, if |$|x^*_i| < \epsilon _0,$| we must have |$ x_i = 0$|. Based on the definition of |$\epsilon _0,$| we have |$\|\mathbf x^* \|_0 = s $|, which implies |$\mathbf x^* $| is a minimizer of (1.1).
Conversely, if |$\mathbf x^* $| is a solution of (1.1), then |$\mathbf x^*$| satisfies |$A\mathbf x^*=\mathbf b.$| With the choice of |$ u_i^* = 0$| for |$ |x_i^*| \neq 0$| and |$u_i^* = 1 $| otherwise, we get |$(\mathbf x^*, \mathbf u^*)$| is a minimizer of (1.3) such that the objective function attains the minimum value |$-\alpha (n-s) $|.
Suppose |$U = [0,\infty )^n $| and |$g $| is a Type C function in Definition 2. Then there exists a constant |$\alpha> 0$| such that the model (1.3) is equivalent to (1.1).
3. Numerical Algorithms and Convergence Analysis
We describe in Section 3.1 the ADMM [8,19] for solving the general model, with convergence analysis presented in Section 3.2. In Section 3.3, we discuss closed-form solutions of the |$\mathbf u$|-subproblem for two specific choices of |$g$|.
3.1 The Proposed Algorithm
We define a function |$\psi (\cdot )$| to unify the constrained and the unconstrained formulations, i.e. |$\psi (\mathbf x) = \delta _\varOmega (\mathbf x)$| for (1.3) and |$\psi (\mathbf x) = \frac \gamma 2 \|A\mathbf x-\mathbf b\|_2^2$| for (1.4). We introduce a new variable |$\mathbf y$| in order to apply ADMM to minimize
The corresponding augmented Lagrangian becomes
where |$\mathbf v$| is the Lagrangian dual variable and |$\rho $| is a positive parameter. The ADMM scheme involves the following iterations:
The original problem (1.3) jointly minimizes |$\mathbf u $| and |$\mathbf x,$| which are updated separately in (3.3). In particular, the |$\mathbf u$|-subproblem can be expressed as
In general, one may not find a closed-form solution to (3.4). For a separable function |$g$| and a rectangular set |$ U$|, the |$\mathbf u $|-update simplifies into |$n $| one-dimensional minimization problems; refer to Section 3.3 for the |$\mathbf u $|-update with two specific |$g$| functions that are used in experiments. For the |$\mathbf x$|-update, it has a closed-form solution given by
where |$ \mathbf{shrink}(\mathbf v, \mathbf u) = \mathrm{sign}( \mathbf v) \odot \max (|\mathbf v|-\mathbf u,\mathbf 0). $|
For the constrained formulation, i.e. |$\psi (\mathbf y) = \delta _{\varOmega }(\mathbf y) $|, the |$\mathbf y$|-subproblem becomes
It is equivalent to a projection into the affine solution of |$A \mathbf x = \mathbf b $|, which has a closed-form solution,
For the unconstrained formulation, |$\psi (\mathbf y) = \frac{\gamma }{2} \| A \mathbf y - \mathbf b \|_2^2 $|, the |$\mathbf y$|-subproblem also has a closed-form solution by solving a linear system
It is worth noting that for the unconstrained formulation, a more accurate choice for |$\psi $| is |$\psi (\mathbf y) = \frac{\alpha \gamma }{2} \| A \mathbf y - \mathbf b \|_2^2 $|. The reason for this is further explained in Remark 2.
The ADMM iterations (3.3) minimize the general model (2.2) for a fixed value of |$\alpha $|. Following Theorem 3 and Corollary 1, we consider a type of homotopy optimization (also known as continuation approach) [16,52] to update |$\alpha $| in order to better approximate the |$\ell _0$| norm. In particular, we gradually decrease |$\alpha $| to |$0$| while optimizing (3.1) for each |$\alpha $|. Algorithm 1 summarizes the overall iteration for the proposed approach.
We remark that letting |$\alpha $| approach to |$0$| is not exactly a homotopy algorithm, as the transformation between |$\| \mathbf x \|_0 $| and |$\frac 1 {\alpha _0} F^U_{g, \alpha _0} (\mathbf x) + n $| is not continuous. We observe empirically the rate that |$\alpha $| decays to zero plays a critical role in the performance of sparse recovery. On the other hand, we should minimize |$\frac 1 \alpha F^U_{g, \alpha _0} (\mathbf x) +\frac \gamma 2\|A\mathbf x-\mathbf b\|_2^2$| to approximate the |$\ell _0$| norm. This formulation requires the inversion of |$(\rho I_n+\gamma \alpha A^TA)$| for different |$\alpha $| in the |$\mathbf y$|-update, which is computationally expensive, as opposed to pre-computing the inverse of |$(\rho I_n+\gamma A^TA)$| with a fixed value |$\gamma .$|
3.2 Convergence Analysis for ADMM
We prove the convergence for the ADMM method (3.3) for the unconstrained case, i.e. |$\psi (\mathbf y) = \frac{\gamma }{2} \|A \mathbf y - \mathbf b \|_2^2 $|. The steps that we take to prove the convergence of (3.3) are similar to the convergence of the standard ADMM method. Yet, since we update |$\mathbf x$| and |$\mathbf u$| separately, the method (3.3) is not actually the ADMM iterations for the Lagrangian (3.2), and the function in the optimization (3.1) is not jointly convex with respect to |$\mathbf u$| and |$\mathbf x$|. Note that we apply an adaptive |$\alpha $| update in Algorithm 1, but the convergence analysis is restricted to a fixed |$\alpha $| value. In addition, we assume that |$ g$| is a Type B or Type C function that is continuously differentiable on |$ U$|.
In the case of Type C functions with |$ U = [0,\infty )^n$|, it follows from optimality conditions for each sub-problem in (3.3) that there exists |$ \mathbf s^{k+1} \geq 0$| and |$\mathbf p^{k+1} \in \partial |\mathbf x^{k+1} | $| such that
with |$\mathbf s^{k+1} \odot \mathbf u^{k+1} = \mathbf 0$|.
Note that for the Type B functions with |$U = [0,1]^n $|, the optimality condition for |$u$| sub-problem is that there exists |$\mathbf s^{k+1}, \mathbf t^{k+1} \geq 0 $| such that |$ \mathbf s^{k+1} \odot \mathbf u^{k+1} = 0, \ \mathbf t^{k+1} \odot (\mathbf u^{k+1} - \mathbf 1) = \mathbf 0,$| and |$ \mathbf 0 = |\mathbf x^k| + \alpha \nabla g(\mathbf u^{k+1}) - \mathbf s^{k+1} + \mathbf t^{k+1}$|.
[Stationary points] Suppose the sequence |$\{\mathbf u^k, \mathbf x^k, \mathbf y^k, \mathbf v^k\}$| is generated by (3.3). If |$g\in \mathcal C^1(\mathbb R)$| is a Type B or Type C function and |$\rho>\sqrt{2}\gamma C_A$|, and |$(\mathbf u^k, \mathbf x^k)$| is bounded, then every limit point of |$\{\mathbf u^k, \mathbf x^k, \mathbf y^k, \mathbf v^k\},$| denoted by |$\{ \mathbf u^*, \mathbf x^*, \mathbf y^*, \mathbf v^* \} $|, is a stationary point of |$L_\rho (\mathbf u, \mathbf x, \mathbf y; \mathbf v) $| and also |$\{\mathbf x^*, \mathbf u^*\}$| is a stationary point of (1.4).
The boundedness of |$L_\rho ,$| |$\mathbf u^k$| and |$\mathbf x^k$| together with (3.8) implies that |$\mathbf y^k$| is bounded, and hence |$\mathbf v^k$| is bounded due to |$\mathbf v^{k} = \nabla \psi (\mathbf y^{k}).$| Then the Bolzano–Weierstrass Theorem guarantees that there exists a subsequence, denoted by |$\{ \mathbf x^{k_s}, \mathbf y^{k_s}, \mathbf u^{k_s}, \mathbf v^{k_s} \}$|, that converges to a limit point, i.e. |$(\mathbf x^{k_s}, \mathbf y^{k_s}, \mathbf u^{k_s}, \mathbf v^{k_s}) \rightarrow (\mathbf x^*, \mathbf y^*, \mathbf u^*, \mathbf v^*). $|
By Theorem 8, we get |$\mathbf x^{k_s}-\mathbf y^{k_s} \rightarrow \mathbf 0, $| leading to |$\mathbf x^* = \mathbf y^* $|, and |$(\mathbf x^{k_s-1}, \mathbf y^{k_s-1}) \rightarrow (\mathbf x^*, \mathbf y^*),$| and hence we have |$\mathbf v^{k_s-1} \rightarrow \mathbf v^* $|. Let |$\mathbf p^{k_s} $| be the corresponding variables in the optimality condition (3.6). As |$\mathbf p^{k_s} \in \partial |\mathbf x^{k_s} |,$| we know |$\mathbf p^{k_s} $| is bounded by |$[-1,1].$| Therefore, there exists a limit point of the sequence |$\mathbf p^{k_s}.$| Without loss of generality, we assume it is the sequence itself, i.e. |$\mathbf p^{k_s}\rightarrow \mathbf p^*,$| and hence we have |$\mathbf p^* \in \partial |\mathbf x^* | $|.
Type B: If |$ g$| is a type B function then |$\mathbf u \in [0,1]^n $| and hence the optimality condition for the |$\mathbf u- $|update is |$\mathbf 0 = |\mathbf x^k| + \alpha \nabla g(\mathbf u^{k+1}) - \mathbf s^{k+1} + \mathbf t^{k+1},$| with |$\mathbf s^{k+1} \odot \mathbf u^{k+1} = \mathbf 0$| and |$\mathbf t^{k+1} \odot (\mathbf u^{k+1}-1) = \mathbf 0$| with |$\mathbf s^{k+1} \geq 0 $| and |$\mathbf t^{k+1} \geq 0 $|.
3.3 Algorithm Updates for Different Lifting Functions
Here we consider two examples of |$g $| functions, with which the |$\mathbf u$|-subproblem has a closed-form solution. We define one function as |$g_1(\mathbf u) = -\frac{1}{2} \| \mathbf u\|_2^2 $|, a Type B function with |$U_1 = [0,1]^n$| and a Type C function |$g_2(\mathbf u) = \frac{1}{2} \| \mathbf u\|_2^2 - \|\mathbf u \|_1 $| with |$U_2 = [0,\infty )^n$|. For these combinations the update for (3.4) simplifies to
Note that for this choice of |$ g_2$|, the proposed model simplifies to
which can be solved by a quadratic programming with linear constraints.
4. Numerical Experiments
We demonstrate the performance of Algorithm 1 with |$\epsilon =0.01$| and two specific |$g$| functions discussed in Section 3.3. We compare with the following sparsity promoting regularizations: |$\ell _1 $| [13], |$\ell _{1/2}$| [57], transformed |$\ell _1 $| (TL1) [62], |$\ell _1-\ell _2 $| [59] and ERF [20]. For more related models, see [54]. For each model, we consider both constrained and unconstrained formulations. Specifically for the |$\ell _p$| model, we adopt the iteratively reweighted least-squares algorithm [12] in the constrained case and use the half thresholding [57] as a proximal operator for minimizing the unconstrained |$\ell _{1/2}$| formulation. Both |$\ell _1-\ell _2$| and TL1 are minimized by the difference of convex algorithm (DCA) for the best performance as reported in [59,62]. We use the online code provided by the authors of [20] to solve for the ERF model. We use the default values of model parameters suggested in respective papers; note that |$\ell _1$| and |$\ell _1-\ell _2 $| do not involve any parameters. In Appendix C, we present a comparison between using ADMM and DCA for the proposed model. All the experiments are conducted on a Windows desktop with CPU (Intel i7-6700, 3.19GHz) and MATLAB (R2021a).
4.1 Constrained Models
We examine the performance of finding a sparse solution that satisfies the constraint |$A \mathbf x = \mathbf b$|. We consider two types of sensing matrices, Gaussian and over-sampled discrete cosine transform (DCT). The Gaussian matrix is generated based on the multivariate normal distribution |$\mathcal{N}\ (0, \varSigma ) $|, where |$\varSigma _{i,j} = (1-r) \delta (i = j) + r $| for a parameter |$r> 0.$| Note that |$\delta (i = j)$| is |$1$| if |$i = j$| and zero otherwise. The over-sampled DCT matrix is defined by |$A = [\mathbf a_1,..., \mathbf a_n] \in \mathbb{R}^{m \times n}$| with each column defined as
where |$\mathbf w $| is a uniformly random vector and |$F \in \mathbb{R}_+ $| is a scalar. The larger the |$F$| is, the larger the coherence of the matrix |$A$| is, thus more challenging to find a sparse solution.
We fix the dimension as |$64 \times 1024 $| for Gaussian and DCT matrix, while generating Gaussian matrices with |$ r \in \{0, 0.2, 0.8 \}$| and DCT matrices with |$F \in \{1, 5, 10 \}$|. The ground truth vector |$ \mathbf x_g \in \mathbb{R}^n$| is simulated as |$s$|-sparse signal, where |$s$| is the total number of non-zero entries each drawn from normal distribution |$\mathcal{N}(0, 1) $| and the support index set is also drawn randomly. We evaluate the performance by success rates where a ‘successful’ reconstruction refers to the case when the distance of the output vector |$\mathbf x $| and the ground truth |$\mathbf x_g $| is less than |$10^{-2} $|, i.e.
Figure 2 presents success rates for both Gaussian and DCT matrices, and demonstrates that the proposed LL1 outperforms the state of the art in all the testing cases. For the Gaussian matrices, the parameter |$r$| has little affect on the performance, as we observe the same ranking of these models under various |$r$| values. As for the DCT matrices, the parameter |$F$| influences the coherence of the resulting matrix. For smaller |$F$| value, |$\ell _p$| is the second best, while TL1 and |$\ell _1-\ell _2$| perform well for coherent matrices (for |$F=10$|). With a well-chosen |$g$| function, the proposed LL1 framework always achieves the best results among the competing methods. The results of LL1 using |$g_1$| with |$U_1$| and |$g_2$| with |$U_2$| are similar. This phenomenon illustrates that our model works best as it is equivalent to the |$\ell _0 $| model for small enough |$\alpha $|.

Success rate comparison among all the competing methods based on Gaussian matrices (left) with |$r = 0, 0.8$| and DCT matrices (right) with |$F = 1, 10$|.
4.2 Unconstrained Models
We consider the unconstrained |$\ell _0$| model for comparison on noisy data:
where |$\gamma $| is a regularization parameter. We consider signals of length |$512 $| with sparsity |$130 $|, and |$m $| measurements |$\mathbf b $|, determined by a Gaussian sensing matrix |$A $|. The columns of |$A $| are normalized with mean zero and unit norm. A Gaussian noise with means zero and standard deviation |$\sigma $| is also added to the measurements. To evaluate the success rate of algorithms, we consider the mean square error (MSE) of the output signal |$\mathbf x$| with the ground-truth solution |$\mathbf x^* $| using the formula
For each algorithm, we compute the average of MSE for 100 realizations by ranging the number of measurements in the range |$60 <m <120 $|. Figure 3 presents the comparison results for two noise levels |$\sigma \in \{10^{-6}, 0.01 \} $|. All the algorithms perform badly with a few measurements, and as the number of measurements |$m $| increases, their MSE error decreases. For the smaller amount of the noise (|$\sigma =10^{-6}$|), our approach almost works perfectly in around |$100$| measurements, while other algorithms either require more measurements to achieve the nearly perfect MSE or are unable to do so. Figure 3c presents the computational times, which suggests that LL1 performs as fast as the |$\ell _1 $| model and at the same time it has the lowest recovery error.

When the noise level is high, for instance |$\sigma = 0.1 $|, then it is almost impossible to reconstruct the ground-truth signal using any number of measurements. In such cases, our algorithm finds a signal that is sparser and has smaller objective for any choice of the regularization parameter |$\gamma $|.
5. Concluding Remarks
In this paper, we proposed a lifted |$\ell _1$| model for sparse recovery, which describes a class of regularizations. Specifically we established the connections of this framework to various existing methods that aim to promote sparsity of the model solution. Furthermore, we proved that our method can exactly recover the sparsest solution under a constrained formulation. We promoted the use of ADMM to solve for the proposed model with convergence analysis. An alternative approach of using DCA was discussed in Section C, showing the efficiency of ADMM over DCA. Experimental results on both noise-free and noisy cases illustrate that the proposed framework outperforms the state-of-the-art methods in terms of accuracy and is comparable with the convex |$\ell _1$| approach in terms of computational time.
One future work involves the convergence analysis of ADMM for solving the constrained model. One difficulty lies in the fact that the corresponding function |$ \psi (\cdot )$| is a |$\delta $|-function, which is neither differentiable nor coercive, and as a result, the proof presented in Section 3.2 for the unconstrained minimization is not applicable for the constrained case. We observe that the ADMM algorithm for the constrained case does converge and the augmented Lagrangian is decreasing. This empirical evidence suggests the potential to prove the convergence or the sufficient decrease of the augmented Lagrangian, which will be left as future work.
Funding
NSF CAREER 1846690; Simons Foundation grant 584960.
Data Availability Statement
No new data were generated or analysed in support of this review.
References
A. Proof of Theorem 1
B. Finding the Lift Corresponding to Existing Models
Given a regularization function |$J(\mathbf x),$| we want to find a proper |$g$| function and a set |$U$| such that |$J(\mathbf x)=F_g^U(\mathbf x)$| up to a constant. We assume |$J(\mathbf x) = J(|\mathbf x|)$| since |$F_g^U$| satisfies this condition. Consequently, we only need to consider the case |$\mathbf x \geq 0$| so that the notation |$\frac{\partial J }{\partial |x_i|} $| should be in place of |$\frac{\partial J }{\partial x_i} $| for |$ x_i \geq 0$|. Using Theorem 1, we can directly find |$F_{g}^U$|; however, sometimes it might be easier to use the following observation which leads to simpler computations. Suppose |$F_{g}^U$| has a unique minimizer |$\mathbf u,$| and hence |$\mathbf u$| satisfies |$\mathbf u = \nabla _{\mathbf x} F_{g}^U(\mathbf x)= \nabla _{\mathbf x} J(\mathbf x).$| Assuming that the minimum of (2.5) is finite, the optimality condition gives |$|\mathbf x| + \nabla _{\mathbf u} g = 0$| for |$\mathbf u \in int(U),$| where |$int(U)$| denotes the interior of the set |$U $| (Note that |$|\mathbf x| + \nabla _{\mathbf u} g$| can have non-zero coordinates on the boundary of |$U$|.) Thus, we only need to solve the following two equations for a function |$g $| with respect to |$\mathbf u $| on the feasible set |$U$|:
- (i) |$\ell _p $| model: Consider |$J = J^{\ell _p} / p $|, and note that |$\frac{\partial J} {\partial |x_i|} = |x_i|^{p-1} $|. For |$g(\mathbf u) = \sum g_i(u_i) $| and |$\mathbf x \in \mathbb{R}^n$|, the (B.1) simplifies intofor all |$i $|. From the first equation we get that |$|x_i| = u_i^{\frac{1}{p-1}} $| and then from the second equation we get |$ g_i^{\prime}(u_i) = -u_i^{\frac{1}{p-1}}$| . A solution for |$g $| is |$g_i(u_i) = \frac{1-p}{p} u_i^{\frac{p}{p-1}} $| for |$u_i \geq 0 $|. Finally taking |$U = \mathbb{R}^n_+ $| and |$g(\mathbf u) = \sum _{i} \frac{1-p}{p} u_i^{\frac{p}{p-1}} $|, one can check that |$F_{g}^U = J $|.$$ \begin{align*}& \begin{cases} u_i = |x_i|^{p-1}, \\ | x_i| + g_{i}^{\prime}(u_i) = 0, \end{cases} \end{align*} $$
- (ii) log-sum penalty: Consider |$J = J^{\log }_a $|, and note that |$\frac{\partial J} {\partial |x_i|} = \frac{1}{|x_i| + a} $|. For |$g(\mathbf u) = \sum g_i(u_i) $| and |$\mathbf x \in \mathbb{R}^n$|, the (B.1) simplifies intofor all |$i $|. From the first equation we get that |$|x_i| = \frac{1}{u_i} - a $| and then from the second equation we get |$ g_i^{\prime}(u_i) = a - \frac{1}{u_i}$|. A solution for |$g $| is |$g_i(u_i) = a u_i - \log (u_i) $| for |$u_i> 0 $|. Finally taking |$U = \mathbb{R}^n_{>0} $| and |$g(\mathbf u) = \sum _{i} \left ( a u_i - \log (u_i) \right ) $|, one can check that |$F_{g}^U + 1 = J $|.$$ \begin{align*}& \begin{cases} u_i = \frac{1}{|x_i| + a}, \\ | x_i| + g_{i}^{\prime}(u_i) = 0, \end{cases} \end{align*} $$
- (iii) Smoothly clipped lasso model: Consider |$J = J_{\gamma , \lambda }^{\text{SCAD}} $|, and note thatFor |$g(\mathbf u) = \sum g_i(u_i) $| and |$\mathbf x \in \mathbb{R}^n$|, the first equation in (B.1) simplifies into(B.2)$$ \begin{align}& \frac{\partial f_{\lambda, \gamma} ^{\text{SCAD}}(t)} {\partial |t|}= \left\{ \begin{array}{@{}ll} \lambda & \text{if}\ |t| \leq \lambda, \\ \frac{\lambda \gamma - t}{\gamma - 1} & \text{if}\ \lambda < |t| \leq \gamma \lambda, \\ 0 & \text{if}\ |t|> \gamma \lambda. \end{array} \right.\end{align} $$for all |$i $|. In the case of |$\lambda < |x_i| \leq \gamma \lambda , $| we get that |$|x_i| = \gamma \lambda - (\gamma - 1) u_i, $| which means we should have |$ g_{i}^{\prime}(u_i) = -\gamma \lambda + (\gamma - 1) u_i$| and |$u_i\leq \lambda $|. By taking |$U = [0,\lambda ]^n $| and |$g(\mathbf u) = \sum _{i} \left ( -\gamma \lambda u_i + (\gamma - 1) \frac{u_i^2}{2} \right ) $|, one can check that |$F_{g}^U + \frac{(\gamma + 1)\lambda ^2}{2} = J $|.$$ \begin{align*}& u_i = \left\{ \begin{array}{@{}ll} \lambda & \text{if}\ |x_i| \leq \lambda, \\ \frac{\lambda \gamma - |x_i|}{\gamma - 1} & \text{if}\ \lambda < |x_i| \leq \gamma \lambda, \\ 0 & \text{if}\ |x_i|> \gamma \lambda. \end{array} \right. \end{align*} $$
- (iv) Mini-max concave penalty: Consider |$J = J_{\gamma , \lambda }^{\text{MCP}} $|, and note thatFor |$g(\mathbf u) = \sum g_i(u_i) $| and |$\mathbf x \in \mathbb{R}^n$|, the first equation in (B.1) simplifies into(B.3)$$ \begin{align}& \frac{\partial f_{\lambda, \gamma} ^{\text{MCP}}(t)} {\partial |t|}= \left\{ \begin{array}{@{}ll} \lambda - \frac{t}{\gamma} & \text{if}\ |t| \leq \gamma \lambda, \\ 0 & \text{if}\ |t|> \gamma \lambda. \end{array} \right.\end{align} $$for all |$i $|. When |$ |x_i| \leq \gamma \lambda ,$| we obtain |$|x_i| = \gamma (\lambda - u_i),$| which implies that |$ g_{i}^{\prime}(u_i) = -\gamma (\lambda - u_i)$| from the second equation in (B.1). Therefore, we set |$U = [0,\infty )^n $| and |$g(\mathbf u) = \sum _{i} \left ( -\gamma (\lambda u_i - \frac{u_i^2}{2}) \right ) $|, leading to |$F_{g}^U + \frac{1}{2} \gamma \lambda ^2 = J $|.$$ \begin{align*}& u_i = \left\{ \begin{array}{@{}ll} \lambda - \frac{|x_i|}{\gamma} & \text{if}\ |x_i| \leq \gamma \lambda, \\ 0 & \text{if}\ |x_i|> \gamma \lambda, \end{array} \right. \end{align*} $$
- (v) Capped |$\ell _1 $| model: Consider |$J = J^{\text{CL1}}_a $|, and note thatFor |$g(\mathbf u) = \sum g_i(u_i) $| and |$\mathbf x \in \mathbb{R}^n$|, the first equation in (B.1) simplifies into(B.4)$$ \begin{align}& \frac{\partial J}{\partial |x_i|} = \left\{ \begin{array}{@{}ll} 1 & \text{if}\ |x_i| < a, \\ 0 & \text{if}\ |x_i|> a. \end{array} \right.\end{align} $$for all |$i $|. Note that the second equation in (B.1) only happens if the minimizer is in the interior of the set |$U $|. Consider |$U = [0,1]^n $|, therefore for this case since the minimizer is on the boundary, therefore we need a |$g $| function which is non-zero in the interior of |$ U $| and for |$|x_i| < a $| we have |$ | x_i| + g_{i}^{\prime}(u_i) < 0$| and for |$|x_i|> a $| we have |$ | x_i| + g_{i}^{\prime}(u_i)> 0$|. Therefore |$g_{i}^{\prime}(u_i) = -a $| and a solution for this is |$ g_{i}(u_i) = -a u_i $|. Finally taking |$U = [0,1]^n $| and |$g(\mathbf u) = \sum _{i} \left ( -a u_i \right ) $|, one can check that |$F_{g}^U + a = J $|.$$ \begin{align*}& u_i = \left\{ \begin{array}{@{}ll} 1 & \text{if}\ |x_i| < a, \\ 0 & \text{if}\ |x_i|> a, \end{array} \right. \end{align*} $$
- (vi) Transformed |$\ell _1 $| model: Consider |$J = J_a^{\text{TL1}} / (a+1) $|, and note that |$\frac{\partial J} {\partial |x_i|} = \frac{a}{(a + |x_i|)^2} $|. For |$g(\mathbf u) = \sum g_i(u_i) $| and |$\mathbf x \in \mathbb{R}^n$|, the (B.1) simplifies intofor all |$i $|. From the first equation we get that |$|x_i| = \sqrt{\frac{a}{u_i}} - a $| and then from the second equation we get |$ g_i^{\prime}(u_i) = a - \sqrt{\frac{a}{u_i}}$|. A solution for |$g $| is |$g_i(u_i) = a u_i - 2 \sqrt{a u_i} $| for |$u_i \geq 0 $|. Finally taking |$U = \mathbb{R}^n_+ $| and |$g(\mathbf u) = \sum _{i} a u_i - 2 \sqrt{a u_i} $|, one can check that |$ F_{g}^U + 1 = J $|.$$ \begin{align*}& \begin{cases} u_i = \frac{a}{(a + |x_i|)^2}, \\ | x_i| + g_{i}^{\prime}(u_i) = 0, \end{cases} \end{align*} $$
- (vii) Error function penalty: Consider |$J = J_{\sigma }^{\text{ERF}} $|, and note that |$\frac{\partial J} {\partial |x_i|} = e^{-x_i^2/\sigma ^2} $| and |$e^{-x_i^2/\sigma ^2} \in (0,1] $|. For |$g(\mathbf u) = \sum g_i(u_i) $| and |$\mathbf x \in \mathbb{R}^n$|, the (B.1) simplifies intofor all |$i $|. From the first equation we get that |$|x_i| = \sigma \sqrt{-\log (u_i)} $| and then from the second equation we get |$ g_i^{\prime}(u_i) = -\sigma \sqrt{- \log (u_i)}$|. A solution for |$g $| is |$g_i(u_i) = \sigma \int _{u_i}^{1} \sqrt{- \log (\tau )} d \tau $| for |$u_i \in (0,1] $|. Finally taking |$U = [0,1]^n $| and |$g(\mathbf u) = \sigma \sum _{i} \int _{u_i}^{1} \sqrt{- \log (\tau )} d \tau $|, one can check that |$ F_{g}^U = J $|.$$ \begin{align*}& \begin{cases} u_i = e^{-x_i^2/\sigma^2}, \\ | x_i| + g_{i}^{\prime}(u_i) = 0, \end{cases} \end{align*} $$
- (viii) |$\ell _1-\ell _2$| penalty: As for |$J = J^{L1-L2}$| working with derivatives is a bit challenging so we directly apply Theorem 1 for finding the lifted form. First note thatHence we get that$$ \begin{align*} & \sup_{\mathbf x \geq 0} \left\langle \mathbf x, \mathbf y \right\rangle - \|\mathbf x\|_2 = \begin{cases} + \infty & \|\mathbf y_+\|_2> 1, \\ 0 & \|\mathbf y_+\|_2 \leq 1. \end{cases} \end{align*} $$Therefore, the set |$U = \{\mathbf u \mid \|(\mathbf 1 - \mathbf u)_+\|_2 \leq 1 \}$| and the function |$g = 0$|. Notice that we can relax the set |$U$| to |$\{\mathbf u \mid \|\mathbf 1 - \mathbf u\|_2 \leq 1 \}$| and still get the same |$J$| function.$$ \begin{align*} & g(\mathbf u) = \sup_{\mathbf x \geq 0} \left\langle \mathbf x, -\mathbf u \right\rangle + \|\mathbf x\|_1 - \|\mathbf x\|_2 = \begin{cases} + \infty & \|(\mathbf 1 - \mathbf u)_+\|_2> 1, \\ 0 & \|(\mathbf 1 - \mathbf u)_+\|_2 \leq 1. \end{cases} \end{align*} $$
C. Comparing the ADMM and DCA based algorithms
Alternatively to ADMM, one can minimize the proposed model (1.3) and (1.4) by the graduated non-convexity algorithm [5] and the DCA [22,44,45]. Specifically for DCA, since our function |$ -F_{g,\alpha }^U$| is convex on the positive cone, then we use the algorithm introduced in [37], where we only need to find the sub-differential of the function on |$\mathbb{R}^n_+ $|. We use the function |$\psi (\cdot )$| to unify the constrained and the unconstrained formulations and we have the model
Since the function |$F_{g,\alpha }^U $| is concave on |$\mathbb{R}^n_+ $|, it can be written as a difference of two convex functions, i.e. |$F_{g, \alpha }^U(\mathbf x) + \psi (\mathbf x) = h_1(\mathbf x) - h_2(\mathbf x) $|, where |$h_1(\mathbf x) = \psi (\mathbf x) + \frac{\beta }{2} \|\mathbf x \|_2^2 $| and |$ h_2(\mathbf x) = \frac{\beta }{2} \| \mathbf x \|_2^2 - F_{g, \alpha }^U (\mathbf x) $| for |$\beta \geq 0.$| An interesting fact about using a DCA form is that if |$g $| is a Type B or Type C then for |$\mathbf x \geq 0 $| we have sub-differentials of the form
For |$\mathbf x \in \mathbb{R}^n $|, take |$\mathbf u_{g, \alpha }(\mathbf x^k) \in \arg \min _{\mathbf u \in U} \left \langle \mathbf u, |\mathbf x^k | \right \rangle + \alpha g(\mathbf u) $| then the DCA iterations become
We implement the DCA iterations of (C.2) for |$\beta = 0 $| for its simplicity and efficiency as opposed to |$ \beta> 0$|. In addition, we can consider an adaptive scheme to update |$\alpha ,$| which is adopted in Algorithm 2.

Success rate comparison of ADMM and DCA with two |$g$| functions and different |$\eta $| for Gaussian matrices (left) with |$r = 0$| and |$r = 0.8 $| and DCT matrices (right) with |$F = 1, 5$|.
We compare ADMM (Algorithm 1) and DCA (Algorithm 2) for minimizing the same constrained formulation (1.3) with |$g_1$| and |$g_2$| discussed in Section 3.3. We are particularly interested in the algorithmic behaviours when dynamically updating |$\alpha $|. As mentioned in Theorem 5, |$\alpha $| is supposed to be small enough to approximate the |$\ell _0$| solution. A common way involves an exponential decay in the form of |$\alpha ^{k+1} = (1 - \eta ) \alpha ^k,$| for |$\eta \in (0,1)$|. If the parameter |$ \eta $| is close to |$ 1$| then |$\alpha $| converges to zero too quickly and hence the algorithm cannot converge to a good local minimum as it is equivalent to having |$ \alpha = 0$| in the very beginning. On the other hand, if |$ \eta $| is close to |$0 $| then |$\alpha $| slowly decreases to zero; and as a result, Algorithm 1 may terminate before |$\alpha $| is small enough for |$F^U_{g, \alpha }$| to approximate the |$\ell _0$| norm.
The comparison between ADMM and DCA on Gaussian and DCT matrices is presented in Fig. C1. By fixing |$\eta = 0.1,$| we select the optimal |$\rho = 2^j $| for ADMM with |$j \in \{-1,0,1,2,3,4,5,6,7,8 \}$| that achieves the smallest relative error to the ground-truth. Then using the optimal parameters |$\rho $| and |$\gamma ,$| Fig. C1 presents the ADMM results for |$\eta \in \{ 0.001,0.01,0.1 \}$| and the DCA ones for |$\eta \in \{0.01,0.1 \} $|. For all the cases, ADMM is superior to DCA in that it is less sensitive to |$\eta $|. In addition, DCA consists of two loops and hence it is generally slower than ADMM. Our experiment shows that a suitable choice for our experiments is |$\eta = 0.01 $|. These comparisons show that the lifted form works well with ADMM, but not necessarily with DCA. This may be due to the fact that DCA usually converges in a few iterations and this is not enough for |$\alpha $| to get close to |$0$|. One may incorporate the decreasing of |$\alpha $| into the |$\mathbf x$|-update iteration, or update |$\alpha $| according to the result of each iteration, which is left for future work.