A lifted ℓ 1 framework for sparse recovery

2.3 Properties of the Proposed Regularization

There is a wide range of analysis related to concave and symmetric regularization functions based on RIP and NSP conditions [9,15,47,48] regarding the sensing matrix |$A $|⁠. Our general model |$F^U_{g,\alpha }$| satisfies all the NSP-related conditions discussed in [48] so that the exact sparse recovery can be guaranteed. Theorem 2 summarizes some important properties of the proposed regularization.

Theorem 2.

For any |$\mathbf x \in \mathbb{R}^n,$| |$\alpha> 0,$| and a feasible set |$U $| on the weights |$\mathbf u, $| |$F_{g, \alpha }^U (\mathbf x)$| defined in (1.2) has the following properties:

(i) The function |$F^U_{g, \alpha }(\mathbf x) = -\left ( \alpha g + \delta _{U} \right )^*(-|\mathbf x|),$| where |$f^*$| denotes the convex conjugate of a function |$f$|⁠, thus |$F^U_{g, \alpha }(\mathbf x)$| is concave in the positive cone.
(ii) If |$g $| is symmetric on |$ U $|⁠, then |$ F^U_{g, \alpha }$| is symmetric on |$\mathbb{R}^n $|⁠.
(iii) If |$g $| is separable on |$ U $|⁠, then |$ F^U_{g, \alpha }$| is separable on |$\mathbb{R}^n$|⁠.
(iv) If |$g $| is separable and symmetric on |$U $|⁠, |$ F^U_{g, \alpha }$| satisfies the increasing property on |$\mathbb{R}^n_+ $|⁠, i.e. |$F^U_{g, \alpha } (|\mathbf x|) \geq F^U_{g, \alpha } ( |\mathbf x^{\prime}|) $| for any |$|\mathbf x| \geq |\mathbf x^{\prime}| $|⁠, and reverses the order of majorization, i.e. |$F^U_{g, \alpha } (|\mathbf x|) \leq F^U_{g, \alpha } ( |\mathbf x^{\prime}|) $| if |$|\mathbf x| \succ |\mathbf x^{\prime}|$|⁠, where |$\succ $| is defined in Section 2.1.
(v) If |$g $| is separable and |$U $| is rectangular, then |$ F^U_{g, \alpha }$| satisfies the sub-additive property on |$\mathbb{R}^n $|⁠, i.e.
$$ \begin{align*}& F^U_{g, \alpha}(\mathbf x_1 + \mathbf x_2) \leq F^U_{g, \alpha}(\mathbf x_1) + F^U_{g, \alpha}(\mathbf x_2), \quad \forall \mathbf x_1, \mathbf x_2 \in \mathbb{R}^n. \end{align*} $$
The equality holds if |$\mathbf x_1 $| and |$\mathbf x_2 $| have disjoint support, and each coordinate of |$g $| has the same minimum.
(vi) Let |$U $| be compact and |$g $| continuous. Then |$F_{g,\alpha }^U $| is continuous and the set of sub-differentials of |$-F_{g,\alpha }^U $| at the point |$\mathbf x \geq 0 $| is given by
$$ \begin{align*}& \partial (-F_{g, \alpha}^U) (\mathbf x) = - \text{Conv} \left(\arg\min_{\mathbf u \in U} \left\langle \mathbf u, |\mathbf x | \right\rangle + \alpha g(\mathbf u) \right), \end{align*} $$
where |$\text{Conv} $| is the convex hall of the points. In addition, the function |$F_{g,\alpha }^U $| is differentiable at |$\mathbf x$| if there exists a unique solution |$\mathbf u$| for minimizing |$F_{g,\alpha }^U (\mathbf x)$|⁠. Consequently, if |$g $| is strongly convex, |$F_{g,\alpha }^U $| is continuously differentiable on the positive cone.
(vii) If |$g(0) = 0 $| and |$g$| takes its minimum value at some point in |$ U \subseteq \mathbb{R}^n_+$|⁠, then we have for |$\alpha _1> \alpha _2 $| that
$$ \begin{align*}& F^U_{g, \alpha_1}(\mathbf x) \leq F^U_{g, \alpha_2}(\mathbf x), \quad \forall \mathbf x \in \mathbb{R}^n. \end{align*} $$

Proof.

(i) Recall the convex conjugate of a function |$f$| is defined as |$ f^*(\mathbf y)=\sup \{\langle \mathbf x, \mathbf y\rangle -f(\mathbf x)\}.$| Comparing it with the definition |$F^U_{g, \alpha }(\mathbf x)$| in (1.2), we have |$F^U_{g, \alpha }(\mathbf x) = -\left ( \alpha g + \delta _{U} \right )^*(-|\mathbf x|)$|⁠. As convex conjugate is always convex, |$F_{g, \alpha }^{U}(\mathbf x)$| is concave on the positive cone.
(ii) If |$ g$| is symmetric, it is straightforward that |$F_{g, \alpha }^{U} $| is also symmetric.
(iii) Since |$g $| is a separable function, |$\min _{\mathbf u\in U} f(\mathbf x, \mathbf u)$| breaks down into |$n $| scalar problems, and hence |$ F^U_{g, \alpha }$| is separable.
(iv) The increasing property follows from the fact that we have for every |$\mathbf u \in U $|
$$ \begin{align*}& \left\langle \mathbf u, |\mathbf x | \right\rangle + \alpha g(\mathbf u) \geq \left\langle \mathbf u, |\mathbf x^{\prime} | \right\rangle + \alpha g(\mathbf u) \quad \forall |\mathbf x| \geq |\mathbf x^{\prime}|. \end{align*} $$
Taking the minimum of both sides with respect to |$\mathbf u $| proves the increasing property. The reverse order of majorization can be proved in the same way as in [48, Proposition 2.10].
(v) The sub-additive property can be proved in the same way as in [48, Lemma 2.7].
(vi) It is straightforward that
$$ \begin{align}& -F^U_{g, \alpha}(\mathbf x) = -\min_{\mathbf u\in U} \left\langle \mathbf u, | \mathbf x| \right\rangle + \alpha g(\mathbf u) = \max _{\mathbf u\in U} -\left\langle \mathbf u, | \mathbf x| \right\rangle - \alpha g(\mathbf u),\end{align} $$
(2.9)
for each |$\mathbf u \in U $|⁠. We set |$f_{\mathbf u}(\mathbf x) = -\left \langle \mathbf u, | \mathbf x| \right \rangle - \alpha g(\mathbf u)$|⁠. As |$g$| is continuous, the map |$\mathbf u \mapsto f_{\mathbf u}(\mathbf x)$| is continuous for every |$\mathbf x $|⁠, and for each |$\mathbf u \in U $|⁠, the function |$f_{\mathbf u} $| is continuous. For |$\mathbf x \geq 0 $|⁠, it follows from the Ioffe–Tikhomirov’s Theorem [21, Proposition 6.3] that
$$ \begin{align*} \partial (-F_{g, \alpha}^U) (\mathbf x) & = \text{Conv} \left\{ \cup_{\mathbf u \in T(\mathbf x)} \partial f_{\mathbf u}(\mathbf x) \right\} = \text{Conv} \left\{ \cup_{\mathbf u \in T(\mathbf x)} \{-\mathbf u \} \right\} \\ & = \text{Conv} \{ -\mathbf u \mid -F_{g, \alpha}^U(\mathbf x) = f_{\mathbf u}(\mathbf x) \} = \text{Conv} \{ -\mathbf u \mid \mathbf u \in \arg\min_{\mathbf u \in U} \left\langle \mathbf u, |\mathbf x | \right\rangle + \alpha g(\mathbf u) \} \\ & = -\text{Conv} \{ \arg\min_{\mathbf u \in U} \left\langle \mathbf u, |\mathbf x | \right\rangle + \alpha g(\mathbf u) \}, \end{align*} $$
where |$ T(\mathbf x) = \{ \mathbf u \mid -F_{g, \alpha }^U(\mathbf x) = f_{\mathbf u}(\mathbf x) \}$|⁠.
(vii) Given |$\mathbf x \in \mathbb{R}^n $| with |$\alpha _2$|⁠, we denote |$ \mathbf u^*_2 \in \arg \min _{\mathbf u \in U} \left \langle \mathbf u, |\mathbf x | \right \rangle + \alpha _2 g(\mathbf u), $| which implies |$ F^U_{g, \alpha _2}(\mathbf x) = \left \langle \mathbf u^*_2, |\mathbf x | \right \rangle + \alpha _2 g(\mathbf u^*_2) $| and |$\left \langle \mathbf u^*_2, |\mathbf x | \right \rangle + \alpha _2 g(\mathbf u^*_2) \leq \left \langle \mathbf 0, |\mathbf x | \right \rangle + \alpha _2 g(\mathbf 0) = 0.$| It further follows from |$\left \langle \mathbf u^*_2, |\mathbf x | \right \rangle \geq 0$| that |$g(\mathbf u^*_2) \leq 0 $|⁠. For |$\alpha _1> \alpha _2 $|⁠, we have |$\alpha _1 g(\mathbf u^*_2) \leq \alpha _2 g(\mathbf u^*_2)$| and hence |$ F^U_{g, \alpha _1}(\mathbf x) \leq \left \langle \mathbf u_2^*, |\mathbf x | \right \rangle + \alpha _1 g(\mathbf u^*_2) \leq \left \langle \mathbf u_2^*, |\mathbf x | \right \rangle + \alpha _2 g(\mathbf u^*_2) = F^U_{g, \alpha _2}(\mathbf x). $|

Using proprieties (i)–(v), we can prove that every |$s $|-sparse vector |$\mathbf x $| is the unique solution to (1.3) if and only if |$F_{g, \alpha } $| satisfies the generalized null space property (gNSP) [48] of order |$s$|⁠. A function |$ F$| satisfies the gNSP of order |$s $| corresponding to a matrix |$A $| if

$$ \begin{align*}& \ker(A) \setminus \{0\} \subset \{ \mathbf v \in \mathbf{R}^n \mid F(\mathbf v_S) < F(\mathbf v_{\bar S}),\ \text{ for\ all}\ S\ \text{ with}\ |S| < s \}. \end{align*} $$

Note that |$S \subseteq \{1, \dots , n \}$|⁠, |$\bar{S}$| is the complement of |$S$|⁠, and |$\mathbf v_S$| refers to the vector with the same coordinates as |$\mathbf v$| except zero for indices in |$\bar{S}$|⁠. Please refer to [48, Theorem 4.3] for more details on gNSP. The property (vii) has algorithmic benefits, as many optimization algorithms are designed for continuously differentiable functions. We show in Theorem 3 that |$F^U_{g, \alpha }$| is related to |$\ell _0$| and |$\ell _1$| if |$g$| is separable (without the assumption of strong convexity on |$g$|⁠). The relationship of |$F^U_{g, \alpha }$| to iteratively re-weighted algorithms, e.g. [10,20] is characterized in Theorem 4.

Theorem 3.

Suppose |$U= [0,1]^n$| and |$g$| is separable, i.e. |$g(\mathbf u) = \sum _{i=1}^n g_i(u_i)$| with each |$g_i$| being a strictly decreasing function on |$[0,1]$| with a bounded derivative. If |$g_i(0) =0$| and |$g_i(1) = -1 $| for |$1\leq i \leq n$|⁠, we have that for |$\mathbf x \in \mathbb{R}^n $| there are |$ \alpha _0 \leq \alpha _1$| such that

(i) |$ \frac 1 {\alpha } F^U_{g, \alpha } (\mathbf x) + n = \|\mathbf x\|_0 $|⁠, for all |$0< \alpha \leq \alpha _0 $|⁠;
(ii) |$ F^U_{g, \alpha }(\mathbf x) -\alpha g(\mathbf 1) = \|\mathbf x\|_1$|⁠, for all |$\alpha \geq \alpha _1 $|⁠.

and consequently, we have the functional convergent results

(i) |$ \frac 1 {\alpha } F^U_{g, \alpha } + n \rightarrow \ell _0 $|⁠, as |$\alpha \rightarrow 0 $|⁠;
(ii) |$ F^U_{g, \alpha } -\alpha g(\mathbf 1) \rightarrow \ell _1$|⁠, as |$\alpha \rightarrow +\infty $|⁠.

Note that the function |$F^U_{g, \alpha } $| is defined in (1.2).

Proof.

(i) For any fixed |$\mathbf x \in \mathbb{R}^n $|⁠, we consider the derivative of |$F^U_{g, \alpha }$| with respect to each of its component, i.e. |$f_i:=|x_i| + \alpha g^{\prime}(u_i)$|⁠. If |$x_i = 0,$| then |$f_i$| is negative due to decreasing |$g,$| and hence the minimum is attained at |$u_i = 1$|⁠. If |$x_i \neq 0$|⁠, |$f_i$| is positive for a small enough |$\alpha ,$| due to the assumption that |$g^{\prime}$| is bounded. Then positive derivative implies that the function is increasing, and hence the minimum is attained at |$u_i = 0 $|⁠. In summary, if |$\alpha $| is sufficiently small, then we obtain that |$u_i=1$| if |$x_i=0$| and |$u_i=0$| if |$x_i \neq 0$|⁠. For this choice of |$\mathbf u, $| we get that |$ \left \langle \mathbf u, |\mathbf x | \right \rangle = 0$| and
$$ \begin{align*} &\sum_{i} g_i(u_i) = \sum_{x_i = 0} g_i(u_i) = \sum_{x_i = 0} (-1) = \|\mathbf x \|_0 - n, \end{align*} $$
which implies that |$F_{g, \alpha } (\mathbf x) = \alpha (\|\mathbf x \|_0 - n)$| for a small enough |$\alpha $| that depends on the choice of |$\mathbf x$|⁠. By letting |$\alpha \rightarrow 0$|⁠, we have |$F_{g, \alpha } (\mathbf x) / \alpha = \|\mathbf x \|_0 - n$| for all |$\mathbf x.$|
(ii) Since |$g(1) < g(0)$| there exists a value of |$u_i \in (0,1] $| with |$g^{\prime}(u_i) < 0 $| and so for large enough |$\alpha $|⁠, the derivative |$ |x_i| + \alpha g^{\prime}(u_i) $| is always negative. It further follows from the decreasing function |$g$| that the minimizer is always attained at |$\mathbf u = \mathbf 1$| to reach to the desired result. Similarly to (i), by letting |$\alpha \rightarrow +\infty ,$| the analysis holds for all |$\mathbf x.$|

Theorem 3 presents the ideal choice for the weight, i.e. |$u_i=1$| if |$x_i=0$| and |$u_i=0$| if |$x_i \neq 0$|⁠. A similar idea of zero weights on the known support was explored in [32,40,50], which is referred to as weighted |$\ell _1$|⁠. In addition, Theorem 3 implies that the function |$ \frac{1}{\alpha }F_{g,\alpha }^U + n$| approximates the |$\ell _0$| norm from below. We can define a function of |$H(\mathbf x, \alpha ):= \frac 1 {\alpha } F^U_{g, \alpha } (\mathbf x) + n: \mathbb{R}^n \times [0,\alpha _0] \rightarrow \mathbb{R}$| as a transformation between |$\| \mathbf x \|_0 $| and |$\frac 1 {\alpha _0} F^U_{g, \alpha _0} (\mathbf x) + n $| for a fixed |$\alpha _0$|⁠. As characterized in Corollary 1, this relationship motivates us to consider a type of homotopic algorithm (discussed in Section 3) to better approximate the desired |$\ell _0$| norm, although |$H(\mathbf x, \alpha )$| is not homotopy itself (it is not continuous with respect to |$\mathbf x$|⁠).

Corollary 1.

If |$\{ \alpha _i \} $| is a decreasing sequence converging to zero and |$g $| satisfies the conditions in Theorem 3, then a sequence of functions |$\{ \frac{1}{\alpha _i}F_{g,\alpha _i}^U \}$| are increasing, i.e.

$$ \begin{align*}& \frac{1}{\alpha_0}F_{g,\alpha_0}^U(\mathbf x) \leq \frac{1}{\alpha_1}F_{g,\alpha_1}^U(\mathbf x) \leq \dots \rightarrow \|\mathbf x \|_0 - n, \quad \forall \mathbf x. \end{align*} $$

If we do not restrict all the |$g_i$| functions attain the same value at |$1$| as in Theorem 3, but rather |$g_i(1)$| can take different values, then the proposed regularization |$ F^U_{g, \alpha }$| is equivalent to a weighted |$\ell _1 $| model with a certain shift of |$g(\mathbf 1)$|⁠; see Theorem 4.

Theorem 4.

Suppose |$U= [0,1]^n$| and |$g$| is separable, i.e. |$g(\mathbf u) = \sum _{i=1}^n g_i(u_i)$| with each |$g_i$| being a strictly decreasing function on |$[0,1]$| with a bounded derivative. Let |$\mathbf w = \langle w_1, \dots , w_n \rangle \geq \mathbf 0$|⁠, then if |$g_i(0) =0$| and |$g_i(1) = -w_i $| for |$1\leq i \leq n$|⁠, we have for all |$\mathbf x \in \mathbb{R}^n $|⁠,

$$ \begin{align}& F^U_{g, \alpha}(\mathbf x) -\alpha g(\mathbf 1) \rightarrow \left\langle \mathbf w, \mathbf x \right\rangle, \quad \text{as}\ \alpha\rightarrow +\infty.\end{align} $$

(2.10)

In another words, for a sufficiently large |$\alpha $|⁠, |$ F^U_{g, \alpha }$| is approaching to a weighted |$\ell _1 $| model.

Proof.

Since each derivative |$g_i^{\prime}$| is bounded, there exists a positive number |$M_{\mathbf x}$| (depending on |$\mathbf x$|⁠) such that |$|x_i| + \alpha g^{\prime}(u_i)$| is negative for |$ \alpha> M_{\mathbf x}$|⁠. As a result, the minimizer of |$\arg \min _{u_i}\langle u_i, |x_i|\rangle + \alpha g_i(u_i)$| is attained at |$u_i = 1 $|⁠. For a sufficiently large |$\alpha ,$| the minimizer |$\mathbf u^* =\mathbf{1} $|⁠. By letting |$\alpha \rightarrow +\infty ,$| (2.10) holds for all |$\mathbf x.$|

2.4 Exact Recovery Analysis

There are many models approximating the |$\ell _0$| minimization problem (1.1), and yet only a few of them have exact recovery guarantees. Motivated by the equivalence [65] between the |$\ell _0 $| model (1.1) and (2.4) with a sufficiently small parameter |$\epsilon $|⁠, we give conditions on the |$g $| function to establish the equivalence between our proposed model (1.3) and (1.1). Note that the weight vector |$\mathbf u $| in our formulation is not binary, but takes continuous values. By taking Table 1 and Fig. 1 into account, we consider two types of |$g $| functions defined as follows:

Definition 2.

Let |$g:\mathbb{R}^n \rightarrow \mathbb{R} \cup \{ +\infty , -\infty \} $| be a separable function with |$g(\mathbf u) = \sum _{i = 1}^n g_i(u_i),$| for |$\mathbf u =[u_1,\cdots ,u_n]\in \mathbb{R}^n $|⁠. We define

Type B: All |$g_i $| functions have bounded derivatives on |$[0,1],$| and are strictly decreasing on |$[0,1] $| with the same value at |$0 $| and |$ 1$|⁠, i.e. there exist two constants |$a, b \in \mathbb{R} $| such that |$g_i(0) = a, g_i(1) = b, \forall i\in [n].$|
Type C: All |$g_i $| functions are convex on |$[0, \infty )$| with the same value at |$0 $| and the same minimum at a point other than zero, i.e. there exist two constants |$a>b \in \mathbb{R}$| such that |$g_i(0) = a $| and |$\min _{t \geq 0} g_i(t) = b, \forall i\in [n]$|⁠.

An important characteristic both types of functions share is that they are decreasing near zero. Type B functions are defined on a bounded interval, and we enforce a box constraint on |$\mathbf u$| for strictly decreasing |$g$|⁠. Type C refers to convex functions defined on an unbounded interval due to |$\lim _{t \rightarrow \infty } g(t)> g(0).$| Note that Theorems 3 and 4 hold when |$g$| is a Type B or Type C function. We establish the equivalent between (1.3) and (1.1) for Type B and Type C functions in Theorems 5 and 6, respectively.

Theorem 5.

Proof.

Since |$g $| is a Type B function, we represent |$g(\mathbf u) = \sum _{i=1}^n g_i(u_i)$| with each |$g_i$| strictly decreasing and having bounded derivatives on |$[0,1].$| Without loss of generality, we assume that |$g_i(0) = 0 $| and |$g_i(1) = -1, \forall i\in [n].$| Denote |$ s:= \min _{\mathbf x} \{ \|\mathbf x\|_0 \mid A \mathbf x = \mathbf b \}$| and |$ \epsilon _0:= \min _{\mathbf x} \{ |\mathbf x|_{(s)} \mid A \mathbf x = \mathbf b \}. $| Here |$\epsilon _0> 0;$| otherwise there exists a solution to (1.1) with sparsity less than |$s$|⁠. Since |$g_i $| has bounded derivatives, there exists a scalar |$ \alpha> 0 $| such that |$-\frac{\epsilon _0}{\alpha } < g^{\prime}_i(u), \forall u \in [0,1]$| and |$i \in [n] $|⁠.

Let |$(\mathbf x^*, \mathbf u^*) $| be a solution of (1.3). If |$|x_i^*| \geq \epsilon _0, $| we obtain |$\frac{\partial f}{\partial u_i}=|x_i^*|+\alpha g_i^{\prime}(u_i)> 0$|⁠. Therefore, |$h(t):= t|x^*_i| + \alpha g_i(t)$| is an increasing function on |$[0,1], $| thus attaining its minimum at |$t = 0 $|⁠. As a result, we have |$ u_i^* = 0$|⁠; otherwise |$(\mathbf x^*, \mathbf u^*) $| is not a minimizer of (1.3). In addition, we have |$ u|x| + \alpha g_i(u) \geq \alpha g_i(u) \geq \alpha g_i(1), \forall u \in [0,1]$| and |$\forall x$|⁠, as |$g_i$| is strictly decreasing. By combining two cases of |$|x_i^*| \geq \epsilon _0$| and |$|x_i^*| < \epsilon _0,$| we estimate a lower bound of

$$ \begin{align*} f(\mathbf x^*,\mathbf u^*) &:= \sum_{i =1}^{n} \Big[u^*_i|x^*_i| + \alpha g_i(u^*_i) \Big] = \sum_{\{i \mid |x^*_i| \geq \epsilon_0 \}} \alpha g_i(0) + \sum_{\{i \mid |x^*_i| < \epsilon_0 \}} \Big[u^*_i|x^*_i| + \alpha g_i(u^*_i) \Big] \\ & \geq \sum_{\{i \mid |x^*_i| \geq \epsilon_0 \}} \alpha g_i(0) + \sum_{\{i \mid |x^*_i| < \epsilon_0 \}} \alpha g_i(1) = -\alpha | \{i \mid |x^*_i| < \epsilon_0 \}| \geq -\alpha (n-s), \end{align*} $$

where we use the assumptions of |$g_i(0) = 0$| and |$g_i(1) = -1$| together with |$| \{ i \mid |x_i| < \epsilon _0\}| \leq n-s $| by the definitions of |$s$| and |$\epsilon _0.$| On the other hand, the lower bound |$-\alpha (n-s) $| for |$f(\mathbf x^*, \mathbf u^*) $| can be achieved by any solution |$\mathbf x $| of |$ A \mathbf x = \mathbf b$| with sparsity |$ s$|⁠, by choosing |$u_i = 0 $| for |$ x_i \neq 0$| and |$u_i = 1 $| otherwise. Therefore, we have |$f(\mathbf x^*, \mathbf u^*) = -\alpha (n-s)$|⁠.

Conversely, if |$\mathbf x^* $| is a solution of (1.1), then |$\mathbf x^*$| satisfies |$A\mathbf x^*=\mathbf b.$| With the choice of |$ u_i^* = 0$| for |$ |x_i^*| \neq 0$| and |$u_i^* = 1 $| otherwise, we get |$(\mathbf x^*, \mathbf u^*)$| is a minimizer of (1.3) such that the objective function attains the minimum value |$-\alpha (n-s) $|⁠.

Theorem 6.

Proof.

We define |$f(\mathbf x, \mathbf u) $|⁠, |$ s$| and |$\epsilon _0 $| in the same way as in the proof of Theorem 5. Since |$g_i$| is convex, |$g_i^{\prime} $| is increasing and hence |$g_i^{\prime}(u) \geq g_i^{\prime}(0), \forall u \in [0,\infty )$| and |$i\in [n]$|⁠. Then there exists a scalar |$\alpha>0$| such that |$g_i^{\prime}(0)> -\frac{\epsilon _0}{\alpha }, \forall i\in [n]$|⁠. Without loss of generality, we assume |$a=0, b = \arg \min _{t \geq 0} g_i(t) =-1$|⁠. In this setting, we get

$$ \begin{align*} & u|x| + \alpha g_i(u) \geq \alpha g_i(u) \geq \alpha g_i(d_i) = -\alpha, \ \forall u \in [0,\infty), \ \forall x, \ i\in [n],\end{align*} $$

which implies that |$f(\mathbf x^*, \mathbf u^*) \geq -\alpha (n-s).$| The rest of the proof follows exactly from the one of Theorem 5, thus omitted.

Remark 1.

In Theorems 5 and 6, we consider a linear constraint set |$\varOmega = \{ \mathbf x \in \mathbb{R}^n \mid A \mathbf x = \mathbf b \}.$| All the analysis can be extended to a feasible set of inequality constraints, e.g. |$\varOmega _\epsilon = \{ \mathbf x \in \mathbb{R}^n \mid \|A \mathbf x - \mathbf b\| \leq \epsilon \} $| for |$\epsilon \geq 0 $|⁠. In this case, we can show that our model (2.2) with a given |$\varOmega _\epsilon $| is equivalent to the following |$\ell _0$| formulation:

$$ \begin{align}& \arg \min_{\mathbf x \in \mathbb{R}^n} \|\mathbf x\|_0 + \delta_{\varOmega_\epsilon} (\mathbf x).\end{align} $$

(2.11)

3. Numerical Algorithms and Convergence Analysis

We describe in Section 3.1 the ADMM [8,19] for solving the general model, with convergence analysis presented in Section 3.2. In Section 3.3, we discuss closed-form solutions of the |$\mathbf u$|-subproblem for two specific choices of |$g$|⁠.

3.1 The Proposed Algorithm

$$ \begin{align}& \arg \min_{\mathbf u, \mathbf x,\mathbf y} \{ \left\langle \mathbf u, |\mathbf x | \right\rangle + \alpha g(\mathbf u) + \delta_{U}(\mathbf u) + \psi(\mathbf y)\ | \ \ \mathbf y = \mathbf x \}.\end{align} $$

(3.1)

The corresponding augmented Lagrangian becomes

$$ \begin{align}& L_{\rho}(\mathbf u, \mathbf x, \mathbf y; \mathbf v) = \left\langle \mathbf u, |\mathbf x | \right\rangle + \alpha g(\mathbf u) + \delta_U(\mathbf u) + \psi(\mathbf y)+ \left\langle \mathbf v, \mathbf x - \mathbf y \right\rangle + \frac{\rho}{2} \| \mathbf x - \mathbf y \|_2^2,\end{align} $$

(3.2)

where |$\mathbf v$| is the Lagrangian dual variable and |$\rho $| is a positive parameter. The ADMM scheme involves the following iterations:

$$ \begin{align}& \left\lbrace \begin{array}{@{}l} \mathbf u^{k+1} = \arg\min_{\mathbf u} L_{\rho}(\mathbf u, \mathbf x^{k}, \mathbf y^{k}; \mathbf v^k) \\ \mathbf x^{k+1} = \arg\min_{\mathbf x} L_{\rho}(\mathbf u^{k+1}, \mathbf x, \mathbf y^{k}; \mathbf v^k) \\ \mathbf y^{k+1} = \arg\min_{\mathbf y} L_{\rho}(\mathbf u^{k+1}, \mathbf x^{k+1}, \mathbf y; \mathbf v^k) \\ \mathbf v^{k+1} = \mathbf v^k + \rho(\mathbf x^{k+1} - \mathbf y^{k+1}). \end{array} \right.\end{align} $$

(3.3)

$$ \begin{align}& \mathbf u^{k+1} = \arg\min_{\mathbf u \in U} \left\langle \mathbf u, |\mathbf x^k | \right\rangle + \alpha g(\mathbf u).\end{align} $$

(3.4)

$$ \begin{align*}& \mathbf x^{k+1} = \mathbf{shrink}\left( \mathbf y^{k} - \frac{1}{\rho} \mathbf v^k, \frac{1}{\rho} \mathbf u^{k+1} \right), \end{align*} $$

where |$ \mathbf{shrink}(\mathbf v, \mathbf u) = \mathrm{sign}( \mathbf v) \odot \max (|\mathbf v|-\mathbf u,\mathbf 0). $|

For the constrained formulation, i.e. |$\psi (\mathbf y) = \delta _{\varOmega }(\mathbf y) $|⁠, the |$\mathbf y$|-subproblem becomes

$$ \begin{align*} & \arg\min_{\mathbf y} \left\lbrace \frac{1}{2} \| \mathbf y - (\mathbf x^{k+1} + \frac{1}{\rho} \mathbf v^k ) \|_2^2 \mid A \mathbf y = \mathbf b \right\rbrace. \end{align*} $$

It is equivalent to a projection into the affine solution of |$A \mathbf x = \mathbf b $|⁠, which has a closed-form solution,

$$ \begin{align*}& \mathbf y^{k+1} = \left( I_n - A^T(AA^T)^{-1}A \right)\left(\mathbf x^{k+1} +\frac{1}{\rho} \mathbf v^k\right) + A^{T} (AA^{T})^{-1} \mathbf b. \end{align*} $$

For the unconstrained formulation, |$\psi (\mathbf y) = \frac{\gamma }{2} \| A \mathbf y - \mathbf b \|_2^2 $|⁠, the |$\mathbf y$|-subproblem also has a closed-form solution by solving a linear system

$$ \begin{align} \mathbf y^{k+1} & = \arg\min_{\mathbf y} \frac{\gamma}{2} \| A \mathbf y - \mathbf b \|_2^2 + \frac{\rho}{2} \| \mathbf y - (\mathbf x^{k+1} + \frac{1}{\rho} \mathbf v^k ) \|_2^2 \nonumber \\ & = \left( \rho I_n + \gamma A^T A \right)^{-1} \left( \rho \mathbf x^{k+1} + \mathbf v^k + \gamma A^T \mathbf b \right).\end{align} $$

(3.5)

It is worth noting that for the unconstrained formulation, a more accurate choice for |$\psi $| is |$\psi (\mathbf y) = \frac{\alpha \gamma }{2} \| A \mathbf y - \mathbf b \|_2^2 $|⁠. The reason for this is further explained in Remark 2.

Remark 2.

We remark that letting |$\alpha $| approach to |$0$| is not exactly a homotopy algorithm, as the transformation between |$\| \mathbf x \|_0 $| and |$\frac 1 {\alpha _0} F^U_{g, \alpha _0} (\mathbf x) + n $| is not continuous. We observe empirically the rate that |$\alpha $| decays to zero plays a critical role in the performance of sparse recovery. On the other hand, we should minimize |$\frac 1 \alpha F^U_{g, \alpha _0} (\mathbf x) +\frac \gamma 2\|A\mathbf x-\mathbf b\|_2^2$| to approximate the |$\ell _0$| norm. This formulation requires the inversion of |$(\rho I_n+\gamma \alpha A^TA)$| for different |$\alpha $| in the |$\mathbf y$|-update, which is computationally expensive, as opposed to pre-computing the inverse of |$(\rho I_n+\gamma A^TA)$| with a fixed value |$\gamma .$|

3.2 Convergence Analysis for ADMM

We prove the convergence for the ADMM method (3.3) for the unconstrained case, i.e. |$\psi (\mathbf y) = \frac{\gamma }{2} \|A \mathbf y - \mathbf b \|_2^2 $|⁠. The steps that we take to prove the convergence of (3.3) are similar to the convergence of the standard ADMM method. Yet, since we update |$\mathbf x$| and |$\mathbf u$| separately, the method (3.3) is not actually the ADMM iterations for the Lagrangian (3.2), and the function in the optimization (3.1) is not jointly convex with respect to |$\mathbf u$| and |$\mathbf x$|⁠. Note that we apply an adaptive |$\alpha $| update in Algorithm 1, but the convergence analysis is restricted to a fixed |$\alpha $| value. In addition, we assume that |$ g$| is a Type B or Type C function that is continuously differentiable on |$ U$|⁠.

In the case of Type C functions with |$ U = [0,\infty )^n$|⁠, it follows from optimality conditions for each sub-problem in (3.3) that there exists |$ \mathbf s^{k+1} \geq 0$| and |$\mathbf p^{k+1} \in \partial |\mathbf x^{k+1} | $| such that

$$ \begin{align}& \left\lbrace \begin{array}{@{}l} \mathbf 0 = |\mathbf x^k| + \alpha \nabla g(\mathbf u^{k+1}) - \mathbf s^{k+1} \\ \mathbf 0 = \mathbf u^{k+1} \odot \mathbf p^{k+1} + \mathbf v^k + \rho(\mathbf x^{k+1} - \mathbf y^{k}) \\ \mathbf 0 = \nabla \psi(\mathbf y^{k+1}) - \mathbf v^k - \rho(\mathbf x^{k+1}-\mathbf y^{k+1}), \end{array} \right.\end{align} $$

(3.6)

with |$\mathbf s^{k+1} \odot \mathbf u^{k+1} = \mathbf 0$|⁠.

Note that for the Type B functions with |$U = [0,1]^n $|⁠, the optimality condition for |$u$| sub-problem is that there exists |$\mathbf s^{k+1}, \mathbf t^{k+1} \geq 0 $| such that |$ \mathbf s^{k+1} \odot \mathbf u^{k+1} = 0, \ \mathbf t^{k+1} \odot (\mathbf u^{k+1} - \mathbf 1) = \mathbf 0,$| and |$ \mathbf 0 = |\mathbf x^k| + \alpha \nabla g(\mathbf u^{k+1}) - \mathbf s^{k+1} + \mathbf t^{k+1}$|⁠.

Lemma 1.

Suppose the sequence |$\{\mathbf u^k, \mathbf x^k, \mathbf y^k, \mathbf v^k\}$| is generated by (3.3) and denote |$C_{A}:= \|A^T A\|_2$|⁠, then the following inequality holds:

$$ \begin{align*} & \|\mathbf x^{k+1} -\mathbf y^{k+1} \|_2 \leq \frac{\gamma C_{A}}{\rho} \|\mathbf y^{k+1} - \mathbf y^{k} \|_2. \end{align*} $$

Proof.

It is straightforward that |$\psi $| has Lipschitz continuous gradient with parameter |$\gamma C_A.$| From the |$\mathbf v$|-update in (3.3) and the last optimality condition in (3.6), we have |$ \mathbf v^{k+1} = \nabla \psi (\mathbf y^{k+1}), $| and hence |$ \| \nabla \psi (\mathbf y^{k+1}) - \nabla \psi (\mathbf y^{k}) \|_2 \leq \gamma C_{A} \|\mathbf y^{k+1} - \mathbf y^{k} \|_2, $| which implies that

$$ \begin{align*}& \|\mathbf x^{k+1} -\mathbf y^{k+1}\|_2 = \frac{1}{\rho} \|\mathbf v^{k+1} -\mathbf v^{k}\|_2 =\frac{1}{\rho} \| \nabla \psi (\mathbf y^{k+1}) - \nabla \psi (\mathbf y^{k}) \|_2 \leq \frac{\gamma C_A}{\rho} \|\mathbf y^{k+1} - \mathbf y^{k} \|_2. \end{align*} $$

Theorem 7. Sufficient decrease condition

Suppose the sequence |$\{\mathbf u^k, \mathbf x^k, \mathbf y^k, \mathbf v^k\}$| is generated by (3.3). Let |$\rho> \sqrt{2}\gamma C_A $|⁠, then there exists a constant |$C>0$| such that the augmented Lagrangian |$L_{\rho }(\mathbf u^k, \mathbf x^k, \mathbf y^k; \mathbf v^k)$| defined in (3.2) satisfies

$$ \begin{align}& \begin{split} &L_{\rho}(\mathbf u^{k+1}, \mathbf x^{k+1}, \mathbf y^{k+1}; \mathbf v^{k+1}) \\ \leq & L_{\rho}(\mathbf u^k, \mathbf x^k, \mathbf y^k; \mathbf v^k) - \frac{\gamma}{2} \|A(\mathbf y^{k+1} - \mathbf y^{k}) \|_2^2 - C \|\mathbf y^{k+1} - \mathbf y^{k} \|_2^2 - \frac{\rho}{2} \| \mathbf x^{k+1} - \mathbf x^k \|_2^2, \end{split}\end{align} $$

(3.7)

which implies that |$L_{\rho }$| decreases sufficiently.

Proof.

The |$\mathbf v$|-update in (3.3) and Lemma 1 lead to

$$ \begin{align*}& \begin{split} &L_{\rho}(\mathbf u^{k+1}, \mathbf x^{k+1}, \mathbf y^{k+1}; \mathbf v^{k+1}) - L_{\rho}(\mathbf u^{k+1}, \mathbf x^{k+1}, \mathbf y^{k+1}; \mathbf v^{k}) \\ =& \left\langle \mathbf v^{k+1} - \mathbf v^{k}, \mathbf x^{k+1} - \mathbf y^{k+1} \right\rangle = \rho \|\mathbf x^{k+1}-\mathbf y^{k+1}\|^2_2 \leq \frac{\gamma^2 C^2_{A}}{\rho} \|\mathbf y^{k+1} - \mathbf y^{k} \|_2^2. \end{split} \end{align*} $$

Using the |$\mathbf y$|-update in (3.3) and the last optimality condition in (3.6), we have

$$ \begin{align*}& \begin{split} & L_{\rho}(\mathbf u^{k+1}, \mathbf x^{k+1}, \mathbf y^{k+1}; \mathbf v^{k}) - L_{\rho}(\mathbf u^{k+1}, \mathbf x^{k+1}, \mathbf y^{k}; \mathbf v^{k}) \\ =& \frac{\gamma}{2} (\| A\mathbf y^{k+1} \|_2^2 - \| A\mathbf y^{k} \|_2^2) + \frac{\rho}{2} (\| \mathbf y^{k+1} \|_2^2 - \| \mathbf y^{k} \|_2^2) + \left\langle \gamma A^T \mathbf b + \rho \mathbf x^{k+1} + \mathbf v^{k}, \mathbf y^{k} - \mathbf y^{k+1} \right\rangle \\ = &\frac{\gamma}{2} (\| A\mathbf y^{k+1} \|_2^2 - \| A\mathbf y^{k} \|_2^2) + \frac{\rho}{2} (\| \mathbf y^{k+1} \|_2^2 - \| \mathbf y^{k} \|_2^2) + \left\langle \gamma \rho \mathbf y^{k+1} + \gamma A^T A \mathbf y^{k+1}, \mathbf y^{k} - \mathbf y^{k+1} \right\rangle \\ =& -\frac{\gamma}{2} \| A\mathbf y^{k+1} - A\mathbf y^{k} \|_2^2 - \frac{\rho}{2} \| \mathbf y^{k+1} - \mathbf y^{k} \|_2^2. \end{split} \end{align*} $$

As for the |$\mathbf x$|-update, we get

$$ \begin{align*}& \begin{split} & L_{\rho}(\mathbf u^{k+1}, \mathbf x^{k+1}, \mathbf y^{k}; \mathbf v^{k}) - L_{\rho}(\mathbf u^{k+1}, \mathbf x^{k}, \mathbf y^{k}; \mathbf v^{k}) \\ =& \left\langle \mathbf u^{k+1}, |\mathbf x^{k+1}| - |\mathbf x^k| \right\rangle + \left\langle \mathbf v^k,\mathbf x^{k+1} - \mathbf x^k \right\rangle + \frac{\rho}{2} \|\mathbf x^{k+1} - \mathbf y^k \|_2^2 - \frac{\rho}{2} \|\mathbf x^{k} - \mathbf y^k \|_2^2 \\ =& \left\langle \mathbf u^{k+1}, |\mathbf x^{k+1}| - |\mathbf x^k| \right\rangle - \left\langle \mathbf u^{k+1}\odot\mathbf p^{k+1} + \rho \mathbf x^{k+1},\mathbf x^{k+1} - \mathbf x^k \right\rangle + \frac{\rho}{2} \|\mathbf x^{k+1} \|_2^2 - \frac{\rho}{2} \|\mathbf x^{k} \|_2^2 \\ = & \left[ \left\langle \mathbf u^{k+1}, |\mathbf x^{k+1}| - |\mathbf x^k| \right\rangle - \left\langle \mathbf u^{k+1}\odot\mathbf p^{k+1},\mathbf x^{k+1} - \mathbf x^k \right\rangle \right] - \frac{\rho}{2} \| \mathbf x^{k+1} - \mathbf x^{k} \|_2^2 \\ \leq & - \frac{\rho}{2} \| \mathbf x^{k+1} - \mathbf x^{k} \|_2^2, \end{split} \end{align*} $$

where the last inequality comes from the definition of subgradient. For the |$\mathbf u$|-update, we use the fact that |$\mathbf u^{k+1} $| is a minimizer, hence

$$ \begin{align*}& \begin{split} & L_{\rho}(\mathbf u^{k+1}, \mathbf x^{k}, \mathbf y^{k}; \mathbf v^{k}) - L_{\rho}(\mathbf u^{k}, \mathbf x^{k}, \mathbf y^{k}; \mathbf v^{k}) \leq 0 \end{split}. \end{align*} $$

Adding all these inequalities yields the desired inequality (3.7) with |$C = ( \frac{\rho }{2} -\frac{\gamma ^2 C_A^2}{\rho } )$|⁠. If |$\rho> \sqrt{2}\gamma C_A $|⁠, then |$C> 0,$| leading to the sufficient decreasing of |$L_\rho .$|

Theorem 8.

[Residue convergent] Suppose the sequence |$\{\mathbf u^k, \mathbf x^k, \mathbf y^k, \mathbf v^k\}$| is generated by (3.3). If |$g$| is a Type B or Type C function and |$\rho> \sqrt{2}\gamma C_A$|⁠, the following hold as |$ k \rightarrow \infty $|

$$ \begin{eqnarray*} \mathbf x^{k+1}- \mathbf x^k \rightarrow \mathbf 0, \;\;\; \mathbf y^{k+1}- \mathbf y^k \rightarrow \mathbf 0, \;\; \text{ and} \;\; \mathbf r^{k}:= \mathbf x^k-\mathbf y^k \rightarrow \mathbf 0. \end{eqnarray*} $$

Proof.

Since |$g $| is a Type B or Type C, then it is bounded below, and hence we denote |$ m_g:=\min _{\mathbf u\in U} g(\mathbf u).$| By telescoping summation of (3.7) from |$k=1$| to |$N,$| we obtain

$$ \begin{align*}& \begin{split} & \sum_{k = 0}^{N} \left( \frac{\gamma}{2} \|A(\mathbf y^{k+1} - \mathbf y^{k}) \|_2^2 + C \|\mathbf y^{k+1} - \mathbf y^{k} \|_2^2 + \frac{\rho}{2} \| \mathbf x^{k+1} - \mathbf x^k \|_2^2 \right) \\ \leq & \sum_{k = 0}^{N} L_{\rho}(\mathbf x^k, \mathbf y^k, \mathbf u^k, \mathbf v^k) - L_{\rho}(\mathbf x^{k+1}, \mathbf y^{k+1}, \mathbf u^{k+1}, \mathbf v^{k+1}) \\ = & L_{\rho}(\mathbf x^0, \mathbf y^0, \mathbf u^0, \mathbf v^0) - L_{\rho}(\mathbf x^{N+1}, \mathbf y^{N+1}, \mathbf u^{N+1}, \mathbf v^{N+1}) \; \leq \; L_{\rho}(\mathbf x^0, \mathbf y^0, \mathbf u^0, \mathbf v^0) - \alpha m_g \; < \; \infty, \end{split} \end{align*} $$

which implies that |$ \sum _{k = 0}^{\infty } \| \mathbf x^{k+1}- \mathbf x^k \|_2^2 \leq \infty $| and |$ \sum _{k = 0}^{\infty } \| \mathbf y^{k+1}- \mathbf y^k \|_2^2 \leq \infty .$| Therefore, we must have |$ \| \mathbf x^{k+1}- \mathbf x^k \|_2^2 \rightarrow 0$| and |$ \| \mathbf y^{k+1}- \mathbf y^k \|_2^2 \rightarrow 0,$| as |$k \rightarrow \infty . $| It further follows from Lemma 1 that |$\| \mathbf x^k-\mathbf y^k \|_2 \rightarrow 0,$| which completes the proof.

Theorem 9.

[Stationary points] Suppose the sequence |$\{\mathbf u^k, \mathbf x^k, \mathbf y^k, \mathbf v^k\}$| is generated by (3.3). If |$g\in \mathcal C^1(\mathbb R)$| is a Type B or Type C function and |$\rho>\sqrt{2}\gamma C_A$|⁠, and |$(\mathbf u^k, \mathbf x^k)$| is bounded, then every limit point of |$\{\mathbf u^k, \mathbf x^k, \mathbf y^k, \mathbf v^k\},$| denoted by |$\{ \mathbf u^*, \mathbf x^*, \mathbf y^*, \mathbf v^* \} $|⁠, is a stationary point of |$L_\rho (\mathbf u, \mathbf x, \mathbf y; \mathbf v) $| and also |$\{\mathbf x^*, \mathbf u^*\}$| is a stationary point of (1.4).

Proof.

Using |$\mathbf v^{k} = \nabla \psi (\mathbf y^{k})$| from Lemma 1, i.e. |$\mathbf v^{k} = \gamma A^T(A\mathbf y^{k}-\mathbf b),$| we have

$$ \begin{align*}& \begin{split} & \psi(\mathbf y^{k}) + \left\langle \mathbf v^{k}, \mathbf x^{k} - \mathbf y^{k} \right\rangle = \frac{\gamma}{2} \|A \mathbf y^{k} - \mathbf b\|_2^2 + \left\langle \gamma A^T(A \mathbf y^{k}-\mathbf b), \mathbf x^{k} - \mathbf y^{k} \right\rangle \\ = & \frac{\gamma}{2} \|A \mathbf x^{k} - \mathbf b\|_2^2 - \frac{\gamma}{2} \|A (\mathbf x^{k} - \mathbf y^{k})\|_2^2 \; \geq \; \frac{\gamma}{2} \|A \mathbf x^{k} - \mathbf b\|_2^2 - \frac{\gamma C_A}{2} \|\mathbf x^{k} - \mathbf y^{k}\|_2^2. \end{split} \end{align*} $$

Consequently, we obtain that

$$ \begin{align}& L_{\rho}(\mathbf u^{k}, \mathbf x^{k}, \mathbf y^{k}; \mathbf v^{k}) \geq \langle \mathbf u^{k}, |\mathbf x^{k} | \rangle+ \alpha g(\mathbf u^{k}) + \frac{\gamma}{2} \| A \mathbf x^{k} - \mathbf b\|_2^2 + \frac{\rho-\gamma C_A} 2\|\mathbf x^{k}-\mathbf y^{k}\|_2^2.\end{align} $$

(3.8)

If |$\rho> \gamma C_A,$| then |$L_{\rho }(\mathbf u^{k}, \mathbf x^{k}, \mathbf y^{k}; \mathbf v^{k}) \geq \alpha g(\mathbf u^{k}) \geq \alpha m_g$|⁠, where |$ m_g:=\min _{\mathbf u\in U} g(\mathbf u)$|⁠, showing |$L_\rho $| is lower bounded. On the other hand, Theorem 7 gives an upper bound of |$L_{\rho }(\mathbf u^{k}, \mathbf x^{k}, \mathbf y^{k}; \mathbf v^{k}),$| i.e. |$L_{\rho }(\mathbf u^0, \mathbf x^0, \mathbf y^0; \mathbf v^0)$|⁠.

The boundedness of |$L_\rho ,$| |$\mathbf u^k$| and |$\mathbf x^k$| together with (3.8) implies that |$\mathbf y^k$| is bounded, and hence |$\mathbf v^k$| is bounded due to |$\mathbf v^{k} = \nabla \psi (\mathbf y^{k}).$| Then the Bolzano–Weierstrass Theorem guarantees that there exists a subsequence, denoted by |$\{ \mathbf x^{k_s}, \mathbf y^{k_s}, \mathbf u^{k_s}, \mathbf v^{k_s} \}$|⁠, that converges to a limit point, i.e. |$(\mathbf x^{k_s}, \mathbf y^{k_s}, \mathbf u^{k_s}, \mathbf v^{k_s}) \rightarrow (\mathbf x^*, \mathbf y^*, \mathbf u^*, \mathbf v^*). $|

By Theorem 8, we get |$\mathbf x^{k_s}-\mathbf y^{k_s} \rightarrow \mathbf 0, $| leading to |$\mathbf x^* = \mathbf y^* $|⁠, and |$(\mathbf x^{k_s-1}, \mathbf y^{k_s-1}) \rightarrow (\mathbf x^*, \mathbf y^*),$| and hence we have |$\mathbf v^{k_s-1} \rightarrow \mathbf v^* $|⁠. Let |$\mathbf p^{k_s} $| be the corresponding variables in the optimality condition (3.6). As |$\mathbf p^{k_s} \in \partial |\mathbf x^{k_s} |,$| we know |$\mathbf p^{k_s} $| is bounded by |$[-1,1].$| Therefore, there exists a limit point of the sequence |$\mathbf p^{k_s}.$| Without loss of generality, we assume it is the sequence itself, i.e. |$\mathbf p^{k_s}\rightarrow \mathbf p^*,$| and hence we have |$\mathbf p^* \in \partial |\mathbf x^* | $|⁠.

Type C: If |$ g$| is a Type C function then |$\mathbf u \in [0,\infty )^n $| and hence the optimality condition for the |$\mathbf u- $|update is |$\mathbf 0 = |\mathbf x^k| + \alpha \nabla g(\mathbf u^{k+1}) - \mathbf s^{k+1},$| with |$\mathbf s^{k+1} \odot \mathbf u^{k+1} = \mathbf 0$|⁠. We define

$$ \begin{align*}& \mathbf s^*:= \lim_{s \rightarrow \infty} |\mathbf x^{k_s-1}| + \alpha \nabla g(\mathbf u^{k_s}), \end{align*} $$

and so |$\mathbf s^* \geq 0 $| and |$\mathbf s^{k_s} \rightarrow \mathbf s^*$| (since |$g $| is continuously differentiable). The optimality condition |$\mathbf s^{k_s} \odot \mathbf u^{k_s} = \mathbf 0$| implies that |$\mathbf s^* \odot \mathbf u^* = \mathbf 0$|⁠. Since all the equations in (3.6) are continuous, we can replace |$k $| by |$k_s-1 $| and take the limit as |$k_s \rightarrow \infty $| to get

$$ \begin{align*}& \left\lbrace \begin{array}{@{}l} \mathbf 0 = |\mathbf x^*| + \alpha \nabla g(\mathbf u^{*}) - \mathbf s^{*} \\ \mathbf 0 = \mathbf u^{*} \odot \mathbf p^{*} + \mathbf v^* + \rho(\mathbf x^{*} - \mathbf y^{*}) \\ \mathbf 0 = \nabla \psi(\mathbf y^{*}) - \mathbf v^* - \rho(\mathbf x^{*}-\mathbf y^{*}), \end{array} \right. \end{align*} $$

where |$\mathbf s^* \geq 0 $| with |$\mathbf s^* \odot \mathbf u^* = \mathbf 0$|⁠, and |$\mathbf p^* \in \partial |\mathbf x^* | $|⁠. Hence, |$(\mathbf x^*, \mathbf y^*, \mathbf u^*, \mathbf v^*)$| is a stationary point of |$L_\rho (\mathbf u, \mathbf x, \mathbf y; \mathbf v) $|⁠. Furthermore, we have |$\mathbf v^{k_s} = \nabla \psi (\mathbf y^{k_s}) $| from the proof of Lemma 1, leading to |$\mathbf v^* = \nabla \psi (\mathbf y^*)$|⁠. Together with |$\mathbf x^* = \mathbf y^*, $| we get

$$ \begin{align*}& \left\lbrace \begin{array}{@{}l} \mathbf 0 = |\mathbf x^*| + \alpha \nabla g(\mathbf u^{*}) - \mathbf s^{*}, \\ \mathbf 0 = \mathbf u^{*} \odot \mathbf p^{*} + \nabla \psi (\mathbf y^*), \end{array} \right. \end{align*} $$

which means that |$(\mathbf x^*, \mathbf u^*) $| is a stationary point of (1.4) for |$U = [0,\infty )^n $|⁠.

Type B: If |$ g$| is a type B function then |$\mathbf u \in [0,1]^n $| and hence the optimality condition for the |$\mathbf u- $|update is |$\mathbf 0 = |\mathbf x^k| + \alpha \nabla g(\mathbf u^{k+1}) - \mathbf s^{k+1} + \mathbf t^{k+1},$| with |$\mathbf s^{k+1} \odot \mathbf u^{k+1} = \mathbf 0$| and |$\mathbf t^{k+1} \odot (\mathbf u^{k+1}-1) = \mathbf 0$| with |$\mathbf s^{k+1} \geq 0 $| and |$\mathbf t^{k+1} \geq 0 $|⁠.

Note that we have |$\mathbf s^{k_s} - \mathbf t^{k_s} = |\mathbf x^{k_s-1}| + \alpha \nabla g(\mathbf u^{k_s})$|⁠. Since |$\mathbf u^{k_s} \rightarrow \mathbf u^*,$| |$\mathbf x^{k_s-1} \rightarrow \mathbf x^*$| and |$g $| is continuously differentiable, the sequence |$\mathbf s^{k_s} - \mathbf t^{k_s} $| is bounded and converges to the limit |$ |\mathbf x^*| + \alpha \nabla g(\mathbf u^*)$|⁠. Combining the boundedness of |$\mathbf s^{k_s} - \mathbf t^{k_s} $| together with the optimality conditions, the sequences |$\mathbf s^{k_s} $| and |$\mathbf t^{k_s} $| must be bounded. Therefore, each sequence has a convergent sub-sequence and without loss of generality, we may assume it is the sequence itself, i.e. |$\mathbf s^{k_s}\rightarrow \mathbf s^*,$| and |$\mathbf t^{k_s}\rightarrow \mathbf t^*$|⁠. We must have

$$ \begin{align*}& \mathbf s^* - \mathbf t^* = \lim_{s \rightarrow \infty} |\mathbf x^{k_s-1}| + \alpha \nabla g(\mathbf u^{k_s}), \end{align*} $$

$$ \begin{align*}& \left\lbrace \begin{array}{@{}l} \mathbf 0 = |\mathbf x^*| + \alpha \nabla g(\mathbf u^{*}) - \mathbf s^{*} + \mathbf t^* \\ \mathbf 0 = \mathbf u^{*} \odot \mathbf p^{*} + \mathbf v^* + \rho(\mathbf x^{*} - \mathbf y^{*}) \\ \mathbf 0 = \nabla \psi(\mathbf y^{*}) - \mathbf v^* - \rho(\mathbf x^{*}-\mathbf y^{*}), \end{array} \right. \end{align*} $$

which means |$(\mathbf x^*, \mathbf y^*, \mathbf u^*, \mathbf v^*)$| is a stationary point of |$L_\rho (\mathbf u, \mathbf x, \mathbf y; \mathbf v) $| and

$$ \begin{align*}& \left\lbrace \begin{array}{@{}l} \mathbf 0 = |\mathbf x^*| + \alpha \nabla g(\mathbf u^{*}) - \mathbf s^{*} + \mathbf t^{*}, \\ \mathbf 0 = \mathbf u^{*} \odot \mathbf p^{*} + \nabla \psi (\mathbf y^*), \end{array} \right. \end{align*} $$

which means that |$(\mathbf x^*, \mathbf u^*) $| is a stationary point of (1.4) for |$U = [0,1]^n $|⁠.

3.3 Algorithm Updates for Different Lifting Functions

Here we consider two examples of |$g $| functions, with which the |$\mathbf u$|-subproblem has a closed-form solution. We define one function as |$g_1(\mathbf u) = -\frac{1}{2} \| \mathbf u\|_2^2 $|⁠, a Type B function with |$U_1 = [0,1]^n$| and a Type C function |$g_2(\mathbf u) = \frac{1}{2} \| \mathbf u\|_2^2 - \|\mathbf u \|_1 $| with |$U_2 = [0,\infty )^n$|⁠. For these combinations the update for (3.4) simplifies to

$$ \begin{align*}& \text{For} g_1, U_1, \;\; u^{k+1}_i = \left\{ \begin{array}{@{}cc} 1 & \text{if}\ |x_i^k| \leq \frac{\alpha}{2}, \\ 0 & \text{if}\ |x_i^k| \geq \frac{\alpha}{2}, \end{array} \right. \;\; \text{and} \;\;\text{ for}\ g_2, U_2, \;\; u^{k+1}_i = \max \left\{ 1 - \frac{|x_i^k|}{\alpha}, 0 \right\}. \end{align*} $$

Note that for this choice of |$ g_2$|⁠, the proposed model simplifies to

$$ \begin{align*}& \min_{\mathbf x \in \mathbb{R}^n, \mathbf u\in \mathbb{R}^n_+} \left\langle \mathbf u, |\mathbf x| \right\rangle + \alpha (\frac{1}{2} \|\mathbf u\|_2^2 - \|\mathbf u \|_1) \ \ \text{s.t.} \ \ A \mathbf x = \mathbf b, \end{align*} $$

which can be solved by a quadratic programming with linear constraints.

4. Numerical Experiments

We demonstrate the performance of Algorithm 1 with |$\epsilon =0.01$| and two specific |$g$| functions discussed in Section 3.3. We compare with the following sparsity promoting regularizations: |$\ell _1 $| [13], |$\ell _{1/2}$| [57], transformed |$\ell _1 $| (TL1) [62], |$\ell _1-\ell _2 $| [59] and ERF [20]. For more related models, see [54]. For each model, we consider both constrained and unconstrained formulations. Specifically for the |$\ell _p$| model, we adopt the iteratively reweighted least-squares algorithm [12] in the constrained case and use the half thresholding [57] as a proximal operator for minimizing the unconstrained |$\ell _{1/2}$| formulation. Both |$\ell _1-\ell _2$| and TL1 are minimized by the difference of convex algorithm (DCA) for the best performance as reported in [59,62]. We use the online code provided by the authors of [20] to solve for the ERF model. We use the default values of model parameters suggested in respective papers; note that |$\ell _1$| and |$\ell _1-\ell _2 $| do not involve any parameters. In Appendix C, we present a comparison between using ADMM and DCA for the proposed model. All the experiments are conducted on a Windows desktop with CPU (Intel i7-6700, 3.19GHz) and MATLAB (R2021a).

4.1 Constrained Models

We examine the performance of finding a sparse solution that satisfies the constraint |$A \mathbf x = \mathbf b$|⁠. We consider two types of sensing matrices, Gaussian and over-sampled discrete cosine transform (DCT). The Gaussian matrix is generated based on the multivariate normal distribution |$\mathcal{N}\ (0, \varSigma ) $|⁠, where |$\varSigma _{i,j} = (1-r) \delta (i = j) + r $| for a parameter |$r> 0.$| Note that |$\delta (i = j)$| is |$1$| if |$i = j$| and zero otherwise. The over-sampled DCT matrix is defined by |$A = [\mathbf a_1,..., \mathbf a_n] \in \mathbb{R}^{m \times n}$| with each column defined as

$$ \begin{align*}& \mathbf a_j:= \frac{1}{\sqrt{m}} \cos \left( \frac{2 \pi \mathbf w j}{F} \right), \end{align*} $$

where |$\mathbf w $| is a uniformly random vector and |$F \in \mathbb{R}_+ $| is a scalar. The larger the |$F$| is, the larger the coherence of the matrix |$A$| is, thus more challenging to find a sparse solution.

We fix the dimension as |$64 \times 1024 $| for Gaussian and DCT matrix, while generating Gaussian matrices with |$ r \in \{0, 0.2, 0.8 \}$| and DCT matrices with |$F \in \{1, 5, 10 \}$|⁠. The ground truth vector |$ \mathbf x_g \in \mathbb{R}^n$| is simulated as |$s$|-sparse signal, where |$s$| is the total number of non-zero entries each drawn from normal distribution |$\mathcal{N}(0, 1) $| and the support index set is also drawn randomly. We evaluate the performance by success rates where a ‘successful’ reconstruction refers to the case when the distance of the output vector |$\mathbf x $| and the ground truth |$\mathbf x_g $| is less than |$10^{-2} $|⁠, i.e.

$$ \begin{align*} & \frac{\|\mathbf x - \mathbf x_g \|_2}{\| \mathbf x_g \|_2} \leq 10^{-2}. \end{align*} $$

Figure 2 presents success rates for both Gaussian and DCT matrices, and demonstrates that the proposed LL1 outperforms the state of the art in all the testing cases. For the Gaussian matrices, the parameter |$r$| has little affect on the performance, as we observe the same ranking of these models under various |$r$| values. As for the DCT matrices, the parameter |$F$| influences the coherence of the resulting matrix. For smaller |$F$| value, |$\ell _p$| is the second best, while TL1 and |$\ell _1-\ell _2$| perform well for coherent matrices (for |$F=10$|⁠). With a well-chosen |$g$| function, the proposed LL1 framework always achieves the best results among the competing methods. The results of LL1 using |$g_1$| with |$U_1$| and |$g_2$| with |$U_2$| are similar. This phenomenon illustrates that our model works best as it is equivalent to the |$\ell _0 $| model for small enough |$\alpha $|⁠.

Fig. 2.

Success rate comparison among all the competing methods based on Gaussian matrices (left) with |$r = 0, 0.8$| and DCT matrices (right) with |$F = 1, 10$|⁠.

4.2 Unconstrained Models

We consider the unconstrained |$\ell _0$| model for comparison on noisy data:

$$ \begin{align}& \arg \min_{\mathbf x \in \mathbb{R}^n} \|\mathbf x\|_0 + \frac{\gamma}{2} \| A \mathbf x - \mathbf b \|_2^2,\end{align} $$

(4.1)

$$ \begin{align*}& \text{MSE}(\mathbf x) = \|\mathbf x - \mathbf x^* \|_2. \end{align*} $$

For each algorithm, we compute the average of MSE for 100 realizations by ranging the number of measurements in the range |$60 <m <120 $|⁠. Figure 3 presents the comparison results for two noise levels |$\sigma \in \{10^{-6}, 0.01 \} $|⁠. All the algorithms perform badly with a few measurements, and as the number of measurements |$m $| increases, their MSE error decreases. For the smaller amount of the noise (⁠|$\sigma =10^{-6}$|⁠), our approach almost works perfectly in around |$100$| measurements, while other algorithms either require more measurements to achieve the nearly perfect MSE or are unable to do so. Figure 3c presents the computational times, which suggests that LL1 performs as fast as the |$\ell _1 $| model and at the same time it has the lowest recovery error.

$Comparison of all algorithms for $m \times 512 $ matrices.$

Fig. 3.

Comparison of all algorithms for |$m \times 512 $| matrices.

When the noise level is high, for instance |$\sigma = 0.1 $|⁠, then it is almost impossible to reconstruct the ground-truth signal using any number of measurements. In such cases, our algorithm finds a signal that is sparser and has smaller objective for any choice of the regularization parameter |$\gamma $|⁠.

5. Concluding Remarks

In this paper, we proposed a lifted |$\ell _1$| model for sparse recovery, which describes a class of regularizations. Specifically we established the connections of this framework to various existing methods that aim to promote sparsity of the model solution. Furthermore, we proved that our method can exactly recover the sparsest solution under a constrained formulation. We promoted the use of ADMM to solve for the proposed model with convergence analysis. An alternative approach of using DCA was discussed in Section C, showing the efficiency of ADMM over DCA. Experimental results on both noise-free and noisy cases illustrate that the proposed framework outperforms the state-of-the-art methods in terms of accuracy and is comparable with the convex |$\ell _1$| approach in terms of computational time.

One future work involves the convergence analysis of ADMM for solving the constrained model. One difficulty lies in the fact that the corresponding function |$ \psi (\cdot )$| is a |$\delta $|-function, which is neither differentiable nor coercive, and as a result, the proof presented in Section 3.2 for the unconstrained minimization is not applicable for the constrained case. We observe that the ADMM algorithm for the constrained case does converge and the augmented Lagrangian is decreasing. This empirical evidence suggests the potential to prove the convergence or the sufficient decrease of the augmented Lagrangian, which will be left as future work.

Funding

NSF CAREER 1846690; Simons Foundation grant 584960.

Data Availability Statement

No new data were generated or analysed in support of this review.

References

1.

Amir

,

T.

,

Basri

,

R.

&

Nadler

,

B.

(

2021

)

The trimmed lasso: sparse recovery guarantees and practical optimization by the generalized soft-min penalty

.

SIAM J. Math. Data Sci.

,

3

,

900

–

929

.

2.

Askari

,

A.

,

Negiar

,

G.

,

Sambharya

,

R.

&

El Ghaoui

,

L.

(

2018

)

Lifted neural networks

arXiv preprint arXiv:1805.01532

.

3.

Bertsimas

,

D.

,

Copenhaver

,

M. S.

&

Mazumder

,

R.

(

2017

)

The trimmed lasso: sparsity and robustness

arXiv preprint arXiv:1708.04527

.

4.

Bi

,

N.

&

Tang

,

W.-S.

(

2022

)

A necessary and sufficient condition for sparse vector recovery via |${l}_1-{l}_2$| minimization

.

Appl. Comput. Harmon. Anal.

,

56

,

337

–

350

.

5.

Blake

,

A.

&

Zisserman

,

A.

(

1987

)

Visual reconstruction

.

MIT press

.

6.

Blanchard

,

J. D.

,

Tanner

,

J.

&

Wei

,

K.

(

2015

)

CGIHT: conjugate gradient iterative hard thresholding for compressed sensing and matrix completion

.

Inf. Inference

,

4

,

289

–

327

.

7.

Blumensath

,

T.

&

Davies

,

M. E.

(

2009

)

Iterative hard thresholding for compressed sensing

.

Appl. Comput. Harmon. Anal.

,

27

,

265

–

274

.

8.

Boyd

,

S.

,

Parikh

,

N.

,

Chu

,

E.

,

Peleato

,

B.

&

Eckstein

,

J.

(

2011

)

Distributed optimization and statistical learning via the alternating direction method of multipliers

.

Foundations and Trends® in Machine learning

,

3

,

1

–

122

.

9.

Candes

,

E. J.

,

Romberg

,

J. K.

&

Tao

,

T.

(

2006

)

Stable signal recovery from incomplete and inaccurate measurements

.

Commun. Pure Appl. Math.

,

59

,

1207

–

1223

.

10.

Candes

,

E. J.

,

Wakin

,

M. B.

&

Boyd

,

S. P.

(

2008

)

Enhancing sparsity by reweighted |${l}_1$| minimization

.

J. Fourier Anal. Appl.

,

14

,

877

–

905

.

11.

Cevher

,

V

.

An ALPS view of sparse recovery

. In

2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

, pages

5808

–

5811

.

IEEE

,

2011

.

12.

Chartrand

,

R

&

Yin

,

W

.

Iteratively reweighted algorithms for compressive sensing

. In

2008 IEEE international conference on acoustics, speech and signal processing

, pages

3869

–

3872

.

IEEE

,

2008

.

13.

Chen

,

S. S.

,

Donoho

,

D. L.

&

Saunders

,

M. A.

(

2001

)

Atomic decomposition by basis pursuit

.

SIAM Rev.

,

43

,

129

–

159

.

14.

Donoho

,

D. L.

(

2006

)

Compressed sensing

.

IEEE Trans. Inf. Theory

,

52

,

1289

–

1306

.

15.

Donoho

,

D. L.

&

Huo

,

X.

(

2001

)

Uncertainty principles and ideal atomic decomposition

.

IEEE Trans. Inf. Theory

,

47

,

2845

–

2862

.

16.

Dunlavy

,

D. M.

&

O’Leary

,

D. P.

(

2005

)

Homotopy optimization methods for global optimization

Report SAND2005-7495, Sandia National Laboratories

.

17.

Fan

,

J.

&

Li

,

R.

(

2001

)

Variable selection via nonconcave penalized likelihood and its oracle properties

.

J. Amer. Statist. Assoc.

,

96

,

1348

–

1360

.

18.

Foucart

,

S.

&

Rauhut

,

H.

(

2013

)

An invitation to compressive sensing

.

Springer

.

19.

Gabay

,

D.

&

Mercier

,

B.

(

1976

)

A dual algorithm for the solution of nonlinear variational problems via finite element approximation

.

Comput. Math. Appl.

,

2

,

17

–

40

.

20.

Guo

,

W.

,

Lou

,

Y.

,

Qin

,

J.

&

Yan

,

M.

(

2021

)

A novel regularization based on the error function for sparse recovery

.

J. Sci. Comput.

,

87

,

31

.

21.

Hantoute

,

A.

&

López

,

M. A.

(

2008

)

Characterizations of the subdifferential of the supremum of convex functions

.

J. Convex Anal.

,

15

,

831

–

858

.

22.

Horst

,

R.

&

Thoai

,

N. V.

(

1999

)

DC programming: overview

.

J. Optim. Theory Appl.

,

103

,

1

–

43

.

23.

Huang

,

X.-L.

,

Shi

,

L.

&

Yan

,

M.

(

2015

)

Nonconvex sorted |${l}_1$| minimization for sparse approximation

.

J. Oper. Res. Soc. China

,

3

,

207

–

229

.

24.

Kutyniok

,

G.

(

2013

)

Theory and applications of compressed sensing

.

GAMM-Mitt.

,

36

,

79

–

101

.

25.

Lai

,

M.-J.

&

Wang

,

Y.

(

2021

)

Sparse solutions of underdetermined linear systems and their applications

.

SIAM

.

26.

Lai

,

M.-J.

,

Xu

,

Y.

&

Yin

,

W.

(

2013

)

Improved iteratively reweighted least squares for unconstrained smoothed |${l}_q$| minimization

.

SIAM J. Numer. Anal.

,

51

,

927

–

957

.

27.

Li

,

J.

,

Xiao

,

M.

,

Fang

,

C.

,

Dai

,

Y.

,

Chao

,

X.

&

Lin

,

Z.

(

2020

)

Training neural networks by lifted proximal operator machines

.

IEEE Trans. Pattern Anal. Mach. Intell.

,

44

,

3334

–

3348

.

28.

Lou

,

Y.

&

Yan

,

M.

(

2018

)

Fast L1–L2 minimization via a proximal operator

.

J. Sci. Comput.

,

74

,

767

–

785

.

29.

Lou

,

Y.

,

Yin

,

P.

,

He

,

Q.

&

Xin

,

J.

(

2015

)

Computing sparse representation in a highly coherent dictionary based on difference of L|$_1$| and L|$_2$|

.

J. Sci. Comput.

,

64

,

178

–

196

.

30.

Lou

,

Y.

,

Yin

,

P.

&

Xin

,

J.

(

2016

)

Point source super-resolution via non-convex |${L}_1$| based methods

.

J. Sci. Comput.

,

68

,

1082

–

1100

.

31.

Lv

,

J.

&

Fan

,

Y.

(

2009

)

A unified approach to model selection and sparse recovery using regularized least squares

.

Ann. Stat.

,

37

,

3498

–

3528

.

32.

Mansour

,

H.

&

Saab

,

R.

(

2017

)

Recovery analysis for weighted |${l}_1$|-minimization using the null space property

.

Appl. Comput. Harmon. Anal.

,

43

,

23

–

38

.

33.

Marques

,

E. C.

,

Maciel

,

N.

,

Naviner

,

L.

,

Cai

,

H.

&

Yang

,

J.

(

2018

)

A review of sparse recovery algorithms

.

IEEE Access

,

7

,

1300

–

1322

.

34.

Natarajan

,

B. K.

(

1995

)

Sparse approximate solutions to linear systems

.

SIAM J. Comput.

,

24

,

227

–

234

.

35.

Needell

,

D.

&

Tropp

,

J. A.

(

2009

)

Cosamp: iterative signal recovery from incomplete and inaccurate samples

.

Appl. Comput. Harmon. Anal.

,

26

,

301

–

321

.

36.

Nesterov

,

Y.

(

2005

)

Smooth minimization of non-smooth functions

.

Math. Program.

,

103

,

127

–

152

.

37.

Ochs

,

P.

,

Dosovitskiy

,

A.

,

Brox

,

T.

&

Pock

,

T.

(

2015

)

On iteratively reweighted algorithms for nonsmooth nonconvex optimization in computer vision

.

SIAM J. Imaging Sci.

,

8

,

331

–

372

.

38.

Prater-Bennette

,

A.

,

Shen

,

L.

&

Tripp

,

E. E.

(

2022

)

The proximity operator of the log-sum penalty

.

J. Sci. Comput.

,

93

.

39.

Rahimi

,

Y.

,

Wang

,

C.

,

Dong

,

H.

&

Lou

,

Y.

(

2019

)

A scale-invariant approach for sparse signal recovery

.

SIAM J. Sci. Comput.

,

41

,

A3649

–

A3672

.

40.

Rauhut

,

H.

&

Ward

,

R.

(

2016

)

Interpolation via weighted |${l}_1$| minimization

.

Appl. Comput. Harmon. Anal.

,

40

,

321

–

351

.

41.

Naoki

Saito

.

Superresolution of noisy band-limited data by data adaptive regularization and its application to seismic trace inversion

. In

International Conference on Acoustics, Speech, and Signal Processing

, pages

1237

–

1240

.

IEEE

,

1990

.

42.

Shen

,

L.

,

Suter

,

B. W.

&

Tripp

,

E. E.

(

2019

)

Structured sparsity promoting functions

.

J. Optim. Theory Appl.

,

183

,

386

–

421

.

43.

Shen

,

X.

,

Pan

,

W.

&

Zhu

,

Y.

(

2012

)

Likelihood-based selection and sharp parameter estimation

.

J. Am. Stat. Assoc.

,

107

,

223

–

232

.

44.

Tao

,

P. D.

&

An

,

L. T. H.

(

1997

)

Convex analysis approach to DC programming: theory, algorithms and applications

.

Acta Math. Vietnam.

,

22

,

289

–

355

.

45.

Tao

,

P. D.

&

An

,

L. T. H.

(

1998

)

A DC optimization algorithm for solving the trust-region subproblem

.

SIAM J. Optim.

,

8

,

476

–

505

.

46.

Tibshirani

,

R.

(

1996

)

Regression shrinkage and selection via the lasso

.

J. R. Stat. Soc. B

,

58

,

267

–

288

.

47.

Tillmann

,

A. M.

&

Pfetsch

,

M. E.

(

2013

)

The computational complexity of the restricted isometry property, the nullspace property, and related concepts in compressed sensing

.

IEEE Trans. Inf. Theory

,

60

,

1248

–

1259

.

48.

Tran

,

H.

&

Webster

,

C.

(

2019

)

A class of null space conditions for sparse recovery via nonconvex, non-separable minimizations

.

Results Appl. Math.

,

3

, 100011.

49.

Tropp

,

J. A.

&

Gilbert

,

A. C.

(

2007

)

Signal recovery from random measurements via orthogonal matching pursuit

.

IEEE Trans. Inf. Theory

,

53

,

4655

–

4666

.

50.

Vaswani

,

N.

&

Wei

,

L.

(

2010

)

Modified-CS: modifying compressive sensing for problems with partially known support

.

IEEE Trans. Signal Process.

,

58

,

4595

–

4607

.

51.

Wang

,

C.

,

Yan

,

M.

,

Rahimi

,

Y.

&

Lou

,

Y.

(

2020

)

Accelerated schemes for the |${L}_1/{L}_2$| minimization

.

IEEE Trans. Signal Process.

,

68

,

2660

–

2669

.

52.

Watson

,

L. T.

(

2001

)

Theory of globally convergent probability-one homotopies for nonlinear programming

.

SIAM J. Optim.

,

11

,

761

–

780

.

53.

Wipf

,

D.

&

Nagarajan

,

S.

(

2010

)

Iterative reweighted |${\ell }_1$| and |${\ell }_2$| methods for finding sparse solutions

.

IEEE J. Sel. Top. Signal Process.

,

4

,

317

–

329

.

54.

Fanding

,

X.

,

Duan

,

J.

&

Liu

,

W.

(

2023

)

Comparative study of non-convex penalties and related algorithms in compressed sensing

.

Digit. Signal Process.

,

135

,

103937

.

55.

Yiming

,

X.

,

Narayan

,

A.

,

Tran

,

H.

&

Webster

,

C. G.

(

2021

)

Analysis of the ratio of l1 and l2 norms in compressed sensing

.

Appl. Comput. Harmon. Anal.

,

55

,

486

–

511

.

56.

Zong-Ben

,

X. U.

,

Guo

,

H.-L.

,

Wang

,

Y.

&

Zhang

,

H.

(

2012

)

Representative of |${l}_{1/2}$| regularization among |${l}_q\ (0<q<1)$| regularizations: an experimental study based on phase diagram

.

Acta Automat. Sinica

,

38

,

1225

–

1228

.

57.

Xu

,

Z.

,

Chang

,

X.

,

Xu

,

F.

&

Zhang

,

H.

L|$_1/2$| regularization: a thresholding representation theory and a fast solver

.

IEEE Trans. Neural Netw. Learn. Syst.

,

23

:

1013

–

1027

, 2012.

PubMed

58.

Yin

,

P.

,

Esser

,

E.

&

Xin

,

J.

(

2014

)

Ratio and difference of |${\ell }_1$| and |${\ell }_2$| norms and sparse representation with coherent dictionaries

.

Commun. Inf. Syst.

,

14

,

87

–

109

.

59.

Yin

,

P.

,

Lou

,

Y.

,

He

,

Q.

&

Xin

,

J.

(

2015

)

Minimization of |${\ell }_{1-2}$| for compressed sensing

.

SIAM J. Sci. Comput.

,

37

,

A536

–

A563

.

60.

Zach

,

C.

&

Bourmaud

,

G.

(

2017

)

Iterated lifting for robust cost optimization

.

In British Machine Vision Conference (BMVC)

.

61.

Zhang

,

C.-H.

(

2010

)

Nearly unbiased variable selection under minimax concave penalty

.

Ann. Stat.

,

38

,

894

–

942

.

62.

Zhang

,

S.

&

Xin

,

J.

(

2017

)

Transformed schatten-1 iterative thresholding algorithms for low rank matrix completion

.

Commun. Math. Sci.

,

15

,

839

–

862

.

63.

Zhang

,

S.

&

Xin

,

J.

(

2018

)

Minimization of transformed |${\ell }_1$| penalty: theory, difference of convex function algorithm, and robust application in compressed sensing

.

Math. Program.

,

169

,

307

–

336

.

64.

Tong

Zhang

.

Multi-stage convex relaxation for learning with sparse regularization

. In

Advances in neural information processing systems

, volume

21

,

2008

.

65.

Zhu

,

W.

,

Huang

,

Z.

,

Chen

,

J.

&

Peng

,

Z.

(

2021

)

Iteratively weighted thresholding homotopy method for the sparse solution of underdetermined linear equations

.

Sci. China Math.

,

64

,

639

–

664

.

A. Proof of Theorem 1

Proof.

Set

$$ \begin{align*} f(\mathbf x) = \begin{cases} -J(\mathbf x) & \mathbf x \geq \mathbf 0 \\ +\infty & \text{otherwise.} \end{cases} \end{align*} $$

The function |$ f$| is proper, lower semi-continuous and convex, hence by the Fenchel–Moreau’s theorem we have that |$f = f^{**} $|⁠. Also, we have

$$ \begin{align*} f^*(\mathbf y) = \sup_{\mathbf x \in \mathbb{R}^n} \left\langle \mathbf x, \mathbf y \right\rangle - f(\mathbf x) = \sup_{\mathbf x \geq \mathbf 0} \left\langle \mathbf x, \mathbf y \right\rangle + J(\mathbf x) = g(-\mathbf y) \end{align*} $$

using |$g(\mathbf y):= \sup _{\mathbf x \geq \mathbf 0} \left \langle \mathbf x, -\mathbf y \right \rangle + J(\mathbf x) $|⁠. In addition,

$$ \begin{align*} f(\mathbf x) = f^{**}(\mathbf x) = \sup_{\mathbf y \in \mathbb{R}^n} \left\langle \mathbf x, \mathbf y \right\rangle - f^*(\mathbf y) = \sup_{\mathbf y \in \mathbb{R}^n} \left\langle \mathbf x, \mathbf y \right\rangle - g(-\mathbf y). \end{align*} $$

Therefore, since |$|\mathbf x| \geq 0 $|

$$ \begin{align*} J(|\mathbf x|) = -f(|\mathbf x|) & = - \left( \sup_{\mathbf y \in \mathbb{R}^n} \left\langle |\mathbf x|, \mathbf y \right\rangle - g(-\mathbf y) \right) = \inf_{\mathbf y \in \mathbb{R}^n} \left\langle |\mathbf x|, -\mathbf y \right\rangle + g(-\mathbf y) \\ & = \inf_{\mathbf y \in \mathbb{R}^n} \left\langle |\mathbf x|, \mathbf y \right\rangle + g(\mathbf y) = \inf_{\mathbf y \in U} \left\langle |\mathbf x|, \mathbf y \right\rangle + g(\mathbf y), \end{align*} $$

where |$U \subseteq \{ \mathbf y \in \mathbb{R}^n \mid g(\mathbf y) \neq +\infty \}. $|

B. Finding the Lift Corresponding to Existing Models

Given a regularization function |$J(\mathbf x),$| we want to find a proper |$g$| function and a set |$U$| such that |$J(\mathbf x)=F_g^U(\mathbf x)$| up to a constant. We assume |$J(\mathbf x) = J(|\mathbf x|)$| since |$F_g^U$| satisfies this condition. Consequently, we only need to consider the case |$\mathbf x \geq 0$| so that the notation |$\frac{\partial J }{\partial |x_i|} $| should be in place of |$\frac{\partial J }{\partial x_i} $| for |$ x_i \geq 0$|⁠. Using Theorem 1, we can directly find |$F_{g}^U$|⁠; however, sometimes it might be easier to use the following observation which leads to simpler computations. Suppose |$F_{g}^U$| has a unique minimizer |$\mathbf u,$| and hence |$\mathbf u$| satisfies |$\mathbf u = \nabla _{\mathbf x} F_{g}^U(\mathbf x)= \nabla _{\mathbf x} J(\mathbf x).$| Assuming that the minimum of (2.5) is finite, the optimality condition gives |$|\mathbf x| + \nabla _{\mathbf u} g = 0$| for |$\mathbf u \in int(U),$| where |$int(U)$| denotes the interior of the set |$U $| (Note that |$|\mathbf x| + \nabla _{\mathbf u} g$| can have non-zero coordinates on the boundary of |$U$|⁠.) Thus, we only need to solve the following two equations for a function |$g $| with respect to |$\mathbf u $| on the feasible set |$U$|⁠:

$$ \begin{align}& \begin{cases} \mathbf u = \nabla_{\mathbf x} J(\mathbf x), \\ |\mathbf x| + \nabla_{\mathbf u} g = 0, \ \mathbf u \in int(U). \end{cases}\end{align} $$

(B.1)

(i) |$\ell _p $| model: Consider |$J = J^{\ell _p} / p $|⁠, and note that |$\frac{\partial J} {\partial |x_i|} = |x_i|^{p-1} $|⁠. For |$g(\mathbf u) = \sum g_i(u_i) $| and |$\mathbf x \in \mathbb{R}^n$|⁠, the (B.1) simplifies into
$$ \begin{align*}& \begin{cases} u_i = |x_i|^{p-1}, \\ | x_i| + g_{i}^{\prime}(u_i) = 0, \end{cases} \end{align*} $$
for all |$i $|⁠. From the first equation we get that |$|x_i| = u_i^{\frac{1}{p-1}} $| and then from the second equation we get |$ g_i^{\prime}(u_i) = -u_i^{\frac{1}{p-1}}$| . A solution for |$g $| is |$g_i(u_i) = \frac{1-p}{p} u_i^{\frac{p}{p-1}} $| for |$u_i \geq 0 $|⁠. Finally taking |$U = \mathbb{R}^n_+ $| and |$g(\mathbf u) = \sum _{i} \frac{1-p}{p} u_i^{\frac{p}{p-1}} $|⁠, one can check that |$F_{g}^U = J $|⁠.
(ii) log-sum penalty: Consider |$J = J^{\log }_a $|⁠, and note that |$\frac{\partial J} {\partial |x_i|} = \frac{1}{|x_i| + a} $|⁠. For |$g(\mathbf u) = \sum g_i(u_i) $| and |$\mathbf x \in \mathbb{R}^n$|⁠, the (B.1) simplifies into
$$ \begin{align*}& \begin{cases} u_i = \frac{1}{|x_i| + a}, \\ | x_i| + g_{i}^{\prime}(u_i) = 0, \end{cases} \end{align*} $$
for all |$i $|⁠. From the first equation we get that |$|x_i| = \frac{1}{u_i} - a $| and then from the second equation we get |$ g_i^{\prime}(u_i) = a - \frac{1}{u_i}$|⁠. A solution for |$g $| is |$g_i(u_i) = a u_i - \log (u_i) $| for |$u_i> 0 $|⁠. Finally taking |$U = \mathbb{R}^n_{>0} $| and |$g(\mathbf u) = \sum _{i} \left ( a u_i - \log (u_i) \right ) $|⁠, one can check that |$F_{g}^U + 1 = J $|⁠.
(iii) Smoothly clipped lasso model: Consider |$J = J_{\gamma , \lambda }^{\text{SCAD}} $|⁠, and note that
$$ \begin{align}& \frac{\partial f_{\lambda, \gamma} ^{\text{SCAD}}(t)} {\partial |t|}= \left\{ \begin{array}{@{}ll} \lambda & \text{if}\ |t| \leq \lambda, \\ \frac{\lambda \gamma - t}{\gamma - 1} & \text{if}\ \lambda < |t| \leq \gamma \lambda, \\ 0 & \text{if}\ |t|> \gamma \lambda. \end{array} \right.\end{align} $$
(B.2)
For |$g(\mathbf u) = \sum g_i(u_i) $| and |$\mathbf x \in \mathbb{R}^n$|⁠, the first equation in (B.1) simplifies into
$$ \begin{align*}& u_i = \left\{ \begin{array}{@{}ll} \lambda & \text{if}\ |x_i| \leq \lambda, \\ \frac{\lambda \gamma - |x_i|}{\gamma - 1} & \text{if}\ \lambda < |x_i| \leq \gamma \lambda, \\ 0 & \text{if}\ |x_i|> \gamma \lambda. \end{array} \right. \end{align*} $$
for all |$i $|⁠. In the case of |$\lambda < |x_i| \leq \gamma \lambda , $| we get that |$|x_i| = \gamma \lambda - (\gamma - 1) u_i, $| which means we should have |$ g_{i}^{\prime}(u_i) = -\gamma \lambda + (\gamma - 1) u_i$| and |$u_i\leq \lambda $|⁠. By taking |$U = [0,\lambda ]^n $| and |$g(\mathbf u) = \sum _{i} \left ( -\gamma \lambda u_i + (\gamma - 1) \frac{u_i^2}{2} \right ) $|⁠, one can check that |$F_{g}^U + \frac{(\gamma + 1)\lambda ^2}{2} = J $|⁠.
(iv) Mini-max concave penalty: Consider |$J = J_{\gamma , \lambda }^{\text{MCP}} $|⁠, and note that
$$ \begin{align}& \frac{\partial f_{\lambda, \gamma} ^{\text{MCP}}(t)} {\partial |t|}= \left\{ \begin{array}{@{}ll} \lambda - \frac{t}{\gamma} & \text{if}\ |t| \leq \gamma \lambda, \\ 0 & \text{if}\ |t|> \gamma \lambda. \end{array} \right.\end{align} $$
(B.3)
For |$g(\mathbf u) = \sum g_i(u_i) $| and |$\mathbf x \in \mathbb{R}^n$|⁠, the first equation in (B.1) simplifies into
$$ \begin{align*}& u_i = \left\{ \begin{array}{@{}ll} \lambda - \frac{|x_i|}{\gamma} & \text{if}\ |x_i| \leq \gamma \lambda, \\ 0 & \text{if}\ |x_i|> \gamma \lambda, \end{array} \right. \end{align*} $$
for all |$i $|⁠. When |$ |x_i| \leq \gamma \lambda ,$| we obtain |$|x_i| = \gamma (\lambda - u_i),$| which implies that |$ g_{i}^{\prime}(u_i) = -\gamma (\lambda - u_i)$| from the second equation in (B.1). Therefore, we set |$U = [0,\infty )^n $| and |$g(\mathbf u) = \sum _{i} \left ( -\gamma (\lambda u_i - \frac{u_i^2}{2}) \right ) $|⁠, leading to |$F_{g}^U + \frac{1}{2} \gamma \lambda ^2 = J $|⁠.
(v) Capped |$\ell _1 $| model: Consider |$J = J^{\text{CL1}}_a $|⁠, and note that
$$ \begin{align}& \frac{\partial J}{\partial |x_i|} = \left\{ \begin{array}{@{}ll} 1 & \text{if}\ |x_i| < a, \\ 0 & \text{if}\ |x_i|> a. \end{array} \right.\end{align} $$
(B.4)
For |$g(\mathbf u) = \sum g_i(u_i) $| and |$\mathbf x \in \mathbb{R}^n$|⁠, the first equation in (B.1) simplifies into
$$ \begin{align*}& u_i = \left\{ \begin{array}{@{}ll} 1 & \text{if}\ |x_i| < a, \\ 0 & \text{if}\ |x_i|> a, \end{array} \right. \end{align*} $$
for all |$i $|⁠. Note that the second equation in (B.1) only happens if the minimizer is in the interior of the set |$U $|⁠. Consider |$U = [0,1]^n $|⁠, therefore for this case since the minimizer is on the boundary, therefore we need a |$g $| function which is non-zero in the interior of |$ U $| and for |$|x_i| < a $| we have |$ | x_i| + g_{i}^{\prime}(u_i) < 0$| and for |$|x_i|> a $| we have |$ | x_i| + g_{i}^{\prime}(u_i)> 0$|⁠. Therefore |$g_{i}^{\prime}(u_i) = -a $| and a solution for this is |$ g_{i}(u_i) = -a u_i $|⁠. Finally taking |$U = [0,1]^n $| and |$g(\mathbf u) = \sum _{i} \left ( -a u_i \right ) $|⁠, one can check that |$F_{g}^U + a = J $|⁠.
(vi) Transformed |$\ell _1 $| model: Consider |$J = J_a^{\text{TL1}} / (a+1) $|⁠, and note that |$\frac{\partial J} {\partial |x_i|} = \frac{a}{(a + |x_i|)^2} $|⁠. For |$g(\mathbf u) = \sum g_i(u_i) $| and |$\mathbf x \in \mathbb{R}^n$|⁠, the (B.1) simplifies into
$$ \begin{align*}& \begin{cases} u_i = \frac{a}{(a + |x_i|)^2}, \\ | x_i| + g_{i}^{\prime}(u_i) = 0, \end{cases} \end{align*} $$
for all |$i $|⁠. From the first equation we get that |$|x_i| = \sqrt{\frac{a}{u_i}} - a $| and then from the second equation we get |$ g_i^{\prime}(u_i) = a - \sqrt{\frac{a}{u_i}}$|⁠. A solution for |$g $| is |$g_i(u_i) = a u_i - 2 \sqrt{a u_i} $| for |$u_i \geq 0 $|⁠. Finally taking |$U = \mathbb{R}^n_+ $| and |$g(\mathbf u) = \sum _{i} a u_i - 2 \sqrt{a u_i} $|⁠, one can check that |$ F_{g}^U + 1 = J $|⁠.
(vii) Error function penalty: Consider |$J = J_{\sigma }^{\text{ERF}} $|⁠, and note that |$\frac{\partial J} {\partial |x_i|} = e^{-x_i^2/\sigma ^2} $| and |$e^{-x_i^2/\sigma ^2} \in (0,1] $|⁠. For |$g(\mathbf u) = \sum g_i(u_i) $| and |$\mathbf x \in \mathbb{R}^n$|⁠, the (B.1) simplifies into
$$ \begin{align*}& \begin{cases} u_i = e^{-x_i^2/\sigma^2}, \\ | x_i| + g_{i}^{\prime}(u_i) = 0, \end{cases} \end{align*} $$
for all |$i $|⁠. From the first equation we get that |$|x_i| = \sigma \sqrt{-\log (u_i)} $| and then from the second equation we get |$ g_i^{\prime}(u_i) = -\sigma \sqrt{- \log (u_i)}$|⁠. A solution for |$g $| is |$g_i(u_i) = \sigma \int _{u_i}^{1} \sqrt{- \log (\tau )} d \tau $| for |$u_i \in (0,1] $|⁠. Finally taking |$U = [0,1]^n $| and |$g(\mathbf u) = \sigma \sum _{i} \int _{u_i}^{1} \sqrt{- \log (\tau )} d \tau $|⁠, one can check that |$ F_{g}^U = J $|⁠.
(viii) |$\ell _1-\ell _2$| penalty: As for |$J = J^{L1-L2}$| working with derivatives is a bit challenging so we directly apply Theorem 1 for finding the lifted form. First note that
$$ \begin{align*} & \sup_{\mathbf x \geq 0} \left\langle \mathbf x, \mathbf y \right\rangle - \|\mathbf x\|_2 = \begin{cases} + \infty & \|\mathbf y_+\|_2> 1, \\ 0 & \|\mathbf y_+\|_2 \leq 1. \end{cases} \end{align*} $$
Hence we get that
$$ \begin{align*} & g(\mathbf u) = \sup_{\mathbf x \geq 0} \left\langle \mathbf x, -\mathbf u \right\rangle + \|\mathbf x\|_1 - \|\mathbf x\|_2 = \begin{cases} + \infty & \|(\mathbf 1 - \mathbf u)_+\|_2> 1, \\ 0 & \|(\mathbf 1 - \mathbf u)_+\|_2 \leq 1. \end{cases} \end{align*} $$
Therefore, the set |$U = \{\mathbf u \mid \|(\mathbf 1 - \mathbf u)_+\|_2 \leq 1 \}$| and the function |$g = 0$|⁠. Notice that we can relax the set |$U$| to |$\{\mathbf u \mid \|\mathbf 1 - \mathbf u\|_2 \leq 1 \}$| and still get the same |$J$| function.

C. Comparing the ADMM and DCA based algorithms

Alternatively to ADMM, one can minimize the proposed model (1.3) and (1.4) by the graduated non-convexity algorithm [5] and the DCA [22,44,45]. Specifically for DCA, since our function |$ -F_{g,\alpha }^U$| is convex on the positive cone, then we use the algorithm introduced in [37], where we only need to find the sub-differential of the function on |$\mathbb{R}^n_+ $|⁠. We use the function |$\psi (\cdot )$| to unify the constrained and the unconstrained formulations and we have the model

$$ \begin{align}& \arg \min_{\mathbf x \in \mathbb{R}^n} F_{g,\alpha}^U(\mathbf x) + \psi(\mathbf x).\end{align} $$

(C.1)

Since the function |$F_{g,\alpha }^U $| is concave on |$\mathbb{R}^n_+ $|⁠, it can be written as a difference of two convex functions, i.e. |$F_{g, \alpha }^U(\mathbf x) + \psi (\mathbf x) = h_1(\mathbf x) - h_2(\mathbf x) $|⁠, where |$h_1(\mathbf x) = \psi (\mathbf x) + \frac{\beta }{2} \|\mathbf x \|_2^2 $| and |$ h_2(\mathbf x) = \frac{\beta }{2} \| \mathbf x \|_2^2 - F_{g, \alpha }^U (\mathbf x) $| for |$\beta \geq 0.$| An interesting fact about using a DCA form is that if |$g $| is a Type B or Type C then for |$\mathbf x \geq 0 $| we have sub-differentials of the form

$$ \begin{align*}& \partial (-F_{g, \alpha}^U) (\mathbf x) = - \arg\min_{\mathbf u \in U} \left\langle \mathbf u, |\mathbf x | \right\rangle + \alpha g(\mathbf u). \end{align*} $$

For |$\mathbf x \in \mathbb{R}^n $|⁠, take |$\mathbf u_{g, \alpha }(\mathbf x^k) \in \arg \min _{\mathbf u \in U} \left \langle \mathbf u, |\mathbf x^k | \right \rangle + \alpha g(\mathbf u) $| then the DCA iterations become

$$ \begin{align}& \mathbf x^{k+1} = \arg\min_{\mathbf x \in \mathbb{R}^n} \psi(\mathbf x) + \frac{\beta}{2} \|\mathbf x \|_2^2 - \left\langle \beta |\mathbf x^k| - \mathbf u_{g, \alpha}(\mathbf x^k),| \mathbf x| \right\rangle.\end{align} $$

(C.2)

$Success rate comparison of ADMM and DCA with two $g$ functions and different $\eta $ for Gaussian matrices (left) with $r = 0$ and $r = 0.8 $ and DCT matrices (right) with $F = 1, 5$.$

Fig. C1.

Success rate comparison of ADMM and DCA with two |$g$| functions and different |$\eta $| for Gaussian matrices (left) with |$r = 0$| and |$r = 0.8 $| and DCT matrices (right) with |$F = 1, 5$|⁠.