Abstract

We propose the use of low bit-depth Sigma-Delta and distributed noise-shaping methods for quantizing the random Fourier features (RFFs) associated with shift-invariant kernels. We prove that our quantized RFFs—even in the case of |$1$|-bit quantization—allow a high-accuracy approximation of the underlying kernels, and the approximation error decays at least polynomially fast as the dimension of the RFFs increases. We also show that the quantized RFFs can be further compressed, yielding an excellent trade-off between memory use and accuracy. Namely, the approximation error now decays exponentially as a function of the bits used. The quantization algorithms we propose are intended for digitizing RFFs without explicit knowledge of the application for which they will be used. Nevertheless, as we empirically show by testing the performance of our methods on several machine learning tasks, our method compares favourably with other state-of-the-art quantization methods.

1. Introduction

Kernel methods have long been demonstrated as effective techniques in various machine learning applications, cf. [32, 33]. Given a dataset |$\mathscr{X}\subset \mathbb{R}^{d}$| with |$|\mathscr{X}|=N$|⁠, kernel methods implicitly map data points to a high, possibly infinite, dimensional feature space |$\mathscr{H}$| by |$\phi :\mathscr{X}\rightarrow \mathscr{H}$|⁠. However, instead of working directly on that space the inner products between feature embeddings can be preserved by a kernel function |$k(x,y):= \langle \phi (x),\phi (y)\rangle _{\mathscr{H}}$| that coincides with the inner product. Nevertheless, in cases where |$N$| is large, using nonlinear kernels for applications like, say, support vector machines (SVMs) and logistic regression requires the expensive computation of the |$N\times N$| Gram matrix of the data [24]. In order to overcome this bottleneck, one popular approach is to ‘linearize’ |$k$| using the random Fourier features (RFFs) originally proposed by [29], and in turn built on the Bochner’s theorem [26]. Given a continuous, shift-invariant real-valued kernel |$k(x,y)=\kappa (x-y)$| with |$\kappa (0)=1$|⁠, |$\kappa $| is the (inverse) Fourier transform of a probability measure |$\varLambda $| over |$\mathbb{R}^{d}$| and we have

(1.1)

As an example, the radial basis function (RBF) kernel |$k(x,y)=\exp (-\|x-y\|_{2}^{2}/2\sigma ^{2})$| corresponds to the multivariate normal distribution |$\varLambda =\mathscr{N}(0,\sigma ^{-2}I_{d})$|⁠. Following [29], for a target dimension |$m$|⁠, the associated RFFs (without normalization) are

(1.2)

where |$\varOmega :=(\omega _{1},\ldots ,\omega _{m})\in \mathbb{R}^{d\times m}$| is a random matrix generated as |$\omega _{j}\overset{iid}{\sim } \varLambda $| and |$\xi \in \mathbb{R}^{m}$| is a random vector with |$\xi _{j}\overset{iid}{\sim } U([0,2\pi ))$| for all |$j$|⁠. Additionally, the identity |${\mathbb{E}}(\langle z(x),z(y)\rangle )=\frac{m}{2}k(x,y)$| implies that the inner product of low-dimensional features |$\sqrt{\frac{2}{m}}z(x)$|⁠, |$\sqrt{\frac{2}{m}}z(y)$| can approximate |$k(x,y)$| in kernel-based algorithms. Learning a linear model on the (normalized) RFFs then amounts to using the approximation

(1.3)

as a reference kernel during training. For instance, performing linear SVM and linear ridge regression on RFFs winds up training nonlinear kernel-based SVM and ridge regression with |$\widehat{k}_{\text{RFF}}$|⁠. It turns out that using RFFs in such a way with adjustable dimension |$m$| can remarkably speed up training for large-scale data and alleviate the memory burden for storing the kernel matrix. As an additional and very important benefit, the entire kernel function |$k$| is approximated accurately, i.e. the approximation error |$|k(x,y)-\widehat{k}_{\text{RFF}}(x,y)|$| has been shown to be small, particularly when |$m$| is large, e.g. in [3–5, 30, 35, 37].

The need for large |$m$| for guaranteeing good generalization performance on large datasets [1, 25, 27, 38] provides an opportunity for further savings in memory usage. Rather than store the RFFs in full precision, quantization methods have been proposed to encode RFFs (1.2) into a sequence of bits and subsequently approximate |$k(x,y)$| by taking inner product between quantized RFFs, thereby introducing a new level of approximation. One of our goals is to propose quantization techniques that favourably trade off approximation accuracy against number of bits used.

1.1 Related work

To make the discussion more precise, let us start by defining the |$2K$|-level quantization alphabet that we use throughout as

(1.4)

and note that one can use |$b:=\log _{2}(2K)$| bits to represent each element of |$\mathscr{A}$|⁠. The goal of quantization in the RFF context is to map |$z(x)=\cos (\varOmega ^{T} x+ \xi ) \in{\mathbb{R}}^{m} \mapsto q(x) \in \mathscr{A}^{m}$|⁠. We will be interested in very small values of |$K$|⁠, particularly |$K=1$|⁠, which corresponds to very few bits per RFF sample.

It is natural to start our discussion of quantization methods with the simplest quantizer, namely memoryless scalar quantization (MSQ), where we round each coordinate of the input vector |$z\in \mathbb{R}^{m}$| to the nearest element in |$\mathscr{A}$|⁠. Specifically, |$Q_{\mathrm{MSQ}}:\mathbb{R}^{m}\rightarrow \mathscr{A}^{m}$| is defined by

Moreover, by setting |$K=1$|⁠, one can get a binary embedding |$Q_{\mathrm{MSQ}}(z)=\text{sign}(z)$| with |$\mathscr{A}=\{-1,1\}$| where |$\text{sign}$| is an element-wise operation. This yields the so-called |$1$|-bit universal quantizer [8, 31] for RFFs, which generates a distorted (biased) kernel

(1.5)

Although replacing the |$\text{sign}$| function in (1.5) by |$Q_{MSQ}$| with |$K>1$| and renormalizing the inner product correspondingly can alleviate the distortion, there are better choices in terms of approximation error. In [23], a Lloyd–Max quantization scheme is designed based on the MSQ where, rather than use the evenly spaced alphabet in (1.4), one has to construct specific alphabets for different |$K$|⁠. Recently with an eye towards asymmetric sensor network applications, an asymmetric semi-quantized scheme (SemiQ) was proposed in [31], and shown to be unbiased. It generates |$\widehat{k}_{s}(x,y)$|⁠, which is an inner product between an unquantized RFF vector and a quantized one, i.e.

(1.6)

However, this asymmetric setting is restrictive on many kernel machines because it only works for the inference stage and the model still has to be trained based on unquantized RFFs. Another unbiased quantization scheme resorts to injecting randomness into the quantization, and is known as randomized rounding [40], or stochastic quantization (StocQ) [23]. Specifically, for each |$z\in \mathbb{R}$|⁠, one chooses the two consecutive points |$s, t\in \mathscr{A}$| with |$z\in [s, t]$|⁠. Then one randomly assigns the quantization via |$\mathrm{P}(Q_{\mathrm{StocQ}}(z)=s) = \frac{t-z}{t-s}$|⁠, |$\mathrm{P}(Q_{\mathrm{StocQ}}(z)=t) = \frac{z-s}{t-s}$|⁠. It follows that

(1.7)

where |$Q_{\mathrm{StocQ}}$| operates on each component separately. Due to the Bernoulli sampling for |$Q_{\mathrm{StocQ}}$|⁠, the quantization process involves additional randomness for each dimension of RFFs, which leads to extra variance especially in the case of binary embedding, i.e. |$b=1$|⁠. Note that the distinct normalization factors are chosen in (1.6) and (1.7) such that the expectations of |$\hat{k}_{s}$| and |$\hat{k}_{StocQ}$| are equal to |$k(x,y)$|⁠. Moreover, the kernel approximation error for |$\widehat{k}_{s}$| and |$\widehat{k}_{\text{StocQ}}$| is bounded by |$O(m^{-1/2})$| with high probability, see [31, 40].

1.2 Methods and contributions

We explore the use of |$\varSigma \varDelta $| [14, 15, 18] and distributed noise-shaping [9, 10] quantization methods on RFFs. These techniques, explicitly defined and discussed in Section 2 and Appendix  A, yield superior performance to methods based on scalar quantization in contexts ranging from bandlimited function quantization [14, 18], to quantization of linear measurements [6, 7], of compressed sensing measurements [19], of nonlinear measurements [20] and even for binary embeddings that preserve (Euclidean) distances [21, 41]. It is therefore natural to wonder whether they can also yield superior performance in the RFF context. Let |$Q_{\varSigma \varDelta }^{(r)}$| be the |$r$|-th order |$\varSigma \varDelta $| quantizer and let |$Q_{\beta }$| be the distributed noise shaping quantizer with |$\beta \in (1,2)$|⁠, and let |$\widetilde{V}_{\varSigma \varDelta }$| and |$\widetilde{V}_{\beta }$| be their associated sparse condensation matrices defined in Section 2. Then our method approximates kernels via

(1.8)

and

(1.9)

Specifically, given large-scale data |$\mathscr{T}$| contained in a compact set |$\mathscr{X}\subset \mathbb{R}^{d}$|⁠, we put forward Algorithm 1 to generate and store quantized RFFs such that one can subsequently use them for training and inference using linear models.

graphic

For illustration, Appendix  B presents a pointwise comparison of the above kernel approximations on a synthetic toy dataset. A summary of our contributions is as follows.

  • We give the first detailed analysis of |$\varSigma \varDelta $| and distributed noise-shaping schemes for quantizing RFFs. Specifically, Theorem 3.1 provides a uniform upper bound for the errors |$|\widehat{k}^{(r)}_{\varSigma \varDelta }(x,y)-k(x,y)|$| and |$|\widehat{k}_{\beta }(x,y)-k(x,y)|$| over compact (possibly infinite) sets. Our analysis shows that the quantization error decays fast as |$m$| grows. Additionally, Theorem 3.3 provides spectral approximation guarantees for first-order |$\varSigma \varDelta $| quantized RFF approximation of kernels.

  • Our methods allow a further reduction in the number of bits used. Indeed, to implement (1.8) and (1.9) in practice, one would store and transmit the condensed bitstreams |$\widetilde{V}_{\varSigma \varDelta }Q_{\varSigma \varDelta }^{(r)}(z(x))$| or |$\widetilde{V}_{\beta } Q_{\beta }(z(x))$|⁠. For example, since the matrices |$\widetilde{V}_{\varSigma \varDelta }$| are sparse and essentially populated by bounded integers, each sample can be represented by fewer bits, as summarized in Table 1.

  • We illustrate the benefits of our proposed methods in several numerical experiments involving kernel ridge regression (KRR), kernel SVM and two-sample tests based on maximum mean discrepancy (MMD) (all in Section 4). Our experiments show that |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\beta }$| are comparable with the semi-quantization scheme and outperforms the other fully quantized method mentioned above, both when we fix the number of RFF features |$m$|⁠, and when we fix the number of bits used to store each quantized RFF vector.

Table 1

The memory usage to store each encoded sample.

MethodRFFs|$Q_{\text{SemiQ}}$||$Q_{\text{StocQ}}$||$Q_{\varSigma \varDelta }^{(r)}$||$Q_{\beta }$|
Memory|$32m$||$32m$||$mb$||$O(p(b+r\log _{2}\lambda ))$||$mb{}^{*}$|
MethodRFFs|$Q_{\text{SemiQ}}$||$Q_{\text{StocQ}}$||$Q_{\varSigma \varDelta }^{(r)}$||$Q_{\beta }$|
Memory|$32m$||$32m$||$mb$||$O(p(b+r\log _{2}\lambda ))$||$mb{}^{*}$|

*This can be reduced to |$mb\log _{2}\beta $| for certain |$\beta $|⁠.

Table 1

The memory usage to store each encoded sample.

MethodRFFs|$Q_{\text{SemiQ}}$||$Q_{\text{StocQ}}$||$Q_{\varSigma \varDelta }^{(r)}$||$Q_{\beta }$|
Memory|$32m$||$32m$||$mb$||$O(p(b+r\log _{2}\lambda ))$||$mb{}^{*}$|
MethodRFFs|$Q_{\text{SemiQ}}$||$Q_{\text{StocQ}}$||$Q_{\varSigma \varDelta }^{(r)}$||$Q_{\beta }$|
Memory|$32m$||$32m$||$mb$||$O(p(b+r\log _{2}\lambda ))$||$mb{}^{*}$|

*This can be reduced to |$mb\log _{2}\beta $| for certain |$\beta $|⁠.

2. Noise shaping quantization preliminaries

The methods we consider herein are special cases of noise shaping quantization schemes (see, e.g. [11]). For a fixed alphabet |$\mathscr{A}$| and each dimension |$m$|⁠, such schemes are associated with an |$m \times m$| lower triangular matrix |$H$| with unit diagonal, and are given by a map |$Q: {\mathbb{R}}^{m} \to \mathscr{A}^{m}$| with |$y\mapsto q$| designed to satisfy |$y-q = H u$|⁠. The schemes are called stable if |$\|u\|_{\infty } \leq C$|⁠, where |$C$| is independent of |$m$|⁠. Among these noise shaping schemes, we will be interested in stable |$r{\text{th}}$|-order |$\varSigma \varDelta $| schemes |$Q_{\varSigma \varDelta }^{(r)}$| [15, 18], and distributed noise shaping schemes |$Q_{\beta }$| [9, 10]. For example, in the case of |$\varSigma \varDelta $| with |$r=1$|⁠, the entries |$q_{i}, i=1,...,m$| of the vector |$q=Q_{\varSigma \varDelta }^{(1)}(y)$| are assigned iteratively via

(2.1)

where |$Q_{\mathrm{MSQ}}(z)=\text{argmin}_{v\in \mathscr{A}}|z-v|$|⁠. This yields the difference equation |$y-q=Du$|⁠, where |$D$| is the first-order difference matrix given by |$D_{ij}=1$| if |$i=j$|⁠, |$D_{ij}=-1$| if |$i=j+1$| and |$0$| otherwise. Stable |$\varSigma \varDelta $| schemes with |$r>1$|⁠, are more complicated to construct (see Appendix  A), but satisfy

(2.2)

On the other hand, a distributed noise-shaping quantizer |$Q_{\beta }:\mathbb{R}^{m}\rightarrow \mathscr{A}^{m}$| converts the input vector |$y\in \mathbb{R}^{m}$| to |$q=Q_{\beta }(y)\in \mathscr{A}^{m}$| such that

(2.3)

where, again, |$\|u\|_{\infty }\leq C$|⁠. Here, denoting the |$p\times p$| identity matrix by |$I_{p}$| and the Kronecker product by |$\otimes $|⁠, |$H$| is a block diagonal matrix defined as |$H:=I_{p}\otimes H_{\beta } \in \mathbb{R}^{m\times m}$|⁠, where |$H_{\beta }\in \mathbb{R}^{\lambda \times \lambda }$| is given by |$({H_{\beta }})_{ij}=1$| if |$i=j$|⁠, |$({H_{\beta }})_{ij}=-\beta $| if |$i=j+1$| and |$0$| otherwise. Defining |$\widetilde{H}:=I_{m}-H$|⁠, one can implement the quantization step |$q=Q_{\beta }(y)$| via the following iterations for |$i=1,2,\ldots ,m$| [9, 10]:

(2.4)

where |$Q_{\mathrm{MSQ}}(z)=\text{argmin}_{v\in \mathscr{A}}|z-v|$|⁠. The stability of (2.4) is discussed in Appendix  A. It is worth mentioning that since |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\beta }$| are sequential quantization methods, they cannot be implemented entirely in parallel. On the other hand, blocks of size |$\lambda $| can still be run in parallel. Next, we adopt the definition of a condensation operator in [9, 21, 41].

 

Definition 2.1.

(⁠|$\varSigma \varDelta $| condensation operator) Let |$p$|⁠, |$r$|⁠, |$\lambda $| be fixed positive integers such that |$\lambda =r\widetilde{\lambda }-r+1$| for some integer |$\widetilde{\lambda }$|⁠. Let |$m=\lambda p$| and |$v$| be a row vector in |$\mathbb{R}^{\lambda }$| whose entry |$v_{j}$| is the |$j$|-th coefficient of the polynomial |$(1+z+\ldots +z^{\widetilde{\lambda }-1})^{r}$|⁠. Define the condensation operator |$V_{\varSigma \varDelta }\in \mathbb{R}^{p\times m}$| as |$V_{\varSigma \varDelta }:=I_{p}\otimes v.$|

For example, when |$r=1$|⁠, |$\lambda =\widetilde{\lambda }$| and the vector |$v\in{\mathbb{R}}^{\lambda }$| is simply the vector of all ones while when |$r=2$|⁠, |$\lambda =2\widetilde{\lambda }-1$| and |$v=(1,2,\ldots , \widetilde{\lambda }-1, \widetilde{\lambda }, \widetilde{\lambda }-1,\ldots , 2,1)\in \mathbb{R}^{\lambda }$|⁠. As we will see in the proof of Theorem 3.1, the quantization error is bounded by controlling |$\|V_{\varSigma \varDelta } D^{r} u\|_{2}$|⁠. The specific construction of |$V_{\varSigma \varDelta }$| in Definition 2.1 accomplishes two things. First, it allows us to obtain favourable bounds on |$\|V_{\varSigma \varDelta } D^{r} u\|_{2}$|⁠. Secondly, as |$V_{\varSigma \varDelta }$| is the Kronecker product of a vector having |$O(\lambda )$| non-zeros with the identity matrix, it is a sparse matrix, hence computationally efficient to apply. For analogous reason, we define the condensation operator for |$Q_{\beta }$| as follows.

 

Definition 2.2.

(Distributed noise-shaping condensation operator) Let |$p,\lambda $| be positive integers and fix |$\beta \in (1,2)$|⁠. Let |$m=\lambda p$| and |$v_{\beta }:=(\beta ^{-1}, \beta ^{-2}, \ldots , \beta ^{-\lambda })\in \mathbb{R}^{\lambda }$| be a row vector. Define the distributed noise-shaping condensation operator |$V_{\beta }\in \mathbb{R}^{p\times m}$| as |$V_{\beta }:=I_{p}\otimes v_{\beta }.$|

We will also need the normalized condensation operators given by

(2.5)

If |$\widetilde{V}$| is either of the two normalized matrices in (2.5), Lemma D.1 (Appendix  D) shows that

(2.6)

3. Main results and space complexity

Our approach to quantizing RFFs given by (1.8) and (1.9) is justified by (2.6), along with the observation that, for our noise-shaping schemes, we have |$q=z-Hu$| with guarantees that |$\|\widetilde{V}Hu\|_{2}$| is small.

Moreover, as we will see in Section 3.1, we are able to control the approximation error such that |$\widehat{k}_{\varSigma \varDelta }(x,y)\approx k(x,y)$| and |$\widehat{k}_{\beta }(x,y)\approx k(x,y)$| hold with high probability. In fact Theorem 3.1 shows more: the approximation error of the quantized kernel estimators in (1.8) and (1.9) have polynomial and exponential error decay, respectively, as a function of |$m$|⁠, the dimension of the RFFs. Armed with this result, in Section 3.2 we also present a brief analysis of the space-complexity associated with our quantized RFFs, and show that the approximation error due to quantization decays exponentially as a function of the bits needed.

Additionally, in various applications such as KRR, spectral error bounds on the kernel may be more pertinent than point-wise bounds. For example, it was shown in [4, 40] that the expected loss of KRR performed using an approximation of the true kernel is bounded by a function of the spectral error in the kernel approximation (Lemma 2 of [4], Proposition 1 of [40]). In Theorem 3.3, we provide spectral approximation guarantees for first-order |$\varSigma \varDelta $| quantized RFF approximation of kernels, in the spirit of the analogous guarantees in [40] for StocQ.

3.1 Approximation error bounds

3.1.1 Point-wise error bounds on the approximation

We begin with Theorem 3.1, with its proof in Appendix  D.

 

Theorem 3.1.
Let |$\mathscr{X}\subseteq \mathbb{R}^{d}$| be compact and |$k:\mathscr{X}\times \mathscr{X}\rightarrow \mathbb{R}$| be a normalized, i.e. |$k(0,0)=1$|⁠, shift-invariant kernel. Let |$\varLambda $| be its corresponding probability measure as in (1.1), and suppose that the second moment |$\sigma _{\varLambda }^{2}={\mathbb{E}}_{\omega \sim \varLambda } \|\omega \|_{2}^{2}$| exists. Let |$\beta \in (1,2)$|⁠, |$p,r\in \mathbb{N}$|⁠, |$\lambda =O(\sqrt{p\log ^{-1} p})\in \mathbb{N}$| and |$m=\lambda p$|⁠. Let |$C_{1}$| be the constant defined in Proposition A.1. For |$x,y \in \mathscr{X}$|⁠, and |$b$|-bit alphabet |$\mathscr{A}$| in (1.4) with |$b=\log _{2}(2K)$|⁠, consider the approximated kernels |$\widehat{k}^{(r)}_{\varSigma \varDelta }(x,y)$| and |$\widehat{k}_{\beta }(x,y)$| defined as in (1.8) and (1.9), respectively. Then there exist positive constants |$\{\alpha _{i}\}_{i=1}^{10}$| that are independent of |$m,p,\lambda $| such that
(3.1)
holds with probability at least |$1-\alpha _{1} p^{-1-\alpha _{2}}-\alpha _{3} \exp (-\alpha _{4} p^{1/2}+\alpha _{5} \log p)$|⁠, and
(3.2)
holds with probability exceeding |$1-\alpha _{6} p^{-1-\alpha _{7}}-\alpha _{8} \exp (-\alpha _{9} p^{1/2}+\alpha _{10} \log p)$|⁠.

Note that the first error term in (3.1) and (3.2) results from the condensation of RFFs, i.e. Theorem D.3, while the remaining two error terms are due to the corresponding quantization schemes.

3.1.2 Spectral approximation guarantees for first-order Sigma-Delta quantized RFFs

We begin with a definition of a |$(\varDelta _{1}, \varDelta _{2})$|-spectral approximation of a matrix as the error bounds |$\varDelta _{1}$| and |$\varDelta _{2}$| play a key role in bounding the generalization error in various applications such as KRR (Lemma 2 of [4], Proposition 1 of [40]).

 

Definition 3.2

|$(\varDelta _{1},\varDelta _{2})$|-spectral approximation. Given |$\varDelta _{1}, \varDelta _{2}>0$|⁠, a matrix |$A$| is a |$(\varDelta _{1},\varDelta _{2})$|-spectral approximation of another matrix |$B$| if |$(1-\varDelta _{1})B \preccurlyeq A \preccurlyeq (1+\varDelta _{2})B$|⁠.

For the tractability of obtaining spectral error bounds, in this section we consider a variation of the Sigma-Delta scheme for |$r=1$|⁠. In particular, given a |$b$|-bit alphabet as in (1.4) with |$b=\log _{2}(2K)$|⁠, we consider the following first-order |$\varSigma \varDelta $| quantization scheme for an RFF vector |$z(x) \in [-1,1]^{m}$| corresponding to a data point |$x \in \mathbb{R}^{d}$|⁠, where the state variable |$(u_{x})_{0}$| is initialized as a random number, i.e.

(3.3)

where |$q \in \mathscr{A}^{m}$| represents the |$\varSigma \varDelta $| quantization of |$z(x)$| and |$(u_{x})_{0}$| is drawn randomly from the uniform distribution on |$\left [-\frac{1}{2^{b}-1},\frac{1}{2^{b}-1}\right ]$|⁠.

Let |$Q_{\varSigma \varDelta }$| be the first-order |$\varSigma \varDelta $| quantizer represented by (3.3) and let |$\widetilde{V}_{\varSigma \varDelta }$| be the associated sparse condensation matrix as in definition 2.1. Then the elements of the corresponding approximation |$\hat{K}_{\varSigma \varDelta }$| of the kernel |$K$| is given by

Now, we state Theorem 3.3 whose proof can be found in Appendix  E.

 

Theorem 3.3.
Let |$\hat{K}_{\varSigma \varDelta }$| be an approximation of a true kernel matrix |$K$| using |$m$|-feature first-order |$\varSigma \varDelta $| quantized RFF (as in (3.3)) with a |$b$|-bit alphabet (as in (1.4)) and |$m=\lambda p$|⁠. Then given |$\varDelta _{1} \geq 0, \varDelta _{2} \geq \frac{\delta }{\eta }$|⁠, where |$\eta>0$| represents the regularization and |$\delta = \frac{8 + \frac{26}{3p}}{\lambda (2^{b}-1)^{2}}$|⁠, we have

The above result differs from the spectral bound results presented in [40] for StocQ in a particular aspect of the lower bound requirement on |$\varDelta _{2}$|⁠, namely, the lower bound for |$\varDelta _{2}$| in Theorem 3.3 for first-order |$\varSigma \varDelta $| quantization has another controllable parameter |$\lambda $| in addition to the number of bits |$b$|⁠. Specifically, provided |$8>> \frac{26}{3p}$|⁠, we have |$\delta \approx \frac{8}{\lambda (2^{b}-1)^{2}}$|⁠, which is monotonically decreasing in |$\lambda $|⁠.

3.2 Space complexity

At first glance, Theorem 3.1 shows that |$Q_{\beta }$| has faster quantization error decay as a function of |$\lambda $| (hence |$m$|⁠) as compared with |$Q_{\varSigma \varDelta }^{(r)}$|⁠. However, a further compression of the bit-stream resulting from the latter is possible, and results in a similar performance of the two methods from the perspective of bit-rate versus approximation error, as we will now show.

Indeed, our methods entail training and testing linear models on condensed bitstreams |$\widetilde{V}q\in \widetilde{V}\mathscr{A}^{m}\subset{\mathbb{R}}^{p}$|⁠, where |$q$| is the quantized RFFs generated by |$Q_{\varSigma \varDelta }^{(r)}$| or |$Q_{\beta }$|⁠, and |$\widetilde{V}$| is the corresponding normalized condensation operator. Thus, when considering the space complexity associated with our methods, the relevant factor is the number of bits needed to encode |$\widetilde{V}q$|⁠. To that end, by storing the normalization factors in |$\widetilde{V}$| (see (2.5)) separately using a constant number of bits, we can simply ignore them when considering space complexity. Let us now consider |$b$|-bit alphabets |$\mathscr{A}$| with |$b=\log _{2}(2K)$|⁠. Since the entries of |$v$| are integer valued and |$\|v\|_{1}=O(\lambda ^{r})$|⁠, one can store |$\widetilde{V}_{\varSigma \varDelta }q$| using |$B:=O(p\log _{2}(2K\|v\|_{1}))=O(p(b+r\log _{2}\lambda ))$| bits. Then |$\lambda ^{-r} \approx 2^{-c B/p}$| and thus the dominant error terms in (3.1) decay exponentially as a function of bit-rate |$B$|⁠. On the other hand, for distributed noise shaping each coordinate of |$\widetilde{V}_{\beta } q$| is a linear combination of |$\lambda $| components in |$q$|⁠, so |$\widetilde{V}_{\beta } q$| takes on at most |$(2K)^{\lambda }$| values. This implies we need |$p\log _{2}(2K)^{\lambda }=mb$| bits to store |$\widetilde{V}_{\beta } q$| in the worst case.

 

Remark 3.1.

Although |$mb$| is a tight upper bound for arbitrary |$\beta \in (1,2)$|⁠, an interesting observation is that the number of bits used to store |$\widetilde{V}_{\beta } q$| can be smaller than |$mb$| with special choices of |$\beta $|⁠, e.g. when |$\beta ^{k}=\beta +1$| with integer |$k>1$|⁠. For example, if |$k=2$| and |$b=1$|⁠, then |$\beta =(\sqrt{5}+1)/2$| is the golden ratio and one can see that |$v_{\beta }=(\beta ^{-1},\ldots ,\beta ^{-\lambda })$| satisfies |$v_{\beta }(i)=v_{\beta }(i+1)+v_{\beta }(i+2)$| for |$1\leq i\leq \lambda -2$|⁠. Since |$b=1$|⁠, we have |$q\in \{\pm 1\}^{m}$| and |$\widetilde{V}_{\beta } q$| (ignoring the normalizer) can be represented by |$p\log _{2}(\beta ^{\lambda })=m\log _{2}(\beta )<m$| bits. Defining the number of bits used to encode each RFF vector by |$R:=m\log _{2}(\beta )$|⁠, then (3.2) shows that |$\beta ^{-\lambda }=2^{-\lambda R/m}=2^{-R/p}$| dominates the error. In other words, up to constants, the error is essentially equal to the error obtained by a |$\lambda $| bit MSQ quantization of a |$p$|-dimensional RFF embedding.

If we assume that each full-precision RFF is represented by |$32$| bits, then the storage cost per sample for both full-precision RFF and semi-quantized scheme |$Q_{\text{SemiQ}}$| in (1.6) is |$32m$|⁠. Because |$Q_{\text{StocQ}}$| in (1.7) does not admit further compression, it needs |$mb$| bits. A comparison of space complexity of different methods is summarized in Table 1.

4. Numerical experiments

We have established that both |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\beta }$| are memory-efficient and approximate their intended kernels well. In this section, we will verify via numerical experiments that they perform favourably compared with other baselines on machine learning tasks.

4.1 Kernel ridge regression

KRR [28] corresponds to the ridge regression (linear least squares with |$\ell _{2}$| regularization) in a reproducing kernel Hilbert space. We synthesize |$N=5000$| highly nonlinear data samples |$(x_{i},y_{i})\in \mathbb{R}^{5}\times \mathbb{R}$| such that for each |$i$|⁠, we draw each component of |$x_{i}\in \mathbb{R}^{5}$| uniformly from |$[-1,1)$| and use it to generate

where |$\gamma _{1}=\gamma _{2}=\gamma _{3}=[1,1,\ldots , 1]^{\top }\in \mathbb{R}^{5}$| and |$\varepsilon _{i}\sim \mathscr{N}(0,\frac{1}{4})$|⁠. This is split into |$4000$| samples used for training and |$1000$| samples for testing. Given an RBF kernel |$k(x,y)=\exp (-\gamma \|x-y\|_{2}^{2})$| with |$\gamma =1/d=0.2$|⁠, by the Representer theorem, our predictor is of the form |$\widehat{f}(x)=\sum _{i=1}^{N}\alpha _{i} k(x_{i},x)$| where the coefficient vector |$\alpha :=(\alpha _{1},\ldots ,\alpha _{N})\in \mathbb{R}^{N}$| is obtained by solving |$(K+\eta I_{N})\alpha =y$|⁠. Here, |$K=(k(x_{i},x_{j}))\in \mathbb{R}^{N\times N}$| is the kernel matrix and |$\eta =1$| is the regularization parameter.

Since the dimension of RFFs satisfies |$m=\lambda p$|⁠, there is a trade-off between |$p$| and |$\lambda $|⁠. According to Theorem 3.1, increasing the embedding dimension |$p$| can reduce the error caused by compressing RFFs, while larger |$\lambda $| leads to smaller quantization error and makes the memory usage of |$Q_{\varSigma \varDelta }^{(r)}$| more efficient (see Table 1). Beyond this, all hyperparameters, e.g. |$\lambda $|⁠, |$\beta $|⁠, are tuned based on cross validation. In our experiment, we consider the kernel approximations |$\widehat{k}_{\text{RFF}}$|⁠, |$\widehat{k}_{\text{StocQ}}$|⁠, |$\widehat{k}_{\varSigma \varDelta }^{(1)}$| with |$\lambda =15$|⁠, |$\widehat{k}_{\varSigma \varDelta }^{(2)}$| with |$\lambda =15$|⁠, and |$\widehat{k}_{\beta }$| with |$\beta =1.9$|⁠, |$\lambda =12$|⁠. These are applied for both training (solving for |$\alpha $|⁠) and testing (computing |$\widehat{f}(x)$| based on |$\alpha $|⁠), while the semi-quantized scheme |$\widehat{k}_{s}$| is only used for testing and its coefficient vector |$\alpha $| is learned using |$\widehat{k}_{\text{RFF}}$| on the training set. Furthermore, according to [31], |$\widehat{k}_{s}$| can be used in two scenarios during the testing stage:

  1. Training data are unquantized RFFs, while test data are quantized, i.e. |$\widehat{f}(x)=\sum _{i=1}^{N}\alpha _{i} \widehat{k}_{s}(x_{i},x)$|⁠;

  2. Quantize training data and leave testing points as RFFs, i.e. |$\widehat{f}(x)=\sum _{i=1}^{N}\alpha _{i} \widehat{k}_{s}(x,x_{i})$|⁠.

We summarize the KRR results averaging over |$30$| runs for |$b=1$| bit quantizers in Fig. 1, in which the solid curves represent our methods and the dashed lines depict other baselines. Note that in both cases, the noise-shaping quantizer |$Q_{\beta }$| achieves the lowest test mean squared error among all quantization schemes, and it even outperforms the semi-quantization scheme |$\widehat{k}_{s}$| with respect to the number of measurements |$m$|⁠. Moreover, due to the further compression advantage, |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\beta }$| are more memory-efficient than the fully quantized scheme |$Q_{\text{StocQ}}$| in terms of the usage of bits per sample. More experiments for |$b=2,3$| can be found in Appendix  C.

KRR with $b=1$. The labels RFF, $s1$, $s2$, StocQ, $r1$, $r2$, $\beta $ represent $\widehat{k}_{\text{RFF}}$, $\widehat{k}_{s}$ for scenarios (1), (2), $\widehat{k}_{\text{StocQ}}$, $\widehat{k}_{\varSigma \varDelta }^{(1)}$, $\widehat{k}_{\varSigma \varDelta }^{(2)}$ and $\widehat{k}_{\beta }$ respectively.
Fig. 1.

KRR with |$b=1$|⁠. The labels RFF, |$s1$|⁠, |$s2$|⁠, StocQ, |$r1$|⁠, |$r2$|⁠, |$\beta $| represent |$\widehat{k}_{\text{RFF}}$|⁠, |$\widehat{k}_{s}$| for scenarios (1), (2), |$\widehat{k}_{\text{StocQ}}$|⁠, |$\widehat{k}_{\varSigma \varDelta }^{(1)}$|⁠, |$\widehat{k}_{\varSigma \varDelta }^{(2)}$| and |$\widehat{k}_{\beta }$| respectively.

4.2 Kernel SVM

To illustrate the performance of our methods for classification tasks, we perform Kernel SVM [32, 36] to evaluate different kernel approximations on the UCI ML hand-written digits dataset [2, 39], in which |$N=1797$| grayscale images compose |$C=10$| classes and they are vectorized to |$d=64$| dimensional vectors. Additionally, all pixel values are scaled in the range |$[0,1]$| and we randomly split this dataset into |$80\%$| for training and |$20\%$| for testing. As for the classifier, we use the soft margin SVM with a regularization parameter |$R=1$|⁠.

Note that in the binary classification case, i.e. labels |$y_{i}\in \{-1,1\}$|⁠, our goal is to learn the coefficients |$\alpha _{i}$|⁠, the intercept |$b$|⁠, and the index set of support vectors |$S$| in a decision function during the training stage:

(4.1)

Here, we use an RBF kernel |$k(x,y)=\exp (-\gamma \|x-y\|_{2}^{2})$| with |$\gamma =1/(d\sigma _{0}^{2})\approx 0.11$| and |$\sigma _{0}^{2}$| being equal to the variance of training data. In the multi-class case, we implement the ‘one-versus-one’ approach for multi-class classification where |$\frac{C(C-1)}{2}$| classifiers are constructed and each one trains data from two classes. In our experiment, we found that a large embedding dimension |$p=m/\lambda $| is needed and approximations |$\widehat{k}_{\text{RFF}}$|⁠, |$\widehat{k}_{\text{StocQ}}$|⁠, |$\widehat{k}_{\varSigma \varDelta }^{(1)}$| with |$\lambda =2$|⁠, |$\widehat{k}_{\varSigma \varDelta }^{(2)}$| with |$\lambda =3$|⁠, and |$\widehat{k}_{\beta }$| with |$\beta =1.1$|⁠, |$\lambda =2$|⁠, are implemented for both training (obtaining |$\alpha _{i}$|⁠, |$b$|⁠, and |$S$| in (4.1)) and testing (predicting the class of an incoming sample |$x$| by |$g(x)$|⁠) phases, whereas the asymmetric scheme |$\widehat{k}_{s}$| is only performed for inference with its parameters in (4.1) learned from |$\widehat{k}_{\text{RFF}}$| during the training stage. Moreover, as before there are two versions of |$\widehat{k}_{s}$| used for making predictions:

  1. Keep the support vectors as unquantized RFFs and quantize the test point |$x$|⁠, i.e. substitute |$\widehat{k}_{s}(x_{i}, x)$| for |$k(x,x_{i})$| in (4.1);

  2. Quantize the support vectors and leave the testing point |$x$| as unquantized RFFs, i.e. replace |$k(x,x_{i})$| in (4.1) with |$\widehat{k}_{s}(x, x_{i})$|⁠.

For each binary quantization scheme (with |$b=1$|⁠), the average test accuracy over |$30$| independent runs is plotted in Fig. 2. We observe that, in regard to |$m$|⁠, |$Q_{\beta }$| substantially outperforms other fully quantized schemes including |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\text{StocQ}}$|⁠, but, as expected, it is still worse than the semi-quantized methods. Memory efficiency is characterized in the right plot by estimating the test accuracy against the storage cost (in terms of bits) per sample. Note that both |$Q_{\beta }$| and |$Q_{\varSigma \varDelta }^{(1)}$| have significant advantage over the baseline method |$Q_{\text{StocQ}}$|⁠, which means that our methods require less memory to achieve the same test accuracy when |$b=1$|⁠. See Appendix  C for extra experiment results with |$b=2,3$|⁠.

Kernel SVM with $b=1$. The labels RFF, $s1$, $s2$, StocQ, $r1$, $r2$, $\beta $ represent $\widehat{k}_{\text{RFF}}$, $\widehat{k}_{s}$ for scenarios (1), (2), $\widehat{k}_{\text{StocQ}}$, $\widehat{k}_{\varSigma \varDelta }^{(1)}$, $\widehat{k}_{\varSigma \varDelta }^{(2)}$ and $\widehat{k}_{\beta }$ respectively.
Fig. 2.

Kernel SVM with |$b=1$|⁠. The labels RFF, |$s1$|⁠, |$s2$|⁠, StocQ, |$r1$|⁠, |$r2$|⁠, |$\beta $| represent |$\widehat{k}_{\text{RFF}}$|⁠, |$\widehat{k}_{s}$| for scenarios (1), (2), |$\widehat{k}_{\text{StocQ}}$|⁠, |$\widehat{k}_{\varSigma \varDelta }^{(1)}$|⁠, |$\widehat{k}_{\varSigma \varDelta }^{(2)}$| and |$\widehat{k}_{\beta }$| respectively.

4.3 Maximum mean discrepancy

Given two distributions |$p$| and |$q$|⁠, and a kernel |$k$| over |$\mathscr{X}\subset \mathbb{R}^{d}$|⁠, the MMD has been shown to play an important role in the two-sample test [17], by proposing the null hypothesis |$\mathscr{H}_{0}: p=q$| against the alternative hypothesis |$\mathscr{H}_{1}: p\neq q$|⁠. The square of MMD distance can be computed by

where |$x,x^{\prime }\overset{iid}{\sim } p$| and |$y,y^{\prime }\overset{iid}{\sim }q$|⁠. Here, we set |$k$| to an RBF kernel, which is characteristic [34] implying that |$\mathrm{MMD}_{k}(p,q)$| is metric, i.e. |$\mathrm{MMD}_{k}(p,q)=0\iff p=q$|⁠, and the following hypothesis test is consistent.

In our experiment, the distribution |$p$| is supported on a quadrant of the unit circle, while |$q$| is generated by perturbing |$p$| by a gap of size |$\delta $| at various regions, see Fig. 3a. Let |$n=60$| and choose finite samples |$X=\{x_{1},\ldots ,x_{n}\}\sim p$| and |$Y=\{y_{1},\ldots ,y_{n}\}\sim q$|⁠. Then |$\mathrm{MMD}_{k}(p,q)$| can be estimated by

(4.2)
Two distributions and the MMD values based on the RBF kernel.
Fig. 3.

Two distributions and the MMD values based on the RBF kernel.

Under the null hypothesis |$\mathscr{H}_{0}$|⁠, one can get the empirical distribution of (4.2) by reshuffling the data samples |$X\cup Y$| many times (⁠|$t=2000$|⁠) and recomputing |$\widehat{\mathrm{MMD}}_{k}^{2}(X^{\prime },Y^{\prime })$| on each partition |$X^{\prime }\cup Y^{\prime }$|⁠. For a significance level of |$\alpha =0.05$|⁠, |$\mathscr{H}_{0}$| is rejected if the original |$\widehat{\mathrm{MMD}}_{k}^{2}(X,Y)$| is greater than the |$(1-\alpha )$| quantile from the empirical distribution. Figure 3c shows that the empirical distributions of (4.2) under both |$\mathscr{H}_{0}$| and |$\mathscr{H}_{1}$| are separated well, where we use the ground truth RBF kernel with small bandwidth |$\sigma =0.05$|⁠.

In order to compare different quantization methods when |$b=1$|⁠, we use the following approximations with optimal |$\lambda $| to perform the permutation test: |$\widehat{k}_{\text{RFF}}$|⁠, |$\widehat{k}_{\text{StocQ}}$|⁠, |$\widehat{k}_{\varSigma \varDelta }^{(1)}$| with |$\lambda =4$|⁠, |$\widehat{k}_{\varSigma \varDelta }^{(2)}$| with |$\lambda =5$|⁠, and |$\widehat{k}_{\beta }$| with |$\beta =1.5$|⁠, |$\lambda =4$|⁠. Due to the symmetry in (4.2), |$\widehat{k}_{s}$| can be implemented without worrying about the order of inputs. Additionally, if the probability of Type II error, i.e. false negative rate, is denoted by |$\beta $|⁠, then the statistical power of our test is defined by

In other words, the power equals to the portion of MMD values under |$\mathscr{H}_{1}$| that are greater than the |$(1-\alpha )$| quantile of MMD distribution under |$\mathscr{H}_{0}$|⁠. In Fig. 4, we observe that, compared with other fully quantized schemes, |$Q_{\beta }$| has the greatest power in terms of |$m$|⁠. The performance of semi-quantized scheme is pretty close to the plain RFF approximation but it requires more storage space, as discussed in Section 3. Moreover, Fig. 5 presents the corresponding changes of the MMD distributions under |$\mathscr{H}_{0}$| and |$\mathscr{H}_{1}$|⁠, in which the overlap between the two distributions is considerably reduced as |$m$| increases. Regarding the number of bits per sample, both |$Q_{\varSigma \varDelta }^{(1)}$| and |$Q_{\beta }$| have remarkable advantage over |$Q_{\text{StocQ}}$|⁠. Extra results related to |$b=2,3$| can be found in Appendix  C.

Power of the permutation test with $b=1$. The labels RFF, $s$, StocQ, $r1$, $r2$, $\beta $ represent $\widehat{k}_{\text{RFF}}$, $\widehat{k}_{s}$, $\widehat{k}_{\text{StocQ}}$, $\widehat{k}_{\varSigma \varDelta }^{(1)}$, $\widehat{k}_{\varSigma \varDelta }^{(2)}$ and $\widehat{k}_{\beta }$, respectively.
Fig. 4.

Power of the permutation test with |$b=1$|⁠. The labels RFF, |$s$|⁠, StocQ, |$r1$|⁠, |$r2$|⁠, |$\beta $| represent |$\widehat{k}_{\text{RFF}}$|⁠, |$\widehat{k}_{s}$|⁠, |$\widehat{k}_{\text{StocQ}}$|⁠, |$\widehat{k}_{\varSigma \varDelta }^{(1)}$|⁠, |$\widehat{k}_{\varSigma \varDelta }^{(2)}$| and |$\widehat{k}_{\beta }$|⁠, respectively.

The empirical distributions of MMD values under $\mathscr{H}_{0}$ and $\mathscr{H}_{1}$.
Fig. 5.

The empirical distributions of MMD values under |$\mathscr{H}_{0}$| and |$\mathscr{H}_{1}$|⁠.

5. Conclusion

In order to reduce memory requirement for training and storing kernel machines, we proposed a framework of using Sigma-Delta and distributed noise-shaping quantization schemes, |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\beta }$|⁠, to approximate shift-invariant kernels. We have shown that these fully deterministic quantization schemes are capable of saving more bits than other baselines without compromising the performance. Importantly, we showed that, for all pairs of signals from an infinite low-complexity set, the approximations have uniform probabilistic error bounds yielding an exponential decay as the number of bits used increases. Empirically, we illustrated across popular kernel machines that the proposed quantization methods achieve strong performance both as a function of the dimension of the RFF embedding, and the number of bits used, especially in the case of binary embedding.

Data Availability Statement

The data underlying this article are available in the UCI Machine Learning Repository, at https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits.

Funding

JZ was partially supported by grants NSF DMS 2012546 and 2012266. AC was partially supported by NSF DMS 1819222, 2012266. RS was partially supported by NSF DMS 2012546 and a UCSD senate research award.

References

1.

Agrawal
,
R.
,
Campbell
,
T.
,
Huggins
,
J.
&
Broderick
,
T.
(
2019
)
Data-dependent compression of random features for large-scale kernel approximation
.
The 22nd International Conference on Artificial Intelligence and Statistics
. (
Chaudhuri
,
K.
&
Sugiyama
,
M.
eds.)
PMLR
, pp.
1822
1831
.

2.

Alpaydin
,
E.
&
Kaynak
,
C.
(
1998
)
Cascading classifiers
.
Kybernetika (Prague)
,
34
,
369
374
.

3.

Avron
,
H.
,
Clarkson
,
K. L.
&
Woodruff
,
D. P.
(
2017
)
Faster kernel ridge regression using sketching and preconditioning
.
SIAM J. Matrix Anal. Appl.
,
38
,
1116
1138
.

4.

Avron
,
H.
,
Kapralov
,
M.
,
Musco
,
C.
,
Musco
,
C.
,
Velingker
,
A.
&
Zandieh
,
A.
(
2017
)
Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees
.
International Conference on Machine Learning
. (
Precup
,
D.
&
Teh
,
Y. W.
eds.)
PMLR
, pp.
253
262
.

5.

Bach
,
F.
(
2017
)
On the equivalence between kernel quadrature rules and random feature expansions
.
J. Mach. Learn. Res.
,
18
,
714
751
.

6.

Benedetto
,
J. J.
,
Powell
,
A. M.
&
Yilmaz
,
Ö.
(
2006
)
Second-order Sigma–Delta (⁠|$\varSigma \varDelta $|⁠) quantization of finite frame expansions
.
Appl. Comput. Harmon. Anal.
,
20
,
126
148
.

7.

Benedetto
,
J. J.
,
Powell
,
A. M.
&
Yilmaz
,
O.
(
2006
)
Sigma-delta quantization and finite frames
.
IEEE Trans. Inf. Theory
,
52
,
1990
2005
.

8.

Boufounos
,
P. T.
&
Rane
,
S.
(
2013
)
Efficient coding of signal distances using universal quantized Embeddings
. 2013 Data Compression Conference.
DCC
,
251
260
.

9.

Chou
,
E.
&
Güntürk
,
C. S.
(
2016
)
Distributed noise-shaping quantization: I. Beta duals of finite frames and near-optimal quantization of random measurements
.
Constr. Approx.
,
44
,
1
22
.

10.

Chou
,
E.
(
2017
)
Distributed noise-shaping quantization: II. Classical frames
.
Excursions in Harmonic Analysis
, vol.
5
.
Springer, Birkhäuser, Cham
, pp.
179
198
.

11.

Chou
,
E.
,
Güntürk
,
C. S.
,
Krahmer
,
F.
,
Saab
,
R.
&
Yilmaz
,
Ö.
(
2015
)
Noise-shaping quantization methods for frame-based and compressive sampling systems
.
Sampling theory, a renaissance
,
157
184
. Birkhäuser, Cham.

12.

Cucker
,
F.
&
Smale
,
S.
(
2002
)
On the mathematical foundations of learning
.
Bull. Am. Math. Soc.
,
39
,
1
49
.

13.

Danzer
,
L.
(
1963
)
Helly’s theorem and its relatives
.
Convexity. in Proc. Symp. Pure Math
, vol.
7
.
Amer. Math. Soc
, pp.
101
180
.

14.

Daubechies
,
I.
&
DeVore
,
R.
(
2003
)
Approximating a bandlimited function using very coarsely quantized data: a family of stable sigma-delta modulators of arbitrary order
.
Ann. Math.. Second Series
,
158
,
679
710
.

15.

Deift
,
P.
,
Krahmer
,
F.
&
Güntürk
,
C. S.
(
2011
)
An optimal family of exponentially accurate one-bit Sigma-Delta quantization schemes
.
Commun. Pure Appl. Math.
,
64
,
883
919
.

16.

Foucart
,
S.
&
Rauhut
,
H.
(
2013
)
A Mathematical Introduction to Compressive Sensing
.
Springer, Birkhäuser, New York, NY
.

17.

Gretton
,
A.
,
Borgwardt
,
K. M.
,
Rasch
,
M. J.
,
Schölkopf
,
B.
&
Smola
,
A.
(
2012
)
A kernel two-sample test
.
J. Mach. Learn. Res.
,
13
,
723
773
.

18.

Güntürk
,
C. S.
(
2003
)
One-bit sigma-delta quantization with exponential accuracy
.
Commun. Pure Appl. Math.
,
56
,
1608
1630
.

19.

Güntürk
,
C. S.
,
Lammers
,
M.
,
Powell
,
A. M.
,
Saab
,
R.
&
Yilmaz
,
Ö.
(
2013
)
Sobolev duals for random frames and |$\varSigma \varDelta $| quantization of compressed sensing measurements
.
Found. Comput. Math.
,
13
,
1
36
.

20.

Huynh
,
T.
(
2016
)
Accurate quantization in redundant systems: From frames to compressive sampling and phase retrieval
 
Ph.D. thesis,
.
New York University
.

21.

Huynh
,
T.
&
Saab
,
R.
(
2020
)
Fast binary embeddings and quantized compressed sensing with structured matrices
.
Commun. Pure Appl. Math.
,
73
,
110
149
.

22.

Krahmer
,
F.
,
Saab
,
R.
&
Ward
,
R.
(
2012
)
Root-exponential accuracy for coarse quantization of finite frame expansions
.
IEEE Trans. Inf. Theory
,
58
,
1069
1079
.

23.

Li
,
X.
&
Li
,
P.
(
2021
)
Quantization algorithms for random Fourier features
.
The 38th International Conference on Machine Learning
. (
Meila
,
M.
&
Zhang
,
T.
eds.) PMLR, pp. 6369–6380.

24.

Lin
,
C.-J.
(
2007
)
Large-scale kernel machines
.
MIT press, Cambridge, MA, USA
.

25.

Liu
,
F.
,
Huang
,
X.
,
Chen
,
Y.
&
Suykens
,
J. A.
(
2022
)
Random features for kernel approximation: a survey in algorithms, theory, and beyond
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
44
, pp. 7128–7148.

26.

Loomis
,
L. H.
(
2013
)
Introduction to abstract harmonic analysis
.
Courier Corporation
.

27.

May
,
A.
,
Garakani
,
A. B.
,
Lu
,
Z.
,
Guo
,
D.
,
Liu
,
K.
,
Bellet
,
A.
,
Fan
,
L.
,
Collins
,
M.
,
Hsu
,
D.
,
Kingsbury
,
B.
,
Picheny
,
M.
&
Sha
,
F.
(
2019
)
Kernel approximation methods for speech recognition
.
J. Mach. Learn. Res.
,
20
,
1
36
.

28.

Murphy
,
K. P.
(
2012
)
Machine learning: a probabilistic perspective
.
MIT press, Cambridge, MA, USA
.

29.

Rahimi
,
A.
&
Recht
,
B.
(
2007
)
Random features for large-scale kernel machines
.
Adv. Neural Inf. Process. Syst.
,
20
,
1177
1184
.

30.

Rudi
,
A.
&
Rosasco
,
L.
(
2017
)
Generalization properties of learning with random features
.
NIPS
(
Guyon
,
I.
 
Von Luxburg
,
U.
,
Bengio
,
S.
,
Wallach
,
H.
,
Fergus
,
R.
,
Vishwanathan
,
S.
&
Garnett
,
R.
eds.) Curran Associates, Inc.,
3215
3225
.

31.

Schellekens
,
V.
&
Jacques
,
L.
(
2020
)
Breaking the waves: asymmetric random periodic features for low-bitrate kernel machines
.
Information and Inference: A Journal of the IMA
,
11
, 385–421.

32.

Scholkopf
,
B.
&
Smola
,
A. J.
(
2018
)
Learning with kernels: support vector machines, regularization, optimization, and beyond
.
Adaptive Computation and Machine Learning series.
.

33.

Shawe-Taylor
,
J.
,
Cristianini
,
N.
, et al. (
2004
)
Kernel methods for pattern analysis
.
Cambridge university press, Cambridge, UK
.

34.

Sriperumbudur
,
B. K.
,
Gretton
,
A.
,
Fukumizu
,
K.
,
Schölkopf
,
B.
&
Lanckriet
,
G. R.
(
2010
)
Hilbert space embeddings and metrics on probability measures
.
J. Mach. Learn. Res.
,
11
,
1517
1561
.

35.

Sriperumbudur
,
B. K.
&
Szabó
,
Z.
(
2015
)
Optimal rates for random Fourier features
. In
proceedings of the 28th international conference on neural information processing systems
(
Cortes
,
C.
,
Lawrence
,
N.
,
Lee
,
D.
,
Sugiyama
,
M.
&
Garnett
,
R.
eds.) Curran Associates, Inc. volume
1
, pp.
1144
1152
.

36.

Steinwart
,
I.
&
Christmann
,
A.
(
2008
)
Support vector machines
.
Springer Science & Business Media, New York, NY, USA
.

37.

Sutherland
,
D. J.
&
Schneider
,
J.
(
2015
)
On the error of random fourier features
. in
Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence
, pp.
862
871
.

38.

Tu
,
S.
,
Roelofs
,
R.
,
Venkataraman
,
S.
&
Recht
,
B.
(
2016
)
Large scale kernel learning using block coordinate descent
.
arXiv preprint arXiv:1602.05310
.

39.

Xu
,
L.
,
Krzyzak
,
A.
&
Suen
,
C. Y.
(
1992
)
Methods of combining multiple classifiers and their applications to handwriting recognition
.
IEEE Trans. Syst. Man Cybern.
,
22
,
418
435
.

40.

Zhang
,
J.
,
May
,
A.
,
Dao
,
T.
&
,
C.
(
2019
)
Low-precision random Fourier features for memory-constrained kernel approximation
. In
The 22nd International Conference on Artificial Intelligence and Statistics
(
Chaudhuri
,
K.
&
Sugiyama
,
M.
eds.), pp.
1264
1274
.
PMLR
.

41.

Zhang
,
J.
&
Saab
,
R.
(
2021
)
Faster binary Embeddings for preserving Euclidean distances
. in
International Conference on Learning Representations
.

A. Stable quantization methods

The general definition for stable |$Q_{\varSigma \varDelta }^{(r)}$|⁠. Although it is a non-trivial task to design a stable |$Q_{\varSigma \varDelta }^{(r)}$| for |$r>1$|⁠, families of |$\varSigma \varDelta $| quantization schemes that achieve this goal have been designed [14, 15, 18], and we adopt the version in [15]. Specifically, an |$r$|-th order |$\varSigma \varDelta $| quantization scheme may also arise from the following difference equation

(A.1)

where * is the convolution operator and the sequence |$H:=D^{r} g$| with |$g\in \ell ^{1}$|⁠. Then any bounded solution |$v$| of (A.1) gives rise to a bounded solution |$u$| of (2.2) via |$u=g*v$|⁠. By change of variables, (2.2) can be reformulated as (A.1). By choosing a proper filter |$h:=\delta ^{(0)}-H$|⁠, where |$\delta ^{(0)}$| denotes the Kronecker delta sequence supported at |$0$|⁠, one can implement (A.1) by |$v_{i}=(h*v)_{i}+y_{i}-q_{i}$| and the corresponding stable quantization scheme |$Q_{\varSigma \varDelta }^{(r)}$| reads as

(A.2)

Furthermore, the above design leads to the following result from [15, 22], which exploits the constant |$c(K,\mu , r)$| to bound |$\|u\|_{\infty }$|⁠.

 

Proposition A.1.
There exists a universal constant |$C>0$| such that the |$\varSigma \varDelta $| schemes (2.1) and (A.2) with alphabet |$\mathscr{A}$| in (1.4), are stable, and
where |$C_{1}=\bigl ( \bigl \lceil \frac{\pi ^{2}}{(\cosh ^{-1}\gamma )^{2}}\bigr \rceil \frac{e}{\pi } \bigr )$| with |$\gamma :=2K-(2K-1)\mu $|⁠.

Note that even with the |$b=1$| bit alphabet, i.e. |$K=1$| and |$\mathscr{A}=\{-1, 1\}$|⁠, stability can be guaranteed with

The stability of |$Q_{\beta }$|⁠. The relevant result for stability of the noise-shaping quantization schemes (2.4) is the following proposition, which can be simply proved by induction or can be found in [10].

 

Proposition A.2.
The noise-shaping scheme (2.4) with alphabet |$\mathscr{A}$| in (1.4) is stable and

B. A comparison of kernel approximations

In Fig. B.6, we evaluate approximated kernels (1.8) and (1.9) in Section 1 on |$n=1000$| pairs of points |$\{x_{i}, y_{i}\}_{i=1}^{n}$| in |$\mathbb{R}^{d}$| with |$d=50$| such that for each |$i$|

Kernel Approximations with $b=3$.
Fig. B.6.

Kernel Approximations with |$b=3$|⁠.

Moreover, each data point |$x_{i}$| is represented by |$m=3000$| RFF features and we use |$3$|-bit quantizers to guarantee good performance for all methods. The target RBF kernel (red curves) is |$k(x,y)=\exp (-\|x-y\|_{2}^{2}/2\sigma ^{2})$| with |$\gamma :=1/2\sigma ^{2}=\frac{1}{5}$| and note that the approximations (black dots) have their |$\ell _{2}$| distances |$\|x-y\|_{2}$| uniformly distributed in the range |$[0,5]$|⁠. We see that both |$\widehat{k}_{\varSigma \varDelta }^{(r)}$| and |$\widehat{k}_{\beta }$| can approximate |$k$| well.

C. More figures in Section 4

KRR. In Fig. C.7 and Fig. C.8, we show the KRR results for |$b=2$| and |$b=3$|⁠, respectively. Same as in the Section 4, we see that the proposed methods |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\beta }$| have strong performance in terms of |$m$| and the number of bits used for each sample.

KRR with $b=2$.
Fig. C.7.

KRR with |$b=2$|⁠.

KRR with $b=3$.
Fig. C.8.

KRR with |$b=3$|⁠.

Kernel SVM.  Figure C.9 and Fig. C.10 illustrate the performance of kernel SVM with |$b=2$| and |$b=3$|⁠, respectively. As we expect, the gap across various schemes shrinks when we use multibit quantizers, where |$Q_{\text{StocQ}}$| is comparable with |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\beta }$|⁠.

Kernel SVM with $b=2$.
Fig. C.9.

Kernel SVM with |$b=2$|⁠.

Kernel SVM with $b=3$.
Fig. C.10.

Kernel SVM with |$b=3$|⁠.

MMD. As a continuation of the two-sample test in Section 4, Fig. C.11 and Fig. C.12 imply that both semi-quantized scheme and |$Q_{\text{StocQ}}$| have better performance with respect to |$m$|⁠, while |$Q_{\varSigma \varDelta }^{(r)}$| can save more memory than other quantization methods.

Power of the permutation test with $b=2$.
Fig. C.11.

Power of the permutation test with |$b=2$|⁠.

Power of the permutation test with $b=3$.
Fig. C.12.

Power of the permutation test with |$b=3$|⁠.

D. Proof of Theorem 3.1

Given |$x,y\in \mathscr{X}\subset \mathbb{R}^{d}$|⁠, we use either stable |$Q_{\varSigma \varDelta }^{(r)}$| or stable |$Q_{\beta }$| to quantize their RFFs |$z(x)$|⁠, |$z(y)$| as in (1.2). Then we get quantized sequences

and expect that both

approximate the ground truth |$k(x,y)$| well.

In the case of |$\varSigma \varDelta $| quantization, by (2.2), one can get

It follows that

The triangle inequality implies that

(D1)

Similarly, for the noise-shaping quantization, one can derive the following inequality based on (2.3),

(D2)

In order to control the kernel approximation errors in (D.1) and (D.2), we need to bound fours terms (I), (II), (III) and (IV) on the right-hand side.

D.1 Useful lemmata

In this section, we present the following well-known concentration inequalities and relevant lemmata.

 

Theorem D.1.
(Hoeffding’s inequality [16]) Let |$X_{1},\ldots , X_{M}$| be a sequence of independent random variables such that |${\mathbb{E}} X_{l}=0$| and |$|X_{l}|\leq B_{l}$| almost surely for all |$1\leq l\leq M$|⁠. Then for all |$t>0$|⁠,

 

Theorem D.2.
(Bernstein’s inequality [16]) Let |$X_{1},\ldots , X_{M}$| be independent random variables with zero mean such that |$|X_{l}|\leq K$| almost surely for all |$1\leq l\leq M$| and some constant |$K>0$|⁠. Furthermore assume |${\mathbb{E}}|X_{l}|^{2}\leq \sigma _{l}^{2}$| for constants |$\sigma _{l}>0$| for all |$1\leq l\leq M$|⁠. Then for all |$t>0$|⁠,
where |$\sigma ^{2}:=\sum _{l=1}^{M}\sigma _{l}^{2}$|⁠.

Additionally, one can compute the moments of |$\cos (\omega _{i}^{\top } x+\xi _{i})\cos (\omega _{j}^{\top } y +\xi _{j})$| as follows.

 

Lemma D.1.
Suppose |$x,y\in \mathbb{R}^{d}$| with RFFs |$z(x)$| and |$z(y)$| as in (1.2). Let |$\widetilde{V}$| be either of the two normalized condensation operators defined in (2.5). Then
(D.3)
 
(D.4)
 
(D.5)

 

Proof.
(i) Using trigonometric identities, the independence of |$\omega _{j}$| and |$\xi _{j}$| and formula (1.1), we get
(ii) If |$i=j$|⁠, then
Similarly, when |$i\neq j$| we have
(iii) According to (D.3), we have |${\mathbb{E}}( z(x)z(y)^{\top })=\frac{1}{2}k(x,y)I_{m}$| and thus

 

Lemma D.2.
Let |$x,y\in \mathbb{R}^{d}$| and |$\varepsilon>0$|⁠. Then
 

 

Proof.
(i) We first consider the case of |$\varSigma \varDelta $| scheme, i.e. (I) in (D.1). Note that
where |$S_{1}(x,y),\ldots , S_{p}(x,y)$| are i.i.d. with
Due to (D.3), (D.4), and
one can get
and
for all |$1\leq i\leq p$|⁠, it follows immediately from Bernstein’s inequality that
(ii) Since the proof in part (i) works for all vectors |$v\in \mathbb{R}^{\lambda }$| with non-negative components, a similar result holds for the noise-shaping case by replacing |$V_{\varSigma \varDelta }$| and |$v$| by |$V_{\beta }$| and |$v_{\beta }$|⁠, respectively.

 

Lemma D.3.
Let |$x\in \mathbb{R}^{d}$| and |$\varepsilon>0$|⁠. Then
 

 

Proof.
(i) In the case of |$\varSigma \varDelta $| quantization, we note that |$V_{\varSigma \varDelta }=I_{p}\otimes v$| and
where |$R_{i}(x):= \frac{1}{\|v\|_{2}^{2}}\sum _{j=1}^{\lambda } v_{j} z(x)_{(i-1)\lambda +j}=\frac{1}{\|v\|_{2}^{2}}\sum _{j=1}^{\lambda } v_{j}\cos (\omega _{(i-1)\lambda +j}^{\top } x+\xi _{(i-1)\lambda +j})$| for |$1\leq i\leq p$|⁠.
Since |${\mathbb{E}}(v_{j}^{2}\cos ^{2}(\omega _{(i-1)\lambda +j}^{\top } x+\xi _{(i-1)\lambda +j}))=v_{j}^{2}/2$| and |$|v_{j}\cos (\omega _{(i-1)\lambda +j}^{\top } x+\xi _{(i-1)\lambda +j})|\leq \|v\|_{\infty }$| holds for all |$i$| and |$j$|⁠, we can apply Theorem D.2 to |$R_{i}(x)$| with |$K=\|v\|_{\infty }$|⁠, |$M=\lambda $|⁠, and |$\sigma ^{2}=\frac{\|v\|_{2}^{2}}{2}$|⁠. Specifically, for all |$t>0$|⁠, we have
(D6)
Since
by union bound, we have
where the last inequality is due to (D.6).

(ii) Substituting |$V_{\varSigma \varDelta }$| with |$V_{\beta }=I_{p}\otimes v_{\beta }$| leads to a verbatim proof for the second inequality.

D.2 Upper bound of (I)

This section is devoted to deriving an upper bound of the term (I) in (D.1), (D.2). Here, we adapt the proof techniques used in [37].

According to Theorem 3.1, |$\mathscr{X}$| is a compact subset of |$\mathbb{R}^{d}$| with diameter  |$\ell =\mathrm{diam}(\mathscr{X})>0$|⁠. Then |$\mathscr{X}^{2}:=\mathscr{X}\times \mathscr{X}$| is a compact set in |$\mathbb{R}^{2d}$| with diameter |$\sqrt{2}\ell $|⁠. Additionally, the second moment of distribution |$\varLambda $|⁠, that is, |$\sigma _{\varLambda }^{2}:={\mathbb{E}}_{\omega \sim \varLambda }\|\omega \|_{2}^{2}=\text{tr}(\nabla ^{2} \kappa (0))$| exists where |$\nabla ^{2} \kappa $| is the Hessian matrix of |$\kappa $| in (1.1). We will need the following results in order to obtain a uniform bound of term (I) over |$\mathscr{X}$|⁠, via an |$\varepsilon $|-net argument.

 

Lemma D.4.
[12] Let |$B_{2}^{d}(\eta ):=\{x\in \mathbb{R}^{d}: \|x\|_{2}\leq \eta \}$|⁠. Then the covering number |$\mathscr{N}(B_{2}^{d}(\eta ),\varepsilon )$| satisfies

 

Lemma D.5.
Jung’s Theorem [13] Let |$K\subseteq \mathbb{R}^{d}$| be compact with |$\mathrm{diam}(K)>0$|⁠. Then |$K$| is contained in a closed ball with radius
The boundary case of equality is attained by the regular |$n$|-simplex.

We can now prove the following theorem controlling term (I).

 

Theorem D.3.
Let |$\varepsilon , \eta _{1}>0$|⁠. Then
and

 

Proof.
Indeed, the following proof techniques are independent of the choice of row vector |$v$| in |$\widetilde{V}_{\varSigma \varDelta }$|⁠. So we only prove the case related to |$\widetilde{V}_{\varSigma \varDelta }$| and everything works for |$\widetilde{V}_{\beta }$| by replacing |$v$| with |$v_{\beta }$|⁠. Let
Recall that |${\mathbb{E}}(s(x,y))=k(x,y)$| and
where |$S_{1}(x,y),\ldots , S_{p}(x,y)$| are i.i.d. with
According to Lemma D.5, |$\mathscr{X}^{2}\subseteq \mathbb{R}^{2d}$| is enclosed in a closed ball with radius |$\ell \sqrt{\frac{2d}{2d+1}}$|⁠. By Lemma D.4, one can cover |$\mathscr{X}^{2}$| using an |$\eta _{1}$|-net with at most |$\Bigl (\frac{4\ell }{\eta _{1}}\sqrt{\frac{2d}{2d+1}}\Bigr )^{2d}\leq T_{1}:= \bigl (\frac{4\ell }{\eta _{1}}\big )^{2d}$| balls of radius |$\eta _{1}$|⁠. Let |$c_{i}=(x_{i}, y_{i})$| denote their centres for |$1\leq i\leq T_{1}$|⁠.
For |$1\leq l\leq d$| we have
Then
Since |$\Bigl |\frac{\partial s}{\partial x_{l}}(x,y)\Big | $| is dominated by an integrable function, one can interchange expectations and partial derivatives. In particular,
and similarly
It follows that
(D7)
Let |$L_{f}=\|\nabla f(x^{*}, y^{*})\|_{2}$| be the Lipschitz constant with |$(x^{*},y^{*})=\text{argmax}_{(x,y)\in \mathscr{X}^{2}}\|\nabla f(x,y) \|_{2}$|⁠. Applying law of total expectation and (D.7) gives
(D8)
Note that
By Cauchy–Schwarz inequality and the fact |$\|v\|_{1}\leq \sqrt{\lambda }\|v\|_{2}$|⁠, we have
Then
and similarly
Plugging above results into (D.8) shows
Let |$\varepsilon>0$|⁠. Markov’s inequality implies
(D9)
By union bound and Lemma D.2, we get
(D10)
If |$|f(c_{i})|<\varepsilon /2$| for all |$i$| and |$L_{f}<\varepsilon /2\eta _{1}$|⁠, then |$|f(x,y)|<\varepsilon $| for all |$(x,y)\in \mathscr{X}^{2}$|⁠. It follows immediately from (D.9) and (D.10) that

D.3 Upper bound of (II) and (III)

By symmetry it suffices to bound (II) in (D.1), (D.2) and the same upper bound holds for (III).

 

Theorem D.4.
Let |$\varepsilon ,\eta _{2}>0$|⁠. Then we have
and
where |$c(K,r)$| and |$c(K,\beta )$| are upper bounds of the |$\ell _{\infty }$| norm of state vectors in Proposition A.1 and Proposition A.2, respectively.

 

Proof.
(i) We first prove the case associated with |$\widetilde{V}_{\varSigma \varDelta }$|⁠. Since |$\mathrm{diam}(\mathscr{X})=\ell $|⁠, by Lemma D.4 and Lemma D.5, one can cover |$\mathscr{X}$| using an |$\eta _{2}$|-net with at most |$T_{2}:=\big ( \frac{2\sqrt{2}\ell }{\eta _{2}}\big )^{d}$| balls with radius |$\eta _{2}$|⁠. Let |$x_{k}$| denote their centres for |$1\leq k\leq T_{2}$|⁠. For |$x\in \mathbb{R}^{d}$|⁠, define
where |$g_{i}(x):=\sum _{j=1}^{\lambda } v_{j}\cos (\omega _{(i-1)\lambda +j}^{\top } x+\xi _{(i-1)\lambda +j})$|⁠. By the triangle inequality, we have
where |$x^{*}_{i} =\text{argmax}_{x\in \mathscr{X}}\|\nabla g_{i}(x) \|_{2}$| and |$L_{g}:=\frac{1}{p\|v\|_{2}^{2}} \sum _{i=1}^{p} \|\nabla g_{i}(x_{i}^{*})\|_{2}$|⁠. It follows that
Applying the Cauchy–Schwarz inequality gives
Taking expectation on both sides leads to
Let |$\varepsilon>0$|⁠. Markov’s inequality implies
(D11)
By union bound and Lemma D.3, we get
(D12)
If |$|g(x_{i})|<\varepsilon /2$| for all |$i$| and |$L_{g}<\varepsilon /2\eta _{2}$|⁠, then |$|g(x)|<\varepsilon $| for all |$x\in \mathscr{X}^{2}$|⁠. It follows immediately from (D.11) and (D.12) that
(D13)
Because |$\|V_{\varSigma \varDelta }D^{r}\|_{\infty }=2^{r}$| and |$\|u_{y}\|_{\infty }\leq c(K, r):=c(K,1,r)$| in Proposition A.1, we have
Therefore, one can get
(ii) By repeating the statements before (D.13) with |$V_{\varSigma \varDelta }$| replaced with |$V_{\beta }$|⁠, one can get
(D14)
Due to |$\|V_{\beta } H\|_{\infty } = \beta ^{-\lambda }$| and |$\|u_{y}\|_{\infty }\leq c(K,\beta )$| in Proposition A.2, we get
(D15)
 
(D15)
It follows from (D.14), (D.15) that

D.4 Upper bound of (IV)

 

Theorem D.5.

Let |$r\in \mathbb{N}^{+}$| and |$\beta \in (1,2)$|⁠.

  1. If |$u_{x}$|⁠, |$u_{y}$| are state vectors of the |$\varSigma \varDelta $| quantizer |$Q_{\varSigma \varDelta }^{(r)}$|⁠, then
    where |$c(K,r)$| is the upper bound of the |$\ell _{\infty }$| norm of state vectors in Proposition A.1 and |$c(r)>0$| is a constant related to |$r$|⁠.
  2. If |$u_{x}$|⁠, |$u_{y}$| are state vectors of the noise-shaping quantizer |$Q_{\beta }$|⁠, then
    where |$c(K,\beta )$| is the upper bound of the |$\ell _{\infty }$| norm of state vectors in Proposition A.2.

 

Proof.
(i) Cauchy–Schwarz inequality implies
One can easily verify that |$V_{\varSigma \varDelta }D^{r}$| is a sparse matrix such that each row has at most |$r+1$| non-zero entries |$\{w_{0},w_{1},\ldots ,w_{r} \}$| of the following form:
Since |$\max \{\|u_{x}\|_{\infty },\|u_{y}\|_{\infty }\}\leq c(K,r)$| as indicated by Proposition A.1, we have |$\|V_{\varSigma \varDelta }D^{r} u_{x}\|_{2}\leq c(K, r)c(r)\sqrt{p}$| and |$\|V_{\varSigma \varDelta }D^{r} u_{y}\|_{2}\leq c(K,r)c(r)\sqrt{p}$|⁠. So above inequality becomes
where the last inequality is due to |$\|v\|_{2}^{2}\geq \lambda ^{2r-1}r^{-2r}$|⁠.
(ii) In the case of noise-shaping quantization, similarly, we have
Note that |$V_{\beta } H=(I_{p}\otimes v_{\beta })(I_{p}\otimes H_{\beta }) =I_{p}\otimes (v_{\beta } H_{\beta })$| with |$v_{\beta } H_{\beta }=(0,0,\ldots ,0,\beta ^{-\lambda })\in \mathbb{R}^{1\times \lambda }$|⁠, and |$\max \{\|u_{x}\|_{\infty },\|u_{y}\|_{\infty }\}\leq c(K,\beta )$| by Proposition A.2. It follows that |$\|V_{\beta } Hu_{x}\|_{2}\leq \beta ^{-\lambda }\sqrt{p}\|u_{x}\|_{\infty }$| and |$\|V_{\beta } Hu_{y}\|_{2}\leq \beta ^{-\lambda }\sqrt{p}\|u_{y}\|_{\infty }$|⁠. Then one can get
where the last inequality comes from |$\|v_{\beta } \|_{2}\geq \beta ^{-1}$|⁠.

D.5 Proof of Theorem 3.1

 

Proof.

Recall that the kernel approximation errors in (D.1) and (D.2) can be bounded by four terms: (I), (II), (III), (IV).

(i) For the |$\varSigma \varDelta $| scheme, in Theorem D.3, one can choose |$\varepsilon =O(\sqrt{p^{-1}\log p})$|⁠, |$\lambda =O(\sqrt{p\log ^{-1} p})$| and |$\eta _{1}=O(p^{-2-\alpha })$| with |$\alpha>0$|⁠. Moreover, since |$\|v\|_{2}^{2}\geq \lambda ^{2r-1}r^{-2r}$| and |$\|v\|_{\infty }=O(\lambda ^{r-1})$| (see Lemma 4.6 in [21]), in Theorem D.4, we can choose |$\varepsilon =O(c(K,r)\lambda ^{-r+1}\log ^{1/2}p)$| and |$\eta _{2}=O(\lambda ^{-r-1}\log ^{1/2}p)$|⁠. Then (3.1) follows immediately by combining above results with part (1) in Theorem D.5.

(ii) As for the noise-shaping scheme, in Theorem D.3, we choose the same parameters as in part (i): |$\varepsilon =O(\sqrt{p^{-1}\log p})$|⁠, |$\lambda =O(\sqrt{p\log ^{-1} p})$| and |$\eta _{1}=O(p^{-2-\alpha })$| with |$\alpha>0$|⁠. Nevertheless, according to |$\|v_{\beta }\|_{2}^{2}\geq \beta ^{-2}$| and |$\|v_{\beta }\|_{\infty }=\beta ^{-1}$|⁠, we set different values |$\varepsilon =O(c(K,\beta )\beta ^{-\lambda +1}\sqrt{p})$| and |$\eta _{2}=O(p^{-1})$| in Theorem D.4. Therefore, (3.2) holds by applying above results and part (2) in Theorem D.5.

E. Proof of Theorem 3.3

The architecture for the proof of Theorem 3.3 closely follows the methods used in [40]. We start with some useful lemmata that aid in proving theorem 3.3.

Given a |$b$|-bit alphabet as in (1.4) with |$b=\log _{2}(2K)$|⁠, we consider the following first-order |$\varSigma \varDelta $| quantization scheme for a random Fourier feature vector |$z(x) \in [-1,1]^{m}$| corresponding to a data point |$x \in \mathbb{R}^{d}$|⁠, where the state variable |$(u_{x})_{0}$| is initialized as a random number, i.e.

The corresponding recurrence equation can written as

 

Lemma E.1.
Given the following first-order Sigma-Delta quantization scheme with a |$b$|-bit alphabet as in (1.4), for a vector |$z \in \mathbb{R}^{m}$| with |$z \in [-1,1]^{m}$|⁠,
for each |$k=0,1,\cdots m-1$|⁠, we have |$u_{k} \sim U\left [-\frac{1}{2^{b}-1},\frac{1}{2^{b}-1}\right ]$|⁠.

 

Proof.

Let the inductive hypothesis be |$u_{k} \sim U\left [-\frac{1}{2^{b}-1},\frac{1}{2^{b}-1}\right ]$|⁠. Note that this is true by definition for |$u_{0}$|⁠.

Case:  |$\frac{j}{2^{b}-1} \leq z_{k+1} \leq \frac{j+1}{2^{b}-1}$|⁠, where |$j \in \{ 1, 3,\cdots 2^{b}-3\}$|⁠.

|$u_{k} \sim U\left [-\frac{1}{2^{b}-1},\frac{1}{2^{b}-1}\right ]$| implies that |$z_{k+1}+u_{k} \sim U\left [-\frac{1}{2^{b}-1} + z_{k+1},\frac{1}{2^{b}-1} + z_{k+1}\right ]$|⁠. Since by assumption, |$\frac{j}{2^{b}-1} \leq z_{k+1} \leq \frac{j+1}{2^{b}-1}$| we see that |$z_{k+1}+u_{k} \in [\frac{j-1}{2^{b}-1},\frac{j+2}{2^{b}-1}]$| and thus
which in turn implies that
Now we can compute the CDF of |$u_{k+1}$| (conditioned on |$z$|⁠) as follows:
Note that in the third equality we use the fact that |$\frac{j}{2^{b}-1} \leq z_{k+1} \leq \frac{j+1}{2^{b}-1}$| implies |$ \frac{j-1}{2^{b}-1} - z_{k+1} \leq -\frac{1}{2^{b}-1}$|⁠. Now note that
and
Thus
which shows that |$u_{k+1} \: | \: z \sim U\left [-\frac{1}{2^{b}-1},\frac{1}{2^{b}-1}\right ]$|⁠.

Showing that |$u_{k+1} \: | \: z \sim U\left [-\frac{1}{2^{b}-1},\frac{1}{2^{b}-1}\right ]$| for the other cases, namely, |$\frac{j}{2^{b}-1} \leq z_{k+1} \leq \frac{j+1}{2^{b}-1}$|⁠, where |$j \in \{ 0, 2,\cdots 2^{b}-2\}$| and |$-\frac{j+1}{2^{b}-1} \leq z_{k+1} \leq -\frac{j}{2^{b}-1}$|⁠, where |$j \in \{ 1, 3,\cdots 2^{b}-3\}$| and |$-\frac{j+1}{2^{b}-1} \leq z_{k+1} \leq -\frac{j}{2^{b}-1}$|⁠, where |$j \in \{ 0, 2,\cdots 2^{b}-2\}$| follow a similar argument as above and for the sake of brevity is skipped from an explicit mention. Thus, by induction, we have shown that |$u_{k+1} \: | \: z \sim U\left [-\frac{1}{2^{b}-1},\frac{1}{2^{b}-1}\right ]$|⁠.

For the subsequent sections, we adopt the following notations. Let |$A$| be the matrix whose rows are the vectors |$\{ (\tilde{V}z(x))^{T}\}_{x}$|⁠, |$B$| be the matrix whose rows are the vectors |$\{ (\tilde{V}D^{r}u_{x})^{T}\}_{x}$| and |$C$| be the matrix whose first column is |$\frac{\sqrt{2}}{\sqrt{p}\|v\|_{2}} (u_{0}^{x_{1}} \: u_{0}^{x_{2}} \: \cdots \: u_{0}^{x_{n}})^{T}$| and all other columns as zero. Let the columns of |$A, B, C$| be denoted by |$A_{i}, B_{i}, C_{i}$|⁠, respectively. Now the corresponding approximation to the kernel can written as

 

Lemma E.2.
where each |$\varLambda _{i}$| is a diagonal matrix with positive diagonal entries, |$\delta> 0$| and |$I$| is the identity matrix.

 

Proof.
We begin by noting that
where we have used the result from lemma D.1 that |$\mathbb{E}[AA^{T}] = K$|⁠. Now, let |$F:= \tilde{V}D^{r} \in \mathbb{R}^{p \times m}$| and |$\{f_{i}^{T}\}_{i=1}^{p}$| denote the rows of |$F$|⁠. Then let
where |$B_{i}$| is the |$i$|-th column of |$B$|⁠, |$\{x_{1}, \cdots x_{n}\}$| represent the individual data samples and |$u_{x_{j}} \in \mathbb{R}^{m}$| for each |$j = 1, \cdots , n$|⁠. Note that |$u_{x_{j}}$| (when conditioned on |$Z$|⁠) across data points |$x_{j}$| are independent with respect to each other with their entries |$\sim U[-1/(2^{b}-1),1/(2^{b}-1)]$|⁠. Thus,
By a similar argument, |$\mathbb{E}[B_{i}A_{i}^{T}] = 0$| and |$\mathbb{E}[\sum _{i = 1}^{p} A_{i}C_{i}^{T}] = \mathbb{E}[\sum _{i = 1}^{p} C_{i}A_{i}^{T}] = 0$|⁠. Now,
First we note that |$u_{x_{j}}$| (when conditioned on |$Z$|⁠) across data points |$x_{j}$| are independent with respect to each other with their entries |$\sim U[-1/(2^{b}-1),1/(2^{b}-1)]$| and thus |$\mathbb{E}[BB^{T}]$| is a diagonal matrix. Then note that each row of |$VD$| has at most two non-zero entries which are |$\{1,-1\}$|⁠. Thus,
Further, since |$r=1$|⁠, |$\|v\|_{2}^{2} = \lambda $|⁠, which implies |$|f_{i}^{T}u_{x_{i}}| \leq \frac{2^{3/2}}{\sqrt{p\lambda }(2^{b}-1)}$| Thus, the diagonal matrix |$\mathbb{E}[B_{i}B_{i}^{T}] \preccurlyeq \frac{8}{p\lambda (2^{b}-1)^{2}} I$|⁠, which in turn implies |$\mathbb{E}[BB^{T}] \preccurlyeq \frac{8}{\lambda (2^{b}-1)^{2}} I$|⁠. Now,
Thus, by similar reasoning as in prior paragraphs, |$\mathbb{E}[B_{1}C_{1}^{T}]$| is a diagonal matrix and also
and thus
So |$\mathbb{E}[-B_{1}C_{1}^{T}] \preccurlyeq \frac{4}{p\lambda (2^{b}-1)^{2}} I$|⁠. Similarly, |$\mathbb{E}[-C_{1}B_{1}^{T}] \preccurlyeq \frac{4}{p\lambda (2^{b}-1)^{2}} I$|⁠. Now,
and thus |$\mathbb{E}[C_{1}C_{1}^{T}] \preccurlyeq \left (\frac{2}{3 r^{-2r}}\right ) \frac{1}{p\lambda ^{2r-1}}$|⁠. Thus, putting together the bounds for each of the terms, we get
where |$\varLambda : = \mathbb{E}[BB^{T}] - \mathbb{E}[B_{1}C_{1}^{T}] - \mathbb{E}[C_{1}B_{1}^{T}] + \mathbb{E}[C_{1}C_{1}^{T}]$| is a diagonal matrix and
Thus |$\mathbb{E}[\varLambda ]\preccurlyeq \delta I$|⁠, where

 

Lemma E.3.
[40] Let |$\eta>0$|⁠, |$K$| and |$\hat{K}$| be positive symmetric semi-definite matrices, then
where, |$M: = (K + \eta I)^{-1/2}$|⁠.

 

Proof.
The proof is obtained using the following sequence of equivalent statements.
Note that the assumptions made on |$K,\hat{K}$| and |$\eta $| imply that |$\hat{K} + \eta I$| is invertible and also |$(K + \eta I)^{-1/2}$| exists.

 

Lemma E.4.
[40] Let |$0 \preccurlyeq \varLambda \preccurlyeq \delta I$|⁠, where |$\delta>0$|⁠. Also let |$\eta>0$|⁠, |$K$| and |$\hat{K}$| be positive symmetric semi-definite matrices and |$M:=(K + \eta I)^{-1/2}$|⁠. Then

 

Proof.
We begin by noting that |$0 \preccurlyeq M\varLambda M$| since |$M$| is invertible, |$0 \preccurlyeq \varLambda $| and for all |$x \neq 0$|⁠, |$x^{T}M \varLambda Mx = (Mx)^{T} \varLambda (Mx)$|⁠. Thus
Also, note that |$\|M\|_{2}^{2} = \|M^{2}\|_{2} = \|(K+\eta I)^{-1}\|_{2}$| (⁠|$M$| and |$M^{2}$| are symmetric) and |$\|M \varLambda M\| \leq \| \varLambda \|\|(K+\eta I)^{-1}\|$|⁠. Also since |$0 \preccurlyeq K$| (positive semi-definite kernel), we have that |$\|(K+\eta I)^{-1}\|_{2} \leq \frac{1}{\eta }$|⁠. Hence, we get

 

Theorem E.1.
(Matrix–Bernstein inequality [40]) Consider a finite sequence |$\{S_{i}\}$| of random Hermitian matrices of the same size and assume that
Let |$S = \sum _{i} S_{i}$| and |$\mathbb{E}[S^{2}] \preccurlyeq W$|⁠, i.e. |$W$| is a semi-definite upper bound for the second moment of |$S$|⁠. Then, for |$t \geq 0$|⁠,

Recall our notations where |$A$| is the matrix whose rows are the vectors |$\{ (\tilde{V}z(x))^{T}\}_{x}$|⁠, |$B$| is the matrix whose rows are the vectors |$\{ (\tilde{V}D^{r}u_{x})^{T}\}_{x}$| and |$C$| is the matrix whose first column is |$\frac{\sqrt{2}}{\sqrt{p}\|v\|_{2}} (u_{0}^{x_{1}}, u_{0}^{x_{2}}, \cdots , u_{0}^{x_{n}})^{T}$| and all other columns as zero. Also the columns of |$A, B, C$| are denoted by |$A_{i}, B_{i}, C_{i}$|⁠, respectively. Additionally let |$K_{i}: = \mathbb{E}[A_{i}A_{i}^{T}]$|⁠, |$M:= (K + \eta I_{n})^{-1/2}$| and

(E.1)

Thus note that by design |$\mathbb{E}[S_{i}] = 0$|⁠. We now will show that the remaining assumptions required to apply Matrix–Bernstein inequality hold for the sequence of matrices |$\{S_{i}\}_{i=1}^{p}$|⁠.

 

Lemma E.5.
The |$2$|-norm of |$S_{i}$| (defined in (E.1)) is bounded for each |$i=1, \cdots , p$| and |$\mathbb{E}[S]^{2}$| has a semi-definite upper bound, where |$S = \sum _{i} S_{i}$|⁠. In particular,
where |$n$| is the number of data samples, |$\eta $| is the regularization, |$m=\lambda p$| denotes the parameters of |$\varSigma \varDelta $| quantization and |$\tilde{M}:= M(K+\varLambda )M^{T}$|⁠.

 

Proof.
(i) |$\lambda _{max}(S_{i})$| is bounded. Let |$u_{i}:= M(A_{i}-B_{i}-C_{i})$|⁠, then |$S_{i} = u_{i}u_{i}^{T} - \mathbb{E}[u_{i}u_{i}^{T}]$|⁠. First note that
Also, |$A_{i}-B_{i}-C_{i}$| is the |$i$|-th column of the matrix which has as its rows the vectors |$\{\tilde{V}Q(z(x))^{T}\}_{x}$|⁠. Thus, in general
where |$g_{i}^{T}$| denotes the |$i$|-th row of |$\tilde{V}$|⁠. Also note that the entries of |$Q(z(x))$| are in |$\mathscr{A} = \Bigl \{\frac{a}{2K-1}\, \Big |\, a= \pm 1,\pm 3, \ldots , \pm (2K-1) \Bigr \}$|⁠. Thus,
Note that |$\tilde{V} = \frac{\sqrt{2}}{\sqrt{p}\| v\|_{2}}(I_{p} \otimes v)$|⁠, where for |$r=1$|⁠, |$v \in \mathbb{R}^{\lambda }$| is the vector of all ones, which implies |$\|\tilde{V}\|_{\infty } = \frac{\sqrt{2\lambda }}{\sqrt{p}}$|⁠. Thus,
Further, since by definition |$M = (K+\eta I)^{-1/2}$| we have
Thus we see that
So |$\|S_{i}\| \leq 2l$| implies |$\lambda _{max}(S_{i}) \leq 2l$|⁠. Note that |$K_{i}$| and |$ \varLambda _{i}$| are expectations of symmetric matrices and thus |$S_{i}$| is symmetric.
(ii) |$\mathbb{E}[S^{2}]$| has a semi-definite upper bound.
Now,
where |$\tilde{M}:= M(K+ \varLambda )M^{T}$| and thus |$\mathbb{E}[S^{2}] \preccurlyeq l\tilde{M}$|⁠.

Now we are in a position to prove Theorem 3.3 of the main text which we restate for convenience.

 

Theorem E.2.
Let |$\hat{K}_{\varSigma \varDelta }$| be an approximation of a true kernel matrix |$K$| using |$m$|-feature first-order |$\varSigma \varDelta $| quantized RFF (as in (3.3)) with a |$b$|-bit alphabet (as in (1.4)) and |$m=\lambda p$|⁠. Then given |$\varDelta _{1} \geq 0, \varDelta _{2} \geq \frac{\delta }{\eta }$|⁠, where |$\eta>0$| represents the regularization and |$\delta = \frac{8 + \frac{26}{3p}}{\lambda (2^{b}-1)^{2}}$|⁠, we have
where |$l = \frac{2n\lambda }{p\eta ^{2}}$|⁠.

 

Proof.
We apply Matrix–Bernstein inequality (Theorem E.1) to |$\{S_{i}\}_{i=1}^{p}$| (defined in E.1) to obtain that given |$t_{2} \geq 0$|⁠,
Now, since |$\lambda _{max}(S) = -\lambda _{min}(-S)$|⁠, by repeating an identical argument for |$-S$| we obtain that given |$t_{1} \geq 0$|⁠,
Putting the above two equations together with the fact that |$M = (K + \eta I_{n})^{-1/2}$| we obtain that for |$t_{1}, t_{2} \geq 0$|⁠,
Thus, by lemmas E.2, E.3, E.4 and E.5, for the |$\varSigma \varDelta $|-quantized RFF kernel |$\hat{K}_{\varSigma \varDelta }$|⁠, given |$\varDelta _{1} \geq 0, \varDelta _{2} \geq \frac{\delta }{\eta }$| we have the following spectral approximation result:
where |$\tilde{M} = M(K+ \varLambda )M$|⁠, |$M = (K + \eta I_{n})^{-1/2}$|⁠, |$l=\frac{2n\lambda }{p\eta ^{2}}$| is the upper bound for |$\|u_{i}u_{i}^{T}\|$| computed in lemma E.5 and |$\delta = \frac{8 + \frac{26}{3p}}{\lambda (2^{b}-1)^{2}}$| is the bound computed in lemma E.2 such that |$\mathbb{E}[\hat{K}_{\varSigma \varDelta }] \preccurlyeq K + \delta I$|⁠. Now, note that
Also given the positive semi-definite matrix |$\tilde{M}$|⁠, we know that |$\|\tilde{M}\|_{2}= \lambda _{max}(\tilde{M})$| and thus |$tr(\tilde{M}) \leq \text{rank}(\tilde{M})\|M\|_{2}$|⁠, which implies |$\frac{\text{tr}(\tilde{M})}{\|\tilde{M}\|} \leq \text{rank}(\tilde{M}) \leq n$|⁠. Hence,

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://dbpia.nl.go.kr/pages/standard-publication-reuse-rights)