-
PDF
- Split View
-
Views
-
Cite
Cite
Jinjie Zhang, Harish Kannan, Alexander Cloninger, Rayan Saab, Sigma-Delta and distributed noise-shaping quantization methods for random Fourier features, Information and Inference: A Journal of the IMA, Volume 13, Issue 1, March 2024, iaad052, https://doi.org/10.1093/imaiai/iaad052
- Share Icon Share
Abstract
We propose the use of low bit-depth Sigma-Delta and distributed noise-shaping methods for quantizing the random Fourier features (RFFs) associated with shift-invariant kernels. We prove that our quantized RFFs—even in the case of |$1$|-bit quantization—allow a high-accuracy approximation of the underlying kernels, and the approximation error decays at least polynomially fast as the dimension of the RFFs increases. We also show that the quantized RFFs can be further compressed, yielding an excellent trade-off between memory use and accuracy. Namely, the approximation error now decays exponentially as a function of the bits used. The quantization algorithms we propose are intended for digitizing RFFs without explicit knowledge of the application for which they will be used. Nevertheless, as we empirically show by testing the performance of our methods on several machine learning tasks, our method compares favourably with other state-of-the-art quantization methods.
1. Introduction
Kernel methods have long been demonstrated as effective techniques in various machine learning applications, cf. [32, 33]. Given a dataset |$\mathscr{X}\subset \mathbb{R}^{d}$| with |$|\mathscr{X}|=N$|, kernel methods implicitly map data points to a high, possibly infinite, dimensional feature space |$\mathscr{H}$| by |$\phi :\mathscr{X}\rightarrow \mathscr{H}$|. However, instead of working directly on that space the inner products between feature embeddings can be preserved by a kernel function |$k(x,y):= \langle \phi (x),\phi (y)\rangle _{\mathscr{H}}$| that coincides with the inner product. Nevertheless, in cases where |$N$| is large, using nonlinear kernels for applications like, say, support vector machines (SVMs) and logistic regression requires the expensive computation of the |$N\times N$| Gram matrix of the data [24]. In order to overcome this bottleneck, one popular approach is to ‘linearize’ |$k$| using the random Fourier features (RFFs) originally proposed by [29], and in turn built on the Bochner’s theorem [26]. Given a continuous, shift-invariant real-valued kernel |$k(x,y)=\kappa (x-y)$| with |$\kappa (0)=1$|, |$\kappa $| is the (inverse) Fourier transform of a probability measure |$\varLambda $| over |$\mathbb{R}^{d}$| and we have
As an example, the radial basis function (RBF) kernel |$k(x,y)=\exp (-\|x-y\|_{2}^{2}/2\sigma ^{2})$| corresponds to the multivariate normal distribution |$\varLambda =\mathscr{N}(0,\sigma ^{-2}I_{d})$|. Following [29], for a target dimension |$m$|, the associated RFFs (without normalization) are
where |$\varOmega :=(\omega _{1},\ldots ,\omega _{m})\in \mathbb{R}^{d\times m}$| is a random matrix generated as |$\omega _{j}\overset{iid}{\sim } \varLambda $| and |$\xi \in \mathbb{R}^{m}$| is a random vector with |$\xi _{j}\overset{iid}{\sim } U([0,2\pi ))$| for all |$j$|. Additionally, the identity |${\mathbb{E}}(\langle z(x),z(y)\rangle )=\frac{m}{2}k(x,y)$| implies that the inner product of low-dimensional features |$\sqrt{\frac{2}{m}}z(x)$|, |$\sqrt{\frac{2}{m}}z(y)$| can approximate |$k(x,y)$| in kernel-based algorithms. Learning a linear model on the (normalized) RFFs then amounts to using the approximation
as a reference kernel during training. For instance, performing linear SVM and linear ridge regression on RFFs winds up training nonlinear kernel-based SVM and ridge regression with |$\widehat{k}_{\text{RFF}}$|. It turns out that using RFFs in such a way with adjustable dimension |$m$| can remarkably speed up training for large-scale data and alleviate the memory burden for storing the kernel matrix. As an additional and very important benefit, the entire kernel function |$k$| is approximated accurately, i.e. the approximation error |$|k(x,y)-\widehat{k}_{\text{RFF}}(x,y)|$| has been shown to be small, particularly when |$m$| is large, e.g. in [3–5, 30, 35, 37].
The need for large |$m$| for guaranteeing good generalization performance on large datasets [1, 25, 27, 38] provides an opportunity for further savings in memory usage. Rather than store the RFFs in full precision, quantization methods have been proposed to encode RFFs (1.2) into a sequence of bits and subsequently approximate |$k(x,y)$| by taking inner product between quantized RFFs, thereby introducing a new level of approximation. One of our goals is to propose quantization techniques that favourably trade off approximation accuracy against number of bits used.
1.1 Related work
To make the discussion more precise, let us start by defining the |$2K$|-level quantization alphabet that we use throughout as
and note that one can use |$b:=\log _{2}(2K)$| bits to represent each element of |$\mathscr{A}$|. The goal of quantization in the RFF context is to map |$z(x)=\cos (\varOmega ^{T} x+ \xi ) \in{\mathbb{R}}^{m} \mapsto q(x) \in \mathscr{A}^{m}$|. We will be interested in very small values of |$K$|, particularly |$K=1$|, which corresponds to very few bits per RFF sample.
It is natural to start our discussion of quantization methods with the simplest quantizer, namely memoryless scalar quantization (MSQ), where we round each coordinate of the input vector |$z\in \mathbb{R}^{m}$| to the nearest element in |$\mathscr{A}$|. Specifically, |$Q_{\mathrm{MSQ}}:\mathbb{R}^{m}\rightarrow \mathscr{A}^{m}$| is defined by
Moreover, by setting |$K=1$|, one can get a binary embedding |$Q_{\mathrm{MSQ}}(z)=\text{sign}(z)$| with |$\mathscr{A}=\{-1,1\}$| where |$\text{sign}$| is an element-wise operation. This yields the so-called |$1$|-bit universal quantizer [8, 31] for RFFs, which generates a distorted (biased) kernel
Although replacing the |$\text{sign}$| function in (1.5) by |$Q_{MSQ}$| with |$K>1$| and renormalizing the inner product correspondingly can alleviate the distortion, there are better choices in terms of approximation error. In [23], a Lloyd–Max quantization scheme is designed based on the MSQ where, rather than use the evenly spaced alphabet in (1.4), one has to construct specific alphabets for different |$K$|. Recently with an eye towards asymmetric sensor network applications, an asymmetric semi-quantized scheme (SemiQ) was proposed in [31], and shown to be unbiased. It generates |$\widehat{k}_{s}(x,y)$|, which is an inner product between an unquantized RFF vector and a quantized one, i.e.
However, this asymmetric setting is restrictive on many kernel machines because it only works for the inference stage and the model still has to be trained based on unquantized RFFs. Another unbiased quantization scheme resorts to injecting randomness into the quantization, and is known as randomized rounding [40], or stochastic quantization (StocQ) [23]. Specifically, for each |$z\in \mathbb{R}$|, one chooses the two consecutive points |$s, t\in \mathscr{A}$| with |$z\in [s, t]$|. Then one randomly assigns the quantization via |$\mathrm{P}(Q_{\mathrm{StocQ}}(z)=s) = \frac{t-z}{t-s}$|, |$\mathrm{P}(Q_{\mathrm{StocQ}}(z)=t) = \frac{z-s}{t-s}$|. It follows that
where |$Q_{\mathrm{StocQ}}$| operates on each component separately. Due to the Bernoulli sampling for |$Q_{\mathrm{StocQ}}$|, the quantization process involves additional randomness for each dimension of RFFs, which leads to extra variance especially in the case of binary embedding, i.e. |$b=1$|. Note that the distinct normalization factors are chosen in (1.6) and (1.7) such that the expectations of |$\hat{k}_{s}$| and |$\hat{k}_{StocQ}$| are equal to |$k(x,y)$|. Moreover, the kernel approximation error for |$\widehat{k}_{s}$| and |$\widehat{k}_{\text{StocQ}}$| is bounded by |$O(m^{-1/2})$| with high probability, see [31, 40].
1.2 Methods and contributions
We explore the use of |$\varSigma \varDelta $| [14, 15, 18] and distributed noise-shaping [9, 10] quantization methods on RFFs. These techniques, explicitly defined and discussed in Section 2 and Appendix A, yield superior performance to methods based on scalar quantization in contexts ranging from bandlimited function quantization [14, 18], to quantization of linear measurements [6, 7], of compressed sensing measurements [19], of nonlinear measurements [20] and even for binary embeddings that preserve (Euclidean) distances [21, 41]. It is therefore natural to wonder whether they can also yield superior performance in the RFF context. Let |$Q_{\varSigma \varDelta }^{(r)}$| be the |$r$|-th order |$\varSigma \varDelta $| quantizer and let |$Q_{\beta }$| be the distributed noise shaping quantizer with |$\beta \in (1,2)$|, and let |$\widetilde{V}_{\varSigma \varDelta }$| and |$\widetilde{V}_{\beta }$| be their associated sparse condensation matrices defined in Section 2. Then our method approximates kernels via
and
Specifically, given large-scale data |$\mathscr{T}$| contained in a compact set |$\mathscr{X}\subset \mathbb{R}^{d}$|, we put forward Algorithm 1 to generate and store quantized RFFs such that one can subsequently use them for training and inference using linear models.
For illustration, Appendix B presents a pointwise comparison of the above kernel approximations on a synthetic toy dataset. A summary of our contributions is as follows.
We give the first detailed analysis of |$\varSigma \varDelta $| and distributed noise-shaping schemes for quantizing RFFs. Specifically, Theorem 3.1 provides a uniform upper bound for the errors |$|\widehat{k}^{(r)}_{\varSigma \varDelta }(x,y)-k(x,y)|$| and |$|\widehat{k}_{\beta }(x,y)-k(x,y)|$| over compact (possibly infinite) sets. Our analysis shows that the quantization error decays fast as |$m$| grows. Additionally, Theorem 3.3 provides spectral approximation guarantees for first-order |$\varSigma \varDelta $| quantized RFF approximation of kernels.
Our methods allow a further reduction in the number of bits used. Indeed, to implement (1.8) and (1.9) in practice, one would store and transmit the condensed bitstreams |$\widetilde{V}_{\varSigma \varDelta }Q_{\varSigma \varDelta }^{(r)}(z(x))$| or |$\widetilde{V}_{\beta } Q_{\beta }(z(x))$|. For example, since the matrices |$\widetilde{V}_{\varSigma \varDelta }$| are sparse and essentially populated by bounded integers, each sample can be represented by fewer bits, as summarized in Table 1.
We illustrate the benefits of our proposed methods in several numerical experiments involving kernel ridge regression (KRR), kernel SVM and two-sample tests based on maximum mean discrepancy (MMD) (all in Section 4). Our experiments show that |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\beta }$| are comparable with the semi-quantization scheme and outperforms the other fully quantized method mentioned above, both when we fix the number of RFF features |$m$|, and when we fix the number of bits used to store each quantized RFF vector.
Method . | RFFs . | |$Q_{\text{SemiQ}}$| . | |$Q_{\text{StocQ}}$| . | |$Q_{\varSigma \varDelta }^{(r)}$| . | |$Q_{\beta }$| . |
---|---|---|---|---|---|
Memory | |$32m$| | |$32m$| | |$mb$| | |$O(p(b+r\log _{2}\lambda ))$| | |$mb{}^{*}$| |
Method . | RFFs . | |$Q_{\text{SemiQ}}$| . | |$Q_{\text{StocQ}}$| . | |$Q_{\varSigma \varDelta }^{(r)}$| . | |$Q_{\beta }$| . |
---|---|---|---|---|---|
Memory | |$32m$| | |$32m$| | |$mb$| | |$O(p(b+r\log _{2}\lambda ))$| | |$mb{}^{*}$| |
*This can be reduced to |$mb\log _{2}\beta $| for certain |$\beta $|.
Method . | RFFs . | |$Q_{\text{SemiQ}}$| . | |$Q_{\text{StocQ}}$| . | |$Q_{\varSigma \varDelta }^{(r)}$| . | |$Q_{\beta }$| . |
---|---|---|---|---|---|
Memory | |$32m$| | |$32m$| | |$mb$| | |$O(p(b+r\log _{2}\lambda ))$| | |$mb{}^{*}$| |
Method . | RFFs . | |$Q_{\text{SemiQ}}$| . | |$Q_{\text{StocQ}}$| . | |$Q_{\varSigma \varDelta }^{(r)}$| . | |$Q_{\beta }$| . |
---|---|---|---|---|---|
Memory | |$32m$| | |$32m$| | |$mb$| | |$O(p(b+r\log _{2}\lambda ))$| | |$mb{}^{*}$| |
*This can be reduced to |$mb\log _{2}\beta $| for certain |$\beta $|.
2. Noise shaping quantization preliminaries
The methods we consider herein are special cases of noise shaping quantization schemes (see, e.g. [11]). For a fixed alphabet |$\mathscr{A}$| and each dimension |$m$|, such schemes are associated with an |$m \times m$| lower triangular matrix |$H$| with unit diagonal, and are given by a map |$Q: {\mathbb{R}}^{m} \to \mathscr{A}^{m}$| with |$y\mapsto q$| designed to satisfy |$y-q = H u$|. The schemes are called stable if |$\|u\|_{\infty } \leq C$|, where |$C$| is independent of |$m$|. Among these noise shaping schemes, we will be interested in stable |$r{\text{th}}$|-order |$\varSigma \varDelta $| schemes |$Q_{\varSigma \varDelta }^{(r)}$| [15, 18], and distributed noise shaping schemes |$Q_{\beta }$| [9, 10]. For example, in the case of |$\varSigma \varDelta $| with |$r=1$|, the entries |$q_{i}, i=1,...,m$| of the vector |$q=Q_{\varSigma \varDelta }^{(1)}(y)$| are assigned iteratively via
where |$Q_{\mathrm{MSQ}}(z)=\text{argmin}_{v\in \mathscr{A}}|z-v|$|. This yields the difference equation |$y-q=Du$|, where |$D$| is the first-order difference matrix given by |$D_{ij}=1$| if |$i=j$|, |$D_{ij}=-1$| if |$i=j+1$| and |$0$| otherwise. Stable |$\varSigma \varDelta $| schemes with |$r>1$|, are more complicated to construct (see Appendix A), but satisfy
On the other hand, a distributed noise-shaping quantizer |$Q_{\beta }:\mathbb{R}^{m}\rightarrow \mathscr{A}^{m}$| converts the input vector |$y\in \mathbb{R}^{m}$| to |$q=Q_{\beta }(y)\in \mathscr{A}^{m}$| such that
where, again, |$\|u\|_{\infty }\leq C$|. Here, denoting the |$p\times p$| identity matrix by |$I_{p}$| and the Kronecker product by |$\otimes $|, |$H$| is a block diagonal matrix defined as |$H:=I_{p}\otimes H_{\beta } \in \mathbb{R}^{m\times m}$|, where |$H_{\beta }\in \mathbb{R}^{\lambda \times \lambda }$| is given by |$({H_{\beta }})_{ij}=1$| if |$i=j$|, |$({H_{\beta }})_{ij}=-\beta $| if |$i=j+1$| and |$0$| otherwise. Defining |$\widetilde{H}:=I_{m}-H$|, one can implement the quantization step |$q=Q_{\beta }(y)$| via the following iterations for |$i=1,2,\ldots ,m$| [9, 10]:
where |$Q_{\mathrm{MSQ}}(z)=\text{argmin}_{v\in \mathscr{A}}|z-v|$|. The stability of (2.4) is discussed in Appendix A. It is worth mentioning that since |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\beta }$| are sequential quantization methods, they cannot be implemented entirely in parallel. On the other hand, blocks of size |$\lambda $| can still be run in parallel. Next, we adopt the definition of a condensation operator in [9, 21, 41].
(|$\varSigma \varDelta $| condensation operator) Let |$p$|, |$r$|, |$\lambda $| be fixed positive integers such that |$\lambda =r\widetilde{\lambda }-r+1$| for some integer |$\widetilde{\lambda }$|. Let |$m=\lambda p$| and |$v$| be a row vector in |$\mathbb{R}^{\lambda }$| whose entry |$v_{j}$| is the |$j$|-th coefficient of the polynomial |$(1+z+\ldots +z^{\widetilde{\lambda }-1})^{r}$|. Define the condensation operator |$V_{\varSigma \varDelta }\in \mathbb{R}^{p\times m}$| as |$V_{\varSigma \varDelta }:=I_{p}\otimes v.$|
For example, when |$r=1$|, |$\lambda =\widetilde{\lambda }$| and the vector |$v\in{\mathbb{R}}^{\lambda }$| is simply the vector of all ones while when |$r=2$|, |$\lambda =2\widetilde{\lambda }-1$| and |$v=(1,2,\ldots , \widetilde{\lambda }-1, \widetilde{\lambda }, \widetilde{\lambda }-1,\ldots , 2,1)\in \mathbb{R}^{\lambda }$|. As we will see in the proof of Theorem 3.1, the quantization error is bounded by controlling |$\|V_{\varSigma \varDelta } D^{r} u\|_{2}$|. The specific construction of |$V_{\varSigma \varDelta }$| in Definition 2.1 accomplishes two things. First, it allows us to obtain favourable bounds on |$\|V_{\varSigma \varDelta } D^{r} u\|_{2}$|. Secondly, as |$V_{\varSigma \varDelta }$| is the Kronecker product of a vector having |$O(\lambda )$| non-zeros with the identity matrix, it is a sparse matrix, hence computationally efficient to apply. For analogous reason, we define the condensation operator for |$Q_{\beta }$| as follows.
(Distributed noise-shaping condensation operator) Let |$p,\lambda $| be positive integers and fix |$\beta \in (1,2)$|. Let |$m=\lambda p$| and |$v_{\beta }:=(\beta ^{-1}, \beta ^{-2}, \ldots , \beta ^{-\lambda })\in \mathbb{R}^{\lambda }$| be a row vector. Define the distributed noise-shaping condensation operator |$V_{\beta }\in \mathbb{R}^{p\times m}$| as |$V_{\beta }:=I_{p}\otimes v_{\beta }.$|
If |$\widetilde{V}$| is either of the two normalized matrices in (2.5), Lemma D.1 (Appendix D) shows that
3. Main results and space complexity
Our approach to quantizing RFFs given by (1.8) and (1.9) is justified by (2.6), along with the observation that, for our noise-shaping schemes, we have |$q=z-Hu$| with guarantees that |$\|\widetilde{V}Hu\|_{2}$| is small.
Moreover, as we will see in Section 3.1, we are able to control the approximation error such that |$\widehat{k}_{\varSigma \varDelta }(x,y)\approx k(x,y)$| and |$\widehat{k}_{\beta }(x,y)\approx k(x,y)$| hold with high probability. In fact Theorem 3.1 shows more: the approximation error of the quantized kernel estimators in (1.8) and (1.9) have polynomial and exponential error decay, respectively, as a function of |$m$|, the dimension of the RFFs. Armed with this result, in Section 3.2 we also present a brief analysis of the space-complexity associated with our quantized RFFs, and show that the approximation error due to quantization decays exponentially as a function of the bits needed.
Additionally, in various applications such as KRR, spectral error bounds on the kernel may be more pertinent than point-wise bounds. For example, it was shown in [4, 40] that the expected loss of KRR performed using an approximation of the true kernel is bounded by a function of the spectral error in the kernel approximation (Lemma 2 of [4], Proposition 1 of [40]). In Theorem 3.3, we provide spectral approximation guarantees for first-order |$\varSigma \varDelta $| quantized RFF approximation of kernels, in the spirit of the analogous guarantees in [40] for StocQ.
3.1 Approximation error bounds
3.1.1 Point-wise error bounds on the approximation
We begin with Theorem 3.1, with its proof in Appendix D.
Note that the first error term in (3.1) and (3.2) results from the condensation of RFFs, i.e. Theorem D.3, while the remaining two error terms are due to the corresponding quantization schemes.
3.1.2 Spectral approximation guarantees for first-order Sigma-Delta quantized RFFs
We begin with a definition of a |$(\varDelta _{1}, \varDelta _{2})$|-spectral approximation of a matrix as the error bounds |$\varDelta _{1}$| and |$\varDelta _{2}$| play a key role in bounding the generalization error in various applications such as KRR (Lemma 2 of [4], Proposition 1 of [40]).
|$(\varDelta _{1},\varDelta _{2})$|-spectral approximation. Given |$\varDelta _{1}, \varDelta _{2}>0$|, a matrix |$A$| is a |$(\varDelta _{1},\varDelta _{2})$|-spectral approximation of another matrix |$B$| if |$(1-\varDelta _{1})B \preccurlyeq A \preccurlyeq (1+\varDelta _{2})B$|.
For the tractability of obtaining spectral error bounds, in this section we consider a variation of the Sigma-Delta scheme for |$r=1$|. In particular, given a |$b$|-bit alphabet as in (1.4) with |$b=\log _{2}(2K)$|, we consider the following first-order |$\varSigma \varDelta $| quantization scheme for an RFF vector |$z(x) \in [-1,1]^{m}$| corresponding to a data point |$x \in \mathbb{R}^{d}$|, where the state variable |$(u_{x})_{0}$| is initialized as a random number, i.e.
where |$q \in \mathscr{A}^{m}$| represents the |$\varSigma \varDelta $| quantization of |$z(x)$| and |$(u_{x})_{0}$| is drawn randomly from the uniform distribution on |$\left [-\frac{1}{2^{b}-1},\frac{1}{2^{b}-1}\right ]$|.
Let |$Q_{\varSigma \varDelta }$| be the first-order |$\varSigma \varDelta $| quantizer represented by (3.3) and let |$\widetilde{V}_{\varSigma \varDelta }$| be the associated sparse condensation matrix as in definition 2.1. Then the elements of the corresponding approximation |$\hat{K}_{\varSigma \varDelta }$| of the kernel |$K$| is given by
Now, we state Theorem 3.3 whose proof can be found in Appendix E.
The above result differs from the spectral bound results presented in [40] for StocQ in a particular aspect of the lower bound requirement on |$\varDelta _{2}$|, namely, the lower bound for |$\varDelta _{2}$| in Theorem 3.3 for first-order |$\varSigma \varDelta $| quantization has another controllable parameter |$\lambda $| in addition to the number of bits |$b$|. Specifically, provided |$8>> \frac{26}{3p}$|, we have |$\delta \approx \frac{8}{\lambda (2^{b}-1)^{2}}$|, which is monotonically decreasing in |$\lambda $|.
3.2 Space complexity
At first glance, Theorem 3.1 shows that |$Q_{\beta }$| has faster quantization error decay as a function of |$\lambda $| (hence |$m$|) as compared with |$Q_{\varSigma \varDelta }^{(r)}$|. However, a further compression of the bit-stream resulting from the latter is possible, and results in a similar performance of the two methods from the perspective of bit-rate versus approximation error, as we will now show.
Indeed, our methods entail training and testing linear models on condensed bitstreams |$\widetilde{V}q\in \widetilde{V}\mathscr{A}^{m}\subset{\mathbb{R}}^{p}$|, where |$q$| is the quantized RFFs generated by |$Q_{\varSigma \varDelta }^{(r)}$| or |$Q_{\beta }$|, and |$\widetilde{V}$| is the corresponding normalized condensation operator. Thus, when considering the space complexity associated with our methods, the relevant factor is the number of bits needed to encode |$\widetilde{V}q$|. To that end, by storing the normalization factors in |$\widetilde{V}$| (see (2.5)) separately using a constant number of bits, we can simply ignore them when considering space complexity. Let us now consider |$b$|-bit alphabets |$\mathscr{A}$| with |$b=\log _{2}(2K)$|. Since the entries of |$v$| are integer valued and |$\|v\|_{1}=O(\lambda ^{r})$|, one can store |$\widetilde{V}_{\varSigma \varDelta }q$| using |$B:=O(p\log _{2}(2K\|v\|_{1}))=O(p(b+r\log _{2}\lambda ))$| bits. Then |$\lambda ^{-r} \approx 2^{-c B/p}$| and thus the dominant error terms in (3.1) decay exponentially as a function of bit-rate |$B$|. On the other hand, for distributed noise shaping each coordinate of |$\widetilde{V}_{\beta } q$| is a linear combination of |$\lambda $| components in |$q$|, so |$\widetilde{V}_{\beta } q$| takes on at most |$(2K)^{\lambda }$| values. This implies we need |$p\log _{2}(2K)^{\lambda }=mb$| bits to store |$\widetilde{V}_{\beta } q$| in the worst case.
Although |$mb$| is a tight upper bound for arbitrary |$\beta \in (1,2)$|, an interesting observation is that the number of bits used to store |$\widetilde{V}_{\beta } q$| can be smaller than |$mb$| with special choices of |$\beta $|, e.g. when |$\beta ^{k}=\beta +1$| with integer |$k>1$|. For example, if |$k=2$| and |$b=1$|, then |$\beta =(\sqrt{5}+1)/2$| is the golden ratio and one can see that |$v_{\beta }=(\beta ^{-1},\ldots ,\beta ^{-\lambda })$| satisfies |$v_{\beta }(i)=v_{\beta }(i+1)+v_{\beta }(i+2)$| for |$1\leq i\leq \lambda -2$|. Since |$b=1$|, we have |$q\in \{\pm 1\}^{m}$| and |$\widetilde{V}_{\beta } q$| (ignoring the normalizer) can be represented by |$p\log _{2}(\beta ^{\lambda })=m\log _{2}(\beta )<m$| bits. Defining the number of bits used to encode each RFF vector by |$R:=m\log _{2}(\beta )$|, then (3.2) shows that |$\beta ^{-\lambda }=2^{-\lambda R/m}=2^{-R/p}$| dominates the error. In other words, up to constants, the error is essentially equal to the error obtained by a |$\lambda $| bit MSQ quantization of a |$p$|-dimensional RFF embedding.
If we assume that each full-precision RFF is represented by |$32$| bits, then the storage cost per sample for both full-precision RFF and semi-quantized scheme |$Q_{\text{SemiQ}}$| in (1.6) is |$32m$|. Because |$Q_{\text{StocQ}}$| in (1.7) does not admit further compression, it needs |$mb$| bits. A comparison of space complexity of different methods is summarized in Table 1.
4. Numerical experiments
We have established that both |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\beta }$| are memory-efficient and approximate their intended kernels well. In this section, we will verify via numerical experiments that they perform favourably compared with other baselines on machine learning tasks.
4.1 Kernel ridge regression
KRR [28] corresponds to the ridge regression (linear least squares with |$\ell _{2}$| regularization) in a reproducing kernel Hilbert space. We synthesize |$N=5000$| highly nonlinear data samples |$(x_{i},y_{i})\in \mathbb{R}^{5}\times \mathbb{R}$| such that for each |$i$|, we draw each component of |$x_{i}\in \mathbb{R}^{5}$| uniformly from |$[-1,1)$| and use it to generate
where |$\gamma _{1}=\gamma _{2}=\gamma _{3}=[1,1,\ldots , 1]^{\top }\in \mathbb{R}^{5}$| and |$\varepsilon _{i}\sim \mathscr{N}(0,\frac{1}{4})$|. This is split into |$4000$| samples used for training and |$1000$| samples for testing. Given an RBF kernel |$k(x,y)=\exp (-\gamma \|x-y\|_{2}^{2})$| with |$\gamma =1/d=0.2$|, by the Representer theorem, our predictor is of the form |$\widehat{f}(x)=\sum _{i=1}^{N}\alpha _{i} k(x_{i},x)$| where the coefficient vector |$\alpha :=(\alpha _{1},\ldots ,\alpha _{N})\in \mathbb{R}^{N}$| is obtained by solving |$(K+\eta I_{N})\alpha =y$|. Here, |$K=(k(x_{i},x_{j}))\in \mathbb{R}^{N\times N}$| is the kernel matrix and |$\eta =1$| is the regularization parameter.
Since the dimension of RFFs satisfies |$m=\lambda p$|, there is a trade-off between |$p$| and |$\lambda $|. According to Theorem 3.1, increasing the embedding dimension |$p$| can reduce the error caused by compressing RFFs, while larger |$\lambda $| leads to smaller quantization error and makes the memory usage of |$Q_{\varSigma \varDelta }^{(r)}$| more efficient (see Table 1). Beyond this, all hyperparameters, e.g. |$\lambda $|, |$\beta $|, are tuned based on cross validation. In our experiment, we consider the kernel approximations |$\widehat{k}_{\text{RFF}}$|, |$\widehat{k}_{\text{StocQ}}$|, |$\widehat{k}_{\varSigma \varDelta }^{(1)}$| with |$\lambda =15$|, |$\widehat{k}_{\varSigma \varDelta }^{(2)}$| with |$\lambda =15$|, and |$\widehat{k}_{\beta }$| with |$\beta =1.9$|, |$\lambda =12$|. These are applied for both training (solving for |$\alpha $|) and testing (computing |$\widehat{f}(x)$| based on |$\alpha $|), while the semi-quantized scheme |$\widehat{k}_{s}$| is only used for testing and its coefficient vector |$\alpha $| is learned using |$\widehat{k}_{\text{RFF}}$| on the training set. Furthermore, according to [31], |$\widehat{k}_{s}$| can be used in two scenarios during the testing stage:
Training data are unquantized RFFs, while test data are quantized, i.e. |$\widehat{f}(x)=\sum _{i=1}^{N}\alpha _{i} \widehat{k}_{s}(x_{i},x)$|;
Quantize training data and leave testing points as RFFs, i.e. |$\widehat{f}(x)=\sum _{i=1}^{N}\alpha _{i} \widehat{k}_{s}(x,x_{i})$|.
We summarize the KRR results averaging over |$30$| runs for |$b=1$| bit quantizers in Fig. 1, in which the solid curves represent our methods and the dashed lines depict other baselines. Note that in both cases, the noise-shaping quantizer |$Q_{\beta }$| achieves the lowest test mean squared error among all quantization schemes, and it even outperforms the semi-quantization scheme |$\widehat{k}_{s}$| with respect to the number of measurements |$m$|. Moreover, due to the further compression advantage, |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\beta }$| are more memory-efficient than the fully quantized scheme |$Q_{\text{StocQ}}$| in terms of the usage of bits per sample. More experiments for |$b=2,3$| can be found in Appendix C.

KRR with |$b=1$|. The labels RFF, |$s1$|, |$s2$|, StocQ, |$r1$|, |$r2$|, |$\beta $| represent |$\widehat{k}_{\text{RFF}}$|, |$\widehat{k}_{s}$| for scenarios (1), (2), |$\widehat{k}_{\text{StocQ}}$|, |$\widehat{k}_{\varSigma \varDelta }^{(1)}$|, |$\widehat{k}_{\varSigma \varDelta }^{(2)}$| and |$\widehat{k}_{\beta }$| respectively.
4.2 Kernel SVM
To illustrate the performance of our methods for classification tasks, we perform Kernel SVM [32, 36] to evaluate different kernel approximations on the UCI ML hand-written digits dataset [2, 39], in which |$N=1797$| grayscale images compose |$C=10$| classes and they are vectorized to |$d=64$| dimensional vectors. Additionally, all pixel values are scaled in the range |$[0,1]$| and we randomly split this dataset into |$80\%$| for training and |$20\%$| for testing. As for the classifier, we use the soft margin SVM with a regularization parameter |$R=1$|.
Note that in the binary classification case, i.e. labels |$y_{i}\in \{-1,1\}$|, our goal is to learn the coefficients |$\alpha _{i}$|, the intercept |$b$|, and the index set of support vectors |$S$| in a decision function during the training stage:
Here, we use an RBF kernel |$k(x,y)=\exp (-\gamma \|x-y\|_{2}^{2})$| with |$\gamma =1/(d\sigma _{0}^{2})\approx 0.11$| and |$\sigma _{0}^{2}$| being equal to the variance of training data. In the multi-class case, we implement the ‘one-versus-one’ approach for multi-class classification where |$\frac{C(C-1)}{2}$| classifiers are constructed and each one trains data from two classes. In our experiment, we found that a large embedding dimension |$p=m/\lambda $| is needed and approximations |$\widehat{k}_{\text{RFF}}$|, |$\widehat{k}_{\text{StocQ}}$|, |$\widehat{k}_{\varSigma \varDelta }^{(1)}$| with |$\lambda =2$|, |$\widehat{k}_{\varSigma \varDelta }^{(2)}$| with |$\lambda =3$|, and |$\widehat{k}_{\beta }$| with |$\beta =1.1$|, |$\lambda =2$|, are implemented for both training (obtaining |$\alpha _{i}$|, |$b$|, and |$S$| in (4.1)) and testing (predicting the class of an incoming sample |$x$| by |$g(x)$|) phases, whereas the asymmetric scheme |$\widehat{k}_{s}$| is only performed for inference with its parameters in (4.1) learned from |$\widehat{k}_{\text{RFF}}$| during the training stage. Moreover, as before there are two versions of |$\widehat{k}_{s}$| used for making predictions:
Keep the support vectors as unquantized RFFs and quantize the test point |$x$|, i.e. substitute |$\widehat{k}_{s}(x_{i}, x)$| for |$k(x,x_{i})$| in (4.1);
Quantize the support vectors and leave the testing point |$x$| as unquantized RFFs, i.e. replace |$k(x,x_{i})$| in (4.1) with |$\widehat{k}_{s}(x, x_{i})$|.
For each binary quantization scheme (with |$b=1$|), the average test accuracy over |$30$| independent runs is plotted in Fig. 2. We observe that, in regard to |$m$|, |$Q_{\beta }$| substantially outperforms other fully quantized schemes including |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\text{StocQ}}$|, but, as expected, it is still worse than the semi-quantized methods. Memory efficiency is characterized in the right plot by estimating the test accuracy against the storage cost (in terms of bits) per sample. Note that both |$Q_{\beta }$| and |$Q_{\varSigma \varDelta }^{(1)}$| have significant advantage over the baseline method |$Q_{\text{StocQ}}$|, which means that our methods require less memory to achieve the same test accuracy when |$b=1$|. See Appendix C for extra experiment results with |$b=2,3$|.

Kernel SVM with |$b=1$|. The labels RFF, |$s1$|, |$s2$|, StocQ, |$r1$|, |$r2$|, |$\beta $| represent |$\widehat{k}_{\text{RFF}}$|, |$\widehat{k}_{s}$| for scenarios (1), (2), |$\widehat{k}_{\text{StocQ}}$|, |$\widehat{k}_{\varSigma \varDelta }^{(1)}$|, |$\widehat{k}_{\varSigma \varDelta }^{(2)}$| and |$\widehat{k}_{\beta }$| respectively.
4.3 Maximum mean discrepancy
Given two distributions |$p$| and |$q$|, and a kernel |$k$| over |$\mathscr{X}\subset \mathbb{R}^{d}$|, the MMD has been shown to play an important role in the two-sample test [17], by proposing the null hypothesis |$\mathscr{H}_{0}: p=q$| against the alternative hypothesis |$\mathscr{H}_{1}: p\neq q$|. The square of MMD distance can be computed by
where |$x,x^{\prime }\overset{iid}{\sim } p$| and |$y,y^{\prime }\overset{iid}{\sim }q$|. Here, we set |$k$| to an RBF kernel, which is characteristic [34] implying that |$\mathrm{MMD}_{k}(p,q)$| is metric, i.e. |$\mathrm{MMD}_{k}(p,q)=0\iff p=q$|, and the following hypothesis test is consistent.
In our experiment, the distribution |$p$| is supported on a quadrant of the unit circle, while |$q$| is generated by perturbing |$p$| by a gap of size |$\delta $| at various regions, see Fig. 3a. Let |$n=60$| and choose finite samples |$X=\{x_{1},\ldots ,x_{n}\}\sim p$| and |$Y=\{y_{1},\ldots ,y_{n}\}\sim q$|. Then |$\mathrm{MMD}_{k}(p,q)$| can be estimated by

Under the null hypothesis |$\mathscr{H}_{0}$|, one can get the empirical distribution of (4.2) by reshuffling the data samples |$X\cup Y$| many times (|$t=2000$|) and recomputing |$\widehat{\mathrm{MMD}}_{k}^{2}(X^{\prime },Y^{\prime })$| on each partition |$X^{\prime }\cup Y^{\prime }$|. For a significance level of |$\alpha =0.05$|, |$\mathscr{H}_{0}$| is rejected if the original |$\widehat{\mathrm{MMD}}_{k}^{2}(X,Y)$| is greater than the |$(1-\alpha )$| quantile from the empirical distribution. Figure 3c shows that the empirical distributions of (4.2) under both |$\mathscr{H}_{0}$| and |$\mathscr{H}_{1}$| are separated well, where we use the ground truth RBF kernel with small bandwidth |$\sigma =0.05$|.
In order to compare different quantization methods when |$b=1$|, we use the following approximations with optimal |$\lambda $| to perform the permutation test: |$\widehat{k}_{\text{RFF}}$|, |$\widehat{k}_{\text{StocQ}}$|, |$\widehat{k}_{\varSigma \varDelta }^{(1)}$| with |$\lambda =4$|, |$\widehat{k}_{\varSigma \varDelta }^{(2)}$| with |$\lambda =5$|, and |$\widehat{k}_{\beta }$| with |$\beta =1.5$|, |$\lambda =4$|. Due to the symmetry in (4.2), |$\widehat{k}_{s}$| can be implemented without worrying about the order of inputs. Additionally, if the probability of Type II error, i.e. false negative rate, is denoted by |$\beta $|, then the statistical power of our test is defined by
In other words, the power equals to the portion of MMD values under |$\mathscr{H}_{1}$| that are greater than the |$(1-\alpha )$| quantile of MMD distribution under |$\mathscr{H}_{0}$|. In Fig. 4, we observe that, compared with other fully quantized schemes, |$Q_{\beta }$| has the greatest power in terms of |$m$|. The performance of semi-quantized scheme is pretty close to the plain RFF approximation but it requires more storage space, as discussed in Section 3. Moreover, Fig. 5 presents the corresponding changes of the MMD distributions under |$\mathscr{H}_{0}$| and |$\mathscr{H}_{1}$|, in which the overlap between the two distributions is considerably reduced as |$m$| increases. Regarding the number of bits per sample, both |$Q_{\varSigma \varDelta }^{(1)}$| and |$Q_{\beta }$| have remarkable advantage over |$Q_{\text{StocQ}}$|. Extra results related to |$b=2,3$| can be found in Appendix C.

Power of the permutation test with |$b=1$|. The labels RFF, |$s$|, StocQ, |$r1$|, |$r2$|, |$\beta $| represent |$\widehat{k}_{\text{RFF}}$|, |$\widehat{k}_{s}$|, |$\widehat{k}_{\text{StocQ}}$|, |$\widehat{k}_{\varSigma \varDelta }^{(1)}$|, |$\widehat{k}_{\varSigma \varDelta }^{(2)}$| and |$\widehat{k}_{\beta }$|, respectively.

The empirical distributions of MMD values under |$\mathscr{H}_{0}$| and |$\mathscr{H}_{1}$|.
5. Conclusion
In order to reduce memory requirement for training and storing kernel machines, we proposed a framework of using Sigma-Delta and distributed noise-shaping quantization schemes, |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\beta }$|, to approximate shift-invariant kernels. We have shown that these fully deterministic quantization schemes are capable of saving more bits than other baselines without compromising the performance. Importantly, we showed that, for all pairs of signals from an infinite low-complexity set, the approximations have uniform probabilistic error bounds yielding an exponential decay as the number of bits used increases. Empirically, we illustrated across popular kernel machines that the proposed quantization methods achieve strong performance both as a function of the dimension of the RFF embedding, and the number of bits used, especially in the case of binary embedding.
Data Availability Statement
The data underlying this article are available in the UCI Machine Learning Repository, at https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits.
Funding
JZ was partially supported by grants NSF DMS 2012546 and 2012266. AC was partially supported by NSF DMS 1819222, 2012266. RS was partially supported by NSF DMS 2012546 and a UCSD senate research award.
References
A. Stable quantization methods
The general definition for stable |$Q_{\varSigma \varDelta }^{(r)}$|. Although it is a non-trivial task to design a stable |$Q_{\varSigma \varDelta }^{(r)}$| for |$r>1$|, families of |$\varSigma \varDelta $| quantization schemes that achieve this goal have been designed [14, 15, 18], and we adopt the version in [15]. Specifically, an |$r$|-th order |$\varSigma \varDelta $| quantization scheme may also arise from the following difference equation
where * is the convolution operator and the sequence |$H:=D^{r} g$| with |$g\in \ell ^{1}$|. Then any bounded solution |$v$| of (A.1) gives rise to a bounded solution |$u$| of (2.2) via |$u=g*v$|. By change of variables, (2.2) can be reformulated as (A.1). By choosing a proper filter |$h:=\delta ^{(0)}-H$|, where |$\delta ^{(0)}$| denotes the Kronecker delta sequence supported at |$0$|, one can implement (A.1) by |$v_{i}=(h*v)_{i}+y_{i}-q_{i}$| and the corresponding stable quantization scheme |$Q_{\varSigma \varDelta }^{(r)}$| reads as
Furthermore, the above design leads to the following result from [15, 22], which exploits the constant |$c(K,\mu , r)$| to bound |$\|u\|_{\infty }$|.
Note that even with the |$b=1$| bit alphabet, i.e. |$K=1$| and |$\mathscr{A}=\{-1, 1\}$|, stability can be guaranteed with
The stability of |$Q_{\beta }$|. The relevant result for stability of the noise-shaping quantization schemes (2.4) is the following proposition, which can be simply proved by induction or can be found in [10].
B. A comparison of kernel approximations
In Fig. B.6, we evaluate approximated kernels (1.8) and (1.9) in Section 1 on |$n=1000$| pairs of points |$\{x_{i}, y_{i}\}_{i=1}^{n}$| in |$\mathbb{R}^{d}$| with |$d=50$| such that for each |$i$|

Moreover, each data point |$x_{i}$| is represented by |$m=3000$| RFF features and we use |$3$|-bit quantizers to guarantee good performance for all methods. The target RBF kernel (red curves) is |$k(x,y)=\exp (-\|x-y\|_{2}^{2}/2\sigma ^{2})$| with |$\gamma :=1/2\sigma ^{2}=\frac{1}{5}$| and note that the approximations (black dots) have their |$\ell _{2}$| distances |$\|x-y\|_{2}$| uniformly distributed in the range |$[0,5]$|. We see that both |$\widehat{k}_{\varSigma \varDelta }^{(r)}$| and |$\widehat{k}_{\beta }$| can approximate |$k$| well.
C. More figures in Section 4
KRR. In Fig. C.7 and Fig. C.8, we show the KRR results for |$b=2$| and |$b=3$|, respectively. Same as in the Section 4, we see that the proposed methods |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\beta }$| have strong performance in terms of |$m$| and the number of bits used for each sample.


Kernel SVM. Figure C.9 and Fig. C.10 illustrate the performance of kernel SVM with |$b=2$| and |$b=3$|, respectively. As we expect, the gap across various schemes shrinks when we use multibit quantizers, where |$Q_{\text{StocQ}}$| is comparable with |$Q_{\varSigma \varDelta }^{(r)}$| and |$Q_{\beta }$|.


MMD. As a continuation of the two-sample test in Section 4, Fig. C.11 and Fig. C.12 imply that both semi-quantized scheme and |$Q_{\text{StocQ}}$| have better performance with respect to |$m$|, while |$Q_{\varSigma \varDelta }^{(r)}$| can save more memory than other quantization methods.


D. Proof of Theorem 3.1
Given |$x,y\in \mathscr{X}\subset \mathbb{R}^{d}$|, we use either stable |$Q_{\varSigma \varDelta }^{(r)}$| or stable |$Q_{\beta }$| to quantize their RFFs |$z(x)$|, |$z(y)$| as in (1.2). Then we get quantized sequences
and expect that both
approximate the ground truth |$k(x,y)$| well.
In the case of |$\varSigma \varDelta $| quantization, by (2.2), one can get
It follows that
The triangle inequality implies that
Similarly, for the noise-shaping quantization, one can derive the following inequality based on (2.3),
In order to control the kernel approximation errors in (D.1) and (D.2), we need to bound fours terms (I), (II), (III) and (IV) on the right-hand side.
D.1 Useful lemmata
In this section, we present the following well-known concentration inequalities and relevant lemmata.
Additionally, one can compute the moments of |$\cos (\omega _{i}^{\top } x+\xi _{i})\cos (\omega _{j}^{\top } y +\xi _{j})$| as follows.
(ii) Substituting |$V_{\varSigma \varDelta }$| with |$V_{\beta }=I_{p}\otimes v_{\beta }$| leads to a verbatim proof for the second inequality.
D.2 Upper bound of (I)
This section is devoted to deriving an upper bound of the term (I) in (D.1), (D.2). Here, we adapt the proof techniques used in [37].
According to Theorem 3.1, |$\mathscr{X}$| is a compact subset of |$\mathbb{R}^{d}$| with diameter |$\ell =\mathrm{diam}(\mathscr{X})>0$|. Then |$\mathscr{X}^{2}:=\mathscr{X}\times \mathscr{X}$| is a compact set in |$\mathbb{R}^{2d}$| with diameter |$\sqrt{2}\ell $|. Additionally, the second moment of distribution |$\varLambda $|, that is, |$\sigma _{\varLambda }^{2}:={\mathbb{E}}_{\omega \sim \varLambda }\|\omega \|_{2}^{2}=\text{tr}(\nabla ^{2} \kappa (0))$| exists where |$\nabla ^{2} \kappa $| is the Hessian matrix of |$\kappa $| in (1.1). We will need the following results in order to obtain a uniform bound of term (I) over |$\mathscr{X}$|, via an |$\varepsilon $|-net argument.
We can now prove the following theorem controlling term (I).
D.3 Upper bound of (II) and (III)
By symmetry it suffices to bound (II) in (D.1), (D.2) and the same upper bound holds for (III).
D.4 Upper bound of (IV)
Let |$r\in \mathbb{N}^{+}$| and |$\beta \in (1,2)$|.
- If |$u_{x}$|, |$u_{y}$| are state vectors of the |$\varSigma \varDelta $| quantizer |$Q_{\varSigma \varDelta }^{(r)}$|, thenwhere |$c(K,r)$| is the upper bound of the |$\ell _{\infty }$| norm of state vectors in Proposition A.1 and |$c(r)>0$| is a constant related to |$r$|.$$ \begin{align*} & \bigl|\langle \widetilde{V}_{\varSigma\varDelta}D^{r}u_{x}, \widetilde{V}_{\varSigma\varDelta}D^{r}u_{y}\rangle\bigr|\leq\frac{c(K,r)^{2}c(r)}{\lambda^{2r-1}}, \end{align*} $$
- If |$u_{x}$|, |$u_{y}$| are state vectors of the noise-shaping quantizer |$Q_{\beta }$|, thenwhere |$c(K,\beta )$| is the upper bound of the |$\ell _{\infty }$| norm of state vectors in Proposition A.2.$$ \begin{align*} & \bigl|\langle \widetilde{V}_{\beta} Hu_{x}, \widetilde{V}_{\beta} Hu_{y}\rangle\bigr| \leq \frac{2c(K,\beta)^{2}}{\beta^{2\lambda -2}}, \end{align*} $$
D.5 Proof of Theorem 3.1
Recall that the kernel approximation errors in (D.1) and (D.2) can be bounded by four terms: (I), (II), (III), (IV).
(i) For the |$\varSigma \varDelta $| scheme, in Theorem D.3, one can choose |$\varepsilon =O(\sqrt{p^{-1}\log p})$|, |$\lambda =O(\sqrt{p\log ^{-1} p})$| and |$\eta _{1}=O(p^{-2-\alpha })$| with |$\alpha>0$|. Moreover, since |$\|v\|_{2}^{2}\geq \lambda ^{2r-1}r^{-2r}$| and |$\|v\|_{\infty }=O(\lambda ^{r-1})$| (see Lemma 4.6 in [21]), in Theorem D.4, we can choose |$\varepsilon =O(c(K,r)\lambda ^{-r+1}\log ^{1/2}p)$| and |$\eta _{2}=O(\lambda ^{-r-1}\log ^{1/2}p)$|. Then (3.1) follows immediately by combining above results with part (1) in Theorem D.5.
(ii) As for the noise-shaping scheme, in Theorem D.3, we choose the same parameters as in part (i): |$\varepsilon =O(\sqrt{p^{-1}\log p})$|, |$\lambda =O(\sqrt{p\log ^{-1} p})$| and |$\eta _{1}=O(p^{-2-\alpha })$| with |$\alpha>0$|. Nevertheless, according to |$\|v_{\beta }\|_{2}^{2}\geq \beta ^{-2}$| and |$\|v_{\beta }\|_{\infty }=\beta ^{-1}$|, we set different values |$\varepsilon =O(c(K,\beta )\beta ^{-\lambda +1}\sqrt{p})$| and |$\eta _{2}=O(p^{-1})$| in Theorem D.4. Therefore, (3.2) holds by applying above results and part (2) in Theorem D.5.
E. Proof of Theorem 3.3
The architecture for the proof of Theorem 3.3 closely follows the methods used in [40]. We start with some useful lemmata that aid in proving theorem 3.3.
Given a |$b$|-bit alphabet as in (1.4) with |$b=\log _{2}(2K)$|, we consider the following first-order |$\varSigma \varDelta $| quantization scheme for a random Fourier feature vector |$z(x) \in [-1,1]^{m}$| corresponding to a data point |$x \in \mathbb{R}^{d}$|, where the state variable |$(u_{x})_{0}$| is initialized as a random number, i.e.
The corresponding recurrence equation can written as
Let the inductive hypothesis be |$u_{k} \sim U\left [-\frac{1}{2^{b}-1},\frac{1}{2^{b}-1}\right ]$|. Note that this is true by definition for |$u_{0}$|.
Case: |$\frac{j}{2^{b}-1} \leq z_{k+1} \leq \frac{j+1}{2^{b}-1}$|, where |$j \in \{ 1, 3,\cdots 2^{b}-3\}$|.
Showing that |$u_{k+1} \: | \: z \sim U\left [-\frac{1}{2^{b}-1},\frac{1}{2^{b}-1}\right ]$| for the other cases, namely, |$\frac{j}{2^{b}-1} \leq z_{k+1} \leq \frac{j+1}{2^{b}-1}$|, where |$j \in \{ 0, 2,\cdots 2^{b}-2\}$| and |$-\frac{j+1}{2^{b}-1} \leq z_{k+1} \leq -\frac{j}{2^{b}-1}$|, where |$j \in \{ 1, 3,\cdots 2^{b}-3\}$| and |$-\frac{j+1}{2^{b}-1} \leq z_{k+1} \leq -\frac{j}{2^{b}-1}$|, where |$j \in \{ 0, 2,\cdots 2^{b}-2\}$| follow a similar argument as above and for the sake of brevity is skipped from an explicit mention. Thus, by induction, we have shown that |$u_{k+1} \: | \: z \sim U\left [-\frac{1}{2^{b}-1},\frac{1}{2^{b}-1}\right ]$|.
For the subsequent sections, we adopt the following notations. Let |$A$| be the matrix whose rows are the vectors |$\{ (\tilde{V}z(x))^{T}\}_{x}$|, |$B$| be the matrix whose rows are the vectors |$\{ (\tilde{V}D^{r}u_{x})^{T}\}_{x}$| and |$C$| be the matrix whose first column is |$\frac{\sqrt{2}}{\sqrt{p}\|v\|_{2}} (u_{0}^{x_{1}} \: u_{0}^{x_{2}} \: \cdots \: u_{0}^{x_{n}})^{T}$| and all other columns as zero. Let the columns of |$A, B, C$| be denoted by |$A_{i}, B_{i}, C_{i}$|, respectively. Now the corresponding approximation to the kernel can written as
Recall our notations where |$A$| is the matrix whose rows are the vectors |$\{ (\tilde{V}z(x))^{T}\}_{x}$|, |$B$| is the matrix whose rows are the vectors |$\{ (\tilde{V}D^{r}u_{x})^{T}\}_{x}$| and |$C$| is the matrix whose first column is |$\frac{\sqrt{2}}{\sqrt{p}\|v\|_{2}} (u_{0}^{x_{1}}, u_{0}^{x_{2}}, \cdots , u_{0}^{x_{n}})^{T}$| and all other columns as zero. Also the columns of |$A, B, C$| are denoted by |$A_{i}, B_{i}, C_{i}$|, respectively. Additionally let |$K_{i}: = \mathbb{E}[A_{i}A_{i}^{T}]$|, |$M:= (K + \eta I_{n})^{-1/2}$| and
Thus note that by design |$\mathbb{E}[S_{i}] = 0$|. We now will show that the remaining assumptions required to apply Matrix–Bernstein inequality hold for the sequence of matrices |$\{S_{i}\}_{i=1}^{p}$|.
Now we are in a position to prove Theorem 3.3 of the main text which we restate for convenience.