You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: chapter_6/2_dictionary_learning.tex
+23-13
Original file line number
Diff line number
Diff line change
@@ -1,12 +1,13 @@
1
1
2
2
\section{Dictionary learning}
3
-
\subsection{Introduction}
3
+
4
4
Sparse dictionary learning is a representation learning method that aims at finding a sparse representation of the input data in the form of a linear combination of basic elements. These elements form the dictionary.
5
5
6
6
The problem of sparse coding consists of unsupervised learning of the dictionary and the coefficient matrices. Thus, given only unlabeled data, we aim to learn the set of dictionary atoms or
7
7
basis functions that provide a good fit to the observed data. Sparse coding is important in a variety of domains. Sparse coding of natural images has yielded dictionary atoms which resemble the receptive fields of neurons in the visual cortex. Other important applications include compressed sensing and signal recovery.
8
8
9
-
\subsection{Problem formulation}
9
+
\subsubsection{Problem formulation}
10
+
10
11
Dictionary learning has a particularly simple setup as follows. Given $n$ i.i.d. samples $y^i\in\mathbb{R}^d$, $i\in [n]$ from the generative model
However, fixing one of $A$ and $x$, the problem reduces to linear regression involving response variable $y$ and covariates being either $A$ or $X$, for which various classical optimization methods for least-square problems can be used. This leads to various alternating minimization heuristics, which have been proven effective in practice. Meanwhile, theoretical guarantees for alternating minimization - such as guaranteed convergence to a global optimum and local rate of convergence - have also been developed.
16
17
17
18
18
-
% \section{Alternating minimization: algorithms and theoretical guarantees}
19
-
20
19
A generic \textit{alternating minimization} algorithm works as follows. First, initialize $A^{(0)}$ and $x^{(0)}$. In each epoch, first fixing $A$ and update $X$: $x^+ \leftarrow\phi(A, x)$; then, fixing $x^+$ and update $A$: $A^+ \leftarrow\psi(A, x^+)$. The updates $\phi$ and $\psi$ might involve randomness, such as coordinate-wise updates through a random permutation or a SGD-type method with random sampling of data.
21
20
22
21
Here, we discuss two specific AM algorithms and their respective theoretical convergence/recovery guarantees - that is, under conditions of data generative assumptions and assumptions on algorithm parameters, the iterates converge to the true dictionary with high probability. Both algorithms, as the current theoretical guarantees suggest, achieve \textit{local linear convergence}, that is, the dictionary iterates $A^{(t)}$ satisfy $\|A^{(t)} - A^*\|\leq C \cdot\eta^t \|A^{(0)} - A^*\|$ for $C>0$ and $0<\eta<1$, where $A^*$ is the true underlying dictionary, as long as $A^{(0)}$ is sufficiently close to $A^*$. The first algorithm below has a provably larger radius of convergence, that is, $A^{(0)}$ can be further away from $A^*$. It also has weaker assumption on the dictionary matrix, that is, an upper bound on $\|A^*\|_\infty$ instead of its operator norm, while allowing the operator norm to grow with dimension.
23
22
24
23
\subsection{Alternating minimization algorithm I}
24
+
25
25
The first is the one considered in \cite{chatterji2017alternating}, in which the authors provide theoretical guarantees of local linear convergence. On a high level, this algorithm involves a specific robust sparse least-square estimation subroutine (for updating the coefficients $x$) and simple gradient step (for updating the dictionary $A$).
26
26
27
27
We first introduce the robust sparse least-square estimation subroutine, which is defined as follows. Let $\gamma, \lambda, \nu > 0$ be tuning parameters and $R>0$ be an upper bound on the entry-wise maximum dictionary estimation error ($\max_{i,j} |A_{ij} - A^*_{ij}|$). Let $(\hat{\theta}, \hat{t}, \hat{u})\in\mathbb{R}^r \times\mathbb{R}_+\times\mathbb{R}_+$ be the solution to the following convex minimization problem:
Next, we discuss the various assumptions on problem data and algorithm parameters used in \cite{chatterji2017alternating} in order to establish convergence guarantees for the algorithms.
70
70
71
-
\subsection{Assumptions on the problem}
71
+
\subsubsection{Assumptions on the problem}
72
72
73
73
In order to develop convergence guarantees for Algorithm \ref{alg:am-for-dl-with-mus-estimator}, it is often customary to assume a fixed, deterministic ground truth dictionary $A^*$ and $x^{i*}$ being drawn i.i.d. from a distribution with various regularity assumptions such as bounded variance, independent coordinates and sparse support. Usually, in order to prove local convergence of any optimization algorithm for non-convex problems with many local optima, it is often necessary to assume that the initial iterate is close to a particular local optimum, while different optima are sufficiently ``far apart''.
74
74
@@ -101,7 +101,8 @@ \subsection{Assumptions on the problem}
101
101
\item (C5) Mean and variance of variables in the support: $\mathbb{E}(x_i^*|x_i^*\neq0) = 0$ and $\mathbb{E}((x_i^*)^2|x_i^*\neq0) = 1$ for all $i$.
102
102
\end{itemize}
103
103
104
-
\subsection{Assumptions on algorithm parameters}
104
+
\subsubsection{Assumptions on algorithm parameters}
105
+
105
106
The following are the assumptions on Algorithm \ref{alg:am-for-dl-with-mus-estimator} as well as the sparse estimation subroutine \ref{eq:mu-selector}.
106
107
\begin{itemize}
107
108
\item The stepsize $\eta$ for the gradient update of $A$ should satisfy $\frac{3r}{4s}\leq\eta\leq\frac{r}{s}$.
@@ -125,8 +126,8 @@ \subsection{Assumptions on algorithm parameters}
125
126
an ``expectation'' term and a ``deviation'' (variance) term.
126
127
Then, a deterministic convergence result can be proven by working with the expectation terms and MUS estimation errors (in particular, MUS is guaranteed to estimate the sign correctly in every iteration). Then, making use of concentration inequalities, the behavior of the deviation terms and MUS estimation errors can be controlled.
127
128
128
-
% 直接抄那些assumptions,尽量加一些解释。
129
-
\subsection{Convergence Results}
129
+
\subsubsection{Convergence Results}
130
+
130
131
By Lemma \cite{chatterji2017alternating}, Algorithm \ref{alg:am-for-dl-with-mus-estimator} can estimate the sign of $x^*$ correctly during each loop, i.e. $sgn(x)=sgn(x^*)$. According the our Assumption (B1), the initial value of $A$ is close to one dictionary $A^*$ up to permutation.
131
132
132
133
During time $t$ and time $t+1$, Let's consider the entries of $A$. Note $a^*_{ij}$ the $(i,j)$-th entry of the correct dictionary $A^*$, $a_{i,j}$ that of the generating dictionary $A^{(t)}$ during loop time $t$, and $a'_{i,j}$ that of loop time $t+1$. Denote $x^{m^*}_k$ the $k$-th coordinate of the $m$-th covariate at step $t$, and $x^m_k$ the $k$-th coordinate of the estimate of the $m$-th covariate at step $t$. Denote the radius $R^{(t)}=R$, $n^{(t)}=n$ for simplicity. Let $\bar g_{ij}$ be the $(i,j)$-th entry of the gradient with $n$ samples at step $t$, $g_{ij}$ be the $(i,j)$-th entry of the expected value of the gradient. We have:
We have $\eta|\epsilon_n|\leq R/8\leq R^{(0)}/8$ with probability at least $1-\delta$ in each iteration. Thus by taking a union bound over the iterations we are guaranteed to remain in our initial ball of radius $R^{(0)}$ with high probability which completes the proof.
In this section we describe another alternating minimization algorithm used in the paper \cite{ref1} to learn the dictionary matrix $A^*$. This paper implements another specific kind of alternating minimization - estimating coefficients via $\ell_1$ minimization and estimating the dictionary via least square estimation. The paper shows a theoretical guarantee for the above alternative minimization algorithm. The paper shows that under a set of conditions of the dictionary matrix $A^*$, coefficient matrix $X^*$ and sample complexity, we can show that the algorithm 1) satisfies local linear convergence, and 2) produces exact recovery of the dictionary matrix $A^*$ up to arbitrarily close. In fact, the paper shows that under the set of specified conditions, we need at most $\mathcal{O}(\log_2 1/\epsilon)$ rounds of alternating minimization to ensure that each dictionary atom vector is approximated within an absolute error of $\pm \epsilon$. In later sections, the paper proved a nice theoretical guarantee for this algorithm. We use $A(i)$ and $X(i)$ to denote the dictionary estimate and coefficient estimate after the $i$-th round of alternating minimization. \\
187
189
188
190
Input: Samples $Y$, initial dictionary $A(0)$, accuracy sequence $\epsilon_t$ and sparsity level $s$. (The way to estimate $A(0)$ and the selection of $\epsilon_t$'s will be explained later in the paper.)
The algorithm above is alternating between two procedures: a sparse recovery step for estimating the coefficients given a fixed dictionary via $\ell_1$ minimization, and a step for estimating the dictionary given a fixed coefficient matrix via least squares estimation. Note that the ``Threshold" step in the algorithm serves to guarantee that the coefficient matrix we get is a sparse matrix. The ``Normalize" step in the algorithm serves to normalize each dictionary atom to have $\ell_2$ norm $=1$; we are allowed to assume this because we can always scale the columns of the coefficient matrices accordingly.
206
208
207
209
208
-
\subsection{Lemmas, theorems, and proof}
210
+
\subsubsection{Lemmas, theorems, and proof}
211
+
209
212
The paper has two main theorems based on two sets of assumptions. Assumptions (A1)-(A7) give local linear convergence (Theorem \ref{thm:thm3-1}), but (A5) requires knowledge of an estimate of $A$ not too far from $A^*$ as initialization. Based on a similar set of assumptions (B1), (B3)-(B5), Theorem \ref{thm:thm3-2} (Specialization of Theorem 2.1 from \cite{ref1} gives an initialization that satisfies (A5), thus shows Algorithm 1 is feasible and will give exact recovery (Corollary \ref{cor:cor3-1}).
210
213
211
214
We will use the shorthand Supp($v$) and Supp($W$) to denote the set of non-zero entries of $v$ and $W$ respectively. $||w||_p$ denote the $\ell_p$ norm of vector $w$; by default $||w||$ denotes $\ell_2$ norm of $w$. $||W||_2$ denotes the spectral norm of matrix $W$. $||W||_{\inf}$ denotes the largest elements in magnitude of $W$. For a matrix $X$, $X^i, X_i$ and $X_j^i$ denote the $i^{th}$ row, $i^{th}$ column and $(i, j)^{th}$ element of $X$ respectively.
212
215
213
-
\subsection{Assumption A, Theorem \ref{thm:thm3-1}}
216
+
\subsubsection{Assumption A, Theorem \ref{thm:thm3-1}}
217
+
214
218
Without loss of generality, assume that the elements are normalized: $\lVert A_i^* \rVert_2=1,~\forall i\in[r]$.
215
219
Assumptions:
216
220
@@ -256,7 +260,8 @@ \subsection{Assumption A, Theorem \ref{thm:thm3-1}}
256
260
257
261
Another way to understand this theorem is that based on Assumption (A6), the initialization should have error no greater than $\mathcal{O}(1/s^2)$. Thus, we can view this $\mathcal{O}(1/s^2)$ as the size of basin of attraction: as long as $A(0)$ is within this basin, Algorithm 1 will succeed.
(B1) \textbf{Incoherent Dictionary Elements}: Without loss of generality, assume that all the elements are normalized: $\lVert A_i^*\rVert_2=1$, for $i\in [r]$. We assume pairwise incoherence condition on the dictionary elements, for some constant $\mu_0>0$, $\lvert <A_i^*,A_j^*>\rvert < \frac{\mu_0}{\sqrt{d}}$.
Notice that the set of Assumptions A and B are similar (but more strict for some B): (B3) is more strict than (A3) by adding an lower bound on $\lvert X_j^{*i} \rvert$; assumptions (B1) and (B4) implies (A1).
297
302
298
-
\subsection{Lemmas}
303
+
\subsubsection{Lemmas}
304
+
299
305
\begin{lemma}
300
306
(Error in sparse recovery) Let $\Delta X:=X(t)-X^*$. Assume that $2\mu_0s/\sqrt{d}\leq0.1$ and $\sqrt{s\epsilon_t}\leq0.1$. Then we have $Supp(\Delta X) \subset Supp(\Delta X^*)$ and the error bound $\lVert\Delta X \rVert_{\infty} \leq9s\epsilon_t$.
301
307
\label{lemma:l1}
@@ -337,7 +343,9 @@ \subsection{Lemmas}
337
343
\end{equation*}
338
344
\label{def:def2}
339
345
\end{definition}
340
-
\subsection{Proof of Theorem \ref{thm:thm3-1} (by induction)}
346
+
347
+
\subsubsection{Proof of Theorem \ref{thm:thm3-1} (by induction)}
348
+
341
349
The idea of proof is to use the distance function defined in Definition \ref{def:def1} and \ref{def:def2} and Lemma \ref{lemma:l4} to provide an upper bound for $\min_{z\in\{-1,+1\}}\lVert zA_i(t)-A_i^*\rVert,~\forall i \in [r]$.
342
350
The key observation is that for each update for $A$, we have
343
351
\begin{equation*}
@@ -391,6 +399,7 @@ \subsection{Proof of Theorem \ref{thm:thm3-1} (by induction)}
391
399
392
400
393
401
\subsection{Experiments}
402
+
394
403
The paper shows three experimental results that validate the proved theoretical guarantees: a) advantage of alternating minimization over one-shot initialization, b) linear convergence of alternating minimization, c) sample complexity of alternating minimization. \\
395
404
396
405
Data generation: Each entry of the dictionary matrix $A$ is chosen i.i.d. from $\mathcal{N}(0, 1/\sqrt{d})$. The support of each column of $X$ was chosen independently and uniformly from the set of all $s$-subsets of $[r]$. Each non-zero element of $X$ was chosen randomly and uniformly from $[-2, -1] \cup [1, 2]$. We measure error in the recovery of dictionary by $error(A) = \max_i \sqrt{1-\frac{\langle A_i, A_i^* \rangle ^2}{||A_i||_2^2 ||A_i^*||_2^2}}$. Observe that the data generated satisfies the conditions $A$ and $B$'s.\\
@@ -406,6 +415,7 @@ \subsection{Experiments}
406
415
\end{enumerate}
407
416
408
417
\subsection{Future Directions}
418
+
409
419
\begin{enumerate}
410
420
\item[(1)] The current sparse recovery step in the alternating minimization algorithm decodes the coefficients individually for each sample. There may be better algorithms that decode the coefficients for all samples at the same time. Such algorithms will enable control over properties across samples, such as controlling the number of samples per dictionary element.
0 commit comments