Skip to content

Commit bcc2495

Browse files
committed
Fixed sections in dict learning
1 parent a2f5631 commit bcc2495

File tree

2 files changed

+25
-14
lines changed

2 files changed

+25
-14
lines changed

chapter_6/2_dictionary_learning.tex

+23-13
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11

22
\section{Dictionary learning}
3-
\subsection{Introduction}
3+
44
Sparse dictionary learning is a representation learning method that aims at finding a sparse representation of the input data in the form of a linear combination of basic elements. These elements form the dictionary.
55

66
The problem of sparse coding consists of unsupervised learning of the dictionary and the coefficient matrices. Thus, given only unlabeled data, we aim to learn the set of dictionary atoms or
77
basis functions that provide a good fit to the observed data. Sparse coding is important in a variety of domains. Sparse coding of natural images has yielded dictionary atoms which resemble the receptive fields of neurons in the visual cortex. Other important applications include compressed sensing and signal recovery.
88

9-
\subsection{Problem formulation}
9+
\subsubsection{Problem formulation}
10+
1011
Dictionary learning has a particularly simple setup as follows. Given $n$ i.i.d. samples $y^i\in \mathbb{R}^d$, $i\in [n]$ from the generative model
1112
\begin{align}
1213
y^i = A^* x^{i*}, \ i\in [n], \label{eq:gen-model}
@@ -15,13 +16,12 @@ \subsection{Problem formulation}
1516
However, fixing one of $A$ and $x$, the problem reduces to linear regression involving response variable $y$ and covariates being either $A$ or $X$, for which various classical optimization methods for least-square problems can be used. This leads to various alternating minimization heuristics, which have been proven effective in practice. Meanwhile, theoretical guarantees for alternating minimization - such as guaranteed convergence to a global optimum and local rate of convergence - have also been developed.
1617

1718

18-
% \section{Alternating minimization: algorithms and theoretical guarantees}
19-
2019
A generic \textit{alternating minimization} algorithm works as follows. First, initialize $A^{(0)}$ and $x^{(0)}$. In each epoch, first fixing $A$ and update $X$: $x^+ \leftarrow \phi(A, x)$; then, fixing $x^+$ and update $A$: $A^+ \leftarrow \psi(A, x^+)$. The updates $\phi$ and $\psi$ might involve randomness, such as coordinate-wise updates through a random permutation or a SGD-type method with random sampling of data.
2120

2221
Here, we discuss two specific AM algorithms and their respective theoretical convergence/recovery guarantees - that is, under conditions of data generative assumptions and assumptions on algorithm parameters, the iterates converge to the true dictionary with high probability. Both algorithms, as the current theoretical guarantees suggest, achieve \textit{local linear convergence}, that is, the dictionary iterates $A^{(t)}$ satisfy $\|A^{(t)} - A^*\| \leq C \cdot \eta^t \|A^{(0)} - A^*\|$ for $C>0$ and $0<\eta<1$, where $A^*$ is the true underlying dictionary, as long as $A^{(0)}$ is sufficiently close to $A^*$. The first algorithm below has a provably larger radius of convergence, that is, $A^{(0)}$ can be further away from $A^*$. It also has weaker assumption on the dictionary matrix, that is, an upper bound on $\|A^*\|_\infty$ instead of its operator norm, while allowing the operator norm to grow with dimension.
2322

2423
\subsection{Alternating minimization algorithm I}
24+
2525
The first is the one considered in \cite{chatterji2017alternating}, in which the authors provide theoretical guarantees of local linear convergence. On a high level, this algorithm involves a specific robust sparse least-square estimation subroutine (for updating the coefficients $x$) and simple gradient step (for updating the dictionary $A$).
2626

2727
We first introduce the robust sparse least-square estimation subroutine, which is defined as follows. Let $\gamma, \lambda, \nu > 0$ be tuning parameters and $R>0$ be an upper bound on the entry-wise maximum dictionary estimation error ($\max_{i,j} |A_{ij} - A^*_{ij}|$). Let $(\hat{\theta}, \hat{t}, \hat{u})\in \mathbb{R}^r \times \mathbb{R}_+\times \mathbb{R}_+$ be the solution to the following convex minimization problem:
@@ -68,7 +68,7 @@ \subsection{Alternating minimization algorithm I}
6868

6969
Next, we discuss the various assumptions on problem data and algorithm parameters used in \cite{chatterji2017alternating} in order to establish convergence guarantees for the algorithms.
7070

71-
\subsection{Assumptions on the problem}
71+
\subsubsection{Assumptions on the problem}
7272

7373
In order to develop convergence guarantees for Algorithm \ref{alg:am-for-dl-with-mus-estimator}, it is often customary to assume a fixed, deterministic ground truth dictionary $A^*$ and $x^{i*}$ being drawn i.i.d. from a distribution with various regularity assumptions such as bounded variance, independent coordinates and sparse support. Usually, in order to prove local convergence of any optimization algorithm for non-convex problems with many local optima, it is often necessary to assume that the initial iterate is close to a particular local optimum, while different optima are sufficiently ``far apart''.
7474

@@ -101,7 +101,8 @@ \subsection{Assumptions on the problem}
101101
\item (C5) Mean and variance of variables in the support: $\mathbb{E}(x_i^*|x_i^*\neq 0) = 0$ and $\mathbb{E}((x_i^*)^2|x_i^*\neq 0) = 1$ for all $i$.
102102
\end{itemize}
103103

104-
\subsection{Assumptions on algorithm parameters}
104+
\subsubsection{Assumptions on algorithm parameters}
105+
105106
The following are the assumptions on Algorithm \ref{alg:am-for-dl-with-mus-estimator} as well as the sparse estimation subroutine \ref{eq:mu-selector}.
106107
\begin{itemize}
107108
\item The stepsize $\eta$ for the gradient update of $A$ should satisfy $\frac{3r}{4s}\leq \eta \leq \frac{r}{s}$.
@@ -125,8 +126,8 @@ \subsection{Assumptions on algorithm parameters}
125126
an ``expectation'' term and a ``deviation'' (variance) term.
126127
Then, a deterministic convergence result can be proven by working with the expectation terms and MUS estimation errors (in particular, MUS is guaranteed to estimate the sign correctly in every iteration). Then, making use of concentration inequalities, the behavior of the deviation terms and MUS estimation errors can be controlled.
127128

128-
% 直接抄那些assumptions,尽量加一些解释。
129-
\subsection{Convergence Results}
129+
\subsubsection{Convergence Results}
130+
130131
By Lemma \cite{chatterji2017alternating}, Algorithm \ref{alg:am-for-dl-with-mus-estimator} can estimate the sign of $x^*$ correctly during each loop, i.e. $sgn(x)=sgn(x^*)$. According the our Assumption (B1), the initial value of $A$ is close to one dictionary $A^*$ up to permutation.
131132

132133
During time $t$ and time $t+1$, Let's consider the entries of $A$. Note $a^*_{ij}$ the $(i,j)$-th entry of the correct dictionary $A^*$, $a_{i,j}$ that of the generating dictionary $A^{(t)}$ during loop time $t$, and $a'_{i,j}$ that of loop time $t+1$. Denote $x^{m^*}_k$ the $k$-th coordinate of the $m$-th covariate at step $t$, and $x^m_k$ the $k$-th coordinate of the estimate of the $m$-th covariate at step $t$. Denote the radius $R^{(t)}=R$, $n^{(t)}=n$ for simplicity. Let $\bar g_{ij}$ be the $(i,j)$-th entry of the gradient with $n$ samples at step $t$, $g_{ij}$ be the $(i,j)$-th entry of the expected value of the gradient. We have:
@@ -183,6 +184,7 @@ \subsection{Convergence Results}
183184
We have $\eta|\epsilon_n|\leq R/8\leq R^{(0)}/8$ with probability at least $1-\delta$ in each iteration. Thus by taking a union bound over the iterations we are guaranteed to remain in our initial ball of radius $R^{(0)}$ with high probability which completes the proof.
184185

185186
\subsection{Alternating Minimization Algorithm II}
187+
186188
In this section we describe another alternating minimization algorithm used in the paper \cite{ref1} to learn the dictionary matrix $A^*$. This paper implements another specific kind of alternating minimization - estimating coefficients via $\ell_1$ minimization and estimating the dictionary via least square estimation. The paper shows a theoretical guarantee for the above alternative minimization algorithm. The paper shows that under a set of conditions of the dictionary matrix $A^*$, coefficient matrix $X^*$ and sample complexity, we can show that the algorithm 1) satisfies local linear convergence, and 2) produces exact recovery of the dictionary matrix $A^*$ up to arbitrarily close. In fact, the paper shows that under the set of specified conditions, we need at most $\mathcal{O}(\log_2 1/\epsilon)$ rounds of alternating minimization to ensure that each dictionary atom vector is approximated within an absolute error of $\pm \epsilon$. In later sections, the paper proved a nice theoretical guarantee for this algorithm. We use $A(i)$ and $X(i)$ to denote the dictionary estimate and coefficient estimate after the $i$-th round of alternating minimization. \\
187189

188190
Input: Samples $Y$, initial dictionary $A(0)$, accuracy sequence $\epsilon_t$ and sparsity level $s$. (The way to estimate $A(0)$ and the selection of $\epsilon_t$'s will be explained later in the paper.)
@@ -205,12 +207,14 @@ \subsection{Alternating Minimization Algorithm II}
205207
The algorithm above is alternating between two procedures: a sparse recovery step for estimating the coefficients given a fixed dictionary via $\ell_1$ minimization, and a step for estimating the dictionary given a fixed coefficient matrix via least squares estimation. Note that the ``Threshold" step in the algorithm serves to guarantee that the coefficient matrix we get is a sparse matrix. The ``Normalize" step in the algorithm serves to normalize each dictionary atom to have $\ell_2$ norm $=1$; we are allowed to assume this because we can always scale the columns of the coefficient matrices accordingly.
206208

207209

208-
\subsection{Lemmas, theorems, and proof}
210+
\subsubsection{Lemmas, theorems, and proof}
211+
209212
The paper has two main theorems based on two sets of assumptions. Assumptions (A1)-(A7) give local linear convergence (Theorem \ref{thm:thm3-1}), but (A5) requires knowledge of an estimate of $A$ not too far from $A^*$ as initialization. Based on a similar set of assumptions (B1), (B3)-(B5), Theorem \ref{thm:thm3-2} (Specialization of Theorem 2.1 from \cite{ref1} gives an initialization that satisfies (A5), thus shows Algorithm 1 is feasible and will give exact recovery (Corollary \ref{cor:cor3-1}).
210213

211214
We will use the shorthand Supp($v$) and Supp($W$) to denote the set of non-zero entries of $v$ and $W$ respectively. $||w||_p$ denote the $\ell_p$ norm of vector $w$; by default $||w||$ denotes $\ell_2$ norm of $w$. $||W||_2$ denotes the spectral norm of matrix $W$. $||W||_{\inf}$ denotes the largest elements in magnitude of $W$. For a matrix $X$, $X^i, X_i$ and $X_j^i$ denote the $i^{th}$ row, $i^{th}$ column and $(i, j)^{th}$ element of $X$ respectively.
212215

213-
\subsection{Assumption A, Theorem \ref{thm:thm3-1}}
216+
\subsubsection{Assumption A, Theorem \ref{thm:thm3-1}}
217+
214218
Without loss of generality, assume that the elements are normalized: $\lVert A_i^* \rVert_2=1,~\forall i\in[r]$.
215219
Assumptions:
216220

@@ -256,7 +260,8 @@ \subsection{Assumption A, Theorem \ref{thm:thm3-1}}
256260

257261
Another way to understand this theorem is that based on Assumption (A6), the initialization should have error no greater than $\mathcal{O}(1/s^2)$. Thus, we can view this $\mathcal{O}(1/s^2)$ as the size of basin of attraction: as long as $A(0)$ is within this basin, Algorithm 1 will succeed.
258262

259-
\subsection{Assumption B, Theorem \ref{thm:thm3-2}}
263+
\subsubsection{Assumption B, Theorem \ref{thm:thm3-2}}
264+
260265
Assumptions:
261266

262267
(B1) \textbf{Incoherent Dictionary Elements}: Without loss of generality, assume that all the elements are normalized: $\lVert A_i^*\rVert_2=1$, for $i\in [r]$. We assume pairwise incoherence condition on the dictionary elements, for some constant $\mu_0>0$, $\lvert <A_i^*,A_j^*>\rvert < \frac{\mu_0}{\sqrt{d}}$.
@@ -295,7 +300,8 @@ \subsection{Assumption B, Theorem \ref{thm:thm3-2}}
295300

296301
Notice that the set of Assumptions A and B are similar (but more strict for some B): (B3) is more strict than (A3) by adding an lower bound on $\lvert X_j^{*i} \rvert$; assumptions (B1) and (B4) implies (A1).
297302

298-
\subsection{Lemmas}
303+
\subsubsection{Lemmas}
304+
299305
\begin{lemma}
300306
(Error in sparse recovery) Let $\Delta X:=X(t)-X^*$. Assume that $2\mu_0s/\sqrt{d}\leq 0.1$ and $\sqrt{s\epsilon_t}\leq 0.1$. Then we have $Supp(\Delta X) \subset Supp(\Delta X^*)$ and the error bound $\lVert \Delta X \rVert_{\infty} \leq 9s\epsilon_t$.
301307
\label{lemma:l1}
@@ -337,7 +343,9 @@ \subsection{Lemmas}
337343
\end{equation*}
338344
\label{def:def2}
339345
\end{definition}
340-
\subsection{Proof of Theorem \ref{thm:thm3-1} (by induction)}
346+
347+
\subsubsection{Proof of Theorem \ref{thm:thm3-1} (by induction)}
348+
341349
The idea of proof is to use the distance function defined in Definition \ref{def:def1} and \ref{def:def2} and Lemma \ref{lemma:l4} to provide an upper bound for $\min_{z\in \{-1,+1\}}\lVert zA_i(t)-A_i^*\rVert,~\forall i \in [r]$.
342350
The key observation is that for each update for $A$, we have
343351
\begin{equation*}
@@ -391,6 +399,7 @@ \subsection{Proof of Theorem \ref{thm:thm3-1} (by induction)}
391399

392400

393401
\subsection{Experiments}
402+
394403
The paper shows three experimental results that validate the proved theoretical guarantees: a) advantage of alternating minimization over one-shot initialization, b) linear convergence of alternating minimization, c) sample complexity of alternating minimization. \\
395404

396405
Data generation: Each entry of the dictionary matrix $A$ is chosen i.i.d. from $\mathcal{N}(0, 1/\sqrt{d})$. The support of each column of $X$ was chosen independently and uniformly from the set of all $s$-subsets of $[r]$. Each non-zero element of $X$ was chosen randomly and uniformly from $[-2, -1] \cup [1, 2]$. We measure error in the recovery of dictionary by $error(A) = \max_i \sqrt{1-\frac{\langle A_i, A_i^* \rangle ^2}{||A_i||_2^2 ||A_i^*||_2^2}}$. Observe that the data generated satisfies the conditions $A$ and $B$'s.\\
@@ -406,6 +415,7 @@ \subsection{Experiments}
406415
\end{enumerate}
407416

408417
\subsection{Future Directions}
418+
409419
\begin{enumerate}
410420
\item[(1)] The current sparse recovery step in the alternating minimization algorithm decodes the coefficients individually for each sample. There may be better algorithms that decode the coefficients for all samples at the same time. Such algorithms will enable control over properties across samples, such as controlling the number of samples per dictionary element.
411421

chapter_6/chapter_6.tex

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
\chapter{Linear Dimensionality Reduction}
22
\begin{refsection}
3-
\input{chapter_6/1_linear_dim_red}
3+
\input{chapter_6/1_linear_dim_red}
4+
\input{chapter_6/2_dictionary_learning}
45
\printbibliography[heading=subbibliography]
56
\end{refsection}

0 commit comments

Comments
 (0)