experiments.tex.bak

\chapter{Experiments}\label{experiments}

This chapter presents the experimental performed in this Dissertation. The rest of the chapter is organized as follows. We first describe the corpora used to perform the experiments. Then we describe the experimental setup used to perform the thematic role assignment task with the Word2Vec-ESN model and the variant proposed in the previous chapter. We also describe the experiments performed for thematic role assignment task.

\section{Corpora and pre-processing}\label{corpora}

\subsection{Corpora For TRA Task}

In order to have a fair comparison between the Word2Vec-ESN model and $\theta RARes$ model (described in section \ref{sec:xavier_model}), we used the same coprus used by Hinaut et al. \cite{xavier:2013:RT, tra:xavier_hri} to perform thematic roles assignment task with $\theta RARes$ model. Thus, the corpus-373, corpus-462, corpora-90582 containing 373, 462 and 90582 sentences respectively were used to perform the thematic role assignment task.  

\paragraph{Coprus-462 and Corpus-90582: } The sentence in corpus-462 and corpus-90582 was generated by Hinaut et al. using a context-free-grammer for English language and used for TRA task \cite{xavier:2013:RT}. Each sentence in these corpora have verbs which takes 1, 2, 3 clause elements. For example, the sentences, 'The man \textit{jump}', 'The boy \textit{cut} an Apple', 'John \textit{gave} the ball to Marie', have verbs with clause elements agent, agent and object, or agent, object and recipient respectively. The sentences in the corpora have a maximum of four nouns and two verbs \cite{xavier:2013:RT}. A maximum of 1 relative clause is present in the sentences; verb in the relative clauses could take 1 or 2 clause elements (i.e., without recipient). For e.g. 'The dog that bit the cat chased the boy'. Both the corpus-462 and corpus-90582 have the constructions in form:

\begin{enumerate}[noitemsep]
\item walk giraffe $<\!o\!>$ AP $<\!/o\!>$ ; the giraffe walk -s . \# ['the', 'X', 'X', '-s', '.']
\item cut beaver fish , kiss fish girl $<\!o\!>$ APO , APO $<\!/o\!>$ ; the beaver cut -s the fish that kiss -s the girl . \# ['the', 'X', 'X', '-s', 'the', 'X', 'that', 'X', '-s', 'the', 'X', '.']
\end{enumerate}

Each construction in the corpus was divided into four parts. The first part describes the meaning of sentence using semantic words (or open class words) in order of predicate, agent, object, recipient. The second part (between '$<\!o\!>$' and '$<\!/o\!>$') describes the order of thematic roles of semantic words as they appear in the raw sentence. The third part (between ';' and '\#') is the raw sentence with verb inflections (i.e. '-s') and the fourth part is the abstract representation of a sentence with semantic words removed and replace with 'X' \cite{xavier:2013:RT}.

Corpus-90582 have 90582 sentences along with the coded meaning of each sentence. This corpus is redundant; multiple sentences with different grammatical structure but the same coded meaning (see fig. \ref{fig:meaning_realtions}). In total, there were only 2043 distinct coded meanings \cite{xavier:2013:RT}. This corpus also has an additional property that along with complete coded meanings for sentences it also have incomplete meanings. For example, the sentence “The Ball was given to the Man” have no \textit{'Agent'}, and thus the meaning of the sentence is "given(-, ball, man)". The corpus also contains $5268$ pair and $184$ triplets of ambiguous sentences i.e., 10536 and 553 sentences respectively. Thus in total there were $12.24 \%$ (i.e., $ 5268 \times 2 + 184 \times 3 = 11088 $) of ambiguous sentences which have the similar grammatical structure but different coded meaning \cite{xavier:2013:RT}.

\paragraph{Corpus-373: }Apart from the corpus-462 and corpus-90582 which were artificially constructed using the context-free grammar, we also used the corpus-373. Corpus-373 includes the instructions collected from the participants interacting with a humanoid robot (iCub) in a Human-Robot Interaction study of language acquisition conducted by Hinaut et al. \cite{tra:xavier_hri}. In the study, the robot first performs one or two actions by placing the available objects to a location (e.g. left, right, middle) and the participants observes the actions. Then the participants were asked to instruct the robot again to perform the same actions in natural language. Thus the corpus contains 373 complex instructions to perform single or double actions with temporal correlation (see Action Performing task in Hinaut et al. \cite{tra:tra:xavier_hri} for more details). For example, the instruction "Point to the guitar" is a one action command whereas the instruction "Before you put the guitar on the left put the trumpet on the right" is a complex command with double actions, where the second action is specified before the first action. Thus this data is complex enough to test the learnability and generalization of the model for TRA task.

\paragraph{Pre-processing: } Recall that the corpus-462 and corpus-90582 have sentences where are verbs are represented along with inflections (suffixes "-s", "-ed", "-ing"). We preprocessed these constructions in corpus-462 and corpus-90582, to obtain the raw sentences without verb inflections.  Firstly, all the words are lowercased and then the verbs with inflections are replaced by conjugated verbs \footnote{service used to find verb conjugations: \text{http://www.scientificpsychic.com/cgi-bin/verbconj2.pl}}. The verb conjugation to be used depends on the inflection used for the verb. For example, the sentences 'The giraffe walk -s' and 'John eat -ed the apple' has been changed to 'The giraffe walks' and 'John ate the apple' respectively. This preprocessing was done because the distributed word representation captured by word2vec model already captures these syntactic relations which were imposed using verb inflection e.g $walks - walk \approx talks - talk$. We also added additional token '$<start>$' at the beginning of sentence and '$<end>$' token at the end, to mark the beginning and end of a sentence.
  
\subsection{Corpus For Training Word2Vec Model}

In this work, To train the word2vec model, we used the Wikipedia corpus\footnote{https://dumps.wikimedia.org/enwiki/latest/} ($\approx$ 14 GB) to obtain the low dimensional distributed embeddings of words. The corpus contains 2,65,8629 unique words. We chose to use Wikipedia data because we needed a general purpose dataset to get the vector representation of words. The Word2Vec model does not give good quality vector representation for words when trained over a small corpus thus a general purpose data set with billions of words is required to have good word embeddings. Thus more words we have the better the vector representation of words. Once the vector representation of words in Wikipedia data is obtained, the model can then be trained further on any our domain specific dataset (corpus-462 and corpus-90582) with more bias toward domain specific dataset (by repetition of dataset during training) to update the previously learned vector representations. 

\section{Experimental Setup}

\subsection{Obtaining Word Embeddings} \label{get_word_embeddings}

Before using the model for TRA task, we need to have the word embeddings. So in order to get the vector representation of words we first trained the Word2Vec model on Wikipedia dataset. For training we used the word2vec model with skip-gram negative sampling (see Chapter \ref{basics} for more details) approach to obtain the word embeddings as it is claimed to accelare the training and generates better word vector as compared to CBOW approach \cite{w2v:mikolov_2013_efficient, w2v:mikolov_2013_distributed}. 

To train the Word2Vec model we used the hidden layer with 50 units (desired dimensions of word embedding), and a context window of $\pm 5$. The negative sampling size of 5 is chosen i.e. 5 noise words are chosen randomly from the vocabulary which does not appear in the context of the current input word. As the Wikipedia corpus is huge, it contains some words like "a", "the", "in" etc. which occurs million of times. Thus the frequent words are discarded with the probabily using equation \ref{eqn:subsampling} with a subsampling threshold of $t = 10^{-5}$. We ignored all the words which appear less than $5$ times in the corpus. To update the network weights stochastic gradient descent was used. The initial learning rate was set to be $\alpha = 0.025$, which drops to $min_alpha = 0.0001$ linearly as training progress. 

The word embeddings obtained from training on Wikipedia dataset are accurate enough to capture the semantic relationship of words for e.g. $vec(Paris) -vec(France) + vec(Germany) \approx vec(Berlin)$. Before training the model on Wikipedia corpus, a vocabulary of words is created and once the vocabulary is created it is not possible to add new words to this vocabulary. However, there is a possibility that when a domain specific corpus (e.g. corpus-462, corpus-373 etc.) is used to further train the Word2Vec model, some of the new words may not be present in previously generated vocabulary. Due to this limitation, it was not possible to get the distributed embeddings of these new words. Thus we needed to update the vocabulary of the model, if the new words are not present in the vocabulary in order to facilitate the online training of Word2Vec model. Unfortunately, neither C++ API \footnote{https://code.google.com/archive/p/word2vec/} nor Gensim python API \cite{w2v:gensim_api} implementation of Word2Vec supports vocabulary update, once created. So, we implemented the online training\footnote{The code is adapted from-  http://rutumulkar.com/blog/2015/word2vec} of word2vec by modifying and extending the Gensim API. The new words that were not present in the existing vocabulary is now added and initialized with some random weights. The model can then be trained in the usual manner to learn the distributed  embeddings of new words. Although now the vocabulary can now be updated in an online manner but the vector embedding of the newly added words has poor quality if its count in the new corpus is less. This can be improved by repetition of new dataset several times before training the model \footnote{Idea discussed on: https://groups.google.com/forum/$\#$!topic/gensim/Z9fr0B88X0w}.

So now when we have an online version of training word2vec model, we extend the word2vec model by resuming the training on the domain specific corpus (corpus-462 and corpus-90582). While updating the model on the new dataset we do not disregard any word irrespective of its count, so that we have vector embeddings of all the words in our domain specific corpora. Once the Word2Vec model is trained, the genrated word embeddings are normalized using L-2 norm before using them. The trained word2vec model is now ready to be used with Word2Vec-ESN model.

\subsection{ESN reservoir initialization}

The size of the reservoir is one of the important parameters of ESN and is often recommended to use a bigger reservoir that can be computationally afforded provided the appropriate regularization method is used \cite{esn:practical_guide}. The bigger the size of the reservoir, the easier it is for the ESN to learn from the input signal. Thus we chose a reservoir of 1000 (until and unless specified) leaky integrator neurons with \textit{tanh} activation function. The input-to-hidden, hidden-to-hidden weights were randomly and sparsely generated using a normal distribution with mean 0 and variance 1. Thematic role assignment does not use any feedback from the readout layer, we do not use any output-to-hidden weights. The state update equations \ref{eqn:res_update} and \ref{eqn:res_state} thus changes to:

\begin{equation} \label{eqn:res_new_update}
x'(n) =\textit {tanh } ( W^{res}x(n-1) + W^{in}.u(n))
\end{equation}
\begin{equation} \label{eqn:res_new_state}
x(n) =\textit (1-\alpha) x(n-1) + \alpha x'(n)
\end{equation}

To generate the sparse weights a fixed fanout number of $F_{hh} = 10$ and $F_{ih} = 2$ was used i.e. each reservoir neuron was randomly connected to $10$ other reservoir neurons and each input neuron was connected to only $2$ other reservoir neurons. Use of fixed fanout number scales the cost of state update of reservoir linearly with increase in reservoir size \cite{esn:practical_guide}.

\subsection{Input and Ouput Coding}

As specified in chapter \ref{approach}, the Word2Vec-ESN model and its variant process the sentences differently, thus the input and output coding also differs. However, the initialization of reservoir weights remains same both the models. 

\paragraph{Word2Vec-ESN model:} A raw sentence is presented to the model, where each word in the sentence is processed across time by both word2vec model and ESN. The word2vec model outputs the $E_{v} = 50$ dimension word embedding which is then used as an input for ESN. Thus input layer of ESN has $50$ neurons. For the output coding the topologically modified but equivalent representation was used \cite{xavier:2013:RT}. Thus, the readout layer of ESN contains 24 $(4 \times 3 \times 2)$ neurons as the corpus contains sentences, having a maximum of 4 nouns each having 3 possible roles (Agent, Object, and Recipient) with respect to a maximum of 2 verbs. Each neuron in the readout layer thus codes for a role of a noun with respect to a verb. Output neuron has an activation 1 if the role is present in the sentence, -1 otherwise. When using corpus-90582 for training the number of neurons in the reservoir were raised to 5000 and also the  readout neurons are increased to 30 $(5 \times 3 \times 2)$ as there was maximum of 5 nouns in the sentences of this corpus. 

\paragraph{Word2Vec-ESN model variant:} Recall that in Word2Vec-ESN model variant a raw sentence is presented to the model, where each word (argument) along with the verb (Predicate) with respect to which the word is currently processed, is input to the model across time(see section \ref{sec:model_variant}). A sentence is processed as many time as there are verbs in the sentence. The word2vec model firstly takes this argument-predicate pair as an input and outputs a vector of $E_{v} = 2 \times 50$ dimension, which is then used as an input to ESN. Thus input layer thus has $100$ neurons where first $50$ neurons encodes the vector representation of the word and remaining $50$ neurons codes for the verb with respect to which word is being processed. Unlike the model variant-1, the size of readout neurons always remains the same and contains 5 neurons each coding for a role: Predicate (P), Agent(A), Object(O), Recipient(R) and No Role(XX) for both corpus-462 and corpus-90582. When using the corpus-90582 the number of neurons in the reservoir were also increased to 3000. Readout neuron of ESN has an activation 1 if the input word-verb (argument-predicate) pair have the corresponding role, -1 otherwise. 

\subsection{Learning ESN parameters} \label{grid_search}

Echo state network have several parameters to be optimized for the proposed model to perform efficiently on thematic role assignment task. Some of the parameters like reservoir size, sparsity, distribution of non-zero weights are straightforward \cite{esn:practical_guide}. Whereas other parameters like Spectral Radius (SR), Input Scaling (IS) and Leak Rate (LR) are task dependent and are often tuned by multiple trials. Thus to identify these parameters, we performed a grid search over the parameter space. As the parameter search space can be really large, a broader grid with wider parameter ranges is first explored to find the optimal grid: grid with low sentence error for Word2Vec-ESN model and high F1-Score for the model variant. The optimal region identified previously was then used for the fine search to identify the optimal parameters \cite{esn:practical_guide}. As both the proposed model and its variant process the sentences differently and have different training objectives, the ESN parameters for both the models are optimized separately. Also, the Word2Vec-ESN model parameters are optimized separately for sentence continuous learning mode and in sentence final learning mode. Here we describe about parameter optimization for the general case for corpus-462. 

\paragraph{Word2Vec-ESN model: } To optimize the reservoir parameters, corpus-462 was used. To get the optimal paramaters of Word2Vec-ESN model, the model was trained and tested over a range of parameters using 10-fold cross-validation method; corpus-462 with 462 sentences was randomly split into 10 equally sized subsets (i.e. each subset with $\approx$ 46 sentences). The model was trained on sentences from 9 subsets and then tested on remaining one subset. This process was repeated 10 times such that the model was trained and tested on all the subsets at least once. A reservoir of 1000 neurons and a fixed regularization coefficient, $\beta = 1\mathrm{e}{-6}$ for ridge regression was used. By exploring the parameter space we identified the optimal parameters as $SR = 2.4$, $IS = 2.5$ and $LR = 0.07$ in SCL mode and $SR = 2.2$, $IS = 2.3$ and $LR = 0.13$ in SFL mode.  

\paragraph{Word2Vec-ESN model variant: } For the Word2Vec-ESN model variant, we find the optimal parameters during the grid search using the same corpus-462 and by applying 10-fold cross-validation. Keeping all the parameters i.e. reservoir size and regularization coefficient, identical to Word2Vec-ESN model described above, optimal parameters were identified as $SR = 0.7$ , $IS = 1.15$, $LR = 0.1$. 

As later in the experiment \ref{exp-2}, we will also be testing the performance of the model variant on the sentences transformed to grammatical form and using localist word vectors. We also need to find the optimal parameters for this transformed dataset. Keeping all the conditions same, we performed a grid search over parameter space and identified the optimal parameters as $SR = 1.2$, $IS = 0.7$ , $LR = 0.3$.   

\section{Experiments}

In this section we descibe the experiment performed in this thesis. Untill and unless specified all the use the optimal parameters identified during the grid search for both Word2Vec-ESN model and its variant, described in section \ref{grid_search}.

\subsection{Experiment-1: Model performance on a small corpous}\label{exp-1}

In order to determine the model's capability for predicting thematic roles of the sentences using word2vec embeddings for words, we first used the limited set of sentences i.e. 26 sentences (sentence 15 to 40, in corpus-45) from corpus-45. The chosen sentences have distinct surface form and grammatical structure (e.g. active, passive, dative-passive). Using these limited set of sentences we performed the simulations both on Word2Vec-ESN model and its variant. Both the models were first trained and tested on all the sentences and then tested using Leave one Out (LoO) cross-validation method: training on 25 sentences and testing on 1 sentence, such that all the sentences are tested atleast once.
 
For simulation with word2Vec-ESN model a reservoir of 1000 neurons and reservoir parameters, SR=? , IS=?, LR=? was used. Whereas for simulations with model variant a reservoir of 500 neurons with reservoir parameters SR=? , IS=? , LR=?, was used. The parameters for both the model variants were found by exploring the parameter space using LoO cross-validation. This experiments remains a toy demonstration and we will explore the generalization capabilty of the model further with the an extend corpus in experiment \ref{exp-2} and \ref{exp-5}.

\subsection{Experiment-2: Generalization Capabilities} \label{exp-2}

In the experiment \ref{exp-1}, we performed simulations with limited set of constructions. Thus in order to test the generalization capability of the model, we examined the model's behaviour with an extended corpus of 462 sentences (see corpus-462 in \ref{datasets}). 

\paragraph{Word2Vec-ESN model: }We performed simulations on both Word2Vec-ESN model and using 10-fold cross validation. A reservoir of size 1000 neurons was used. In order to compare the generalization ability of the Word2Vec-ESN model and $\theta RARes$ model, simulations using in both the learning modes i.e. in SCL and SFL mode. The reservoir parameters identified during grid search for both the learning modes were used i.e. $SR = 2.4$, $IS = 2.5$, and $LR = 0.07$ in SCL mode and $SR = 2.2$, $IS = 2.3$ and $LR = 0.13$ SFL mode (see section \ref{grid_search}).

\paragraph{Word2Vec-ESN model variant: }Recall that the Word2Vec-ESN model variant process the sentences diffently from Word2Vec-ESN model (see section \ref{sec:model_variant} in chapter \ref{approach}). Thus in order to compare the effect of processing the raw sentences with distributed word embeddings over the transformed sentence in grammatical form and using localist word vectors on the model variant, we performed two simulations. Both the simulations were performed using a reservoir of 1000 neurons.

The first simulation was performed using the raw sentences using reservoir parameters $SR = 0.7$ , $IS = 1.15$, $LR = 0.1$. 

In the second simulation we used transformed sentences in grammatical form and using localist word vectors i.e. model variant was used without word2vec unit. We used the resevoir parameters $SR = 1.2$, $IS = 0.7$ , $LR = 0.3$. The reservoir parameters used here was identified previously by exploring the parameter space(see section  \ref{grid_search}). 

\subsection{Experiment-3: Effect of Corpus structure} \label{exp-3}

Recall that the sentences in the corpus-462 were created based on context-free grammar. Thus the sentences in the corpus contain an inherent grammatical structure. The models thus possibly utilizes the underlying grammatical structure to some extent for learning and generalizing. To test this hypothesis and to demonstrate that the model is not generalizing on any other inconsistent regularity in the corpus, in this experiment, we removed the inherent grammatical structure from the sentences in the corpus by randomizing the word orders within the sentences \cite{xavier:2013:RT}. Such a test will also help us to have an insight on what the model is actually learning and whether the model is overfitting or not. The situation of overfitting typically occurs when the corpus size is significantly less than the number of trainable parameters \cite{xavier:2013:RT}. The Word2Vec-ESN model with the reservoir of size 1000 neurons and 42 readout neurons have 42000 $(42 \times 1000)$ trainable parameters, whereas the model variant with reservoir size $1000$ and $5$ readout neuron the trainable parameters are 5000 $(5 \times 600)$. In both the case, the number of trainable parameters is significantly greater than our corpus size (i.e. 462 sentences). This is thus a possible situation of overfitting.

In this experiment, we presented the corpus with the scrambled sentences (i.e. in absence of any grammatical structure) to both Word2Vec-ESN model and its variant. Keeping the reservoir parameters same as used in Experiment \ref{exp-2}, a 10 fold cross-validation was performed. The cross-validation errors obtained in the previous experiment on the corpus-462 with inherent grammatical structure can then be compared with the cross-validation error obtained while using scrambled corpus. If the model is not overfitting and learning from the grammatical structure then the model should not perform better on the corpus with scrambled sentences (i.e. in absence of grammatical structure). However, in the case of overfitting the generalization effect on the corpus should not vary much both in presence and absence of grammatical structure. 

\subsection{Experiment-4: Effect of Reservoir size} \label{exp-4}

The most important hyperparameter which effects the performance of the model is the size of the reservoir (i.e. number of neurons in the reservoir). Addition of neurons in the reservoir is also computationally inexpensive because the read-out weights ($W^{out}$) scales linearly with the number of neurons in the reservoir \cite{esn:learn_gs}. So, in order to determine the effect of reservoir size on the performance of the Word2Vec-ESN model and its variant, the simulations were performed over a range of reservoir size \footnote[1]{Reservoir sizes explored in Word2Vec-ESN model: [90, 291 ,493, 695 ,896, 1098, 1776, 2882, 3988, 5094]} \footnote[2]{Reservoir sizes explored in model variant: [50, 100, 250, 400, 600, 800, 1050, 1600, 2250, 3320, 3860, 4500]} using 5 instances of the model i.e. reservoir weights of the model were initialized using 5 different random generator seeds). 

\paragraph{Word2Vec-ESN model: } Using the Word2Vec-ESN model, the simulations were performed on a range of reservoir size\footnotemark[1]  in both SCL and SFL modes. We used the same optimal reservoir parameters used in the experiment \ref{exp-2} i.e. $SR = 2.4$, $IS = 2.5$ and $LR = 0.07$ in SCL mode and $SR = 2.2$, $IS = 2.3$ and $LR = 0.13$ in SFL mode. 

\paragraph{Word2Vec-ESN model variant: } Keeping all the parameters identical to experiment \ref{exp-2}, we also explored the behavior of Word2Vec-ESN model variant on a range of reservoir sizes \footnotemark[2], when using raw sentences with word2vec word embeddings and also when using grammatical form of sentence along with localist word representation. 

\subsection{Experiment-5: Effect of Corpus size} \label{exp-5}

In order to investigate the effect of corpus size and scaling capability of the model, we used extended corpus-90582 (see section \ref{corpora}) for this experiment. As the corpus also contains $12\%$ of ambiguous sentences which impedes the learning and generalization of the model, this experiment will also validate the model's ability to process the abmbiguous sentences.

In order to study the generalization capabilty of the model with different corpus size, 6 sub-corpora were created by randomly sampling $6\%$, $12\%$, $25\%$, $50\%$, $75\%$, $100\%$ of sentences from the orginal corpora of 90582 sentences\cite{end-to-end}.

\paragraph{Word2Vec-ESN model: } Each of the generated sub-corpora was exposed to the model and a 2-fold cross validation was performed where the model was trained on half the sub-corpora size and tested on remaining half. The second half used for testing was then used to train the model and then tested on the first half, used for training previously. This experiment was only performed with Word2Vec-ESN model in both the learning modes i.e. SCL and SFL mode, using a 5 model instances each with a reservoir of 5000 neurons. All other parameter were kept identical to the experiment \ref{exp-2}. 

\subsection{Experiment-6: Generalization on new corpus} \label{exp-6}

One may also argue that the previously used corpus (corpus-462 and corpus-90k), which were artifically constructed using grammer is adding a bias to the model which may make it easier for the model to learn and generalize on these corpus for thematic role assignment task. To answer this question, in this experiment we used the corpus-373 (see section \ref{corpora}) collected by Hinaut et al. \cite{tra:tra:xavier_hri} in a Human Robot Interaction (HRI) study of language acquisition.

\paragraph{Word2Vec-ESN model: }To test the generalization of Word2Vec-ESN model on this corpus, simulations were performed in SCL mode and leave-one-out (LoO) cross validation methods on 10 model instances. We chose SCL mode and LoO, so that results can be compared with that of obtained using $\theta RARes$ model \cite{tra:tra:xavier_hri}. Also for this experiment we did not explore for the best parameters, but instead we used the optimized parameter obtained on corpus-462 and used in experiment \ref{exp-2} i.e. $SR = 2.4$, $IS = 2.5$ and $LR = 0.07$. Doing so will also enable us to test the robustness of the model parameters, learned previously with corpus-462, on the new corpora.