7conclusions.tex.bak

\chapter{Conclusion And Future Work}\label{conclusion}

Previously, we saw the $\theta$RARes model which processes the sentences transformed to grammatical form and used the localist word representations to solve the TRA task \cite{xavier:2013:RT}. We have also seen the proposed neuro-inspired language models (Word2Vec-$\theta$RARes and Word2Vec-ESN classifier) and the experiments performed using them. In this chapter, we finally conclude this research, first by summarizing the findings from the outcome of the experiments performed, then the possible future work identified during the development of this study.

\section{Conclusion}

This thesis started with the goal to test the hypothesis that distributed word representation in combination with a recurrent neural network could be used to achieve better performance on the TRA task. Thus, to test this hypothesis, Word2Vec-$\theta$RARes language model and Word2Vec-ESN classifier were proposed. Several experiments were performed using the proposed Word2Vec-$\theta$RARes model and Word2Vec-ESN classifier. The outcomes of the experiments have led to the conclusion that follows.

\subsection{Generalization: Performance on unseen sentences}

One basic conclusion from this research is that in the SCL model, the proposed Word2Vec-$\theta$RARes model using distributed word embeddings performs better than the $\theta$RARes model which processes the sentences transformed in the grammatical form using localist word vectors.

\paragraph{Performance on a limited number of sentences:} It was shown in Experiment-1, that the Word2Vec-$\theta$RARes model learned the sentences thoroughly when trained on limited sentences but failed to generalize on unseen sentences. Whereas, the  $\theta$RARes model was able to generalize better on the small corpus because of the generalization imposed in the data by replacing the semantic words with `SW' token while pre-processing. This leads us to a conclusion that $\theta$RARes model performs better on small corpus whereas Word2Vec-$\theta$RARes fails to do so. 

\paragraph{Performance on corpus-462 and corpus-90582:} Recall that the extended corpus-462 and corpus-90582 contain the sentences with the verb inflections (i.e. -ed, -ing, -s). For the same corpora, the $\theta$RARes model process the sentences transformed in the grammatical form using the localist word vectors. Thus, the input layer contains a unit for each closed class words, a unit to encode semantic word (`SW' token) and additional units to encode the verb inflections \cite{xavier:2013:RT}. Whereas, for the Word2Vec-$\theta$RARes model, the verbs with inflections in the corpora were replaced with the corresponding verb conjugations while preprocessing. The Word2Vec-$\theta$RARes model uses the topologically modified coded meaning for experiments with corpus-462 and corpus-90582. Use of verb inflections as an input to the $\theta$RARes model and topologically modified coded meaning with Word2Vec-$\theta$RARes model do not suffice the requirement of an end-to-end system. Both the $\theta$RARes and Word2Vec-$\theta$RARes model have to depend on a POS tagger to extract the verb inflections and verbs respectively in a sentence. Thus the performance of the model becomes dependent on the accuracy of the parser. Assuming that the parser is $100\%$ accurate, the Word2Vec-$\theta$RARes model generalized on both the corpora but performed much better in the SCL mode when compared to the SFL mode, whereas it is vice-versa in $\theta$RARes model. Additionally, the Word2Vec-$\theta$RARes model outperforms the $\theta$RARes model in the SCL mode. The word embeddings learned from the nearby words and the capability of ESN to model the long-dependencies in the sentences can be credited for such behavior. In the SFL mode, the performance of both the Word2Vec-$\theta$RARes and $\theta$RARes model remained almost equivalent. So, it can be concluded that when both the models are dependent on a parser, the Word2Vec-$\theta$RARes model performs better than $\theta$RARes model.

The Word2Vec-ESN classifier also generalizes with high classification scores, when the raw sentences are processed using distributed words embeddings. When the sentences transformed in grammatical form and encoded using localist vector representation are used for processing, the classifier did not generalize well as shown by the low classification scores. Overall the results indicate that the distributed word embeddings improve the performance of the Word2Vec-ESN classifier over localist word vectors.

\paragraph{Performance on corpus-373:} As corpus-373 does not contain verb inflections, the $\theta$RARes model processes the sentences without using the input units for verb inflections. Also, the Word2Vec-$\theta$RARes model does not use topologically modified coding. This makes both the models independent of the parser and thus the performance of both the models can be compared. As shown in Experiment-6, the Word2Vec-$\theta$RARes model outperforms the $\theta$RARes model. Hence, it can be concluded that the semantic and syntactic information captured in distributed word vectors helps the Word2Vec-$\theta$RARes model to enhance the performance on the TRA task.

\subsection{Effect of reservoir and corpus size}

It was also shown that the generalization ability of the Word2Vec-$\theta$RARes model on the TRA task also increases with increase in reservoir and corpus size but asymptotes at a point. With the increase in both the reservoir and corpus size, the Word2Vec-$\theta$RARes model generalized better in SCL mode than in SFL mode. However, the generalization ability in both the learning modes becomes stable if the corpus and reservoir size is further increased from the asymptotic point. Overall, the conclusion which can be drawn from this is that with large corpus and reservoir size the Word2Vec-$\theta$RARes model generalizes better. Also, with the increase in corpus and reservoir size the $\theta$RARes model performs better in SFL mode whereas Word2Vec-$\theta$RARes outperforms the $\theta$RARes model in SCL mode.

The Word2Vec-ESN classifier, on the other hand, achieved the high classification score even with the smaller reservoir of 250 neurons. The performance of the model does not deviate much with the further increase in reservoir size. Thus the model can be used with small reservoir size which makes it computationally cheaper for the TRA task.

\subsection{Structure of corpus}

The results of Experiments 2 and 3, shows that the Word2Vec-$\theta$RARes model generalizes well in both the SCL and SFL mode when the inherent grammatical structure is present in the sentences but fails to generalize when the word order in the sentences is scrambled. Thus we can conclude that the Word2Vec-$\theta$RARes model is not generalizing on any inconsistent regularity in the corpus but, instead, exploiting the inherent grammatical structure of the sentences to learn and generalize.

The Word2Vec-ESN classifier, on the other hand, generalized with high classification scores with the scrambled corpus in configuration-1 (with distributed word embeddings). However, this behavior is not surprising at all. The answer to this behavior lies in the way the Word2Vec-ESN classifier is trained. Recall that unlike Word2Vec-$\theta$RARes model, the Word2Vec-ESN classifier is trained to classify an argument-predicate pair to a role. So, the model learns the mapping between the argument-predicate pairs and the corresponding roles. Therefore, even if the word order is changed, the model utilizes the semantic and syntactic information and also the information captured about context words in word embeddings, to classify the argument-predicate pairs correctly. Hence, two conclusions can be drawn from these results. First, it can be said that the Word2Vec-ESN classifier can perform better with distributed embeddings in the cases where sentences are not completely grammatically correct. Second, it can be considered as a drawback for the TRA task as the model always classify the argument-predicate pairs correctly even when the sentences do not make any sense.

In configuration-2 (with localist word vectors), the Word2Vec-ESN classifier failed to generalize on the scrambled corpus. The reason is the transformation of sentences to abstract grammatical form. The use of same `SW' token for all semantic words and encoding the words in localist fashion confuses the model to assign the appropriate thematic role to the input argument-predicate pair thus leading to failure in generalization. 

\subsection{Robustness of the model}

Recall that the Word2Vec-$\theta$RARes processes the corpus-373 without using the topologically modified coded meaning. However, it was shown in Experiment-6, that the model gracefully generalized well on corpus-373, using the same model parameters learned from corpus-462 with the topologically modified coded meaning. Thus, it can also be concluded that the model parameters are robust enough to generalize on new corpora for the TRA task. However, there is a possibility to optimize the performance further on corpus-373 with Word2Vec-$\theta$RARes model by optimizing the ESN parameters on this corpus.

\subsection{Dimension of word vectors}

Mikolov et al. \cite{w2v:mikolov_2013_efficient,w2v:mikolov_2013_distributed} showed that the higher dimensional word embeddings perform better on the word analogy task as compared to word vectors with lower dimensions. We have seen in the result of Experiment-7, that the dimensions of distributed word vectors,  do not affect the performance of Word2Vec-$\theta$RARes model significantly on the TRA task. However, the larger word vector dimensions increased the computation cost, but the performance gain was negligible. So, following Occam's razor principle which states that ``It is vain to do with more what can be done with fewer" \cite{razor:franklin_2002}, it can be concluded that the smaller word vectors should be preferred over larger ones with the Word2Vec-$\theta$RARes model for the TRA task.

\subsection{Online re-analysis of sentence meanings}

In addition to the generalization on unseen sentences, it was also shown that the proposed Word2Vec-$\theta$RARes model also do the re-analysis of the sentence meanings upon arrival of each word. The read-out activity of the model changes each time a new word is input to the model. The read-out activations at any time step can be interpreted as an estimated prediction of the sentence meaning for the input given so far \cite{xavier:2013:RT}. This behavior is not something new. The online re-analysis of sentence meanings was also observed with the $\theta$RARes model. However, the use of word embeddings learned from the context words and possibly encapsulating the information about nearby words enables the Word2Vec-$\theta$RARes model to make a very early prediction of the sentences meaning. This behavior seems to be natural in humans when a sentence is being heard. A person listening to a sentence starts predicting the meaning of a sentence as soon as he/she listens to the first few words by making assumptions, gained from previous experiences. This advantage of early prediction can be useful for Human-Robot Interaction (HRI). A robot can start making actions, using the output activations even before the completion of the sentence \cite{tra:xavier_hri}.

The Word2Vec-ESN classifier considers the TRA task as a classification problem. The training objective of the classifier is to classify the argument-predicate pairs to the thematic roles. Thus, unlike the Word2Vec-$\theta$RARes model, the Word2Vec-ESN classifier was not teacher forced with the meaning of the whole sentence during training but, instead with the thematic role of the current argument-predicate pair. Therefore, it is not possible for the Word2Vec-ESN classifier to do the online re-analysis of sentence meaning.

\subsection{Word2Vec-$\theta$RARes vs Word2Vec-ESN classifier}

We have seen that the Word2Vec-$\theta$RARes model and Word2Vec-ESN classifier process the sentences differently and are evaluated using different metrics. As the performance of both the models is evaluated with different metrics, it would not be fair to claim which one of them performs better. However, it can be argued that, given a large corpus, both the models perform better when distributed word embeddings are used over localist word vectors. For the TRA task, both models could be employed. Each model has its own advantages and limitations which are stated below:

\paragraph{Input coding:} The Word2Vec-$\theta$RARes model process a sentence word by word. Thus the model takes a word as an input at each time step. Whereas, the Word2Vec-ESN classifier takes an argument-predicate pair as input. To identify the predicates (verbs) in a sentence, the Word2Vec-ESN classifier has to depend on a syntactic parser. Hence the performance of the Word2Vec-ESN classifier also depends on the accuracy of the parser. This is one of the limitations of Word2Vec-ESN classifier.

\paragraph{Output coding:} The output units of the Word2Vec-$\theta$RARes model encode the thematic roles of semantic words. Thus, in order to encode the thematic roles for all possible semantic words in any sentence, the maximum number of semantic words a sentence can have in the corpus should be known in prior. Also, the number of output units increases with the number of semantic words and the thematic roles a semantic word can have. The increase in the number of output units also increases the number of learnable reservoir-to-outputs weights ($W^{out}$) and hence the computational cost also increases. Unlike the Word2Vec-$\theta$RARes model, the number of output units in the Word2Vec-ESN classifier remains independent of the number of semantic words a sentence can have. The number of output units is always equal to the number of thematic roles which a word can have. Therefore, the limited number of output units makes the Word2Vec-ESN classifier computationally cheap. 

\paragraph{Reservoir size:} The Word2Vec-$\theta$RARes model requires a bigger reservoir for generalization, whereas the Word2Vec-ESN classifier generalizes with a small reservoir size. Generalization with the small reservoir size also makes the Word2Vec-ESN classifier computationally cheap and hence a preferable choice over Word2Vec-$\theta$RARes model.

\paragraph{Online re-analysis of sentence meaning:} As discussed earlier, the Word2Vec-$\theta$RARes have the capability to make online re-analysis of sentence meaning across time. In some cases, the model also predicts the meaning of sentences even before the completion of the sentence. This advantage of online re-analysis and early prediction is not possible with the Word2Vec-ESN classifier.

\section{Future Work}

In this research, the proposed Word2Vec-ESN classifier makes an assumption that the syntactic parser or POS tagger used to identify verbs is $100 \%$ accurate. It would be interesting to explore the behavior of the Word2Vec-ESN classifier with the most accurate syntactic parser like Stanford parser \cite{parser:stanford}, Charniak parser \cite{charniak_parser:2000}, Bikel parser \cite{parser:bikel:2004} or Berkley parser \cite{parser:berkley:2006}.

As described earlier, the Word2Vec model implementation has been modified to update the vocabulary of an already trained model. The new words can be added to the vocabulary, and distributed embeddings can be learned for them. Thus the modified implementation of the Word2Vec model can be extended to make the proposed models, ever-learning systems for the TRA task. However, updating the Word2Vec model with every sentence was also attempted during this research work, but because the computation time was really high and the performance gain was negligible, it was left unexplored for the future work. It would be interesting to explore how the model can be updated in an online manner with the minimal computation time so that it can be feasibly implemented on the robotics platform. 

For this work, the combination of the Word2Vec model and ESN was used. The ESN has an advantage of modeling sequential data. Thus the sequential and temporal aspect of a sentence is taken into account in this study for thematic role assignment. However, the dependencies between the thematic roles of words in the sentences were not taken into consideration for learning. To model the conditional probability distribution of the thematic roles of words, Conditional Random Fields (CRF); a log-linear model; could be used \cite{crf:intro:sutton}. CRFs have been one of the most successful approaches used earlier as well for classification and sequential data labeling tasks \cite{end-to-end, esn:esn_crf}. Thus, the Word2Vec-ESN classifier, proposed in this study, could be used with an additional CRF unit to model the temporal dependencies between the input sentences conditional on the corresponding thematic roles. Doing so allows the resulting model to capture the concealed temporal dynamics present in the sentences \cite{esn:esn_crf}. It is also important to know that addition of CRF with Word2Vec-$\theta$RARes model does not make any sense because, unlike Word2Vec-ESN classifier, the model takes the meaning of the whole sentence as teacher output to learn thematic roles. 

Distributed word vectors obtained by training the Word2Vec model on one language corpus (e.g., English) can also be translated to get the most similar word in any other target language (e.g., French). This translation is achieved by linearly projecting (rotation and scaling) the word vectors of source language on the target language \cite{w2v:language_similarities}. Therefore, the Word2Vec-$\theta$RARes language model can also be investigated further for multiple language acquisition \cite{hinaut_multiple_lang}.