Two Methods for Domain Adaptation of Bilingual Tasks: Delightfully Simple and Broadly Applicable, Hangya et al., 2018
Paper, Tags: #nlp
Contributions:
- Test a simple method for domain adaptation of bilingual word embeddings. We evaluate those embeddings on 2 bilingual tasks involving 2 domains:
- cross-lingual twitter sentiment classification
- medical bilingual lexicon induction
- We tailor a broadly applicable semi-supervised classification method from computer vision to these tasks.
We study 2 bilingual tasks that depend on bilingual word embeddings (BWE). BWE adaptation:
- We adapt monolingual WE to the target domain for source and target languages by building them using both general and target-domain unlabeled data.
- We use post-hoc mapping, a seed lexicon to transform the word embeddings of the 2 languages into the same vector space.
Our approach is simple and task independent.
BWEs trained on general domain texts usually result in lower performance when used in a system for a specific domain. Reasons:
- Vocabularies of specific domains contain words that aren't used in the general case, names of medicines or diseases
- The meaning of a word varies across domains.
Our method adapts general domain BWEs in a way that preserves the semantic knowledge from general domain data and leverages monolingual domain specific data to create domain-specific BWEs. Our domain-adaptation approach is applicable to any language-pair in which monolingual data is available.
We first train MWEs in both languages, and then map those into the same space using post-hoc mapping. We train MWEs for both languages by concatenating monolingual out-of-domain and in-domain data.
- out-of-domain data allows us to create accurate distributed representations of common vocabulary
- in-domain data embeds domain specific words
We then map the two MWEs using a small seed lexicon to create the adapted BWEs.
Our delightfully simple task independent method to adapt BWEs to a specific domain uses unlabeled monolingual data only