Paper, Tags: #nlp
We study the representations produced by several recent pretrained contextualizers (variants of ELMo, the OpenAI transformer LM, and BERT) with a suite of 17 diverse probing tasks.
We quantify differences in the transferability of individual layers within contextualizers.
We compare LM pretraining with 11 super-vised pretraining tasks. Pretraining on a closely related task yields better performance than LM pretraining (which is better on average) when the pretraining dataset is fixed. LM pretraining on more data gives the best results.
This paper asks and answers:
- What features of language do these CWRs (contextual word representations) capture, and what do they miss?
- How and why does transferability vary across representation layers in contextualizers?
- How does the choice of pretraining task affect the vectors' learned linguistic knowledge and transferability?
Our analysis reveals following insights:
- Linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge.
- The first layer output of long LSTM is consistently the most transferable, whereas in the transformers it's the middle layers.
- Higher layers in LSTMs are more task-specific, more general, while transformer layers don't exhibit the same monotonic trend.
- LM pretraining yields representations that are more transferable in general than 11 other candidate pretraining tasks, although pretraining on related tasks yields the strongest results for individual end tasks.