Paper, Tags: #nlp, #architectures
BERT, modeling bidirectional contexts, achieves better performance than pretraining approaches based on autoregressive modeling. But relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy.
XLNet, a generalized autoregressive pretraining method:
- Enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order.
- Overcomes the limintations of BERT due to our autoregressive formulation.
- Integrates ideas from Transformer-XL.
Unsupervised methods first pretrain NNs on large-scale unlabeled text corpora, and then finetune the models on downstream tasks. Autoregressive (AR) language modeling and autoencoding (AE) have been the two most successful pretraining objectives.
- AR models seek to estimate the probability distribution of a text corpus with an AR model (Peters et al., 2018a). They factorize the likelihood into a forward or backward product. AR LMs are only trained to encode a uni-directional context (either forward or backward), so they can't model deep bidirectional contexts.
- AE based pretraining reconstructs the original data from corrupted input (Devlin et al., 2018). Given the input sequence, a certain portion of tokens are replaced by a special symbol [MASK], and the model is trained to recover the original tokens from the corrupted version. These models use bidirectional contexts for reconstruction. But the artificial symbols like [MASK] are absent from real data, resulting in pretrain-finetune discrepancy. BERT assumes that the predicted tokens are independent of each other given the unmasked tokens.
XLNet leverages the best of AR language modeling and AE while avoiding their limitations:
- XLNet maximizes the expected log likelihood with respect to all possible permutations of the factorization order. Each positions learns to utilize contextual information from all positions, capturing bidirectional context
- XLNet doesn't rely on data corruption, so no pretrain-finetune discrepancy
- XLNet integrates the segment recurrence mechanism and relative encoding scheme of Transformer-XL
AR language modeling performs pretraining by maximizing the likelihood under the forward autoregressive factorization. In comparison, BERT is based on denoising autoencoding, first constructing a corrupted version by randomly setting a portion of tokens in the input to a special symbol [MASK], and then trying to reconstruct the input. Pros and cons of the two pretraining techniques:
- Independence assumption
- BERT factorizes the joint probability based on an independence assumption
- AR LMs don't have this assumption
- Input noise
- BERT's input contains artificial symbols like [MASK], creating a pretrain-finetune discrepancy
- AR LMs don't rely on any input corruption
- Context dependency
- AR representation is only conditioned on the tokens to the left
- BERT representation has access to the contextual information on both sides
We propose the permutation language modeling objective that not only retains the benefit of AR models but also allows models to capture bidirectional contexts.
Our architecture is two-stream self-attention for target-aware representations (Section 2.3 of paper). Since our objective function fits in the AR framework, we incorporate the state-of-the-art AR LM, Transformer-XL
Both BERT and XLNet perform partial prediction, only predicting a subset of tokens in the sequence. The independence assumption disables BERT to model dependency between targets. XLNet always learns more dependency pairs given the same target and contains denser effective training signals.
AR LMs only cover the dependency on the left, while XLNet is able to cover both in expectation over all factorization orders. Approaches like ELMo concatenate models in a shallow manner, not sufficient to model deep interactions between the two directions.