Given a partial description like 'she opened the hood of the car', humans can reason about the situation and anticipate what might come next: 'then, she examined the engine'. We introduce the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning.
SWAG is a new dataset with 113k multiple choice questions about a rich spectrum of grounded situations. To address the challenges of the annotation artifacts and human biases present in many datasets, we propose adversarial fintering (AF), which constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers and using them to filter the data. This filtering methodology allows for cost-effective construction of a large-scale dataset while substantially reducing known annotation artifacts.
To account for the aggressive AF, we use state-of-the-art LMs to massively oversample a diverse set of potential counterfactuals. While humans can solve the resulting inference problems with 88% accuracy, various competitive models struggle on our task (<60%)
In this task, we focus on whethet a multiple-choice ending describes a possible (future) world that can be anticipated from the situation described in the premise, even when it's not strictly entailed. Many such inference requires a rich understanding about everyday physical situations, including object affordances and frame semantics.
The first step toward grounded commonsense inference nowadays is creating a large-scale dataset, but recent work has shown that human-written datasets are susceptible to annotation artifacts: unintended stylistic patterns that give out clues for the gold labels.
In this paper, with AF, we construct SWAG with 113k multiple-choice questions. We start with pairs of temporally adjacent video captions, each with a context and a follow-up event that we know is physically possible. Then we use a state-of-the-art LM fine-tuned on this data to massively oversample a diverse set of possible negative sentence endings. Then we filter these candidate endings aggressively and adversarially using a committee of trained models to obtain a population of de-biased endings with similar stylistic features to the real ones. Finally, these filtered counterfactuals are validated by crowd workers.
We predict which event is most likely to occur next in a video. Given a context c = (s, n), a complete sentence s and a noun phrase n that begins a second sentence, as well as a list of possible verb phrase endings V = {v1, ..., v4}. The model must select the most appropriate verb phrase.
A dataset being adversarial for a model f means that f won't generalize, even if evaluated on test data from the same distribution. A dataset is adversarial if we expect high empirical error over all leave-one-out train/test splits.
This process can be thought of as increasing the overall entropy of the dataset: given a strong model f0 that's compatible with a random subset of the data, we aim to ensure it can't generalize to the held-out set.
To generate counterfactuals for SWAG, we use an LSTM LM conditioned on context from video captions. We first train on BookCampus, then finetune on the video caption dataset. We use the LM to sample 1024 unique endings for a partial caption. We greedily sample the endings, since beam search decoding biases the generated endings to be of lower perplexity. The generated endings are marked as 'gibberish' by humans only 9% of the time.
Our final classifier is an ensemble of 4 stylistic models:
- A multilayer perceptron MLP
- A BoW model that averages the word embeddings of the second sentence as features
- A one-layer CNN
- A biLSTM over the most common words in the second sentence.
We concatenate their final representations and pass through an MLP. On every adversarial iteration, the ensemble is trained jointly to minimize cross-entropy.
Performance drops from 60% to close to random chance. Confusing the perplexity-based MLP isn't enough to lower performance of the ensemble. Only once the other models are added does the ensemble accuracy drop substantially, suggesting that our approach is effective at reducing stylistic artifacts.
Workers on Amazon Mechanical Turk were given the caption context, and 6 candidate endings: 1 found ending and 5 adversarially-sampled endings. Turkers ranked the endings independently as likely, unlikely or gibberish, and selected the best and second endings.
The best model that only uses the ending is the LSTM with ELMo embeddings, which obtains 43.6%, and is improved with context: by 3.1% when given the initial noun phrase, and by an additional 4% when also given the first sentence.
Further improvement is gained from models that compute pairwise representations of the inputs. Our results suggest that humans, even untrained, converge to a consensus.
SWAG requires a unique type of temporal reasoning. Our use of videos results in wide converage of dynamic and temporal situations. And our dataset suffers from few lexical biases.
SWAG is a challenging testbed for NLI models, but because the adversarial models are purely stylistic and focus only on the second sentence, subtle artifacts are still likely to remain on the dataset.