Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference., McCoy et al., 2019
Paper, Tags: #nlp, #linguistics, #datasets
Machine learning systems usually break down in challenging cases not present in the test set. We study this issue withing natural language inference (NLI), the task of determining whether one sentence entails another.
We hypothesize that statistical NLI models might adopt 3 fallible syntactic heuristics:
- lexical overlap heuristic
- subsequence heuristic
- constituent heuristic
To determine whether models have adopted these, we introduce a controlled evaluation set called HANS (heuristic analysis for NLI systems) which contains many examples where the heuristics fail.
Models trained on MNLI, including BERT, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there's substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress.
Most NNs adopt shallow heuristics that succeed for the majority of training examples, instead of learning the underlying generalizations that they are intended to capture.
The current paper addresses this issue in the domain of NLI, the task of determining whether a premise sentence entails (implies the truth of) a hypotesis sentence. A model might assign a label of contradiction to any input containing the word not, since not often appears in the examples of contradiction in NLI training sets.
The focus of our work is on heuristics based on superficial syntactic properties. An NLI system might assume that the premise entails any hypotesis whose words all appear in the premise, which isn't necessarily the case.
We evaluate 4 popular NLI models including BERT, on the HANS dataset. Models barely exceeded 0% accuracy in most cases. The types of examples present in HANS can be used to extend models' training sets to dampen the heuristics shortcomings. The Github repository.
HANS targets three heuristics:
- Lexical overlap
- Assume that a premise entails all hypoteses constructed from words in the premise
- The doctor was paid by the actor -> The doctor paid the actor
- Subsequence
- Assume that a premise entails all of its contiguous subsequences
- The doctor near the actor danced -> The actor danced
- Constituent
- Assume that a premise entails all complete subtrees in its parse tree
- If the artist slept, the actor ran -> The artist slept
There are 2 reasons why we expect these heuristics to be adopted by a statistical learner trained on standard NLI training datasets such as SNLI or MNLI. First, the MNLI training dataset contains far more examples supporting the heuristics than examples contradicting them.
The second reason is that the input representations that these models make make them susceptible to these heuristics. The lexical overlap heuristics disregards the order of the words in the sentence and considers only their identity, so it's likely to be adopted by bag-of-words NLI models. The constituent heuristic appeals to components of the parse tree, so it can be adopted by tree-based NLI models.
For each heuristic, we generate five templates for examples supporting the heuristic and five templates for examples that contradict it. Example:
The N1 P the N2 V -/> the N2 V The lawyer by the actor ran -/> the actor ran
We generated 1k examples from each template, for a total of 10k examples per heuristic. Dataset controls:
- Plausibility: we ensure the plausibility of all generated sentences (the book read the student doesn't fulfill this)
- Selectional criteria: some example types depend on the availability of lexically-specific verb frames.
We selected 3 models that exemplify popular strategies for representing the input sentence:
- DA: bag-of-words model
- ESIM: sequential structure
- SPINN: syntactic parse tree
- BERT: state-of-the-art model for MNLI
We trained all models on MNLI, which uses 3 labels: entailment, contradiction and neutral. We labelled our dataset with entailment and non-entailment only because the distinction between contradiction and neutral was often unclear for our cases.
All models almost always assigned the correct label in the cases where the label is entailment, i.e., where the correct answer is in line with the hypotesized heuristics. But they all performed poorly with accuracies less than 10% in most cases, on the cases where heuristics make incorrect predictions.
DA and ESIM had near-zero performance across all 3 heuristics. These models treat them as the same phenomenon. SPINN had the best performance on the subsequence cases, which might be because of the tree-based nature of its input. SPINN did worse than the other models on constituent cases where the correct answer is entailment.
BERT did slightly worse than SPINN on the subsequence cases, but performed poorly than all other models at both the constituent and lexical overlap cases. BERT's success at MNLI might have been due to a greater tendency to incorporate word order information compared to other models.
BERT achieved 39% accuracy on conjunction, but 0% on subject/object swap. BERT achieved 49% accuracy at determining that a clause embedded under if and other conditional words isn't entailed, but 0% at identifying that the clause outside of the conditional clause is also not entailed.
Though the heuristics form a hierarchy (the constituent heuristic is a subcase of the subsequence heuristic, which is a subcase of lexical overlap heuristic), this hierarchy doesn't necessarily predict the performance of the models. BERT performed worse on the subsequence heuristic than on the constituent one.
Do the heuristics arise from the architecture or the training set?
Because SPINN did better at the constituent and subsequence cases than ESIM and DA, suggests that MNLI contains some signal counteracting the appeal of the syntactic heuristics tested by HANS. SPINN's structural inductive biases allow it to leverage this signal but the other model's biases don't.
Other sources of evidence suggest that the reason is insufficient signal from the MNLI training dataset.
Is the dataset too difficult? We got 100 participants who achieved 76% accuracy in non-expert cases and 97% on expert cases. Annotators found HANS to be harder than MNLI.
We conclude that despite the impressive accuracies of state-of-the-art models on standard evaluations, there's still much progress to be made aht targeted, challenging datasets such as HANS are important for determining whether models are learning what they are intended to learn.