Paper, Tags: #nlp, #text-summarization
Current metrics for assessing summarization algorithms don't account for factual consistency with source documents. We propose a weakly supervised, model-based approach for verifying factual consistency and identifying conflicts between source documents and a generated summary.
Training data is generated by applying a series of rule-based transformations to the sentences of the source documents. The model is trained for 3 tasks:
- Identify whether sentences remain factually consistent after transformation
- Extract a span in the source documents to support the consistency prediction
- Extract a span in the summary sentence that's inconsistent if one exists.
Human evaluation shows that the auxiliary span extraction tasks provide useful assistance in the process of verifying factual consistency.
Common approaches to summarization:
- Extractive, the model directly copies the salient parts of the source document
- Abstractive: the important parts ade paraphrased to form novel sentences
- Hybrid
State-of-the-art solutions use self-attentive Transformer blocks, attention and copying mechanisms and multi-objective training strategies, including RL.
We address the problem of verifying factual consistency between source documents and generated summaries. Recent studies show that up to 30% summaries generated by abstractive models contain factual inconsistencies.
We propose a BERT-based model for verifying factual consistency, and we add specialized modules explaining which portions of moth documents are pertinent to the model's decision.
Checking factual consistency on a sentence-sentence level, where each sentence in the summary is verified against each sentence from the source document, is insufficient. Errors made by summarization models are most often related to the use of incorrect entity names, numbers, and pronouns. Other errors such as negations and common sense error occur less often
Taking these insights into account, we propose and test a document-sentence approach for factual consistency checking, where each sentence of the summary is verified against the entire body of the source document.
There are currently no supervised training datasets for factual consistency checking. Creating a largescale, high-quality dataset with strong supervision collected from human annotators is prohibitively expensive and time consuming. Thus alternative approaches of acquiring training data are necessary.
We propose using an artificial, weaklysupervised dataset for the task at hand. We introduce following transformations:
- Paraphrasing
- Entity and number swapping
- Pronoun swapping
- Sentence negation
- Noise injection
We use uncased, base BERT architecture in a 2-way classification task, consistent vs. inconsistent, using a single-layer classifier based on the [CLS] token. We refer to this model as the factual consistency checking model, FactCC.
We used news articles from CNN/DailyMail, 1M training examples, balanced dataset.
Despite being trained in a (document, sentence) setting, our model transfers well to the (sentence-sentence setting and outperforms all other NLI models, including BERT finetuned on the MNLI dataset.
Most of the errors done by our models were related to commonsense mistakes made by summarization models. These errors are easy to spot for humans but hard to define as a set of transformations that would allow such errors to be added to the training data.
In addition, certain types of errors stemming from dependencies between different sentences within the summary, such as temporal inconsistencies or incorrect coreference, are not handled by the document-sentence setting used in this work.