Skip to content

Latest commit

 

History

History
80 lines (44 loc) · 6.5 KB

1906.05887.md

File metadata and controls

80 lines (44 loc) · 6.5 KB

Sentiment analysis is not solved! Assessing and probing sentiment classification, Barnes et al., 2019

Paper, Dataset, Tags: #nlp, #datasets

It's not clear what outstanding conceptual challenges for sentiment analysis remain. In this work, we attempt to discover what challenges still prove a problem for sentiment classifiers for English and to provide a challenging dataset.

We collect the subset of sentences that an ensemble of state-of-the-art sentiment classifiers misclassify and then annotate them for 18 linguistic and paralinguistic phenomena, such as negation, sarcasm, modality, etc.

Finally we provide a case study that demonstrates the usefulness of the dataset to probe the performance of a given sentiment classifier with respect to linguistic phenomena.

Previous work

Over the last 15 years people have used sentiment lexicons, n-grams for classifications, but now they've been replaced by models that are able to exploit compositionality or implicitly learn relations between tokens. These neural models push the state of the art to over 90% accuracy on binary sentence-level sentiment analysis.

These methods don't show a thorough analysis of the qualitative differences corresponding to the quantitative improvements. So we are aware of quantitative differences, but not qualitative ones. This means that we aren't aware of the outstanding conceptual challenges remaining in sentiment analysis.

We train 3 state-of-the-art ML classifiers, BERT, ELMo and a BiLSTM and a BoW classifier on 6 sentence-level sentiment datasets. We collect the subset of sentences that all models misclassify and annotate them for 18 linguistic and paralinguistic phenomena.

In sentiment analysis (SA), negation has historically been a big problem. The solutions have included determining negation scope, and semantic composition. Verbal polarity shifters have also been studied, and the linguistic phenomenon of modality.

Experimental setup

We choose to focus on sentence-level classification for 3 reasons: sentence-level classification is a popular and useful task, there's a large amount of high-quality annotated data available, and annotation of linguistic phenomena is easier at sentence-level than document-level.

1. Datasets

We extract the incorrectly predicted sentences from these datasets:

  • MPQA: multi-perspective question answer opinion corpus provides contextual polarity annotations for English news documents from world press
  • OpeNER: open polarity enhanced NER sentiment datasets contain hotel reviews annotated for 4-class (strong positive, positive, negative, strong negative) sentiment classification.
  • SemEval: tweet classification dataset, contains tweets collected and annotated for 3 class sentiment
  • Stanford Sentiment Treebank (SST): 11.8k English sentences from movie reviews annotated at each node of a constituency parse tree.
  • Täckström dataset: contains product reviews annotated at both document and sentence-level for 3 class sentiments. 3.6k sentences
  • Thelwall dataset: contains microblogs annotated for both positive and negative sentiment on a scale from 1 to 5. 6.3k annotated sentences.

2. Models

  • BERT
  • ELMo
  • BiLSTM
  • BoW classifier

3. Model performance

ELMo and BERT are the best performing models, with an average of 11.8 difference between the LMs and standard models (BoW and BiLSTM). The error rates range between 8.3 in OpeNER and 20.5 on SST, indicating differences in difficulty of datasets due to domain and annotation characteristics.

Challenging dataset

We collect the subset of test sentences that all of the sentiment systems predicted incorrectly --> 836 sentences in the dataset with a similar number of positive, neutral and negative labels and fewer strong labels.

Dataset analysis

We annotate these instances using 19 linguistic and paralinguistic labels. We also annotate the polarity of the sentence manually irrespective of the gold label in order to locate possible annotation errors during our analysis.

The largest set of errors, with 185 sentences labelled, were 'mixed polarity sentences', sentences where 2 different polarities are expressed, either towards 2 separate entities or towards the same entity. The first can be solved with a fine-grained approach (aspect-level or targeted sentiment), but the second is more difficult. Nearly a third of the errors contain “but” clauses, which could be correctly classified by splitting them.

Non-standard spelling: most errors in this category are labeled either negative or positive, with almost no strong sentiment, which is because the noisier datasets don't contain the strong labels. Around 30% of exapmles contain hashtags and clearly express the sentiment of the whole sentence. This indicates the need to properly deal with hashtags in order to correctly classify sentiment.

Idioms: incorporating idiomatic information from external data sources may be necessary to improve the classification of sentences within this category.

Strong labels: strong negative sentiment is often expressed in an understated or ironic manner, and for strong positive examples, there's often difficult vocabulary and morphologically creative uses of language, while strong negatuve examples often contain sarcasm or non-standard spelling.

Negation: there's no specific negator that's more difficult to resolve regarding its effect on sentiment classification.

World knowledge: in 81 sentences, world knowledge was necessary to correctly classify the sentence. This category is highly correlated with sarcasm and irony.

Amplified: amplifiers occur mainly in the strong examples.

Comparative: comparative sentiment, with 68 errors, is known to be difficult, as it's necessary to determine which entity is on which side of the inequality.

Sarcasm/irony: 58 errors, are mainly present in negative and strong negative examples of the dataset.

Shifters, emoji, modality, morphology, reducers in paper section 5.

Case study: training with phrase-level annotations

We evaluate a model that has access to more compositional information. We fine-tune the best performing model, BERT on SST and test on our dataset. BERT achieves 55.1 accuracy on SST, vs. 53 for the model trained only on sentence-label annotations.

Using the error annotations in the challenge data set, we find that results improve greatly on the sentences which contain the labels negation, world knowledge, amplified, emoji, and reduced, while performing worse on irony, shifters and equally on morphology.