📌 NLP Clustering Analysis

Text-as-Data NLP Project | R (quanteda), K-means Clustering, Topic Modeling

Analysis of EU Council negotiation speeches using unsupervised learning to identify compromise-oriented language patterns across different political orientations.

📖 Project Summary

This project analyzes negotiation speeches from the Council of the European Union to understand how governments communicate during international policy deliberations. Using modern NLP methods, the project identifies linguistic patterns related to negotiation strategies—especially compromise—and evaluates how these patterns differ between Europhile and Euroskeptic governments.

The analysis demonstrates end-to-end text processing, unsupervised learning, and political interpretation using real-world legislative speech data.

🎯 Key Research Questions

How is public opinion reflected—explicitly or implicitly—in EU negotiation speeches?
Do governments with different political orientations (Europhile vs Euroskeptic) use systematically different language?
Can we automatically detect compromise-related rhetoric using clustering methods?
What major topics structure EU Council deliberations?

🛠 Technical Stack

R (4.0+)
tidyverse — data manipulation
quanteda, quanteda.textplots — text analysis & visualization
topicmodels — LDA topic modeling
Machine Learning: TF-IDF, K-means clustering

📦 Dataset

Source

Speeches delivered by national ministers during meetings of the Council of the European Union.

Dataset originates from:

"Government Rhetoric and the Representation of Public Opinion in International Negotiations"
(Wratil et al., 2023)

Content

3,631 negotiation speeches (documents)
Spoken during legislative and policy deliberations
English transcripts generated using automatic speech recognition

Metadata Included

Country (origin government)
Minister identity
Date
Meeting type
Political orientation (Europhile / Euroskeptic)
Topic category
Speech text

🔬 Methodology

1. Preprocessing

Input: 3,631 negotiation documents

Pipeline:

Tokenization — split text into words
Stopword removal — remove common English words (the, and, of)
Punctuation & number removal
No stemming — preserve policy-specific vocabulary
(e.g., "directive" ≠ "direction", "compromise" exact form matters)
Document-Feature Matrix (DFM) creation
Feature trimming — keep terms appearing in ≥5% of documents

Output: 3,631 documents × 389 features

2. Exploratory NLP

Word Frequency Analysis

Extracted top 30 most frequent terms
Created word cloud visualization

Key negotiation vocabulary identified:

proposal, compromise, presidency, member, support, directive, parliament

3. Unsupervised Clustering (K-means, K=25)

Method:

Applied TF-IDF weighting to emphasize distinctive words
Ran K-means clustering (25 clusters, 50 random starts)
Identified clusters based on top terms

Findings:

4 clusters strongly associated with "compromise" language:
- Frequent terms: compromise, agreement, text, proposal, directive

Political Analysis:

Merged cluster assignments with government metadata
Calculated Euroskeptic proportion per cluster

Category	Euroskeptic %
Compromise clusters	10.3%
Overall dataset	11.7%

→ Compromise-heavy speech is slightly more common among Europhile governments
(Consistent with political science theory)

4. Topic Modeling (LDA, K=10)

Applied to US State of the Union speeches to demonstrate generalizability of the pipeline.

Extracted interpretable themes:

Foreign policy & war
Governance and institutions
National identity
Programs, economy, energy

This validates that the text-as-data pipeline works across different political corpora.

📊 Key Insights

✅ Public opinion is rarely referenced explicitly, but negotiation behavior patterns track political constraints implicitly.

✅ Europhile governments exhibit more compromise-oriented rhetoric than Euroskeptic governments.

✅ EU negotiation discourse clusters around themes like:

Legislation & regulatory directives
Coordination with Parliament
Intergovernmental agreement-building

✅ Topic models validate the presence of broad, recurring legislative themes.

🚀 Reproduction Instructions

Prerequisites

Install required R packages:

install.packages(c("tidyverse", "quanteda", "quanteda.textplots", "topicmodels"))

Step-by-Step

Clone this repository:

git clone https://github.com/HOYALIM/nlp-clustering-analysis.git
cd nlp-clustering-analysis

Add data files to data/raw/:
- corpus_final.RData (EU Council speeches)
- SOTU_WithText.csv (for LDA demo, optional)

Run analysis scripts in order:

# 1. Preprocessing
source("src/01_preprocessing.R")

# 2. Word cloud generation
source("src/02_wordcloud.R")

# 3. K-means clustering & political analysis
source("src/03_clustering_kmeans.R")

# 4. LDA topic modeling (optional)
source("src/04_topic_modeling_LDA.R")

View results:
- Figures: results/figures/
- Tables: results/tables/

📁 Project Structure

nlp-clustering-analysis/
│
├── data/
│   ├── raw/                    # Original datasets
│   │   ├── corpus_final.RData
│   │   └── SOTU_WithText.csv
│   └── processed/              # Intermediate outputs
│       ├── dfm_eu_trim.rds
│       ├── dfm_eu_tfidf.rds
│       └── kmeans_results.rds
│
├── src/                        # Analysis scripts (run in order)
│   ├── 01_preprocessing.R
│   ├── 02_wordcloud.R
│   ├── 03_clustering_kmeans.R
│   └── 04_topic_modeling_LDA.R
│
├── results/
│   ├── figures/                # Visualizations
│   │   └── wordcloud.png
│   └── tables/                 # Summary statistics
│       ├── top_features.csv
│       ├── cluster_top_terms.csv
│       └── cluster_eurosceptic_stats.csv
│
├── docs/                       # Documentation
│   └── methodology_notes.md
│
└── README.md                   # This file

💡 What This Project Demonstrates

✨ Ability to build a full NLP pipeline from raw text → structured insights

✨ Proficiency with quanteda, unsupervised ML, and political text interpretation

✨ Experience analyzing real-world institutional speech data

✨ Strong capability to connect statistical modeling with domain-specific theory

✨ Clear communication of complex quantitative findings

📚 References

Wratil, C., Bailer, S., Gessler, T., Glavaš, G., & Groll, T. (2023). Government Rhetoric and the Representation of Public Opinion in International Negotiations. Working Paper.

📄 License

MIT License — feel free to use this project for educational or research purposes.

👤 Author

Ho Lim
Data Science Student | Text Analytics & Machine Learning
GitHub: @HOYALIM

🙏 Acknowledgments

Dataset provided by Wratil et al. (2023)

⭐ If you find this project useful, please consider giving it a star!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/raw		data/raw
docs		docs
results		results
src		src
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
run_all.sh		run_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📌 NLP Clustering Analysis

📖 Project Summary

🎯 Key Research Questions

🛠 Technical Stack

📦 Dataset

Source

Content

Metadata Included

🔬 Methodology

1. Preprocessing

2. Exploratory NLP

Word Frequency Analysis

3. Unsupervised Clustering (K-means, K=25)

4. Topic Modeling (LDA, K=10)

📊 Key Insights

🚀 Reproduction Instructions

Prerequisites

Step-by-Step

📁 Project Structure

💡 What This Project Demonstrates

📚 References

📄 License

👤 Author

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📌 NLP Clustering Analysis

📖 Project Summary

🎯 Key Research Questions

🛠 Technical Stack

📦 Dataset

Source

Content

Metadata Included

🔬 Methodology

1. Preprocessing

2. Exploratory NLP

Word Frequency Analysis

3. Unsupervised Clustering (K-means, K=25)

4. Topic Modeling (LDA, K=10)

📊 Key Insights

🚀 Reproduction Instructions

Prerequisites

Step-by-Step

📁 Project Structure

💡 What This Project Demonstrates

📚 References

📄 License

👤 Author

🙏 Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages