Text-as-Data NLP Project | R (quanteda), K-means Clustering, Topic Modeling
Analysis of EU Council negotiation speeches using unsupervised learning to identify compromise-oriented language patterns across different political orientations.
This project analyzes negotiation speeches from the Council of the European Union to understand how governments communicate during international policy deliberations. Using modern NLP methods, the project identifies linguistic patterns related to negotiation strategiesβespecially compromiseβand evaluates how these patterns differ between Europhile and Euroskeptic governments.
The analysis demonstrates end-to-end text processing, unsupervised learning, and political interpretation using real-world legislative speech data.
- How is public opinion reflectedβexplicitly or implicitlyβin EU negotiation speeches?
- Do governments with different political orientations (Europhile vs Euroskeptic) use systematically different language?
- Can we automatically detect compromise-related rhetoric using clustering methods?
- What major topics structure EU Council deliberations?
- R (4.0+)
- tidyverse β data manipulation
- quanteda, quanteda.textplots β text analysis & visualization
- topicmodels β LDA topic modeling
- Machine Learning: TF-IDF, K-means clustering
Speeches delivered by national ministers during meetings of the Council of the European Union.
Dataset originates from:
"Government Rhetoric and the Representation of Public Opinion in International Negotiations"
(Wratil et al., 2023)
- 3,631 negotiation speeches (documents)
- Spoken during legislative and policy deliberations
- English transcripts generated using automatic speech recognition
- Country (origin government)
- Minister identity
- Date
- Meeting type
- Political orientation (Europhile / Euroskeptic)
- Topic category
- Speech text
Input: 3,631 negotiation documents
Pipeline:
- Tokenization β split text into words
- Stopword removal β remove common English words (the, and, of)
- Punctuation & number removal
- No stemming β preserve policy-specific vocabulary
(e.g., "directive" β "direction", "compromise" exact form matters) - Document-Feature Matrix (DFM) creation
- Feature trimming β keep terms appearing in β₯5% of documents
Output: 3,631 documents Γ 389 features
- Extracted top 30 most frequent terms
- Created word cloud visualization
Key negotiation vocabulary identified:
proposal,compromise,presidency,member,support,directive,parliament
Method:
- Applied TF-IDF weighting to emphasize distinctive words
- Ran K-means clustering (25 clusters, 50 random starts)
- Identified clusters based on top terms
Findings:
- 4 clusters strongly associated with "compromise" language:
- Frequent terms:
compromise,agreement,text,proposal,directive
- Frequent terms:
Political Analysis:
- Merged cluster assignments with government metadata
- Calculated Euroskeptic proportion per cluster
| Category | Euroskeptic % |
|---|---|
| Compromise clusters | 10.3% |
| Overall dataset | 11.7% |
β Compromise-heavy speech is slightly more common among Europhile governments
(Consistent with political science theory)
Applied to US State of the Union speeches to demonstrate generalizability of the pipeline.
Extracted interpretable themes:
- Foreign policy & war
- Governance and institutions
- National identity
- Programs, economy, energy
This validates that the text-as-data pipeline works across different political corpora.
β Public opinion is rarely referenced explicitly, but negotiation behavior patterns track political constraints implicitly.
β Europhile governments exhibit more compromise-oriented rhetoric than Euroskeptic governments.
β EU negotiation discourse clusters around themes like:
- Legislation & regulatory directives
- Coordination with Parliament
- Intergovernmental agreement-building
β Topic models validate the presence of broad, recurring legislative themes.
Install required R packages:
install.packages(c("tidyverse", "quanteda", "quanteda.textplots", "topicmodels"))-
Clone this repository:
git clone https://github.com/HOYALIM/nlp-clustering-analysis.git cd nlp-clustering-analysis -
Add data files to
data/raw/:corpus_final.RData(EU Council speeches)SOTU_WithText.csv(for LDA demo, optional)
-
Run analysis scripts in order:
# 1. Preprocessing source("src/01_preprocessing.R") # 2. Word cloud generation source("src/02_wordcloud.R") # 3. K-means clustering & political analysis source("src/03_clustering_kmeans.R") # 4. LDA topic modeling (optional) source("src/04_topic_modeling_LDA.R")
-
View results:
- Figures:
results/figures/ - Tables:
results/tables/
- Figures:
nlp-clustering-analysis/
β
βββ data/
β βββ raw/ # Original datasets
β β βββ corpus_final.RData
β β βββ SOTU_WithText.csv
β βββ processed/ # Intermediate outputs
β βββ dfm_eu_trim.rds
β βββ dfm_eu_tfidf.rds
β βββ kmeans_results.rds
β
βββ src/ # Analysis scripts (run in order)
β βββ 01_preprocessing.R
β βββ 02_wordcloud.R
β βββ 03_clustering_kmeans.R
β βββ 04_topic_modeling_LDA.R
β
βββ results/
β βββ figures/ # Visualizations
β β βββ wordcloud.png
β βββ tables/ # Summary statistics
β βββ top_features.csv
β βββ cluster_top_terms.csv
β βββ cluster_eurosceptic_stats.csv
β
βββ docs/ # Documentation
β βββ methodology_notes.md
β
βββ README.md # This file
β¨ Ability to build a full NLP pipeline from raw text β structured insights
β¨ Proficiency with quanteda, unsupervised ML, and political text interpretation
β¨ Experience analyzing real-world institutional speech data
β¨ Strong capability to connect statistical modeling with domain-specific theory
β¨ Clear communication of complex quantitative findings
Wratil, C., Bailer, S., Gessler, T., GlavaΕ‘, G., & Groll, T. (2023). Government Rhetoric and the Representation of Public Opinion in International Negotiations. Working Paper.
MIT License β feel free to use this project for educational or research purposes.
Ho Lim
Data Science Student | Text Analytics & Machine Learning
GitHub: @HOYALIM
- Dataset provided by Wratil et al. (2023)
β If you find this project useful, please consider giving it a star!
