Skip to content

HOYALIM/nlp-clustering-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Œ NLP Clustering Analysis

Text-as-Data NLP Project | R (quanteda), K-means Clustering, Topic Modeling

Analysis of EU Council negotiation speeches using unsupervised learning to identify compromise-oriented language patterns across different political orientations.

R License: MIT


πŸ“– Project Summary

This project analyzes negotiation speeches from the Council of the European Union to understand how governments communicate during international policy deliberations. Using modern NLP methods, the project identifies linguistic patterns related to negotiation strategiesβ€”especially compromiseβ€”and evaluates how these patterns differ between Europhile and Euroskeptic governments.

The analysis demonstrates end-to-end text processing, unsupervised learning, and political interpretation using real-world legislative speech data.


🎯 Key Research Questions

  1. How is public opinion reflectedβ€”explicitly or implicitlyβ€”in EU negotiation speeches?
  2. Do governments with different political orientations (Europhile vs Euroskeptic) use systematically different language?
  3. Can we automatically detect compromise-related rhetoric using clustering methods?
  4. What major topics structure EU Council deliberations?

πŸ›  Technical Stack

  • R (4.0+)
  • tidyverse β€” data manipulation
  • quanteda, quanteda.textplots β€” text analysis & visualization
  • topicmodels β€” LDA topic modeling
  • Machine Learning: TF-IDF, K-means clustering

πŸ“¦ Dataset

Source

Speeches delivered by national ministers during meetings of the Council of the European Union.

Dataset originates from:

"Government Rhetoric and the Representation of Public Opinion in International Negotiations"
(Wratil et al., 2023)

Content

  • 3,631 negotiation speeches (documents)
  • Spoken during legislative and policy deliberations
  • English transcripts generated using automatic speech recognition

Metadata Included

  • Country (origin government)
  • Minister identity
  • Date
  • Meeting type
  • Political orientation (Europhile / Euroskeptic)
  • Topic category
  • Speech text

πŸ”¬ Methodology

1. Preprocessing

Input: 3,631 negotiation documents

Pipeline:

  1. Tokenization β€” split text into words
  2. Stopword removal β€” remove common English words (the, and, of)
  3. Punctuation & number removal
  4. No stemming β€” preserve policy-specific vocabulary
    (e.g., "directive" β‰  "direction", "compromise" exact form matters)
  5. Document-Feature Matrix (DFM) creation
  6. Feature trimming β€” keep terms appearing in β‰₯5% of documents

Output: 3,631 documents Γ— 389 features


2. Exploratory NLP

Word Frequency Analysis

  • Extracted top 30 most frequent terms
  • Created word cloud visualization

Key negotiation vocabulary identified:

  • proposal, compromise, presidency, member, support, directive, parliament

Word Cloud


3. Unsupervised Clustering (K-means, K=25)

Method:

  1. Applied TF-IDF weighting to emphasize distinctive words
  2. Ran K-means clustering (25 clusters, 50 random starts)
  3. Identified clusters based on top terms

Findings:

  • 4 clusters strongly associated with "compromise" language:
    • Frequent terms: compromise, agreement, text, proposal, directive

Political Analysis:

  • Merged cluster assignments with government metadata
  • Calculated Euroskeptic proportion per cluster
Category Euroskeptic %
Compromise clusters 10.3%
Overall dataset 11.7%

β†’ Compromise-heavy speech is slightly more common among Europhile governments
(Consistent with political science theory)


4. Topic Modeling (LDA, K=10)

Applied to US State of the Union speeches to demonstrate generalizability of the pipeline.

Extracted interpretable themes:

  • Foreign policy & war
  • Governance and institutions
  • National identity
  • Programs, economy, energy

This validates that the text-as-data pipeline works across different political corpora.


πŸ“Š Key Insights

βœ… Public opinion is rarely referenced explicitly, but negotiation behavior patterns track political constraints implicitly.

βœ… Europhile governments exhibit more compromise-oriented rhetoric than Euroskeptic governments.

βœ… EU negotiation discourse clusters around themes like:

  • Legislation & regulatory directives
  • Coordination with Parliament
  • Intergovernmental agreement-building

βœ… Topic models validate the presence of broad, recurring legislative themes.


πŸš€ Reproduction Instructions

Prerequisites

Install required R packages:

install.packages(c("tidyverse", "quanteda", "quanteda.textplots", "topicmodels"))

Step-by-Step

  1. Clone this repository:

    git clone https://github.com/HOYALIM/nlp-clustering-analysis.git
    cd nlp-clustering-analysis
  2. Add data files to data/raw/:

    • corpus_final.RData (EU Council speeches)
    • SOTU_WithText.csv (for LDA demo, optional)
  3. Run analysis scripts in order:

    # 1. Preprocessing
    source("src/01_preprocessing.R")
    
    # 2. Word cloud generation
    source("src/02_wordcloud.R")
    
    # 3. K-means clustering & political analysis
    source("src/03_clustering_kmeans.R")
    
    # 4. LDA topic modeling (optional)
    source("src/04_topic_modeling_LDA.R")
  4. View results:

    • Figures: results/figures/
    • Tables: results/tables/

πŸ“ Project Structure

nlp-clustering-analysis/
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                    # Original datasets
β”‚   β”‚   β”œβ”€β”€ corpus_final.RData
β”‚   β”‚   └── SOTU_WithText.csv
β”‚   └── processed/              # Intermediate outputs
β”‚       β”œβ”€β”€ dfm_eu_trim.rds
β”‚       β”œβ”€β”€ dfm_eu_tfidf.rds
β”‚       └── kmeans_results.rds
β”‚
β”œβ”€β”€ src/                        # Analysis scripts (run in order)
β”‚   β”œβ”€β”€ 01_preprocessing.R
β”‚   β”œβ”€β”€ 02_wordcloud.R
β”‚   β”œβ”€β”€ 03_clustering_kmeans.R
β”‚   └── 04_topic_modeling_LDA.R
β”‚
β”œβ”€β”€ results/
β”‚   β”œβ”€β”€ figures/                # Visualizations
β”‚   β”‚   └── wordcloud.png
β”‚   └── tables/                 # Summary statistics
β”‚       β”œβ”€β”€ top_features.csv
β”‚       β”œβ”€β”€ cluster_top_terms.csv
β”‚       └── cluster_eurosceptic_stats.csv
β”‚
β”œβ”€β”€ docs/                       # Documentation
β”‚   └── methodology_notes.md
β”‚
└── README.md                   # This file

πŸ’‘ What This Project Demonstrates

✨ Ability to build a full NLP pipeline from raw text β†’ structured insights

✨ Proficiency with quanteda, unsupervised ML, and political text interpretation

✨ Experience analyzing real-world institutional speech data

✨ Strong capability to connect statistical modeling with domain-specific theory

✨ Clear communication of complex quantitative findings


πŸ“š References

Wratil, C., Bailer, S., Gessler, T., GlavaΕ‘, G., & Groll, T. (2023). Government Rhetoric and the Representation of Public Opinion in International Negotiations. Working Paper.


πŸ“„ License

MIT License β€” feel free to use this project for educational or research purposes.


πŸ‘€ Author

Ho Lim
Data Science Student | Text Analytics & Machine Learning
GitHub: @HOYALIM


πŸ™ Acknowledgments

  • Dataset provided by Wratil et al. (2023)

⭐ If you find this project useful, please consider giving it a star!

About

K-means clustering and LDA topic modeling on EU Council speeches to detect compromise-oriented language patterns

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors