This repository focuses on topic modeling techniques that leverage BERT-based keyword extraction. We explore three main approaches:
- Domain Adaptation – Applying keyword extraction in a specialized domain (e.g., agriculture).
- Multilingual Extension – Handling documents in the Greek language.
- NER-based Preprocessing – Using Named Entity Recognition to filter key entities before extracting keywords.
domain_adaptation/covers the agriculture-domain adaptation approach.multilingual/includes all code for multilingual (Greek) modeling.ner_preprocessing/implements NER-based entity filtering.utils/has utility scripts for logging, helper functions, etc.
-
Install Dependencies
pip install spacy nltk scikit-learn requests pip install torch sentence-transformers keybert thefuzz python -m spacy download el_core_news_sm
-
Running via
main.pyWe provide a single entry point in
main.pythat accepts a parameter specifying which approach to run:python main.py --approach domain
Runs the domain adaptation pipeline.
python main.py --approach multilingual
Runs the multilingual (Greek) pipeline.
python main.py --approach ner
Runs the NER-based preprocessing pipeline.
Inside
main.py, these commands map to the corresponding scripts in their respective folders.
During each run, the code may generate:
- Logs: Training and validation logs for model performance tracking.
- Metrics: Precision, Recall, F1 scores for keyword extraction.
- Comparison: We compare the final results (baseline vs. extended approaches) in our final report.