This repository contains the implementation and data for quantifying the concept of "Seisyun" (青春), a term widely used in Japan to describe youth, through a novel measure called Seisyun Information Entropy. The method combines advanced natural language processing techniques and large language models (LLMs) to establish a quantitative framework for analyzing and understanding this abstract concept.
The study introduces a mathematical and computational framework to measure the "youthfulness" of a text by considering three key factors:
- Unusualness: Using the probability of word predictions from pre-trained BERT models to calculate unexpectedness in text.
- Positivity: Sentiment analysis to evaluate the positive nature of the text.
- Fluency: Evaluating the grammatical correctness of the text.
These metrics are integrated into a single measure, Seisyun Information Entropy, using information theory principles.
The Seisyun Information Entropy for a given text (S) is defined as:
Where:
-
$P_{unusual}(S)$ : The average word prediction probability. -
$P_{positive}(S)$ : The probability that the text is classified as positive. -
$P_{fluency}(S)$ : The probability that the text is grammatically correct.
The framework leverages multiple pre-trained BERT models:
-
Unusualness:
- Tohoku BERT (Japanese)
- Predicts word probabilities to assess unexpectedness.
-
Positivity:
- Sentiment-Enhanced BERT
- Analyzes sentiment polarity.
-
Fluency:
- Fluency-Scoring BERT
- Evaluates grammatical correctness.
module/: Contains the core modules for calculating Seisyun Information Entropy.main.py: The main script to execute the entropy calculations.seisyun.csv: Dataset containing text samples related to "Seisyun".goodness.csv: Dataset with human-annotated "youthfulness" scores.comparison.csv: Results comparing calculated entropy with human scores.plot.png: Visualization of the comparison results.requirement.txt: List of dependencies required to run the project..gitignore: Specifies files and directories to be ignored by git.LICENSE: License information for the repository.
- Python 3.8+
- PyTorch
- Hugging Face Transformers
Install dependencies:
pip install -r requirement.txtFor testing the result of thesis, run:
python main.py Experimental results demonstrated a weak positive correlation ((r = 0.333)) between Seisyun Information Entropy and manually generated "youthfulness" rankings of texts. The significance test ((p = 0.035)) confirmed the statistical validity of the measure at a 5% significance level.
This project is licensed under the MIT License - see the LICENSE file for details.
The complete manuscript is accessible online at the following address:
https://web.cshe.nagoya-u.ac.jp/support/student/contest/img/2024_kotama.pdf
If you use this work, please cite:
@unpublished{kotama:seisyun_entropy_llm,
author = {Takanori Kotama},
title = {A Quantitative Definition of Youth Using Large Language Models},
year = {2025},
note = {Award-winning paper of the Nagoya University Student Paper Contest (Encouragement Prize), presented in 2025},
url = {https://web.cshe.nagoya-u.ac.jp/support/student/contest/img/2024_kotama.pdf}
}