Large Reasoning Models (LRMs) have achieved rapid progress, yet existing benchmarks—focused on mathematics, programming, or common-sense reasoning—suffer from poor long-context evaluation, weak difficulty control, ambiguous answers, and narrow coverage of reasoning paradigms.
GrAlgoBench introduces a benchmark of graph algorithm problems to evaluate LRMs. Graph tasks naturally provide:
- Effective long-context reasoning → graph descriptions induce long inputs, testing context scalability.
- Scalable difficulty control → complexity grows smoothly with graph size (8–160 nodes).
- Standardized evaluation → outputs are integers/nodes/edges, enabling exact and programmatic checking.
- Diverse reasoning paradigms → tasks span Enumeration, Exploration, and Intuition, mapping to brute-force, search, and greedy paradigms.
Experiments on nine tasks across three categories uncover two major weaknesses of current LRMs:
- Performance collapse under long contexts → accuracy drops sharply as graphs or text length grow, due to step-by-step execution errors, weak memory, and redundant reasoning.
- Ineffective self-verification → models often engage in verbose self-checking that inflates reasoning traces but rarely improves correctness, becoming the main driver of over-thinking.
By addressing the shortcomings of prior benchmarks, GrAlgoBench establishes graph algorithm problems as a rigorous, multidimensional, and application-relevant testbed for advancing the study of reasoning in LRMs.
GrAlgoBench/
├── data_generation/ # Scripts for dataset construction
├── Inference/ # Model inference scripts and configs
├── error_analysis/ # Scripts for analyzing model errors
├── overthinking/ # Overthinking analysis module
├── label/ # Response labeling and segmentation
├── judge/ # Segment effectiveness judgment
├── entropy_analysis/ # Token entropy analysis and wordclouds
├── logs/ # Default log directory
├── results/ # Saved inference results
├── scripts/ # Helper bash scripts for batch running
├── common_utils.py # Shared utilities and configurations
└── README.md # Documentation
conda create -n gralgobench python=3.10
conda activate gralgobench
pip install -r requirements.txtpython ./data_generation/build_dataset.py
Datasets are organized as:
dataset/
├── MKC_easy.pkl
├── MKC_medium.pkl
├── MST_hard.pkl
└── ...
Single-task example:
python Inference/infer_open.py \
--LLM Qwen3-8B \
--task MST \
--difficulty medium \
--batch_size 32 \
--gpu_num 4Batch execution via script:
bash scripts/infer_open.shTo analyze model errors, follow these steps:
-
Reformat raw responses
This step parses and normalizes model outputs into a consistent structure:python error_analysis/reformat.py
-
Run the error analysis script
Specify the task, evaluation model, and the model that generated the responses:python error_analysis/error_analysis.py \ --task MST \ --llm gpt5_mini \ --response_generated_from_what_model Qwen3-8B
GrAlgoBench provides comprehensive analysis tools to understand model behavior and reasoning patterns across multiple dimensions:
Analyze model self-verification and reasoning redundancy patterns:
./overthinking/run_overthinking.sh 0,1 \
--LLM Qwen3-32B \
--task_type graph \
--batch_size 32
./overthinking/run_overthinking.sh 0,1 \
--LLM Qwen3-32B \
--task_type math_competition \
--batch_size 16Segment and label model reasoning steps for fine-grained analysis:
./label/run_label.sh 0,1 \
--LLM Qwen3-32B \
--task_type graph \
--batch_size 32
./label/run_label.sh 0,1 \
--LLM Qwen3-32B \
--task_type math_competition \
--batch_size 16Evaluate the effectiveness of individual reasoning segments:
./judge/run_judge.sh 0,1 \
--LLM Qwen3-32B \
--task_type graph \
--batch_size 32
./judge/run_judge.sh 0,1 \
--LLM Qwen3-32B \
--task_type math \
--batch_size 16Analyze token-level uncertainty and model confidence patterns:
./entropy_analysis/run_entropy_analysis.sh infer 0,1 \
--LLM Qwen3-32B --task MaxDegree --difficulty easy
./entropy_analysis/run_entropy_analysis.sh analyze 0,1 \
--min_freq 40000 --top_k 100
./entropy_analysis/run_entropy_analysis.sh wordcloud 0,1