Skip to content

pandu-rao/logitest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LogiTest



Introduction

LogiTest is a benchmarking tool that evaluates the ability of large language models (LLMs) to simplify boolean logic expressions. It generates random boolean expressions, computes a reference solution using SymPy's symbolic logic engine, queries an LLM for its simplified form, and compares the two results to produce an accuracy score. Results are appended to a csv file.



Requirements

Development Environment

  • Python 3.14+
  • An API key for the configured LLM provider

Production Environment

  • An API key for the configured LLM provider

Installation

$ pip install .

Or using uv:

$ uv sync

Configuration

  • Set the LLM model via the LOGITEST_LLM_MODEL environment variable (default: anthropic/claude-sonnet-4-5)
  • Set the API key via the LOGITEST_LLM_MODEL_API_KEY environment variable
export LOGITEST_LLM_MODEL=anthropic/claude-sonnet-4-5
export LOGITEST_LLM_MODEL_API_KEY=your-api-key-here

Development Workflow

The project uses sympy for reference solutions and the llm library for model interaction. The core flow is:

  1. Boolean expressions are generated using sympy symbols and operators
  2. Expressions are converted to Python syntax and passed to the LLM
  3. The LLM response is parsed back into SymPy's DNF form
  4. The LLM solution is compared against simplify_logic output

Usage

Run LogiTest with one randomly generated expression:

$ python logitest.py

Examples

$ python logitest.py
$ python logitest.py --model anthropic/claude-opus-4-6
$ python logitest.py --include-default
$ python logitest.py --num-exprs 2
$ python logitest.py --num-exprs 2 --num-tries 2
$ python logitest.py --num-exprs 2 --num-tries 2 --extra-effort
$ python logitest.py --num-exprs 2 --num-tries 2 --extra-effort --atoms 'a,b,c,d'
$ python logitest.py --num-exprs 2 --seed 1024

The output is a grid table showing each expression, the reference solution, the LLM solution, metadata, and a pass/fail result, followed by an overall accuracy score:

+-+----------------------------+------------------+----------------------+------+
|#|expression                  |ref_solution      |llm_solution          |result|
+=+============================+==================+======================+======+
|1|(b & ~a) | (a & b & c)      |b & (c | ~a)      |(b & ~a) | (a & b & c)|     0|
+-+----------------------------+------------------+----------------------+------+
|2|(a & c) | (b & c) | (b & ~a)|(a & c) | (b & ~a)|b | (a & c)           |     0|
+-+----------------------------+------------------+----------------------+------+
accuracy: 0.0 | avg_tries: 1.0 | correct: 0/2

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors