Skip to content

lkondratova/llm-benchmarking-suite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

LLM Benchmarking Suite

Designing evaluation systems for large language models in scientific and data-intensive domains, with a focus on reproducibility, ambiguity detection, and non-deterministic workflows.


Overview

This repository presents a collection of expert-level evaluation tasks and frameworks for assessing LLM performance beyond simple correctness.

The work focuses on realistic scenarios where:

  • multiple valid solutions may exist
  • outputs depend on workflow or methodological choices
  • ambiguity in task design affects reproducibility
  • naive grading fails to distinguish reasoning from shortcuts

The goal is to demonstrate how to design robust, fair, and scientifically grounded evaluation systems.


Key Themes

  • Evaluation under uncertainty
    Handling non-deterministic outputs from pipelines and workflows

  • Scientific reasoning assessment
    Evaluating model performance on bioinformatics and data analysis tasks

  • Ambiguity detection and resolution
    Identifying underspecified prompts and improving reproducibility

  • Rubric-driven evaluation
    Designing structured grading systems that reward correct reasoning

  • Model failure analysis
    Understanding hallucinations, shortcuts, and reasoning gaps


Repository Structure

tasks/
├── task_00_algorithm_reconstruction/
├── task_01_non_deterministic_pipeline/
├── task_02_deterministic_scientific_calculation/

### To be added
frameworks/
├── deterministic_vs_nondeterministic.md
├── grading_strategies.md

Task Design Philosophy

Each task in this repository is designed to test a specific dimension of model capability:

Task Type Purpose
Deterministic computation Precision and correctness
Pipeline variability Handling non-deterministic outputs
QC interpretation Scientific reasoning
Reasoning traps Detecting shortcut behavior
Ambiguity detection Improving task design
Framework design System-level evaluation thinking

Evaluation Approach

Unlike traditional benchmarks, this repository emphasizes:

  • tolerance-aware grading for non-deterministic outputs
  • multiple valid solution paths where appropriate
  • explicit handling of methodological variability
  • separation of reasoning quality from final answer correctness

Example: What Makes a Task Challenging

A task is considered challenging if it requires:

  • distinguishing between systematic vs stochastic variation
  • interpreting outputs rather than computing them
  • resolving ambiguity in problem definition
  • applying domain-specific reasoning
  • avoiding common shortcut solutions

Why This Matters

As LLMs are increasingly applied to scientific and technical domains, evaluation must move beyond:

  • exact string matching
  • single ground-truth answers
  • oversimplified benchmarks

This repository explores more realistic evaluation strategies aligned with real-world use cases.


About

PhD in Genetics & Genomics with experience in bioinformatics, multi-omics data analysis, and AI evaluation. Focused on building reproducible, fair, and interpretable evaluation systems for complex model outputs.


Notes

  • All tasks are independently designed or reconstructed from publicly available methods
  • No proprietary datasets or internal materials are included
  • Emphasis is placed on clarity, reproducibility, and evaluation rigor

About

AI evaluation frameworks for scientific reasoning, benchmarking, and non-deterministic workflows

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors