Designing evaluation systems for large language models in scientific and data-intensive domains, with a focus on reproducibility, ambiguity detection, and non-deterministic workflows.
This repository presents a collection of expert-level evaluation tasks and frameworks for assessing LLM performance beyond simple correctness.
The work focuses on realistic scenarios where:
- multiple valid solutions may exist
- outputs depend on workflow or methodological choices
- ambiguity in task design affects reproducibility
- naive grading fails to distinguish reasoning from shortcuts
The goal is to demonstrate how to design robust, fair, and scientifically grounded evaluation systems.
-
Evaluation under uncertainty
Handling non-deterministic outputs from pipelines and workflows -
Scientific reasoning assessment
Evaluating model performance on bioinformatics and data analysis tasks -
Ambiguity detection and resolution
Identifying underspecified prompts and improving reproducibility -
Rubric-driven evaluation
Designing structured grading systems that reward correct reasoning -
Model failure analysis
Understanding hallucinations, shortcuts, and reasoning gaps
tasks/
├── task_00_algorithm_reconstruction/
├── task_01_non_deterministic_pipeline/
├── task_02_deterministic_scientific_calculation/
### To be added
frameworks/
├── deterministic_vs_nondeterministic.md
├── grading_strategies.md
Each task in this repository is designed to test a specific dimension of model capability:
| Task Type | Purpose |
|---|---|
| Deterministic computation | Precision and correctness |
| Pipeline variability | Handling non-deterministic outputs |
| QC interpretation | Scientific reasoning |
| Reasoning traps | Detecting shortcut behavior |
| Ambiguity detection | Improving task design |
| Framework design | System-level evaluation thinking |
Unlike traditional benchmarks, this repository emphasizes:
- tolerance-aware grading for non-deterministic outputs
- multiple valid solution paths where appropriate
- explicit handling of methodological variability
- separation of reasoning quality from final answer correctness
A task is considered challenging if it requires:
- distinguishing between systematic vs stochastic variation
- interpreting outputs rather than computing them
- resolving ambiguity in problem definition
- applying domain-specific reasoning
- avoiding common shortcut solutions
As LLMs are increasingly applied to scientific and technical domains, evaluation must move beyond:
- exact string matching
- single ground-truth answers
- oversimplified benchmarks
This repository explores more realistic evaluation strategies aligned with real-world use cases.
PhD in Genetics & Genomics with experience in bioinformatics, multi-omics data analysis, and AI evaluation. Focused on building reproducible, fair, and interpretable evaluation systems for complex model outputs.
- All tasks are independently designed or reconstructed from publicly available methods
- No proprietary datasets or internal materials are included
- Emphasis is placed on clarity, reproducibility, and evaluation rigor