Research note: Climate science domain benchmark — first cross-domain eval run #12

base76-research-lab · 2026-04-07T05:58:13Z

base76-research-lab
Apr 7, 2026
Maintainer

Summary

First benchmark run against an external domain (climate science), contributed by @harshareddy832 in #9. This note documents what we found, what broke, and what it revealed about the architecture.

Setup

Question bank: 60 climate science questions (eval/domains/climate_science.json)
Models: cerebras/llama3.1-8b (small baseline), groq/llama-3.3-70b-versatile (large baseline + NoUse run)
Judge: groq/llama-3.3-70b-versatile
Domain seeding: brain.add() — 120 relations inserted from Q&A pairs and key concepts

Results

Configuration	Model	NoUse	Score
A — small, no NoUse	`cerebras/llama3.1-8b`	—	4%
B — small + NoUse	`groq/llama-3.3-70b-versatile`	✅	3%
C — large, no NoUse	`groq/llama-3.3-70b-versatile`	—	0%

Aggregate scores are low and not the main finding — see below.

What actually happened

Individual question result (climate_009)

What is the main natural source of methane emissions?

Run	Score	Answer excerpt
A — 8B, no NoUse	3/3	Correct (wetlands + rice paddies)
B — 70B + NoUse	3/3	"Detta är en verifierad relation i min kunskapsminne med ett evidensvärde på 0,90."
C — 70B, no NoUse	0/3	Failed (rate limited)

Run B explicitly cited the graph: the model knew what it knew and why — with an evidence score attached. That is the intended behavior.

What confounded the aggregate scores

Rate limiting — Cerebras 429 errors on run B after run A exhausted the quota. Many answers became [ERROR], scored 0.
Language mismatch — system prompt is in Swedish, questions in English, models answered in Swedish. Keyword scoring could not match.
Cold-start problem — the original benchmark (96% vs 46%) generated questions from the NoUse graph. These external questions had only partial graph coverage.

What this revealed: the cold-start problem

NoUse cannot help in a domain it has never seen. This is expected — but it means cross-domain benchmarking requires domain seeding first.

This prompted two new architectural issues:

Domain bootstrap: use LLM weights to seed graph topology on install and at runtime #10 — Domain bootstrap: use LLM weights to seed graph topology on install and at runtime
modelsessions domain: cross-model epistemic cache with zero-token replay and Hebbian correlation #11 — modelsessions domain: cross-model epistemic cache with zero-token replay and Hebbian correlation

The hypothesis from #10: if NoUse seeds domain knowledge reactively when a query misses, aggregate scores should converge toward the internal benchmark results.

Next steps

Implement domain bootstrap (Domain bootstrap: use LLM weights to seed graph topology on install and at runtime #10) and re-run this benchmark
Fix eval methodology: use Groq for all runs to avoid rate-limit skew
Add LLM judge language normalization (accept correct answers in any language)
Re-run with Gemma 4:26b as judge once model is available locally

Null results and methodology failures are as useful as wins. If you replicate this or run it on a different domain, post results here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Research note: Climate science domain benchmark — first cross-domain eval run #12

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Research note: Climate science domain benchmark — first cross-domain eval run #12

Uh oh!

base76-research-lab Apr 7, 2026 Maintainer

Summary

Setup

Results

What actually happened

Individual question result (climate_009)

What confounded the aggregate scores

What this revealed: the cold-start problem

Next steps

Replies: 0 comments

base76-research-lab
Apr 7, 2026
Maintainer